Two weeks ago I suggested that learning the art of regular expressions, or its future equivalent would become a skill for digital literacy in the age of information. Last week I told a story about James Legge using a regular expression (regex) to easily carry out an otherwise odious formatting task. This posting assumes you have read that one which explained a few basics.
This week, let us try something a little different and a bit more challenging. Using this Project Gutenberg plain text edition of an English translation of the Heimskringla containing fifteen sagas divided up into over 800 sections, let us see if we can use a regular expression to answer the question: how many – and which – separate sections in the text refer to women?
I should start by saying that I am not a regex expert. Like many of our postings here at Profhacker, in which we share things as we learn them, I am a novice and everything I know about regex came from poking around online.
When I set out to write a regular expression, I usually skim over the text I’m working with to familiarize myself with it and then try to think of how I might construct the regex in plain English first. Here is what I came up with: A section in Heimskringla begins with a number, followed by a period. Then look for a word indicating a woman, such as woman, women, lady, girl, princess, queen, daughter, sister, mother, or wife. Oh, and any of those might be capitalized or in all caps if it is in the section title. The end of each section looks like it has four empty lines.
Immediately you may notice some reasons why even my English explanation doesn’t cover everything. However, when you write a regex, you need to decide how accurate you want it to be. The more effort you spend on making it work perfectly, the less time you save over solving the problem some other way. As one of the developers of Netscape Jamie Zawinski once put it long ago, “Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.” (h/t Glenn F. Henriksen) If it takes too long to write a regex good enough for your needs, then it probably isn’t the right way to go. In this case, I’m imagining a case where someone just wants to get a rough tally which, depending on the results, they may or may not check with greater care by hand.
Why not search for “she” or “her”? These two words can be found inside many other words, so we would have to make the regular expression much more complicated by looking for the words followed and preceded by spaces, commas, periods, or found at the beginning or ends of lines etc. I decided it was just not worth the trouble. What if there are women who are only mentioned directly by name? This is pretty hard to solve without knowing something about Old Norse naming practices, but fortunately – at least for us if not for them, looking over the text it looks like all the women I see mentioned are described in relation to some man. What about false positives? After all “lady” will also find “malady.” I decided that the number of false positives were likely to be low and not worth the trouble fixing.
So let us put together the regular expression, beginning with the easiest part, which will go into the middle of the regex, finding the terms indicating women:
If you use a | or “pipe” character, the regex will match any of the items listed. Noticed that, like last week, I used a character class which lists a number of accepted characters to match both woman and women and both wife and wive, which we will know will match “wives.” You can indicate that you want your search to be completely case insensitive (which we do here since these words might be capitalized or found in all caps in the title of sections), by adding the special search option “i” which is done by adding: (?i)
If you run that regex on the Heimskringla text in a regex friendly text editor like TextWrangler, it will just find any and all those words (835 hits by my count). But we want to match all sections which contain a mention of women, so our work is not done. We need to wrap these women related words to match a whole section.
In a regex, the ^ character usually (see below) means “the start of a line,” so I’m going to look for the start of a line, followed by one or more digits, followed by a period:
Notice that inside the character class, I put 0-9. The – allows me to indicated a range of acceptable characters, and some commonly used regex ranges are 0-9, a-z, and A-Z. As we saw last week, the + character means will match one or more of the characters that preceded it. Also remember that to indicate that we are looking for a literal period, I have to escape it with a backslash since a period usually means “match any character except the end of a line.”
Here is the tricky part. Regex usually searches line by line, not across lines. But we are looking for sections which all have more than one line. Fortunately, you can often (and this is true of TextWrangler) indicate that you want to match a pattern found across lines by adding the “s” option to your regex. Let us combine it with the “i” case-insensitive option I mentioned earlier:
Now to put it all together I want to allow for the fact that there might be lots of words between the section title number and the women, and there are likely to be many more words after the hit for women. The only thing there cannot be is four blank lines. Once you hit four blank lines, you know the section has ended. In other words:
(?is)^[0-9]+\. anything but four lines (wom[ae]n|lady|girl|princess|queen|daughter|sister|mother|wi[vf]e) anything but four lines
To match any character except certain characters, you put a ^ inside a character class. For example, if you want to match anything except a “z” or an “x” you can search for [^zx]. As we saw last time, \r means a carriage return so I tried [^\r\r\r\r] but negated character classes can only look at a single character, not several consecutive characters. I think I need to use something called a negative lookbehind to do a negative match on multiple characters, but I can’t quite figure out how this works (any regex gurus out there?). So I got around this problem by doing a quick search and replace, replacing all instances of four blank lines with a %. Now instead of looking for anything but four lines I need only look for anything but % or [^%] and then a *, which is very similar to + but means “zero or more of the preceding.” The final regex (which should all be on one line) is:
In English: Set the options to case-insensitive and search for matches across lines. Match a number at the beginning of the line with one or more digits, and then a period. Then optionally match some text, as long as it isn’t a “%.” Then match any word in my list referring to women. Then optionally match more text, as long as it isn’t a “%.”
When I ran this search in TextWrangler with “Find All” it found 278 separate sections referring to women and gave me a list of hits that I could go through, one section at a time.
 – Match any of the characters in this character class
- – Indicates a range in a character class like 0-9, a-z, or A-Z
^ – Matches beginning of line, but when inside a character class is used for a negative match
| – Matches either what comes before it or what comes after
() – Used to group things together, and can be referred to later by number, e.g. \1
\ escape metacharacters to make them literal; also used for special characters like \r
+ – Matches one or more of the preceding
* – Matches zero or more of the preceding
\r – matches a carriage return
(?i) – Case-insensitive search (Note: Only some regex applications support this!)
(?s) – Multi-line pattern match (Note: May not be supported in your application)
(?is) – you can combine options like this
Image: Christian Krohg Illustration for Olav Tryggvasons saga, Heimskringla 1899-edition.