previous article
next article

Why I (sometimes) love regular expressions

Article posted 05/05/2014

by Bob Mesibov

Command line wizards are forever encouraging the rest of us to learn regular expressions (regex). We're told regex is elegant, powerful and incredibly useful.

The downside is that regex can be seriously fiddly. One fly-spot out of place, and the code fails. For this reason I only use 'non-simple' regex when I absolutely have to. But when I do, sigh... Regex really is elegant and powerful, as the following example shows.

I had a 6.7 mb KML text file with more than 1700 placemarks. Here's a typical placemark from the file:

1
Click to enlarge

I needed to shrink the file by doing two things: getting rid of the 'ExtendedData' section in each placemark, and reducing to 4 the number of decimal places in the longitudes and latitudes, e.g., '147.741426249143842' to '147.7414'. With regex, it took just one command:

sed '/<E/,/<\/E/d;s/\(14.\.[0-9]\{4\}\)[0-9]\{9,11\}/\1/g;s/\(-4.\.[0-9]\{4\}\)[0-9]\{9,11\}/\1/g' infile.kml > outfile.kml

See what I mean by fiddly?

What's happening here is that the sed command is modifying 'infile.kml' and saving the modified file as 'outfile.kml', and there are 3 separate sed commands separated by semicolons:

sed 'command1;command2;command3' infile.kml > outfile.kml

The first command is

/<E/,/<\/E/d

This tells sed to look for the pattern '<E' (which begins '<ExtendedData>') and the pattern '</E' (which begins '</ExtendedData>') and to delete (with the 'd' option) those two patterns and anything in between. The backslash in the second pattern ensures that sed doesn't see the '/' in '</E' as part of its instructions; in other words, the '/' in '</E' is escaped with a backslash. You'll see more backslashes below.

Before I explain the next command, notice that each of the latitudes and longitudes in the placemark above is on a single line, and that the number of numerals after the decimal is always either 13, 14 or 15 (that's just how the KML file came to me). Because neither latitudes nor longitudes are split over two lines, sed (which reads line-by-line) can process every figure individually. It was also true in this file that every longitude began with 140-something and every latitude with 40-something.

The second command tells sed to do a substitution (the 's' option) on longitudes. The pattern to be substituted is:

  • '14' followed by a single character ('.', which in this case will be a numeral),
  • then a decimal point ('\.', where the backslash tells sed that the period character is a literal one),
  • then any numeral ('[0-9]') appearing exactly 4 times ('{4}', with the curly braces escaped with backslashes),
  • then any numeral ('[0-9]') appearing 9, 10 or 11 times ('{9,11}', with the curly braces escaped with backslashes)

14.\.[0-9]\{4\}[0-9]\{9,11\}

What sed does after finding this pattern is substitute the same thing, but without those last 9, 10 or 11 numerals. To do that I enclose everything up to the last 9, 10 or 11 numerals in round brackets (escaped with backslashes), and use a back reference ('\1') as shorthand for everything between those round brackets. The command ends with the 'g' option, which tells sed to do this substitution every time it finds the longitude pattern in a line.

s/\(14.\.[0-9]\{4\}\)[0-9]\{9,11\}/\1/g

The third command is just like the second, except that the pattern to be looked for begins with '-4.' ('-4' followed by any character) instead of '14.'.

And here's the result:

2

The regex 'surgery' modified my big file in the blink of an eye. Note that the longitudes and latitudes haven't been rounded to 4 decimal places, just truncated to 4. For my particular purposes, this wasn't important.


About the Author

Bob Mesibov is Tasmanian, retired and a keen Linux tinkerer.

Tags: tutorial regular-expressions scripting linux-shell bash
blog comments powered by Disqus