Building a gazetteer table from KML files
Article posted 10/03/2014
Geoscience Australia has a freely downloadable gazetteer of Australian placenames with more than 370 000 entries. The download is a .zip file (83 mb) containing a user guide together with placenames data in GML, KML and Microsoft Access (.mdb) formats.
What I wanted from all that data was a simple table. Each row would have one placename, its latitude, its longitude, and maybe some other useful items, like the State or Territory in which the place was located. After inspecting the downloaded .zip file, I built the table in a few minutes using GNU/Linux shell commands, as explained below. What made it easy was that the KML file format is just plain text, and GNU/Linux shell commands can do almost anything with plain text.
I first extracted the KML files from the .zip archive to an empty folder. There were 19 .kml files totalling 322 mb. In each .kml file, every placename had an entry like this:
Ugly! But the data items I wanted were hidden there, in the lines beginning *<SimpleData name=* and *<coordinates>*.
First I deleted the .kml file for the Australian Antarctic Territory, which I didn't need. Then, navigating to the KML folder in a terminal, I ran the following one-line command:
Let's tease out what happened. The command took each of the .kml files in turn, did something to the contents, and outputted a modified file with new_ prefixed to the old filename.
The grep command pulled out each placename record as a set of lines. For the Australian Capital Territory, islands, Northern Territory and Tasmania .kml files, this intermediate result had five lines per placename and looked something like this:
The New South Wales, Queensland, South Australia, Victoria and Western Australia .kmlfiles were organised by sub-area and had an extra line in each entry, like this:
In both cases, the sed command removed all the markup:
For the next step I moved the new_ files into different folders according to whether each placename entry had five lines, or six.
Navigating to the 'five line entry' folder, I ran the following one-line command, which modified each new_... KML file and outputted it as finalnew...:
The paste command used here is particularly elegant. It took the lines in groups of five and placed each five in a single line, separated by tabs:
The tr command changed the commas in the last item to tabs:
The AWK command then printed out the first four data items as strings, rounded off the longitude and latitude to four decimal places, swapped their order from longitude first to latitude first, and ignored that last *0 * item:
The one-line command for the .kml files with six-line entries in the 'six line entry' folder was only slightly different:
Finally, I moved all the finalnew... KML files back into a single folder, navigated to that folder in a terminal and merged them all with the command:
The combined text file Oz_gazetteer has 372 833 placenames, one per line, and is 17.8 mb. Each line has Geoscience Australia's unique ID for the placename, the placename itself, the kind of feature that's named (town, hill, stream, etc), the State or Territory where the place is located, the latitude in decimal degrees to four decimal places, and the longitude ditto.
The GNU/Linux commands work zippety-quick with huge files that would choke a spreadsheet or text editor, and can do their jobs batch-wise on any number of files. Love that shell!