I have a script that compiles and sorts museum collection data. One of the data fields is for museum registration codes, and a plain sort doesn't understand that humans expect "KS.99", for example, to come before "KS.100":
My script instead uses the magic "-V" option for GNU sort, and gets the human-friendly result:
Behind the magic trick
"V" is short for "version" (the long option is "--version"). As it says in the GNU coreutils manual, the purpose is to:
Sort by version name and number, lowest first. It behaves like a default sort, except that each sequence of decimal digits is treated numerically as an index/version number.
The "-V" trick also works if the numbers are in the middle of a string:
and it gives the order that humans would expect even when there are numbers embedded in several places in the string:
You can also use sort -V on particular fields in a sort:
Limits to the magic
The GNU developers warn that version-sorting looks for strings built like this:
where "suffix" matches the regular expression
and that some strings may not sort as expected. I haven't had a problem with my museum registration codes, though, even for "sub-sample" codes:
Another warning is that non-numeric "prefix" strings are sorted using C-style collation, with uppercase letters before lowercase. This could indeed be a problem:
Sorting by computer is horribly complicated. I'm very grateful to the GNU folk for sort's "-V" option, and just wish there was something equally straightforward for "version sorting" in Gnumeric and LibreOffice Calc spreadsheets. The simplest way I know to get the expected result in those two programs is to split the field to be sorted into non-numeric and numeric pieces (for example, "KS.99" becomes "KS" and "99"), sort the numbers, then join the pieces together again. There's a "natural sort" option in Calc, but it doesn't work for some strings and is still a little buggy.
Top image: sorting crabs in Scotland; public domain, Wikimedia Commons