Robert Elder Software Inc.
  • Home
  • Store
  • Blog
  • Contact
  • Home
  • Store
  • Blog
  • Contact
  • #linux
  • |
  • #commandline
  • |
  • #softwareengineering
  • |
  • #embeddedsystems
  • |
  • #compilers
  • ...
  • View All >>

Why The 'sort' Command Is A Favourite For Librarians

2021-02-20 - By Robert Elder

Introduction

     In this article, I will attempt to convince you that the 'sort' command is worth learning.  The 'sort' command is no doubt the favourite Unix command of librarians, and by the end of this article, it should be clear why!  There are a number of other Linux/Unix commands that only work correctly with input that has been pre-sorted, such as the 'uniq' command or the 'comm' command.  Piping the output of 'sort' into these commands can make for quick and easy solutions to your many of your text-processing problems.

The Simplest Sort Example

     Let's do a quick example to see how useful the sort command can be.  Here is a text file 'best-novels.txt' that contains some random books from this list of 100 best novels on Wikipedia:

Tropic of Cancer	Henry Miller	1934
Housekeeping	Marilynne Robinson	1981
Deliverance	James Dickey	1970
The Sun Also Rises	Ernest Hemingway	1926
The Great Gatsby	F. Scott Fitzgerald	1925
The Corrections	Jonathan Franzen	2001
The Berlin Stories	Christopher Isherwood	1946
Call It Sleep	Henry Roth	1935
Slaughterhouse-Five	Kurt Vonnegut	1969
Light in August	William Faulkner	1932

     These books have been listed in the file 'best-novels.txt' in a random order, but we want them to be listed in alphabetical order.  We can quickly see them sorted using this command:

sort best-novels.txt

     And this will immediately sort all of the lines in the file, then print them to the terminal:

Call It Sleep	Henry Roth	1935
Deliverance	James Dickey	1970
Housekeeping	Marilynne Robinson	1981
Light in August	William Faulkner	1932
Slaughterhouse-Five	Kurt Vonnegut	1969
The Berlin Stories	Christopher Isherwood	1946
The Corrections	Jonathan Franzen	2001
The Great Gatsby	F. Scott Fitzgerald	1925
The Sun Also Rises	Ernest Hemingway	1926
Tropic of Cancer	Henry Miller	1934

Sort In Reverse

     You can also sort the lines in reverse order with the following command:

sort -r best-novels.txt

     which will output the following:

Tropic of Cancer	Henry Miller	1934
The Sun Also Rises	Ernest Hemingway	1926
The Great Gatsby	F. Scott Fitzgerald	1925
The Corrections	Jonathan Franzen	2001
The Berlin Stories	Christopher Isherwood	1946
Slaughterhouse-Five	Kurt Vonnegut	1969
Light in August	William Faulkner	1932
Housekeeping	Marilynne Robinson	1981
Deliverance	James Dickey	1970
Call It Sleep	Henry Roth	1935

Sorting Numbers - Lexical Vs. Numerical Ordering

     You can also use the 'sort' command to sort numbers.  Here's a file 'some-numbers.txt' that contains some sample numbers:

16454
6123
10538
9446
23666
21749
101
6812

     if we use the following sort command on this file:

sort some-numbers.txt

     we'll get the following result:

101
10538
16454
21749
23666
6123
6812
9446

     which probably isn't what you were expecting, since the numbers aren't sorted in numerically ascending order.  This is because the sort command is defaulting to lexical sorting (used by librarians) rather than numerical sorting.  To force numerical sorting, you can use the '-n' flag:

sort -n some-numbers.txt

     and now the result will be:

101
6123
6812
9446
10538
16454
21749
23666

     which shows the numbers in numerically ascending order.

Sorting By Column Offset

     Whenever we run the sort command like this:

sort best-novels.txt

     this will perform the sort comparisons using the entire line, which mixes the author together with the book title.  In our use case, we can explicitly sort on the text starting at the first column using the following command (this probably isn't what you want, see next section!):

sort -t $'\t' -k 1 best-novels.txt

     in order to split up the columns, we need to use the '-t' flag to specify the column delimiter.  In this case the $'\t' is a special syntax that is used in bash to specify a literal tab character.  If you were delimiting columns with a space, you'd do this:

sort -t ' ' -k 1 best-novels.txt

     or with a comma, you'd do this:

sort -t ',' -k 1 best-novels.txt

     If we want to start the comparison at the second or third column (also probably not exactly what you really want) you'd do this:

sort -t $'\t' -k 2 best-novels.txt
sort -t $'\t' -k 3 best-novels.txt

Be Careful! Column Sorting Isn't Intuitive

     BUT, the '-k' flag with the sort command is easy to misunderstand!  The correct way to sort based on a specific individual column (and only that column) is to use the '-k' flag like this:

sort -t $'\t' -k 1,1 best-novels.txt

     The -k flag can be easy to mis-use since it actually requires that you specify a starting and an ending column, not just a column number.  If you only specify one column number, the 'ending' column is assumed to be the end of the line!  This is a very poor choice of default IMHO, but it's standard behaviour now.

     When it comes to sorting on columns it can get hard to understand what's going on, but fortunately, the GNU implementation of the sort command also includes the --debug flag to help you debug what the sorting process is actually looking at.  It will underline the parts of the line that are actually considered in the comparison.  Let's see what the --debug shows us with this basic sort command:

sort --debug best-novels.txt

     and the result is:

Call It Sleep>Henry Roth>1935
_____________________________
Deliverance>James Dickey>1970
_____________________________
Housekeeping>Marilynne Robinson>1981
____________________________________
Light in August>William Faulkner>1932
_____________________________________
Slaughterhouse-Five>Kurt Vonnegut>1969
______________________________________
The Berlin Stories>Christopher Isherwood>1946
_____________________________________________
The Corrections>Jonathan Franzen>2001
_____________________________________
The Great Gatsby>F. Scott Fitzgerald>1925
_________________________________________
The Sun Also Rises>Ernest Hemingway>1926
________________________________________
Tropic of Cancer>Henry Miller>1934
__________________________________

     As you can see, it's obviously underlining the entire part of every line.  Let's see what happens when try to sort based on the first column as shown above:

sort -t $'\t' -k 1 --debug best-novels.txt

     and the result is:

Call It Sleep>Henry Roth>1935
_____________________________
_____________________________
Deliverance>James Dickey>1970
_____________________________
_____________________________
Housekeeping>Marilynne Robinson>1981
____________________________________
____________________________________
...trimmed for space...

     Aha!  You can see clearly now that it's underlining the entire part of every line, so clearly '-k 1' might not be doing what you want.  There are also two underlines for each line, which is something that we'll come back to later.  Let's see what happens if we use '-k 2':

sort -t $'\t' -k 2 --debug best-novels.txt

     and the result is:

The Berlin Stories>Christopher Isherwood>1946
                   __________________________
_____________________________________________
The Sun Also Rises>Ernest Hemingway>1926
                   _____________________
________________________________________
The Great Gatsby>F. Scott Fitzgerald>1925
                 ________________________
_________________________________________
...trimmed for space...

     As you can see above, the underline now starts under the second column and then goes to the end of the line.  That's because the number we specified after the -k flag is treated as the starting column to begin the comparison on.

     Now, let's perform the sort on only the first column.  To do this, we specify a starting column and an ending column like this:

sort -t $'\t' -k 1,1 --debug best-novels.txt

     which produces this result:

Call It Sleep>Henry Roth>1935
_____________
_____________________________
Deliverance>James Dickey>1970
___________
_____________________________
Housekeeping>Marilynne Robinson>1981
____________
____________________________________
...trimmed for space...

     As you can see, the first pass of the sorting comparison considers only the first column (column 1 to column 1).  We can do the same thing with column two like this:

sort --debug  -t $'\t' -k 2,2 best-novels.txt

     which produces this result:

The Berlin Stories>Christopher Isherwood>1946
                   _____________________
_____________________________________________
The Sun Also Rises>Ernest Hemingway>1926
                   ________________
________________________________________
The Great Gatsby>F. Scott Fitzgerald>1925
                 ___________________
_________________________________________
...trimmed for space...

     Above, we can see that first pass of the comparison will now only consider the author.  We can do this same thing with the year column:

sort --debug  -t $'\t' -k 3,3 best-novels.txt

     which produces this result:

The Great Gatsby>F. Scott Fitzgerald>1925
                                     ____
_________________________________________
The Sun Also Rises>Ernest Hemingway>1926
                                    ____
________________________________________
Light in August>William Faulkner>1932
                                 ____
_____________________________________
...trimmed for space...

     But wait a minute, what's with that extra underline that appears on every single line?  I'm glad you asked, because the answer has to do with sorting stability which we'll discuss in the next section.

Sorting Stability

     Sorting Stability is a statement about whether a sorting algorithm will keep the original ordering of input elements the same (aka 'stable') or not in cases where that item is repeated more than once and therefore 'tied' as far as sorting comparisons go.

     So, does the 'sort' command provide 'stable' sorting?  Well, it's not included in the POSIX standard, but the standard does note that many implementations do include a '-s' flag that provides stable sorting.  The documentation notes that the '-s' flag disables 'last-resort comparisons', which effectively makes the sorting 'stable'.  It also gets rid of that extra mysterious underline that we saw in the last section:

sort --debug  -t $'\t' -k 1,1 -s best-novels.txt

     will now produce this output:

Call It Sleep>Henry Roth>1935
_____________
Deliverance>James Dickey>1970
___________
Housekeeping>Marilynne Robinson>1981
____________
Light in August>William Faulkner>1932
_______________
...trimmed for space...
sort --debug  -t $'\t' -k 2,2 -s best-novels.txt

     will now produce this output:

The Berlin Stories>Christopher Isherwood>1946
                   _____________________
The Sun Also Rises>Ernest Hemingway>1926
                   ________________
The Great Gatsby>F. Scott Fitzgerald>1925
                 ___________________
Tropic of Cancer>Henry Miller>1934
                 ____________
...trimmed for space...
sort --debug  -t $'\t' -k 3,3 -s best-novels.txt

     will now produce this output:

The Great Gatsby>F. Scott Fitzgerald>1925
                                     ____
The Sun Also Rises>Ernest Hemingway>1926
                                    ____
Light in August>William Faulkner>1932
                                 ____
Tropic of Cancer>Henry Miller>1934
                              ____
...trimmed for space...

     Let's review an example where stable sorting makes a difference.  Here are the contents of a file called 'stable-sort-example.csv':

abc,789,hello
abc,123,hello
def,123,hello
def,456,hello
abc,456,hello

     If we run this command to sort the data based on the first column without '-s' for stable sorting:

sort -t ',' -k 1,1 stable-sort-example.csv

     the result is the following:

abc,123,hello
abc,456,hello
abc,789,hello
def,123,hello
def,456,hello

     But, if we run this command again with the '-s' flag:

sort -t ',' -k 1,1 -s stable-sort-example.csv

     we get the following:

abc,789,hello
abc,123,hello
abc,456,hello
def,123,hello
def,456,hello

     As you can see from above, these two results are different because the '-s' flag will disable the 'last resort comparison' which would have defaulted to sorting the entire lines any time two lines have an identical value in the first column.

Multiple Sort Columns At Once

     You can explicitly define sort orders for all columns by using the -k flag multiple times.  Here is an example, that will sort our list of novels first according to the book year, then according to author, and finally by the book title:

sort -t $'\t' -k 3,3 -k 2,2 -k 1,1 -s best-novels.txt

     However, the above command has an issue related to lexical vs. numerical sorting that we saw previously.  The last column won't sort things in numerically ascending order unless we tell it to, so a book published in a year with only three digits that happens to start with a large number (like '823') would end up at the end of the list instead of at the start:

Some Really Old Book	Really Old Guy	823

     If we try our sort command on our novels list with this extra book, we'll get this:

The Great Gatsby	F. Scott Fitzgerald	1925
The Sun Also Rises	Ernest Hemingway	1926
Light in August	William Faulkner	1932
Tropic of Cancer	Henry Miller	1934
Call It Sleep	Henry Roth	1935
The Berlin Stories	Christopher Isherwood	1946
Slaughterhouse-Five	Kurt Vonnegut	1969
Deliverance	James Dickey	1970
Housekeeping	Marilynne Robinson	1981
The Corrections	Jonathan Franzen	2001
Some Really Old Book	Really Old Guy	823

     We can fix this my adding the 'n' flag, but only for the 3rd column:

sort -t $'\t' -k 3,3n -k 2,2 -k 1,1 -s best-novels.txt

     and now the result will be this:

Some Really Old Book	Really Old Guy	823
The Great Gatsby	F. Scott Fitzgerald	1925
The Sun Also Rises	Ernest Hemingway	1926
Light in August	William Faulkner	1932
Tropic of Cancer	Henry Miller	1934
Call It Sleep	Henry Roth	1935
The Berlin Stories	Christopher Isherwood	1946
Slaughterhouse-Five	Kurt Vonnegut	1969
Deliverance	James Dickey	1970
Housekeeping	Marilynne Robinson	1981
The Corrections	Jonathan Franzen	2001

The Dangers of Sorting Collations

     It's common to assume that the final sort ordering of lines in a file would only depend on the actual content of the file itself.  But is that assumption correct?  Nope!  Consider the file 'unicode-example.txt' containing the following text:

A B C
Abc
A b c

     On my machine, if I run this sort command:

sort unicode-example.txt

     I get the following result:

A B C
Abc
A b c

     However, if I set the environment variable 'LC_ALL' to have the value 'C' when running the sort command, like this:

LC_ALL=C sort unicode-example.txt

     then I get this result:

A B C
A b c
Abc

     which is obviously different.  The difference comes down to the unicode collation algorithm.  You can also read more about the environment variables that affect sorting by checking the man page for 'setlocale':

man setlocale

Sort By Random

     Another useful feature of the sort command is the '-R' flag, which will sort the lines in the file by 'random':

sort -R best-novels.txt

     Each time you run this command, it will output the lines in a different order.  This can be very useful for generating test cases, or any situation where you need to purposefully mix up data, such as for producing an unbiased class list (although don't expect it to be cryptographically unbiased).

Find Unique Lines In File

     The 'sort' command also supports the '-u' flag which will print out the unique set of lines with duplicates removed (which is, confusingly, a completely different behaviour from using the '-u' flag with the 'uniq' command!):

sort -u best-novels.txt

Using Sort With Other Utilities

     The sort command is also a necessary pre-requisite for several other common Unix utilities that require all data to be sorted first.  For example, you can use the sort command together with the head or tail command to extract a subset of items at the start or end of the file.  For example, if we want to find the 3 oldest books from our original novels list, we can use this command:

sort -t $'\t' -k 3,3n -k 2,2 -k 1,1 -s best-novels.txt | head -n 3

     and the result will be:

The Great Gatsby	F. Scott Fitzgerald	1925
The Sun Also Rises	Ernest Hemingway	1926
Light in August	William Faulkner	1932

     To find the newest 3 books, you could use the head command like this:

sort -t $'\t' -k 3,3n -k 2,2 -k 1,1 -s best-novels.txt | tail -n 3

     and the result will be:

Deliverance	James Dickey	1970
Housekeeping	Marilynne Robinson	1981
The Corrections	Jonathan Franzen	2001

     But the ordering of the list doesn't have the newest ones at the top. You could fix this by using 'head' again and changing the sort order to reverse from newest to oldest:

sort -t $'\t' -k 3,3rn -k 2,2 -k 1,1 -s best-novels.txt | head -n 3
The Corrections	Jonathan Franzen	2001
Housekeeping	Marilynne Robinson	1981
Deliverance	James Dickey	1970

     Another common use case for the sort command is to use it in combination with the 'uniq' command, since the 'uniq' command expects its input to be sorted.  There is some overlap between the features of the 'sort' command and the 'uniq' command (because of sort's '-u' flag), but 'uniq' also has a few extra useful features.  The '-u' flag with sort, works like this:

sort -u best-novels.txt

     but you could do this to get the same result:

sort best-novels.txt | uniq

     The 'uniq' command also supports a flag that will show you only the set of lines that were duplicated in the file (which is very useful in cases where you're merging data):

sort best-novels.txt | uniq -d

     Another useful flag with 'uniq' is to find the counts for the number of times each line appears in a file:

sort best-novels.txt | uniq -c

     And finally, another useful command-line tool that expects pre-sorted data is the 'comm' command.  You can use this command to find the set intersections, unions, and complements of the lines in files or streams:

comm -13 <(sort best-novels.txt) <(sort best-novels2.txt)

Closing Thoughts

     The 'sort' command is a great tool to have at your disposal.  Since it comes built-in on most *nix distributions, it's always there when you need it, and it sure beats writing a from-scratch C program do the sorting instead!  Whether you're an accomplished sysadmin, or an aspiring librarian, the 'sort' command is sure to make you more productive at your job.

     And that's why the 'sort' command is my favourite Linux command.

A Surprisingly Common Mistake Involving Wildcards & The Find Command
A Surprisingly Common Mistake Involving Wildcards & The Find Command
Published 2020-01-21
Terminal Block Mining Simulation Game
$1.00 CAD
Terminal Block Mining Simulation Game
A Guide to Recording 660FPS Video On A $6 Raspberry Pi Camera
A Guide to Recording 660FPS Video On A $6 Raspberry Pi Camera
Published 2019-08-01
The Most Confusing Grep Mistakes I've Ever Made
The Most Confusing Grep Mistakes I've Ever Made
Published 2020-11-02
Use The 'tail' Command To Monitor Everything
Use The 'tail' Command To Monitor Everything
Published 2021-04-08
An Overview of How to Do Everything with Raspberry Pi Cameras
An Overview of How to Do Everything with Raspberry Pi Cameras
Published 2019-05-28
An Introduction To Data Science On The Linux Command Line
An Introduction To Data Science On The Linux Command Line
Published 2019-10-16
Using A Piece Of Paper As A Display Terminal - ed Vs. vim
Using A Piece Of Paper As A Display Terminal - ed Vs. vim
Published 2020-10-05
Join My Mailing List
Privacy Policy
Why Bother Subscribing?
  • Free Software/Engineering Content. I publish all of my educational content publicly for free so everybody can make use of it.  Why bother signing up for a paid 'course', when you can just sign up for this email list?
  • Read about cool new products that I'm building. How do I make money? Glad you asked!  You'll get some emails with examples of things that I sell.  You might even get some business ideas of your own :)
  • People actually like this email list. I know that sounds crazy, because who actually subscribes to email lists these days, right?  Well, some do, and if you end up not liking it, I give you permission to unsubscribe and mark it as spam.
© 2025 Robert Elder Software Inc.
SocialSocialSocialSocialSocialSocialSocial
Privacy Policy      Store Policies      Terms of Use