Intro To 'comm' Command In Linux
2023-09-06 - By Robert Elder
I use the 'comm' command to find all of the lines that are common between two files:
comm -12 A.txt B.txt
b
c
Things You Can Do With The 'comm' Command
- Using 'comm' To Compute Only Set A
- Using 'comm' To Compute Only Set B
- Using 'comm' To Compute A \ B (Set Subtraction)
- Using 'comm' To Compute B \ A (Set Subtraction)
- Using 'comm' To Compute A ∩ B (Intersection)
- Using 'comm' To Compute A ∪ B (Union)
- Using 'comm' To Compute (A ∪ B) ∖ (A ∩ B) (Disjunctive Union)
- Using 'comm' To Compute ∅ (Empty Set)
- Understanding The 'comm' Command
- Avoid Tab Indenting
- Input Must Be Sorted
Example Use Cases
In the next few sections, we'll review some example use cases of the 'comm' command that make use of the following two files: 'plants.txt' and 'foods.txt':
This file 'plants.txt' contains this list of plants:
Oak Tree
Corn
Poison Ivy
Potato
Wheat
Grass
and this file, 'foods.txt', contains a list of foods:
Wheat
Corn
Potato
Milk
Fish
Energy Drinks
NOTE: In the examples below, it is assumed that your input does not contain tab characters. See the section Avoid Tab Indenting for a special note on this topic.
Using 'comm' To Compute Only Set A
This 'comm' command will show only the list of plants:
# All Plants: Only Set A
comm -2 <(sort plants.txt) <(sort foods.txt) | tr -d '\t'
Corn
Grass
Oak Tree
Poison Ivy
Potato
Wheat
Using 'comm' To Compute Only Set B
and this command will show only the list of foods:
# All Foods: Only B
comm -1 <(sort plants.txt) <(sort foods.txt) | tr -d '\t'
Corn
Energy Drinks
Fish
Milk
Potato
Wheat
Using 'comm' To Compute A \ B (Set Subtraction)
This will show plants that are not foods:
# Plants that are not foods: (A ∖ B) Set Subtraction
comm -23 <(sort plants.txt) <(sort foods.txt) | tr -d '\t'
Grass
Oak Tree
Poison Ivy
Using 'comm' To Compute B \ A (Set Subtraction)
This will show foods that are not plans:
# Foods that are not plants: (B ∖ A) Set Subtraction
comm -13 <(sort plants.txt) <(sort foods.txt) | tr -d '\t'
Energy Drinks
Fish
Milk
Using 'comm' To Compute A ∩ B (Intersection)
This command shows items that are both plants and foods:
# Plants that are also foods: (A ∩ B), Intersection
comm -12 <(sort plants.txt) <(sort foods.txt) | tr -d '\t'
Corn
Potato
Wheat
Using 'comm' To Compute A ∪ B (Union)
and this shows items that are plants or foods:
# Items that are either plants or foods: (A ∪ B), Union
comm <(sort plants.txt) <(sort foods.txt) | tr -d '\t'
Corn
Energy Drinks
Fish
Grass
Milk
Oak Tree
Poison Ivy
Potato
Wheat
Using 'comm' To Compute (A ∪ B) ∖ (A ∩ B) (Disjunctive Union)
This will show all plants or foods that are not both plants and foods:
# Items that are either plants or, but not both: (A ∪ B) ∖ (A ∩ B), Disjunctive Union
comm -3 <(sort plants.txt) <(sort foods.txt) | tr -d '\t'
Energy Drinks
Fish
Grass
Milk
Oak Tree
Poison Ivy
Using 'comm' To Compute ∅ (Empty Set)
This will always show the empty set:
# This will always produce the empty Set, ∅
comm -123 <(sort plants.txt) <(sort foods.txt) | tr -d '\t'
(no output)
Understanding The 'comm' Command
Each number that's supplied to the comm command corresponds an area in the Venn diagram that will be suppressed from the final output:
This explains why 'comm -123' will always show no output. Including '-123' means to suppress all three regions in the Venn diagram.
Avoiding Tab Indenting
By default, the 'comm' command will tab indent the lines based on the column number that they belong to:
comm <(sort plants.txt) <(sort foods.txt)
Corn
Energy Drinks
Fish
Grass
Milk
Oak Tree
Poison Ivy
Potato
Wheat
In a previous version of this article, I suggested that you can remove this indentation by specifying the output-delimiter='' flag with an empty delimiter. However, this is not entirely correct! For example, on my machine, if I specify an empty output, I'll see the following output:
comm --output-delimiter='' <(sort plants.txt) <(sort foods.txt)
Corn
Energy Drinks
Fish
Grass
Milk
Oak Tree
Poison Ivy
Potato
Wheat
Which looks fine upon cursory inspection. However, if you pipe this into 'xxd':
comm --output-delimiter='' <(sort plants.txt) <(sort foods.txt) | xxd
00000000: 0000 436f 726e 0a00 456e 6572 6779 2044 ..Corn..Energy D
00000010: 7269 6e6b 730a 0046 6973 680a 4772 6173 rinks..Fish.Gras
00000020: 730a 004d 696c 6b0a 4f61 6b20 5472 6565 s..Milk.Oak Tree
00000030: 0a50 6f69 736f 6e20 4976 790a 0000 506f .Poison Ivy...Po
00000040: 7461 746f 0a00 0057 6865 6174 0a tato...Wheat.
you will notice from the above output, that using --output-delimiter='' doesn't give you an empty string for the delimiter, it uses a single null character instead!
I am not sure if this should be considered a bug in my own GNU Coreutils v8.30 version of the 'comm' command or not. The '--output-delimiter' flag does does not appear to be in the POSIX standard for the 'comm' command, so it's not surprising that the behaviour of this flag is a bit less predictable. I checked the source code in an older version of the GNU comm command, and I believe older versions may even issue an error message if you try to specify an empty output delimiter.
This doesn't matter much if you're simply printing results to the terminal, but it's a big deal if you're using 'comm' perform some kind of fundamental set operation and you want to send the results to another program (for example, even back into the 'comm' command again)! I only noticed this issue after someone prompted me to verify (A ∪ B) ∖ (A ∩ B) using the 'comm' command, or in other words, basically this:
diff <(comm -23 <(comm <(sort plants.txt) <(sort foods.txt)) <(comm -12 <(sort plants.txt) <(sort foods.txt))) <(comm -3 <(sort plants.txt) <(sort foods.txt))
But, predictably, this doesn't produce an empty diff because the 'comm' command adds tabs to some of the lines:
1d0
< Corn
8,9d6
< Potato
< Wheat
Now, if you add the output-delimiter='', it STILL doesn't compare equally:
diff <(comm --output-delimiter='' -23 <(comm --output-delimiter='' <(sort plants.txt) <(sort foods.txt)) <(comm --output-delimiter='' -12 <(sort plants.txt) <(sort foods.txt))) <(comm --output-delimiter='' -3 <(sort plants.txt) <(sort foods.txt))
comm: file 1 is not in sorted order
comm: input is not in sorted order
Binary files /dev/fd/63 and /dev/fd/62 differ
The fact that it says 'Binary files differ' it itself a clue, since the input files are purely ASCII text. If the --output-delimiter flag is omitted entirely, and the output is instead piped through the 'tr' command to delete any tab characters, the result now successfully compares as being identical:
diff <(comm -23 <(comm <(sort plants.txt) <(sort foods.txt) | tr -d '\t') <(comm -12 <(sort plants.txt) <(sort foods.txt) | tr -d '\t') | tr -d '\t') <(comm -3 <(sort plants.txt) <(sort foods.txt) | tr -d '\t')
(no output)
Based on the above, I would suggest not using the '--output-delimiter' flag to remove the indentation and instead remove the tab characters by piping them through the 'tr' command, as shown in the examples in this article. Of course, this will not work for you if your input contains tab characters, however this is the best general-purpose solution that I can think of for dealing with this unfortunate corner-case of the 'comm' command. If you input does contain tab characters, you could try explicitly using a null delimiter and then delete that using 'tr'.
The 'comm' Command Only Works With Sorted Inputs
The 'comm' command expects both input files to be sorted, and if they're not the output may be incorrect:
comm -12 --output-delimiter='' plants.txt foods.txt
comm: file 1 is not in sorted order
Wheat
comm: file 2 is not in sorted order
Fortunately, you can use the 'sort' command and the following '<(...)' syntax in bash to redirect a sorted version of the file directly into the 'comm' command like this:
comm -12 <(sort plants.txt) <(sort foods.txt)
Corn
Potato
Wheat
And that's why the 'comm' command is my favourite Linux command.
Intro To 'stty' Command In Linux
Published 2023-10-04 |
$1.00 CAD |
Intro To 'nproc' Command In Linux
Published 2023-07-15 |
How To Force The 'true' Command To Return 'false'
Published 2023-07-09 |
A Surprisingly Common Mistake Involving Wildcards & The Find Command
Published 2020-01-21 |
A Guide to Recording 660FPS Video On A $6 Raspberry Pi Camera
Published 2019-08-01 |
Intro To 'chroot' Command In Linux
Published 2023-06-23 |
Intro To 'sha256sum' Command In Linux
Published 2023-08-30 |
Join My Mailing List Privacy Policy |
Why Bother Subscribing?
|