A solution without regular expressions
There are many ways for selecting columns in a delimited text. The easiest ones require GNU awk, cut and Perl. Let's consider the following file:
1 2 3 4 |
[johndoe@ArchLinux]% cat Sample.txt abb_158,we1a,1,r_fadf,4 b_w2c_W,reka,2,ssd*dd,5 css_vvv,tebi,3,tfw2_1,6 |
In order to select 3rd column one can use GNU awk:
1 2 3 4 |
[johndoe@ArchLinux]% awk -F"," '{print $3}' Sample.txt 1 2 3 |
or the cut command from coreutils package:
1 2 3 4 |
[johndoe@ArchLinux]% cut -d',' -f3 Sample.txt 1 2 3 |
or Perl:
1 2 3 4 |
[johndoe@ArchLinux]% perl -F, -lane "print @F[2]" Sample.txt 1 2 3 |
Regular expressions based solution
Now let's consider using GNU sed, GNU grep, Perl and GNU Emacs with regular expression to do same task.
GNU Sed
GNU sed is a stream editor. The basic functionality of sed is replacing a text following some pattern, e.g.:
1 2 3 |
[johndoe@ArchLinux]% echo "This is a text which was typed\nusing a text editor." | sed 's/text/message/g' This is a message which was typed using a message editor. |
The flags in the query "s///g" denote substitution and global behavior (i.e. every line is processed) respectively.
It can also be used for selecting columns in a delimited file. Look-arounds and non-matching groups are not supported in sed. In order to select 3rd column from the file one need to run a regular expression containing only matching groups:
1 2 3 4 |
[johndoe@ArchLinux]% sed -E 's/^([^,]*,){2}([^,]*)(,.*){0,}$/\2/g' Sample.txt 1 2 3 |
Explanation of regular expressions in the query:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
Start: ^ beginning of the line Following group: [^,] Any character except comma [^,]* A sequence of any characters except comma [^,]*, A sequence of any characters except comma followed by a comma ([^,]*,) A capturing group matching a sequence of any characters except comma followed by a comma ([^,]*,){2} A sequence of two capturing groups each of whom is matching a sequence of any characters except comma which is followed by a comma Following group (our desired one!): ([^,]*) A capturing group matching a sequence of any characters not including comma. Following group: (,.*) A capturing group matching a sequence of characters which starts with a comma and contains any characters inside (,.*){0,} Zero or more occurrences of a capturing group matching a sequence of characters which starts with a comma and contains any characters inside End: $ end of line |
A string which matches the following pattern is replaced with the second captured group ("\2") of the pattern. If we use "\1" instead of "\2" then the first captured pattern will be fed to the output:
1 2 3 4 5 6 7 8 9 |
[johndoe@ArchLinux]% cat Sample.txt abb_158,we1a,1,r_fadf,4 b_w2c_W,reka,2,ssd*dd,5 css_vvv,tebi,3,tfw2_1,6 [johndoe@ArchLinux]% sed -E 's/^([^,]*,){2}([^,]*)(,.*){0,}$/\1/g' Sample.txt we1a, reka, tebi, |
If we use "\3" instead of "\2" then the third captured pattern will be fed to the output:
1 2 3 4 |
[johndoe@ArchLinux]% sed -E 's/^([^,]*,){2}([^,]*)(,.*){0,}$/\3/g' Sample.txt ,r_fadf,4 ,ssd*dd,5 ,tfw2_1,6 |
By changing the number of occurrences of the first capturing group, we can choose the column number:
1 2 3 4 5 6 7 8 9 |
[johndoe@ArchLinux]% sed -E 's/^([^,]*,){1}([^,]*)(,.*){0,}$/\2/g' Sample.txt we1a reka tebi [johndoe@ArchLinux]% sed -E 's/^([^,]*,){4}([^,]*)(,.*){0,}$/\2/g' Sample.txt 4 5 6 |
We can wrap the query above into a BASH script with a function-like behavior.
How do we pass arguments to a BASH script? It's easy:
1 2 |
[johndoe@ArchLinux]% bash -c 'echo $1 $0' FirstArgument SecondArgument SecondArgument FirstArgument |
Another aspect which should be taken into account is the desired column number, which requires subtraction operation. BASH supports arithmetic expansion:
1 2 |
[johndoe@ArchLinux]% echo "$((2*4))" 8 |
Now we can wrap the whole query into a script which accepts file name as the 1st argument, delimiter as the 2nd argument and column number as the 3rd:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
[johndoe@ArchLinux]% bash -c 'sed -E "s/^([^$1]*$1){$(($2-1))}([^$1]*)($1.*){0,}$/\2/g" $0' Sample.txt "," 1 abb_158 b_w2c_W css_vvv [johndoe@ArchLinux]% bash -c 'sed -E "s/^([^$1]*$1){$(($2-1))}([^$1]*)($1.*){0,}$/\2/g" $0' Sample.txt "," 4 r_fadf ssd*dd tfw2_1 [johndoe@ArchLinux]% bash -c 'sed -E "s/^([^$1]*$1){$(($2-1))}([^$1]*)($1.*){0,}$/\2/g" $0' Sample.txt "," 5 4 5 6 [johndoe@ArchLinux]% cat Sample.txt abb_158,we1a,1,r_fadf,4 b_w2c_W,reka,2,ssd*dd,5 css_vvv,tebi,3,tfw2_1,6 |
The script should work with any delimiter. Let's generate another file SampleVBar.txt which is similar to Sample.txt but has vertical bar as the delimiter:
1 2 3 4 5 6 |
[johndoe@ArchLinux]% sed 's/,/|/g' Sample.txt > SampleVBar.txt [johndoe@ArchLinux]% cat SampleVBar.txt abb_158|we1a|1|r_fadf|4 b_w2c_W|reka|2|ssd*dd|5 css_vvv|tebi|3|tfw2_1|6 |
The same script will work if the vertical bar delimiter is specified:
1 2 3 4 |
[johndoe@ArchLinux]% bash -c 'sed -E "s/^([^$1]*$1){$(($2-1))}([^$1]*)($1.*){0,}$/\2/g" $0' SampleVBar.txt "\|" 1 abb_158 b_w2c_W css_vvv |
Debugging can be done by viewing the executed BASH script:
1 2 3 4 5 |
[johndoe@ArchLinux]% bash -xc 'sed -E "s/^([^$1]*$1){$(($2-1))}([^$1]*)($1.*){0,}$/\2/g" $0' SampleVBar.txt "\|" 1 + sed -E 's/^([^\|]*\|){0}([^\|]*)(\|.*){0,}$/\2/g' SampleVBar.txt abb_158 b_w2c_W css_vvv |
GNU grep
GNU grep does not support variable length look-behind. The look-behind in grep can only be of fixed length! In the example below, only 7 characters are matched. Variable length option {7,9} will NOT work:
1 2 3 4 |
[johndoe@ArchLinux]% grep -P -o '(?<=^([^,]){7}),.{2}' Sample.txt ,we ,re ,te |
In the recent versions of grep, look-behind can be emulated using \K operator, which tells the program that the match happened here. Together with look-ahead it can be used to select columns:
1 2 3 4 5 6 7 8 9 |
[johndoe@ArchLinux]% cat Sample.txt abb_158,we1a,1,r_fadf,4 b_w2c_W,reka,2,ssd*dd,5 css_vvv,tebi,3,tfw2_1,6 [johndoe@ArchLinux]% grep -P -o '^([^,]*,){4}\K([^,]*)(?=,.*){0,}' Sample.txt 4 5 6 |
Instead of positive look-ahead (?=) the non-capturing group (?:) can be used:
1 2 3 4 |
[johndoe@ArchLinux]% grep -P -o '^([^,]*,){4}\K([^,]*)(?:,.*){0,}' Sample.txt 4 5 6 |
Basically \K disregards here 4 instances of the first capturing group (i.e. 4 columns). Note that there is no end-of-line character "$" in the query. Removing it allows to select 1st column:
1 2 3 4 |
[johndoe@ArchLinux]% grep -P -o '^([^,]*,){0}\K([^,]*)(?=,.*){0,}' Sample.txt abb_158 b_w2c_W css_vvv |
Behavior of non-capturing group and look-ahead is different:
1 2 3 4 5 6 7 8 9 |
[johndoe@ArchLinux]% grep -P -o '^([^,]*,){2}\K([^,]*)(?:,.*){0,}' Sample.txt 1,r_fadf,4 2,ssd*dd,5 3,tfw2_1,6 [johndoe@ArchLinux]% grep -P -o '^([^,]*,){2}\K([^,]*)(?=,.*){0,}' Sample.txt 1 2 3 |
Wrapping it up into a BASH script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
[johndoe@ArchLinux]% bash -c 'grep -P -o "^([^$1]*$1){$(($2 - 1))}\K([^$1]*)(?=$1.*){0,}" $0' Sample.txt "," 1 abb_158 b_w2c_W css_vvv [johndoe@ArchLinux]% bash -c 'grep -P -o "^([^$1]*$1){$(($2 - 1))}\K([^$1]*)(?=$1.*){0,}" $0' Sample.txt "," 3 1 2 3 [johndoe@ArchLinux]% bash -c 'grep -P -o "^([^$1]*$1){$(($2 - 1))}\K([^$1]*)(?=$1.*){0,}" $0' Sample.txt "," 5 4 5 6 |
Perl
There is some weird behavior of Perl with non-capturing groups and look-ahead:
1 2 3 4 5 6 7 8 9 |
[johndoe@ArchLinux]% perl -ple 's/^(?:[^,]*,){2}([^,]*)(?:,.*){0,}/\1/g' Sample.txt 1 2 3 [johndoe@ArchLinux]% perl -ple 's/^(?:[^,]*,){2}([^,]*)(?=,.*){0,}/\1/g' Sample.txt 1,r_fadf,4 2,ssd*dd,5 3,tfw2_1,6 |
GNU Emacs
Similar operation for replacing the whole text with a chosen column can be done in GNU Emacs. Let's recall the content of our file:
1 2 3 4 |
[johndoe@ArchLinux]% cat Sample.txt abb_158,we1a,1,r_fadf,4 b_w2c_W,reka,2,ssd*dd,5 css_vvv,tebi,3,tfw2_1,6 |
Now in order to do the replacement:
1 2 3 4 5 6 7 |
Open file in Emacs. Select the whole text using "C-x h" shortcut. Once selected, run "M-x", then choose "replace-regexp". Enter query for the substitution: ^\([^,]*,\)\{3\}\([^,]*\)\(,.*\) Enter the group number which will replace the matched pattern: \2 |
The resulting text will be:
1 2 3 |
r_fadf ssd*dd tfw2_1 |
Useful links
Useful One-Line Scripts for Sed (Unix stream editor)