Using regular expressions with Perl, sed and grep for selecting columns in delimited text

A solution without regular expressions

There are many ways for selecting columns in a delimited text. The easiest ones require GNU awk, cut and Perl. Let's consider the following file:

[johndoe@ArchLinux]% cat Sample.txt
abb_158,we1a,1,r_fadf,4
b_w2c_W,reka,2,ssd*dd,5
css_vvv,tebi,3,tfw2_1,6

[johndoe@ArchLinux]% cat Sample.txt

abb_158,we1a,1,r_fadf,4

b_w2c_W,reka,2,ssd*dd,5

css_vvv,tebi,3,tfw2_1,6

In order to select 3rd column one can use GNU awk:

[johndoe@ArchLinux]% awk -F"," '{print $3}' Sample.txt
1
2
3

[johndoe@ArchLinux]% awk -F"," '{print $3}' Sample.txt

or the cut command from coreutils package:

[johndoe@ArchLinux]% cut -d',' -f3  Sample.txt
1
2
3

[johndoe@ArchLinux]% cut -d',' -f3 Sample.txt

or Perl:

[johndoe@ArchLinux]% perl -F, -lane "print @F[2]" Sample.txt
1
2
3

[johndoe@ArchLinux]% perl -F, -lane "print @F[2]" Sample.txt

Regular expressions based solution

Now let's consider using GNU sed, GNU grep, Perl and GNU Emacs with regular expression to do same task.

GNU Sed

GNU sed is a stream editor. The basic functionality of sed is replacing a text following some pattern, e.g.:

[johndoe@ArchLinux]% echo "This is a text which was typed\nusing a text editor." | sed 's/text/message/g'
This is a message which was typed
using a message editor.

[johndoe@ArchLinux]% echo "This is a text which was typed\nusing a text editor." | sed 's/text/message/g'

This is a message which was typed

using a message editor.

The flags in the query "s///g" denote substitution and global behavior (i.e. every line is processed) respectively.

It can also be used for selecting columns in a delimited file. Look-arounds and non-matching groups are not supported in sed. In order to select 3rd column from the file one need to run a regular expression containing only matching groups:

[johndoe@ArchLinux]% sed -E 's/^([^,]*,){2}([^,]*)(,.*){0,}$/\2/g' Sample.txt
1
2
3

[johndoe@ArchLinux]% sed -E 's/^([^,]*,){2}([^,]*)(,.*){0,}$/\2/g' Sample.txt

Explanation of regular expressions in the query:

Start:
^ beginning of the line

Following group:
[^,] Any character except comma
[^,]* A sequence of any characters except comma
[^,]*, A sequence of any characters except comma followed by a comma
([^,]*,) A capturing group matching a sequence of any characters except comma followed by a comma
([^,]*,){2} A sequence of two capturing groups each of whom is matching a sequence of any characters except comma which is followed by a comma

Following group (our desired one!):
([^,]*) A capturing group matching a sequence of any characters not including comma.

Following group:
(,.*) A capturing group matching a sequence of characters which starts with a comma and contains any characters inside
(,.*){0,} Zero or more occurrences of a capturing group matching a sequence of characters which starts with a comma and contains any characters inside

End:
$ end of line

Start:

^ beginning of the line

Following group:

[^,] Any character except comma

[^,]* A sequence of any characters except comma

[^,]*, A sequence of any characters except comma followed by a comma

([^,]*,) A capturing group matching a sequence of any characters except comma followed by a comma

([^,]*,){2} A sequence of two capturing groups each of whom is matching a sequence of any characters except comma which is followed by a comma

Following group (our desired one!):

([^,]*) A capturing group matching a sequence of any characters not including comma.

Following group:

(,.*) A capturing group matching a sequence of characters which starts with a comma and contains any characters inside

(,.*){0,} Zero or more occurrences of a capturing group matching a sequence of characters which starts with a comma and contains any characters inside

End:

$ end of line

A string which matches the following pattern is replaced with the second captured group ("\2") of the pattern. If we use "\1" instead of "\2" then the first captured pattern will be fed to the output:

[johndoe@ArchLinux]% cat Sample.txt
abb_158,we1a,1,r_fadf,4
b_w2c_W,reka,2,ssd*dd,5
css_vvv,tebi,3,tfw2_1,6

[johndoe@ArchLinux]% sed -E 's/^([^,]*,){2}([^,]*)(,.*){0,}$/\1/g' Sample.txt
we1a,
reka,
tebi,

[johndoe@ArchLinux]% cat Sample.txt

abb_158,we1a,1,r_fadf,4

b_w2c_W,reka,2,ssd*dd,5

css_vvv,tebi,3,tfw2_1,6

[johndoe@ArchLinux]% sed -E 's/^([^,]*,){2}([^,]*)(,.*){0,}$/\1/g' Sample.txt

we1a,

reka,

tebi,

If we use "\3" instead of "\2" then the third captured pattern will be fed to the output:

[johndoe@ArchLinux]% sed -E 's/^([^,]*,){2}([^,]*)(,.*){0,}$/\3/g' Sample.txt
,r_fadf,4
,ssd*dd,5
,tfw2_1,6

[johndoe@ArchLinux]% sed -E 's/^([^,]*,){2}([^,]*)(,.*){0,}$/\3/g' Sample.txt

,r_fadf,4

,ssd*dd,5

,tfw2_1,6

By changing the number of occurrences of the first capturing group, we can choose the column number:

[johndoe@ArchLinux]% sed -E 's/^([^,]*,){1}([^,]*)(,.*){0,}$/\2/g' Sample.txt
we1a
reka
tebi

[johndoe@ArchLinux]% sed -E 's/^([^,]*,){4}([^,]*)(,.*){0,}$/\2/g' Sample.txt
4
5
6

[johndoe@ArchLinux]% sed -E 's/^([^,]*,){1}([^,]*)(,.*){0,}$/\2/g' Sample.txt

we1a

reka

tebi

[johndoe@ArchLinux]% sed -E 's/^([^,]*,){4}([^,]*)(,.*){0,}$/\2/g' Sample.txt

We can wrap the query above into a BASH script with a function-like behavior.
How do we pass arguments to a BASH script? It's easy:

[johndoe@ArchLinux]% bash -c 'echo $1 $0' FirstArgument SecondArgument
SecondArgument FirstArgument

1 2	[johndoe@ArchLinux]% bash -c 'echo $1 $0' FirstArgument SecondArgument SecondArgument FirstArgument

Another aspect which should be taken into account is the desired column number, which requires subtraction operation. BASH supports arithmetic expansion:

[johndoe@ArchLinux]% echo "$((2*4))"                                         
8

1 2	[johndoe@ArchLinux]% echo "$((2*4))" 8

Now we can wrap the whole query into a script which accepts file name as the 1st argument, delimiter as the 2nd argument and column number as the 3rd:

[johndoe@ArchLinux]% bash -c 'sed -E "s/^([^$1]*$1){$(($2-1))}([^$1]*)($1.*){0,}$/\2/g" $0' Sample.txt "," 1
abb_158
b_w2c_W
css_vvv

[johndoe@ArchLinux]% bash -c 'sed -E "s/^([^$1]*$1){$(($2-1))}([^$1]*)($1.*){0,}$/\2/g" $0' Sample.txt "," 4
r_fadf
ssd*dd
tfw2_1

[johndoe@ArchLinux]% bash -c 'sed -E "s/^([^$1]*$1){$(($2-1))}([^$1]*)($1.*){0,}$/\2/g" $0' Sample.txt "," 5
4
5
6

[johndoe@ArchLinux]% cat Sample.txt
abb_158,we1a,1,r_fadf,4
b_w2c_W,reka,2,ssd*dd,5
css_vvv,tebi,3,tfw2_1,6

[johndoe@ArchLinux]% bash -c 'sed -E "s/^([^$1]*$1){$(($2-1))}([^$1]*)($1.*){0,}$/\2/g" $0' Sample.txt "," 1

abb_158

b_w2c_W

css_vvv

[johndoe@ArchLinux]% bash -c 'sed -E "s/^([^$1]*$1){$(($2-1))}([^$1]*)($1.*){0,}$/\2/g" $0' Sample.txt "," 4

r_fadf

ssd*dd

tfw2_1

[johndoe@ArchLinux]% bash -c 'sed -E "s/^([^$1]*$1){$(($2-1))}([^$1]*)($1.*){0,}$/\2/g" $0' Sample.txt "," 5

[johndoe@ArchLinux]% cat Sample.txt

abb_158,we1a,1,r_fadf,4

b_w2c_W,reka,2,ssd*dd,5

css_vvv,tebi,3,tfw2_1,6

The script should work with any delimiter. Let's generate another file SampleVBar.txt which is similar to Sample.txt but has vertical bar as the delimiter:

[johndoe@ArchLinux]% sed 's/,/|/g' Sample.txt > SampleVBar.txt

[johndoe@ArchLinux]% cat SampleVBar.txt
abb_158|we1a|1|r_fadf|4
b_w2c_W|reka|2|ssd*dd|5
css_vvv|tebi|3|tfw2_1|6

[johndoe@ArchLinux]% sed 's/,/|/g' Sample.txt > SampleVBar.txt

[johndoe@ArchLinux]% cat SampleVBar.txt

abb_158|we1a|1|r_fadf|4

b_w2c_W|reka|2|ssd*dd|5

css_vvv|tebi|3|tfw2_1|6

The same script will work if the vertical bar delimiter is specified:

[johndoe@ArchLinux]% bash -c 'sed -E "s/^([^$1]*$1){$(($2-1))}([^$1]*)($1.*){0,}$/\2/g" $0' SampleVBar.txt "\|" 1
abb_158
b_w2c_W
css_vvv

[johndoe@ArchLinux]% bash -c 'sed -E "s/^([^$1]*$1){$(($2-1))}([^$1]*)($1.*){0,}$/\2/g" $0' SampleVBar.txt "\|" 1

abb_158

b_w2c_W

css_vvv

Debugging can be done by viewing the executed BASH script:

[johndoe@ArchLinux]% bash -xc 'sed -E "s/^([^$1]*$1){$(($2-1))}([^$1]*)($1.*){0,}$/\2/g" $0' SampleVBar.txt "\|" 1
+ sed -E 's/^([^\|]*\|){0}([^\|]*)(\|.*){0,}$/\2/g' SampleVBar.txt
abb_158
b_w2c_W
css_vvv

[johndoe@ArchLinux]% bash -xc 'sed -E "s/^([^$1]*$1){$(($2-1))}([^$1]*)($1.*){0,}$/\2/g" $0' SampleVBar.txt "\|" 1

+ sed -E 's/^([^\|]*\|){0}([^\|]*)(\|.*){0,}$/\2/g' SampleVBar.txt

abb_158

b_w2c_W

css_vvv

GNU grep

GNU grep does not support variable length look-behind. The look-behind in grep can only be of fixed length! In the example below, only 7 characters are matched. Variable length option {7,9} will NOT work:

[johndoe@ArchLinux]% grep -P -o '(?<=^([^,]){7}),.{2}' Sample.txt
,we
,re
,te

[johndoe@ArchLinux]% grep -P -o '(?<=^([^,]){7}),.{2}' Sample.txt

,we

,re

,te

In the recent versions of grep, look-behind can be emulated using \K operator, which tells the program that the match happened here. Together with look-ahead it can be used to select columns:

[johndoe@ArchLinux]% cat Sample.txt
abb_158,we1a,1,r_fadf,4
b_w2c_W,reka,2,ssd*dd,5
css_vvv,tebi,3,tfw2_1,6

[johndoe@ArchLinux]% grep -P -o '^([^,]*,){4}\K([^,]*)(?=,.*){0,}' Sample.txt
4
5
6

[johndoe@ArchLinux]% cat Sample.txt

abb_158,we1a,1,r_fadf,4

b_w2c_W,reka,2,ssd*dd,5

css_vvv,tebi,3,tfw2_1,6

[johndoe@ArchLinux]% grep -P -o '^([^,]*,){4}\K([^,]*)(?=,.*){0,}' Sample.txt

Instead of positive look-ahead (?=) the non-capturing group (?:) can be used:

[johndoe@ArchLinux]% grep -P -o '^([^,]*,){4}\K([^,]*)(?:,.*){0,}' Sample.txt
4
5
6

[johndoe@ArchLinux]% grep -P -o '^([^,]*,){4}\K([^,]*)(?:,.*){0,}' Sample.txt

Basically \K disregards here 4 instances of the first capturing group (i.e. 4 columns). Note that there is no end-of-line character "$" in the query. Removing it allows to select 1st column:

[johndoe@ArchLinux]% grep -P -o '^([^,]*,){0}\K([^,]*)(?=,.*){0,}' Sample.txt
abb_158
b_w2c_W
css_vvv

[johndoe@ArchLinux]% grep -P -o '^([^,]*,){0}\K([^,]*)(?=,.*){0,}' Sample.txt

abb_158

b_w2c_W

css_vvv

Behavior of non-capturing group and look-ahead is different:

[johndoe@ArchLinux]% grep -P -o '^([^,]*,){2}\K([^,]*)(?:,.*){0,}' Sample.txt
1,r_fadf,4
2,ssd*dd,5
3,tfw2_1,6

[johndoe@ArchLinux]% grep -P -o '^([^,]*,){2}\K([^,]*)(?=,.*){0,}' Sample.txt
1
2
3

[johndoe@ArchLinux]% grep -P -o '^([^,]*,){2}\K([^,]*)(?:,.*){0,}' Sample.txt

1,r_fadf,4

2,ssd*dd,5

3,tfw2_1,6

[johndoe@ArchLinux]% grep -P -o '^([^,]*,){2}\K([^,]*)(?=,.*){0,}' Sample.txt

Wrapping it up into a BASH script:

[johndoe@ArchLinux]% bash -c 'grep -P -o "^([^$1]*$1){$(($2 - 1))}\K([^$1]*)(?=$1.*){0,}" $0' Sample.txt "," 1
abb_158
b_w2c_W
css_vvv

[johndoe@ArchLinux]% bash -c 'grep -P -o "^([^$1]*$1){$(($2 - 1))}\K([^$1]*)(?=$1.*){0,}" $0' Sample.txt "," 3
1
2
3

[johndoe@ArchLinux]% bash -c 'grep -P -o "^([^$1]*$1){$(($2 - 1))}\K([^$1]*)(?=$1.*){0,}" $0' Sample.txt "," 5
4
5
6

[johndoe@ArchLinux]% bash -c 'grep -P -o "^([^$1]*$1){$(($2 - 1))}\K([^$1]*)(?=$1.*){0,}" $0' Sample.txt "," 1

abb_158

b_w2c_W

css_vvv

[johndoe@ArchLinux]% bash -c 'grep -P -o "^([^$1]*$1){$(($2 - 1))}\K([^$1]*)(?=$1.*){0,}" $0' Sample.txt "," 3

[johndoe@ArchLinux]% bash -c 'grep -P -o "^([^$1]*$1){$(($2 - 1))}\K([^$1]*)(?=$1.*){0,}" $0' Sample.txt "," 5

Perl

There is some weird behavior of Perl with non-capturing groups and look-ahead:

[johndoe@ArchLinux]% perl -ple 's/^(?:[^,]*,){2}([^,]*)(?:,.*){0,}/\1/g' Sample.txt
1
2
3

[johndoe@ArchLinux]% perl -ple 's/^(?:[^,]*,){2}([^,]*)(?=,.*){0,}/\1/g' Sample.txt
1,r_fadf,4
2,ssd*dd,5
3,tfw2_1,6

[johndoe@ArchLinux]% perl -ple 's/^(?:[^,]*,){2}([^,]*)(?:,.*){0,}/\1/g' Sample.txt

[johndoe@ArchLinux]% perl -ple 's/^(?:[^,]*,){2}([^,]*)(?=,.*){0,}/\1/g' Sample.txt

1,r_fadf,4

2,ssd*dd,5

3,tfw2_1,6

GNU Emacs

Similar operation for replacing the whole text with a chosen column can be done in GNU Emacs. Let's recall the content of our file:

[johndoe@ArchLinux]% cat Sample.txt
abb_158,we1a,1,r_fadf,4
b_w2c_W,reka,2,ssd*dd,5
css_vvv,tebi,3,tfw2_1,6

[johndoe@ArchLinux]% cat Sample.txt

abb_158,we1a,1,r_fadf,4

b_w2c_W,reka,2,ssd*dd,5

css_vvv,tebi,3,tfw2_1,6

Now in order to do the replacement:

Open file in Emacs.
Select the whole text using "C-x h" shortcut.
Once selected, run "M-x", then choose "replace-regexp".
Enter query for the substitution:
^\([^,]*,\)\{3\}\([^,]*\)\(,.*\)
Enter the group number which will replace the matched pattern:
\2

Open file in Emacs.

Select the whole text using "C-x h" shortcut.

Once selected, run "M-x", then choose "replace-regexp".

Enter query for the substitution:

^$[^,]*,$\{3\}$[^,]*$$,.*$

Enter the group number which will replace the matched pattern:

The resulting text will be:

r_fadf
ssd*dd
tfw2_1

r_fadf

ssd*dd

tfw2_1

Useful links

Useful One-Line Scripts for Sed (Unix stream editor)

Using Perl like awk and sed

Lookahead and Lookbehind Zero-Length Assertions

Using Look-ahead and Look-behind

Altynbek Isabekov

Machine Learning and Embedded Systems

Using regular expressions with Perl, sed and grep for selecting columns in delimited text

A solution without regular expressions

Regular expressions based solution

GNU Sed

GNU grep

Perl

GNU Emacs

Useful links