Using regular expressions with Perl, sed and grep for selecting columns in delimited text

A solution without regular expressions

There are many ways for selecting columns in a delimited text. The easiest ones require GNU awk, cut and Perl. Let's consider the following file:

In order to select 3rd column one can use GNU awk:

or the cut command from coreutils package:

or Perl:

Regular expressions based solution

Now let's consider using GNU sed, GNU grepPerl and GNU Emacs with regular expression to do same task.

GNU Sed

GNU sed is a stream editor. The basic functionality of sed is replacing a text following some pattern, e.g.:

The flags in the query "s///g" denote substitution and global behavior (i.e. every line is processed) respectively.

It can also be used for selecting columns in a delimited file. Look-arounds and non-matching groups are not supported in sed. In order to select 3rd column from the file one need to run a regular expression containing only matching groups:

Explanation of regular expressions in the query:

A string which matches the following pattern is replaced with the second captured group ("\2") of the pattern. If we use "\1" instead of "\2" then the first captured pattern will be fed to the output:

If we use "\3" instead of "\2" then the third captured pattern will be fed to the output:

By changing the number of occurrences of the first capturing group, we can choose the column number:

We can wrap the query above into a BASH script with a function-like behavior.
How do we pass arguments to a BASH script? It's easy:

Another aspect which should be taken into account is the desired column number, which requires subtraction operation. BASH supports arithmetic expansion:

Now we can wrap the whole query into a script which accepts file name as the 1st argument, delimiter as the  2nd argument and column number as the 3rd:

The script should work with any delimiter. Let's generate another file SampleVBar.txt which is similar to Sample.txt but has vertical bar as the delimiter:

The same script will work if the vertical bar delimiter is specified:

Debugging can be done by viewing the executed BASH script:

GNU grep

GNU grep does not support variable length look-behind. The look-behind in grep can only be of fixed length! In the example below, only 7 characters are matched. Variable length option {7,9} will NOT work:

In the recent versions of grep, look-behind can be emulated using \K operator, which tells the program that the match happened here. Together with look-ahead it can be used to select columns:

Instead of positive look-ahead (?=) the non-capturing group (?:) can be used:

Basically \K disregards here 4 instances of the first capturing group (i.e. 4 columns). Note that there is no end-of-line character "$" in the query. Removing it allows to select 1st column:

Behavior of non-capturing group and look-ahead is different:

Wrapping it up into a BASH script:

Perl

There is some weird behavior of Perl with non-capturing groups and look-ahead:

GNU Emacs

Similar operation for replacing the whole text with a chosen column can be done in GNU Emacs. Let's recall the content of our file:

Now in order to do the replacement:

The resulting text will be:

 

Useful links

Useful One-Line Scripts for Sed (Unix stream editor)

Using Perl like awk and sed

Lookahead and Lookbehind Zero-Length Assertions

Using Look-ahead and Look-behind

Leave a Reply

Your email address will not be published. Required fields are marked *