Emulating logical “OR” in hledger queries

HLedger is a perfect tool for generating financial reports. However, it lacks one important functionality: the boolean "OR" operator for combining queries. It is better to demonstrate the problem using a specific example. Let's generate an example hledger journal file with a decent number of transactions and accounts used. For this purpose we will use bean-example command from beancount:

This will yield an example hledger journal file with different accounts. "Expenses" account has many subaccounts:

Let's say you want to display all transactions that fulfill the following conditions:

  • are in "Liabilities:US:Chase:Slate" account (which is basically a credit card),
  • are not "Expenses" except "Expenses:Food:Groceries" and "Expenses:Food:Restaurant", i.e. money spent only on these two food categories and not on other expense categories,
  • were executed after 2021-11,

and feed these transactions into "hledger register 'acct:Liabilities:US:Chase:Slate'" command to display the running total. In other words, we want to see food expenditures of the given two categories and refills of credit card balance from other asset accounts. Another criterion is that all operations should be packable into a one line command without creating temporary files on the hard disk.

The query on accounts implies these logical operations:

However, hledger supports only AND operation and there is no support for OR operation, although this feature request is still open and might be addressed in the future.

Since we are printing the transactions which meet given conditions, their order in the resulting aggregation of the OR operation is not important, because the cumulative sum of expenses/money transfers will be calculated by the final "hledger register" command which is smart enough to sort them in chronological order.

Solution using "tee"

In order to emulate "OR" operation, we can use the trick with "tee" and BASH-specific process substitution,  which will print 24 transactions that meet the criteria:

STDOUT of tee is piped into STDIN of hledger -f- print "acct:Expenses:Food:(Groceries|Restaurant)"  and instead of file.txt process substitution >(hledger -f- print "not:acct:Expenses") is used, where temporary file descriptors are created. Although 'hledger -f- print "not:acct:Expenses"' is an another process, it's STDIN is mapped to STDIN of a temporary file and >(hledger -f- print "not:acct:Expenses") is treated as a file descriptor.

What happens if we pipe these transactions into "hledger register"? We have a problem here: transaction with date 2021-11-07 is not processed:

We can fix it by combining STDOUT of two commands (two "hledger print"s for filtering account names) before the last command ("hledger register"):

Changing  the filtering at the last "hledger register" command to display "Expense:*" accounts of the same transactions:

 

Explanation: the output of 'hledger -f- print "acct:Expenses:Food:(Groceries|Restaurant)" ' is redirected into a new stream &4 and the output of 'hledger -f Finances_2021.journal -b 2021-11 print "acct:Liabilities:US:Chase:Slate"' is redirected into a new stream &3. These output streams are combined into STDOUT (stream &1) in these two redirections: "3>&1" and "4>&1".

Solution using "pee"

Another alternative is to use "pee" command from moreutils package:

it acts like "tee", but pipes STDOUT directly into multiple commands. The STDOUT of these multiple commands is combined before entering the next pipe.

Solution attempt using regular expressions

Yet another way to emulate logical OR for filtering accounts with hierarchical structure is to use regular expressions.  We want to exclude the "Expenses" account (so negation using hledger's filtering specifier "not:acct:" for a query) and all its children except "Expenses:Food:Groceries" and "Expenses:Food:Restaurant". The first thing which comes to mind is to use negative lookahead, i.e. something like "not:acct:Expenses:(?!Food:Groceries)", but as it turns out, lookaheads are not supported by hledger's regular expressions engine.

How about using matching a single character that is not contained within the brackets? The expression "not:acct:Expenses:[^F]" should filter out all expense accounts except ones whose names are starting with "Expenses:F". The problem is that "Expenses:Food" is not the only account matching this expression, there is "Expenses:Finances" as well:

The workaround has been discussed in a question on Stackoverflow. We need to use a capturing group with an "OR" statement combining at least two items:

  • Matching a single character that is not contained within the brackets (so the first letter of the account name in the brackets): "[^F]",
  • Similarly to the previous item but for the second letter of the account name, with condition that it is preceded by the first letter, i.e. "F[^o]".

These two items should extended by the items corresponding to the 3rd, 4th and following letters of the account name, as it is done in the chain rule for the joint probability distribution: "not:acct:Expenses:([^F]|F[^o]|Fo[^o]Foo[^d])"

We need to be more specific to avoid matching "Expenses:Food:Coffee" and "Expenses:Food:Alcohol", so the regular expression should look like "not:acct:Expenses:([^F]|F[^o]|Fo[^o]Foo[^d]|Food:[^RG])":

Although this regular expression is sufficient to solve our problem, it would also match and display accounts like "Expenses:Food:Grapes" if they existed. In order to be more specific, we need to extend it to capture only "Restaurant" and "Groceries" accounts. It is not easy, since there is an "OR" operator in the expression we want to invert: "Expenses:Food:(Groceries|Restaurant)".

The solution can be based on matching one letter at a time and combining the letters from two words into a range (square brackets): [RG][er][so][tc][ae][ur][ri][ae][ns][t$], here "Restaurant" and "Groceries" are grouped.

In essence, the condition on the preceding character should look like "<skipping the OR'less part>....|Food:[^RG]|Food:[RG][^er]|Food:[RG][er][^so])".

The full solution using only regular expressions is scary and ugly, but it works:

It will also match and display accounts formed from combinations of two letters in the sequence, e.g. "Expenses:Food:Rrscarren", "Expenses:Food:Rrscarrent", "Expenses:Food:Gescauias", "Expenses:Food:Gescauiast", "Expenses:Food:Rroceries", "Expenses:Food:Gestauran" etc. if they existed, so this approach cannot be considered as a proper solution.

Conlusion

The solution using "tee", works fine, but POSIX-compliance of file descriptors like "/dev/fd/3" is not clear. Order of execution of filtering commands is still not guaranteed, but it is not important anyways.

The solution  using "pee" is the best, but requires installation of the command "pee". Order of execution of processes is still not guaranteed.

The solution attempt using regular expressions without lookaheads and pure hledger's filtering functionality is intimidating and may not be 100% correct if there are weird account names matching the constructed regular expression.

References

Keeping header of an output while grepping the rest for something else in BASH

Stackoverflow: Output order with process substitution

BASH: Grouping commands

Wikipedia: Chain rule (probability)

Stackoverflow: Regular expression to match a line that doesn't contain a word