Regular expressions filter
This page describes how to write filters in regular expression syntax. i-net PDFC uses Java's regular expression implementation, so a full specification of the syntax can be found in the Java Pattern Documentation.
i-net PDFC specific implementation
i-net PDFC has some specific rules for pattern matching apart from the normal regular expression rules:
-
a word will be ignored in the comparison if at least one character of the word is matched by any pattern
(you can however match a space at the end or beginning to match the whole word) -
if used, match groups are exclusive by default - please refer to the Match Groups section for details.
-
patterns are by default case sensitive
-
the
.
operator does not match a line break
Developing a pattern
To define a pattern from scratch, simply start with one example of the text sequence which will be matched by the filter. Imagine you want to exclude a "last modified" date from the comparison. Your first step might be to start with a date as filter pattern:
24/12/2013
But that would only match one date. To allow other dates to match as well, replace all parts which may have more than one value with a more generic expression. In our example, these are the day, month and year numbers. Replace them by a generic number expression:
\d+/\d+/\d+
The expression \d
matches any number including fraction separator. The +
defines that there must be one or more number digits. But wait - this would match other numbers as well which are no dates at all, such as 1234/521/5122
.. To avoid accidental matches, use precise expressions wherever possible. As a simple improvement we choose to limit the number of digits:
\d{1,2}/\d{1,2}/\d{2,4}
There are many more ways to optimize the pattern, but one is special for i-net PDFC and helps you to precisely define the context of the pattern to only remove the last modified dates:
Last Modified: (\d{1,2}/\d{2,4}/\d{2,4}
)
Expressions in a parenthesis are a so-called match group. The filter will still match the whole expression but only the content of the match groups will be excluded.
Character matching types
The smallest unit in matching is a character matcher. The following expressions each match a single character. For a complete list please refer to the Java Pattern Documentation
Single characters | |
---|---|
x |
the character x |
\\ |
the backslash character - keep in mind that a single \ is an escape sequence in regular expressions |
\t |
the tab character (Unicode 0x0009) |
\n |
the newline / line feed character (Unicode 0x000A) |
\r |
the carriage-return character (Unicode 0x000D) |
Character groups | |
[xyz] |
either x, y or t |
[^abc] |
any character but x, y or z |
[a-z0-9] |
a through z(lower case only) or 0 through 9 |
[a-z&&[^b-y]] |
subtraction of ranges - result is here only a or z |
Predefined groups | |
. |
any character (except the line terminator unless explicitly specified) |
\d / \D |
a digit / a NON-digit |
\s / \S |
an ASCII whitespace character / a NON-whitespace character |
\w / \W |
a word character ([a-zA-Z_0-9]) / a NON-word character |
Word Matching
To match words, simply use the character matchers and extend them by a quantifier.
Quantifiers | |
---|---|
X? |
character matcher X, once or not at all |
X* |
character matcher X, zero or more times |
X+ |
character matcher X, one or more times |
X{n} |
character matcher X, exactly n times |
X{n,} |
character matcher X, at least n times |
X{n,m} |
character matcher X, at least n but not more than m times |
As an example, the expression
\d+
matches any number with at least one digit.
Greedy / Reluctant quantifier
A very common term to exclude any text between to known expressions is .*
. But since the *
operator is greedy this will match the maximum number of characters, which isn't the expected result in most cases.
To change this, simply add an ?
to the quantifier. For example:
Pattern | Argument | First matched sequence |
---|---|---|
.* |
12345 | 12345 |
.*? |
12345 | (no match at all!) |
\d+ |
12345 | 12345 |
\d+? |
12345 | 1 |
a.*z |
a to z or z to a | a to z or z |
a.*?z |
a to z or z to a | a to z |
a{1,3} |
aaaaa | aaa |
a{1,3}? |
aaaaa | a |
As a general guideline, the non-greedy quantifiers should be preferred.
Match groups
Regular expressions allow more complex patterns as well. Any term up to this point can be nested in parentheses and used as much alike a single matcher. For example:
Pattern | example which matches completely |
---|---|
(123)+ |
123123123 |
(\d{1,3},)*(\d{1,3})(.\d+)? |
123,456,789.123 |
([0-9A-F]{2}-){5}[0-9A-F]{2} |
0A-12-ED-32-9C-72 |
Note: If you use match groups, i-net PDFC will exclude only the content matched by these groups. Any part of the pattern outside these groups will be used as an anchor but not be removed. While this spares the overhead of rather complex look-ahead and look-behind groups, it will be an issue when pasting regular expressions from external source.
To avoid this, you can simply put an additional group around the whole pattern. To disable this behavior for distinct groups, the group to be set to non-capturing by the ?:
operator, e.g. (?:this|that)
.
Alternative terms
In case a matcher needs to match a list of alternative terms, these terms can be defined using the | operator. For example:
Pattern | example which matches completely |
---|---|
(a|b)* |
abba |
(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) |
May (for instance) |
Flags and Switches
Some operators don't match anything but can be used to change the behavior of the matcher.
Operator | Effect |
---|---|
(?i) |
Switches to case insensitive matching |
(?s) |
Switches the . operator to match linebreaks as well, enable multiline matches |
(?: ... ) |
Defines a non-capturing group. i-net PDFC will not exclude matches of such groups unless there are no capturing groups in the pattern |
Standard patterns
Pattern | Use case |
---|---|
Page: (\d+) of (\d+) |
ignore page numbers in n of m format |
((19|20)\d\d([- /.])(0[1-9]|1[012])([- /.])(0[1-9]|[12][0-9]|3[01])) |
Valid dates in year-month-day format |