Nre Package Commands


NAME

nrematch - Match a regular expression against a string

SYNOPSIS

package require nre ?3.0?
nrematch ?switches? exp string ?matchVar? ?subMatchVar subMatchVar ...?
nrematch -eval ?switches? exp string ?matchVar matchScript? ?subMatchVar subMatchScript subMatchVar subMatchScript ...?
nrematch -list|-flatten|-split ?switches? exp string matchVar
nrematch -eval -list|-flatten|-split ?switches? exp string matchVar matchScript

DESCRIPTION

Determines whether the regular expression exp matches part or all of string.Returns the number of times exp matched. The number of matches can be greater than 1 if the -all or -split switches are used.

If additional arguments are specified after string then they are treated as the names of variables in which to return information about which part(s) of string matched exp. MatchVar will be set to the range of string that matched all of exp. The first subMatchVar will contain the characters in string that matched the leftmost parenthesized subexpression within exp, the next subMatchVar will contain the characters that matched the next parenthesized subexpression to the right in exp, and so on.Instead of using the standard regular expression package it uses the package described in this man page.

If there are more subMatchVar's than parenthesized subexpressions within exp, or if a particular subexpression in exp doesn't match the string (e.g. because it was in a portion of the expression that wasn't matched), then the corresponding subMatchVar will be set to ``-1 -1'' if -indices,to ``0.0 0.0'' if -textidx, or to an empty string otherwise.The exception to this is the -prune switch. Use it if you do not want empty items added to the match variables.

If the -list, -flatten, or -split switches are used then matchVar is required and subMatchVar is not allowed.

If the -eval switch is used then each matchVar and subMatchVar has an associated script that will be executed when a match is found.

Any matchVar, subMatchVar, matchScript, and subMatchScript can be an empty string if you want nrematch to ignore it.

If the initial arguments to nrematch start with - then they are treated as switches. The following switches are currently supported:

-nocase
Causes upper-case characters in string to be treated as lower case during the matching process.

-indices
Changes what is stored in the subMatchVars. Instead of storing the matching characters from string, each variable will contain a list of two decimal strings giving the indices in string of the first and last characters in the matching range of characters.

-textidxChanges what is stored in the subMatchVars. Instead of storing the matching characters from string, each variable will contain a list of two Tk text widget indices that specify the matching range of characters in string. The first points to the first character in the range and the second points to the position after the last character in the range. A text widget index is of the form line.char that indicates the char'th character on line line. Lines are numbered from 1. Within a line, characters are numbered from 0.

-allInstead of returning after a single match all ranges in string that match exp are found. Returns the number of matches found. The matchVar and subMatchVars are set to an empty list and as each match is found an element is appended to the var's list. If the -indices switch is used then two elements are appended to each list for each match found.

-splitImplies -all and -flatten. In addition the text matched by the entire exp is not appended to matchVar. Instead the text that preceeded the match is appended followed by any captured subexpressions. Finally when exp fails to match any remaining unmatched text from string is appended to matchVar. Note that this switch is used by the nresplit command.

-limit numLimits the number of matches the -all or -split to num. num must be an integer. If the limit is reached then acts as if no more matches exist.

-start numStart matching input at the offset num. This switch will not change the result index values; those are still computed from the start of string.

-end numAct as if the input string was only of length num. In combination with -start this can save a call to string range.

-pruneNormally an empty element will be added to the match result if a subexpression did not match at all. The -prune switch changes this behavior so that if a subexpression did not match at all nothing will be added to the result for it. Note that pruning is only done for -all and -split.

-listCauses all of the matched strings to be put in the required matchVar as a list. The command detects the number of captured subexpression in exp and adds that number of additional elements to the matchVar list. In the case of the -all switch each element added to matchVar is itself a list whose size if the number of captured subexpressions plus one. In the case of the -split switch an additional element will be added to matchVar for each match if the number of captured subexpressions is greater than 0. That additional element will be a list which will have an element for each captured subexpression.

-flattenLike -list except a sublist will not be used. Instead the captured subexpressions will be directly append to matchVar.

-evalCauses matchScript and any subMatchScripts to be evaluated for each match found. Before doing any evaluations matchVar and any subMatchVars are set with the normal value they would be without the -eval switch. If you don't care about a particular part of the match then use an empty string for that match variable. The scripts are evaluated from left to right. The script can be an empty string if no evaluation is desired for a particular match variable. Since the match variables are all set before script evaluation they can all be accessed from any of the scripts. If a subexpression did not match at all, including an empty string, then its corresponding subMatchScript will not be evaluated. If an evaluated script executes break or return then no more scripts will be evaluated. In the case of -all or -split nrematch will act as if the current match did not happen and that no more matches exist. If continue is executed and -all or -split is used then nrematch will act as if the current match did not happen and will try to find another match.

-tryagain varnameIf the exp could have matched if string had had additional text on its end then varname will be set to the offset at which additional matches should be attempted once additional input is appended to string. It will be set to -1 if trying again with the current input could not yield a match. This switch should only be used if the input is being read from a stream which may have additional input.

--
Marks the end of switches. The argument following this one will be treated as exp even if it starts with a -.

REGULAR EXPRESSIONS

Regular expressions are implemented using Henry Spencer's package (thanks, Henry!), and much of the description of regular expressions below is copied verbatim from his manual entry.

A regular expression is zero or more branches, separated by ``|''. It matches anything that matches one of the branches.

A branch is zero or more pieces, concatenated. It matches a match for the first, followed by a match for the second, etc.

A piece is an atom possibly followed by ``*'', ``+'', ``?'',or ``{x,y}'' which in turn might be followed by a ``?''.A ``*'' matches a sequence of 0 or more matches of the atom. A ``+'' matches a sequence of 1 or more matches of the atom. A ``?'' matches a sequence of 0 or 1 matches of the atom.A ``{x}'' matches a sequence of x matches of the atom. A``{x,}'' matches a sequence of x or more matches of the atom. A ``{x,y}'' matches a sequence of at least x and at most y matches of the atom. By default a piece will match as long a sequence as possible. However if the piece constructs described above have a ``?'' after them then piece will match as short a sequence as possible.

Note that the ``{x,y}'' repetition construct is only recognized if the p flag is set.

An atom is a regular expression in parentheses (matching a match for the regular expression), a range (see below), ``.'' (matching any single character), ``^'' (matching the null string at the beginning of the input string), ``$'' (matching the null string at the end of the input string), a ``\'' followed by a single character (matching that characteror matching something special if the p flag is used; see the FLAGS section for details),or a single character with no other significance (matching that character).

A range is a sequence of characters enclosed in ``[]''. It normally matches any single character from the sequence. If the sequence begins with ``^'', it matches any single character not from the rest of the sequence. If two characters in the sequence are separated by ``-'', this is shorthand for the full list of ASCII characters between them (e.g. ``[0-9]'' matches any decimal digit). To include a literal ``]'' in the sequence, make it the first character (following a possible ``^''). To include a literal ``-'', make it the first or last character.A range can also contain POSIX character classes. They represent a sequence of characters just as two characters sperated by ``-'' do. However the sequence is determined using the functions from C runtime library and current locale. The following POSIX character classes are supported:

[:alnum:]Alphabetic and numeric characters. Defined by isalnum().

[:alpha:]Alphabetic characters. Defined by isalpha().

[:cntrl:]Control characters. Defined by iscntrl().

[:digit:]Digit characters. Defined by isdigit().

[:graph:]Printable characters excluding a space. Defined by isgraph().

[:lower:]Lowercase alphabetic characters even if the i switch is used. Defined by islower().

[:print:]Printable characters including a space. Defined by isprint().

[:punct:]Punctuation characters. Defined by ispunct().

[:space:]Whitespace characters. Defined by isspace().

[:upper:]Uppercase alphabetic characters even if the i switch is used. Defined by isupper().

[:xdigit:]Characters allowed in a hexidecimal number. Defined by isxdigit().

A parentheses atom in which the character immediately after the ``('' is a ``?'' is a special construct with one of the following meanings:

``(?:''regexp``)'' are shy groups. This groups like ``()'' but doesn't capture the text for backreferences like ``()'' does. It matches if regexp matches.

``(?=''regexp``)'' is a non-capturing zero-width positive lookahead assertion. It matches if regexp matches. The matched text is not consumed.

``(?!''regexp``)'' is a non-capturing zero-width negative lookahead assertion. It matches if regexp does not match.

``(?#''any text``)'' is a comment. The entire atom is treated as an empty string.

``(?ipxm)'' is a used to set flags. Any combination of the flag characters ``ipxm'' are allowed. The entire atom is treated as an empty string. See the FLAGS section for a description of each flag.

``(?|''range``)'' is an alternate syntax for a character range. Its benefit is that it does not use the Tcl special characters ``[]'' to enclose the range.

FLAGS

Flags can be set using a ``(?''flag-char``)'' atom. Some commands that use regular expressions have options that set some of these same flags. For example the -nocase option sets the i flag. The advantage of having the flags in the regular expression itself is that they can then be used by any command without the need to add new command switches. It is best to set the flags at the very beginning of the regular expression; however they apply to the entire regular expression no matter where they appear.

The i flag causes case to be ignored when alphabetic characters are compared.

The m flag enables multi-line mode. The ``^'' atom is changed to match at the beginning of the string or the beginning of any line in the string. The ``$'' atom is changed to match at the end of the string or the end of any line in the string. The ``.'' atom is changed to match any character except ``\n''.

The x flag causes white space in the regular expression to be ignored and removed during compilation. To include literal white space as an atom to be matched preceed it with a backslash ``\''. Whitespace is only ignored between atoms, pieces, branches, and regular expressions. It is not ignored in ranges or in any other complex atom. The white space includes comments where a comment starts with a ``#'' and continues to the end of the line.

The q flag enables quick compile mode. This turns off expensive optimizations that tend to slow down compilation of the regular expression. This can speed up compile time but may slow down match time. Some regular expressions are optimized by the writer and doing optimization is a waste of time. Another reason to use this switch is if most of your time is spent compiling. This can happen if the regular expression needs to be compiled each time and the text being match against is small. This flag can also be used to work around bugs in the optimizer.

The p flag enables extra escape sequences and constructs to be recognized. See the BACKWARDS COMPATIBILITY section for why these constructs are not enabled by default. The following are enabled:

\w
Match a "word" character. Same as ``[_[:alnum:]]''.

\W
Match a non-word character. Same as ``[^_[:alnum:]]''.

\s
Match a whitespace character. Same as ``[[:space:]]''.

\S
Match a non-whitespace character. Same as ``[^[:space:]]''.

\d
Match a digit character. Same as ``[[:digit:]]''.

\D
Match a non-digit character. Same as ``[^[:digit:]]''.

\b
Zero-width assertion matches a word boundary. Current character matches \w and previous character matches \W or current character matches \W and previous character matches \w. The position before the first character in the string and after the last character match \W.

\B
Zero-width assertion matches a non-word boundary. Current character matches \w and previous character matches \w or current character matches \W and previous character matches \W. The position before the first character in the string and after the last character match \W.

\<
Zero-width assertion matches start of word. Current character matches \w and previous character matches \W. The position before the first character in the string and after the last character match \W.

\>
Zero-width assertion matches end of word. Current character matches \W and previous character matches \w. The position before the first character in the string and after the last character match \W.

\A
Zero-width assertion matches only at beginning of string even if m flag.

\Z
Zero-width assertion matches only at end of string even if m flag.

\G
Zero-width assertion matches only where previous -all match left off.

\Q
Quote mode. All characters following are treated as literal text until a \E or the end of the regular expression.

\E
End quote mode.

\num
Backreference to the num'th captured substring. The value of num must not be greater than the number of captured substrings to the left of the backreference. The text from the backreference is inserted into the regular expression and is always treated as literal text.

\meta
If the ``\'' is followed by a regular expression meta character then the meta character is treated as literal text. The meta chararacters are: ``\*+?()|[]{}^$''. If ``\'' is followed by anything else the regexp compiler will raise an error.

{x,y}
This piece construct is a repetition operator and is described above in the piece paragraph.

CHOOSING AMONG ALTERNATIVE MATCHES

In general there may be more than one way to match a regular expression to an input string. For example, consider the command
nrematch  (a*)b*  aabaaabb  x  y
Considering only the rules given so far, x and y could end up with the values aabb and aa, aaab and aaa, ab and a, or any of several other combinations. To resolve this potential ambiguity nrematch chooses among alternatives using the following rules apply in decreasing order of priority:

  1. If a regular expression could match two different parts of an input string then it will match the one that begins earliest.

  2. If a regular expression contains | operators then the leftmost matching sub-expression is chosen.

  3. In *, +, ?, and{x,y}constructs, longer matches are chosen in preference to shorter ones. These operators are often called greedy because they match the longest possible string that allows the entire regular expression to match.In *?, +?, ??, and {x,y}? constructs, shorter matches are chosen in preference to longer ones. These operators are often called lazy because they match the shortest possible string that allows the entire regular expression to match.

  4. In sequences of expression components the components are considered from left to right.

In the example from above, (a*)b* matches aab: the (a*) portion of the pattern is matched first and it consumes the leading aa; then the b* portion of the pattern consumes the next b. Or, consider the following example:
nrematch  (ab|a)(b*)c  abc  x  y  z
After this command x will be abc, y will be ab, and z will be an empty string. Rule 4 specifies that (ab|a) gets first shot at the input string and Rule 2 specifies that the ab sub-expression is checked before the a sub-expression. Thus the b has already been claimed before the (b*) component is checked and (b*) must match an empty string.

LIMITS

The maximum number of capturing subexpressions ``()'' in a single regular expression is 255. This limit does not apply to the non-capturing ``(?:)''.

A compiled regular expression is limited in size to 32678 bytes. If during compilation it is discovered that the regular expression requires more memory then the operation will fail with the error: ``regexp too big''.

The counts in the repetition construct ``{x,y}'' must be greater than or equal to zero and less than or equal to 255.

The maximum number of unique ranges in a regular expression is 64.

BACKWARDS COMPATIBILITY

Regular expressions from previous releases of Tcl should behave exactly the same. The following new constructs:

(?...), *?, +?, and ??

will cause compilation errors in older regular expressions so they are always recognized in new regular expressions.

All the other new constructs would have meant something else in older regular expressions. So they always have the old meaning unless you turn on one of the new flags. For example you need to start a regular expression with (?p) if you want to use the new ``\'' sequences or the ``{x,y}'' repetition construct.

PERFORMANCE INFORMATION

The first time a regular expressions is used it is compiled into a Tcl object. The next time that object needs to be used as a regular expression the compilation step will not be needed if the object still exists and is still a regular expression. So if the regular expression is a constant string:
nrematch {abc|def|zeq} $str
then the first time the above command is executed the string constant object is converted to a regular expression object and will remain so giving a performance boost.

However if the regular expression string is not constant:

nrematch "$W1|$W2|$W3" $str
then the string object will need to be recreated each time the above command executes.

If instead you stored the regular expression string into a variable then the regular expression object would remain and not need to be recreated each time:

set re "$W1|$W2|$W3"
proc foo {} {
    global re
    nrematch $re $str
}
If it is a complex regular expression used in more than one place this can be a win in both time and space.

It is best to use (?i) instead of -nocase if you can because then the text of the regular expression object describes its state.

If you do not need the matchVar or a subMatchVar then you can set that argument to an empty string ``{}''. This tells nrematch to not bother setting a variable to that particular captured subexpression.

BINARY CLEAN

The new regular expression compiler and matcher are binary clean. This means that it is ok for the regular expression and the string being matched to contain binary data including null bytes.

EXAMPLES

To match a C comment:
nrematch {/\*.*?\*/} $str

To match a number if not followed by a period:

nrematch {[0-9]+(?![.])} $str

To match a number if followed by something other than a period:

nrematch {[0-9]+(?=[^.])} $str

To match an item that contains only letters, but not all uppercase:

nrematch {^(?![A-Z]*$)[a-zA-Z]*$} $str

To see if a string contains both 'this' and 'that':

nrematch {^(?=.*?this)(?=.*?that)} $str

KEYWORDS

match, nre, regular expression, string

Last change: 3.0

[ nre3.0 ]

Copyright © 1997 Darrel Schneider.