11 The Perl Compatible Regular Expression Library

The Perl Compatible Regular Expression Library (PCRE) implements the pcre pattern with the method attributes
init
options
match
matchAll
replace
replaceAll
the value attribute
subPatterns
the exception attribute
compilationError

11.1 Regular Expressions

A regular expression (regexp, for short) is a pattern that denotes a set of strings, possibly an infinite set. Searching for matches for a regexp is a very powerful operation that editors and scripting languages on Unix systems have traditionally offered.

This implementation of regular expressions implements a regular expression syntax that is compatible with that in the language Perl. It is based on Philip Hazel's PCRE library. Most of Philip Hazel's documentation also applies to the BETA version.

11.2 pcre

The pcre pattern encapsulates a regular expression. It takes a Text reference as an enter parameter. The enter parameter is given to the Init method.

The pcre pattern has an empty do-part

The pcre pattern exits a reference to itself

You can use the pcre pattern as in the following example

 
   re: ^Pcre;
do 'trigger' -> pcre -> re[];
   (filename[], re[]) -> myGrep;

11.3 init

The init method takes a Text reference as an enter parameter. This string describes the regular expression according to the syntax described in the pcre documentation. Init compiles the regular expression into an internal format suitable for matching against strings. This operation takes some CPU time, so the result (stored in the pcre object) should be kept if the same pattern is to be used many times.

When compiling the regular expression the options defined by the options method are used.

You can call init several times if you want to change the regular expression matched by the pcre object.

11.4 options

The options method is a virtual pattern, which you can specialise. Put the options you need into the do part. For example:

 
   re: @Pcre (# options:: (# do CASELESS; DO_STUDY; #); #)
do 'tRiGgEr' -> re;
   (filename[], re[]) -> myGrep;

There is an alternative way to specify certain options, which involves placing them in the textual representation of the regular expression. For example the option CASELESS can be specified by prepending the string '<?i>' to the regular expression.

The following options are supported

Option Text version Used in Description
CASELESS (?i) init Ignore case when matching
MULTILINE (?m) init ^ and $ match after/before newlines
DOTALL (?s) init . matches newlines
EXTENDED (?x) init Extended regexp syntax
ANCHORED ^ init Match only at start of string
DOLLAR_ENDONLY init $ doesn't match before terminal newline
EXTRA (?X) init Support PCRE extensions to Perl regexps
NOTBOL init or match Do not match ^ at start of string
NOTEOL init or match Do not match $ at end of string
UNGREEDY (?U) init Quantifiers not greedy by default
NOTEMPTY init or match Empty string cannot match entire expression
C_LOCALE init Use C locale instead of default localei
DO_STUDY init Study regexp after compiling it
RETURN_NONE init or match Return NONE for subpatterns that didn't match

Notes:

11.5 match

The match method takes a Text reference and exits true or false, depending on whether the text matched the expression. It also contains a set of methods that can be overridden to provide much more information about the match.

The INNER part of the match method is only called in the case of a match.

options
pre
matchPos
matchText
preMatchText
postMatchText
subMatchPos
subMatchText
sub1, sub2, sub3...
noMatch
position

11.6 match.options

This method can be overridden in much the same way as the options method in the pcre pattern in order to pass options to the matching stage of the regular expression engine.

11.7 match.pre

This method is called before any matching takes place. It does nothing, but you can specialise it in your own subclasses.

11.8 match.matchPos

This method can be called from the inner part of the match method. It exits an integer pair, indicating the start and end positions of the matched text in the original text. See the example below.

11.9 match.matchText

This method can be called from the inner part of the match method. It exits a Text reference indicating the text that matched the regular expression. See the example below.

11.10 match.preMatchText and match.postMatchText

These methods can be called from the inner part of the match method. They exit a Text reference indicating the text that preceeded (or followed) the text that matched the regular expression.

For example:

(# t1: ^Text;
   t2: ^Text;
   r3: ^Text;
   s: @Integer;
   e: @Integer;
do 'abc123def' -> ('\\d+' -> pcre).match
   (#
   do preMatchText -> t1[];
      matchText -> t2[];
      postMatchText -> t3[];
      matchPos -> (s, e);
   #);
   ...
#);

Will put 'abc' in t1, '123' in t2 and 'def' in t3. It also puts 4 in s and 6 in e.

11.11 match.subMatchPos

This method can be called from the inner part of the match method. It enters an integer and exits an integer pair, indicating the start and end positions of the nth subpattern in the original text. See the example below.

11.12 match.subMatchText

This method can be called from the inner part of the match method. It enters an integer and exits a text, indicating the text matched by the nth subpattern. See the example below.

11.13 match.sub1, match.sub2, match.sub3...

These methods can be called from the inner part of the match method. They exit a text, indicating the text matched by the nth subpattern. They are simply a shorthand method of invoking subMatchText. See the example below:

(# t1: ^Text;
   t2: ^Text;
   r3: ^Text;
   s: @Integer;
   e: @Integer;
do 'abc123def' -> ('([a-z])(\\d+)([a-z]+)' -> pcre).match
   (#
   do sub1 -> t1[];
      sub2 -> t2[];
      3 -> subMatchText -> t3[];
      3 -> subMatchPos -> (s, e);
   #);
   ...
#);

Will put 'c' in t1, '123' in t2 and 'def' in t3. It also puts 7 in s and 9 in e.

11.14 match.noMatch

This method is called by match when no match is found. You can specialise it to specify an action if no match is found.

11.15 match.position

This method controls where in the input string the search for a regular expression match starts. You can specialise it, putting a different number into the variable 'value'.

11.16 replace

The replace method inherits from the match method. It takes two inputs, firstly a Text reference to a search string, and secondly a text reference to a default replacement string. It exits two values, firstly a boolean (true or false), depending on whether the text matched the expression. Secondly the a text reference to the new text with the replacement carried out. If no replacement is carried out then the text exited is a copy of the search string entered. Replace also contains a set of methods that can be overridden to provide much more information about the match and to control the replacement text more accurately. See the example below.

The INNER part of the replace method is only called in the case of a match.

options
pre
matchPos
matchText
preMatchText
postMatchText
subMatchPos
subMatchText
sub1, sub2, sub3...
noMatch
position
rep

11.17 replace.rep

This method controls the replacement string. The 'value' variable is a reference to the default replacement text. By assigning a new reference to 'value' you can dynamically choose another replacement string, based on information gleaned from the other methods available in replace.

(# t1: ^Text;
do ('The y2k problem', 'year 2000' ->
      ('\\by2k\\b' -> pcre).replace ->
      (p, t1[]);
   ...
#);

Will put 'The year 2000 problem' in t1. (The escape sequence '\b' in a regular expression matches a word boundary. In a BETA string you have to double the backslash.)

(# t1: ^Text;
do ('The y3k problem', '' ->
      ('\\by([0-9]+)k\\b' -> pcre).replace
      (# rep::
         (#
         do 'year %s000' -> putFormat (# do sub1 -> s #) -> value[];
         #);
      #) -> (p, t1[]);
   ...
#);

11.18 matchAll

This method is similar to match, but calls INNER several times, once for each match. It is not yet fully documented. Please see pcre.bet comments and demo programs.

11.19 replaceAll

This method is similar to replace, but calls INNER several times, once for each match. It is not yet fully documented. Please see pcre.bet comments and demo programs.

11.20 subPatterns

This is a readonly integer pattern, which tells you how many subpatterns your pattern has. Only useful if you are reading regular expressions from a config file or from the user, since otherwise you should know this figure already.

11.21 compilationError

This pattern is executed if your regular expression contains syntax errors. In this case it is not a good idea to call match or replace on that pattern.


Basic Libraries - Reference Manual
© 1990-2002 Mjølner Informatics
[Modified: Friday January 4th 2002 at 13:10]