11 The Perl Compatible Regular Expression Library

The Perl Compatible Regular Expression Library (PCRE) implements the pcre pattern with the method attributes

init
options
match
matchAll
replace
replaceAll

the value attribute

subPatterns

the exception attribute

compilationError

11.1 Regular Expressions

A regular expression (regexp, for short) is a pattern that denotes a set of strings, possibly an infinite set. Searching for matches for a regexp is a very powerful operation that editors and scripting languages on Unix systems have traditionally offered.

This implementation of regular expressions implements a regular expression syntax that is compatible with that in the language Perl. It is based on Philip Hazel's PCRE library. Most of Philip Hazel's documentation also applies to the BETA version.

11.2 pcre

The pcre pattern encapsulates a regular expression. It takes a Text reference as an enter parameter. The enter parameter is given to the Init method.

The pcre pattern has an empty do-part

The pcre pattern exits a reference to itself

You can use the pcre pattern as in the following example

 
   re: ^Pcre;
do 'trigger' -> pcre -> re[];
   (filename[], re[]) -> myGrep;

11.3 init

The init method takes a Text reference as an enter parameter. This string describes the regular expression according to the syntax described in the pcre documentation. Init compiles the regular expression into an internal format suitable for matching against strings. This operation takes some CPU time, so the result (stored in the pcre object) should be kept if the same pattern is to be used many times.

When compiling the regular expression the options defined by the options method are used.

You can call init several times if you want to change the regular expression matched by the pcre object.

11.4 options

The options method is a virtual pattern, which you can specialise. Put the options you need into the do part. For example:

 
   re: @Pcre (# options:: (# do CASELESS; DO_STUDY; #); #)
do 'tRiGgEr' -> re;
   (filename[], re[]) -> myGrep;

There is an alternative way to specify certain options, which involves placing them in the textual representation of the regular expression. For example the option CASELESS can be specified by prepending the string '<?i>' to the regular expression.

The following options are supported

Option	Text version	Used in	Description
CASELESS	(?i)	init	Ignore case when matching
MULTILINE	(?m)	init	^ and $ match after/before newlines
DOTALL	(?s)	init	. matches newlines
EXTENDED	(?x)	init	Extended regexp syntax
ANCHORED	^	init	Match only at start of string
DOLLAR_ENDONLY		init	$ doesn't match before terminal newline
EXTRA	(?X)	init	Support PCRE extensions to Perl regexps
NOTBOL		init or match	Do not match ^ at start of string
NOTEOL		init or match	Do not match $ at end of string
UNGREEDY	(?U)	init	Quantifiers not greedy by default
NOTEMPTY		init or match	Empty string cannot match entire expression
C_LOCALE		init	Use C locale instead of default localei
DO_STUDY		init	Study regexp after compiling it
RETURN_NONE		init or match	Return NONE for subpatterns that didn't match

Notes:

More information
More details are available in the documentation of the pcre library.
When to set options
Some options are only used in the init method, while others are set to a default in the init method, but may be overridden in the options method of match (inherited by matchAll, replace and replaceAll).
C_LOCALE
Normally pcre will use the locale defined by your C library to determine whether a given character is a letter, etc. If you set this option, then pcre will use the C locale, ie only the characters a-z are letters. Most of the time this option will make no difference.
DO_STUDY
If you are going to be using a pattern many times, then this option may improve performance. See more about this in the documentation of the pcre library.
RETURN_NONE
Normally the subMatchText methods and similar will exit an empty string in the case where the subpattern didn't match at all. This helps prevent unexpected ref-NONE exceptions in your code. Unfortunately it also makes it difficult to tell the difference between a subpattern that matched an empty string and a subpattern that didn't match at all (because it was in an alternation that wasn't used). If you set this option then you risk getting NONE back from an invocation of subMatchText and your program must be able to cope with that.

Clearing options
In some cases you may want to clear an option that has been set by a superpattern. You do this by prefixing the option name with clear. For example:

(* p is a perl regexp with my favourite options, including case
 * insensitivity, but just this once I want a case sensitive
 * regexp.
 *)
p: @PcreWithMyFavouriteOptions (# options:: (# do clearCASELESS #) #);

11.5 match

The match method takes a Text reference and exits true or false, depending on whether the text matched the expression. It also contains a set of methods that can be overridden to provide much more information about the match.

The INNER part of the match method is only called in the case of a match.

options
pre
matchPos
matchText
preMatchText
postMatchText
subMatchPos
subMatchText
sub1, sub2, sub3...
noMatch
position

11.6 match.options

This method can be overridden in much the same way as the options method in the pcre pattern in order to pass options to the matching stage of the regular expression engine.

11.7 match.pre

This method is called before any matching takes place. It does nothing, but you can specialise it in your own subclasses.

11.8 match.matchPos

This method can be called from the inner part of the match method. It exits an integer pair, indicating the start and end positions of the matched text in the original text. See the example below.

11.9 match.matchText

This method can be called from the inner part of the match method. It exits a Text reference indicating the text that matched the regular expression. See the example below.

11.10 match.preMatchText and match.postMatchText

These methods can be called from the inner part of the match method. They exit a Text reference indicating the text that preceeded (or followed) the text that matched the regular expression.

For example:

(# t1: ^Text;
   t2: ^Text;
   r3: ^Text;
   s: @Integer;
   e: @Integer;
do 'abc123def' -> ('\\d+' -> pcre).match
   (#
   do preMatchText -> t1[];
      matchText -> t2[];
      postMatchText -> t3[];
      matchPos -> (s, e);
   #);
   ...
#);

Will put 'abc' in t1, '123' in t2 and 'def' in t3. It also puts 4 in s and 6 in e.

11.11 match.subMatchPos

This method can be called from the inner part of the match method. It enters an integer and exits an integer pair, indicating the start and end positions of the nth subpattern in the original text. See the example below.

11.12 match.subMatchText

This method can be called from the inner part of the match method. It enters an integer and exits a text, indicating the text matched by the nth subpattern. See the example below.

11.13 match.sub1, match.sub2, match.sub3...

These methods can be called from the inner part of the match method. They exit a text, indicating the text matched by the nth subpattern. They are simply a shorthand method of invoking subMatchText. See the example below:

(# t1: ^Text;
   t2: ^Text;
   r3: ^Text;
   s: @Integer;
   e: @Integer;
do 'abc123def' -> ('([a-z])(\\d+)([a-z]+)' -> pcre).match
   (#
   do sub1 -> t1[];
      sub2 -> t2[];
      3 -> subMatchText -> t3[];
      3 -> subMatchPos -> (s, e);
   #);
   ...
#);

Will put 'c' in t1, '123' in t2 and 'def' in t3. It also puts 7 in s and 9 in e.

11.14 match.noMatch

This method is called by match when no match is found. You can specialise it to specify an action if no match is found.

11.15 match.position

This method controls where in the input string the search for a regular expression match starts. You can specialise it, putting a different number into the variable 'value'.

11.16 replace

The replace method inherits from the match method. It takes two inputs, firstly a Text reference to a search string, and secondly a text reference to a default replacement string. It exits two values, firstly a boolean (true or false), depending on whether the text matched the expression. Secondly the a text reference to the new text with the replacement carried out. If no replacement is carried out then the text exited is a copy of the search string entered. Replace also contains a set of methods that can be overridden to provide much more information about the match and to control the replacement text more accurately. See the example below.

The INNER part of the replace method is only called in the case of a match.

options
pre
matchPos
matchText
preMatchText
postMatchText
subMatchPos
subMatchText
sub1, sub2, sub3...
noMatch
position
rep

11.17 replace.rep

This method controls the replacement string. The 'value' variable is a reference to the default replacement text. By assigning a new reference to 'value' you can dynamically choose another replacement string, based on information gleaned from the other methods available in replace.

(# t1: ^Text;
do ('The y2k problem', 'year 2000' ->
      ('\\by2k\\b' -> pcre).replace ->
      (p, t1[]);
   ...
#);

Will put 'The year 2000 problem' in t1. (The escape sequence '\b' in a regular expression matches a word boundary. In a BETA string you have to double the backslash.)

(# t1: ^Text;
do ('The y3k problem', '' ->
      ('\\by([0-9]+)k\\b' -> pcre).replace
      (# rep::
         (#
         do 'year %s000' -> putFormat (# do sub1 -> s #) -> value[];
         #);
      #) -> (p, t1[]);
   ...
#);

[Modified: Friday January 4^th 2002 at 13:10]