MzScheme provides built-in support for regular expression pattern matching on strings, implemented by Henry Spencer's package. Regular expressions are specified as strings, using the same pattern language as the Unix utility egrep. String-based regular expressions can be compiled into a regexp value for repeated matches. The internal size of a regexp value is limited to 32 kilobytes; this limit roughly corresponds to a source string with 32,000 literal characters or 5,000 special characters.
Figure 10.1: Grammar for regular expressions
The format of a regular expression is specified by the grammar in Figure 10.1. A few subtle points about the regexp language are worth noting:
The regular expression procedures are:
Additional strings are returned in the list if pattern contains parenthesized sub-expressions; matches for the sub-expressions are provided in the order of the opening parentheses in pattern. When sub-expressions occur in branches of an ``or'' (``|''), in a ``zero or more'' pattern (``*''), or in a ``zero or one'' pattern (``?''), a #f is returned for the expression if it did not contribute to the final match. When a single sub-expression occurs in a ``zero or more'' pattern (``*'') or a ``one or more'' pattern (``+'') and is used multiple times in a match, then the rightmost match associated with the sub-expression is returned in the list.
If insert-string contains ``&'', then ``&'' is replaced with the matching portion of string before it is substituted into string. If insert-string contains ``\n'' (for some integer n), then it is replaced with the nth matching sub-expression from string. ``&'' and ``\0'' are synonymous. If the nth sub-expression was not used in the match or if n is greater than the number of sub-expressions in pattern, then ``\n'' is replaced with the empty string.
A literal ``&'' or ``\'' is specified as ``\&'' or ``\\'', respectively. If insert-string contains ``\$'', then ``\$'' is replaced with the empty string. (This can be used to terminate a number n following a backslash.) If a ``\'' is followed by anything other than a digit, ``&'', ``\'', or ``$'', then it is treated as ``\0''.
(regexp-replace* pattern string insert-string) is the same as regexp-replace, except that every instance of pattern in string is replaced with insert-string. Only non-overlapping instances of pattern in the original string are replaced, so instances of pattern within inserted strings are not replaced recursively.
Examples:
(define r (regexp "(-[0-9]*)+"))
(regexp-match r "a-12-345b") ; => ("-12-345" "-345")
(regexp-match-positions r "a-12-345b") ; => ((1 . 10) (5 . 10))
(regexp-match "x+" "12345") ; => #f
(regexp-replace "mi" "mi casa" "su") ; => "su casa"
(define r2 (regexp "([Mm])i ([a-zA-Z]*)"))
(define insert "\\1y \\2")
(regexp-replace r2 "Mi Casa" insert) ; => "My Casa"
(regexp-replace r2 "mi cerveza Mi Mi Mi" insert) ; => "my cerveza Mi Mi Mi"
(regexp-replace* r2 "mi cerveza Mi Mi Mi" insert) ; => "my cerveza My Mi Mi"