=head1 TITLE Boolean Regexes =head1 VERSION Maintainer: Richard Proctor Date: 6 Sep 2000 Last Modified: 1 Oct 2000 Mailing List: perl6-language-regex@perl.org Number: 198 Version: 3 Status: Frozen =head1 ABSTRACT This is a development of the proposal for the "not a pattern" concept in RFC 166 V1. Looking deeper into the handling of advanced regexs, there are potential needs for many other concepts, to allow a regex to extract information directly from a complex file in one go, rather than a mixture of splits and nested regexes as is typically needed today. With these parsing data should become easier (in some cases). =head1 CHANGES V2 - Changed the "Fail Pattern", enhanced the wording for many things. V3 - Further clarifications =head1 DESCRIPTION It would be nice (in my opinion) to be able to build more elaborate regexes allowing data to be mined out of a sting in one go. These ideas allow you to apply several patterns to one substring (each must match), to fail a match from within, to look for patterns that do not contain other patterns, and to handle looking for cases such as (foo.*bar)|(bar.*foo) in a more general way of saying "A substring that contains both foo and bar". These are ideas, at present with some proposed syntax. The ideas are more important than the exact syntax at this stage. This is very much work in progress. I have called these boolean regexs as they bring the concepts of and (&&) or (||) and not(!) into the realm of regexes. Within a boolean regex (or the boolean part of a regex), several new symbols have meanings, and some have enhanced meanings. =head2 The Ideas Are these part of a boolean (?&...) construct within an existing regex, or is the advanced syntax (and meaning of &|!^$) invoked by a new flag such as /B? These can look like line noise so the use of white space with /x is used throughout, and it might be appropriate to enforce (or assume) /x within (&...). I am not alone in wanting something like this, Ilya in a posting of lots of things he would like had : onion rings: (?<> A <> B &! C & D) (substring matched by A such that B and D match against it, but C does not, in B, C, D \A and \z denote boundaries of what was matched by A); Which is virtually identical in concept. =head3 Boolean construct (?&...) grabs a substring, and applies one or more tests to the substring. =head3 Substring matching multiple patterns (&&) (?& pattern1 && pattern2 && pattern3 ) A substring is definied that matches each pattern. For example, the first pattern may say specify a substring of at least 30 chars, the next two have a foo and a bar. =head3 Substring matching alternative patterns (||) (?& pattern1 || pattern2 || pattern3) This is similar to the existing alternative syntax "|" but the alternatives to "|" behave as /^pattern$/ rather than /pattern/ (^ and $ taken as refereing to the substring in this case - see below). (pattern1 || pattern2 || pattern3) can be mixed in with the && case above to build up more advanced cases. && and || operators can be nested with brackets in normal ways. =head3 Brackets within boolean regexes Within a complex boolean regex there are likely to be lots and lots of brackets to nest and control the behaviour of the regex. Rather than having to sprinkle the regex with (?:) line noise, it would be nicer to just use ordinary brackets () and only support capturing of elements by using one of the (?$=) or (?%=) constructs that have been proposed elsewhere (RFC 112 and RFC 150). There might be some case for this as a general capability using some flag /b = brackets? =head3 Substring not matching a pattern In RFC 166 I originally proposed (?^ pattern ). This proposal replaces that. Though it could be used as well outside of the (?&) construct. !pattern matches anything that does not match the pattern. On its own it is not terribly useful, but in conjuction with && and || one can do things such as /(?& )//ix; It might be possible to have a regex that simply matches valid perl6 out of this. (though it would be large...)! =head1 IMPLENTATION Implementation detail is not appropriate for this stage in the devlopment of this RFC. If the concepts gain approval then detailed implementation issues become relevant. There are two aspects to regexs - compiling and executing: Compiling of these extended forms should be relativly straight forward, but would need some extensions to recognise the regex as being within (?&) state to handle the extended syntax. Executing - No thoughts at all at present. =head1 REFERENCES RFC 166: Alternative lists and quoting of things RFC 112: Assignment within a regex RFC 150: Extend regex syntax to provide for return of a hash of matched subpatterns RFC 145: Brace-matching for Perl Regular Expressions (or at least the discussion that followed it)