[% setvar title Replace %]

This file is part of the Perl 6 Archive

To see what is currently happening visit http://www.perl6.org/

TITLE

Replace =~, !~, m//, s///, and tr// with match(), subst(), and trade()

VERSION

  Maintainer: Nathan Wiger <nate@wiger.org>
  Date: 27 Aug 2000
  Last Modified: 15 Sep 2000
  Mailing List: perl6-language-regex@perl.org
  Number: 164
  Version: 3
  Status: Frozen
  Requires: RFC 170

ABSTRACT

Several people (including Larry) have expressed a desire to get rid of =~ and !~. This RFC proposes a way to replace m//, s///, and tr/// with three new builtins, match, subst, and trade. It also proposes a way to allow a full backwards-compatible syntax.

NOTES ON FREEZE

There appeared to be two types of people regarding this RFC: Those that absolutely loved it, and those that hated it. I received many personal emails saying what a good idea this was, but there was also considerable dissention on the list about it. Most people agreed that the ability to overload regex functions and also use them in common applications, such as:

   print subst /$old/$new/g, @input;

Was a distinct benefit. The main uneasiness, which I agree with completely, was not having the ability to do stuff like this:

   next if /^$/ || /^#/;         # that's Perl, allright

However, the RFC in its current format does provide for the ability to do this still through its 100% backwards compatible syntax, which I think actually satisfied many of the detractors.

Personally, the only way I would accept this proposal myself is if RFC 170 was also accepted, since that gives us a means for a backwards compatible syntax. RFC 170 also has the nice side effect of extending =~ to other operations as well. My feeling in the end is that the combination of this RFC and RFC 170 gives us the "best of both worlds":

   1. Prototypeable regex functions that can easily work on lists and
      be chained together, just like grep, map, split, and so forth.

   2. 100% backwards compatible, allowing the use of =~ for situations
      where it makes sense, but not where it doesn't.

I honestly feel that the combination of this RFC and RFC 170 gives us all the niceties of current Perl syntax with all the benefits of true functions, and is a win as such.

DESCRIPTION

Overview

Everyone knows how =~ and !~ work. Several proposals, such as RFCs 135 and 138, attempt to fix some stuff with the current pattern-matching syntax. Most proposals center around minor modifications to m// and s///.

This RFC proposes that m//, s///, and tr/// be dropped from the language, and instead be replaced with new match, subst, and trade builtins, with the following syntaxes:

   $res [, $res] = match /pat/flags, $str [, $str];
   $res [, $res] = subst /pat/new/flags, $str [, $str];
   $res [, $res] = trade /pat/new/flags, $str [, $str];

These subs are designed to mirror the format of split, making them more consistent. Unlike the current forms, these return the modified string, leaving the input $str alone.

Note that replace is likely a better name for subst since it connotes the action better and is not so close to substr. Short on time :-(, though, I won't have the chance to change all the examples.

Context modifies the return values just as Perl 5 context does, with some extensions:

   1. If called in a void context, they act on and modify C<$_>,
      consistent with current behavior.

   2. If called in a scalar context, C<match> returns the number
      of matches (like now), and the rest return the first (or
      only) string.

   3. If called in a list context, a list of the modified strings
      will be returned.

   4. If called in a numeric context, they all return the number
      of substitutions made.

Extra arguments can be dropped, consistent with split and many other builtins:

   match;                  # all defaults (pattern is /\w+/)
   match /pat/;            # match $_
   match /pat/, $str;      # match $str
   match /pat/, @strs;     # match any of @strs

   subst;                  # strip leading/trailing whitespace
   subst /pat/new/;        # sub on $_
   subst /pat/new/, $str;  # sub on $str
   subst /pat/new/, @strs; # return array of modified strings

   trade;                  # nothing
   trade /pat/new/;        # tr on $_
   trade /pat/new/, $str;  # tr on $str
   trade /pat/new/, @str;  # return array of modified strings

These new builtins eliminate the need for =~ and !~ altogether, since they are functions just like split, join, splice, and so on. There are also shortcut forms, see below.

Sometimes examples are easiest, so here are some examples of the new syntax:

   Perl 5                           Perl 6
   -------------------------------- ----------------------------------
   if ( /\w+/ ) { }                 if ( match ) { }
   die "Bad!" if ( $_ !~ /\w+/ );   die "Bad!" if ( ! match ); 
   ($res) = m#^(.*)$#g;             ($res) = match #^(.*)$#g;

   # These are longer, but you can still use the backwards
   # compatible syntax where it makes sense, like here

   next if /\s+/ || /\w+/;          next if match /\s+/ or match /\w+/;
   next if ($str =~ /\s+/) ||       next if match /\s+/, $str or 
           ($str =~ /\w+/)                  match /\w+/, $str;
   next unless $str =~ /^N/;        next unless match /^N/, $str;
   
   $str =~ s/\w+/$bob/gi;           $str = subst /\w+/$bob/gi, $str;
   s/\w+/this/;                     subst /\w+/this/;             

   tr/a-z/Z-A/;                     trade /a-z/Z-A/;
   $new =~ tr/a/b/;                 $new = trade /a/b/, $new;


   # Some become easier and more consistent...

   ($str = $_) =~ s/\d+/&func/ge;   $str = subst /\d+/&func/ge;
   ($new = $old) =~ tr/a/z/;        $new = trade /a/z/, $old;


   # And these are pretty cool...   

   foreach (@old) {                 @new = subst /hello/X/gi, @old;
      s/hello/X/gi;
      push @new, $_;
   }

   foreach (@str) {                 @new = trade /a-z/A-Z/, @str;
      tr/a-z/A-Z/;
      push @new, $_;
   }

   $gotit = 1;                      print "Got it" if match /\w+/, @str;
   foreach (@str) {
      undef $gotis unless /\w+/;
   }
   print "Got it" if $gotit;

This gives us a cleaner, more consistent syntax. In addition, it makes several things easier, is more easily extensible:

   &callsomesub(subst(/old/new/gi, $mystr));
   $str = subst /old/new/i, $r->getsomeval;

and is easier to read English-wise. However, it requires too much typing. For that reason, we include the shortcut form as well:

Shortcut Form

RFC 139 describes a way that the // syntax can be expanded to any function. So, to gain backwards compatibility, we simply allow this syntax along with the shortcut function names s, m, and tr [1]:

   Shortcut Form                    Builtin
   -------------------------------- ----------------------------------
   s/\w+/W/g;                       subst /\w+/W/g;
   /\w+/;                           match /\w+/;
   tr/ae/io/;                       trade /ae/io/;

   $new = s/\s+/X/, $old;           $new = subst /\s+/X/, $old;
   $num = m/\w+/, $str;             $num = match /\w+/, $str;
   $new = tr/a-z/z-a/, $str;        $new = trade /a-z/z-a/, $str;

Note // can still be used as a shortcut to m//. This is the form most people will use, I would imagine. Starting to look like Perl 5...

Use of =~ Syntax

RFC 170 shows how =~ can be used as a more generic assignment operator / rvalue duplicator. With this ability, we can now write all our Perl 5 regex syntaxes still, even though they're actually new Perl 6 builtins:

   Shortcut Form + C<=~>            Builtin
   -------------------------------- ----------------------------------
   $str =~ s/\w+/W/g;               $str = subst /\w+/W/g, $str;
   $str =~ tr/a-z/z-a/;             $str = trade /a-z/z-a/, $str;
   $str =~ /\w+/;                   match /\w+/, $str;       # See [2]
   ($match) = /^(.*)$/g;            ($match) = match /^(.*)$/g;

   # Can't do these in Perl 5
   @str =~ s/$foo/$bar/gi;          @str = subst /$foo/$bar/gi, @str;
   @str =~ tr/a-z/A-Z/;             @str = trade /a-z/A-Z/, @str; 
   @str =~ m/^Pass:/;               match /^Pass:/, @str;

So, why all the bother if it looks just like Perl 5? Well, these last two sections are based on more general mechanisms for Perl 6. That is, allowing the generalization of =~ and the // syntax allows us to write these expressions in a way that is backwards compatible. However, there is no explicit relationship between the Perl 5 backwards-compat syntax and the new Perl 6 syntax, even though there appears to be.

In fact, the mechanism covered in RFC 170 allows us to write stuff like:

   $str =~ quotemeta;          # $str = quotemeta($str);
   @a =~ sort { $a <=> $b };   # @a = sort { $a <=> $b } @a;

So you can see how this general purpose mechanism allows us backwards compatibility.

Finally, note how we have a good amount of flexible, parallel syntax because of this:

   $str =~ s/$foo/$bar/gi;     # just a general shorcut
   $new = s/$bar/$baz/g, $str; # more consistent when $new != $str

Concerns

Because of the fact that this proposal has the ability to be 100% backwards compatible, it doesn't strike me as problematic anymore. However, it should still be a conscious decision to change pattern matching at all. I have no interest in breaking Perl regex's. At all. None.

Still, I have received many personal emails in favor of this idea. So, if implemented correctly, I think it could be a benefit for Perl 6.

Note that trade was chosen because "transliterate" is way too long and "trans" looks to be taken by transactional variables. And trade seems to connote the action pretty well still.

Finally, many have expressed the fear that subst is much too close to substr, and I agree. The best alternative suggested was replace, which I like. The only problem is that the shortcut version would be s/// instead of r/// (to maintain backwards compatibility), but this is a detail that can be worked out later, if this RFC is adopted.

IMPLEMENTATION

Major changes to =~, !~, m//, s///, and tr///. Additions of three new functions, match, trade, and substr / replace.

MIGRATION

There are no longer any syntax changes as of v2. No migration path is required, assuming this is implemented carefully.

NOTES

[1] If most people are going to continue using the shortcut form and names, it might be wise just to make the functions be named m, s, and tr, even though these are silly function names.

[2] match is a bit of a special case, just like m// is when compared to s/// and tr///. The support of !~ and m// will have to be explored some more, but I'll leave that for subsequent discussions.

REFERENCES

This is a synthesis of several ideas from myself, MJD, Ed Mills, Steve Fink, and Tom C

RFC 138: Eliminate =~ operator.

RFC 139: Allow Calling Any Function With A Syntax Like s///

RFC 170: Generalize =~ to a special-purpose assignment operator