Simplifying split()


  Maintainer: Sean M. Burke <sburke@cpan.org>
  Date: 30 Sep 2000
  Mailing List: perl6-language@perl.org
  Number: 361
  Version: 1
  Status: Developing


Perl 5's split function is messy, and should be simplified.


Perl 5 split does five things that I think are just annoying, and which I suggest be removed:

The last three of the above points speak for themselves. I will focus on the first two.

Most notably, I suggest that Perl 6 split('|', ...) should work as most people expect -- splitting on a literal bar. (Under Perl 5, split('|', ...) is synonymous with split(/|/, ...) -- i.e., split on nullstring or nullstring [sic].)

So I suggest:

   Perl 5:  split /\|/, ...
  be synonymous with (and be better written as)
   Perl 6:  split '|', ...
           # altho  split /\|/, $bar...  remains valid

And as to the second point, the removal of trailing blanks, I suggest:

   Perl 5:   @x = split /:/, $bar, -1;
  be synonymous with
   Perl 6:   @x = split ':', $bar;

If you want to remove trailing fields, under Perl 6 you should have to do it explicitly:

   Perl 5:   @x = split /:/, $bar;
  be synonymous with
   Perl 6:   @x = split ':', $bar;
             while(@x and !length $x[-1]) { pop @x }

I believe that the current behavior of removing trailing empty fields is unintuitive and surprising to learners; nothing about the concept of splitting a string into a list suggests removing trailing empties. (Moreover, I find that when I need to remove empties, it's not just the trailing ones; so the current behavior is rarely just what I want.)


I'll leave the C-coding details to the usual, capable implementers.

But I will note one minor complication with my first suggestion (that literals and regexps be distinguished). Consider:

  Perl 6:   @x = split $foo, $bar;

I suggest that the correct approach is to treat $foo's value as a literal, unless it holds an object of class Regexp (or a class derived from it?), in which case it should be treated as if the above were:

  Perl 6:   @x = split qr/$foo/, $bar;

In other words, in such cases it is not possible to know at compile time whether a given "split" operator means literal-split or regexp-split. I note that such cases are rare.


In conclusion, I'll note that there is a conservative alternative approach possible: if any of the above features of Perl 5 split seem really worth keeping, my suggestion for a "clean split" can be implemented as a separate operator called, for example, "cleave".

(Consider the precedent of Perl 5 chomp being added alongside Perl 4 chop, not replacing it.)

I would consider this suboptimal, though; I think that an operator with as straightforward and intuitive a name as "split" should behave in a straightforward and intuitive way.