To see what is currently happening visit http://www.perl6.org/
Normalisation and unicode::exact
Maintainer: Simon Cozens <simon@brecon.co.uk> Date: 25 Sep 2000 Mailing List: perl6-internals@perl.org Number: 295 Version: 1 Status: Developing
Perl 6 should support Unicode normalisation; this is going to make comparing strings confusing.
First, what's normalisation? Unicode gives the user a lot of flexibility
over how data is represented. For instance, there are two ways of
representing é; (that's an e with an acute accent) first,
there's U+00E9, (Also handily named LATIN SMALL LETTER E WITH ACUTE)
and there's secondly the two characters U+0065 U+0301. (That's a
an acute accent which combines with an ordinary latin letter 'e')
Normalisation is the process of turning all data in the first type of representation to the second type. Well, strictly speaking, this is "decomposition", but the purpose of it is that we can now compare things for their meaning, and not solely for their representation, and doing it for that purpose is normalisation.
Perl 6 should support normalisation. But this creates a problem. Should
the eq operator compare representations or meanings? After plying the
perl5-porters with large quantities of alcohol at YAPC::Europe, the
consensus was that it should compare meanings. Good. Perl's always been
about handling text for meaning. But then how can we tell whether two
strings are really equal in terms of their representation?
The current use bytes pragma (which is the subject of another of my
Unicode RFCs) will allow comparison in terms of representation; there's
also a problem of optimisation.
If we keep the original data, we need to perform decomposition every
time we do any kind of string comparison, and I don't relish the
prospect of cmp becoming really slow. Of course, you could store a
decomposed PV inside the SV as well, but that's big and heavy; the only
sensible, non-destructive optimisation would be to have some kind of
IsNormalised flag in the SV which tells us not to bother decomposing
this string, since it's already in a normalised representation.
I propose that by default, Perl 6 is allowed to chew up your data and
decompose it. If the exact representation is important to the user, the
pragma unicode::exact should be turned on; inside of the scope of
unicode::exact, no normalisation is performed, and cmp and friends
perform normalisation on a temporary copy of the string so as to be
non-destructive to the original data. For instance:
$x = chr(0x00E9); # LATIN SMALL LETTER E WITH ACUTE
... if ($x cmp $y);
# $x is *actually* chr(0x0065).chr(0x0301) now
{
use unicode::exact;
$x = chr(0x00E9);
... if ($x cmp $y);
# $x is compared as if it were chr(0x0065).chr(0x0301),
# but it retains its old value of chr(0x00E9)
}
Outside of unicode::exact, whether the normalisation is done lazily
(necessitating an IsNormalised flag) or when the data is stored is
not specified by this RFC; it works fine both ways. I'd personally say
it should be done on lazily.
The Unicode FAQ; www.unicode.org
RFC 300: use unicode::representation
RFC 312: Unicode Combinatorix
RFC ??:When UTF8 Leaks Out