[% setvar title Abstract Internals String Interaction %]
|Note: these documents may be out of date. Do not use as reference!|
To see what is currently happening visit http://www.perl6.org/
Abstract Internals String Interaction
Maintainer: Simon Cozens <firstname.lastname@example.org> Date: 26 Sep 2000 Mailing List: email@example.com Number: 322 Version: 1 Status: Developing
All accesses to strings and portions of strings inside the Perl 6 core should be abstracted.
Although RFC 294 states that data will be in UTF8 internally, there's going to come a time when we'll need to move to UTF16 or other encodings. If we start fiddling around "knowing" that we're dealing with UTF8 data, we're going to make it hard on ourselves when it comes to migrate.
But not only that, UTF8 is really rather gammy to deal with. It's a lot nicer if we can just treat each string as an array of wide characters.
Take, for instance, the operation of changing a character U+2654 (WHITE CHESS KING) into U+003F (QUESTION MARK) in the 3rd position of a 12 character string. Because UTF8 is a variable-width character set, we have a big problem: U+2654 is 3 bytes wide, and U+003F is one byte, we have to shift the rest of the string down two bytes, effectively having to rewrite it.
If we used a function or macro to change this single character, it would also have to rewrite the whole string; using this several times would be horrifically inefficient.
Even if we "pretend" that we're dealing with an array of wide chars, and used macros or functions to allow us to keep up this pretence, we'd still have to do that shifting around, which is nasty. The only way to get it working without the inefficiency would be to really deal with an array of wide chars, and have functions to convert such an array from and to our character set.
The obvious question would be, then, why bother with RFC 294 or UTF8 at all? Why not just store these arrays along with our SVs? Several reasons, notably:
Most of the world's data is still eight-bit; it's probably still even seven bit. Storing it all as, effectively, raw Unicode is wasteful.
We can't get an array of wide chars from the user, but we can get UTF8 from them. It's also likely we'll want to output data in a UTF, at least at times.
The following interface is proposed:
wchar_t * Perl_utf_to_raw( U8* string, STRLEN* len ) U8 * Perl_raw_to_utf( wchar_t* string, STRLEN* len )
All functions which manipulate strings should convert them to raw form and operate on the array, converting it back when it's time to store it in the string.
RFC 294: Internally, data is stored as UTF8
As-yet-unsubmitted RFC : When UTF8 Leaks Out