[% setvar title Internal representation of Pseudo-hashes using attributes. %]
To see what is currently happening visit http://www.perl6.org/
Internal representation of Pseudo-hashes using attributes.
Maintainer: Michael Maraist <maraist@udel.edu> Date: 27 Sep 2000 Last Modified: 30 Sep 2000 Mailing List: perl6-internals@perl.org Number: 273 Version: 2 Status: Frozen
Pseudo-hashes were an attempt to mix the speed of arrays with the named fields of hashes. For compatibility issues, the new data-type used [ { field-mapping }, data-list ], which at the very least looks bizarre. It has the required overhead of a dereference and does not behave well as an array since the field-mapping is required to be the first field.
An attempt was made to make this the basis for classes, making use of pragmas such as 'fields', 'base', and module functions such as fields::new, fields::phash. Performance was really secondary to the ability to perform strict-type checking. It was also necessary to use 'my Foo $dog = new Foo' to perform compile-time optimizations.
There is a growing need to provide structured named parameters. Pseudo-hashes provide the type-checking, and _can_ provide the ordering of fields, though non-transparently.
Other RFC's have proposed plausible structured data-types, namely the 'qs' and 'struct' keywords. This RFC, however, proposes a modification of the existing system that maintains procedural compatibility, and allows for the integration of named structures. It provides enhancements for the OOP model, and provides a 100% compatible method of named subroutine parameters by merely extending the concept of strict-hashes through the use of attributes.
sub getStat {
my %stat: fields( name size ctime mtime inode );
$stat{name} = 'Foo'; # array 0 access
$stat{inode} = 50; # array 4 access
my @fields = keys %stat; # get raw internal key listing
my @stat_list = values %stat; # get raw internal data listing
my %loose_stat = %stat; # interlace internal key and data listings
$stat{nme} = 'Bar'; # compiler error!
$loose_stat{nme} = 'Bar'; # OK, regular hash operations
@stat{qw(name size)}; # compiled to @stat[ 0, 1 ]
# In general, slice works in all cases except modification of index
positions
return \%stat; # references entire structure and meta-data
} # end getStat
my $rh_stat = getStat();
$rh_stat->{size}; # OK, performs internal lookup to find size == 1
$rh_stat->{nme}; # run-time error! (after ! exists internal lookup for
nme )
The above describes how a variable attribute can be used to impose type-checking and compile-time optimizations to a normal hash. The modified hash should always look and feel the same, exept in the case of type-checking.
This is very similar to pseudo-hashes with the following exceptions:
Additionally, this can work transparently with the OO approach to allow compiler optimizations for standard returned hash refs. Just like current pseudo-hashes as in:
{
package Foo;
use class qw( a b ); # same functionality as 'fields', but internal
sub new { class::new(shift) } # again, class works like fields
# It does something similar to:
# eval "my $class \%this; return \\\%this;";
# but in an optimized way.
# a => 0, b => 1
}
{
package Bar;
use extends qw( Foo Other); # similar functionality to 'base', but
internal
# also allows multiple inheritance (compiler error if conflicting
name-space)
# If this turns out to be problematic, single inheritance can be used.
use class qw( c d );
# a => 0, b => 1, c => 2, d => 3
# new inherited
}
my Bar $obj = new Bar;
$obj->{a}; # compile-time mapping of a => 0
$obj->{zap}; # compiler error!
my @keys = keys %$obj; # internal fetching of ordered key-list
my @values = values %$obj; # internal fetching of ordered values
my @array_hash = %$obj; # loses type-checking: ( a, undef, b, undef, c,
undef, d, undef )
my %hash = %$obj; # loses type-checking: same as above
# The following makes use of the above class as a pre-defined version of ":
fields(..)"
# All that follows is optional, but a logical extension of the above
my %h_bar: Bar; # Uses that above class to define a hash
$h_bar{a}; # OK, same as $h_bar[ 0 ]
$h_bar{zap}; # compiler-error
sub doFoo : Bar { # optionally restricts parameters at compile time (when
it knows ahead
# of time)
my %args : Bar = @_; # takes in a normal set of hash-ed parameters, but
restricts them
# Essentially performs a faster hash-assignment
$args{a}; # similar to $args[ 0 ];
} # end doFoo
The above would be similar to:
my %doFoo_valid_args = qw( a 1 b 1 c 1 d 1 ); # global for better
performance
sub doFoo {
my %args = @_;
for my $key ( keys %args ) {
die "invalid arg: $key" unless exists $doFoo_valid_args{ $key };
}
$args{a};
}
Except that validation is _much_ faster, AND occurs at compile-time (where possible). Also, access to %args is significantly faster with strict-hashes.
Because of the performance / storage benifits of this approach (see below), it would be worth while reimplementing internal functions such as stat to work as:
sub stat {
my %data : fields( ... ) = get-stat-data
return wantarray ? values %data : \%data;
}
my $file_stat = stat( 'file' );
my @file_stat = stat( 'file' );
print "Size: $file_stat->{size}";
Things to note:
Typed-hashes can only be created at compile-time, and can not be altered, much like a normal structure. This is desirable for formal definitions and for syntax checking.
High level representation of internal structure:
struct hash:
misc_info_t var_info
perl_hash_table_t data
struct strict_hash:
misc_info_t var_info # same as above
perl_array[] key_names # used for "keys" and "get-hash-as-array"
perl_array[] values # the actual field-values
perl_hash_table_t mapping # optimized mapping of keys to positions
struct strict_hash2: # optional representation
misc_info_t var_info
string[] key_names # only used by c-functions, so reduced overhead
scalar[] values # static c-array of scalars, no overhead of
dynamic @array
c_lookup_t mapping # static DS that maps names to indexes
struct strict_hash3: # alternative representation
misc_info_t var_info
perl_array[] key_values # composit array ( @key_names, @values )
c_lookup_t mapping
struct strict_hash4:
misc_info_t var_info
scalar[] values
c_tree_t mapping # makes fetching keys slower, but can maintain
order
# and alleviates redundant storage of key-names
Providing the seperate list of key-names greatly enhances performance for % -> @, and also could allow extensions that could query attribute information, such as
my %hash : fields( a b ); my $key_name = key_at_index( %hash, 1 ); # 'b'
If redundant data is undesirable, then there are three alternatives. The first is to remove the linear list of keys, since this can extracted from the hash (with performance cost). The other is to remove the hash mapping. This would seriously hurt non-strict usage (by an unsuspecting module user), though it can be avoided by the usage of typed-variables. The third is a hybrid:
Start with the linear list, since it ultimately uses less space. As perl compiles, if it never encounters a non-strict dynamic lookup, the hash never needed to have been generated, both space and performance have been conserved. The first time the compiler notices the use of non-strict dynamic lookup, it generates a mapping and stores it within the variable. This way, performance is not hurt in the run time in either direction.
Since the information is always available to regenerate the mapping, we are not threatened by eval statements which may do things like:
1 my %hash : fields( a b );
2 my $field_name = "a";
3 $hash{a};
4 eval "\$hash{$field_name} for 0 .. 10000";
5 eval '$hash{$field_name} for 0 .. 10000';
At line 1, we produce a strict hash, and maintain a compiler-based symbol mapping table. By compilation of like 5, we haven't noticed any dynamic uses of the fields, so we free up memory by throwing away the internal mapping.
By execution of line 4, we need to compile again. We find that we have to regenerate the mapping table so that we can convert interpolated 'a' into 0. The presence of evals is less common than that of hashes, so this performance hit should be acceptible. It might be better to not regenerate the mapping, and simply perform a linear search for all compiled mappings. Greater knowledge of the compiler internals is needed. Execution of the newly evaled line 4 is just as fast as execution of line 3.
At execution of line 5, we find that compilation reveals a dynamic hash-lookup of a strict hash that does not have a mapping. At this point, we generate a mapping. Execution of the newly evaled line 5 then performs a mapping from 'a' to 0 at a slightly higher cost than for normal hash-lookup. Line 4 is obviously faster than line 5. If we had not used line 5, then the above program would saved considerable space (assuming that even a c-type hash takes up lots of room).
Assuming that optimized c-array's are used, then this strict-hash _can_ take up even less room than @array. This assumes that the overhead used in storing field-names is less than that for dynamic perl arrays. This probably won't be the case, but it is certain that less room will be taken up than with perl hashes.
Additionally, since compiler optimizations would allow direct access to fixed sized c-arrays of perl-scalars, then strict-hashes might even work faster than dynamic perl arrays.
We get the best of all worlds: We waste no unnecessary memory, have minor additional compiler time / complexity, increase overall performance, and enhance readibility of perl code.
The main question I have is what to do with the symbol-mapping table. Do we:
Perl5.005 pseudo-hashes
RFC 188: Objects : Private keys and methods
RFC 241: Pseudo-hashes must die!
RFC 122: types and structures
RFC 196: More direct syntax for hashes
RFC 75: structures and interface definitions
RFC 268: Keyed arrays
RFC 259: Builtins : Make use of hashref context for garrulous builtins