|
|
If we measure the execution time of the two word-counting programs developed in the section, ``A Bag example'' for moderate-size input files, we will observe that the Bag version runs faster than the Map version, and furthermore that the discrepancy between the two versions increases with the size of the input file. For example, running each version on a file containing 175,000 words gives:
Map version: real 2m39.48s user 2m32.28s sys 0m3.18s
Bag version: real 2m16.01s user 1m58.81s sys 0m5.38s
which represents a 22% user-time advantage for the Bag version (27% if we consider insertion alone). The difference can be partially explained by noting that while Map insertion runs in O(ln N), where N is the number of elements in the Map, Bag insertion, runs in O(L*L), where L is the average number of elements with the same hash value.
For the 175,000-word example, the value of L was 1.2. We determined this using the function histogram():
main(){ Bag<String> count; String word; while(cin >> word){ count.insert(word); } Map<Set_or_Bag_hashval,unsigned> m; count.histogram(m); cout << "L = " << count.size()/m.size() << "\ n"; }
histogram() returns a Map that gives, for each hash value currently in the Bag, the number of elements in the Bag with that hash value. Since m.size() is the number of unique hash values and count.size() is the number of elements in the Bag, their ratio is the desired average. You can use other statistics based on the histogram to fine-tune your hash functions.
As L gets larger, the execution speed of Set and Bag operations gets slower, approaching O(N*N) in the limit, which occurs when we use a hash function that hashes every element to the same value. Not surprisingly, however, it is this very degenerate case that results in the lowest storage overhead. (As we will see in, ``Internal representation of Sets'', the elements are all stored in a single list.) When prototyping Set or Bag applications, or when you know in advance that Sets and Bags will never have more than a few elements, this behavior may be acceptable.
To underscore the need for a good hash function, consider the following example:
Set_or_Bag_hashval Set<int>::hash(const int& t) { if(t % 2 == 0){ return 2; }else if (t < 0){ return 1; }else{ return t+2; } }
main(){ Set<int> s; for(int i = -20; i<1000; i++){ s.insert(i); } }
With this hash function, the 510 even values hash to the value 2, the 10 negative odd values hash to the value 1, and each of the remaining 500 values hashes to a unique value. Timing the program on a Sun 3/60 gives
real 0m55.55s user 0m28.18s sys 0m1.45s
an average user time of .03 seconds per element! Using the perfect hash function shown earlier:
Set_or_Bag_hashval Set<int>::hash(const int& t) { return (Set_or_Bag_hashval)t; }
the same program gives
real 0m3.11s user 0m0.93s sys 0m0.26s
an average user time of less than one millisecond per element.