DOC HOME SITE MAP MAN PAGES GNU INFO SEARCH PRINT BOOK
 
A Class for Efficient Key Lookup in C++ - Symbol(3C++)

Removing the baggage

The above class Phstring is very efficient, but it is still not as efficient as it can be. The problem is that every Phstring carries around an embedded String object; this makes copy construction and the assignment operator (among other things) somewhat inefficient. A program like a compiler typically copies and assigns many symbols during its execution. Thus, we wish to make these operations as fast as possible. In the following, we will concentrate on the copy constructor for Phstring. The discussion for the assignment operator is analogous. The copy constructor for Phstring must first copy the base class String object, then copy the hash value. In the reference-counted implementation of String found elsewhere in this library (see String(3C++), copying a String requires:

  1. Copying a pointer to the underlying representation;

  2. Incrementing the reference count; and,

  3. Checking whether the reference count overflowed, and if so, duplicating the representation.

The resulting implementation of the copy constructor for Phstring has a total (assuming no overflow of the underlying String) of two integral copies, three integral comparisons (to the value 0), and one increment operation. This is quite fast (compared to strcpy), but as we shall see in a moment, not as fast as it can be. There are only two ways to speed up the Phstring copy constructor: (1) speed up the String copy constructor, or (2) implement the Phstring constructor in a way that does not require copying a String. We will assume the former is not an option (and in any case, the latter approach, as we shall see, works quite well). For the Phstring constructor not to copy a String, it is necessary to move the String data member out of the Phstring object. We cannot, however, simply get rid of the String (once we have constructed its hash value), since we wish to retain the ability to recover the underlying string from a symbol (for example, for printing). Our approach instead is to move the String data member to a global pool of Strings, shared by all Phstrings. When a Phstring is constructed, the string it represents is copied into the pool if it is not already there. The Phstring keeps a pointer to the string in the pool. Two different Phstrings representing the same symbol share the same String in the pool. Using this scheme, copying a Phstring requires only a pointer copy, rather than a String copy. The resulting code is shown below (Ephstring is short for ``efficient perfect hashed string''):

       class Ephstring {
       private:
           const String* s;
           unsigned long hash;
       public:
           Ephstring(const char* s_) {
               s = private_copy(s_);
               hash = perfect_hash(s);
           }
           unsigned long hashval() const {
               return hash;
           }
           // ...
       };

The function private_copy() returns a pointer to the copy of the string in the global pool:

       class Ephstring {
       private:
           static Stringpool p;
           static String* private_copy(const char* s) {
               // return a pointer to a copy of s in p
           }
           // ...
       };

In the actual implementation, Stringpool would be any type that can efficiently store and retrieve copies of Strings, such as Set<String> (see Set(3C++). As mentioned above, notice that two different Ephstring objects representing the same symbol will have the same value for s, and two different Ephstring objects representing different symbols will have different values for s. But notice this is precisely the definition of ``perfect hash'' --- that is, the value of s is a perfect hash for the string, and the hash member is redundant! The following class (our final version) reflects this optimization:

       class Symbol {
       private:
           const String* s;
       public:
           Symbol(const char* s_) {
               s = private_copy(s_);
           }
           unsigned long hashval() const {
               return s;
           }
           // ...
       };

Remember, our reason for doing all this was to reduce the complexity of the Phstring copy constructor and assignment operator. For the above class, the automatically generated copy constructor has the correct semantics. Here it is:

       Symbol::Symbol(const Symbol& sym) : s(sym.s) {}

Notice that a single inline pointer assignment is all it takes. In fact, there is also an additional increment operator in the copy constructor, as part of symbol's memory management. Contrast this with the copy constructor for Phstring, described at the beginning of this section. The assignment operator (also the automatically generated one) is just as fast:

       Symbol& Symbol::operator=(const Symbol& sym) {
           s = sym.s;
           return *this;
       }

Again, all it takes is one pointer assignment.


Next topic: An example
Previous topic: A perfect hash

© 2004 The SCO Group, Inc. All rights reserved.
UnixWare 7 Release 7.1.4 - 27 April 2004