No More String Errors - String(3C++)

No More String Errors - String(3C++)

There are a few basic data structures that appear in almost every non-trivial computer program. One of these is the character string. The code for manipulation and storage management of strings tends to be scattered throughout C programs, since the operations are typically only a few lines long apiece. This leads to unnecessarily hard-to-read programs, since string manipulations are intermixed with problem-specific operations, and to program bugs, since some of the operations can be tricky.

This tutorial describes a C++ character string datatype, called a String, that behaves like the built-in datatypes of C. The syntax and semantics are similar to built-in types, and performance is comparable to what would be expected of built-in types.

The conventional representation of a string in C is a null-terminated array of characters. A variable that refers to a string is a character pointer that points to the first element of the array. This arrangement, while simple in concept and nearly ideal for the lowest level of software, has two outstanding disadvantages in practice: (1) equal character pointers point at the same chunk of memory, so strings are shared, and (2) there is no management of string storage. This has led to the existence of a variety of not-very-satisfactory techniques for handling strings in C, and to much annoyance for C programmers.

A C++ String is simply a sequence of 0 or more characters. It is not necessarily null-terminated; therefore, any value that fits into a char (even 0) can be anywhere in the String. Strings do their own storage management; they do not share memory, (this is not strictly true in the implementation, but true from a user's point of view) and they are automatically extensible. The implementation of Strings relies on C++'s constructors and destructors, member functions, and overloaded operators to encapsulate storage management and provide a more natural syntax for declaring, manipulating, and using sequences of characters. The syntax and semantics of operations on Strings are modeled after that for fixed size objects. Thus, assignment is by value, (the apparently required copy is avoided or at least postponed in the implementation) and Strings can be used as function arguments and result types. As usual in C, changes to the formal argument in the called function do not affect the actual argument in the caller.

There are functions and overloaded operators for writing String expressions. Also, there are versions of some of the functions in Sections 2 and 3 of the UNIX^® System manual that can be called with String instead of character pointer arguments.

A programmer can declare and use pointers to Strings in the usual way (and with the usual risks of dangling pointers), but most of the performance improvement usually associated with pointer usage is already built into the datatypes. For example, when a function is called with a String as argument, a reference count is incremented, but no String copy occurs. Ordinary arrays of Strings are also available.

The following example shows how the String datatype is used. It is a function that takes a char, c, and a String, in_String, as arguments, and returns a copy of the String with all instances of c removed.

   1:      String
   2:      remove(char c, String in_String)
   3:      {
   4:         String  out_String;
   5:         char    temp;
   6:         while ( in_String.getX(temp) )
   7:         if ( temp != c )
   8:             out_String += temp;
   9:         return out_String;
   10:    }

In this example, line 1 defines the return type of the function, line 2 defines the name of the function and its arguments and their types, and lines 4 and 5 are automatic variable declarations. In line 6, the getX function removes the first char from in_String, assigns it to temp, and returns 1, as long as in_String is non-empty. The postfix ``X'' is a lexical convention identifying functions that assign to their first argument. They normally return 1, indicating success, or 0, indicating failure. When in_String is empty, the getX function returns 0, ending the while loop. In line 7, temp is compared to c, and if it is different, the += operator in line 8 adds it to the end of out_String. Thus, in line 9 out_String is the desired result, and the return statement returns it to the caller.

The rest of this tutorial describes Strings from the point of view of a user (programmer).