|
|
A string constant is created by enclosing a sequence of characters inside quotation marks, as in ``"abc"'' or ``"hello, everyone"''. String constants may contain the C programming language escape sequences for special characters listed in ``Extended regular expressions''.
String expressions are created by concatenating constants, variables, field names, array elements, functions, and other expressions. The program
{ print NR ":" $0 }prints each record preceded by its record number and a colon, with no blanks. The three strings representing the record number, the colon, and the record are concatenated and the resulting string is printed. The concatenation operator has no explicit representation other than juxtaposition.
awk
provides the built-in string functions shown in
``awk built-in string functions''.
In this table,
r
represents an extended regular expression
(either as a string or as
/r/),
s
and
t
string expressions, and
n
and
p
integers.
awk built-in string functions
Function | Description |
---|---|
gsub(r,s) | substitute s for r globally in current record, return number of substitutions |
gsub(r,s,t) | substitute s for r globally in string t, return number of substitutions |
index(s,t) | return position of string t in s, 0 if not present |
length(s) | return length of s |
match(s,r) | return the position in s where r occurs, 0 if not present |
split(s,a) | split s into array a on FS, return number of fields |
split(s,a,r) | split s into array a on r, return number of fields |
sprintf(fmt,expr-list) | return expr-list formatted according to format string fmt |
sub(r,s) | substitute s for first r in current record, return number of substitutions |
sub(r,s,t) | substitute s for first r in t, return number of substitutions |
substr(s,p) | return substring of s starting at position p |
substr(s,p,n) | return substring of s of length n starting at position p |
tolower(s) | return a string in which each upper case character in string s is replaced by a lower case character |
toupper(s) | return a string in which each lower case character in string s is replaced by an upper case character |
The functions sub and gsub are patterned after the substitute command in the text editor ed(1). The function gsub(r,s,t) replaces successive occurrences of substrings matched by the extended regular expression r with the replacement string s in the target string t. (As in ed, the leftmost match is used, and is made as long as possible.) It returns the number of substitutions made. The function gsub(r,s) is a synonym for gsub(r,s,,$0). For example, the program
{ gsub(/USA/, "United States"); print }transcribes its input, replacing occurrences of USA by United States. The sub functions are similar, except that they only replace the first matching substring in the target string.
The function
index(s,t)
returns the leftmost position where the string t begins
in s, or zero if
t does not occur in s.
The first character in a string is at position 1.
For example,
index("banana", "an")returns 2.
The length function returns the number of characters in its argument string; thus,
{ print length($0), $0 }prints each record, preceded by its length. ($0 does not include the input record separator.) The program
length($1) > max { max = length($1); name = $1 } END { print name }when applied to the file countries, prints the longest country name:
The match(s,r) function returns the position in string s where extended regular expression r occurs, or 0 if it does not occur. This function also sets two built-in variables RSTART and RLENGTH. RSTART is set to the starting position of the match in the string; this is the same value as the returned value. RLENGTH is set to the length of the matched string. (If a match does not occur, RSTART is 0, and RLENGTH is -1.) For example, the following program finds the first occurrence of the letter i followed by at most one character followed by the letter a in a record:
{ if (match($0, /i.?a/)) print RSTART, RLENGTH, $0 }It produces the following output on the file countries:
17 2 USSR 8650 262 Asia 26 3 Canada 3852 24 North America 3 3 China 3692 866 Asia 24 3 USA 3615 219 North America 27 3 Brazil 3286 116 South America 8 2 Australia 2968 14 Australia 4 2 India 1269 637 Asia 7 3 Argentina 1072 26 South America 17 3 Sudan 968 19 Africa 6 2 Algeria 920 18 Africa
AsiaaaAsiaaaaan
as input, the program
{ if (match($0, /a+/)) print RSTART, RLENGTH, $0 }
matches the first string of a's and sets RSTART to 4 and RLENGTH to 3.
The function sprintf(format, expr[1], expr[2], . . ., expr[n]) returns (without printing) a string containing expr[1], expr[2], . . ., expr[n] formatted according to the printf specifications in the string format. ``The printf statement'' contains a complete specification of the format conventions. The statement
x = sprintf("%10s %6d", $1, $2)assigns to x the string produced by formatting the values of $1 and $2 as a ten-character string and a decimal number in a field of width at least six; x may be used in any subsequent computation.
The function substr(s,p,n) returns the substring of s that begins at position p and is at most n characters long. If substr(s,p) is used, the substring goes to the end of s; that is, it consists of the suffix of s beginning at position p. For example, we could abbreviate the country names in countries to their first three characters by invoking the program
{ $1 = substr($1, 1, 3); print }on this file to produce
USS 8650 262 Asia Can 3852 24 North America Chi 3692 866 Asia USA 3615 219 North America Bra 3286 116 South America Aus 2968 14 Australia Ind 1269 637 Asia Arg 1072 26 South America Sud 968 19 Africa Alg 920 18 AfricaNote that setting $1 in the program forces awk to recompute $0 and, therefore, the fields are separated by blanks (the default value of OFS), not by tabs.
Strings are stuck together (concatenated) merely by writing them one after another in an expression. For example, when invoked on the file countries,
{ s = s substr($1, 1, 3) " " } END { print s }prints
USS Can Chi USA Bra Aus Ind Arg Sud Algby building s up a piece at a time from an initially empty string.