Intro to AWK

AWK can be used as a super-GREP by typing one-liners to do amazing things to text files, but it's a real programming language, adequate for constructing all sorts of small text-based data processing systems. On the way to learning how to do that, we may discover an easy and fun interpreted language to help us with very simple tasks. All we need are AWK's -f command line option and its BEGIN{ action } pattern. Doing the usual "Hello, world!" thing, we can put

BEGIN { print "Hi, I'm an awk program!" }

in a file named hello.awk, run it with the command

awk -f hello.awk

and open the door to a cool new language interpreter for simple tasks.

The awk language takes a very relaxed approach to data types. It assumes that all variables are either text strings or double-precision floating-point numbers. It doesn't expect you to declare the types of the variables you use, but figures out from their context what their types are. It lets us do arithmetic...

BEGIN {
        A = 219
        B = 17
        print A / B
}

...use most of our favorite C control structures...

BEGIN {
        for (i=0; i<10; i++) {
                square = i * i
                printf("%d squared is %d.\n", i, square)
        }
}

...and define our own functions!

BEGIN {
        rangeMin = -2
        rangeMax = 2
        increment = .125

        inValue = rangeMin
        while (inValue <= rangeMax) {
                printf("X = %f\tY = %f\n", inValue, slopestep(inValue))
                inValue += increment
        }
}

function slopestep(x)
{
        if (x < 0) {
                return 0
        } else if (x > 1) {
                return 1
        } else {
                return x
        }
}

Is this really worth tinkering with? If you are or intend to become a Perl wizard, probably not. Knowing one all-purpose scripting language is adequate for the kind of back-of-the-envelope problem solving that we're discussing here. But if you prefer vi to emacs, if you like tools that aren't constantly becoming larger and more complex, and if you can get excited about powerful languages that can be described well in concise books like Aho, Kernighan, and Weinberger's The AWK Programming Language, then you too may become an AWK fan, even before venturing into the wonders of regular expressions and pattern matching!

Looking at BEGIN-pattern AWK, we see a conventional (BASIC-like) interpreted programming language. By adding a couple of patterns, we move into the world of "real" AWK. Real AWK processes text files, one line at a time, checking each line against the patterns of various rules and taking the corresponding action.

The END pattern works like the BEGIN pattern in that it gives us another block of code that will be executed once every time the program is run. The one difference is that while the BEGIN-pattern block is executed prior to all other processing, the END-pattern block is - unsurprisingly - executed after all the other processing is finished.

However, if you add an END block to a BEGIN-pattern AWK program and run it with awk -f script-file, what happens is that the BEGIN-block is executed, but the program seems to hang before the END-block begins to run. Why? The END-pattern block runs only after all data file processing is complete, and since we've specified neither a data file nor the operations to be performed on it, AWK uses its defaults: it's waiting eagerly to do nothing with every line it gets from standard input.

There's no reason to write an AWK program with both BEGIN and END patterns, but nothing else; instead of writing the non-functional program

BEGIN { action-1 }
END { action-2 }

we write

BEGIN {
        action-1
        action-2
}

and use END only when we're also doing one-line-at-a-time text processing.

The simplest pattern for processing text files is what I'll call the nothing-pattern: an action with no pattern, which is executed for every line in the data file, regardless of its content. Let's look at a program which uses the nothing-pattern:

BEGIN { lineCount = 0 }
END { print "lineCount is ", lineCount }
{ lineCount += 1 }

When saved as counter.awk and run with

        awk -f counter.awk datafile.txt

this program counts and reports the number of lines in datafile.txt. And, no, we have no particular reason to put the END-pattern rule last, since pattern-action rules are run when the pattern is matched, regardless of the rule's location in the program.

Actually, you don't need to keep your own running line counter - AWK provides a number of built-in variables, one of which is "NR", the number of input records (i.e. lines) read so far. "FILENAME" is the name of the data file the program is processing. "NF" gives you the number of white-space delimited fields in the current record. "$0" is a text string equal to the entire current line, and "$1", "$2", etc. are the first, second, and later fields within the current record. The "records and fields" terminology is a hint that these features are provided to support traditional data processing methods, using table-structured data files. For example, if you use this data file

Joe Nameless
Madonna
Thomas Mark Ellis
Cyril Douglas Moonrock Polychloride
Prince
Calista Flockhart
Billy Joe Royal
Spencer Tracy

with this script

# put names in standard form

BEGIN { printf("Employees\n\n") }

END { printf("\n%d total employees\n", NR) }

{
        if (NF == 1) {
                printf("%d:\t%s\n", NR, $1)
        } else if (NF == 2) {
                printf("%d:\t%s, %s\n", NR, $2, $1)
        } else if (NF == 3) {
                printf("%d:\t%s, %s %s\n", NR, $3, $1, $2)
	} else if (NF > 3) {
                printf("%d:\t%s, %s blah blah\n", NR, $NF, $1)
	}
}

you get

Employees

1:	Nameless, Joe
2:	Madonna
3:	Ellis, Thomas Mark
4:	Polychloride, Cyril blah blah
5:	Prince
6:	Flockhart, Calista
7:	Royal, Billy Joe
8:	Tracy, Spencer

8 total employees

as output.

Notice that although this code sample uses a single nothing-pattern block to handle every text line, the first thing it does within that block is examine a characteristic of each line (the number of fields) so it can decide what kind of processing to do for each line. AWK gives us another way to structure this program: we can take the logical expressions used in the multi-way "if" construct, and use them as patterns to trigger corresponding actions. We'll call this - naturally enough - the "expression pattern." Rewriting the above:

# put names in standard form

BEGIN { printf("Employees\n\n") }

END { printf("\n%d total employees\n", NR) }

(NF == 1) { printf("%d:\t%s\n", NR, $1) }

(NF == 2) { printf("%d:\t%s, %s\n", NR, $2, $1) }

(NF == 3) { printf("%d:\t%s, %s %s\n", NR, $3, $1, $2) }

(NF > 3) { printf("%d:\t%s, %s blah blah\n", NR, $NF, $1) }

(As in the C language, a "true" expression is one that has a non-zero value.)


The next examples make more practical use of built-in variables and the expression pattern. When reformatting content from other people's complicated web pages into simpler pages, I need something to block long and short lines into more regular paragraphs. Here's a way to do that if the text editor doesn't:

# wrap.awk
 
# Reformat text so that:
#       there are no leading or trailing blank lines
#       a single blank line separates paragraphs
#       all lines begin in the first column
#       lines break iff adding next string exceeds line length
#       strings are not broken
#       no line is longer than lineSize unless a single string is longer
 
BEGIN {
        lineSize = 80
        column = 0
        writeBlank = 0
}

# Blank line - write newline and reset column counter
 
(NF == 0) {
        if (writeBlank) {
                printf("\n\n")
                column = 0
                writeBlank = 0
        }
}
 
# Non blank line - copy all strings.  Cases are:
#       string length > line length - put it on a line by itself
#       string starts in column 0 and fits
#       string starts in other column and fits
#       string forces wrap
 
(NF != 0) {
        writeBlank = 1
        for (i=1; i<=NF; i++) {
                if (length($1) >= lineSize) {
                        if (column != 0) {
                                printf("\n");
                        }
                        printf("%s", $i)
                        column = length($1)
                } else if (column == 0) {
                        printf("%s", $i)
                        column = length($i)
                } else if (column + 1 + length($i) < lineSize) {
                        printf(" %s", $i)
                        column = column + 1 + length($i)
                } else {
                        printf("\n%s", $i)
                        column = length($i)
                }
        }
}

When importing plain text files into a word processor, it's often necessary to strip out linefeeds to let the word processor do its own soft wrapping. That's easy enough...

# unwrap.awk

# Reformat text so that:
#       there are no leading or trailing blank lines
#       a single blank line separates paragraphs
#       all lines begin in the first column
#       each paragraph is one (possibly long) line

# This gets rid of hard line wrapping when text is being
# exported to a word processor that will do its own soft
# line wrapping.

BEGIN {
        writeSpace = 0
        writeBlank = 0
}

# Blank line - write newline and reset column counter

(NF == 0) {
        if (writeBlank) {
                printf("\n\n")
                writeSpace = 0
		writeBlank = 0
        }
}

(NF != 0) {
        writeBlank = 1
        for (i=1; i<=NF; i++) {
                if (writeSpace) {
                        printf(" ")
                } else {
                        writeSpace = 1
                }
                printf("%s", $i)
        }
}

As a text-oriented language built around pattern matching, AWK naturally includes a regular-expression pattern. In that pattern, the regular expression to be matched is enclosed between forward slashes. The simplest such pattern is a string of ordinary characters to be interpreted literally:

/ERROR/ { print $0 }

The one-line script above echoes all lines containing the string "ERROR".

Metacharacters used in AWK regular expressions include:
^ standing for "beginning of string",
$ standing for "end of string",
. standing for any single character, and
\ serving as an escape character.

For example, we can print all lines which end with a period by using

/\.$/ { print $0 }

Of course the regular expression pattern provides many other features, and AWK gives us a few more kinds of pattern to work with, but this will do for now. The following sample script translates text files using a trivial markup language into HTML. In this markup language, formatting tags begin with a period in the first column of a line, and there are only two kinds of tags: ".h", which tells us that the following text is a centered heading, and ".t", which tells us that the following text is simply normal text. One or more blank lines will be treated as a paragraph break.

# prior to first line - initialize & write HTML header

BEGIN {
        UNKNOWN = 0
	HEADING = 1
        TEXT = 2

        currentState = UNKNOWN
        lastState = UNKNOWN

        print "<html>"
        print "<head>"
        print "<title>Powered by AWK</title>"
        print "</head>"
        print "<body>"
}

# blank line - close current tag if we haven't already

(NF == 0) {
        if (currentState == HEADING) {
                print "</h3>"
                lastState = HEADING
                currentState = UNKNOWN
        } else if (currentState == TEXT) {
                print "</p>"
                lastState = TEXT
                currentState = UNKNOWN
        }
}

# line begins with .h - switch to heading mode

/^\.h/ {
        currentState = HEADING
        print "<h3 align=center>"
        if (NF > 1) {
                for (i=2; i<=NF; i++) {
                        printf("%s ", $i)
                }
                printf("\n");
        }
}

# line begins with .t - switch to normal text mode

/^\.t/ {
        currentState = TEXT
        print "<p>"
        if (NF > 1) {
                for (i=2; i<NF; i++) {
                        printf("%s ", $i)
                }
                printf("\n");
        }
}

# non-blank lines not beginning with "."

!/^\./ && (NF > 0) {
        if (currentState != UNKNOWN) {
                print $0
        } else if (lastState == TEXT) {
                currentState = TEXT
                print "<p>"
                print $0
        } else if (lastState == HEADING) {
                currentState = HEADING
                print "</h3>"
                print $0
        }
}

# all lines processed - close open tags

END {
        if (currentState == HEADING) {
                print "</h3>"
        } else if (currentState == TEXT) {
                print "</p>"
        }
        print "</body>"
        print "</html>"
}

This script transforms things like

.h How I Spent My Summer Vacation

.t
I spent way too much time
playing with cryptic and
archaic computer languages.

I had fun anyway.

into things like

<html>
<head>
<title>Powered by AWK</title>
</head>
<body>
<h3 align=center>
How I Spent My Summer Vacation 
</h3>
<p>
I spent way too much time
playing with cryptic and
archaic computer languages.
</p>
<p>
I had fun anyway.
</p>
</body>
</html>

The one slightly interesting thing in this script is its use of the composite pattern "!/^\./ && (NF > 0)" to stand for "all non-blank lines not beginning with a period".

Want more AWK? The essential resource for AWK users is the book The AWK Programming Language by Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger. The book's web page points to other online AWK resources.

Last update: Monday, 5/5/08