go figure

Sooner or later, you’ll want to (or have to) make a graph. Many software programs can do this, and many books tell you how. But if you really care about the appearance of your graph, and perhaps more importantly, about the clear communication of the information contained therein, there is relatively little software flexible enough to give you ideal control of the parameters, and there are only a few guidebooks worth reading.

I’ll cover graphing software in more detail in another entry, and I’ll focus primarily on two programs – called SigmaPlot and S-Plus (available, unfortunately, only to users of Microsoft Windows; a freeware program called R, however, shares many of the features of S-Plus) – that enable the easy creation of publication-quality graphs of many kinds. Here, I’ll focus on a few of the guidelines or rules to follow when making graphs, and I’ll illustrate these rules using only one kind of plot – the scatterplot. In addition, I’ll pair bad examples with good ones, and I’ll explain why the rules work.

These rules aren’t mine; they have been formulated and put into practice by many researchers before me. But the person who has perhaps done the most and best work to understand the theory and practice of statistical graphs, and to consolidate and illustrate these rules, is William Cleveland. His book, The Elements of Graphing Data, is to graph-makers and statisticians what Robert Bringhurst’s The Elements of Typographic Style is to typographers. It is a how-to compendium of graph-making, and is indispensable to those of us who plot data on a routine basis.

Cleveland’s overarching message is: Draw the eye to the data; treat the data fairly and carefully. Six of what I consider Cleveland’s most important rules are:

(1) Use a pair of scale lines for each variable. Cleveland makes a strong argument for table look-up here – that “judging the scale value of a point by judging its position along a scale line...is easier and more accurate as the distance of the point from the scale line decreases.” Compare figures 1a and 1b (made-up data set), and notice how much easier table look-up is when two scale lines are used for each variable, rather than just one.

(2) Make the data rectangle slightly smaller than the scale-line rectangle. In figure 2a, the data rectangle and the scale-line rectangle are coincident; some data points therefore fall on the scale lines and are difficult to see. A “padding” of 5% is added to the data rectangle in figure 2b; all data points are contained within the scale-line rectangle and are easily visualized.

(3) Use outward-pointing tick marks. Inward-pointing ticks, as shown in figure 3a, simply add clutter to the interior of the graph, and in my opinion, make table look-up more difficult. Compare to figure 3b.

(4) Avoid slavishly including zero on the axes. Cleveland here refers to the widely-read book by Darrell Huff – How to Lie with Statistics – wherein Huff says that a graph without a zero line is dishonest. Cleveland argues that to include zero, however, may result in a waste of space, and more importantly, may interfere with our judgment of the data (figure 4a). Therefore, fill the scale-line rectangle with the data (figure 4b). Cleveland emphasizes: “Assume the viewer will look at the tick mark labels and understand them.”

(5) Use open rather than filled symbols to mark the data points. Invariably, some of the data will fall on or close to the same coordinates; see the points that lie roughly at (26, 7) in figures 5a and 5b. They are hard to distinguish in 5a, in which filled circles are used to denote the data, but the overlap can clearly be seen in 5b.

(6) If summarizing the data or drawing the eye to them with a line, use the line that best fits them. It is tempting to superimpose a straight-line regression fit to the data – this is the easiest (or only) option in some graphing programs – but it may not be fair to the data or to the reader. The data set used here has some curvature, and the straight-line fit shown in figure 6a does not adequately represent it. A technique called locally-weighted regression (loess for short) draws a smooth curve to the data by connecting locally fitted regions of data (figure 6b).

A review of The Elements of Graphing Data in Meteorological Magazine states, “Ideally, everyone interested in getting the most out of their data or presenting data clearly and concisely should have a copy handy.” My recommendation is no less enthusiastic. Buy, read, and digest ($52.95US); the quality of your graphs will improve, and the clarity of the information you convey will increase dramatically.

23-July 2002