go figure
Sooner or later, you’ll want to (or have to) make a graph. Many software
programs can do this, and many books tell you how. But if you really
care about the appearance of your graph, and perhaps more importantly,
about the clear communication of the information contained therein,
there is relatively little software flexible enough to give you
ideal control of the parameters, and there are only
a few guidebooks worth reading.
I’ll cover graphing software in more detail in another entry, and I’ll
focus primarily on two programs – called SigmaPlot
and S-Plus
(available, unfortunately, only to users of Microsoft Windows; a
freeware program called R,
however, shares many of the features of S-Plus) – that enable
the easy creation of publication-quality graphs of many kinds. Here,
I’ll focus on a few of the guidelines or rules to follow when
making graphs, and I’ll illustrate these rules using only one kind
of plot – the scatterplot. In addition, I’ll pair bad
examples with good ones, and I’ll explain why the rules work.
These rules aren’t mine; they have been formulated and put into
practice by many researchers before me. But the person who has perhaps
done the most and best work to understand the theory and practice
of statistical graphs, and to consolidate and illustrate these rules,
is William
Cleveland. His book, The
Elements of Graphing Data, is to graph-makers and
statisticians what Robert
Bringhurst’s The
Elements of Typographic Style is to typographers.
It is a how-to compendium of graph-making, and is indispensable
to those of us who plot data on a routine basis.
Cleveland’s overarching message is: Draw the eye to the data; treat the data
fairly and carefully. Six of what I consider Cleveland’s most important
rules are:
(1) Use a pair of scale lines for each variable. Cleveland
makes a strong argument for table look-up here – that “judging
the scale value of a point by judging its position along a scale
line...is easier and more accurate as the distance of the point
from the scale line decreases.” Compare figures 1a
and 1b
(made-up data set), and notice how much easier table look-up is
when two scale lines are used for each variable, rather than just one.
(2) Make the data rectangle slightly smaller than the scale-line
rectangle. In figure 2a,
the data rectangle and the scale-line rectangle are coincident;
some data points therefore fall on the scale lines and are difficult
to see. A “padding” of 5% is added to the data rectangle
in figure 2b;
all data points are contained within the scale-line rectangle and
are easily visualized.
(3) Use outward-pointing tick marks. Inward-pointing ticks,
as shown in figure 3a,
simply add clutter to the interior of the graph, and in my opinion,
make table look-up more difficult. Compare to figure 3b.
(4) Avoid slavishly including zero on the axes. Cleveland here
refers to the widely-read book by Darrell Huff – How
to Lie with Statistics – wherein Huff says
that a graph without a zero line is dishonest. Cleveland argues
that to include zero, however, may result in a waste of space, and
more importantly, may interfere with our judgment of the data (figure
4a).
Therefore, fill the scale-line rectangle with the data (figure 4b).
Cleveland emphasizes: “Assume the viewer will look at the tick
mark labels and understand them.”
(5) Use open rather than filled symbols to mark the data points.
Invariably, some of the data will fall on or close to the same coordinates;
see the points that lie roughly at (26, 7) in figures 5a
and 5b.
They are hard to distinguish in 5a, in which filled circles are
used to denote the data, but the overlap can clearly be seen in 5b.
(6) If summarizing the data or drawing the eye to them with a line,
use the line that best fits them. It is tempting to superimpose
a straight-line regression fit to the data – this is the easiest
(or only) option in some graphing programs – but it may not
be fair to the data or to the reader. The data set used here has
some curvature, and the straight-line fit shown in figure 6a
does not adequately represent it. A technique called locally-weighted
regression (loess for short) draws a smooth curve to the data by
connecting locally fitted regions of data (figure 6b).
A review of The Elements of Graphing Data in Meteorological
Magazine states, “Ideally, everyone interested in getting
the most out of their data or presenting data clearly and concisely
should have a copy handy.” My recommendation is no less enthusiastic.
Buy, read, and digest ($52.95US); the quality of your graphs will
improve, and the clarity of the information you convey will increase
dramatically.
23-July 2002