It seems I’m always second guessing the writing for my monthly BI columns. Whenever I re-read a several-month-old analytics article, for example, I question not only the wordsmithing, but also the choice of stats and visuals. Why that wording? That analytic? That graphic? On second look, there always appears to be a better way. I suppose it’s a good thing a platform as comprehensive as R provides so many choices for analysts.
There are certainly many different ways of visualizing a given set of data in R. R includes several separate graphics subsystems, one being the powerful Lattice, which provides the basis for many types of dimensional visuals. Lattice programmers can work at three different levels of complexity, with each level able to progressively expand visual capabilities. The first approach is to simply script the Lattice function syntax as prescribed, making sure to be cognizant of all options and parameters. It’s often the case with even simple Lattice programming that data must be presented in a “stacked” format amenable to the Lattice calling functions. In a slightly more complex mode, Lattice provides programmable access to the basic panel functions underpinning each of its graphics like xyplot, stripplot, boxplot, and dotplot, along with a set of panel primitives that allow programmers to embellish and change each graphic’s functionality. Finally, R provides access to its low level Grid Graphics Model, giving industrious programmers the ability to develop entirely new visuals and graphics subsystems. Though I work at all three levels, I often find the ability to program panel functions to be quite productive for the effort involved, and illustrate several possibilities in the graphs that follow.
The data used for this column’s graphics derives from the 18221 case New Haven Residential data set detailed in “Data Analysis and R (the first in a series)”, http://dashboardinsight.com/articles/new-concepts-in-business-intelligence/et-data-analysis-r-the-first-series.aspx . Attributes of interest include appraised housing value (totalCurrVal) and its base 10 logarithm (logtotalCurrVal); square footage (livingArea) and its log (loglivingArea), as well as a quartile grouping of livingArea (livingArea_cat); and, zone, a three-valued factor denoting residential area.

Click on image for full size version
Figure 1a is a basic stripplot detailing the distribution of logtotalCurrVal for each of four roughly equal quartile categories of livingArea, with Q4 representing the quartile of largest homes, and Q1 the smallest. The graph deploys jittering to space data within grouping bands. Not surprisingly, there appears to be a strong association between livingArea_cat and logtotalCurrval. Note also the wider variation of logtotalCurrVal in Q1 and Q4. The reader must do mental algebra to compute the anti-logs of logtotalCurrVal.

Click on image for full size version
Figure 1b adds a number of enhancements to 1a. First, it uses color to show the zone in the strip points. RM appears to be a more expensive area than RS, with the smaller Other in between. This graph has also introduced a set of grayed rectangles that help provide anchoring support for viewing the multiple panels. The darker, inside rectangle envelops the middle 50% (interquartile range) of logtotalCurrval values; the larger encompassing rectangle surrounds the middle 95% of logtotalCurrval observations. The irregular scale points translate from log values, noting the min, max, and median, in addition to the 25th, 75th, 2.5th, and 97.5th percentiles. In contrast to 1a, 1b is a lot to digest.

Click on image for full size version
Figure 2a examines kernel density estimates of logtotalCurrval by the categories of loglivingArea_cat. With apologies to purist statisticians, density functions look like smoothed out histograms, in which the bin size of the histogram approaches zero. Even without anchoring, the densities clearly show a shift to the right, confirming an increase in logtotalCurrVal as loglivingArea_cat progresses from Q1 to Q4.

Click on image for full size version
Figure 2b adds the same anchoring rectangles as Figure 1b, including identical anti-log, quantile, scale points. Further visual support is provided by the bottom “All” panel, which includes all observations. The red curve on each panel is an idealized normal distribution with the mean and variance of all logtotalCurrVal observations. Note that for the relatively stable “All” category, there appears to be more values in middle 50% range than would be expected under a normal distribution, with less in the 2.5%-25% and 75%-97.5% bands.

Click on image for full size version
Figure 3a is a scatterplot (xyplot) of logtotalCurVal by loglivingArea, using log values as scale points on both axes.

Click on image for full size version
Figure 3b adds the range rectangles for both variables, loglivingArea on the x axis and logtotalCurrVal on the y. The interquartile range for loglivingArea is thus 1,240-2,575. The min, median, and max of loglivinArea are 330, 1,752, and 13,621, respectively. The 95th percentile range is 701-4,248. The red line is the fitted linear regression; the green curve is a local polynomial regression of degree 2. There appears to some curvature to the relationship between loglivingArea and logtotalCurrval. Figure 3b is a very busy visual.
I generally get feedback on the graphics to be included in my writing before finally submitting the articles. More often than not, that feedback is positive. This time, however, the reviews were mixed, at best. Reviewers dissed the logarithmic transformations, logtotatCurrVal and loglivingArea, as difficult to grasp. Several were also a bit put off by the irregular tick marks and spacing in the scales for the graphs. And, of course, the logarithmic scales are not linear, so a one inch distance has different meanings at different points. One said that Figure 3b was “over the top” too busy.
Stats folks often transform their data before delving deeply into analysis. The logarithmic transformation is one of the most common, especially when there are extreme values in the data. In our case, such a transformation was needed to handle the large ranges of both totalCurrVal (3,940 to 999,000) and livingArea (330-13,600). Left alone, graphs for those variables would be hopelessly bunched in the low to mid ends.
I’m ambivalent on the irregular appearance of the scales in the embellished graphs. I too am a bit put off by that look, but at the same time love the additional information that’s conveyed. Each of the numeric scale labels gives eight points of valuable percentile information, including min, max, and median. I also like to see the interquartile range and the 95% range, both of which are included in these charts. A set of anchoring rectangles for a dependent variable might well be helpful, especially to show differences across panels. The more I look at Figure 3b, however, the more I agree with the critic who complained it’s too busy – maybe one set of anchoring rectangles is enough! I’m very interested in getting reader reactions, both pro and con.
Subsequent columns will highlight 3-dimensional graphs, plots of multidimensional frequencies, unusual plots, and graphics packages contributed by the R community outside the core distribution.
References:
- Paul Murrell. R Graphics. Chapman & Hall/CRC. 2005.
- Deepayan Sarkar. Lattice : Multivariate Data Visualization with R. Springer: 2008.