I wasn’t quite sure how to proceed with this column. The initial R and BI article (http://dashboardinsight.com/articles/new-concepts-in-business-intelligence/et-data-analysis-r-the-first-series.aspx) introduced two data sets for consideration. The first, cps_wages, is small (534 cases) with focus on determinants of wages in a sample from the 1985 population. The second, NewHavenResidential, is much larger (18,221 cases), and revolves on factors impacting assessed housing values for New Haven, CT in 2006.
At first, I thought I’d rotate between the data sets, illustrating exploratory graphics using each as appropriate. As I continued, however, I started to emphasize case-level rather than summary data – and 534 cases is a little easier to handle than 18,221. Regardless, both are good data sets and will get ample attention as the series progresses.
The analyses presented here look at cps_wages for correlates of wages, with special emphasis on differences between the sexes. Table 1 details frequencies for the data set attributes. I collapsed age, experience, and education into categories for this table rather than use original numeric values, which would have made the chart too lengthy. Though this aggregation is certainly helpful for visualization, most statisticians (me included) prefer using atomic to categorized variables for predictive models, since information is lost when detailed data are grouped. Categorized variables, however, are often useful for initial review and construction of dimension variables to facilitate visualization. I made sensible categories for education, but simply divided age into five roughly-equal quantiles and experience into three. Wages are presented in five quantiles as well for summarization. Several variables, like race, sector, and union, have skewed distributions that likely diminish their usefulness for analyses.
It seems every article I write for DI extols the benefits of R’s unsexy graphics. Maybe R graphs are nails, and I have only a hammer. But for the simple visuals I start with when analyzing data – graphs that let me readily review detail, basic summaries, and relationships -- R is rock solid. If, in addition, it’s easy to look at those detail, summary, and relationship data conditioned on the values of other attributes, so much the better. R’s lattice graphics package more than meets those needs, with programmable stripplots, dotplots, and xyplots as basic building blocks of exploratory statistical analysis.
Figure 1 presents a series of summary dotplots that graphically display bivariate frequencies between the variables of the cps_wages data set and sex. The visual is actually an assembly of individual dotplots, each of which is a separate function call in R. The y axis of each chart represents the values of a categorical attribute in cps_wages. The x axes represent frequency scales; the blue and red dots denote the group variable, sex (gender), and depict the number of cases of male and female respectively. The dotplots serve much the same function as bar charts, but with a lot less outlay of ink. And with this efficiency comes the ability to display much more information on a page than many competing tools. As an attention-deficit analyst, the one page rule is quite meaningful to me.

Figure 1 - Click on image for full screen version.
The graphs of Figure 1 confirm the univariate frequencies of Table 1, adding the sex dimension to the mix. For charts where the order of the categories is unimportant, I’ve presented the categories in sorted order of frequencies. For the most part, there isn’t much difference in the relative frequency of gender within attribute categories – with several notable exceptions. Males seem more likely to be white, to be union members, and to have an “other” job, while females are more apt to have clerical and service occupations, and offer a tad more high-end experience.

Figure 2 - Click on image for full screen version.
Figure 2 presents a series of xyplots and stripplots detailing the relationships of data set attributes with wages, showcasing at the same time R’s powerful graphics programming capabilities. For the numeric, atomic variables age, experience, and education, xyplots detail scatter with wages on the x axis to maintain consistency with the stripplots. Each of these charts also includes 2 superimposed regression plots: simple linear and localized polynomial.
The relationships of the categorical attributes with wages are represented by stripplots, with y axis categories presented in median wages order. Here I use jittering, a technique that adds a small amount of random noise to the plotting variable so cases can be distinguished in the graph. Indeed, this visual has a somewhat different look each time it’s run, -- though, overall, it consistently maintains proper order. In addition, the inter-quartile (middle 50% ) range of wages for each category is denoted by purple dots; red dots show wages from 0-25% and 75-100% for each attribute category. The vertical lines highlight the inter-quartile range of wages for the top wage category of each attribute, denoting how wages within that category contrast with the others. Significant separation suggests a relationship between the attribute and wages.
The xyplots for age and experience with wages show modest positive correlation, while the relationship of education with wages appears somewhat stronger. None is simply a straight line. The stripplot of occupation by wages clearly shows differences between the categories, with management at the top and service at the bottom. Similarly, the higher union wages are clearly evident in its plot, as are the higher wages of males. Visually, the 50th percentile of wages for males appears near the 70th for females. Finally, the stripplot of wages by education categories corroborates the xyplot education and wages mentioned earlier.

Figure 3 - Click on image for full screen version.
Figure 3 is comprised of a series of stripplots that display individual case wages for each category of every attribute, distinguishing male from female by color, using jittering as before. Alas, Figure 3 suggests a rather unflattering contrast in earnings between the sexes. For most categories of just about every attribute, the preponderance of blue dots further to the right of red on the wage scale starkly confirms that males are more highly compensated than females. And that frequencies of the attribute categories are relatively consistent by sex suggests that differences in wages might be due to gender, rather than other characteristics that could separate the sexes. Higher education levels do appear to mitigate gender differences in wages somewhat, as do union membership and a professional occupation. Note the outlier observation with wages > $40/hour – white, 18-26 years old, unmarried, 0-10 years experience, non-union, management, living in the non-South. I suspect a coding error.

Figure 4 - Click on image for full screen version.
Figure 4 shows the dimensional power of R’s lattice graphics. The top plot is wages by education and gender, conditioned on experience level. Thus there’s a separate stripplot panel of wages by education and gender for each category of the conditioning or dimensional attribute experience. Vertical and horizontal scales are identical for each of the 3 panels, enabling ready comparison across experience level groups. The lower chart is essentially the same data (though jittering provides a touch of randomness), interchanging the conditioning and y axis attributes. Also under programming control is the layout of panels within each graph, set up in this case to assure each chart is a single row and that both charts sit comfortably on a single page – thus satisfying the attention deficit dictum.

Figure 5 - Click on image for full screen version.
Figure 5 adds yet another dimension to the visual, displaying panels of wages by experience and gender for each combination of conditioning attributes residence and education. This stripplot thus details individual case wages across four dimensions: residence, education, experience, and gender. And, of course, individual cross-panel comparisons can be easily made, even by an attention-challenged analyst.
This column has presented illustrations of R’s fundamental, building blocks graphs – dotplots, stripplots, and xyplots – for use in exploratory data analysis. That R graphics are extendable with a powerful, object-oriented, statistical programming language is an added benefit. Subsequent columns will focus on the NewHavenResidential data set, additional variants to the plots already presented, and other graphics that showcase predictive models.
