I wasn’t quite sure how to proceed with this column. The initial R and BI article (http://dashboardinsight.com/articles/new-concepts-in-business-intelligence/et-data-analysis-r-the-first-series.aspx) introduced two data sets for consideration. The first, cps_wages, is small (534 cases) with focus on determinants of wages in a sample from the 1985 population. The second, NewHavenResidential, is much larger (18,221 cases), and revolves on factors impacting assessed housing values for New Haven, CT in 2006.
At first, I thought I’d rotate between the data sets, illustrating exploratory graphics using each as appropriate. As I continued, however, I started to emphasize case-level rather than summary data – and 534 cases is a little easier to handle than 18,221. Regardless, both are good data sets and will get ample attention as the series progresses.
The analyses presented here look at cps_wages for correlates of wages, with special emphasis on differences between the sexes. Table 1 details frequencies for the data set attributes. I collapsed age, experience, and education into categories for this table rather than use original numeric values, which would have made the chart too lengthy. Though this aggregation is certainly helpful for visualization, most statisticians (me included) prefer using atomic to categorized variables for predictive models, since information is lost when detailed data are grouped. Categorized variables, however, are often useful for initial review and construction of dimension variables to facilitate visualization. I made sensible categories for education, but simply divided age into five roughly-equal quantiles and experience into three. Wages are presented in five quantiles as well for summarization. Several variables, like race, sector, and union, have skewed distributions that likely diminish their usefulness for analyses.
It seems every article I write for DI extols the benefits of R’s unsexy graphics. Maybe R graphs are nails, and I have only a hammer. But for the simple visuals I start with when analyzing data – graphs that let me readily review detail, basic summaries, and relationships -- R is rock solid. If, in addition, it’s easy to look at those detail, summary, and relationship data conditioned on the values of other attributes, so much the better. R’s lattice graphics package more than meets those needs, with programmable stripplots, dotplots, and xyplots as basic building blocks of exploratory statistical analysis.
Figure 1 presents a series of summary dotplots that graphically display bivariate frequencies between the variables of the cps_wages data set and sex. The visual is actually an assembly of individual dotplots, each of which is a separate function call in R. The y axis of each chart represents the values of a categorical attribute in cps_wages. The x axes represent frequency scales; the blue and red dots denote the group variable, sex (gender), and depict the number of cases of male and female respectively. The dotplots serve much the same function as bar charts, but with a lot less outlay of ink. And with this efficiency comes the ability to display much more information on a page than many competing tools. As an attention-deficit analyst, the one page rule is quite meaningful to me.
Figure 1 - Click on image for full screen version.
The graphs of Figure 1 confirm the univariate frequencies of Table 1, adding the sex dimension to the mix. For charts where the order of the categories is unimportant, I’ve presented the categories in sorted order of frequencies. For the most part, there isn’t much difference in the relative frequency of gender within attribute categories – with several notable exceptions. Males seem more likely to be white, to be union members, and to have an “other” job, while females are more apt to have clerical and service occupations, and offer a tad more high-end experience.