I’m a big fan of the Exploratory Data Analysis (EDA) methods popularized in the 1970s and 1980s by academics like John Tukey, Frederick Mosteller, and David Hoaglin. 1 In contrast to traditional top-down, confirmatory statistical methods whose probability models impose stringent (and often unmet) assumptions on data, EDA is more "bottom up" - working with data informally and graphically for discovery, with little in the way of preconceptions. Over the years, exploratory and confirmatory methods have come to provide important complementary approaches to data analysis.
Two major tenets of the EDA approach are resistance and revelation. Resistant methods focus on the main body of data and less on extremes or outliers. Resistant statistics are thus less influenced by small frequencies of depraved observations. The sample median is a resistant statistic; the sample mean is not. Revelation is an emphasis – an obsession -- with visual and graphical displays of data. With such a focus on flexible approaches for examining data, it’s not a stretch to categorize EDA as an influential predecessor to modern business intelligence.
One of the important graphics to come from the EDA world that is now popular with BI is the box-and-whiskers plot or, simply, the boxplot. The point of departure for the univariate boxplot is a set of ordered data that is used to summarize a distribution. The “box” of the boxplot contains the middle 50% (25%-75%) of ordered observations – the so-called interquartile range. The upper and lower whiskers or hatches are drawn at those observations closest to 1.5 times the length of the interquartile range above the 75th and below the 25th percentiles respectively. Remaining observations outside the upper and lower whiskers are designated as outliers. This simple graphic can tell a lot about the distribution of an attribute.

Figure 1a - Click on image for full size version.
Consider the 18221 case New Haven Residential data set detailed in “Data Analysis and R (the first in a series)”, http://dashboardinsight.com/articles/new-concepts-in-business-intelligence/et-data-analysis-r-the-first-series. Figure 1a shows a basic R boxplot for totalCurrVal, the log of appraised housing value. The middle 50% of observations fall within the box, with the median designated by the dot. The whiskers are then connected, followed by the extremes beyond the 1.5 plus or minus the two hatches. It seems there are quite a few outliers with this data vector.

Figure 1b - Click on image for full size version.
An added value of the box-and-whiskers derives from having multiple plots representing different values of dimension variables contrasted on the same graph. The variations between plots are often more telling than the information within. Figure 1b details totalCurrval by the log of livingArea, arranged in categories of roughly-equal sizes. It’s clear that totalCurrval has a strong positive relationship with livingArea. The vertical red line denoting the overall totalCurrVal median helps reveal that boxplots shift to the right as the livingArea quartiles progress from Q1 to Q4. Most outliers are either to the left the smallest livingArea category or to the right of the largest

Figure 2a - Click on image for full size version.
Over the years, a number of variants of the original box-and-whiskers have emerged, generally to mute criticism that boxplots are difficult for non-stats types to interpret. Figure 2a illustrates my personal favorite, the box-percentile plot – or the box without whiskers and outliers. For each category of livingArea, the plots detail the distribution of totalCurrVal as follows: the central, thickest box holds the middle 50% of observations; the next box holds the middle 75%; the third encompassing box comprises the middle 95% of cases; and, of course, the outside lines show all 100% of observations. The vertical line in the center box represents the median, the dot the mean. I picked the middle 50%, 75%, 95%, and 100% as parameters for this graph, but could just as easily have chosen others.

Figure 2b - Click on image for full size version.
Figure 2b extends 2a, adding a dimension variable, zone, to the analysis. Each panel of the graph shows a separate relationship between category of livingArea and distribution of totalCurrVal. The panels are ordered from left to right by median totalCurrval. Figure 2b confirms the relationship of livingArea to totalCurrVal, suggesting as well the influence of zone.

Figure 3a - Click on image for full size version.
Figure 3a introduces a second boxplot variant, the violin plot, named for the obvious similarity with the musical instrument. The beauty of the violin plot is that it combines features of both boxplots and density plots, the latter detailing a more elaborate view of attribute distributions.

Figure 3b - Click on image for full size version.
The outliers of Figure 1b, for example, are shown in the negative skewness for livingArea Q1 and positive skewness of Q4. The bi-modality for Q1 is apparent also, as is the “humpiness” of Q2 and Q3. Figure 3b adds the zone dimension to the graph, introducing many interesting density shapes. Those shapes, across zones for livingAreas Q1 and Q4 especially, are anything but “normal”.
Those who’ve followed so far might have noticed a common weakness for all genera of boxplots presented to this point: they consume a lot of space, making it difficult to show multiple dimensions in a single page. It would be nice to have a simpler box-and-whiskers, a “lite” version that would satisfy the no-wasted-ink mantra of visual expert Edward Tufte. With the programmable features of R’s lattice graphics, and the assistance of an excellent new text by lattice architect Deepayan Sarkar, there just might be an answer. 2
One of the beauties of R’s graphics is the capacity to modify and extend the basic plotting functions – scatter, dot, strip, boxplot, etc. – while maintaining the overall trellis dimensional functionality. And R provides access to low-level functions like points, lines, grids, and maps to support programmer initiatives to “roll their own”. So, borrowing liberally from the ideas and code snippets of Deepayan, I put together version 1 of boxplot lite.

Figure 4a - Click on image for full size version.
Figure 4a depicts “groomed box-and-whiskers”, with totalCurrVal by livingArea category dimensioned by # of bathrooms. The graph details the totalCurrVal distributions vertically, in contrast to those that preceded, potentially freeing space for subsequent use. Each individual “boxplot” is simply a line connecting points denoting the median in red, the 25th and 75th percentiles in green, and, for this case, the 2.5th and 97.5th percentiles in blue. The max and min are, of course, indicated by the ends of the lines. The horizontal red line depicts the overall median, and provides an anchor to assist in interpretation. The panels are ordered by median totalCurrVal for each bathroom # category, with, in this instance, 1.5 higher than 2.

Figure 4b - Click on image for full size version.
Finally, Figure 4b adds another dimension, zone, to the mix, with an end result of totalCurrVal by livingArea category, dimensioned by zone and # bathrooms. Note that much of zone RS lies above the overall median, while much of RM falls below. The graph seems to use space efficiently.
This column has focused on the boxplot and several of its variations for exploring data. Boxplots should become a part of the BI analyst’s arsenal for summarizing the distribution of important variables by categories and dimensions of other attributes. Follow-up articles will continue the investigation of sophisticated graphics for BI in R. The next column will highlight 3-dimensional graphs as well as plots of multidimensional frequencies.
1 Hoaglin, D., Mosteller, F., Tukey, J. W. (Eds.). (2000). Understanding robust and exploratory data analysis. New York: John Wiley & Sons Inc.
2 Sarkar, D. (2008). Multivariate data visualization with R. New York: Springer.
About the Author:
Steve Miller is President of OpenBI, LLC, a Chicago-based services firm focused on delivering business intelligence solutions with open source software. A statistician/quantitative analyst by education, Steve has 30 years BI experience. His charter – and OpenBI's – is to help customers manage performance through optimal deployment of analytics. Steve is a columnist for DMReview and writes also for BIReview and the B-Eye-Network. In addition to R, OpenBI specializes in the Pentaho and JasperSoft open source BI platforms and Weka data mining. Steve can be reached at steve.miller@openbi.com.