I’m a big fan of the Exploratory Data Analysis (EDA) methods popularized in the 1970s and 1980s by academics like John Tukey, Frederick Mosteller, and David Hoaglin. 1 In contrast to traditional top-down, confirmatory statistical methods whose probability models impose stringent (and often unmet) assumptions on data, EDA is more "bottom up" - working with data informally and graphically for discovery, with little in the way of preconceptions. Over the years, exploratory and confirmatory methods have come to provide important complementary approaches to data analysis.
Two major tenets of the EDA approach are resistance and revelation. Resistant methods focus on the main body of data and less on extremes or outliers. Resistant statistics are thus less influenced by small frequencies of depraved observations. The sample median is a resistant statistic; the sample mean is not. Revelation is an emphasis – an obsession -- with visual and graphical displays of data. With such a focus on flexible approaches for examining data, it’s not a stretch to categorize EDA as an influential predecessor to modern business intelligence.
One of the important graphics to come from the EDA world that is now popular with BI is the box-and-whiskers plot or, simply, the boxplot. The point of departure for the univariate boxplot is a set of ordered data that is used to summarize a distribution. The “box” of the boxplot contains the middle 50% (25%-75%) of ordered observations – the so-called interquartile range. The upper and lower whiskers or hatches are drawn at those observations closest to 1.5 times the length of the interquartile range above the 75th and below the 25th percentiles respectively. Remaining observations outside the upper and lower whiskers are designated as outliers. This simple graphic can tell a lot about the distribution of an attribute.
Figure 1a - Click on image for full size version.
Consider the 18221 case New Haven Residential data set detailed in “Data Analysis and R (the first in a series)”, http://dashboardinsight.com/articles/new-concepts-in-business-intelligence/et-data-analysis-r-the-first-series. Figure 1a shows a basic R boxplot for totalCurrVal, the log of appraised housing value. The middle 50% of observations fall within the box, with the median designated by the dot. The whiskers are then connected, followed by the extremes beyond the 1.5 plus or minus the two hatches. It seems there are quite a few outliers with this data vector.
Figure 1b - Click on image for full size version.
An added value of the box-and-whiskers derives from having multiple plots representing different values of dimension variables contrasted on the same graph. The variations between plots are often more telling than the information within. Figure 1b details totalCurrval by the log of livingArea, arranged in categories of roughly-equal sizes. It’s clear that totalCurrval has a strong positive relationship with livingArea. The vertical red line denoting the overall totalCurrVal median helps reveal that boxplots shift to the right as the livingArea quartiles progress from Q1 to Q4. Most outliers are either to the left the smallest livingArea category or to the right of the largest
1 2 3 4
No comments have been posted yet.