When I started the R and BI series five months ago, I chose two datasets available on the web to showcase graphical and statistical techniques in R: http://www.dashboardinsight.com/articles/new-concepts-in-business-intelligence/et-data-analysis-r-the-first-series.aspx. The second of these, NewHavenResidential, consists of 18,221 records with 8 attributes pertinent to the residential property value market for 2006 in New Haven, CT.
The first time I loaded NewHavenResidential, I composed a scatterplot in R of totalCurrVal (the 2006 assessed value), with livingArea, the presumed hypothesis being a strong positive relationship between the two. What I found instead was a more complicated pattern difficult to visualize (Figure 1).
It turns out this type of problem, caused by right skewness in the data, is all too common in statistics and BI. Nassim Taleeb, author of The Black Swan, distinguishes data of the Mediocristan, which is well-behaved and seems normally distributed, with the winner-take-all Extremistan, which exhibits extensive skewness in values and progression in jumps1. Population height and IQ are examples of the Mediocristan; economic variables like wealth and housing prices behave more like Extremistan. It appears totalCurrval and livingArea are more the latter than the former. A challenge for BI is to develop methods and graphics that work equally well with both types of data.
Statisticians and research analysts often treat their miscreant data via transformation, using power functions, reciprocals and logarithms to create new variables that are “better” behaved. Called Re-expression, by Exploratory Data Analysis (EDA) devotees, these transformations can often make a big difference in analysis. EDA authors Emerson and Stoto discuss several reasons for transforming data, including facilitating interpretation, promoting symmetry in distribution, providing stable dispersion within a dimension, and promoting a straight line relationship between two variables2. The authors opine that more than one of the objectives might be in play for a given situation. They also note that a given transformation might have the serendipitous effect of providing simultaneous benefits among multiple objectives, such as stable dispersion and straight line relationships.
In their classic 1977 book, Data Analysis and Regression: A Second Course in Statistics, EDA authors Mosteller and Tukey articulate a strategy for transforming variables based on data type3. Of those noted, the most common for BI analysts are counts/positive amounts, and counted fractions. Counts and positive amounts are bounded by 0 on the left and unbounded at the high end, and are thus suspect for right skewness or long tails. Examples include income, spending and customer purchase history by socio-economic status. For attributes that behave more like Extremistan than Mediocristan, the authors propose a family of transformations known as the “ladder of powers” to make the data more tractable for visualization and analytics. The ladder of powers transformations include exponents greater than 1, such as square or cube; positive powers less than 1, such as square root or cube root; reciprocal powers like 1/x, 1/square root, or 1/square; and the special case of logarithms. The transformations work on skewed data by impacting extreme values more than non-extreme, reducing dispersion and often producing a more symmetric distribution, thus making an overall transformed attribute more “uniform”. If most family incomes are in the range of $40K-$80K, but several are $1M+, a base 10 logarithm transformation might make the data appear much more orderly. The base 10 log of $50,000 is about 4.7; the comparable value for $1,000,000 is 6. The transformed outliers are thus reduced substantially. BI analysts would choose the “correcting” transformation after an initial look at the data.