Those who’ve faithfully (foolishly?) followed the first four articles in the R and BI series will probably have noticed an emphasis on exploratory graphics that purport to tell a preliminary story on variable distributions and relationships in both the CPS Wages and New Haven Residential data sets. An anticipated result of those analyses would, of course, be suggestions for statistical or machine learning models that could neatly summarize relationships of importance suggested by the graphics. This column focuses on several, simple predictive models for variables in the New Haven Residential data set, along with graphics that show how well the models performed with the data.
There are at least two legacies popular in the predictive modeling world today. Traditional statistical models, such as multiple regression, logistic regression, and the linear discriminant function, come from a heritage of well-specified models with strong underlying assumptions about the behavior of data. In contrast, machine learning models have derived from computer and mathematical sciences with less emphasis on assumptions and models and more on optimization of algorithms for large data sets. Classification and Regression Trees (CART), neural nets, and AdaBoost are examples of machine learning models. I sometimes find it useful to think of statistical models as white box and machine learning as black box. Both can do a good job of predicting the outcomes of new events, though statistical models are sometimes easier to interpret and better at testing specific hypotheses, while machine learning algorithms are often optimal for large, ill-specified data challenges. The statistical and machine learning traditions are starting to merge, much to the benefit of predictive business intelligence.
Statisticians and machine learners generally distinguish supervised from unsupervised learning. A supervised model has a designated dependent variable that is to be predicted from a series of independent attributes, while an unsupervised model searches for clusters, associations or patterns within a group of equal input attributes. For supervised models, classification is a problem for which the dependent attribute is a series of categories, such as “Churn” or “No Churn”, rather than a continuous numeric variable. Regression models from either statistics or machine learning, on the other hand, are focused on continuous dependent attributes such as salary or housing appraisal. Predictive modelers generally subset their supervised learning data randomly into a training component, which is used to develop models, and testing data that calibrates the fit of the models to “new” observations.
The remainder of this column considers two predictive problems involving the New Haven Residential data set. The first attempts to forecast the categorical variable zone from attributes loglivingArea, logsize, dep, acType, bedrms, and bathrms. The second predicts the continuous attribute logtotalCurrVal from loglivingArea, logsize, dep, zone, acType, bedrms, and bathrms. Models are developed on the training data set, which consists of a two thirds random sample of the 18221 record New Haven Residential data, and evaluated on the test data set, which consists of the remaining one third of observations.
For classification of zone, I chose two predictive procedures, one statistical and one machine learning, and ran the models in R. The linear discriminant function is a popular multivariate statistical classification technique for categorical dependent variables; the random forest machine learning algorithm can be used for both classification and regression models. To evaluate the performance of classification models, a good starting point is the matrix of actual categories versus those predicted by the model: the higher percentage of numbers along the diagonal of actual versus predicted, the closer the correspondence of the model to the data. A proportion of actual-predicted matches close to 1 is an important measure of the quality of the model.
To show classification model performance graphically, a reasonable choice is the mosaic plot, a graph that contrasts frequencies within categorical variables. Figure 1, an R community favorite, details Titanic fatalities by cabin class, age group, and sex. Black rectangles indicate fatalities, with rectangle size connoting relative frequencies. The data tell a fascinating story of increasing fatalities down cabin class, with women and children the seeming priority for rescue. Children in cabin classes 1 and 2 appear to have been especially fortunate.
Figure 1 - Click on image to see full size version