Figure 2 contrasts the predictive performance of the linear discriminant function with random forests for both training and test data sets. With training data, the discriminant function correctly classified zone in 73% of New Haven Residential cases versus 91% for random forests. The positive differences are illustrated by the larger black Other-Other rectangles, the larger mid-gray RM-RM rectangles, and the larger light-gray RS-RS rectangles in the random forest training mosaic. That same pattern holds for test, though there appears to be a bit of shrinkage in the random forest predictions from training to test, from 91% to 82% correspondence.
Figure 2 - Click on image to see full size version
For continuous dependent variables, one measure of the goodness of fit is the scatter of actual versus predicted. The closer in distance predicted to actual, the better the fit. Figure 3 contrasts the performance of a traditional linear regression model with random forests depicted with an R lattice graph. It should be noted that the regression model includes only linear terms, supposing no curvature in the relationships between logtotalCurrVal and the independent variables – not necessarily a reasonable assumption. Similarly, the regression model considers no interactions among the independent variables. As with the classification example, this model appears to show random forest to be more accurate with the training data, and in fact more accurate with test as well, though with some shrinkage. The numbers in the top left of each panel are the mean absolute differences between actual and predicted in each instance. The lower figures with random forest confirm the observed tighter graphical fit.
Figure 3 - Click on image to see full size version
The availability of a multiplicity of powerful, readily-accessible predictive models in R, as well as elegant graphics to connote the goodness of fit performance, is a boon for BI analysts looking to progress intelligence to the highest analytical levels. Subsequent columns will focus on additional predictive models as well as techniques to incorporate such models into broader BI graphical applications.
- Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. 2001.
- Paul Murrell. R Graphics. Chapman & Hall/CRC. 2005.
- Deepayan Sarkar. Lattice : Multivariate Data Visualization with R. Springer: 2008.
About the Author
Steve Miller is President of OpenBI, LLC, a Chicago-based services firm focused on delivering business intelligence solutions with open source software. A statistician/quantitative analyst by education, Steve has 30 years BI experience. His charter – and OpenBI's – is to help customers manage performance through optimal deployment of analytics. Steve is a columnist for DMReview and writes also for BIReview and the B-Eye-Network. In addition to R, OpenBI specializes in the Pentaho and JasperSoft open source BI platforms and Weka data mining. Steve can be reached at firstname.lastname@example.org.
No comments have been posted yet.