• Votes for this article 5 people voted for this
  • Dashboard Insight Newsletter Sign Up

R and BI (fifth in a series)
Predictive Modeling

by Steve Miller, President, OpenBIMonday, July 28, 2008

Figure 2 contrasts the predictive performance of the linear discriminant function with random forests for both training and test data sets. With training data, the discriminant function correctly classified zone in 73% of New Haven Residential cases versus 91% for random forests. The positive differences are illustrated by the larger black Other-Other rectangles, the larger mid-gray RM-RM rectangles, and the larger light-gray RS-RS rectangles in the random forest training mosaic. That same pattern holds for test, though there appears to be a bit of shrinkage in the random forest predictions from training to test, from 91% to 82% correspondence.

predictive performance of the linear discriminant function
Figure 2 - Click on image to see full size version

For continuous dependent variables, one measure of the goodness of fit is the scatter of actual versus predicted. The closer in distance predicted to actual, the better the fit. Figure 3 contrasts the performance of a traditional linear regression model with random forests depicted with an R lattice graph. It should be noted that the regression model includes only linear terms, supposing no curvature in the relationships between logtotalCurrVal and the independent variables – not necessarily a reasonable assumption. Similarly, the regression model considers no interactions among the independent variables. As with the classification example, this model appears to show random forest to be more accurate with the training data, and in fact more accurate with test as well, though with some shrinkage. The numbers in the top left of each panel are the mean absolute differences between actual and predicted in each instance. The lower figures with random forest confirm the observed tighter graphical fit.

R lattice graph
Figure 3 - Click on image to see full size version

The availability of a multiplicity of powerful, readily-accessible predictive models in R, as well as elegant graphics to connote the goodness of fit performance, is a boon for BI analysts looking to progress intelligence to the highest analytical levels. Subsequent columns will focus on additional predictive models as well as techniques to incorporate such models into broader BI graphical applications.

References:

  1. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. 2001.
  2. Paul Murrell. R Graphics. Chapman & Hall/CRC. 2005.
  3. Deepayan Sarkar. Lattice : Multivariate Data Visualization with R. Springer: 2008.

About the Author

Steve Miller is President of OpenBI, LLC, a Chicago-based services firm focused on delivering business intelligence solutions with open source software. A statistician/quantitative analyst by education, Steve has 30 years BI experience. His charter – and OpenBI's – is to help customers manage performance through optimal deployment of analytics. Steve is a columnist for DMReview and writes also for BIReview and the B-Eye-Network. In addition to R, OpenBI specializes in the Pentaho and JasperSoft open source BI platforms and Weka data mining. Steve can be reached at steve.miller@openbi.com.


Tweet article    Stumble article    Digg article    Buzz article    Delicious bookmark      Dashboard Insight RSS Feed
 
Previous Page
1 2
Other articles by this author

Discussion:

No comments have been posted yet.

Site Map | Contribute | Privacy Policy | Contact Us | Dashboard Insight © 2013