Kjell Max Methodologies

A Process Recipe for the Creation of Models

The Kjell Max methodologies towards building predictive multivariate models!

Understanding the Data

  • What do we know about the scientific process that generated the data
  • Investigate the data
  • Anything unusual in the data table
  • What is the distribution of the response?
  • Are there obvious, important univariate relationships?
  • Pairwise scatterplots of predictors versus response and predictors versus predictors
  • How much missing data or limit of detection data is present

Determine validation approach

  • How much data do we have?
  • How many samples and how many predictors?
  • Do we need an independent test set?  Do we have enough data to have an independent test set?
  • Need to determine whether or note we have a test set before pre-processing the predictors
  • Which cross-validation method should we use?
  • K-fold, repeated k-fold, leave-group-out, bootstrap?
  • Each of these has different computational costs and become noticeable as data size increases

Pre-process the data

  • Pre-filter samples and predictors on missingness (if a column has more than 30% missing data - then take it out - Kjell's rule of thumb he uses)
  • remove samples with too many missing predictor values
  • Remove predictors with too many missing sample values
  • Transform and impute
  • Transform predictors to resolve skewness (Box-Cox)
  • Center and scale
  • Impute missing data
  • Post-filter uninformative 

Select desired performance metric

  • What are we optimizing?
  • R2, RMSE, Accuracy, Kappa, ROC, Sensitivity, specificity...
  • What is an optimal value for this problem
  • Do we know the measurement error of the response

Build models

  • Build sentinel models
  • Choose an interpretable, simple model and a highly complex, uninterpretable model
  • Tune each model, and assess model performance
  • Do the models have significantly different predictive performance?
  • if not, then the interpretable model maybe sufficient
  • If there is a sufficient range of predictive performance, then build lots of models
  • Linear, non-linear, tree based, etc
  • Gather CV performance metrics 
  • Do some models  perform better than others

Dig for understanding

  • Compute variable of importance to understand what predictors are important to each model
  • Are certain predictors common across most?

Final product

  • Often, no one model has superior performance to all other models
  • We can generate predictions from the top models and use those predictions to inform decisions
  • This approach is especially useful for classification models