# Kjell Max Methodologies

Posted Monday, January 26 2015 - 7:32am by

# A Process Recipe for the Creation of Models

The Kjell Max methodologies towards building predictive multivariate models!

Understanding the Data

- What do we know about the scientific process that generated the data
- Investigate the data
- Anything unusual in the data table
- What is the distribution of the response?
- Are there obvious, important univariate relationships?
- Pairwise scatterplots of predictors versus response and predictors versus predictors
- How much missing data or limit of detection data is present

## Determine validation approach

- How much data do we have?
- How many samples and how many predictors?
- Do we need an independent test set? Do we have enough data to have an independent test set?
- Need to determine whether or note we have a test set before pre-processing the predictors
- Which cross-validation method should we use?
- K-fold, repeated k-fold, leave-group-out, bootstrap?
- Each of these has different computational costs and become noticeable as data size increases

## Pre-process the data

- Pre-filter samples and predictors on missingness (if a column has more than 30% missing data - then take it out - Kjell's rule of thumb he uses)
- remove samples with too many missing predictor values
- Remove predictors with too many missing sample values
- Transform and impute
- Transform predictors to resolve skewness (Box-Cox)
- Center and scale
- Impute missing data
- Post-filter uninformative

## Select desired performance metric

- What are we optimizing?
- R2, RMSE, Accuracy, Kappa, ROC, Sensitivity, specificity...
- What is an optimal value for this problem
- Do we know the measurement error of the response

## Build models

- Build sentinel models
- Choose an interpretable, simple model and a highly complex, uninterpretable model
- Tune each model, and assess model performance
- Do the models have significantly different predictive performance?
- if not, then the interpretable model maybe sufficient
- If there is a sufficient range of predictive performance, then build lots of models
- Linear, non-linear, tree based, etc
- Gather CV performance metrics
- Do some models perform better than others

## Dig for understanding

- Compute variable of importance to understand what predictors are important to each model
- Are certain predictors common across most?

## Final product

- Often, no one model has superior performance to all other models
- We can generate predictions from the top models and use those predictions to inform decisions
- This approach is especially useful for classification models