Predictive Modeling

Course Instructor - Jimeng Sun

Predictive Modeling Pipeline

Predictive modeling is a process of modeling historical data for predicting future events. Predictive modeling is not a single algorithm, but a computational pipeline that involves multiple steps. First, we decide the prediction target, for example, whether a patient will develop heart failure in the next few years. Second, we construct the cohort of relevant patients for the study. Third, we define all the potentially relevant features for the study. Fourth, we select which features are relevant for predicting the target. Fifth, we compute the predictive model, and sixth, we evaluate the predictive model. Then we iterate this process several times until we are satisfied with the resulting model.

Prediction target

We should choose the prediction target that addresses the primary question that is both interesting to the investigator and possible to be answered using the data.

Motivation for Early Detection

If we can detect heart failure earlier (for example) we can potentially reduce the cost of hospitalization associated with heart failure (for example). We can also potentially introduce early intervention to try to slow down the progression of heart failure, improve the quality of life, and reduce mortality. In the long term we can improve existing clinical guidelines for heart failure prevention.

Cohort Construction

Cohort construction is about defining the study population. For a given prediction target, there are only a subset of patient that are relevant among the whole patient-population. And they are the Study Population. How do we define the Study Population? There are two different axes to be considered. One the vertical axis, where we have prospective study versus retrospective study. On the horizontal axis, we have cohort study verses case-control study. Depending on the combinations, we have four different options. Prospective Cohort study, Prospective Case-Control study, Retrospective Cohort study, and Retrospective Case-Control study.

Prospective Vs Retrospective

In a prospective study, we first identify the cohort of patients, then decide what information to collect and how to collect them. Then start the data collection. In contrast, in a retrospective study, we first identify the patient cohort from existing data. So, in prospective study, we identify the cohort and collect the data from scratch. But in the retrospective study, the data set already exists; We just need to identify the right subset and retrieve them.

Cohort Study

In a COHORT study, the goal is to select a group of patients who are exposed to a particular risk, for example, if we want a predictive model for predicting heart failure readmission. Here heart failure readmission mean is a situation whereby a patient, after being discharged from the hospital, comes back again to the hospital due to heart failure. In this case, the COHORT should contain all the heart failure of patients who were discharged from the hospital, because they can potentially be readmitted after discharge. The key in COHORT study is to define the right inclusion and exclusion criteria to figure out what patient to include.

Case Control Study

In this design, we try to identify two sets of patients, namely cases and controls. And we put them together to construct the cohort. Cases are patients with positive outcome. For example, the patient who develop the disease. Controls are the patients with negative outcomes. That is, they’re healthy patients, but otherwise similar to the cases. For example, they can have the same age, gender, and visit the same clinics. And the key here is to develop the matching criteria between cases and controls.

Feature Construction

The goal of feature construction is to construct all potentially relevant features about patients in order to predict the target outcome. First, the raw patient data arriving as event sequences over time. Diagnosis date is the date that the target outcome happened. In the heart failure predictive modeling example, each patient is diagnosed with heart failure on this date. Since control patient does not have heart failure diagnosis, in theory, we can use any days from control patient as the diagnosis date. But commonly we choose to use the heart failure diagnosis date of the matching case patient as diagnosis date for the corresponding control. Before the diagnosis day, we have a time window, called prediction window. Before the predication window, we have the index day at which we want to use the learn predicted model to make a prediction about the target outcome. Before the index day, we have another time window called observation window. We use all the patient information happening during this observation window to construct features. There are different ways to construct features. For instance, we can count the number of times an event happens. For example, if type two diabetes code happened three times during this observation window, the corresponding feature for type two diabetes equals three. Or sometimes we can take average of the even value. For example, if patient has two HBA1C measures during observation window, we can take the average of these two measurements as a feature for HBA1C. The length of prediction window and observation window are two important parameters that going to impact the model performance.

Feature Selection

If we look closely at the observation window, we see event sequence data, which are corresponding to different types of clinical events. For example, diagnosis, symptoms, medications, patient demographics, lab results, and vital signs. We can construct features from all those events. However, not all the events are relevant for predicting a specific target. The goal of feature selection is to find the truly predictive features to be included in the model.

Predictive Models

Predictive model is the function that maps the input features of the patient to the output target. For example, if we know a patient’s past diagnosis, medication, and lab result, if we also know this function (that maps the input features of the patient to the output target), then we can assess how likely the patient will have heart failure. Depending on the value of the target, the model can be either regression problems or a classification problem. In regression problem, the target is continuous. For example, if we want to predict the cost that a patient will incur to the healthcare systems, then it’s a regression problem, and y is the cost in dollars. And the popular method includes linear regression and generalized additive model. And if the target is categorical, for example, whether the patient has heart failure or not, then it’s a classification problem. Popular methods include logistic regression, support vector machine, decision tree, and random forest.

Performance Evaluation

Evaluation of predictive models is one of the most crucial steps in the pipeline. The basic idea is to develop the model using some training samples but test this training model on some other unseen samples, ideally from future data. It is important to note that the training error is not very useful, because you can very easily over fit the training data by using complex models which do not generalize well to future samples. Testing error is the key metric because it’s a better of the model on future samples. The classical approach for evaluation is through cross-validation process or CV.

Cross Validation

The main idea behind cross-validation is to iteratively split a data set into training and validation sets. And we want to view the model on the training set, and test the model on the validation step, but do this iteratively, many times. Finally, the performance matrix is aggregated across these iterations often by taking the average. There are three common cross-validations namely, Leave-1-out cross-validation, k-fold cross-validation, and randomized cross-validation.

In Leave-1-Out cross validation, we take one example at time as our validation set and use the remaining set as the training set. Then repeat this process many times, going through the entire data set. The final performance is computed by averaging the prediction performance across all iterations.

K-Fold’s cross validation is very similar to leave-1-out validation. But instead of just using one example of validation set, we have multiple examples in the validation set. More specifically, we split the entire data set into K-Folds. And we iteratively choose each fold as set, validation set and use the remaining Folds as a trimming set.

Finally, randomized cross validation will randomly split the data set into training and testing. For each such split, the model is fit to the training data set, and the prediction accuracy is assessed using the validation set. The results are then averaged over all the splits. The advantage of this method over the K-fold cross validation is that the proportion of the training and validation set is not dependent on the number of folds. The disadvantage of this method is that some observation may never be selected into the validation set because there’s randomization process, whereas some other samples may be selected more than once into the validation set. In other words, validation sets may overlap.

MapReduce →

Modules