Linear and Logistic Regression for Predictive Analytics

1 Goals and Introduction

  1. Split data into testing and training sets for use with regression.

  2. Use regression analysis to make predictions.

Link to this document: https://rpubs.com/anshulkumar/LinearLogisticRegressionPredictiveAnalytics

In this tutorial, we will turn back to supervised machine learning to use basic linear and logistic regression models to make predictions. We will again use the student-por.csv data in your assignment this week, but I also encourage you to use your own data if you have some that you want to analyze. I am also available to meet with you to look at your data and/or assist with preparing it for use in predictive analysis.

Previously, we engaged in an Educational Predictive Analytics Tutorial together in which we began to practice some of the methods we will use in this week’s tutorial. It maybe be useful to you to refer to that tutorial as you go through this one.

The following tutorial shows how to import, split (into training and testing), and create a model with your data. Then, we will look at how to use the models to make predictions and measure how good or bad the model is at making predictions for us.

We will be using supervised machine learning in this tutorial, which means that we must look at a dependent variable that is an outcome of interest. In other words, there should be a variable in the data that it would be useful to us to be able to predict. In the case of this example tutorial, we will treat the variable mpg from within the mtcars dataset as our dependent variable of interest.

2 Import and prepare data

This procedure is essentially the same as it was in our tutorial on unsupervised machine learning that we looked at before.

For the examples in this tutorial, I will again just use the mtcars dataset that is already built into R. However, we will still “import” this and give it a new name—df—so that it is easier to type.

df <- mtcars

Run the following command in your own R console to see information about the data:

?mtcars

And we can see all of the variable names, like we have done before:

names(df)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

Number of observations:

nrow(df)
## [1] 32

Remove missing observations:

df <- na.omit(df)

Inspect some rows in the dataset:

head(df)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

In other instances, we have scaled or standardized our data. It is not necessary for us to do this when using linear or logistic regression.

3 Linear regression to make predictions

This portion of the tutorial will demonstrate how to use linear regression to make predictions. You will then repeat this process with different data and some additional modifications in your assignment for the week.

Our goal in this portion of the tutorial is to predict each car’s mpg as accurately as possible.

3.1 Divide data

As you have now seen multiple times, we will start by dividing the data into testing and training datasets. The training data will be used to create a new predictive model. The testing data, which was not used to create the model and which will be new to the computer, will be plugged into the model to see how well the model makes predictions.

Below, you’ll find the code we use to split up our data. For now, I have decided to use 75% of the data to train and 25% of the data to test. Here’s the first line of code to start the splitting process:

trainingRowIndex <- sample(1:nrow(df), 0.75*nrow(df))  # row indices for training data

The code above does the following:

  • Figure out the number of observations in the dataset df. If you’re using a different dataset, change df to the name of your own data when you use this code.
  • Randomly select 75% of the rows in df. You can change the 0.75 written in the code if you want to change the proportion of training and testing data.

Above, we randomly selected the rows that we would be including in each dataset, but we didn’t actually create the new datasets. So let’s start by creating the training data:

dtrain <- df[trainingRowIndex, ]  # model training data

Above, we created a new dataset called dtrain which contains the 75% of observations which were selected for inclusion in the training dataset.

And now we’ll create the testing dataset:

dtest  <- df[-trainingRowIndex, ]   # test data
save(dtrain, dtest, file = "TrainTestData.RData")

Above, we created a new dataset called dtest which contains the remaining 25% of observations which are not part of the training dataset.

So we now have three datasets:

  • df – This is the original full dataset. We won’t use this much for the rest of the tutorial, but we could refer to it if we need to.
  • dtrain – A training dataset containing 75% of the observations.
  • dtest – A testing dataset containing 25% of the observations.

We can verify if our data splitting procedure worked:

nrow(dtrain)
## [1] 24
nrow(dtest)
## [1] 8

Above, we see that the training data contains 24 observations and the testing data contains 8 observations.

You can also look at these two datasets by running the following commands, one at a time, in your console (not your RMarkdown or R code file):

View(dtrain)
View(dtest)

Running the commands above will open a spreadsheet viewer on your computer for you to look at the data. There’s no way to display that within a “knitted” RMarkdown document or in the textual results that we usually get from R. That’s why the View command should never be included in your RMarkdown code.

3.2 Run linear regression

We are now ready to make our first predictive model, an ordinary least squares (OLS) linear regression model. We will do this using the training data as follows:

reg <- lm(mpg ~ ., data = dtrain)

The code above creates and stores the results of a new linear regression model. The name of the stored model is reg. The dependent variable is designated as mpg. All of the other variables in the dataset are designated as independent variables; that’s what the ~ . part of the code means. We could also write the name of each IV (independent variable) instead, if we wanted to just select a subset of them, like this: mpg ~ IV1 + IV2 + IV3 and so on. Finally, we specified at the end of the command that we wanted to use the dtrain data to make this regression model.

Now that we have created and stored the results of a regression model, we can use it to make predictions for us on our data.

3.3 Make predictions

Below is the key line of code that you will use and see used to plug new data into a preexisting stored model. In this case, our new data on which we want to make predictions is called dtest. Here it is:

dtest$reg.mpg.pred <- predict(reg, newdata = dtest)

Above, we added a variable (column of data) to the dtest data. This new column is called reg.mpg.pred. This new variable stores all of the predicted values of mpg for the testing data. To calculate this prediction, for each row of the test data, the computer plugged the values of the independent variables into the stored regression model reg. Plugging in those independent variables gave the computer a prediction for the dependent variable in that row. This value was then saved as reg.mpg.pred for that row.

The computer knew that this is what we wanted to do because we specified in the predict function that we wanted to make predictions using the stored reg model and we wanted to plug dtest into that model as fresh data.

Now that we have made our predictions, we can see how good they are!

3.4 Check accuracy

This section demonstrates a few selected ways to test accuracy of a regression model.

3.4.1 Actual and predicted values

First, since our dataset is not that large, let’s start by simply looking at the true and predicted values of mpg for the testing data side by side. Remember, we care the most about how the model performs on the testing data, because our goal is to use this model to make predictions on new data in the future.

Here is the code and result that shows us the true values of mpg and the predicted values of mpg for eight cars for which we made predictions (because these were the cars that were randomly split into the testing dataset, which were not used to train the model):

dtest[c("mpg","reg.mpg.pred")]
##                   mpg reg.mpg.pred
## Hornet 4 Drive   21.4     21.36493
## Duster 360       14.3     15.51104
## Merc 230         22.8     32.26509
## Merc 280C        17.8     20.47901
## Toyota Corona    21.5     29.01088
## Pontiac Firebird 19.2     15.16593
## Fiat X1-9        27.3     29.15442
## Volvo 142E       21.4     26.00033

Above, we see, for example, that the car Pontiac Firebird truly has an mpg of 19.2 and the model predicted that it would have an mpg of 15.2. So the prediction is off by 4 units. Whether that much error is tolerable or not is up to you, the user of the predictive model.

We can also look at the data from the table above as a scatterplot:

plot(dtest$mpg,dtest$reg.mpg.pred)

When I look at the graph above, it shows that there is an overall positive trend, meaning that overall the predictions increase as the true values increase, but there is also a lot of error. If there were no errors, the dots would all be on a straight line. You can locate the Pontiac Firebird on this graph. It’s the dot that is in between 18 and 20 on the horizontal axis towards the very bottom. It is at approximately 15 on the vertical axis.

We can also re-plot this and include the labels of each car in the plot:

plot(dtest$mpg,dtest$reg.mpg.pred)
text(dtest$mpg,dtest$reg.mpg.pred, row.names(dtest), cex=0.7, pos=4, col="blue")

Next, we’ll look at another key measurement that is often used when predicting numeric outcomes in predictive analytics.

3.4.2 RMSE

RMSE (root mean square error) is a very common metric that is used to calculate the accuracy of a predictive model when the dependent variable is a continuous numeric variable. This is considered to be better than other metrics—such as \(R^2\)—when measuring how useful a model is with fresh (testing) data.

If you wish, you can read more about RMSE here, but it is not required for you to do so:

Moody, J. (2019). What does RMSE really mean? https://towardsdatascience.com/what-does-rmse-really-mean-806b65f2e48e.

Let’s calculate the RMSE for the predictive model we just made. To do this, we use the rmse function from the Metrics package and we feed the actual and predicted values (which we calculated earlier) into the function:

if(!require(Metrics)) install.packages('Metrics')
library(Metrics)

rmse(dtest$mpg,dtest$reg.mpg.pred)
## [1] 4.943707

Above, we see that the RMSE of this model on this particular testing data is 4.9. You will get a sense for good and bad values of RMSE as you use this metric more and more. Usually, it helps us compare predictive models to each other.

3.4.3 Discretize and check

Another way that we can check our model’s accuracy is to discretize the data. This means that we will sort the data into discrete categories. In this case, we will just sort the mpg data into high and low values of mpg. Let’s start off by saying that a car has high mpg if it has a value of mpg over 22. Let’s do this first for the true values of mpg in the testing data:

dtest$high.mpg <- ifelse(dtest$mpg>22,1,0)

Above, we told the computer, to create a new discrete binary variable (meaning a variable with only two possible levels). The value of this variable will be 1 for all cars with mpg greater than 22 and 0 for all cars with mpg less than or equal to 22.

Below, we’ll do the same for the predicted values of mpg (which are saved as reg.mpg.pred):

dtest$high.mpg.pred <- ifelse(dtest$reg.mpg.pred>22,1,0)

Above, the new variable high.mpg.pred was created (and added to dtest), which will also have values of 1 or 0 to indicate whether the predicted (not true/actual) value of mpg for each car is high or low, again using 22 as a cutoff threshold.

Let’s inspect our data again to look at what we just did, just to be clear:

dtest[c("mpg","high.mpg","reg.mpg.pred","high.mpg.pred")]
##                   mpg high.mpg reg.mpg.pred high.mpg.pred
## Hornet 4 Drive   21.4        0     21.36493             0
## Duster 360       14.3        0     15.51104             0
## Merc 230         22.8        1     32.26509             1
## Merc 280C        17.8        0     20.47901             0
## Toyota Corona    21.5        0     29.01088             1
## Pontiac Firebird 19.2        0     15.16593             0
## Fiat X1-9        27.3        1     29.15442             1
## Volvo 142E       21.4        0     26.00033             1

And again, you can inspect the entire dataset yourself by running the following command in the console:

View(dtest)

Now that we have successfully discretized our data, we can make a confusion matrix that shows how many predictions were correct and incorrect:

with(dtest,table(high.mpg,high.mpg.pred))
##         high.mpg.pred
## high.mpg 0 1
##        0 4 2
##        1 0 2

Above, we see that at the threshold of 22:

  • 4 cars truly have low mpg and were correctly predicted to have low mpg (top left).
  • 0 cars that truly have high mpg were incorrectly predicted to have low mpg (bottom left).
  • 2 cars that truly have low mpg were incorrectly predicted to have high mpg (top right).
  • 2 cars that truly have high mpg were correctly predicted to have high mpg (bottom right).

Next, let’s change the threshold. Let’s keep the true meaning of high and low as 22. But let’s change the sensitivity and specificity of our predictions by modifying the threshold for the predicted values. First, we’ll lower it by 4 units, from 22 to 18. This will make our predictions more likely to classify cars as high mpg, because now all cars between 18 and 22 will also be included as high mpg, whereas before only cars above 22 were considered high mpg.

Here is the modified code which sets the prediction threshold at 18 and then creates a new confusion matrix:

dtest$high.mpg <- ifelse(dtest$mpg>22,1,0)
dtest$high.mpg.pred <- ifelse(dtest$reg.mpg.pred>18,1,0)
with(dtest,table(high.mpg,high.mpg.pred))
##         high.mpg.pred
## high.mpg 0 1
##        0 2 4
##        1 0 2

Above, out of the 6 cars in total that truly have low mpg, 2 were predicted correctly to have low mpg and 4 were predicted incorrectly to have high mpg. If you compare this to the confusion matrix from before (at the 22 threshold), two of these six cars jumped from low prediction to high prediction.

Let’s try this one more time. This time, we’ll make the threshold (only for the predictions) 4 points higher than 22:

dtest$high.mpg <- ifelse(dtest$mpg>22,1,0)
dtest$high.mpg.pred <- ifelse(dtest$reg.mpg.pred>26,1,0)
with(dtest,table(high.mpg,high.mpg.pred))
##         high.mpg.pred
## high.mpg 0 1
##        0 4 2
##        1 0 2

Above, we set the cutoff threshold to 26 for the predictions. We get the very same predictions that we got when we used 22. So the accuracy, sensitivity, and specificity are the same in both cases.

Which of these three prediction cutoff thresholds should you use when making predictions about new cars for which you do not know the true mpg? That is up to you. Do you want cars that the model is uncertain about to be classified as low or high mpg?

3.5 Variable importance

Using the varImp function from the caret package, as shown below, we can rank how important each independent variable was in helping us make a prediction in our linear regression:

if(!require(caret)) install.packages('caret')
library(caret)

varImp(reg)
##         Overall
## cyl  0.98346899
## disp 0.16693025
## hp   0.84666403
## drat 1.08329247
## wt   1.66379377
## qsec 1.62719740
## vs   0.76176942
## am   0.15662440
## gear 0.05210547
## carb 0.09612851

Above, cyl is considered to be the most useful variable in helping us make a prediction about mpg.

Later, if we wanted, we could try to make a predictive regression model using only the few most important variables as independent variables.

3.6 Linear regression wrap-up

Above, you saw how to use linear regression to make predictions of a continuous numeric dependent variable (mpg) on individual observations (cars). You then saw how to check how useful those predictions are.

Next, we’ll repeat the same procedure using logistic regression.

4 Logistic regression to make predictions

Logistic regression helps us predict binary outcomes, which we can think of as yes-no or 1 or 0 outcomes. In our case, we want to predict if a car has high or low mpg, so this fits within our goal. This procedure will follow the same process you saw for linear regression, with a few differences.

4.1 Prepare and divide data

Let’s use the same training-testing division from before, so that we can easily compare results of the linear and logistic models to each other using the same testing cars. So again, we’ll use dtrain and dtest which are already stored in the computer for us to use.

Now that we are using logistic regression, our dependent variable has to be transformed into a yes or no variable. We will use 1 to mean yes and 0 to mean no. Below, create a new variable called high.mpg which is yes if mpg is over 22 and no otherwise:1

dtrain$high.mpg <- ifelse(dtrain$mpg>22,1,0)

Now our new variable high.mpg has been created and added to our dataset. This will be our dependent variable throughout the logistic regression section of this tutorial. We are no longer using mpg as the dependent variable. Therefore, we should remove mpg from our dataset before we continue:

dtrain$mpg <- NULL

The code above removed the variable mpg. Let’s verify this by looking at the variables that we have in our training and testing datasets:

names(dtrain)
##  [1] "cyl"      "disp"     "hp"       "drat"     "wt"       "qsec"    
##  [7] "vs"       "am"       "gear"     "carb"     "high.mpg"
names(dtest)
##  [1] "mpg"           "cyl"           "disp"          "hp"           
##  [5] "drat"          "wt"            "qsec"          "vs"           
##  [9] "am"            "gear"          "carb"          "reg.mpg.pred" 
## [13] "high.mpg"      "high.mpg.pred"

In dtrain, we have the dependent variable high.mpg and the same independent variables as before.

In dtest, we see some of the leftover additional variables from before when we did linear regression, but that’s fine. It won’t mess up our predictions when we plug the testing data into the new logistic regression model.

4.2 Run regression

We are now ready to create our logistic regression model!

logit <- glm(high.mpg ~ ., data = dtrain, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Above, we did the same thing we did when we were doing linear regression, with the following changes:

  • The result is stored as logit instead of reg.
  • The function used to make the regression model is called glm instead of just lm.
  • Now the dependent variable is high.mpg instead of mpg.
  • We added the argument family = "binomial", which tells the computer that we want a binary logistic regression.

Note that we did get an error message, but this is likely due to the small sample size we are using right now. When you repeat this procedure in your assignment, this should not be a problem. If you do happen to get an error message, it is fine for you to ignore it and proceed as if you did not.

Now that our logistic regression result is stored as logit, we can use it to make predictions on the testing data.

4.3 Make predictions

Like we did before, let’s use the all-important predict function to make predictions on the testing data:

dtest$logit.mpg.pred <- predict(logit, newdata=dtest , type = "response")

Above, we created a new variable called logit.mpg.pred which stores the predicted probability that each testing car is high or low mpg.

4.4 Check accuracy

Below, you’ll see how we can make different confusion matrices to see how useful our predictive model is.

4.4.1 Actual and predicted values

Let’s start by inspecting the predicted probabilities that we just calculated:

dtest[c("high.mpg", "logit.mpg.pred")]
##                  high.mpg logit.mpg.pred
## Hornet 4 Drive          0   1.357050e-07
## Duster 360              0   1.185287e-12
## Merc 230                1   1.000000e+00
## Merc 280C               0   2.220446e-16
## Toyota Corona           0   1.000000e+00
## Pontiac Firebird        0   2.220446e-16
## Fiat X1-9               1   1.000000e+00
## Volvo 142E              0   1.000000e+00

Above, on the left, we see high.mpg which contains the true values of whether each testing car has high or low mpg. On the right side, we see logit.mpg.pred which contains the predicted probability that each car will have high or low mpg.

We can also visualize this in a box-plot:

boxplot(dtest$logit.mpg.pred ~ dtest$high.mpg)

Above, this shows us that all cars that truly have a low mpg were essentially perfectly predicted to have a low mpg. Of the two cars that have high mpg, we know from the table that one has a nearly 0 probability of having high mpg and the other has a nearly 1 (maximum possible) probability of having high mpg. So, one of these predictions is wrong (Fiat X1-9) and the other is right (Merc 230).

4.4.2 Confusion matrix

Now we’ll turn back to confusion matrices. To make a confusion matrix, we have to decide which probability threshold we would like to use to classify a car as having high or low mpg. We’ll start with 0.5 as a cutoff and we will use the same coding approach that we did when discretizing our results with linear regression, earlier.

Here is the code to make a binary variable for our predictions, using the probabilities that we already have:

dtest$high.mpg.pred <- ifelse(dtest$logit.mpg.pred > 0.5, 1, 0)

Now we have a new binary variable in our dtest data called high.mpg.pred. We can use it to make a confusion matrix:

with(dtest,table(high.mpg,high.mpg.pred))
##         high.mpg.pred
## high.mpg 0 1
##        0 4 2
##        1 0 2

Interpretation:

  • 6 cars truly have low mpg (top row). Out of these, all were correctly predicted to have low mpg and none were incorrectly predicted to have high mpg.
  • 2 cars truly have high mpg (bottom row). Out of these, one was incorrectly predicted to have low mpg and the other was correctly predicted to have high mpg.

like before, we can change the cutoff threshold easily. Below, we change it to 0.75, keeping the rest of the code the same:

dtest$high.mpg.pred <- ifelse(dtest$logit.mpg.pred > 0.75, 1, 0)

And again we create a confusion matrix:

with(dtest,table(high.mpg,high.mpg.pred))
##         high.mpg.pred
## high.mpg 0 1
##        0 4 2
##        1 0 2

This time, the results happen to be exactly the same. But with more complicated and larger datasets, the results would potentially change when the threshold is adjusted.

4.5 Variable importance

Using the varImp function from the caret package, as shown below, we can rank how important each independent variable was in helping us make a prediction in our logistic regression:

if(!require(caret)) install.packages('caret')
library(caret)

varImp(logit)
##           Overall
## cyl  1.560529e-05
## disp 3.024212e-06
## hp   1.096038e-05
## drat 1.607646e-05
## wt   1.792627e-06
## qsec 2.373447e-06
## vs   4.929542e-06
## am   7.151318e-06
## gear 1.279677e-06
## carb 6.282875e-06

Above, cyl is considered to be the most useful variable in helping us make a prediction about mpg.

Later, if we wanted, we could try to make a predictive regression model using only the few most important variables as independent variables.

4.6 Logistic regression wrap-up

Above, you saw how to use binary logistic regression to make predictions of a binary (discrete) dependent variable (high.mpg) on individual observations (cars). You then saw how to check how useful those predictions are. This is an example of what we call classification, because cars were classified (sorted) into groups. It was never our goal to predict a continuous outcome. It was only our goal to sort observations into groups.

Next, we’ll briefly compare the linear and logistic regression results.

5 Compare linear and logistic predictions

Using a simple table, we can easily compare how well our linear and logistic regression models made predictions for us. We should also compare the confusion matrices that we created above to help us decide which model is best, especially if we had a larger dataset.

The stored dtest dataset has conveniently retained all of the data we need. Here’s the code:

dtest[c("mpg","reg.mpg.pred", "logit.mpg.pred")]
##                   mpg reg.mpg.pred logit.mpg.pred
## Hornet 4 Drive   21.4     21.36493   1.357050e-07
## Duster 360       14.3     15.51104   1.185287e-12
## Merc 230         22.8     32.26509   1.000000e+00
## Merc 280C        17.8     20.47901   2.220446e-16
## Toyota Corona    21.5     29.01088   1.000000e+00
## Pontiac Firebird 19.2     15.16593   2.220446e-16
## Fiat X1-9        27.3     29.15442   1.000000e+00
## Volvo 142E       21.4     26.00033   1.000000e+00

Above, for example, we see that the Toyota Corona has a true value (mpg) of 21.5. The linear regression predicted it to be 29.0. And the logistic regression predicted it to have a probability of nearly 0 of having a high mpg.

Again, it is up to you to decide which model is best for your purposes.

6 Assignment

In this week’s assignment, you are required to answer all of the questions below, which relate to using linear and logistic regression to conduct supervised machine learning. Please show R code and R output for all questions that require you to do anything in R. It is recommended that you submit this assignment as an HTML, Word, or PDF file that is “knitted” through RStudio.

If you have data of your own, it is strongly recommended that you use it this week instead of the provided data for this assignment. You can also request a meeting to help you prepare your data or provide any other assistance that might be useful.

If you do want to use provided data, you should again use the student-por.csv dataset that we have used before. This is in D2L, if you don’t already have it downloaded. You can also get it from here:

Here’s how you can load the data into R:2

df <- read.csv("student-por.csv")

6.1 Prepare and divide data.

In this first part of the assignment, you will prepare the data for further analysis.

Task 1: Divide your data into training and testing datasets.

Task 2: Double-check to make sure that you did this correctly and fix any errors.

Task 3: Specify a dependent variable of interest. This should be a continuous numeric variable.3 Use the variable G3 as your dependent variable if you are using the provided student-por.csv data.

Task 4: Write out the research question that your predictive analysis will help you answer and why that answer might be useful to know.

6.2 Linear regression

Now you will use linear regression to make predictions.

Keep in mind that at any point, if you would like to produce a list of all variables in your dataset such that they are separated by a + and can easily be pasted into a regression formula, you can run the following code:

(b <- paste(names(d), collapse="+"))

Then just copy-paste the resulting output and modify it as you see fit. It is a way to avoid typing out all of your variable names.

Task 5: Train a linear regression model to predict your dependent variable of interest and calculate the RMSE for it. For now, you will calculate RMSE on only the training data.

Task 6: Make predictions on your testing data.

Task 7: Make a scatterplot showing actual and predicted values.

Task 8: Show a confusion matrix that you created using a selected cutoff threshold. Please provide an interpretation of the confusion matrix.

Task 9: Make a second confusion matrix showing predictions made at a different cutoff threshold. Please provide an interpretation of the confusion matrix. You can make as many confusion matrices as you would like.

Task 10: Manually4 calculate the accuracy of at least one of your models, based on the confusion matrix.

Task 11: Manually calculate the specificity of the same confusion matrix.

Task 12: Manually calculate the sensitivity of the same confusion matrix.

Task 13: How did the accuracy, sensitivity, and specificity change as you changed the cutoff threshold used to make the confusion matrix? You do not need to give the exact numbers when you answer this.

Task 14: Which metric—selecting from accuracy, sensitivity, and specificity—is most important for answering this particular predictive question?

Task 15: Create a ranking of variable importance for this model. Do the most important variables make sense to you as the most important ones?

Task 16: Modify the linear regression model you have been using such that there are fewer variables used to make the prediction (if you want, you can refer to the variable importance ranking when you do this). Train this new model and then test it again using the same procedure we used above, using a scatterplot, confusion matrices, accuracy, sensitivity, and specificity. To accomplish this task, you can simply make a copy of your code and slightly modify it. Re-use the same training-testing data split that you used earlier.

Task 17: Calculate and compare the RMSE of your two linear regression models (calculated on the testing data, of course).

6.3 Logistic regression

Turning to logistic regression, you will repeat the process above using a dependent variable that is already binary or by creating a binary variable out of the same continuous variable that you used above for linear regression. Once you create your new binary dependent variable, be sure to remove the original continuous numeric dependent variable from your dataset. We would not want it to be accidentally included as an independent variable.

Task 18: Create a logistic regression model to predict your dependent variable of interest.

Task 19: Make predictions on your testing data and show a confusion matrix that you created using a selected probability cutoff threshold. Please provide an interpretation of the confusion matrix.

Task 20: Make a second confusion matrix showing predictions made at a different cutoff threshold. You can make as many confusion matrices as you would like. Please provide an interpretation of the confusion matrix.

Task 21: Manually calculate the accuracy of at least one of your models, based on the confusion matrix.

Task 22: Manually calculate the specificity of the same confusion matrix.

Task 23: Manually calculate the sensitivity of the same confusion matrix.

Task 24: How did the accuracy, sensitivity, and specificity change as you changed the cutoff threshold used to make the confusion matrix? You do not need to give the exact numbers when you answer this.

Task 25: Which metric—selecting from accuracy, sensitivity, and specificity—is most important for answering this particular predictive question?

Task 26: Create a ranking of variable importance for this model. Do the most important variables make sense to you as the most important ones?

Task 27: Modify the logistic regression model you have been using such that there are fewer variables used to make the prediction (if you want, you can refer to the variable importance ranking when you do this). Train this new model and then test it again using the same procedure we used above (make a few confusion matrices at different cutoffs). Compare the initial and new models, in 2–4 sentences.

6.4 Compare models and make new predictions

Now we’ll practice choosing a predictive model that is best for our data.

Task 28: Compare the linear and logistic regression results that you just created. Would you prefer linear or logistic regression for making predictions on your dependent variable (on fresh data)? Be sure to explain why you made this choice, including the metrics and numbers you used.

Task 29: Using your favorite model that you created, make predictions for your dependent variable on new data, for which the dependent variable is unknown. Present the results in a confusion matrix. This confusion matrix will of course only contain predicted values. If you are using the student-por.csv data, you can click here to download the new data file called student-por-newfake.csv, which is also in D2L, and make predictions on it. If you are using data of your own, you can a) use real new data that you have, b) make fake data to practice, or c) ask for suggestions. The true value of the dependent variable for the new data must be unknown.

6.5 Logistics

Task 30: Write any feedback you have about this week’s content/assignment or anything else at all (optional).

Task 31: Please write any questions you have this week (optional).

Task 32: Please submit your final version of this assignment to the appropriate assignment drop box in D2L.


  1. I know we already did this earlier in our linear regression code, but this step is so important that I want to demonstrate it again, in this new context. This code was slightly incorrect until July 8 2021.↩︎

  2. Of course, make sure you have properly set your working directory first!↩︎

  3. Alternatively, you can specify two different dependent variables: one for use with linear regression and another for use with logistic regression.↩︎

  4. Write out the formula for the calculation yourself and then plug in in the appropriate values and find the answer.↩︎