The purpose of this lab is further explore the Method of Least Squares as a supervised learning technique using real datasets. Round all decimals to 4 digits (unless instructed otherwise).
In the next few exercises we will use the swiss dataset. This dataset is built-in to most versions of R. First, type “swiss” into the console to view the data. Then read about the dataset here: https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/swiss
Let’s explore a series of linear regression models that predict the Education variable from other variables in the dataset.
Build a linear regression model that uses all available variables to predict Education. Report the Adjusted R-squared for this model.
Now, use forward selection based on Adjusted R-squared to decide which variables should be included in the model. The Adjusted R-squared for your final model is ____.
There are two variables that do not appear linearly related to Education. They are _______ and _______ (in alphabetical order).
The final model from Exercise 2 predicts that a town with Fertility = 80.2, Agriculture = 17, Examination = 15, Catholic = 9.96, and Infant.Mortality = 22.2 will have an education level of ___%.
The MSE for the final model in Exercise 2 is ___.
The relationship between Education and Examination looks slightly non-linear. Modify the final model from Exercise 2 by including Examination2 in the model. What is the Adjusted R-squared now?
Most modern ML applications involve predicting a categorical output variable from input variables. These problems are often called “classification problems” since the goal is to correctly guess (“classify”) a non-numeric output variable. There are a number of advanced models that have been developed that perform well on these kinds of problems (see: Logistic Regression, Linear and Quadratic Discriminant Analysis, Support Vector Machines, and Neural Networks, to name a few). Using Least Squares Regression to perform classification tasks is generally not preferred because these models are generally inferior to the aforementioned advanced models, both in accuracy and desirable statistical properties. That said, least squares is a good place to start with classification since it’s intuitive and we already know how to build these kinds of models.
Let’s start with a fairly simple problem recently presented on the ML competition website Kaggle. Read an overview of the problem at https://www.kaggle.com/c/titanic/overview. Click the data tab to learn about the types of variables within the dataset.
Our goal is to build a linear regression model takes values of input variables, such as Sex, Gender, and Age, and uses these values to predict whether or not the passenger survived when the ship sank. For each passenger, the Survived variable takes a value of “1” if that passenger survived, and “0” if they did not. That is, there are two possible values (“lived” and “died”) that the categorical output variable Survived can take and we are representing those values with “0” and “1” in the dataset.
First, download the train.csv dataset included with this lab. This is the same dataset as train.csv posted on Kaggle. We won’t be able to use the test.csv dataset to test our model because the competition has not yet closed, which means that Kaggle hasn’t yet revealed the values of the Survived variable for test.csv. We will separate train.csv into test and training data as we did in class.
Open an R Script file to perform the following exercises. First, download the train.csv dataset from Canvas (or Kaggle) and upload it into the same project that has the R Script file you just opened. Then type the commands
titanic_data <- read.table("train.csv", header = TRUE, sep = ",")
titanic_data = titanic_data[complete.cases(titanic_data),]
titanic_data$Pclassf = as.factor(titanic_data$Pclass) # Creates new variable Pclassf which is categorical instead of numeric
to load the dataset into your working environment as a data frame and then remove incomplete cases from the data.
Set the seed at 2, then build a training dataset called titanic_train using 80% of the data from titanic_data. Put the remaining 20% of the data into a test dataset called titanic_test.
set.seed(2)
train_indexes = sample(1:714, 0.8*714)
titanic_train = titanic_data[train_indexes,]
titanic_test = titanic_data[-train_indexes,]
Using the dim() function you find the dimensions of titanic_train to be __ by __ which means that there are __ passengers and __ variables (including Pclassf) for each passenger in our training dataset.
Use titanic_train to fit a linear model called titanic_lm to predict Survived from the Pclassf, Sex, Age, SibSp, Parch, Fare, and Embarked variables. Report the Adjusted R-squared for this model.
R-squared (and Adjusted R-squared) can be interpreted as the percentage of variation in the output variable that is explained by variation in the input variables. This means that the model from the previous exercise is able to explain _____ (less, more) than half the amount of variation in Survived.
A 20 year old man in Southampton wins a 3rd class ticket (valued at $8) in a poker game. He’s traveling without relatives. Use the predict function to predict whether or not he survived. The predicted value of Survived is ___. (Hint: be sure to put quotes on the values of Pclassf, Sex and Embarked, since these variables are categorical.)
We know that actual values of Survived are either “0” or “1” but our linear model gives predicted values that are not necessarily “0” or “1” (this is an obvious downside to using a linear regression model to perform a classification task). Thus, we need a decision rule to decide whether a predicted value of Survived is indicating “0” or “1.” A common decision rule is to take values that are over 0.5 to be “1,” and values less than or equal to 0.5 to be “0.”
Using this decision rule means the model is predicting that the man described in Exercise 10 ___ (lived, died).
So far we have used MSE to evaluate how well our model performs on test data. The MSE is not a good measure of how well classification models perform because we don’t use the actual predicted values, rather an interpretation of them (see previous 2 exercises). A better measure of model performance is to compute a “confusion matrix” that gives the accuracy of the model’s predictions. A confusion matrix will show the total number of predicted “0” and “1” values and compare these to the actual number of “0” and “1” values in the test data (“Reference”). Then the model’s “Accuracy” is computed as the number of total predictions it got right, divided by the total predictions it made (which is the number of cases in our test data).
Install and load the “caret” package; run the (uncommented) following two lines only once:
# install.packages("caret")
# library(caret)
The following function takes a linear model and a dataset as arguments and returns a confusion matrix. The decision_rule argument is optional; if no value is entered it defaults to 0.5.
confusion_matrix <- function(linear_model, test_data, decision_rule = 0.5){
p = predict(linear_model, test_data) # predict each value for the test data
p_hat = rep(0, length(p))
# turn these predicted values into either 0 or 1 based on decision rule
for(i in 1:length(p)){
if(p[i]>decision_rule){
p_hat[i] =1
}
else{p_hat[i]=0}
}
# compute the confusion matrix using the caret package
conf_m <- confusionMatrix(data=as.factor(p_hat), reference=as.factor(test_data$Survived), positive = "1")
return(conf_m)
}
Use the confusion_matrix() function to find the accuracy of the lm_titanic model on the test data. The accuracy is __.
What is the model’s accuracy when predicting values of Survived for the training dataset?
We seek a model that is as accurate as possible when making predictions for the test data. A common way to improve model performance is to “tune” it: what parameters can we adjust to see if different values improve accuracy? Though it isn’t an explicit parameter of titanic_lm, an obvious choice is the decision rule: choosing predicted values to be interpreted as either 0 or 1 based on a cutoff of 0.5 makes intuitive sense since its halfway between 0 and 1, but there’s no reason we have to use 0.5 as our cutoff.
Tune the model by trying to improve its accuracy on the test data by using different values of decision_rule when calculating accuracy (using confusion_matrix()). What decision_rule value gives the highest accuracy on the test data? Report your answer to the nearest 100ths decimal place.