1) Conceptual Questions
This first question is conceptual and written responses are expected. For each item below, indicate whether the appropriate method would be classification or regression, and whether we are most interested in inference or prediction. Please include a written sentence or two explaining why you made this choice. Also, indicate what n and p are for each section.
(a) A dataset contains data for 350 manufacturing companies in Europe. The following variables are included in the data for each company: industry, number of employees, salary of the CEO, and total profit. We are interested in learning which variables impact the CEO's salary.
This problem calls for a regression method because we are modeling a continuous variable (CEO salary). We are interested in inference because we want to learn about the relationships between our independent variables and the salaries. We want to find out which company characteristics are associated with the variation in CEO pay, as opposed to just making a prediction about how much CEOs are paid. In this case n is 350 because that is the number of companies in the dataset. The p-value is the likelihood that the relationship between an independent variable and CEO salaries is not due to random chance.
(b) A market research company is hired to help a startup analyze their new product. We want to know whether the product will be a success or failure. Similar products exist on the market so the market research company gathers data on 31 similar products. The company records the following data points about each previously launched product: price of the product, competition price, marketing budget, ten other variables, and whether or not it succeeded or failed.
This is a classification problem because we are interested in predicting for a discrete outcome. In this case that outcome binary: success or failure. We are more interested in finding out whether the product will be successful or not than which variables are indicative of other successful products. The value of n for this problem is 31 because it is the number of similar products that we have data on to train the model. Here p is the likelihood that our prediction turns out to be accurate.
(c) Every week data is collected for the world stock market in 2012. The data points collected include the % change in the dollar, the % change in the market in the United States, the % change in the market in China, and the % change in the market in France. We are interested in predicting the % change in the dollar in relation to the changes every week in the world stock markets.
This is an instance in which we would want to use a regression method. Like the first problem, we are interested in predicting for a continuous outcome. In this case it is the value of the dollar at a given time, or the ratio of the value at that time to the value a week earlier. However, unlike the first problem, this calls for prediction rather than inference. We want to use our data on world currencies to forecast the value of the US dollar in the future, not to understand which currencies are linked to it. In this case n = 52 because we have weekly data from one year so we have one data point (consisting of several different attributes) for each week. For these forecasts, p is an approximation of the probability of an accurate outcome.
2) Applied Question
For this second applied question you will develop several predictive models. These should be written in R or Python and the code should be submitted. The models will predict whether a car will get high or low gas mileage. The question will be based on the Cars_mileage data set that is a part of this repo.
Within this writeup I have included my code, divided up into chunks and interspersed among written responses and plots. You can find my original script within this GitHub repository. Before I get to each individual component of this question, I run some preliminary functions that prepare the script to run.
Load packages, global options, plot formatting template:
library(ggplot2)
library(caret)
library(tidyr)
library(C50)
library(gmodels)
library(class)
library(randomForest)
library(plyr)
# global options
options(scipen = "999")
options(stringsAsFactors = FALSE)
# define standard theme for all plots
plot_theme <- function(base_size = 12) {
theme_minimal() %+replace%
theme(
axis.title = element_text(hjust = 1, face = 'italic'),
strip.text = element_text(face = 'italic'),
plot.title = element_text(hjust = 0.5, face = 'bold', vjust = 0.25),
legend.title = element_text(face = 'italic')
)
}Read cars dataset from GitHub:
cars <- read.csv("https://raw.githubusercontent.com/eneedham/data-analyst-data-test/master/Cars_mileage.csv")(a) Create a binary variable that represents whether the car's mpg is above or below its median. Above the median should be represented as 1. Name this variable mpg_binary.
Create the new binary variable:
cars$mpg_binary <- ifelse(cars$mpg > median(cars$mpg), 1, 0) %>% as.factor()
table(cars$mpg_binary)#
# 0 1
# 206 191
Since I am splitting at the median mpg value, there should be roughly the same number of 1's and 0's. However this is not the case.
nrow(cars[which(cars$mpg == median(cars$mpg)), ])# [1] 9
It looks like there are 9 cars that have exactly the median gas mileage which explains the discrepancy.
(b) Which of the other variables seem most likely to be useful in predicting whether a car's mpg is above or below its median? Describe your findings and submit visual representations of the relationship between mpg_binary and other variables.
** I want to look at the relationships between each of the individual predictor variables and mpg_binary to determine which are likely to be useful predictors. Since mpg_binary is a categorical variable, I can not easily use scatterplots. Instead I will compare the distribution of values in each predictor variable between cars with different mpg_binary values. If there is a dramatic difference in, for example, the distribution of horsepower among cars with an mpg_binary value of 1 and those with a value of 0 then it is likely to be a valuable predictor.
Reshape the dataset into long form:
cars_long <- cars %>% gather(var, val, -c(mpg_binary, name, mpg))
cars_long$val <- cars_long$val %>% as.numeric()I have a mix of continuous and discrete variables which I need to plot differently. I have chosen to treat 'year' and 'origin' as discrete predictors so I split them into their own data frame.
cars_cont <- cars_long[!(cars_long$var %in% c("origin", "year")), ]
cars_disc <- cars_long[cars_long$var %in% c("origin", "year"), ]I visualize the relationships between mpg_binary and both discrete predictors using normalized bar charts. You can see that there is a clear relationship between both of them and mpg_binary. Cars from origin '1' are much more likely to be in the lower half of mpg than those from the other two locations. Similarly, cars from the '80s are much more likely to have an mpg value of 1 than those from other years.
ggplot(cars_disc, aes(factor(val), fill = mpg_binary)) +
geom_bar(position = "fill", color = "white") +
facet_wrap(~var, scales = "free") +
xlab("value") + ylab("proportion") +
ggtitle("Discrete predictor variables") + plot_theme()Next I visualize the relationships between the dependent variable and each of the potential continuous predictors. I do this with two different types of plots (violin plots and boxplots) that both show comparitive distributions but do so in slightly different ways.
Violin plots:
ggplot(cars_cont, aes(mpg_binary, val, fill = mpg_binary)) +
geom_violin(color = "white") +
facet_wrap(~var, scales = "free") +
ggtitle("Continuous predictor variables") + plot_theme()Boxplots:
ggplot(cars_cont, aes(mpg_binary, val, color = mpg_binary)) +
geom_boxplot() +
facet_wrap(~var, scales = "free") +
ggtitle("Continuous predictor variables") + plot_theme()We can see right off the bat that displacement is likely to be a strong predictor. More than 3/4 of cars with an mpg_binary value of 0 have a displacement greater than the upper outlier threshold of cars with a value of 1. By contrast, there does not seem to be much difference in acceleration between the two groups. This indicates that acceleration is not likely to be as useful of a predictor variable. I can use independent 2-group t-tests to evaluate whether or not these differences are statistically significant.
displacement:
t.test(cars$displacement ~ cars$mpg_binary)##
## Welch Two Sample t-test
##
## data: cars$displacement by cars$mpg_binary
## t = 22.406, df = 265.27, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 140.5644 167.6490
## sample estimates:
## mean in group 0 mean in group 1
## 267.6748 113.5681
acceleration:
t.test(cars$acceleration ~ cars$mpg_binary)##
## Welch Two Sample t-test
##
## data: cars$acceleration by cars$mpg_binary
## t = -6.8001, df = 394.76, p-value = 0.00000000003882
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.286227 -1.260751
## sample estimates:
## mean in group 0 mean in group 1
## 14.70243 16.47592
As you can see, the differences in means are actually statistically significant for both predictors. However, there is a much larger standardized difference in group means within the displacement variable than accelaration. Overall, most of these variables seem likely to be useful predictors but displacement and cylinders are at the top of the list.
(c) Split the data into a training set and a test set.
Before splitting the data into training and test sets I need to do some quick cleaning. I change the data type for a few variables. I alse remove the name and mpg variables which would overfit the model. Finally I omit a few rows that have NA values which will cause errors when I try to run models.
cars$horsepower <- cars$horsepower %>% as.numeric()
cars$year <- cars$year %>% as.factor()
cars$origin <- cars$origin %>% as.factor()
dat <- cars[ , !names(cars) %in% c("mpg", "name")]
dat <- na.omit(dat)I partition the data into a training set with a random sample of 75% of the data and a test set with the remaining 25%
set.seed(123)
in_train <- createDataPartition(dat$mpg_binary, p = .75, list = FALSE)
train_set <- dat[in_train, ]
test_set <- dat[-in_train, ]Let's see how many observations we have in the training and test sets:
paste("training set observations:", nrow(train_set), sep = " ")## [1] "training set observations: 295"
paste("test set observations:", nrow(test_set), sep = " ")## [1] "test set observations: 97"
(d) Perform two of the following in order to predict mpg_binary: LDA (Linear discriminant analysis), QDA (Quadratic discriminant analysis), Logistic regression, KNN (K-nearest neighbors), Decision Tree, Random Forests, Gradient Boosting, LASSO (Least Absolute Shrinkage and Selection Operator), Elastic Net Method, Ridge regression analysis
For each of the two you select: What is the test error(s) of the model obtained? Do you have any other observations?
I will use two simple machine learning methods (K-Nearest Neighbors and Decision Trees) to predict whether each car will be in the upper or lower half of the dataset in terms of gas mileage. I have broken this task up into three different sub-sections. First I manually build both models using default parameters. I train each on the training data and predict for the test set. Next, I use caret's model tuning functions to tune each model's parameters and see if I can improve predictive accuracy. Finally, I use a 10-fold cross-validation on the whole dataset in order to evaluate the performance of both models.
K-NEAREST NEIGHBORS:
The K-nearest neighbors (KNN) algorithm assigns a value to each (in this case a classification of 1 or 0) observation based on the characteristic in question of a predefined number (k) of feature space neighbors. I use a default value of k to predict mpg_binary for the test set observations based on training set neighbors.
test_set$knn_prediction <- knn(train_set, test_set[,c(1:8)], train_set$mpg_binary)
test_set$knn_accuracy <- ifelse(test_set$knn_prediction == test_set$mpg_binary, 'correct', 'incorrect')Look at the training set error and accuracy rates:
table_knn <- prop.table(table(test_set$knn_accuracy)) %>% print() %>% unname()#
# correct incorrect
# 0.8350515 0.1649485
Even using an exceedingly simple model I was able to accurately classify mpg_binary for > 83% of all test set cars. This of course means that I had an error rate of > 16%. We can get more specifics about these results by looking at a confusion matrix.
CrossTable(test_set$mpg_binary, test_set$knn_prediction)#
#
# Cell Contents
# |-------------------------|
# | N |
# | Chi-square contribution |
# | N / Row Total |
# | N / Col Total |
# | N / Table Total |
# |-------------------------|
#
#
# Total Observations in Table: 97
#
#
# | test_set$knn_prediction
# test_set$mpg_binary | 0 | 1 | Row Total |
# --------------------|-----------|-----------|-----------|
# 0 | 41 | 10 | 51 |
# | 10.737 | 10.093 | |
# | 0.804 | 0.196 | 0.526 |
# | 0.872 | 0.200 | |
# | 0.423 | 0.103 | |
# --------------------|-----------|-----------|-----------|
# 1 | 6 | 40 | 46 |
# | 11.904 | 11.190 | |
# | 0.130 | 0.870 | 0.474 |
# | 0.128 | 0.800 | |
# | 0.062 | 0.412 | |
# --------------------|-----------|-----------|-----------|
# Column Total | 47 | 50 | 97 |
# | 0.485 | 0.515 | |
# --------------------|-----------|-----------|-----------|
#
#
One takeaway from this matrix is that my model overpredicted slightly for cars with an mpg_binary value of 1 (cars in the top 50% of gas mileage). There were 10 instances in which the model predicted a 1 for an observed 0 and only 6 where the opposite was the case.
DECISION TREE
The second method that I used was a decision tree. This algorithm builds a set of hierarchical rules based on the attributes of training data and uses them to determine predictions for test data. As I did with the knn model, I used default parameters to train a model on the test set.
dt_model <- C5.0(train_set[-8], train_set$mpg_binary)
summary(dt_model)#
# Call:
# C5.0.default(x = train_set[-8], y = train_set$mpg_binary)
#
#
# C5.0 [Release 2.07 GPL Edition] Wed Mar 15 04:09:20 2017
# -------------------------------
#
# Class specified by attribute `outcome'
#
# Read 295 cases (8 attributes) from undefined.data
#
# Decision tree:
#
# displacement > 181: 0 (126/3)
# displacement <= 181:
# :...year in {70,74,80,81,82}: 1 (79)
# year in {71,72,73,75,76,77,78,79}:
# :...weight <= 2210: 1 (37)
# weight > 2210:
# :...weight > 2870: 0 (12)
# weight <= 2870:
# :...year in {71,72,73}: 0 (17/3)
# year in {75,76,77,78,79}: 1 (24/5)
#
#
# Evaluation on training data (295 cases):
#
# Decision Tree
# ----------------
# Size Errors
#
# 6 11( 3.7%) <<
#
#
# (a) (b) <-classified as
# ---- ----
# 149 5 (a): class 0
# 6 135 (b): class 1
#
#
# Attribute usage:
#
# 100.00% displacement
# 57.29% year
# 30.51% weight
#
#
# Time: 0.0 secs
You can see that this decision tree only uses three variables to determine an outcome for each observation. This model accurately predicts about 96% of the training data but I'm more interested in how it performs for observations it wasn't trained on. I can use the model to predict for the test set:
test_set$dt_prediction <- predict(dt_model, test_set)
test_set$dt_accuracy <- ifelse(test_set$dt_prediction == test_set$mpg_binary, 'correct', 'incorrect')
table_dt <- prop.table(table(test_set$dt_accuracy)) %>% print() %>% unname()#
# correct incorrect
# 0.8865979 0.1134021
CrossTable(test_set$mpg_binary, test_set$dt_prediction)#
#
# Cell Contents
# |-------------------------|
# | N |
# | Chi-square contribution |
# | N / Row Total |
# | N / Col Total |
# | N / Table Total |
# |-------------------------|
#
#
# Total Observations in Table: 97
#
#
# | test_set$dt_prediction
# test_set$mpg_binary | 0 | 1 | Row Total |
# --------------------|-----------|-----------|-----------|
# 0 | 46 | 5 | 51 |
# | 12.735 | 14.716 | |
# | 0.902 | 0.098 | 0.526 |
# | 0.885 | 0.111 | |
# | 0.474 | 0.052 | |
# --------------------|-----------|-----------|-----------|
# 1 | 6 | 40 | 46 |
# | 14.120 | 16.316 | |
# | 0.130 | 0.870 | 0.474 |
# | 0.115 | 0.889 | |
# | 0.062 | 0.412 | |
# --------------------|-----------|-----------|-----------|
# Column Total | 52 | 45 | 97 |
# | 0.536 | 0.464 | |
# --------------------|-----------|-----------|-----------|
#
#
It looks like the accuracy rate was a slightly more modest 18.3% with an 11.3% error rate. Misclassifications were evenly split between false positives and false negatives. I created a bar chart to compare accuracy/error rates between my decision tree and knn models.
results_table <- rbind(table_knn, table_dt) %>% data.frame()
names(results_table) <- c("correct", "incorrect")
results_table$model <- c("knn", "dt")Helper functions to reshape plot accuracy data for each model, plot out of sample predictive accuracy for the first two models:
to_long_format <- function(accuracy_variables) {
test_long <- test_set[ , accuracy_variables] %>% gather(model, outcome)
test_long$model <- gsub("_accuracy", "", test_long$model)
test_long$outcome <- factor(test_long$outcome, levels = c('incorrect', 'correct'))
return(test_long)
}
accuracy_plot <- function() {
ggplot(test_long, aes(x = model, fill = outcome)) +
geom_bar(stat = "count", position = "fill", alpha = 0.5) +
ylab("accuracy rate") + ggtitle("Comparative model accuracy rates") +
geom_label(data = results_table, aes(x = model, y = correct, label = round(correct, 3)),
fill = "white") +
plot_theme()
}
test_long <- to_long_format(c("knn_accuracy", "dt_accuracy"))
accuracy_plot()My next step was to try to improve predictive accuracy for each model by tuning the parameters. Both of these model types have paramater(s) that can be adjusted to affect the outputs. Using the train function in the caret package, I was able to cycle through each of the combinations of parameter inputs to find the one that produced the most accurate model. I used a repeated cross-validation scheme to train the model, meaning that I repeatedly tested each combination of parameters on an out of sample test to determine accuracy.
tune_model <- function(model_type) {
tune_method <- trainControl(method = "repeatedcv", number = 10, repeats = 5, selectionFunction = "best")
model <- train(mpg_binary ~ ., train_set, method = model_type, trControl = tune_method)
return(model)
}Tune knn model:
tuned_knn_mod <- tune_model("knn")
tuned_knn_mod# k-Nearest Neighbors
#
# 295 samples
# 7 predictor
# 2 classes: '0', '1'
#
# No pre-processing
# Resampling: Cross-Validated (10 fold, repeated 5 times)
# Summary of sample sizes: 266, 265, 266, 264, 265, 266, ...
# Resampling results across tuning parameters:
#
# k Accuracy Kappa
# 5 0.8653126 0.7309017
# 7 0.8654245 0.7313502
# 9 0.8641802 0.7288719
#
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was k = 7.
The knn model only has one parameter to tune: the number (k) of neighbors that it uses. The differences in accuracy were slight but it settled on an optimal value of 7 for k. I used this new model to again predict for the test set.
test_set$knn_tuned_prediction <- predict(tuned_knn_mod, test_set)
test_set$knn_tuned_accuracy <- ifelse(test_set$knn_tuned_prediction == test_set$mpg_binary, "correct", "incorrect")
table_knn_tuned <- prop.table(table(test_set$knn_tuned_accuracy)) %>% print() %>% unname()#
# correct incorrect
# 0.8556701 0.1443299
Tune decision tree model:
tuned_dt_mod <- tune_model("C5.0")
tuned_dt_mod# C5.0
#
# 295 samples
# 7 predictor
# 2 classes: '0', '1'
#
# No pre-processing
# Resampling: Cross-Validated (10 fold, repeated 5 times)
# Summary of sample sizes: 265, 265, 266, 265, 266, 265, ...
# Resampling results across tuning parameters:
#
# model winnow trials Accuracy Kappa
# rules FALSE 1 0.8904553 0.7810404
# rules FALSE 10 0.9108461 0.8217638
# rules FALSE 20 0.9122024 0.8244935
# rules TRUE 1 0.8726867 0.7458431
# rules TRUE 10 0.8945947 0.7894447
# rules TRUE 20 0.8993534 0.7989934
# tree FALSE 1 0.8890760 0.7781498
# tree FALSE 10 0.9120875 0.8242953
# tree FALSE 20 0.9108691 0.8217509
# tree TRUE 1 0.8713993 0.7433017
# tree TRUE 10 0.8925028 0.7853417
# tree TRUE 20 0.8918361 0.7839329
#
# Accuracy was used to select the optimal model using the largest value.
# The final values used for the model were trials = 20, model = rules
# and winnow = FALSE.
The decision tree model is slightly more complex with three parameters to tune: trials, model and winnow. Note that the function cycles through 12 different models before settling on the rules-based model that does not winnow down branches and uses 20 trials. Again, I use this model to predict for the training set and evaluate it's accuracy.
test_set$dt_tuned_prediction <- predict(tuned_dt_mod, test_set)
test_set$dt_tuned_accuracy <- ifelse(test_set$dt_tuned_prediction == test_set$mpg_binary, "correct", "incorrect")
table_dt_tuned <- prop.table(table(test_set$dt_tuned_accuracy)) %>% print() %>% unname()#
# correct incorrect
# 0.92783505 0.07216495
Compile and plot accuracy/error rates four all four models
results_table_2 <- rbind(table_knn_tuned, table_dt_tuned) %>% data.frame()
names(results_table_2) <- c("correct", "incorrect")
results_table_2$model <- c("knn_tuned", "dt_tuned")
results_table <- rbind(results_table, results_table_2)
test_long <- to_long_format(c("knn_accuracy", "knn_tuned_accuracy", "dt_accuracy", "dt_tuned_accuracy"))
accuracy_plot()This chart shows the out of sample accuracy and error rates for both tuned models as well as the first two that I trained. Both models improved slightly through the tuning process. The decision tree model accurately classified 92.8% of test set cars. This means that it had an error rate of 7.2%. By contrast, the knn model had an accuracy rate of 85.6% and an error rate of 14.4%.
In the final section I evaluated accuracy and generalisability for both of my models. In order to validated them I used a 10-fold cross validation method. In this process I divided the model up into 10 random samples (or folds). For each fold I trained a model using the tuned parameters and the data from all but that fold. I then used the resulting model to predict for the fold in question. With a set of accuracy rates from ten different samples, I can determine how accurately and consistently the model performed over a series of random samples.
Partition full dataset into 10 folds:
set.seed(123)
folds <- createFolds(dat$mpg_binary, k = 10)This function performs one fold of a 10-fold cross-validation using either a k-nearest neighbors or decision tree model based on a 'model_type' global variable.
cross_validation <- function(x) {
training <- dat[-x, ]
testing <- dat[x, ]
if (model_type == 'knn') {
testing$prediction <- knn(training, testing, training$mpg_binary, k = tuned_knn_mod$bestTune$k)
accuracy <- ifelse(testing$prediction == testing$mpg_binary, 1, 0)
}
else if (model_type == 'C5.0') {
dt_model <- C5.0(mpg_binary ~ ., data = training, trials = tuned_dt_mod$bestTune[1,1], winnow = FALSE, rules = FALSE)
testing$prediction <- predict(dt_model, testing)
accuracy <- ifelse(testing$prediction == testing$mpg_binary, 1, 0)
}
return(sum(accuracy) / length(accuracy))
}Apply the function over a list of all 10 folds using the KNN model:
model_type <- 'knn'
cv_knn <- ldply(folds, cross_validation)
cv_knn$model <- model_typeDecision tree model cross-validation:
model_type <- 'C5.0'
cv_dt <- ldply(folds, cross_validation)
cv_dt$model <- 'dt'Bind results together into a data frame, adjust variables and summarize average/sd error per model:
cv_results <- rbind(cv_knn, cv_dt)
cv_results$fold <- cv_results$.id %>% as.factor() %>% as.numeric()
cv_results <- ddply(cv_results, ~model, summarise, avg_acc = round(mean(V1), 3),
sd_acc = round(sd(V1), 3)) %>% join(cv_results, ., by = 'model')Plot time series of model accuracy accross folds:
ggplot(cv_results, aes(x = fold, y = V1, ymin = (V1 - sd_acc), ymax = (V1 + sd_acc), color = model)) +
geom_hline(aes(yintercept = avg_acc), linetype = 'dashed', alpha = 0.5) +
geom_line() +
geom_pointrange() +
facet_wrap(~model, scales = 'fixed', ncol = 1) +
ggtitle("10-Fold Cross-Validation Results") +
ylab("predictive accuracy rate") + xlab("fold #") +
geom_text(data = cv_results[c(1,20), ],
aes(x = 7.5, y = .78, label = paste("avg. accuracy: ", avg_acc, sep = ""), hjust = "left")) +
geom_text(data = cv_results[c(1,20), ],
aes(x = 7.5, y = .75, label = paste("st. dev. accuracy: ", sd_acc, sep = ""), hjust = "left")) +
geom_text(data = cv_results[c(1,20), ],
aes(x = .5, y = avg_acc + .01, label = paste('avg. accuracy'), hjust = "left", vjust = "below"),
color = 'black') +
scale_x_continuous(breaks = c(1:10), limits = c(0.5, 10)) +
plot_theme()Accross the 10 folds, the decision tree model had a higher average accuracy rate (92.6% as compared to 86.2%) and lower accuracy rate standard deviation (3.5% to 6.4%) than the knn model. This means that the former not only had less error but also was more consistent. This is an indication that of the two models, the decision tree was the more generalisable and the less likely to overfit.