Setup:

This is the setup required for the final exam..

# Load standard libraries
library(tidyverse)
library(caTools)
library(dplyr)
library(ggplot2)
library(pROC)
library(randomForest)
library(MASS)
library(ggplot2)

Problem 1

For this portion of the exam you will use data from: Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science. This dataset is currently being used in a DrivenData competition. You can find more information about the dataset on the DrivenData or UCI ML Repo websites.

The heartData.csv “dataset is from a study of heart disease that has been open to the public for many years.The study collects various measurements on patient health and cardiovascular statistics, and of course makes patient identities anonymous” (DrivenData: Predicting Heart Disease).

#Loading the data set
heartData <- read.csv("heartData.csv")

#Getting the structure of the data
#str(heartData)

#First few rows of the data set
head(heartData)

heartData is data.frame that contains 180 obs. of 16 variables:

#Based on the above structure of the heart disease data, following
#variables (columns) are coerced into the factor type

#Coercing chest_pain_type, num_major_vessels, resting_ekg_results, heart_disease_present
#into factor type
#heartData$chest_pain_type <- as.factor(heartData$chest_pain_type)
#heartData$num_major_vessels <- as.factor(heartData$num_major_vessels)
#heartData$resting_ekg_results <- as.factor(heartData$resting_ekg_results)
#heartData$heart_disease_present <- as.factor(heartData$heart_disease_present)
#heartData$fasting_blood_sugar_gt_120_mg_per_dl <- as.factor(heartData$fasting_blood_sugar_gt_120_mg_per_dl)
#heartData$slope_of_peak_exercise_st_segment <- as.factor(heartData$slope_of_peak_exercise_st_segment)
#heartData$exercise_induced_angina <- as.factor(heartData$exercise_induced_angina)

#As per the variable description given here http://archive.ics.uci.edu/ml/datasets/Heart+Disease,
#for the sex variable, 1=Male and 0=Female. Therefore, coercing the sex variable:
#heartData$sex <- factor(heartData$sex, levels = c(0,1), labels = c("F","M"))
#str(heartData)
Part (a): Describe the participants (you must include a written response with your code output). Use descriptive, summarization, and exploratory techniques to describe the participants in the study. For example, what proportion of participants are female? What is the average age of participants?
#Finding proportion of female/male patients
cat("\nPercentage of Female participants =", prop.table(table(heartData$sex))[1]*100, "%",
    "\nPercentage of Male participants =", prop.table(table(heartData$sex))[2]*100, "%")
## 
## Percentage of Female participants = 31.11111 % 
## Percentage of Male participants = 68.88889 %
#Finding proportion of female/male patients that have heart disease


#Creating obe consolidated table of participants that have heart disease
#categorised by gender
t1 <- table(heartData[heartData$sex==0,]$heart_disease_present)
t2 <- table(heartData[heartData$sex==1,]$heart_disease_present)
t <- rbind(t1,t2)

#Setting the row names and column names of the table
colnames(t) <- c("Has heart disease", "Does not have heart disease")
rownames(t) <- c("Females", "Males")

#printing the table
cat("\n\nNumber of Males/Females having hear disease:\n\n")
## 
## 
## Number of Males/Females having hear disease:
t
##         Has heart disease Does not have heart disease
## Females                45                          11
## Males                  55                          69
#summary of the age of particpants
cat("\n\nSummary of the age of participants:\n") 
## 
## 
## Summary of the age of participants:
summary(heartData$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.00   48.00   55.00   54.81   62.00   77.00
ggplot(heartData, aes(x=age)) +
  geom_histogram(fill="light green", alpha=0.5, position="identity")+ 
  geom_vline(aes(xintercept = mean(age)), heartData)+
  labs(title="Histogram of age of participants", x="Age of patients", y="Count")+
  geom_text(mapping=aes(x=mean(age), y=8.5, label=paste("Mean=", signif(mean(age),4))),
            size=4, angle=90, vjust=-0.4, hjust=0)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#making a new data frame to store the mean ages of the male and female
#patients so that it can be included in the ggplot face-wrap function
meanAge <- data.frame(sex = c(0, 1), 
                      age = c(mean(heartData[heartData$sex==0,]$age),
                              mean(heartData[heartData$sex==1,]$age)))

#ggplot of age of the patients categorised by sex
agePlot <- ggplot(heartData, aes(x=age, fill=as.factor(sex))) +
  geom_histogram(alpha=0.5, position="identity")+ 
  geom_vline(aes(xintercept = age), meanAge)+
  facet_wrap(~as.factor(sex))+
  labs(title="Histogram of age of participants categorised by sex", 
       x="Age of patients", y="Count", fill="Sex")+
  geom_text(meanAge, mapping=aes(x=age, y=8.5, label=paste("Mean=", signif(age,4))),
            size=4, angle=90, vjust=-0.4, hjust=0)+
  scale_fill_discrete(breaks=c("0", "1"),
                      labels=c("0 - Female", "1 - Male"))

#display the agePlot
agePlot
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#ggplot of resting blood pressure of the patients categorised by sex
ggplot(heartData, aes(y=resting_blood_pressure, fill=as.factor(sex)))+
  geom_boxplot()+
  labs(title="Boxplot of the particpants' resting blood pressure 
       categorised by sex", y="Resting Blood Pressure")+
  facet_wrap(~as.factor(sex))+
  scale_fill_discrete(breaks=c("0", "1"),
                      labels=c("0 - Female", "1 - Male"))

#ggplot of serum cholestrol of the patients categorised by sex
ggplot(heartData, aes(y=serum_cholesterol_mg_per_dl, fill=as.factor(sex)))+
  geom_boxplot()+
  labs(title="Boxplot of the particpants' serum cholestrol 
       categorised by sex", y="Serum Cholestrol (in mg/dl)")+
  facet_wrap(~as.factor(sex))+
  scale_fill_discrete(breaks=c("0", "1"),
                      labels=c("0 - Female", "1 - Male"))

#ggplot of oldpeak_eq_st_depressionl of the patients categorised by sex
ggplot(heartData, aes(y=oldpeak_eq_st_depression, fill=as.factor(sex)))+
  geom_boxplot()+
  labs(title="Boxplot of the particpants' oldpeak=ST exercise induced depression relative to rest 
       categorised by sex", y="Measure")+
  facet_wrap(~as.factor(sex))+
  scale_fill_discrete(breaks=c("0", "1"),
                      labels=c("0 - Female", "1 - Male"))

#ggplot of maximum heart rate achieved in participants categorised by sex
ggplot(heartData, aes(y=max_heart_rate_achieved, fill=as.factor(sex)))+
  geom_boxplot()+
  labs(title="Boxplot of the particpants' maximum heart rate achieved 
       categorised by sex", y="Max Heart Rate Achieved")+
  facet_wrap(~as.factor(sex))+
  scale_fill_discrete(breaks=c("0", "1"),
                      labels=c("0 - Female", "1 - Male"))

From the above analyses, it is clear that:

  • About 31% participants were female and 68% were male.
  • About 45 of the total 56 women had heart disease while 55 of the total 124 men had heart disease
  • Mean age of the participants was 54.81 years
  • Mean age of the female participants was 56.88 years
  • Mean age of the male participants was 53.88 years
Part (b): We want to explore the characteristics of participants who have been diagnosed with heart disease. The data includes a binary outcome variable heart_disease_present. Describe what the values within this variable signify.
  • The variable heart_disease_present describes if a patients was diagnosed with the heart disease or not.
  • A value of 1 indicates that the patient had the heart disease, while a value 0 indicated that the patients does not have the disease
Part (c): Describe the potential explanatory (independent, predictor) variables in this dataset.
  • In this dataset, the outcome (or dependent) variable is the heart_disease_present as we would like to determine/predict if a participat actually has a heart disease.
  • All the remaining variables can be calssified as potential explanatory variables.
Part (d): Split your data into a training and test set based on an 70-30 split, in other words, 70% of the observations will be in the training set (you do not need to create a validation set for this exercise).
  • To test our models, we divide dat into training and testing data sets in 70:30 ratio.
  • training dataset is used to train the models
  • testing dataset is used to test the models’ accuracy
#Excluding columns names that are not required: X, patient_id
colNotRequired <- c(1,2)
heartData <- heartData[,-colNotRequired]

#Encoding factor variable (thal) into a numeric variable where
#1= fixed_defect, 2= Normal and 3 = reversible_defect
heartData$thal <- as.numeric(heartData$thal)

#caret package is required for the createDataPartition() function
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
# Set seed for reproducibility
set.seed(111)

#Splitting training data into training and testing data in 70:30 ratio.
inTrain = createDataPartition(heartData$heart_disease_present, p = 0.7)[[1]]

#training data set containing 70% of the total values
trainingHeartData = heartData[ inTrain,]

#testing data set containing the remaining 30% of the total values
testingHeartData = heartData[-inTrain,]
Part (e): Use an appropriate regression model to explore the relationship between having a diagnosis of heart disease (or not) and all other characteristics in your training data. Comment on which covariates seem to be predictive of having heart disease and which do not.
  • Since the outcome variable, heart_disease_present, is a binary variable, we will deal with the probabilities of getting a result as 1/0 (prediction if heart disease is diagnosed or not).

  • Note that I have not changed some of the variables (hest_pain_type, num_major_vessels, resting_ekg_results, heart_disease_present, sex) to factor because, I’ll be performing logistic regression and the factor variables will need to be numerically encoded.

  • Therefore, we will use logistic regression to build our model. The code is given below:

#To build our model we use the glm function which is 
#used to fit generalized linear models, specified by 
#giving a symbolic description of the linear predictor 
#and a description of the error distribution denoted by the family argument.
#Therefore, we set family as binomial for a logistic regression model.

modelHeartData <- glm(heart_disease_present~., trainingHeartData, family = "binomial")
summary(modelHeartData)
## 
## Call:
## glm(formula = heart_disease_present ~ ., family = "binomial", 
##     data = trainingHeartData)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.56285  -0.39918  -0.05934   0.21864   2.19198  
## 
## Coefficients:
##                                        Estimate Std. Error z value
## (Intercept)                          -14.221195   5.678427  -2.504
## slope_of_peak_exercise_st_segment      2.113287   0.742373   2.847
## thal                                   1.534915   0.545912   2.812
## resting_blood_pressure                 0.054286   0.023159   2.344
## chest_pain_type                        0.649368   0.316170   2.054
## num_major_vessels                      1.401373   0.447963   3.128
## fasting_blood_sugar_gt_120_mg_per_dl  -2.091304   1.041256  -2.008
## resting_ekg_results                    0.215033   0.314428   0.684
## serum_cholesterol_mg_per_dl            0.006567   0.006422   1.023
## oldpeak_eq_st_depression               0.078256   0.380387   0.206
## sex                                    3.661764   1.186009   3.087
## age                                   -0.037790   0.039203  -0.964
## max_heart_rate_achieved               -0.036958   0.019295  -1.915
## exercise_induced_angina                1.323062   0.711395   1.860
##                                      Pr(>|z|)   
## (Intercept)                           0.01227 * 
## slope_of_peak_exercise_st_segment     0.00442 **
## thal                                  0.00493 **
## resting_blood_pressure                0.01907 * 
## chest_pain_type                       0.03999 * 
## num_major_vessels                     0.00176 **
## fasting_blood_sugar_gt_120_mg_per_dl  0.04460 * 
## resting_ekg_results                   0.49405   
## serum_cholesterol_mg_per_dl           0.30654   
## oldpeak_eq_st_depression              0.83700   
## sex                                   0.00202 **
## age                                   0.33507   
## max_heart_rate_achieved               0.05544 . 
## exercise_induced_angina               0.06291 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 173.114  on 125  degrees of freedom
## Residual deviance:  70.135  on 112  degrees of freedom
## AIC: 98.135
## 
## Number of Fisher Scoring iterations: 7

From the above analysis, based on the p-value (less than 0.05), the following predictor variables seem to be affecting the outcome (heart_disease_present):

  1. slope_of_peak_exercise_st_segment (strong)
  2. thal (strong)
  3. resting_blood_pressure (weak)
  4. chest_pain_type (weak)
  5. num_major_vessels (strong)
  6. fasting_blood_sugar_gt_120_mg_per_dl (weak)
  7. sex (strong)

Based on the p-values, the affect of the predictor variables on the outcome variables be categorised into weak (0.04 <= p-value <= 0.05) and strong (p-value < 0.04). All other remaining variables, as per this model, cannot be used to predict the heart disease because they are not statistically significant.

Part (f): Use an all subsets model selection procedure to obtain a “best” fit model for your training data. Is the model different from the full model you fit in part (e)? Which variables are included in the “best” fit model? (You might find the bestglm() function available in the bestglm package helpful)
#bestglm package is required for using the bestglm() function
library(bestglm)
## Loading required package: leaps
#Determining the bestglm model for the training data
#The information criteria in this model is selected as AIC because I do not 
#hot to interpret the alternative criterion i.e., the BIC. Also, it will be 
#easier to compare it to the step-wise method for determining the best model
#using the stepAIC() function from the MASS package

#Note that the family here used is binomial as it is a binomial dataset.
#The default value of the family argument is gaussian
bestModel <- bestglm(trainingHeartData, IC="AIC", family = binomial)
## Morgan-Tatar search since family is non-gaussian.
cat("\n\nThe best model as per the bestmodel() function:\n\n")
## 
## 
## The best model as per the bestmodel() function:
bestModel
## AIC
## BICq equivalent for q in (0.696477110936541, 0.87792447814999)
## Best Model:
##                                          Estimate Std. Error   z value
## (Intercept)                          -14.02547137 4.78252891 -2.932647
## slope_of_peak_exercise_st_segment      2.12917194 0.65176792  3.266764
## thal                                   1.44554007 0.51619751  2.800362
## resting_blood_pressure                 0.04734200 0.02115777  2.237570
## chest_pain_type                        0.58945552 0.29509086  1.997539
## num_major_vessels                      1.34112180 0.42143378  3.182284
## fasting_blood_sugar_gt_120_mg_per_dl  -2.21591478 0.97525903 -2.272129
## sex                                    3.41448066 1.03970280  3.284093
## max_heart_rate_achieved               -0.02929288 0.01708279 -1.714760
## exercise_induced_angina                1.27228511 0.68808217  1.849031
##                                         Pr(>|z|)
## (Intercept)                          0.003360853
## slope_of_peak_exercise_st_segment    0.001087844
## thal                                 0.005104527
## resting_blood_pressure               0.025249091
## chest_pain_type                      0.045766660
## num_major_vessels                    0.001461187
## fasting_blood_sugar_gt_120_mg_per_dl 0.023078691
## sex                                  0.001023112
## max_heart_rate_achieved              0.086389324
## exercise_induced_angina              0.064453369
cat("\n\nSummary of the best model:\n\n")
## 
## 
## Summary of the best model:
summary(bestModel$BestModel)
## 
## Call:
## glm(formula = y ~ ., family = family, data = Xi, weights = weights)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.54258  -0.38833  -0.07104   0.27455   2.19328  
## 
## Coefficients:
##                                       Estimate Std. Error z value Pr(>|z|)
## (Intercept)                          -14.02547    4.78253  -2.933  0.00336
## slope_of_peak_exercise_st_segment      2.12917    0.65177   3.267  0.00109
## thal                                   1.44554    0.51620   2.800  0.00510
## resting_blood_pressure                 0.04734    0.02116   2.238  0.02525
## chest_pain_type                        0.58946    0.29509   1.998  0.04577
## num_major_vessels                      1.34112    0.42143   3.182  0.00146
## fasting_blood_sugar_gt_120_mg_per_dl  -2.21591    0.97526  -2.272  0.02308
## sex                                    3.41448    1.03970   3.284  0.00102
## max_heart_rate_achieved               -0.02929    0.01708  -1.715  0.08639
## exercise_induced_angina                1.27229    0.68808   1.849  0.06445
##                                        
## (Intercept)                          **
## slope_of_peak_exercise_st_segment    **
## thal                                 **
## resting_blood_pressure               * 
## chest_pain_type                      * 
## num_major_vessels                    **
## fasting_blood_sugar_gt_120_mg_per_dl * 
## sex                                  **
## max_heart_rate_achieved              . 
## exercise_induced_angina              . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 173.114  on 125  degrees of freedom
## Residual deviance:  72.431  on 116  degrees of freedom
## AIC: 92.431
## 
## Number of Fisher Scoring iterations: 7
#This best model of the type lm/glm is adopted from this source:
#https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html

cat("\n\nFormula of the best model:\n\n")
## 
## 
## Formula of the best model:
formula(bestModel$BestModel)
## y ~ slope_of_peak_exercise_st_segment + thal + resting_blood_pressure + 
##     chest_pain_type + num_major_vessels + fasting_blood_sugar_gt_120_mg_per_dl + 
##     sex + max_heart_rate_achieved + exercise_induced_angina
## <environment: 0x7fd8b3049ff8>
  • From the above analysis, I have determined a “best” fit model using the best subsets regression technique that takes into accounts all the possible models as opposed to the step-wise technique which either removes the least significant variables or add the most significant variables.

  • The formula used in the model is given above.

  • Yes, the “best” model obtained above is almost similar to the previous glm model obtained in part (e). The variables: slope_of_peak_exercise_st_segment, thal, num_major_vessels and sex, are statistically significant as infered from the p-value. The estimated value of each variable in this “best” model is approximately equal to that of the estimated value of these parameters in previous model.

Part (g): Interpret the model parameters of your model from part (f).

From the above model it can be interpreted that, the following variables are statistically significant (p-value < 0.05) which have a profound impact on the outcome variable (heart_disease_present):

  1. slope_of_peak_exercise_st_segment
  2. thal
  3. num_major_vessels
  4. sex

“Estimate” gives the coefficient value of the variable. “Std Error” is the error in interpretation of the estimate. It depends on the number of samples. “p-value” tells how statistically significant is the value of the estimate determined for the model.

Part (h): Use your test dataset and the predict function to obtain predicted probabilities of having heart disease for each case in the test data. Which model did you use for prediction and why? Interpret your results and use a visualization to support your interpretation.
#predicting the y_hat values for the best model calculated above
y_hatBestModel <- predict(bestModel$BestModel, testingHeartData, type="response")


#printing the predicted values
cat("\n\nPredicted probability values of the testing dataset:\n\n")
## 
## 
## Predicted probability values of the testing dataset:
y_hatBestModel
##            3            4            7           15           16 
## 0.9973728531 0.2034847502 0.0344948618 0.9406051498 0.3519723716 
##           18           19           22           23           24 
## 0.0043059863 0.9952160175 0.0432829012 0.8639336544 0.0046142318 
##           30           32           38           40           45 
## 0.0010161530 0.7940618564 0.8267032877 0.7398368159 0.9990901545 
##           49           52           58           60           62 
## 0.9981346791 0.0981655052 0.9623963777 0.0278936348 0.6906993298 
##           70           81           82           85           87 
## 0.0406228432 0.0008967578 0.9986197704 0.0025787815 0.2765928101 
##           89           91           94           96          103 
## 0.8470651556 0.0575775039 0.0142845576 0.0260925784 0.0434455751 
##          108          109          110          112          116 
## 0.1625940509 0.0175996519 0.9737840498 0.9505445747 0.0314488272 
##          117          118          121          124          132 
## 0.1128610836 0.5973134861 0.5484651465 0.9954362444 0.9570988478 
##          134          136          137          142          143 
## 0.9924752887 0.0518740762 0.9906652380 0.2323000724 0.0092950923 
##          144          153          163          164          166 
## 0.9728844860 0.2551556951 0.9964338164 0.7553452218 0.0448140036 
##          167          169          172          178 
## 0.0003217587 0.1395336144 0.0003729200 0.9806091797
# It is needed to choose a decision threshold t in order to label 
#the instances as positives or negatives. If the probability assigned 
#to the instance by the classifier is greater than t, it is labeled as 
#positive and if the probability is lower than the decision threshold, 
#it is labeled as negative.

#https://www.neuraldesigner.com/blog/methods-binary-classification


#In this question, we will assume the threshold to be 0.5. Therefore, 
#any probability above 0.5 is considered to be 1 (has heart disease) 
#otherwise 0 (does not have heart disease).

#Translating the prediction probabilities based on threshold value into 
#actual value for the best model
predictedY_hatValuesBestModel <- y_hatBestModel
predictedY_hatValuesBestModel[predictedY_hatValuesBestModel > 0.5] = 1
predictedY_hatValuesBestModel[predictedY_hatValuesBestModel <= 0.5] = 0

#confusion matrix only works with "factor" type. Hence we need to coerce vectors into factor type
#confusion matrix for Best Model
cat("\n\nConfusion matrix for Best Model:\n")
## 
## 
## Confusion matrix for Best Model:
confusionMatrix(
  factor(testingHeartData$heart_disease_present, levels=0:1), 
  factor(predictedY_hatValuesBestModel, levels=0:1)
  )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 25  5
##          1  4 20
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.7071, 0.9208)
##     No Information Rate : 0.537           
##     P-Value [Acc > NIR] : 4.712e-06       
##                                           
##                   Kappa : 0.6639          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8621          
##             Specificity : 0.8000          
##          Pos Pred Value : 0.8333          
##          Neg Pred Value : 0.8333          
##              Prevalence : 0.5370          
##          Detection Rate : 0.4630          
##    Detection Prevalence : 0.5556          
##       Balanced Accuracy : 0.8310          
##                                           
##        'Positive' Class : 0               
## 
library(pROC)
#finding the ROC for the most significant variable (sex)
#plot(roc(heart_disease_present ~ sex, data = testingHeartData))

#making a replica of the testing data set
dat <- testingHeartData

#adding an additional column of the probabilities of getting a 1 in the predicted output
dat$prob <- y_hatBestModel

#calculating the ROC of logistic regression
roc.hd <- roc(heart_disease_present ~ prob, data = dat)
cat("\n\nROC output for the best model:\n\n")
## 
## 
## ROC output for the best model:
roc.hd
## 
## Call:
## roc.formula(formula = heart_disease_present ~ prob, data = dat)
## 
## Data: prob in 30 controls (heart_disease_present 0) < 24 cases (heart_disease_present 1).
## Area under the curve: 0.8208
#plotting the ROC
plot(roc.hd, main="ROC curve for the best model")

  • I have used the “best” model for the prediction of the probabilities of the presence of heart diseases in the test dataset because this is a more fine tuned model as compared to a generic glm model. It is so because,the bestglm() function used to determine the “best” model takes into consideration all the possible models and returns the most tuned model. Therefore, the predictions using this “best” model can be considered to be more accurate than the generic glm model used previously.

  • Sensitivity is the true positive rate and specificity is the true negative rate.

  • Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold.

  • Since, the black curve is above the grey line, this means that the model has a higher true positive rate. That is to say that prediction from our model is more accurate.

  • Larger the area under the black curve (AUC), better the model is. The current AUC is about 82% which corresponds to a satisfactory model.

Problem 2

In this problem we will use the (red) wine quality dataset from https://archive.ics.uci.edu/ml/datasets/ Wine+Quality. More about this data (note we will only use the red wine dataset):

The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult: Web Link or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

Suppose you want to explore the relationship between wine quality and other characteristics of the wine. Follow the questions below to perform this analysis.

#Loading the data set
wineData <- read.csv("winequality-red.csv")

#This data is unreadable. We have to separate out all the numbers into 12 columns.
#Reading this data again using the sep parameter in the read.csv() function
wineData <- read.csv("winequality-red.csv", sep = ";")


#First few rows of the data set
head(wineData)
#Getting the structure of the data
str(wineData)
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
Part (a): Examine the bivariate relationships present in the data. Briefly discuss notable results. You might find the scatterplotMatrix() function available in the car package helpful.
#instead of using the scatterplotmatrix() function from the car package,
#I find the ggpairs() from the GGally package more useful because, besides
#plotting the correlation between the various variables, it also returns the
#correlation coesfficient between various vaiables in the upper part of the
#plot matrix

#loading the GGally library required for the ggpairs method
library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
#plotting the relationhips within the data
ggpairs(wineData, lower = list(continuous = wrap("smooth", method = "lm")))

#Exploring the relationship between quality and other variables using the box-plots
for(i in 1:11)
{
  #getting the column data
  s <- wineData[,i]
  #getting the column neame
  yLabel <- colnames(wineData)[i]
  #plotting the resullts
  g <- ggplot(wineData, aes(x=quality, y=s)) + 
  geom_point() + #plotting the points
  geom_boxplot(aes(group = quality), color="blue")+ #plotting the boxplots
    labs(title=paste("Quality vs",yLabel), x="Quality", y=yLabel) #plotting the names
  
  #printing the plots
  print(g)
}

  • From the baove scatterplot matrix, we are mainly concerned with the relation of quality variable with other variables. It is clear that strong corelation exists between quality and:
  1. Volatile Acidity (negative relation)
  2. Citric Acid (positive relation)
  3. Chlorides (negative relation)
  4. Sulphates (positive relation)
  5. Alcohol (positive relation)
Part (b): Fit a multiple linear regression model. How much variance in the wine quality do the predictor variables explain?
#For our subsequent analysis, we will divide the dataset into the training
#and testing data in 70:30 ratio. We will test the model on the train data
#and then prdict the values, based on the model on the test data

library(caret)
# Set seed for reproducibility
set.seed(11456)

#Splitting training data into training and testing data in 70:30 ratio.
inTrain2 = createDataPartition(wineData$quality, p = 0.7)[[1]]

#training data set containing 70% of the total values
trainingWineData = wineData[ inTrain2,]

#testing data set containing the remaining 30% of the total values
testingWineData = wineData[-inTrain2,]

#fitting a multiple linear model on the training set of wine data
modelWine <- lm(quality ~., trainingWineData)

#printing the summary of the linear model
summary(modelWine)
## 
## Call:
## lm(formula = quality ~ ., data = trainingWineData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.69690 -0.37926 -0.04435  0.43417  1.95589 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          14.1899863 25.4146170   0.558 0.576725    
## fixed.acidity         0.0172472  0.0312112   0.553 0.580651    
## volatile.acidity     -1.2524461  0.1443816  -8.675  < 2e-16 ***
## citric.acid          -0.2733096  0.1755041  -1.557 0.119689    
## residual.sugar        0.0131848  0.0166756   0.791 0.429310    
## chlorides            -1.8290970  0.5220675  -3.504 0.000477 ***
## free.sulfur.dioxide   0.0046809  0.0025650   1.825 0.068281 .  
## total.sulfur.dioxide -0.0036525  0.0008732  -4.183 3.11e-05 ***
## density              -9.7054519 25.9505040  -0.374 0.708477    
## pH                   -0.4549061  0.2312662  -1.967 0.049430 *  
## sulphates             0.8849532  0.1357895   6.517 1.09e-10 ***
## alcohol               0.2748169  0.0316817   8.674  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6504 on 1108 degrees of freedom
## Multiple R-squared:  0.3669, Adjusted R-squared:  0.3606 
## F-statistic: 58.37 on 11 and 1108 DF,  p-value: < 2.2e-16

Interpreting the above model:

  • Holding all other variables constant, the quality of the wine will decrease by 1.25 for unit increase in the volatile acidity value. The std. error is a measure of how precise the value of the estimate is. The t-value is the estimate/std. error. And finally, the p-value indicates how statistically significant our estimate is. If the p=value is more than 0.05, then it is not statistically significant. Similarly, we can interpret the coefficients of all the variables.

  • The total variance in the wine quality is explained by the r-squared value. Thus, the above model, which takes into account all the predictor variables, explain about 37% variance in the quality of the red wine. Thus, this is probably not a very good model.

Part (c): Evaluate the statistical assumptions in your regression analysis from part (b) by performing a basic analysis of model residuals and any unusual observations. Discuss any concerns you have about your model.
#arranging the plot to be made in a 2x2 matrix
par(mfrow=c(2,2))

#plotting the residual plots
plot(modelWine)

Statistical Assumptions in the above linear model:

  1. Linear relationship: There must be a linear relationship between the outcome variable and the independent variables. Looking at the scatterplots, we can say that there is a linear relationship between the quality (response) and all other independent (predictor) variables

  2. Multivariate Normality: The residuals are normally distributed.That is to say that the error terms in the linear model are normally distributed. This can be checked by interpreting the Q-Q plot. From the above Q-Q plot we can clearly see that the variables are normally distributed.

  3. No Multicollinearity: The independent variables (predictors) are not highly correlated with each other. This can be checked using one of the following three methods: Correlation Matrix (Correlation among all independent variables the correlation coefficients need to be smaller than 1), Tolerance and Variance Inflation Factor. This is evident from the scatterplot matrix obtained above hence there exists no multicollinearity as all the correlation coefficients are less than 1.

  4. Homoscedasticity: The variance of error terms are similar across the values of the independent variables/predictors. This can be ascertained using the plot of standardized residuals vs fitted values which essentially show whether points are equally distributed across all values of the independent variables. Since the error pattern in the standardised vs fitted values is repeated, this implies that there exists homoscedacity.

Thus, all the assumptions are satisfied and we can fit a multivariate linear model to our wine data set.

From the above plots, we can determine:

  • That the model is probably not a very good model. This is also demonstrated by the fact that r-quared value of the model is only 36% i.e., the model only explains about 36% of the variance in the outcome (quality of the wine).

  • The parallel lines in the residual plot are symbolic of the fact that the outcome (quality of the wine), can only take certain values. Looking at the data we found this to be true because the quality can only take numbers from 0 to 10.

  • Since the outcome (quality) is bounded, we are bound to get the parallel lines in our residual vs fitted plot irrespective the type of model we choose.

  • By seeing the Normal Q-Q plot, we can ascertain that the all that the variables are normally distributed.

  • From the scale-location plot it is clear that there is homoscedacity since the error pattern in the standardised vs fitted values is repeated.

  • From the residual-leverage plot, we can see that the maximum residuals lie within the cook’s distance. There are some outliers, but they do not seem to have a profound impact on the model.

Part (d): Use a stepwise model selection procedure of your choice to obtain a “best” fit model. Is the model different from the full model you fit in part (b)? If yes, how so?
#Performing a step-wise regression using the stepAIC function
#and assigning trace as false so that the multiple functions are not printed on the console.
modelWineBest <- stepAIC(modelWine, trace = F)

#printing the summary of the best model
summary(modelWineBest)
## 
## Call:
## lm(formula = quality ~ volatile.acidity + citric.acid + chlorides + 
##     free.sulfur.dioxide + total.sulfur.dioxide + pH + sulphates + 
##     alcohol, data = trainingWineData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68643 -0.38170 -0.04511  0.43883  1.95618 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.8986960  0.5528314   8.861  < 2e-16 ***
## volatile.acidity     -1.2372563  0.1379533  -8.969  < 2e-16 ***
## citric.acid          -0.2187957  0.1433077  -1.527 0.127107    
## chlorides            -1.8860823  0.5023947  -3.754 0.000183 ***
## free.sulfur.dioxide   0.0050508  0.0025308   1.996 0.046211 *  
## total.sulfur.dioxide -0.0037273  0.0008364  -4.457 9.17e-06 ***
## pH                   -0.5478355  0.1601013  -3.422 0.000645 ***
## sulphates             0.8698493  0.1306037   6.660 4.29e-11 ***
## alcohol               0.2836661  0.0204106  13.898  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6498 on 1111 degrees of freedom
## Multiple R-squared:  0.3664, Adjusted R-squared:  0.3618 
## F-statistic:  80.3 on 8 and 1111 DF,  p-value: < 2.2e-16
  • Using the stepAIC() function, we found that the “best” model is somewhat similar to the previous model in part(b). This “best” model has taken into account all the statistically significant variables (p-values < 0.05) that were also present in the previous model (like volatile.acidity, chlorides, total.sulfur.dioxide, sulphates and alcohol) and included some other variables as well that were not as statistically significant (p-values > 0.05) like pH.

  • The r-squared value remains largely the same in “best” model. However, this model takes into account more number of variables than in the previous model.

Part (e): Assess the generalizability of the model (from part (d)). Perform a 10-fold cross validation to estimate model performance. Report the results.

When evaluating models, we often want to assess how well it performs in predicting the target variable on different subsets of the data. One such technique for doing this is k-fold cross-validation, which partitions the data into k equally sized segments (called ‘folds’). One fold is held out for validation while the other k-1 folds are used to train the model and then used to predict the target variable in our testing data. This process is repeated k times, with the performance of each model in predicting the hold-out set being tracked using a performance metric such as accuracy. The most common variation of cross validation is 10-fold cross-validation.

[Reference: https://www.r-bloggers.com/evaluating-logistic-regression-models/]

#This library is required for the cv.lm() function which performs the
#k-fold cross-validation
library(DAAG)
## 
## Attaching package: 'DAAG'
## The following object is masked from 'package:MASS':
## 
##     hills
#performing the k-fold cross validation on the "best" model from part (d)
cvlm <- cv.lm(data = trainingWineData, form.lm = modelWineBest, m=10, printit=F)
## Warning in cv.lm(data = trainingWineData, form.lm = modelWineBest, m = 10, : 
## 
##  As there is >1 explanatory variable, cross-validation
##  predicted values for a fold are not a linear function
##  of corresponding overall predicted values.  Lines that
##  are shown for the different folds are approximate

#calculating the root mean squared error
cat("\n\nRoot mean squared error after performing the 10 fold cross validation:\n")
## 
## 
## Root mean squared error after performing the 10 fold cross validation:
sqrt(mean((cvlm$cvpred-cvlm$quality)^2))
## [1] 0.6531881
  • RMSD is small meaning the original model will not only fit in the known data but also unknown data.
Part (f): Fit a regression tree using the same covariates in your “best” fit model from part (d). Use cross validation to select the “best” tree.
#Tree library is required for the regression trees
library(tree)

#Using regression tree with the same covariates as determined from the "best" model
modelWineTree <- tree(quality ~ 
                        volatile.acidity + chlorides + free.sulfur.dioxide + 
                        total.sulfur.dioxide + pH + sulphates + alcohol, 
                      data=trainingWineData)

#Summary of the tree model
cat("\n\nSummary of Regression Tree model")
## 
## 
## Summary of Regression Tree model
summary(modelWineTree)
## 
## Regression tree:
## tree(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol, data = trainingWineData)
## Variables actually used in tree construction:
## [1] "alcohol"              "sulphates"            "volatile.acidity"    
## [4] "total.sulfur.dioxide"
## Number of terminal nodes:  9 
## Residual mean deviance:  0.4179 = 464.3 / 1111 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.5520 -0.2619 -0.1738  0.0000  0.4484  1.9840
#Cross Validating the tree model. We use the default function,
#Which is the deviance as parameters to the FUN argument as opposed to prune.mismatch,
#which uses the classification error rate o guide the cross-validation and pruning process,
#in the classification problems

#hecv.tree()function reports the number of terminal nodes of each tree 
#con-sidered (size) as well as the corresponding error rate (deviance) and the 
#value of the cost-complexity parameter used (k)
cv.wineData <- cv.tree(modelWineTree)

#printing the results of the cross-validation
cat("\n\nResults of cross Validation:\n")
## 
## 
## Results of cross Validation:
cv.wineData
## $size
## [1] 9 8 7 6 5 4 3 2 1
## 
## $dev
## [1] 538.6064 533.0583 532.2636 520.7776 540.2504 557.8732 577.7201 631.4065
## [9] 744.0414
## 
## $k
## [1]       -Inf   8.316953   8.440804  10.446890  14.187303  22.754414
## [7]  29.478106  51.445738 130.863499
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"
#Plotting the tree structure
plot(modelWineTree, main = "Regression Tree for the Wine Data Set")

#To display the node labels
#The Rgumentpretty=0 instructs R to include the category names 
#for any qualitative pre-dictors, rather than simply displaying a letter for each category.
text(modelWineTree, pretty=0)

#arranging the plot to be made in a 2x2 matrix
par(mfrow =c(1,2))

#plotting the error rate as a function of both size and k.
plot(cv.wineData$size, cv.wineData$dev, type="b", 
     main = "Size vs Error Rate", xlab = "Size", ylab = "Error Rate")
plot(cv.wineData$k ,cv.wineData$dev ,type="b", 
     main = "Size vs k", xlab = "k", ylab = "Error Rate")

  • Cross-validation is performed to mitigate the problem of overfitting of the data. The tree is pruned to minimize the sum of: 1) the output variable variance in the validation data, taken one terminal node at a time; and 2) the product of the cost complexity factor and the number of terminal nodes.

[Reference: https://www.solver.com/regression-trees][Reference: http://www.stat.cmu.edu/~cshalizi/350-2006/lecture-10.pdf]

  • Dev corresponds to the cross-validation errorrate in this instance.

  • The tree with 8 terminal nodes results in the lowest cross-validation error rate.

  • Pruning is based on cost complexity or weakest link pruning. The whole idea of pruning a tree back was to reduce the variance

  • Thus, pruning it further will not reduce the deviation.

[Reference: https://rstudio-pubs-static.s3.amazonaws.com/264385_739bfda9ffc243d9892644a2762e575e.html]

  • To verify, we prune the tree using the modelWineTree() function and selecting the level as 8
#Pruning the tree with 8 terminal nodes
prunedWineTree <- prune.tree(modelWineTree, best = 8)

#Plotting the pruned tree
plot(prunedWineTree, main="Best-Fit Regression Tree")
text(prunedWineTree, pretty=0)

  • By plotting the above pruned tree, we find that it is almost samilar as the original tree with an exception of having 8 terminal nodes instead of 9 terminal nodes.

*The above is the “best” fit tree.

Part (g): Compare the models from part (d) and (f) based on their performance. Which do you prefer? Be sure to justify your preference.
  • The best models always depend on the problem at hand. If the relationship can be approximated by a linear model, then linear regression will likely dominate. If instead we have a complex, highly non-linear relationship between the features and the response, a decision tree may outperform the classical approaches. Also, sometimes improved interpretation can be chosen above simply test error rate.
#To test the performance of both the models, we test both the models
#on the testing data sets
#Finding the y_hat values using the best step AIC model from part (d)
y_hatWineBest <- predict(modelWineBest, testingWineData)

#Finding the y_hat values using the pruned tree from part (f)
y_hatWinePruned <- predict(prunedWineTree, testingWineData)

#rounding y_hat values to nearest whole numbers. 5.5 rounds to 5
#and 5.57 rounds to 6
y_hatWineBestVal <- ifelse((y_hatWineBest-floor(y_hatWineBest))>0.5, 
                           ceiling(y_hatWineBest), floor(y_hatWineBest))

y_hatWinePrunedVal <- ifelse((y_hatWinePruned-floor(y_hatWinePruned))>0.5, 
                           ceiling(y_hatWinePruned), floor(y_hatWinePruned))

#confusion matrix only works with "factor" type. Hence we need 
#to coerce vectors into factor type confusion matrix for Both the Models
cmWineBest <- confusionMatrix(factor(testingWineData$quality, 
                                     levels = c(3,4,5,6,7,8)), 
                              factor(y_hatWineBestVal, levels = c(3,4,5,6,7,8)))

cmWinePruned <- confusionMatrix(factor(testingWineData$quality, 
                                       levels = c(3,4,5,6,7,8)), 
                                factor(y_hatWinePrunedVal, levels = c(3,4,5,6,7,8)))

#Evaluation of Models
resultsWine <- data.frame(matrix(nrow = 2,ncol = 2))
row.names(resultsWine) <- c("Step AIC Best Model", "Best Pruned Tree Model")
colnames(resultsWine) <- c("Accuracy", "Out-of-sample Error (1-Accuracy)")
resultsWine[1,1] <- cmWineBest$overall[1]
resultsWine[2,1] <- cmWinePruned$overall[1]
resultsWine$`Out-of-sample Error (1-Accuracy)` <- 1-resultsWine$Accuracy
resultsWine
  • Thus, from the above analyses, it is clear that the first model, STEP AIC best model is the slightly more accurate model (59.5%) and has the least out of sample error.

  • Therefore, we will prefer the STEP AIC model to predict the values on the final test data.

Problem 3

The Wisconsin Breast Cancer dataset is available as a comma-delimited text file on the UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Our goal in this problem will be to predict whether observations (i.e. tumors) are malignant or benign.

Part (a): Obtain the data, and load it into R by pulling it directly from the web. (Do not download it and import it from a CSV file.) Give a brief description of the data.
#Loading the data set directly from the url
#As per the info given in the article, missing values are denoted by ?
#Thus all ? must be treated as NAs. Also, no header is present.
cancerData <- 
read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"), sep = ",", header = F, na.strings = "?")

#First few rows of the data
head(cancerData)
#Verifying if the data import was correct
#Last column contains the Class where 2= benign and 4=malignant
#As per the article, benign members = 458 and malignant members = 241
#This is verified below
table(cancerData$V11)
## 
##   2   4 
## 458 241
  • In this dataset, features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

  • Number of Instances: 699 (as of 15 July 1992)

  • Number of Attributes: 10 plus the class attribute. The decription of the attributes is as given below:

  1. V1 = ID number
  2. V2 = Clump Thickness
  3. V3 = Uniformity of Cell Size
  4. V4 = Uniformity of Cell Shape
  5. V5 = Marginal Adhesion
  6. V6 = Bare Nuclei
  7. V7 = Single Epithelial Cell Size
  8. V8 = Bland Chromatin
  9. V9 = Normal Nucleoli
  10. V10 = Mitoses
  11. V11 = Class

[Reference: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29][Reference: http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names] [Reference: http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data]

Part (b): Tidy the data, ensuring that each variable is properly named and cast as the correct data type. Discuss any missing data.
#removing the first column (ID) as it is not needed in the analysis
cancerData <- cancerData[,-1]

#Setting the columns names of the data
colnames(cancerData) <- c("Clump.Thickness", "Cell.Size", "Cell.Shape", 
                          "Marginal.Adhesion", "Single.Epithelial.Cell.Size", 
                          "Bare.Nuclei", "Bland.Chromatin", "Normal.Nucleoli",
                          "Mitoses", "Class")

#Encoding the class variable 0 = benign and 1 = malignant
cancerData$Class <- ifelse(cancerData$Class==2, 0, 1)

#Coercing the Class variable as a factor variable
cancerData$Class <- as.factor(cancerData$Class)

#To caluclate the number of rows which have missing data
cc <- complete.cases(cancerData)
cat("\nNumber of rows which have missing data =",length(cc[cc==FALSE]))
## 
## Number of rows which have missing data = 16
cat("\nPercentage of incomplete rows that have missing data =",
    length(cc[cc==FALSE])/nrow(cancerData)*100,"%\n")
## 
## Percentage of incomplete rows that have missing data = 2.288984 %
#Since only 2.3% of the total cases are incomplete, 
#we will omit all the imcomplete cases off the dataset
cancerData <- na.omit(cancerData)

#To check if each type of observational unit forms a table
sapply(cancerData[,c(1:10)], function(x)table(x))
## $Clump.Thickness
## x
##   1   2   3   4   5   6   7   8   9  10 
## 139  50 104  79 128  33  23  44  14  69 
## 
## $Cell.Size
## x
##   1   2   3   4   5   6   7   8   9  10 
## 373  45  52  38  30  25  19  28   6  67 
## 
## $Cell.Shape
## x
##   1   2   3   4   5   6   7   8   9  10 
## 346  58  53  43  32  29  30  27   7  58 
## 
## $Marginal.Adhesion
## x
##   1   2   3   4   5   6   7   8   9  10 
## 393  58  58  33  23  21  13  25   4  55 
## 
## $Single.Epithelial.Cell.Size
## x
##   1   2   3   4   5   6   7   8   9  10 
##  44 376  71  48  39  40  11  21   2  31 
## 
## $Bare.Nuclei
## x
##   1   2   3   4   5   6   7   8   9  10 
## 402  30  28  19  30   4   8  21   9 132 
## 
## $Bland.Chromatin
## x
##   1   2   3   4   5   6   7   8   9  10 
## 150 160 161  39  34   9  71  28  11  20 
## 
## $Normal.Nucleoli
## x
##   1   2   3   4   5   6   7   8   9  10 
## 432  36  42  18  19  22  16  23  15  60 
## 
## $Mitoses
## x
##   1   2   3   4   5   6   7   8  10 
## 563  35  33  12   6   3   9   8  14 
## 
## $Class
## x
##   0   1 
## 444 239
  • As per the rules of tidy data and the code above:
  1. Each variable has its own column.
  2. Each observation has its own row.
  3. Each value has its own cell.
  4. Each type of observational unit forms a table.
  • There were about 16 rows in the data set which had NAs and they represented about 2.2% of the total rows. Hence, these incomplete rows were removed from the data
Part (c): Split the data into a training and test set such that a random 70% of the observations are in the training set.
# Set seed for reproducibility
set.seed(11456)

#Splitting training data into training and testing data in 70:30 ratio.
inTrain3 = createDataPartition(cancerData$Class, p = 0.7)[[1]]

#training data set containing 70% of the total values
trainingCancerData = cancerData[ inTrain3,]

#testing data set containing the remaining 30% of the total values
testingCancerData = cancerData[-inTrain3,]
Part (d): Fit a regression model to predict whether tissue samples are malignant or benign. Classify cases in the validation set. Compute and discuss the resulting confusion matrix.
#fitting a logistic regresssion model to predict whether 
#tissue samples are malignant or benign
modelCancerData <- glm(Class~., trainingCancerData, family = "binomial")

#Printing the summary of the model
summary(modelCancerData)
## 
## Call:
## glm(formula = Class ~ ., family = "binomial", data = trainingCancerData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5514  -0.1352  -0.0617   0.0213   2.0981  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -9.55237    1.30948  -7.295 2.99e-13 ***
## Clump.Thickness              0.55435    0.17017   3.258 0.001123 ** 
## Cell.Size                   -0.22688    0.28429  -0.798 0.424847    
## Cell.Shape                   0.54443    0.29756   1.830 0.067303 .  
## Marginal.Adhesion            0.46871    0.17272   2.714 0.006655 ** 
## Single.Epithelial.Cell.Size -0.05324    0.22214  -0.240 0.810576    
## Bare.Nuclei                  0.46158    0.12376   3.730 0.000192 ***
## Bland.Chromatin              0.32206    0.20263   1.589 0.111978    
## Normal.Nucleoli              0.21299    0.16149   1.319 0.187198    
## Mitoses                      0.41586    0.32800   1.268 0.204842    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 620.686  on 478  degrees of freedom
## Residual deviance:  74.992  on 469  degrees of freedom
## AIC: 94.992
## 
## Number of Fisher Scoring iterations: 8
#Predicitng this model on the test data set
y_hatCancerData <- predict(modelCancerData, 
                           testingCancerData ,type = "response")

#The confusion matrix is an mxm, where m is the number of classes to be predicted. 
#For binary classification problems, the number of classes is 2, thus the confusion
#matrix will have 2 rows and columns.

# It is needed to choose a decision threshold t in order to label the instances 
#as positives or negatives. If the probability assigned to the instance by the 
#classifier is greater than t, it is labeled as positive and if the probability is 
#lower than the decision threshold, it is labeled as negative.

#https://www.neuraldesigner.com/blog/methods-binary-classification

#In our example, the threshold is assumed as 0.5. Therefore, any probability 
#above 0.5 is considered to be 1 otherwise 0.

#Translating the prediction probabilities based on threshold value into actual value
predictedY_hatValuesCancerData <- y_hatCancerData
predictedY_hatValuesCancerData[predictedY_hatValuesCancerData > 0.5] = 1
predictedY_hatValuesCancerData[predictedY_hatValuesCancerData <= 0.5] = 0

#confusion matrix only works with "factor" type. Hence we need to coerce vectors 
#into factor type

#Printing the predicted values:
cat("\n\nResults when fitting a logistic regression model\n\n")
## 
## 
## Results when fitting a logistic regression model
cat("\n\nThe predicted values from the test set are as follows:\n\n")
## 
## 
## The predicted values from the test set are as follows:
predictedY_hatValuesCancerData
##   2   5  21  22  26  28  29  30  34  36  39  44  49  50  51  55  59  75 
##   1   0   1   1   1   0   0   0   0   0   1   0   0   1   1   1   1   1 
##  83  88  91  94  96 101 104 105 106 109 110 111 116 125 126 128 130 131 
##   0   1   0   0   0   1   1   1   1   0   1   0   0   1   0   0   0   0 
## 132 133 134 135 142 145 147 154 155 158 161 163 166 169 172 178 187 189 
##   0   1   0   0   0   0   1   0   0   0   1   0   0   0   0   1   1   1 
## 190 191 199 202 206 209 212 215 222 224 226 232 233 237 238 244 249 251 
##   0   1   0   1   1   0   1   1   1   1   0   1   1   1   1   0   0   0 
## 257 260 261 264 280 282 283 287 289 297 299 300 301 302 308 313 314 319 
##   0   1   1   1   1   0   1   1   1   1   0   1   1   0   0   1   0   0 
## 323 329 332 336 344 347 352 361 366 372 373 381 388 389 391 392 396 398 
##   0   1   0   0   0   0   0   1   0   0   0   0   0   0   0   1   0   0 
## 401 402 403 407 414 419 420 421 425 426 427 428 433 436 437 439 441 442 
##   1   0   0   0   0   0   0   0   0   1   0   1   0   1   1   0   1   0 
## 450 458 459 461 462 467 473 479 480 481 483 484 486 491 495 497 501 502 
##   1   1   0   0   0   1   0   0   1   0   1   1   0   0   1   0   0   0 
## 504 507 514 515 522 524 525 527 533 534 538 539 542 545 548 549 551 554 
##   0   1   0   1   0   1   0   0   0   0   0   0   0   0   0   0   0   0 
## 572 576 577 578 579 580 581 585 587 595 598 603 605 607 611 616 620 621 
##   1   0   0   0   0   0   0   0   1   1   0   0   1   0   1   0   0   0 
## 623 625 627 628 632 633 636 638 643 645 647 650 659 661 663 664 666 669 
##   0   0   1   0   0   0   0   0   0   0   0   0   1   0   0   0   0   1 
## 670 672 681 682 687 696 
##   1   0   1   1   0   0
cat("\n\nStats for the predicted values from the test set are as follows:\n\n")
## 
## 
## Stats for the predicted values from the test set are as follows:
table(predictedY_hatValuesCancerData)
## predictedY_hatValuesCancerData
##   0   1 
## 129  75
cat("\n\nResulting Confusion Matrix of the Regression Model:\n")
## 
## 
## Resulting Confusion Matrix of the Regression Model:
confusionMatrix(
  factor(testingCancerData$Class, levels=0:1), 
  factor(predictedY_hatValuesCancerData, levels=0:1)
  )#$overall[[1]]
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 128   5
##          1   1  70
##                                           
##                Accuracy : 0.9706          
##                  95% CI : (0.9371, 0.9891)
##     No Information Rate : 0.6324          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.936           
##  Mcnemar's Test P-Value : 0.2207          
##                                           
##             Sensitivity : 0.9922          
##             Specificity : 0.9333          
##          Pos Pred Value : 0.9624          
##          Neg Pred Value : 0.9859          
##              Prevalence : 0.6324          
##          Detection Rate : 0.6275          
##    Detection Prevalence : 0.6520          
##       Balanced Accuracy : 0.9628          
##                                           
##        'Positive' Class : 0               
## 
  • From above, we find that about there are about 133 caes where the tissue samples are benign and about 71 cases where they are malignant.

  • From the above confusion matrix:
  1. True positives (TP), instances that are positives and are classified as positives = 70 2, False positives (FP), instances that are negatives and are classified as positives = 1
  2. False negatives (FN), instances that are positives and are classified as negatives = 5
  3. True negatives (TN), instances that are negatives and are classified as negatives = 128
  4. Accuracy of the model = 97.1%
Part (e): Fit a random forest model to predict whether tissue samples are malignant or benign. Classify cases in the validation set. Compute and discuss the resulting confusion matrix.
#Building the random forest model
set.seed(11175)
modelCancerData.rf <- randomForest(Class ~ ., data = trainingCancerData)

#Predicting the results probabilities

#If type argument is kept "prob", then probabilities of getting a both 0 and 1 
#is the output
y_hatCancerData.rf <- predict(modelCancerData.rf, 
                           testingCancerData ,type = "prob")



#In our example, the threshold is assumed as 0.5. Therefore, any probability 
#above 0.5 is considered to be 1 otherwise 0.

#Translating the prediction probabilities based on threshold value into actual value
#Since in the above calculation of y_hatCancerData.rf probabilities of getting both 0
#and 1 is returned, we will consider only the probabilities of getting a 1, which is 
#the second column of the data set. Hence, [,2].
predictedY_hatValuesCancerData.rf <- y_hatCancerData.rf[,2]
predictedY_hatValuesCancerData.rf[predictedY_hatValuesCancerData.rf > 0.5] = 1
predictedY_hatValuesCancerData.rf[predictedY_hatValuesCancerData.rf <= 0.5] = 0

#Printing the predicted values:

cat("\n\nResults when fitting a random forest model\n\n")
## 
## 
## Results when fitting a random forest model
cat("\n\nThe predicted values from the test set are as follows:\n\n")
## 
## 
## The predicted values from the test set are as follows:
predictedY_hatValuesCancerData.rf
##   2   5  21  22  26  28  29  30  34  36  39  44  49  50  51  55  59  75 
##   1   0   1   1   1   0   0   0   0   0   1   1   0   1   1   1   1   1 
##  83  88  91  94  96 101 104 105 106 109 110 111 116 125 126 128 130 131 
##   0   1   0   0   0   1   1   1   1   0   1   0   0   1   0   0   0   0 
## 132 133 134 135 142 145 147 154 155 158 161 163 166 169 172 178 187 189 
##   0   1   0   0   0   0   1   0   0   0   1   0   0   0   0   1   1   1 
## 190 191 199 202 206 209 212 215 222 224 226 232 233 237 238 244 249 251 
##   0   1   0   1   1   0   1   1   1   1   0   1   1   1   1   0   0   0 
## 257 260 261 264 280 282 283 287 289 297 299 300 301 302 308 313 314 319 
##   0   1   1   1   1   0   1   1   1   1   0   1   1   0   0   1   0   0 
## 323 329 332 336 344 347 352 361 366 372 373 381 388 389 391 392 396 398 
##   0   1   0   0   0   0   0   1   0   0   0   0   0   0   0   1   0   0 
## 401 402 403 407 414 419 420 421 425 426 427 428 433 436 437 439 441 442 
##   1   0   0   0   0   0   0   0   0   1   0   1   0   1   1   0   1   0 
## 450 458 459 461 462 467 473 479 480 481 483 484 486 491 495 497 501 502 
##   1   1   0   0   0   1   0   0   1   0   1   1   0   0   0   0   0   0 
## 504 507 514 515 522 524 525 527 533 534 538 539 542 545 548 549 551 554 
##   0   1   0   1   0   1   0   0   0   0   0   0   0   0   0   0   0   0 
## 572 576 577 578 579 580 581 585 587 595 598 603 605 607 611 616 620 621 
##   1   0   0   0   0   0   0   0   1   1   0   0   1   0   1   0   0   0 
## 623 625 627 628 632 633 636 638 643 645 647 650 659 661 663 664 666 669 
##   0   0   1   0   0   0   0   0   0   0   0   0   1   0   0   0   0   1 
## 670 672 681 682 687 696 
##   1   0   1   1   0   0
cat("\n\nStats for the predicted values from the test set are as follows:\n\n")
## 
## 
## Stats for the predicted values from the test set are as follows:
table(predictedY_hatValuesCancerData.rf)
## predictedY_hatValuesCancerData.rf
##   0   1 
## 129  75
cat("\n\nResulting Confusion Matrix of the Random Forest Model:\n")
## 
## 
## Resulting Confusion Matrix of the Random Forest Model:
confusionMatrix(
  factor(testingCancerData$Class, levels=0:1), 
  factor(predictedY_hatValuesCancerData.rf, levels=0:1)
  )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 129   4
##          1   0  71
##                                           
##                Accuracy : 0.9804          
##                  95% CI : (0.9506, 0.9946)
##     No Information Rate : 0.6324          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9574          
##  Mcnemar's Test P-Value : 0.1336          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9467          
##          Pos Pred Value : 0.9699          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.6324          
##          Detection Rate : 0.6324          
##    Detection Prevalence : 0.6520          
##       Balanced Accuracy : 0.9733          
##                                           
##        'Positive' Class : 0               
## 
  • From above, we find that about there are about 128 caes where the tissue samples are benign and about 76 cases where they are malignant.

  • From the above confusion matrix:
  1. True positives (TP), instances that are positives and are classified as positives = 71 2, False positives (FP), instances that are negatives and are classified as positives = 0
  2. False negatives (FN), instances that are positives and are classified as negatives = 4
  3. True negatives (TN), instances that are negatives and are classified as negatives = 129
  4. Accuracy of the model = 98%
Part(f): Compare the models from part (d) and (e) using ROC curves. Which do you prefer? Be sure to justify your preference.
  • Thus, the above analyses it is clear that the accuracy of random forest model (98%) is more than that of the the accuracy of logistic regression model (97.1%)
#Using the pROC library for the ROC function

                              # Evaluation of the logistic regression model

#making a replica of the testing data set
dat <- testingCancerData

#adding an additional column of the probabilities of getting a 1 in the predicted output
dat$prob <- predictedY_hatValuesCancerData

#calculating the ROC of logistic regression
roc.lr <- roc(Class ~ prob, data = dat)
cat("\n\nROC output for logistic regression model:\n\n")
## 
## 
## ROC output for logistic regression model:
roc.lr
## 
## Call:
## roc.formula(formula = Class ~ prob, data = dat)
## 
## Data: prob in 133 controls (Class 0) < 71 cases (Class 1).
## Area under the curve: 0.9742
#plotting the ROC
plot(roc.lr, main="ROC curve for logistic regression model")

                              # Evaluation of the random forest model

#making a replica of the testing data set
dat <- testingCancerData

#adding an additional column of the probabilities of getting a 1 in the predicted output
dat$prob <- predictedY_hatValuesCancerData.rf

#calculating the ROC of logistic regression
roc.rf <- roc(Class ~ prob, data = dat)
cat("\n\nROC output for logistic regression model:\n\n")
## 
## 
## ROC output for logistic regression model:
roc.rf
## 
## Call:
## roc.formula(formula = Class ~ prob, data = dat)
## 
## Data: prob in 133 controls (Class 0) < 71 cases (Class 1).
## Area under the curve: 0.985
#plotting the ROC
plot(roc.rf, main="ROC curve for random forest model")

From above ROC analyses of both the models, it is clear that the area under the curve (AUC) for the logistic regression model (97.4%) is less than the area under the curve (AUC) for the random forest model (98.5%). This is consistent with the fact that the accuracy of the latter model, from the previous question, is more than that of the former model. Thus, random forest model is slightly better than the logisitic regression model.

Problem 4

Please answer the questions below by writing a short response.

Part (a): Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal in each application inference or prediction? Explain your answer.
  1. Example #1
  • Problem: Classifying an email as span or non-spam
  • Response: Type of e-mail (spam vs non-spam)
  • Predictors: Number of repeated words, number of all-caps words, number of punctution marks
  • End goal: Prediction; because we aren’t establishing relationship between the various predictors and the response. We are simply predicting the outcome of our analysis. Also, this analysis is not prone to confounding because because all predictors are potentially useful for predicting the outcome.
  1. Example #2
  • Problem: Classifying health insurance customers as low, medium and high risk
  • Response: Type of risk (low, medium or high)
  • Predictors: smoker (yes and no), age, if previously operated upon (yes and no), pollution in the area of residence, Body Mass Index.
  • End goal: Inference; because not all predictors will influence the outcome as some of the predictors might contribute more to the response than other. Also, in thisproblem confounding may occur because one variable might be completely related to another variable. For instance, high value of pollution variable might classify a person as high risk because high pollution is related to an increased chance of contracting a lung disease which in turn might classify a person as high risk.
  1. Example #3
  • Problem: Classifying if a visitor will buy the premium subscription after introduction of a new feature in the website
  • Response: If visitor will buy the premium subscription (yes vs no)
  • Predictors: Number of visits, time spent on website, number of clicks on the website
  • End goal: Prediction; because we aren’t establishing relationship between the various predictors and the response. We are simply predicting the outcome of our analysis. Also, this analysis is not prone to confounding because because all predictors are potentially useful for predicting the outcome.
Part (b): Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal in each application inference or predictions? Explain your answer.
  1. Example #1
  • Problem: Predicting the sales of a product in future
  • Response: sales (amount in dollars)
  • Predictors: quantity sold, response to a marketing campaign, turnout at a promotional event
  • End goal: Prediction; because we aren’t establishing relationship between the various predictors and the response. We are simply predicting the outcome of our analysis. Also, this analysis is not prone to confounding because because all predictors are potentially useful for predicting the outcome.
  1. Example #2
  • Problem: Finding the relation between the probability of a goal scored by a soccer player
  • Response: probability of a goal scored (yes/no)
  • Predictors: distance ran on the field, number of passes, shots taken off target
  • End goal: Inference; because not all predictors will influence the outcome as some of the predictors might contribute more to the response than other. Also, in this problem confounding may occur because one variable might be ndirectly related to another variable but is not necessarily the cause.
  1. Example #3
  • Problem: Impact of GRE scores on graduate program application
  • Response: outcome of graduate program application (yes vs no)
  • Predictors: GRE Score, TOEFL score, GPA, number of years of work experience
  • End goal: Inference; because not all predictors will influence the outcome as some of the predictors might contribute more to the response than other. Also, in this problem confounding may occur because one variable might be ndirectly related to another variable but is not necessarily the cause.
Part(c): What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?
  1. Flexbile Models
  • It is a model which, unlike linear model, takes into account the non-linear effect of the predictor variables on the response variables. Example: Polynomial Regression Models and splines.

  • Advantages: It reduces bias. Therefore, the mean-square error decreases as flexibility increases given the fact that the model fits the data more closely.

  • Disadvantages: It increases the variance. This is due to the fact that the model may over-fit the train data (and take into account a lot of noise) and will vary a lot in the test data set.

  • A more flexible model may be preferred when prediction is the goal and we are not concerned with the relationships between the variables. In that case, we will want our prediction to be as close to the test data as possible.

  1. In-flexible Models
  • It is the linear model which predicts the relationship between the predictor and the response variable in a linear way. In actuality, there might not exist a linear relationship between the two variables. It is done for the sake of simplicity.

  • An in-flexible model (linear model) when inference (as opposed to prediction) is the goal and more emphasis is upon the finding relationships amongst the variables.

Problem 5

Suppose we have a dataset with five predictors, X1 = GPA, X2 = IQ, X3 = Degree (1 for B.A. degree holder, and 0 for B.S. degree holder), X4 = InteractionbetweenGPAandIQ, and X5 = InteractionbetweenGPAandDegree. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model and get B0 = 50; B1 = 20, B2 = 0.07, B3 = 35, B4 = 0.01, and B5 = -10.

where y = starting salary after graduation in thousands of dollars (response variable)

Part (a): Which answer is correct and why?
    1. For a fixed value of IQ and GPA, B.A. degree holders earn more on average than B.S. degree holders.
  • Answer: Correct; plugging the values in the above equation, we notice that for a BS degree candidate, the value is 1 while for a BA degree holder the value is 0. The value of y is clearly more in the former case provided the values of IQ and GPA are fixed.

    1. For a fixed value of IQ and GPA, B.S. degree holders earn more on average than B.A. degree holders.
  • Answer: Inorrect; plugging the values in the above equation, we notice that for a BS degree candidate, the value is 1 while for a BA degree holder the value is 0. The value of y is clearly less in the latter case provided the values of IQ and GPA are fixed.

    1. For a fixed value of IQ and GPA, B.S. degree holders earn more on average than B.A. degree holders provided that the GPA is high enough.
  • Answer: Cannot say because of insufficient data. We need to have exact value of the “high enough GPA”.

    1. For a fixed value of IQ and GPA, B.A. degree holders earn more on average than B.S. degree holders provided that the GPA is high enough.
  • Answer: Cannot say because of insufficient data. We need to have exact value of the “high enough GPA”.

Part (b): Predict the salary of a B.A. with IQ of 110 and a GPA of 4.0.
  • Here GPA = 4, IQ= 110, Degree = 1 (since degree is BA)

  • Plugging in these values in the above equation

y <- 50 + 20*(4) + 0.07*(110) + 35*(1) + 0.01*(110*4) - 10*(4*1)
cat("\n\nSalary of a B.A. with IQ of 110 and a GPA of 4.0 is", y*1000,"USD")
## 
## 
## Salary of a B.A. with IQ of 110 and a GPA of 4.0 is 137100 USD
  • Salary of a B.A. with IQ of 110 and a GPA of 4.0 is 137100 USD
Part (c): True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is little evidence of an interaction effect. Justify your answer.
  • Answer: Cannot say; to justify that the coefficient for the GPA/IQ interaction term is very small, we need to test the hypothesis, H0: B4=0. This can be done by calculating the p-value associated with a t-test. Since, we do not have enough data to conduct this hypothesis testing, we cannot be sure if the claim is true.

Problem 6

Apply boosting, bagging, and random forests to a dataset of your choice that we have used in class. Be sure to fit the models on a training set and evaluate their performance on a test set.

#For this problem I am using the Boston dataser available
#in the MASS Package
#Response Variable: medv, Median house value in neighborhoods in Boston, MA
#Predictor Variables: all other variables
bostonData <- Boston

# Set seed for reproducibility
set.seed(11178)

#Splitting training data into training and testing data in 70:30 ratio.
inTrain4 = createDataPartition(bostonData$medv, p = 0.7)[[1]]

#training data set containing 70% of the total values
trainingBostonData = bostonData[ inTrain4,]

#testing data set containing the remaining 30% of the total values
testingBostonData = bostonData[-inTrain4,]

                                #Model 1: Boosting
set.seed(11177)
#library gbm is required for gradient boosting
library(gbm)
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:DAAG':
## 
##     lung
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
#Building the boosting model
modelBoston.boost <- gbm(medv ~ ., data = trainingBostonData, 
                         distribution = "gaussian", 
                         n.trees = 100, shrinkage = 0.01,
                         interaction.depth = 8)

#Predicting the results
modelBoston.boost.pred <- predict(modelBoston.boost, testingBostonData, n.trees = 100)

                                #Model 2: Bagging
library(randomForest)
#here we will perform the ensemble method of bagging on random forest model
#Building the bagging model with a random forest
set.seed(11176)
modelBoston.bag <- randomForest(medv ~ ., data = trainingBostonData, mtry=13, importance=T)

#Predicting the results
modelBoston.bag.pred <- predict(modelBoston.bag, testingBostonData)

                                #Model 3: Random Forest


#Growing a random forest proceeds in exactly the same way, except
#that we use a smaller value of the mtry argument.
#By default, randomForest() uses p/3 variables when building a
#random forest of regression trees, and sqrt(p) variables when building a
#random forest of classification trees.

                                

#Building the random forest model
set.seed(11175)
modelBoston.rf <- randomForest(medv ~ ., data = trainingBostonData, mtry=6, importance=T)

#Predicting the results
modelBoston.rf.pred <- predict(modelBoston.rf, testingBostonData)

                      
                                #Model 4: Linear Regression

#Building the linear regression model
set.seed(11174)
modelBoston.lm <- lm(medv ~ ., data = trainingBostonData)

#Predicting the results
modelBoston.lm.pred <- predict(modelBoston.lm, testingBostonData)

                                #Evaluating the models

#Building an empty data frame to populate the results
resultsBoston <- data.frame(matrix(nrow = 4,ncol = 1))
row.names(resultsBoston) <- c("Boosting", "Bagging with Random Forest", 
                              "Random Forest", "Linear Model")
colnames(resultsBoston) <- c("Mean Squared Error")


#Calculating the Mean Squared Error for each model
resultsBoston[1,1] <- mean((modelBoston.boost.pred-testingBostonData$medv)^2)
resultsBoston[2,1] <- mean((modelBoston.bag.pred-testingBostonData$medv)^2)
resultsBoston[3,1] <- mean((modelBoston.rf.pred-testingBostonData$medv)^2)
resultsBoston[4,1] <- mean((modelBoston.lm.pred-testingBostonData$medv)^2)

resultsBoston
Part (a): How accurate are the results compared to simple methods like linear or logistic regression?
  • From the above analyses it is clear that Linear Regression model is the least accurate model as there is a high variance in the predicted response variable which is given by the mean squared error, as compared to the other model.
Part(b): Which of the approaches yields the best performance?
  • From above analyses, it is clear that boosting preformed along with the random forest modeling produced the best results as it has the minimum variance in the the predicted response variable which is given by the mean squared error.

[Reference: http://www2.stat.duke.edu/~rcs46/lectures_2017/08-trees/08-tree-advanced.pdf]

Problem 7

Authors of a paper in the journal Appetite, entitled “Individual differences in bitter taste preferences are associated with antisocial personality traits” (Sagioglou & Greitemeyer, 2015) hired Mechanical Turkers to complete surveys about their taste preferences and personalities. They ran correlation tests and multiple regression analyses. A portion of their regression analyses for two separate experiments are reproduced below (and can also be found within the article).

The authors conclude:

“The present research has demonstrated that bitter taste preferences are associated with more pronounced malevolent personality traits, especially robustly with everyday sadism. The sample was a large community sample, thereby representing a wide selection of the population. . . In establishing a robust link between taste preferences and personality traits, this research reveals 5 further real-world behavioral correlates of antisocial personality traits” (p. 306).

Provide a short critical analysis (up to 500 words max) of this study, taking into consideration ethics, empirical frameworks, regression analysis, and any other factors that stand out to you. In order to receive full points for this problem, you will likely need to at least skim the full article, but it is possible to receive partial credit based only on the information I provided above. Think critically and be creative!

I understand the premise behind this research. However, in my opinion, this inferential research should not be considered as a guide to determine a person’s personality based on their taste preferences. I say this because, this study is quite prone to confounding. Also, it must be noted that no where in both the studies, the authors have tried to evaluate their model. We do not know for sure how accurate their model is. No test for confounding variables has been done either. Another noteworthy thing, which in my opinion, the authors have missed in their research is the fact that their results can be greatly influenced by the outliers. For example, some repondents who like the bitter taste AND score high on each of the three factors of the dark triad, might have a greater leverage on the overall model than as compared to other predictor variables. Most importantly, the authors have tried to perform a multiple linear regression; the assumptions under which the multiple linear regression is performed are not established. Thus, empirically, the research analyses is very likely to be prone to bias.

I couldn’t help but notice that about 445 people out of a total of 953 respondents had at least an undergraduate degree. Now, we can regard about half of the repondents as having decent education. More educated people tend to interact with more cultures and may try out bitter foods out of curiosity and might have liked it. Also, we have not established what exactly is a bitter taste. Some people might find coffee bitter while others may describe dark chocolate as bitter. Certain hot peppers are also defined as having a bitter taste. Thus, we do not know for sure if all the respondents have same idea of bitterness.

Also, while interpreting the results, care must be take that this is an inferential study and that correlation does not mean causation. Thus, bitter state preference DOES NOT cause pronounced malevolent personality traits.

Thus, I would regard this study as mere general information and would not form any opinions about personality of people based on their taste preferences. That would be incredibly stupid to do!