This is the setup required for the final exam..
# Load standard libraries
library(tidyverse)
library(caTools)
library(dplyr)
library(ggplot2)
library(pROC)
library(randomForest)
library(MASS)
library(ggplot2)
For this portion of the exam you will use data from: Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science. This dataset is currently being used in a DrivenData competition. You can find more information about the dataset on the DrivenData or UCI ML Repo websites.
The heartData.csv “dataset is from a study of heart disease that has been open to the public for many years.The study collects various measurements on patient health and cardiovascular statistics, and of course makes patient identities anonymous” (DrivenData: Predicting Heart Disease).
#Loading the data set
heartData <- read.csv("heartData.csv")
#Getting the structure of the data
#str(heartData)
#First few rows of the data set
head(heartData)
heartData is data.frame that contains 180 obs. of 16 variables:
#Based on the above structure of the heart disease data, following
#variables (columns) are coerced into the factor type
#Coercing chest_pain_type, num_major_vessels, resting_ekg_results, heart_disease_present
#into factor type
#heartData$chest_pain_type <- as.factor(heartData$chest_pain_type)
#heartData$num_major_vessels <- as.factor(heartData$num_major_vessels)
#heartData$resting_ekg_results <- as.factor(heartData$resting_ekg_results)
#heartData$heart_disease_present <- as.factor(heartData$heart_disease_present)
#heartData$fasting_blood_sugar_gt_120_mg_per_dl <- as.factor(heartData$fasting_blood_sugar_gt_120_mg_per_dl)
#heartData$slope_of_peak_exercise_st_segment <- as.factor(heartData$slope_of_peak_exercise_st_segment)
#heartData$exercise_induced_angina <- as.factor(heartData$exercise_induced_angina)
#As per the variable description given here http://archive.ics.uci.edu/ml/datasets/Heart+Disease,
#for the sex variable, 1=Male and 0=Female. Therefore, coercing the sex variable:
#heartData$sex <- factor(heartData$sex, levels = c(0,1), labels = c("F","M"))
#str(heartData)
#Finding proportion of female/male patients
cat("\nPercentage of Female participants =", prop.table(table(heartData$sex))[1]*100, "%",
"\nPercentage of Male participants =", prop.table(table(heartData$sex))[2]*100, "%")
##
## Percentage of Female participants = 31.11111 %
## Percentage of Male participants = 68.88889 %
#Finding proportion of female/male patients that have heart disease
#Creating obe consolidated table of participants that have heart disease
#categorised by gender
t1 <- table(heartData[heartData$sex==0,]$heart_disease_present)
t2 <- table(heartData[heartData$sex==1,]$heart_disease_present)
t <- rbind(t1,t2)
#Setting the row names and column names of the table
colnames(t) <- c("Has heart disease", "Does not have heart disease")
rownames(t) <- c("Females", "Males")
#printing the table
cat("\n\nNumber of Males/Females having hear disease:\n\n")
##
##
## Number of Males/Females having hear disease:
t
## Has heart disease Does not have heart disease
## Females 45 11
## Males 55 69
#summary of the age of particpants
cat("\n\nSummary of the age of participants:\n")
##
##
## Summary of the age of participants:
summary(heartData$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.00 48.00 55.00 54.81 62.00 77.00
ggplot(heartData, aes(x=age)) +
geom_histogram(fill="light green", alpha=0.5, position="identity")+
geom_vline(aes(xintercept = mean(age)), heartData)+
labs(title="Histogram of age of participants", x="Age of patients", y="Count")+
geom_text(mapping=aes(x=mean(age), y=8.5, label=paste("Mean=", signif(mean(age),4))),
size=4, angle=90, vjust=-0.4, hjust=0)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#making a new data frame to store the mean ages of the male and female
#patients so that it can be included in the ggplot face-wrap function
meanAge <- data.frame(sex = c(0, 1),
age = c(mean(heartData[heartData$sex==0,]$age),
mean(heartData[heartData$sex==1,]$age)))
#ggplot of age of the patients categorised by sex
agePlot <- ggplot(heartData, aes(x=age, fill=as.factor(sex))) +
geom_histogram(alpha=0.5, position="identity")+
geom_vline(aes(xintercept = age), meanAge)+
facet_wrap(~as.factor(sex))+
labs(title="Histogram of age of participants categorised by sex",
x="Age of patients", y="Count", fill="Sex")+
geom_text(meanAge, mapping=aes(x=age, y=8.5, label=paste("Mean=", signif(age,4))),
size=4, angle=90, vjust=-0.4, hjust=0)+
scale_fill_discrete(breaks=c("0", "1"),
labels=c("0 - Female", "1 - Male"))
#display the agePlot
agePlot
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#ggplot of resting blood pressure of the patients categorised by sex
ggplot(heartData, aes(y=resting_blood_pressure, fill=as.factor(sex)))+
geom_boxplot()+
labs(title="Boxplot of the particpants' resting blood pressure
categorised by sex", y="Resting Blood Pressure")+
facet_wrap(~as.factor(sex))+
scale_fill_discrete(breaks=c("0", "1"),
labels=c("0 - Female", "1 - Male"))
#ggplot of serum cholestrol of the patients categorised by sex
ggplot(heartData, aes(y=serum_cholesterol_mg_per_dl, fill=as.factor(sex)))+
geom_boxplot()+
labs(title="Boxplot of the particpants' serum cholestrol
categorised by sex", y="Serum Cholestrol (in mg/dl)")+
facet_wrap(~as.factor(sex))+
scale_fill_discrete(breaks=c("0", "1"),
labels=c("0 - Female", "1 - Male"))
#ggplot of oldpeak_eq_st_depressionl of the patients categorised by sex
ggplot(heartData, aes(y=oldpeak_eq_st_depression, fill=as.factor(sex)))+
geom_boxplot()+
labs(title="Boxplot of the particpants' oldpeak=ST exercise induced depression relative to rest
categorised by sex", y="Measure")+
facet_wrap(~as.factor(sex))+
scale_fill_discrete(breaks=c("0", "1"),
labels=c("0 - Female", "1 - Male"))
#ggplot of maximum heart rate achieved in participants categorised by sex
ggplot(heartData, aes(y=max_heart_rate_achieved, fill=as.factor(sex)))+
geom_boxplot()+
labs(title="Boxplot of the particpants' maximum heart rate achieved
categorised by sex", y="Max Heart Rate Achieved")+
facet_wrap(~as.factor(sex))+
scale_fill_discrete(breaks=c("0", "1"),
labels=c("0 - Female", "1 - Male"))
From the above analyses, it is clear that:
#Excluding columns names that are not required: X, patient_id
colNotRequired <- c(1,2)
heartData <- heartData[,-colNotRequired]
#Encoding factor variable (thal) into a numeric variable where
#1= fixed_defect, 2= Normal and 3 = reversible_defect
heartData$thal <- as.numeric(heartData$thal)
#caret package is required for the createDataPartition() function
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
# Set seed for reproducibility
set.seed(111)
#Splitting training data into training and testing data in 70:30 ratio.
inTrain = createDataPartition(heartData$heart_disease_present, p = 0.7)[[1]]
#training data set containing 70% of the total values
trainingHeartData = heartData[ inTrain,]
#testing data set containing the remaining 30% of the total values
testingHeartData = heartData[-inTrain,]
Since the outcome variable, heart_disease_present, is a binary variable, we will deal with the probabilities of getting a result as 1/0 (prediction if heart disease is diagnosed or not).
Note that I have not changed some of the variables (hest_pain_type, num_major_vessels, resting_ekg_results, heart_disease_present, sex) to factor because, I’ll be performing logistic regression and the factor variables will need to be numerically encoded.
Therefore, we will use logistic regression to build our model. The code is given below:
#To build our model we use the glm function which is
#used to fit generalized linear models, specified by
#giving a symbolic description of the linear predictor
#and a description of the error distribution denoted by the family argument.
#Therefore, we set family as binomial for a logistic regression model.
modelHeartData <- glm(heart_disease_present~., trainingHeartData, family = "binomial")
summary(modelHeartData)
##
## Call:
## glm(formula = heart_disease_present ~ ., family = "binomial",
## data = trainingHeartData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.56285 -0.39918 -0.05934 0.21864 2.19198
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -14.221195 5.678427 -2.504
## slope_of_peak_exercise_st_segment 2.113287 0.742373 2.847
## thal 1.534915 0.545912 2.812
## resting_blood_pressure 0.054286 0.023159 2.344
## chest_pain_type 0.649368 0.316170 2.054
## num_major_vessels 1.401373 0.447963 3.128
## fasting_blood_sugar_gt_120_mg_per_dl -2.091304 1.041256 -2.008
## resting_ekg_results 0.215033 0.314428 0.684
## serum_cholesterol_mg_per_dl 0.006567 0.006422 1.023
## oldpeak_eq_st_depression 0.078256 0.380387 0.206
## sex 3.661764 1.186009 3.087
## age -0.037790 0.039203 -0.964
## max_heart_rate_achieved -0.036958 0.019295 -1.915
## exercise_induced_angina 1.323062 0.711395 1.860
## Pr(>|z|)
## (Intercept) 0.01227 *
## slope_of_peak_exercise_st_segment 0.00442 **
## thal 0.00493 **
## resting_blood_pressure 0.01907 *
## chest_pain_type 0.03999 *
## num_major_vessels 0.00176 **
## fasting_blood_sugar_gt_120_mg_per_dl 0.04460 *
## resting_ekg_results 0.49405
## serum_cholesterol_mg_per_dl 0.30654
## oldpeak_eq_st_depression 0.83700
## sex 0.00202 **
## age 0.33507
## max_heart_rate_achieved 0.05544 .
## exercise_induced_angina 0.06291 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 173.114 on 125 degrees of freedom
## Residual deviance: 70.135 on 112 degrees of freedom
## AIC: 98.135
##
## Number of Fisher Scoring iterations: 7
From the above analysis, based on the p-value (less than 0.05), the following predictor variables seem to be affecting the outcome (heart_disease_present):
Based on the p-values, the affect of the predictor variables on the outcome variables be categorised into weak (0.04 <= p-value <= 0.05) and strong (p-value < 0.04). All other remaining variables, as per this model, cannot be used to predict the heart disease because they are not statistically significant.
#bestglm package is required for using the bestglm() function
library(bestglm)
## Loading required package: leaps
#Determining the bestglm model for the training data
#The information criteria in this model is selected as AIC because I do not
#hot to interpret the alternative criterion i.e., the BIC. Also, it will be
#easier to compare it to the step-wise method for determining the best model
#using the stepAIC() function from the MASS package
#Note that the family here used is binomial as it is a binomial dataset.
#The default value of the family argument is gaussian
bestModel <- bestglm(trainingHeartData, IC="AIC", family = binomial)
## Morgan-Tatar search since family is non-gaussian.
cat("\n\nThe best model as per the bestmodel() function:\n\n")
##
##
## The best model as per the bestmodel() function:
bestModel
## AIC
## BICq equivalent for q in (0.696477110936541, 0.87792447814999)
## Best Model:
## Estimate Std. Error z value
## (Intercept) -14.02547137 4.78252891 -2.932647
## slope_of_peak_exercise_st_segment 2.12917194 0.65176792 3.266764
## thal 1.44554007 0.51619751 2.800362
## resting_blood_pressure 0.04734200 0.02115777 2.237570
## chest_pain_type 0.58945552 0.29509086 1.997539
## num_major_vessels 1.34112180 0.42143378 3.182284
## fasting_blood_sugar_gt_120_mg_per_dl -2.21591478 0.97525903 -2.272129
## sex 3.41448066 1.03970280 3.284093
## max_heart_rate_achieved -0.02929288 0.01708279 -1.714760
## exercise_induced_angina 1.27228511 0.68808217 1.849031
## Pr(>|z|)
## (Intercept) 0.003360853
## slope_of_peak_exercise_st_segment 0.001087844
## thal 0.005104527
## resting_blood_pressure 0.025249091
## chest_pain_type 0.045766660
## num_major_vessels 0.001461187
## fasting_blood_sugar_gt_120_mg_per_dl 0.023078691
## sex 0.001023112
## max_heart_rate_achieved 0.086389324
## exercise_induced_angina 0.064453369
cat("\n\nSummary of the best model:\n\n")
##
##
## Summary of the best model:
summary(bestModel$BestModel)
##
## Call:
## glm(formula = y ~ ., family = family, data = Xi, weights = weights)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.54258 -0.38833 -0.07104 0.27455 2.19328
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -14.02547 4.78253 -2.933 0.00336
## slope_of_peak_exercise_st_segment 2.12917 0.65177 3.267 0.00109
## thal 1.44554 0.51620 2.800 0.00510
## resting_blood_pressure 0.04734 0.02116 2.238 0.02525
## chest_pain_type 0.58946 0.29509 1.998 0.04577
## num_major_vessels 1.34112 0.42143 3.182 0.00146
## fasting_blood_sugar_gt_120_mg_per_dl -2.21591 0.97526 -2.272 0.02308
## sex 3.41448 1.03970 3.284 0.00102
## max_heart_rate_achieved -0.02929 0.01708 -1.715 0.08639
## exercise_induced_angina 1.27229 0.68808 1.849 0.06445
##
## (Intercept) **
## slope_of_peak_exercise_st_segment **
## thal **
## resting_blood_pressure *
## chest_pain_type *
## num_major_vessels **
## fasting_blood_sugar_gt_120_mg_per_dl *
## sex **
## max_heart_rate_achieved .
## exercise_induced_angina .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 173.114 on 125 degrees of freedom
## Residual deviance: 72.431 on 116 degrees of freedom
## AIC: 92.431
##
## Number of Fisher Scoring iterations: 7
#This best model of the type lm/glm is adopted from this source:
#https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html
cat("\n\nFormula of the best model:\n\n")
##
##
## Formula of the best model:
formula(bestModel$BestModel)
## y ~ slope_of_peak_exercise_st_segment + thal + resting_blood_pressure +
## chest_pain_type + num_major_vessels + fasting_blood_sugar_gt_120_mg_per_dl +
## sex + max_heart_rate_achieved + exercise_induced_angina
## <environment: 0x7fd8b3049ff8>
From the above analysis, I have determined a “best” fit model using the best subsets regression technique that takes into accounts all the possible models as opposed to the step-wise technique which either removes the least significant variables or add the most significant variables.
The formula used in the model is given above.
Yes, the “best” model obtained above is almost similar to the previous glm model obtained in part (e). The variables: slope_of_peak_exercise_st_segment, thal, num_major_vessels and sex, are statistically significant as infered from the p-value. The estimated value of each variable in this “best” model is approximately equal to that of the estimated value of these parameters in previous model.
From the above model it can be interpreted that, the following variables are statistically significant (p-value < 0.05) which have a profound impact on the outcome variable (heart_disease_present):
“Estimate” gives the coefficient value of the variable. “Std Error” is the error in interpretation of the estimate. It depends on the number of samples. “p-value” tells how statistically significant is the value of the estimate determined for the model.
#predicting the y_hat values for the best model calculated above
y_hatBestModel <- predict(bestModel$BestModel, testingHeartData, type="response")
#printing the predicted values
cat("\n\nPredicted probability values of the testing dataset:\n\n")
##
##
## Predicted probability values of the testing dataset:
y_hatBestModel
## 3 4 7 15 16
## 0.9973728531 0.2034847502 0.0344948618 0.9406051498 0.3519723716
## 18 19 22 23 24
## 0.0043059863 0.9952160175 0.0432829012 0.8639336544 0.0046142318
## 30 32 38 40 45
## 0.0010161530 0.7940618564 0.8267032877 0.7398368159 0.9990901545
## 49 52 58 60 62
## 0.9981346791 0.0981655052 0.9623963777 0.0278936348 0.6906993298
## 70 81 82 85 87
## 0.0406228432 0.0008967578 0.9986197704 0.0025787815 0.2765928101
## 89 91 94 96 103
## 0.8470651556 0.0575775039 0.0142845576 0.0260925784 0.0434455751
## 108 109 110 112 116
## 0.1625940509 0.0175996519 0.9737840498 0.9505445747 0.0314488272
## 117 118 121 124 132
## 0.1128610836 0.5973134861 0.5484651465 0.9954362444 0.9570988478
## 134 136 137 142 143
## 0.9924752887 0.0518740762 0.9906652380 0.2323000724 0.0092950923
## 144 153 163 164 166
## 0.9728844860 0.2551556951 0.9964338164 0.7553452218 0.0448140036
## 167 169 172 178
## 0.0003217587 0.1395336144 0.0003729200 0.9806091797
# It is needed to choose a decision threshold t in order to label
#the instances as positives or negatives. If the probability assigned
#to the instance by the classifier is greater than t, it is labeled as
#positive and if the probability is lower than the decision threshold,
#it is labeled as negative.
#https://www.neuraldesigner.com/blog/methods-binary-classification
#In this question, we will assume the threshold to be 0.5. Therefore,
#any probability above 0.5 is considered to be 1 (has heart disease)
#otherwise 0 (does not have heart disease).
#Translating the prediction probabilities based on threshold value into
#actual value for the best model
predictedY_hatValuesBestModel <- y_hatBestModel
predictedY_hatValuesBestModel[predictedY_hatValuesBestModel > 0.5] = 1
predictedY_hatValuesBestModel[predictedY_hatValuesBestModel <= 0.5] = 0
#confusion matrix only works with "factor" type. Hence we need to coerce vectors into factor type
#confusion matrix for Best Model
cat("\n\nConfusion matrix for Best Model:\n")
##
##
## Confusion matrix for Best Model:
confusionMatrix(
factor(testingHeartData$heart_disease_present, levels=0:1),
factor(predictedY_hatValuesBestModel, levels=0:1)
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 25 5
## 1 4 20
##
## Accuracy : 0.8333
## 95% CI : (0.7071, 0.9208)
## No Information Rate : 0.537
## P-Value [Acc > NIR] : 4.712e-06
##
## Kappa : 0.6639
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8621
## Specificity : 0.8000
## Pos Pred Value : 0.8333
## Neg Pred Value : 0.8333
## Prevalence : 0.5370
## Detection Rate : 0.4630
## Detection Prevalence : 0.5556
## Balanced Accuracy : 0.8310
##
## 'Positive' Class : 0
##
library(pROC)
#finding the ROC for the most significant variable (sex)
#plot(roc(heart_disease_present ~ sex, data = testingHeartData))
#making a replica of the testing data set
dat <- testingHeartData
#adding an additional column of the probabilities of getting a 1 in the predicted output
dat$prob <- y_hatBestModel
#calculating the ROC of logistic regression
roc.hd <- roc(heart_disease_present ~ prob, data = dat)
cat("\n\nROC output for the best model:\n\n")
##
##
## ROC output for the best model:
roc.hd
##
## Call:
## roc.formula(formula = heart_disease_present ~ prob, data = dat)
##
## Data: prob in 30 controls (heart_disease_present 0) < 24 cases (heart_disease_present 1).
## Area under the curve: 0.8208
#plotting the ROC
plot(roc.hd, main="ROC curve for the best model")
I have used the “best” model for the prediction of the probabilities of the presence of heart diseases in the test dataset because this is a more fine tuned model as compared to a generic glm model. It is so because,the bestglm() function used to determine the “best” model takes into consideration all the possible models and returns the most tuned model. Therefore, the predictions using this “best” model can be considered to be more accurate than the generic glm model used previously.
Sensitivity is the true positive rate and specificity is the true negative rate.
Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold.
Since, the black curve is above the grey line, this means that the model has a higher true positive rate. That is to say that prediction from our model is more accurate.
Larger the area under the black curve (AUC), better the model is. The current AUC is about 82% which corresponds to a satisfactory model.
In this problem we will use the (red) wine quality dataset from https://archive.ics.uci.edu/ml/datasets/ Wine+Quality. More about this data (note we will only use the red wine dataset):
The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult: Web Link or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
Suppose you want to explore the relationship between wine quality and other characteristics of the wine. Follow the questions below to perform this analysis.
#Loading the data set
wineData <- read.csv("winequality-red.csv")
#This data is unreadable. We have to separate out all the numbers into 12 columns.
#Reading this data again using the sep parameter in the read.csv() function
wineData <- read.csv("winequality-red.csv", sep = ";")
#First few rows of the data set
head(wineData)
#Getting the structure of the data
str(wineData)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
#instead of using the scatterplotmatrix() function from the car package,
#I find the ggpairs() from the GGally package more useful because, besides
#plotting the correlation between the various variables, it also returns the
#correlation coesfficient between various vaiables in the upper part of the
#plot matrix
#loading the GGally library required for the ggpairs method
library(GGally)
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
#plotting the relationhips within the data
ggpairs(wineData, lower = list(continuous = wrap("smooth", method = "lm")))
#Exploring the relationship between quality and other variables using the box-plots
for(i in 1:11)
{
#getting the column data
s <- wineData[,i]
#getting the column neame
yLabel <- colnames(wineData)[i]
#plotting the resullts
g <- ggplot(wineData, aes(x=quality, y=s)) +
geom_point() + #plotting the points
geom_boxplot(aes(group = quality), color="blue")+ #plotting the boxplots
labs(title=paste("Quality vs",yLabel), x="Quality", y=yLabel) #plotting the names
#printing the plots
print(g)
}
#For our subsequent analysis, we will divide the dataset into the training
#and testing data in 70:30 ratio. We will test the model on the train data
#and then prdict the values, based on the model on the test data
library(caret)
# Set seed for reproducibility
set.seed(11456)
#Splitting training data into training and testing data in 70:30 ratio.
inTrain2 = createDataPartition(wineData$quality, p = 0.7)[[1]]
#training data set containing 70% of the total values
trainingWineData = wineData[ inTrain2,]
#testing data set containing the remaining 30% of the total values
testingWineData = wineData[-inTrain2,]
#fitting a multiple linear model on the training set of wine data
modelWine <- lm(quality ~., trainingWineData)
#printing the summary of the linear model
summary(modelWine)
##
## Call:
## lm(formula = quality ~ ., data = trainingWineData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.69690 -0.37926 -0.04435 0.43417 1.95589
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.1899863 25.4146170 0.558 0.576725
## fixed.acidity 0.0172472 0.0312112 0.553 0.580651
## volatile.acidity -1.2524461 0.1443816 -8.675 < 2e-16 ***
## citric.acid -0.2733096 0.1755041 -1.557 0.119689
## residual.sugar 0.0131848 0.0166756 0.791 0.429310
## chlorides -1.8290970 0.5220675 -3.504 0.000477 ***
## free.sulfur.dioxide 0.0046809 0.0025650 1.825 0.068281 .
## total.sulfur.dioxide -0.0036525 0.0008732 -4.183 3.11e-05 ***
## density -9.7054519 25.9505040 -0.374 0.708477
## pH -0.4549061 0.2312662 -1.967 0.049430 *
## sulphates 0.8849532 0.1357895 6.517 1.09e-10 ***
## alcohol 0.2748169 0.0316817 8.674 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6504 on 1108 degrees of freedom
## Multiple R-squared: 0.3669, Adjusted R-squared: 0.3606
## F-statistic: 58.37 on 11 and 1108 DF, p-value: < 2.2e-16
Interpreting the above model:
Holding all other variables constant, the quality of the wine will decrease by 1.25 for unit increase in the volatile acidity value. The std. error is a measure of how precise the value of the estimate is. The t-value is the estimate/std. error. And finally, the p-value indicates how statistically significant our estimate is. If the p=value is more than 0.05, then it is not statistically significant. Similarly, we can interpret the coefficients of all the variables.
The total variance in the wine quality is explained by the r-squared value. Thus, the above model, which takes into account all the predictor variables, explain about 37% variance in the quality of the red wine. Thus, this is probably not a very good model.
#arranging the plot to be made in a 2x2 matrix
par(mfrow=c(2,2))
#plotting the residual plots
plot(modelWine)
Statistical Assumptions in the above linear model:
Linear relationship: There must be a linear relationship between the outcome variable and the independent variables. Looking at the scatterplots, we can say that there is a linear relationship between the quality (response) and all other independent (predictor) variables
Multivariate Normality: The residuals are normally distributed.That is to say that the error terms in the linear model are normally distributed. This can be checked by interpreting the Q-Q plot. From the above Q-Q plot we can clearly see that the variables are normally distributed.
No Multicollinearity: The independent variables (predictors) are not highly correlated with each other. This can be checked using one of the following three methods: Correlation Matrix (Correlation among all independent variables the correlation coefficients need to be smaller than 1), Tolerance and Variance Inflation Factor. This is evident from the scatterplot matrix obtained above hence there exists no multicollinearity as all the correlation coefficients are less than 1.
Homoscedasticity: The variance of error terms are similar across the values of the independent variables/predictors. This can be ascertained using the plot of standardized residuals vs fitted values which essentially show whether points are equally distributed across all values of the independent variables. Since the error pattern in the standardised vs fitted values is repeated, this implies that there exists homoscedacity.
Thus, all the assumptions are satisfied and we can fit a multivariate linear model to our wine data set.
From the above plots, we can determine:
That the model is probably not a very good model. This is also demonstrated by the fact that r-quared value of the model is only 36% i.e., the model only explains about 36% of the variance in the outcome (quality of the wine).
The parallel lines in the residual plot are symbolic of the fact that the outcome (quality of the wine), can only take certain values. Looking at the data we found this to be true because the quality can only take numbers from 0 to 10.
Since the outcome (quality) is bounded, we are bound to get the parallel lines in our residual vs fitted plot irrespective the type of model we choose.
By seeing the Normal Q-Q plot, we can ascertain that the all that the variables are normally distributed.
From the scale-location plot it is clear that there is homoscedacity since the error pattern in the standardised vs fitted values is repeated.
From the residual-leverage plot, we can see that the maximum residuals lie within the cook’s distance. There are some outliers, but they do not seem to have a profound impact on the model.
#Performing a step-wise regression using the stepAIC function
#and assigning trace as false so that the multiple functions are not printed on the console.
modelWineBest <- stepAIC(modelWine, trace = F)
#printing the summary of the best model
summary(modelWineBest)
##
## Call:
## lm(formula = quality ~ volatile.acidity + citric.acid + chlorides +
## free.sulfur.dioxide + total.sulfur.dioxide + pH + sulphates +
## alcohol, data = trainingWineData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.68643 -0.38170 -0.04511 0.43883 1.95618
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.8986960 0.5528314 8.861 < 2e-16 ***
## volatile.acidity -1.2372563 0.1379533 -8.969 < 2e-16 ***
## citric.acid -0.2187957 0.1433077 -1.527 0.127107
## chlorides -1.8860823 0.5023947 -3.754 0.000183 ***
## free.sulfur.dioxide 0.0050508 0.0025308 1.996 0.046211 *
## total.sulfur.dioxide -0.0037273 0.0008364 -4.457 9.17e-06 ***
## pH -0.5478355 0.1601013 -3.422 0.000645 ***
## sulphates 0.8698493 0.1306037 6.660 4.29e-11 ***
## alcohol 0.2836661 0.0204106 13.898 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6498 on 1111 degrees of freedom
## Multiple R-squared: 0.3664, Adjusted R-squared: 0.3618
## F-statistic: 80.3 on 8 and 1111 DF, p-value: < 2.2e-16
Using the stepAIC() function, we found that the “best” model is somewhat similar to the previous model in part(b). This “best” model has taken into account all the statistically significant variables (p-values < 0.05) that were also present in the previous model (like volatile.acidity, chlorides, total.sulfur.dioxide, sulphates and alcohol) and included some other variables as well that were not as statistically significant (p-values > 0.05) like pH.
The r-squared value remains largely the same in “best” model. However, this model takes into account more number of variables than in the previous model.
When evaluating models, we often want to assess how well it performs in predicting the target variable on different subsets of the data. One such technique for doing this is k-fold cross-validation, which partitions the data into k equally sized segments (called ‘folds’). One fold is held out for validation while the other k-1 folds are used to train the model and then used to predict the target variable in our testing data. This process is repeated k times, with the performance of each model in predicting the hold-out set being tracked using a performance metric such as accuracy. The most common variation of cross validation is 10-fold cross-validation.
[Reference: https://www.r-bloggers.com/evaluating-logistic-regression-models/]
#This library is required for the cv.lm() function which performs the
#k-fold cross-validation
library(DAAG)
##
## Attaching package: 'DAAG'
## The following object is masked from 'package:MASS':
##
## hills
#performing the k-fold cross validation on the "best" model from part (d)
cvlm <- cv.lm(data = trainingWineData, form.lm = modelWineBest, m=10, printit=F)
## Warning in cv.lm(data = trainingWineData, form.lm = modelWineBest, m = 10, :
##
## As there is >1 explanatory variable, cross-validation
## predicted values for a fold are not a linear function
## of corresponding overall predicted values. Lines that
## are shown for the different folds are approximate
#calculating the root mean squared error
cat("\n\nRoot mean squared error after performing the 10 fold cross validation:\n")
##
##
## Root mean squared error after performing the 10 fold cross validation:
sqrt(mean((cvlm$cvpred-cvlm$quality)^2))
## [1] 0.6531881
#Tree library is required for the regression trees
library(tree)
#Using regression tree with the same covariates as determined from the "best" model
modelWineTree <- tree(quality ~
volatile.acidity + chlorides + free.sulfur.dioxide +
total.sulfur.dioxide + pH + sulphates + alcohol,
data=trainingWineData)
#Summary of the tree model
cat("\n\nSummary of Regression Tree model")
##
##
## Summary of Regression Tree model
summary(modelWineTree)
##
## Regression tree:
## tree(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide +
## total.sulfur.dioxide + pH + sulphates + alcohol, data = trainingWineData)
## Variables actually used in tree construction:
## [1] "alcohol" "sulphates" "volatile.acidity"
## [4] "total.sulfur.dioxide"
## Number of terminal nodes: 9
## Residual mean deviance: 0.4179 = 464.3 / 1111
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.5520 -0.2619 -0.1738 0.0000 0.4484 1.9840
#Cross Validating the tree model. We use the default function,
#Which is the deviance as parameters to the FUN argument as opposed to prune.mismatch,
#which uses the classification error rate o guide the cross-validation and pruning process,
#in the classification problems
#hecv.tree()function reports the number of terminal nodes of each tree
#con-sidered (size) as well as the corresponding error rate (deviance) and the
#value of the cost-complexity parameter used (k)
cv.wineData <- cv.tree(modelWineTree)
#printing the results of the cross-validation
cat("\n\nResults of cross Validation:\n")
##
##
## Results of cross Validation:
cv.wineData
## $size
## [1] 9 8 7 6 5 4 3 2 1
##
## $dev
## [1] 538.6064 533.0583 532.2636 520.7776 540.2504 557.8732 577.7201 631.4065
## [9] 744.0414
##
## $k
## [1] -Inf 8.316953 8.440804 10.446890 14.187303 22.754414
## [7] 29.478106 51.445738 130.863499
##
## $method
## [1] "deviance"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
#Plotting the tree structure
plot(modelWineTree, main = "Regression Tree for the Wine Data Set")
#To display the node labels
#The Rgumentpretty=0 instructs R to include the category names
#for any qualitative pre-dictors, rather than simply displaying a letter for each category.
text(modelWineTree, pretty=0)
#arranging the plot to be made in a 2x2 matrix
par(mfrow =c(1,2))
#plotting the error rate as a function of both size and k.
plot(cv.wineData$size, cv.wineData$dev, type="b",
main = "Size vs Error Rate", xlab = "Size", ylab = "Error Rate")
plot(cv.wineData$k ,cv.wineData$dev ,type="b",
main = "Size vs k", xlab = "k", ylab = "Error Rate")
[Reference: https://www.solver.com/regression-trees][Reference: http://www.stat.cmu.edu/~cshalizi/350-2006/lecture-10.pdf]
Dev corresponds to the cross-validation errorrate in this instance.
The tree with 8 terminal nodes results in the lowest cross-validation error rate.
Pruning is based on cost complexity or weakest link pruning. The whole idea of pruning a tree back was to reduce the variance
Thus, pruning it further will not reduce the deviation.
[Reference: https://rstudio-pubs-static.s3.amazonaws.com/264385_739bfda9ffc243d9892644a2762e575e.html]
#Pruning the tree with 8 terminal nodes
prunedWineTree <- prune.tree(modelWineTree, best = 8)
#Plotting the pruned tree
plot(prunedWineTree, main="Best-Fit Regression Tree")
text(prunedWineTree, pretty=0)
*The above is the “best” fit tree.
#To test the performance of both the models, we test both the models
#on the testing data sets
#Finding the y_hat values using the best step AIC model from part (d)
y_hatWineBest <- predict(modelWineBest, testingWineData)
#Finding the y_hat values using the pruned tree from part (f)
y_hatWinePruned <- predict(prunedWineTree, testingWineData)
#rounding y_hat values to nearest whole numbers. 5.5 rounds to 5
#and 5.57 rounds to 6
y_hatWineBestVal <- ifelse((y_hatWineBest-floor(y_hatWineBest))>0.5,
ceiling(y_hatWineBest), floor(y_hatWineBest))
y_hatWinePrunedVal <- ifelse((y_hatWinePruned-floor(y_hatWinePruned))>0.5,
ceiling(y_hatWinePruned), floor(y_hatWinePruned))
#confusion matrix only works with "factor" type. Hence we need
#to coerce vectors into factor type confusion matrix for Both the Models
cmWineBest <- confusionMatrix(factor(testingWineData$quality,
levels = c(3,4,5,6,7,8)),
factor(y_hatWineBestVal, levels = c(3,4,5,6,7,8)))
cmWinePruned <- confusionMatrix(factor(testingWineData$quality,
levels = c(3,4,5,6,7,8)),
factor(y_hatWinePrunedVal, levels = c(3,4,5,6,7,8)))
#Evaluation of Models
resultsWine <- data.frame(matrix(nrow = 2,ncol = 2))
row.names(resultsWine) <- c("Step AIC Best Model", "Best Pruned Tree Model")
colnames(resultsWine) <- c("Accuracy", "Out-of-sample Error (1-Accuracy)")
resultsWine[1,1] <- cmWineBest$overall[1]
resultsWine[2,1] <- cmWinePruned$overall[1]
resultsWine$`Out-of-sample Error (1-Accuracy)` <- 1-resultsWine$Accuracy
resultsWine
Thus, from the above analyses, it is clear that the first model, STEP AIC best model is the slightly more accurate model (59.5%) and has the least out of sample error.
Therefore, we will prefer the STEP AIC model to predict the values on the final test data.
The Wisconsin Breast Cancer dataset is available as a comma-delimited text file on the UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Our goal in this problem will be to predict whether observations (i.e. tumors) are malignant or benign.
#Loading the data set directly from the url
#As per the info given in the article, missing values are denoted by ?
#Thus all ? must be treated as NAs. Also, no header is present.
cancerData <-
read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"), sep = ",", header = F, na.strings = "?")
#First few rows of the data
head(cancerData)
#Verifying if the data import was correct
#Last column contains the Class where 2= benign and 4=malignant
#As per the article, benign members = 458 and malignant members = 241
#This is verified below
table(cancerData$V11)
##
## 2 4
## 458 241
In this dataset, features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
Number of Instances: 699 (as of 15 July 1992)
Number of Attributes: 10 plus the class attribute. The decription of the attributes is as given below:
[Reference: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29][Reference: http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names] [Reference: http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data]
#removing the first column (ID) as it is not needed in the analysis
cancerData <- cancerData[,-1]
#Setting the columns names of the data
colnames(cancerData) <- c("Clump.Thickness", "Cell.Size", "Cell.Shape",
"Marginal.Adhesion", "Single.Epithelial.Cell.Size",
"Bare.Nuclei", "Bland.Chromatin", "Normal.Nucleoli",
"Mitoses", "Class")
#Encoding the class variable 0 = benign and 1 = malignant
cancerData$Class <- ifelse(cancerData$Class==2, 0, 1)
#Coercing the Class variable as a factor variable
cancerData$Class <- as.factor(cancerData$Class)
#To caluclate the number of rows which have missing data
cc <- complete.cases(cancerData)
cat("\nNumber of rows which have missing data =",length(cc[cc==FALSE]))
##
## Number of rows which have missing data = 16
cat("\nPercentage of incomplete rows that have missing data =",
length(cc[cc==FALSE])/nrow(cancerData)*100,"%\n")
##
## Percentage of incomplete rows that have missing data = 2.288984 %
#Since only 2.3% of the total cases are incomplete,
#we will omit all the imcomplete cases off the dataset
cancerData <- na.omit(cancerData)
#To check if each type of observational unit forms a table
sapply(cancerData[,c(1:10)], function(x)table(x))
## $Clump.Thickness
## x
## 1 2 3 4 5 6 7 8 9 10
## 139 50 104 79 128 33 23 44 14 69
##
## $Cell.Size
## x
## 1 2 3 4 5 6 7 8 9 10
## 373 45 52 38 30 25 19 28 6 67
##
## $Cell.Shape
## x
## 1 2 3 4 5 6 7 8 9 10
## 346 58 53 43 32 29 30 27 7 58
##
## $Marginal.Adhesion
## x
## 1 2 3 4 5 6 7 8 9 10
## 393 58 58 33 23 21 13 25 4 55
##
## $Single.Epithelial.Cell.Size
## x
## 1 2 3 4 5 6 7 8 9 10
## 44 376 71 48 39 40 11 21 2 31
##
## $Bare.Nuclei
## x
## 1 2 3 4 5 6 7 8 9 10
## 402 30 28 19 30 4 8 21 9 132
##
## $Bland.Chromatin
## x
## 1 2 3 4 5 6 7 8 9 10
## 150 160 161 39 34 9 71 28 11 20
##
## $Normal.Nucleoli
## x
## 1 2 3 4 5 6 7 8 9 10
## 432 36 42 18 19 22 16 23 15 60
##
## $Mitoses
## x
## 1 2 3 4 5 6 7 8 10
## 563 35 33 12 6 3 9 8 14
##
## $Class
## x
## 0 1
## 444 239
# Set seed for reproducibility
set.seed(11456)
#Splitting training data into training and testing data in 70:30 ratio.
inTrain3 = createDataPartition(cancerData$Class, p = 0.7)[[1]]
#training data set containing 70% of the total values
trainingCancerData = cancerData[ inTrain3,]
#testing data set containing the remaining 30% of the total values
testingCancerData = cancerData[-inTrain3,]
#fitting a logistic regresssion model to predict whether
#tissue samples are malignant or benign
modelCancerData <- glm(Class~., trainingCancerData, family = "binomial")
#Printing the summary of the model
summary(modelCancerData)
##
## Call:
## glm(formula = Class ~ ., family = "binomial", data = trainingCancerData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.5514 -0.1352 -0.0617 0.0213 2.0981
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.55237 1.30948 -7.295 2.99e-13 ***
## Clump.Thickness 0.55435 0.17017 3.258 0.001123 **
## Cell.Size -0.22688 0.28429 -0.798 0.424847
## Cell.Shape 0.54443 0.29756 1.830 0.067303 .
## Marginal.Adhesion 0.46871 0.17272 2.714 0.006655 **
## Single.Epithelial.Cell.Size -0.05324 0.22214 -0.240 0.810576
## Bare.Nuclei 0.46158 0.12376 3.730 0.000192 ***
## Bland.Chromatin 0.32206 0.20263 1.589 0.111978
## Normal.Nucleoli 0.21299 0.16149 1.319 0.187198
## Mitoses 0.41586 0.32800 1.268 0.204842
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 620.686 on 478 degrees of freedom
## Residual deviance: 74.992 on 469 degrees of freedom
## AIC: 94.992
##
## Number of Fisher Scoring iterations: 8
#Predicitng this model on the test data set
y_hatCancerData <- predict(modelCancerData,
testingCancerData ,type = "response")
#The confusion matrix is an mxm, where m is the number of classes to be predicted.
#For binary classification problems, the number of classes is 2, thus the confusion
#matrix will have 2 rows and columns.
# It is needed to choose a decision threshold t in order to label the instances
#as positives or negatives. If the probability assigned to the instance by the
#classifier is greater than t, it is labeled as positive and if the probability is
#lower than the decision threshold, it is labeled as negative.
#https://www.neuraldesigner.com/blog/methods-binary-classification
#In our example, the threshold is assumed as 0.5. Therefore, any probability
#above 0.5 is considered to be 1 otherwise 0.
#Translating the prediction probabilities based on threshold value into actual value
predictedY_hatValuesCancerData <- y_hatCancerData
predictedY_hatValuesCancerData[predictedY_hatValuesCancerData > 0.5] = 1
predictedY_hatValuesCancerData[predictedY_hatValuesCancerData <= 0.5] = 0
#confusion matrix only works with "factor" type. Hence we need to coerce vectors
#into factor type
#Printing the predicted values:
cat("\n\nResults when fitting a logistic regression model\n\n")
##
##
## Results when fitting a logistic regression model
cat("\n\nThe predicted values from the test set are as follows:\n\n")
##
##
## The predicted values from the test set are as follows:
predictedY_hatValuesCancerData
## 2 5 21 22 26 28 29 30 34 36 39 44 49 50 51 55 59 75
## 1 0 1 1 1 0 0 0 0 0 1 0 0 1 1 1 1 1
## 83 88 91 94 96 101 104 105 106 109 110 111 116 125 126 128 130 131
## 0 1 0 0 0 1 1 1 1 0 1 0 0 1 0 0 0 0
## 132 133 134 135 142 145 147 154 155 158 161 163 166 169 172 178 187 189
## 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 1
## 190 191 199 202 206 209 212 215 222 224 226 232 233 237 238 244 249 251
## 0 1 0 1 1 0 1 1 1 1 0 1 1 1 1 0 0 0
## 257 260 261 264 280 282 283 287 289 297 299 300 301 302 308 313 314 319
## 0 1 1 1 1 0 1 1 1 1 0 1 1 0 0 1 0 0
## 323 329 332 336 344 347 352 361 366 372 373 381 388 389 391 392 396 398
## 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0
## 401 402 403 407 414 419 420 421 425 426 427 428 433 436 437 439 441 442
## 1 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 0
## 450 458 459 461 462 467 473 479 480 481 483 484 486 491 495 497 501 502
## 1 1 0 0 0 1 0 0 1 0 1 1 0 0 1 0 0 0
## 504 507 514 515 522 524 525 527 533 534 538 539 542 545 548 549 551 554
## 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## 572 576 577 578 579 580 581 585 587 595 598 603 605 607 611 616 620 621
## 1 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0
## 623 625 627 628 632 633 636 638 643 645 647 650 659 661 663 664 666 669
## 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1
## 670 672 681 682 687 696
## 1 0 1 1 0 0
cat("\n\nStats for the predicted values from the test set are as follows:\n\n")
##
##
## Stats for the predicted values from the test set are as follows:
table(predictedY_hatValuesCancerData)
## predictedY_hatValuesCancerData
## 0 1
## 129 75
cat("\n\nResulting Confusion Matrix of the Regression Model:\n")
##
##
## Resulting Confusion Matrix of the Regression Model:
confusionMatrix(
factor(testingCancerData$Class, levels=0:1),
factor(predictedY_hatValuesCancerData, levels=0:1)
)#$overall[[1]]
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 128 5
## 1 1 70
##
## Accuracy : 0.9706
## 95% CI : (0.9371, 0.9891)
## No Information Rate : 0.6324
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.936
## Mcnemar's Test P-Value : 0.2207
##
## Sensitivity : 0.9922
## Specificity : 0.9333
## Pos Pred Value : 0.9624
## Neg Pred Value : 0.9859
## Prevalence : 0.6324
## Detection Rate : 0.6275
## Detection Prevalence : 0.6520
## Balanced Accuracy : 0.9628
##
## 'Positive' Class : 0
##
From above, we find that about there are about 133 caes where the tissue samples are benign and about 71 cases where they are malignant.
#Building the random forest model
set.seed(11175)
modelCancerData.rf <- randomForest(Class ~ ., data = trainingCancerData)
#Predicting the results probabilities
#If type argument is kept "prob", then probabilities of getting a both 0 and 1
#is the output
y_hatCancerData.rf <- predict(modelCancerData.rf,
testingCancerData ,type = "prob")
#In our example, the threshold is assumed as 0.5. Therefore, any probability
#above 0.5 is considered to be 1 otherwise 0.
#Translating the prediction probabilities based on threshold value into actual value
#Since in the above calculation of y_hatCancerData.rf probabilities of getting both 0
#and 1 is returned, we will consider only the probabilities of getting a 1, which is
#the second column of the data set. Hence, [,2].
predictedY_hatValuesCancerData.rf <- y_hatCancerData.rf[,2]
predictedY_hatValuesCancerData.rf[predictedY_hatValuesCancerData.rf > 0.5] = 1
predictedY_hatValuesCancerData.rf[predictedY_hatValuesCancerData.rf <= 0.5] = 0
#Printing the predicted values:
cat("\n\nResults when fitting a random forest model\n\n")
##
##
## Results when fitting a random forest model
cat("\n\nThe predicted values from the test set are as follows:\n\n")
##
##
## The predicted values from the test set are as follows:
predictedY_hatValuesCancerData.rf
## 2 5 21 22 26 28 29 30 34 36 39 44 49 50 51 55 59 75
## 1 0 1 1 1 0 0 0 0 0 1 1 0 1 1 1 1 1
## 83 88 91 94 96 101 104 105 106 109 110 111 116 125 126 128 130 131
## 0 1 0 0 0 1 1 1 1 0 1 0 0 1 0 0 0 0
## 132 133 134 135 142 145 147 154 155 158 161 163 166 169 172 178 187 189
## 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 1
## 190 191 199 202 206 209 212 215 222 224 226 232 233 237 238 244 249 251
## 0 1 0 1 1 0 1 1 1 1 0 1 1 1 1 0 0 0
## 257 260 261 264 280 282 283 287 289 297 299 300 301 302 308 313 314 319
## 0 1 1 1 1 0 1 1 1 1 0 1 1 0 0 1 0 0
## 323 329 332 336 344 347 352 361 366 372 373 381 388 389 391 392 396 398
## 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0
## 401 402 403 407 414 419 420 421 425 426 427 428 433 436 437 439 441 442
## 1 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 0
## 450 458 459 461 462 467 473 479 480 481 483 484 486 491 495 497 501 502
## 1 1 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0
## 504 507 514 515 522 524 525 527 533 534 538 539 542 545 548 549 551 554
## 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## 572 576 577 578 579 580 581 585 587 595 598 603 605 607 611 616 620 621
## 1 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0
## 623 625 627 628 632 633 636 638 643 645 647 650 659 661 663 664 666 669
## 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1
## 670 672 681 682 687 696
## 1 0 1 1 0 0
cat("\n\nStats for the predicted values from the test set are as follows:\n\n")
##
##
## Stats for the predicted values from the test set are as follows:
table(predictedY_hatValuesCancerData.rf)
## predictedY_hatValuesCancerData.rf
## 0 1
## 129 75
cat("\n\nResulting Confusion Matrix of the Random Forest Model:\n")
##
##
## Resulting Confusion Matrix of the Random Forest Model:
confusionMatrix(
factor(testingCancerData$Class, levels=0:1),
factor(predictedY_hatValuesCancerData.rf, levels=0:1)
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 129 4
## 1 0 71
##
## Accuracy : 0.9804
## 95% CI : (0.9506, 0.9946)
## No Information Rate : 0.6324
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9574
## Mcnemar's Test P-Value : 0.1336
##
## Sensitivity : 1.0000
## Specificity : 0.9467
## Pos Pred Value : 0.9699
## Neg Pred Value : 1.0000
## Prevalence : 0.6324
## Detection Rate : 0.6324
## Detection Prevalence : 0.6520
## Balanced Accuracy : 0.9733
##
## 'Positive' Class : 0
##
From above, we find that about there are about 128 caes where the tissue samples are benign and about 76 cases where they are malignant.
#Using the pROC library for the ROC function
# Evaluation of the logistic regression model
#making a replica of the testing data set
dat <- testingCancerData
#adding an additional column of the probabilities of getting a 1 in the predicted output
dat$prob <- predictedY_hatValuesCancerData
#calculating the ROC of logistic regression
roc.lr <- roc(Class ~ prob, data = dat)
cat("\n\nROC output for logistic regression model:\n\n")
##
##
## ROC output for logistic regression model:
roc.lr
##
## Call:
## roc.formula(formula = Class ~ prob, data = dat)
##
## Data: prob in 133 controls (Class 0) < 71 cases (Class 1).
## Area under the curve: 0.9742
#plotting the ROC
plot(roc.lr, main="ROC curve for logistic regression model")
# Evaluation of the random forest model
#making a replica of the testing data set
dat <- testingCancerData
#adding an additional column of the probabilities of getting a 1 in the predicted output
dat$prob <- predictedY_hatValuesCancerData.rf
#calculating the ROC of logistic regression
roc.rf <- roc(Class ~ prob, data = dat)
cat("\n\nROC output for logistic regression model:\n\n")
##
##
## ROC output for logistic regression model:
roc.rf
##
## Call:
## roc.formula(formula = Class ~ prob, data = dat)
##
## Data: prob in 133 controls (Class 0) < 71 cases (Class 1).
## Area under the curve: 0.985
#plotting the ROC
plot(roc.rf, main="ROC curve for random forest model")
From above ROC analyses of both the models, it is clear that the area under the curve (AUC) for the logistic regression model (97.4%) is less than the area under the curve (AUC) for the random forest model (98.5%). This is consistent with the fact that the accuracy of the latter model, from the previous question, is more than that of the former model. Thus, random forest model is slightly better than the logisitic regression model.
Please answer the questions below by writing a short response.
It is a model which, unlike linear model, takes into account the non-linear effect of the predictor variables on the response variables. Example: Polynomial Regression Models and splines.
Advantages: It reduces bias. Therefore, the mean-square error decreases as flexibility increases given the fact that the model fits the data more closely.
Disadvantages: It increases the variance. This is due to the fact that the model may over-fit the train data (and take into account a lot of noise) and will vary a lot in the test data set.
A more flexible model may be preferred when prediction is the goal and we are not concerned with the relationships between the variables. In that case, we will want our prediction to be as close to the test data as possible.
It is the linear model which predicts the relationship between the predictor and the response variable in a linear way. In actuality, there might not exist a linear relationship between the two variables. It is done for the sake of simplicity.
An in-flexible model (linear model) when inference (as opposed to prediction) is the goal and more emphasis is upon the finding relationships amongst the variables.
Suppose we have a dataset with five predictors, X1 = GPA, X2 = IQ, X3 = Degree (1 for B.A. degree holder, and 0 for B.S. degree holder), X4 = InteractionbetweenGPAandIQ, and X5 = InteractionbetweenGPAandDegree. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model and get B0 = 50; B1 = 20, B2 = 0.07, B3 = 35, B4 = 0.01, and B5 = -10.
The corresponding least squares fit model will be given by the following equation
y = B0 + B1xGPA + B2xIQ + B3xDegree + B4xGPAxIQ + B5xGPAxDegree
y = 50 + 20xGPA + 0.07xIQ + 35xDegree + 0.01xGPAxIQ - 10xGPAxDegree
where y = starting salary after graduation in thousands of dollars (response variable)
Answer: Correct; plugging the values in the above equation, we notice that for a BS degree candidate, the value is 1 while for a BA degree holder the value is 0. The value of y is clearly more in the former case provided the values of IQ and GPA are fixed.
Answer: Inorrect; plugging the values in the above equation, we notice that for a BS degree candidate, the value is 1 while for a BA degree holder the value is 0. The value of y is clearly less in the latter case provided the values of IQ and GPA are fixed.
Answer: Cannot say because of insufficient data. We need to have exact value of the “high enough GPA”.
Answer: Cannot say because of insufficient data. We need to have exact value of the “high enough GPA”.
Here GPA = 4, IQ= 110, Degree = 1 (since degree is BA)
Plugging in these values in the above equation
y <- 50 + 20*(4) + 0.07*(110) + 35*(1) + 0.01*(110*4) - 10*(4*1)
cat("\n\nSalary of a B.A. with IQ of 110 and a GPA of 4.0 is", y*1000,"USD")
##
##
## Salary of a B.A. with IQ of 110 and a GPA of 4.0 is 137100 USD
Apply boosting, bagging, and random forests to a dataset of your choice that we have used in class. Be sure to fit the models on a training set and evaluate their performance on a test set.
#For this problem I am using the Boston dataser available
#in the MASS Package
#Response Variable: medv, Median house value in neighborhoods in Boston, MA
#Predictor Variables: all other variables
bostonData <- Boston
# Set seed for reproducibility
set.seed(11178)
#Splitting training data into training and testing data in 70:30 ratio.
inTrain4 = createDataPartition(bostonData$medv, p = 0.7)[[1]]
#training data set containing 70% of the total values
trainingBostonData = bostonData[ inTrain4,]
#testing data set containing the remaining 30% of the total values
testingBostonData = bostonData[-inTrain4,]
#Model 1: Boosting
set.seed(11177)
#library gbm is required for gradient boosting
library(gbm)
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:DAAG':
##
## lung
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
#Building the boosting model
modelBoston.boost <- gbm(medv ~ ., data = trainingBostonData,
distribution = "gaussian",
n.trees = 100, shrinkage = 0.01,
interaction.depth = 8)
#Predicting the results
modelBoston.boost.pred <- predict(modelBoston.boost, testingBostonData, n.trees = 100)
#Model 2: Bagging
library(randomForest)
#here we will perform the ensemble method of bagging on random forest model
#Building the bagging model with a random forest
set.seed(11176)
modelBoston.bag <- randomForest(medv ~ ., data = trainingBostonData, mtry=13, importance=T)
#Predicting the results
modelBoston.bag.pred <- predict(modelBoston.bag, testingBostonData)
#Model 3: Random Forest
#Growing a random forest proceeds in exactly the same way, except
#that we use a smaller value of the mtry argument.
#By default, randomForest() uses p/3 variables when building a
#random forest of regression trees, and sqrt(p) variables when building a
#random forest of classification trees.
#Building the random forest model
set.seed(11175)
modelBoston.rf <- randomForest(medv ~ ., data = trainingBostonData, mtry=6, importance=T)
#Predicting the results
modelBoston.rf.pred <- predict(modelBoston.rf, testingBostonData)
#Model 4: Linear Regression
#Building the linear regression model
set.seed(11174)
modelBoston.lm <- lm(medv ~ ., data = trainingBostonData)
#Predicting the results
modelBoston.lm.pred <- predict(modelBoston.lm, testingBostonData)
#Evaluating the models
#Building an empty data frame to populate the results
resultsBoston <- data.frame(matrix(nrow = 4,ncol = 1))
row.names(resultsBoston) <- c("Boosting", "Bagging with Random Forest",
"Random Forest", "Linear Model")
colnames(resultsBoston) <- c("Mean Squared Error")
#Calculating the Mean Squared Error for each model
resultsBoston[1,1] <- mean((modelBoston.boost.pred-testingBostonData$medv)^2)
resultsBoston[2,1] <- mean((modelBoston.bag.pred-testingBostonData$medv)^2)
resultsBoston[3,1] <- mean((modelBoston.rf.pred-testingBostonData$medv)^2)
resultsBoston[4,1] <- mean((modelBoston.lm.pred-testingBostonData$medv)^2)
resultsBoston
[Reference: http://www2.stat.duke.edu/~rcs46/lectures_2017/08-trees/08-tree-advanced.pdf]
Authors of a paper in the journal Appetite, entitled “Individual differences in bitter taste preferences are associated with antisocial personality traits” (Sagioglou & Greitemeyer, 2015) hired Mechanical Turkers to complete surveys about their taste preferences and personalities. They ran correlation tests and multiple regression analyses. A portion of their regression analyses for two separate experiments are reproduced below (and can also be found within the article).
The authors conclude:
“The present research has demonstrated that bitter taste preferences are associated with more pronounced malevolent personality traits, especially robustly with everyday sadism. The sample was a large community sample, thereby representing a wide selection of the population. . . In establishing a robust link between taste preferences and personality traits, this research reveals 5 further real-world behavioral correlates of antisocial personality traits” (p. 306).
Provide a short critical analysis (up to 500 words max) of this study, taking into consideration ethics, empirical frameworks, regression analysis, and any other factors that stand out to you. In order to receive full points for this problem, you will likely need to at least skim the full article, but it is possible to receive partial credit based only on the information I provided above. Think critically and be creative!
I understand the premise behind this research. However, in my opinion, this inferential research should not be considered as a guide to determine a person’s personality based on their taste preferences. I say this because, this study is quite prone to confounding. Also, it must be noted that no where in both the studies, the authors have tried to evaluate their model. We do not know for sure how accurate their model is. No test for confounding variables has been done either. Another noteworthy thing, which in my opinion, the authors have missed in their research is the fact that their results can be greatly influenced by the outliers. For example, some repondents who like the bitter taste AND score high on each of the three factors of the dark triad, might have a greater leverage on the overall model than as compared to other predictor variables. Most importantly, the authors have tried to perform a multiple linear regression; the assumptions under which the multiple linear regression is performed are not established. Thus, empirically, the research analyses is very likely to be prone to bias.
I couldn’t help but notice that about 445 people out of a total of 953 respondents had at least an undergraduate degree. Now, we can regard about half of the repondents as having decent education. More educated people tend to interact with more cultures and may try out bitter foods out of curiosity and might have liked it. Also, we have not established what exactly is a bitter taste. Some people might find coffee bitter while others may describe dark chocolate as bitter. Certain hot peppers are also defined as having a bitter taste. Thus, we do not know for sure if all the respondents have same idea of bitterness.
Also, while interpreting the results, care must be take that this is an inferential study and that correlation does not mean causation. Thus, bitter state preference DOES NOT cause pronounced malevolent personality traits.
Thus, I would regard this study as mere general information and would not form any opinions about personality of people based on their taste preferences. That would be incredibly stupid to do!