# Set the working directory
setwd("C:/Users/mws16/OneDrive/Desktop/OPIM-5603-Statistics in Business Analytics/Homework 4")
Part I
Q1: Categorical Predictor Variables in Regression
Use the BirthSmokers.csv
The BirthSmokers dataset is a random sample of 32 births based on a study conducted on pregnant women. The dataset has 3 variables wgt, gest and smoke giving information on birth weight of baby in gms, length of gestation in weeks and smoking status of mother(yes/no)
# read in the file
birthsmokers <- read.csv("BirthSmokers.csv")
# Build a linear regression
fit1a <- lm(Wgt ~ .,
data = birthsmokers)
summary(fit1a)
##
## Call:
## lm(formula = Wgt ~ ., data = birthsmokers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -223.693 -92.063 -9.365 79.663 197.507
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2389.573 349.206 -6.843 1.63e-07 ***
## Gest 143.100 9.128 15.677 1.07e-15 ***
## Smokeyes -244.544 41.982 -5.825 2.58e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 115.5 on 29 degrees of freedom
## Multiple R-squared: 0.8964, Adjusted R-squared: 0.8892
## F-statistic: 125.4 on 2 and 29 DF, p-value: 5.289e-15
# Save predicted values to data table
birthsmokers$Preds <- fit1a$fitted.values
birthsmokers
# Plot the regression function
plot(y = fit1a$residuals,
x = fit1a$fitted.values,
main = "Residuals vs Fitted values",
xlab = "Fitted values",
ylab = "Residuals")
abline(0,1) # this is a perfect prediction - 45 degree line
# add the regression line
abline(lm(fit1a$residuals ~ fit1a$fitted.values),
col = "red")
summary(fit1a)
##
## Call:
## lm(formula = Wgt ~ ., data = birthsmokers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -223.693 -92.063 -9.365 79.663 197.507
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2389.573 349.206 -6.843 1.63e-07 ***
## Gest 143.100 9.128 15.677 1.07e-15 ***
## Smokeyes -244.544 41.982 -5.825 2.58e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 115.5 on 29 degrees of freedom
## Multiple R-squared: 0.8964, Adjusted R-squared: 0.8892
## F-statistic: 125.4 on 2 and 29 DF, p-value: 5.289e-15
As Smoke is a categorical predictor with two values, the regression function gives coefficient estimate for only one value of the variable. In this case, the estimate for Smokeyes = -244.544.
Regression Equation for Smoking Mothers: (Smoke = Yes. Numerically, Smokeyes = 1) Wgt = (-2389.573) + 143.100 * Gest - 244.544
Regression Equation for Non smoking mothers: (Smoke = No. Numerically, Smokeno = 0) Wgt = (-2389.573) + 143.100 * Gest
Q2: Interaction Terms
Use the depression.csv
The response variable is “effectiveness” - a higher number means the treatment was more effective, a lower number means less effective treatment. You also have info on the age of the patient and the treatment that they received.
We want to study the effectiveness of three treatments A, B, and C for severe depression. The researchers collected depression data on a random sample of 36 severely depressed individuals
# read in the data
depression <- read.csv("depression.csv")
# Build a linear regression
fit2a <- lm(effect ~ .,
data = depression)
summary(fit2a)
##
## Call:
## lm(formula = effect ~ ., data = depression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.5732 -3.3922 0.9829 3.9613 9.5062
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.54335 3.58105 9.088 2.23e-10 ***
## age 0.66446 0.06978 9.522 7.42e-11 ***
## TRTB -9.80758 2.46471 -3.979 0.000371 ***
## TRTC -10.25276 2.46542 -4.159 0.000224 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.035 on 32 degrees of freedom
## Multiple R-squared: 0.784, Adjusted R-squared: 0.7637
## F-statistic: 38.71 on 3 and 32 DF, p-value: 9.287e-11
# Save the predictions to data table
depression$preds <- fit2a$fitted.values
depression
summary(fit2a)
##
## Call:
## lm(formula = effect ~ ., data = depression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.5732 -3.3922 0.9829 3.9613 9.5062
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.54335 3.58105 9.088 2.23e-10 ***
## age 0.66446 0.06978 9.522 7.42e-11 ***
## TRTB -9.80758 2.46471 -3.979 0.000371 ***
## TRTC -10.25276 2.46542 -4.159 0.000224 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.035 on 32 degrees of freedom
## Multiple R-squared: 0.784, Adjusted R-squared: 0.7637
## F-statistic: 38.71 on 3 and 32 DF, p-value: 9.287e-11
Regression Equation: effect = 32.54335 + 0.66446 * age + (-9.80758) * TRTB + (-10.25276) * TRTC
When a predictor variable is factor (categorical) with n possible values, R generates estimates for (n-1) values. The intercept and estimates are calcuates taking into account the weightage of the ommited value. Thus, when the value of the categorical variable equals to the one omitted value, the remaining coefficients and the intercept in the equation balance out the effect of this categorical variable.
# Build regression using interaction term
fit2c <- lm(effect ~ age * TRT,
data = depression)
summary(fit2c)
##
## Call:
## lm(formula = effect ~ age * TRT, data = depression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4366 -2.7637 0.1887 2.9075 6.5634
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.51559 3.82523 12.422 2.34e-13 ***
## age 0.33051 0.08149 4.056 0.000328 ***
## TRTB -18.59739 5.41573 -3.434 0.001759 **
## TRTC -41.30421 5.08453 -8.124 4.56e-09 ***
## age:TRTB 0.19318 0.11660 1.657 0.108001
## age:TRTC 0.70288 0.10896 6.451 3.98e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.925 on 30 degrees of freedom
## Multiple R-squared: 0.9143, Adjusted R-squared: 0.9001
## F-statistic: 64.04 on 5 and 30 DF, p-value: 4.264e-15
The interaction term age * TRT adds the deviation from the baseline slope, in addition to the deviation from the baseline given by the simple predictor terms.
In simple words, in an interaction term between a categorical and a continuous variable, the slope depends on how the categorical variable changes. For example, in the model above, for treatment B, a unit change in the age will result in the effect changing by 0.19318, whereas, for treatment C, a unit change in the age will result in the effect changing by 0.70288.
# evaluate the AIC for both models
AIC(fit2a)
## [1] 237.3518
AIC(fit2c)
## [1] 208.0487
# evaluate regression metrics for both models
mymetrics2a <- hydroGOF::gof(sim = fit2a$fitted.values,
obs = depression$effect)
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
# evaluate regression metrics for both models
mymetrics2c <- hydroGOF::gof(sim = fit2c$fitted.values,
obs = depression$effect)
mymetrics2a
## [,1]
## ME 0.00
## MAE 4.72
## MSE 32.38
## RMSE 5.69
## NRMSE % 45.80
## PBIAS % 0.00
## RSR 0.46
## rSD 0.89
## NSE 0.78
## mNSE 0.53
## rNSE 0.65
## d 0.94
## md 0.75
## rd 0.90
## cp 0.89
## r 0.89
## R2 0.78
## bR2 0.78
## KGE 0.84
## VE 0.91
mymetrics2c
## [,1]
## ME 0.00
## MAE 2.98
## MSE 12.84
## RMSE 3.58
## NRMSE % 28.90
## PBIAS % 0.00
## RSR 0.29
## rSD 0.96
## NSE 0.91
## mNSE 0.70
## rNSE 0.90
## d 0.98
## md 0.85
## rd 0.97
## cp 0.95
## r 0.96
## R2 0.91
## bR2 0.91
## KGE 0.94
## VE 0.95
The model with interaction term (fit2c) has a higher R-squared value (0.9143), a lower AIC (208.0487), a lower RMSE, MAE, MSE. Hence definitely the model with interaction term has a better performance.
Thus it can be said that the interaction term does a better job of explaining the data, maybe because it takes into consideration the interaction of the categorical and continuous predictor along with the interaction of both the predictors with the target variable.
Q3: Detecting Multicollinearity
Use the Cement.csv
The heat evolved in calories during the hardening of cement on a per gram basis is given in the cement dataset. The four predictors are the percentages of four ingredients: tricalcium aluminate, tricalcium silicate, tetracalcium alumino ferrite, and dicalcium silicate
cement <- read.csv("Cement.csv")
# scatterplot matrix
library(psych)
# pearson correlation
pairs.panels(cement,
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)
In the Person correlation matrix, it can be seen that all the predictor variables have a significant correlation with the target variable Heat. However, the variables Trical_sil and Dical_Sil have a strong negative correlation with each other, which means, having both of these variables will introduce multicollinearity while building the regression model. It is nothing but a repeatition of information. Hence, preferable, one of these variables should be dropped while buildin the regression.
Similarly, variables Trical_Alum and Tetracal_AlumFer seem to have a correlation, although not as strong as the other two variable. It needs to be checked whether dropping one of these variables would improve the performance of the regression.
# Build a linear regression
fit3b <- lm(Heat ~ .,
data = cement)
summary(fit3b)
##
## Call:
## lm(formula = Heat ~ ., data = cement)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1750 -1.6709 0.2508 1.3783 3.9254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.4054 70.0710 0.891 0.3991
## Trical_Alum 1.5511 0.7448 2.083 0.0708 .
## Trical_sil 0.5102 0.7238 0.705 0.5009
## Tetracal_AlumFer 0.1019 0.7547 0.135 0.8959
## Dical_Sil -0.1441 0.7091 -0.203 0.8441
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.446 on 8 degrees of freedom
## Multiple R-squared: 0.9824, Adjusted R-squared: 0.9736
## F-statistic: 111.5 on 4 and 8 DF, p-value: 4.756e-07
# import library car
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
# Find the Variance Inflation Factor
vif(fit3b)
## Trical_Alum Trical_sil Tetracal_AlumFer Dical_Sil
## 38.49621 254.42317 46.86839 282.51286
# Get the VIF values
vif(fit3b)
## Trical_Alum Trical_sil Tetracal_AlumFer Dical_Sil
## 38.49621 254.42317 46.86839 282.51286
All the variables have a high VIF value. For the variable to be retained for prediction, the VIF should be as low as possible (Ideally, Sqrt(VIF)<=3, maximum 5). Trying building a model without Dical_Sil, the variable with highest VIF.
# Fit another regression dropping the variable Dical_Sil
fit3c <- lm(Heat ~ . - Dical_Sil,
data = cement)
summary(fit3c)
##
## Call:
## lm(formula = Heat ~ . - Dical_Sil, data = cement)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2543 -1.4726 0.1755 1.5409 3.9711
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.19363 3.91330 12.315 6.17e-07 ***
## Trical_Alum 1.69589 0.20458 8.290 1.66e-05 ***
## Trical_sil 0.65691 0.04423 14.851 1.23e-07 ***
## Tetracal_AlumFer 0.25002 0.18471 1.354 0.209
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.312 on 9 degrees of freedom
## Multiple R-squared: 0.9823, Adjusted R-squared: 0.9764
## F-statistic: 166.3 on 3 and 9 DF, p-value: 3.367e-08
# Check the VIF values
vif(fit3c)
## Trical_Alum Trical_sil Tetracal_AlumFer
## 3.251068 1.063575 3.142125
Now the VIF values are low and the variables also seem to have significance in the regression according to the p-values. The variable Tetracal_AlumFer doesnt show any significance, but has a considerably low VIF. Trying to build a model without this variable and then check the performance of both the models.
# Fit another regression dropping both Tetracal_AlumFer and Dical_Sil
fit3c_new <- lm(Heat ~ . - Dical_Sil - Tetracal_AlumFer,
data = cement)
summary(fit3c_new)
##
## Call:
## lm(formula = Heat ~ . - Dical_Sil - Tetracal_AlumFer, data = cement)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.893 -1.574 -1.302 1.363 4.048
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.57735 2.28617 23.00 5.46e-10 ***
## Trical_Alum 1.46831 0.12130 12.11 2.69e-07 ***
## Trical_sil 0.66225 0.04585 14.44 5.03e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.406 on 10 degrees of freedom
## Multiple R-squared: 0.9787, Adjusted R-squared: 0.9744
## F-statistic: 229.5 on 2 and 10 DF, p-value: 4.407e-09
# Find the AIC for models with and without the Tetracal_AlumFer
AIC(fit3c)
## [1] 63.9036
AIC(fit3c_new)
## [1] 64.31239
After removing the variable Tetracal_AlumFer, the R-square reduced and the residual standard error increased slightly. The second model also has a higher AIC. Hence, only the variable Dical_Sil should be removed for building a regression model for Heat.
Part II
Q4: Logistic Regression
Use Smarket: S&P Stock Market Data. https://rdrr.io/cran/ISLR/man/Smarket.html
These are the daily percentage returns for the S&P 500 stock index between 2001 and 2005. You can download this from within R in the ISLR package.
# install.packages("ISLR")
# import library ISLR
library(ISLR)
# read in the Smarket data
smarket <- Smarket
smarket
str(smarket$Direction)
## Factor w/ 2 levels "Down","Up": 2 2 1 2 2 2 1 2 2 2 ...
# Recode the target variable
smarket$Direction <- as.character(smarket$Direction)
smarket$Direction[smarket$Direction == "Up"] <- 1
smarket$Direction[smarket$Direction == "Down"] <- 0
str(smarket$Direction)
## chr [1:1250] "1" "1" "0" "1" "1" "1" "0" "1" "1" "1" "0" "0" "1" "1" ...
smarket$Direction <- as.factor(smarket$Direction)
fit4a <- glm(Direction ~ . - Year - Today,
data = smarket,
family = binomial(link = "logit"))
summary(fit4a)
##
## Call:
## glm(formula = Direction ~ . - Year - Today, family = binomial(link = "logit"),
## data = smarket)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.446 -1.203 1.065 1.145 1.326
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
# predict the probability of having an affair for each observation
mypreds4a <- predict(fit4a, # my model
newdata = smarket, # dataset
type = "response") # to get probabilities
summary(mypreds4a)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4084 0.5020 0.5180 0.5184 0.5338 0.6486
smarket$preds <- round(mypreds4a)
# get the coefficients
coef(fit4a)
## (Intercept) Lag1 Lag2 Lag3 Lag4
## -0.126000257 -0.073073746 -0.042301344 0.011085108 0.009358938
## Lag5 Volume
## 0.010313068 0.135440659
#
The logit function gives a change in the log odds of the variables according to the coefficients. Here:
To better understand the coefficients, lets exponentiate the coefficients.
# exponentiate the coefficients to correctly interpret them as there is log in the output (because we are using logit)
exp(coef(fit4a))
## (Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
## 0.8816146 0.9295323 0.9585809 1.0111468 1.0094029 1.0103664
## Volume
## 1.1450412
The positive sign of the coefficient idicates a direct relationship between the variables and the target. Interprtation of these coefficient values:
Q5: Poisson Regression
Use the AmphibianRoadKills.csv
You are given with amphibian road kill data which gives the count of the number of animals killed on a 25000 mile road at different points.
options(scipen=999) # to avoid scientific notation
# read in the data
arkill <- read.csv("AmphibianRoadKills.csv")
arkill
# build the regression model with family = poisson
fit5a <- glm(death ~ distance,
data = arkill,
family=poisson())
# check the model built
summary(fit5a)
##
## Call:
## glm(formula = death ~ distance, family = poisson(), data = arkill)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.1100 -1.6950 -0.4708 1.4206 7.3337
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.316484980 0.043219983 99.87 <0.0000000000000002 ***
## distance -0.000105851 0.000004387 -24.13 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1071.4 on 51 degrees of freedom
## Residual deviance: 390.9 on 50 degrees of freedom
## AIC: 634.29
##
## Number of Fisher Scoring iterations: 4
#install.packages("qcc")
# import library qcc to be able to run a quasipoisson model
library(qcc)
## Package 'qcc' version 2.7
## Type 'citation("qcc")' for citing this R package in publications.
# build regression using family = quasipoisson
fit5b <- glm(death ~ distance,
data = arkill,
family = quasipoisson())
# check the model built
summary(fit5b)
##
## Call:
## glm(formula = death ~ distance, family = quasipoisson(), data = arkill)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.1100 -1.6950 -0.4708 1.4206 7.3337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.31648498 0.11938536 36.156 < 0.0000000000000002 ***
## distance -0.00010585 0.00001212 -8.735 0.0000000000124 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 7.630148)
##
## Null deviance: 1071.4 on 51 degrees of freedom
## Residual deviance: 390.9 on 50 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
metrics5a <- hydroGOF::gof(sim = fit5a$fitted.values,
obs = fit5a$y)
metrics5b <- hydroGOF::gof(sim = fit5b$fitted.values,
obs = fit5b$y)
metrics5a
## [,1]
## ME 0.00
## MAE 10.40
## MSE 282.67
## RMSE 16.81
## NRMSE % 69.30
## PBIAS % 0.00
## RSR 0.69
## rSD 0.79
## NSE 0.51
## mNSE 0.46
## rNSE -1.50
## d 0.83
## md 0.70
## rd 0.13
## cp 0.16
## r 0.72
## R2 0.52
## bR2 0.41
## KGE 0.65
## VE 0.60
metrics5b
## [,1]
## ME 0.00
## MAE 10.40
## MSE 282.67
## RMSE 16.81
## NRMSE % 69.30
## PBIAS % 0.00
## RSR 0.69
## rSD 0.79
## NSE 0.51
## mNSE 0.46
## rNSE -1.50
## d 0.83
## md 0.70
## rd 0.13
## cp 0.16
## r 0.72
## R2 0.52
## bR2 0.41
## KGE 0.65
## VE 0.60
# plot the residuals vs fitted values
plot(y = fit5a$residuals,
x = fit5a$fitted.values,
main = "Residuals vs Fitted values",
xlab = "Fitted values",
ylab = "Residuals")
abline(0,0) #horizontal axis at x=0, y=0
# add the regression line
abline(lm(fit5a$residuals ~ fit5a$fitted.values),
col = "red")
Looking at the Residuals vs Fitted Values scatter plot, it can be seen that:
However, the range of underprediction is much higher than that of the overpredictions. This indicates there might be some outliers in the data.
# plot residuals vs the actual distances
plot(y = fit5a$residuals,
x = arkill$distance,
main = "Residuals vs Distance",
xlab = "Distance",
ylab = "Residuals")
abline(0,0) #horizontal axis at x=0, y=0
In the Residuals vs. Distance scatterplot above, it can be seen that:
On the whole, the model is overpredicting more frequently that underpredicting.
hist(fit5a$residuals,
main = "Residuals frequency",
xlab = "Residual values",
ylab = "Frequency")
The histogram above solidifies the observation that the model is over predicting more than its underpredicting. It can be seen that: