library(HSAUR3)
library(dplyr)
library(ggplot2)
library(tidyr)
library(ISLR)
library(boot)
library("PerformanceAnalytics")
Use the bladdercancer data from the HSAUR3 library to answer the following questions.
Construct graphical and numerical summaries that will show the relationship between tumor size and the number of recurrent tumors. Discuss your discovery. (Hint: mosaic plot may be the best way to assess this)
Judging by the below mosaic plot the observed frequency for 1 or 2 tumors greater than 3 cm is lower than expected and the observed frequency for 3 or 4 tumors less than or equal to 3 cm is also lower than expected.
data(bladdercancer)
#build a mosaic plot with base r for the bladder cancer data
mosaicplot(xtabs(~ number + tumorsize, data = bladdercancer),
main='Number of Tumors to Tumor Size in Bladder Cancer Patients',
xlab='Numbr of Tumors',
ylab='Size of Tumor in cm',
shade=T)
The histogram shows a non-normal distrobution for both tumor sizes greater than 3 cm and less than or equal to 3 cm. The histogram is centered around one tumor and the frequency of more tumors is substantially less than a patient having just one tumor.
#use ggplot to build a histogram and label the axis values
ggplot(bladdercancer, aes(number, fill = tumorsize)) +
geom_histogram(binwidth=.5, position="dodge") +
labs(title='Tumor Size in Bladder Patients',
x='Number of Tumors',
y='Count')
The frequency table justifies the distribution of the histogram. The highest counts for both tumor size factors come at having just one tumor and significantly decrease as the number of tumors increase.
#create a frequency table in data.frame format by grouping tumorsise and number; summarizing the groups with a count, and then spreading the number variable into columns
bladdercancer %>%
group_by(tumorsize,number) %>%
summarize(freq = n()) %>%
spread(number,freq,sep='_of_tumors_')
## # A tibble: 2 x 5
## # Groups: tumorsize [2]
## tumorsize number_of_tumors_1 number_of_tumors_2 number_of_tumors_3
## * <fctr> <int> <int> <int>
## 1 <=3cm 15 5 1
## 2 >3cm 5 2 1
## # ... with 1 more variables: number_of_tumors_4 <int>
Build a Poisson regression that estimates the effect of size of tumor on the number of recurrent tumors. Discuss your results.
#build three separate models with the glm function
mod1_p1 <- glm(number ~time + tumorsize + tumorsize*time,data=bladdercancer,family=poisson(link=log))
mod2_p1 <- glm(number ~ time + tumorsize,data=bladdercancer,family=poisson())
mod3_p1 <- glm(number ~ tumorsize,data=bladdercancer,family=poisson())
It is evident from the first model containing time, tumor size, and the interaction between tumor size and time that none of the variables are significant. However, the residual deviance and null deviance are quite low compared to the degrees of freedom
summary(mod1_p1)
##
## Call:
## glm(formula = number ~ time + tumorsize + tumorsize * time, family = poisson(link = log),
## data = bladdercancer)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6943 -0.5581 -0.2413 0.2932 1.4644
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.03957 0.43088 0.092 0.927
## time 0.02138 0.02418 0.884 0.377
## tumorsize>3cm 0.46717 0.66713 0.700 0.484
## time:tumorsize>3cm -0.01676 0.03821 -0.439 0.661
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 12.800 on 30 degrees of freedom
## Residual deviance: 11.566 on 27 degrees of freedom
## AIC: 90.377
##
## Number of Fisher Scoring iterations: 4
Interestingly, if we take out the interaction between tumor size and tim our AIC actually drops. Our resiual and null deviance are approximately the same but there is still no significance with the time or tumor size.
summary(mod2_p1)
##
## Call:
## glm(formula = number ~ time + tumorsize, family = poisson(),
## data = bladdercancer)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8183 -0.4753 -0.2923 0.3319 1.5446
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.14568 0.34766 0.419 0.675
## time 0.01478 0.01883 0.785 0.433
## tumorsize>3cm 0.20511 0.30620 0.670 0.503
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 12.800 on 30 degrees of freedom
## Residual deviance: 11.757 on 28 degrees of freedom
## AIC: 88.568
##
## Number of Fisher Scoring iterations: 4
With model 3 we drop the time variable and notice that our intercept is now significant at the 5% level however, tumor size is still not significant. Our residual and null deviance values are still low compared to the degrees of freedom and our AIC drops further.
summary(mod3_p1)
##
## Call:
## glm(formula = number ~ tumorsize, family = poisson(), data = bladdercancer)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6363 -0.3996 -0.3996 0.4277 1.7326
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3747 0.1768 2.120 0.034 *
## tumorsize>3cm 0.2007 0.3062 0.655 0.512
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 12.80 on 30 degrees of freedom
## Residual deviance: 12.38 on 29 degrees of freedom
## AIC: 87.191
##
## Number of Fisher Scoring iterations: 4
#model1 vs model2 vs model 3
anova(mod1_p1,mod2_p1,mod3_p1,test='Chisq')
## Analysis of Deviance Table
##
## Model 1: number ~ time + tumorsize + tumorsize * time
## Model 2: number ~ time + tumorsize
## Model 3: number ~ tumorsize
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 27 11.566
## 2 28 11.757 -1 -0.19095 0.6621
## 3 29 12.380 -1 -0.62363 0.4297
The output indicates that we can reject the null hypothesis and that a more complicated model may work to explain the data. However, my assumption is that we are simply trying to explain noise and are overfitting by adding additional variables and interaction terms. Additionally, there is no statistical significance of the variables explaining the affect on th number of tumors in bladder cancer patients.
1-pchisq((mod1_p1$null.deviance-mod1_p1$deviance),(mod1_p1$df.null-mod1_p1$df.residual))
## [1] 0.7448414
Comparing deviance on model 2 further justifies my assumption from part 5. We removed the interaction term and the deviance test is reduced to 0.59 which signfifies the model is getting closer to the null hypothesis with no statistical significance between time or tumor size and the number of tumors in bladder cancer patients.
1-pchisq((mod2_p1$null.deviance-mod2_p1$deviance),(mod2_p1$df.null-mod2_p1$df.residual))
## [1] 0.5935891
After we compare our deviance on the third model we notice that this model is no different than the null model and we accept the null hpothesis. There is no interaction between tumor size and the number of tumors with this data set.
1-pchisq((mod3_p1$null.deviancee-mod3_p1$deviance),(mod3_p1$df.null-mod3_p1$df.residual))
## numeric(0)
From the above analysis I accept the null hypothesis that there is nothing within the data to explain an increase in the number of tumors. Neither time nor the tumor size have any affect on increasing the number of tumors. This can be further proved through part one and interpreting the distribution of the data.
The following data is the number of new AIDS cases in Belgium between the years 1981-1993. Let t denote time y < c(12, 14, 33, 50, 67, 74, 123, 141, 165, 204, 253, 246, 240) t <- 1:13
Plot the relationship between #AIDS cases against time. Comment on the plot
The scatter plot indicates a clear relationship. As time increases so does the number of AIDs cases. The relationship peaks at year 11 and then begins to decline.
y <- c(12, 14, 33, 50, 67, 74, 123, 141, 165, 204, 253, 246, 240)
t <- 1:13
data <- data.frame(number=y,time=t)
#build a scatterplot with ggplot2 and label the axis
ggplot(data,aes(x=time,y=number)) +
geom_point() +
labs(title='Scatterplot of AIDs Cases From 1981 to 1993',
x = 'Time in Years From 1981',
y='Number of AIDs Cases')
Fit a Poisson regression model \(log(\mu_i) = \beta_0 + \beta_1 t_i\). Comment on the model parameters and residuals (deviance) vs Fitted plot.
model_p2a <- glm(number ~ time, data=data,family=poisson())
After reviewing the residual deviance of the model we can determine that the model is overdispersed by a factor of about 7. 80.686 on 11 degrees of freedom.
summary(model_p2a)
##
## Call:
## glm(formula = number ~ time, family = poisson(), data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.6784 -1.5013 -0.2636 2.1760 2.7306
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.140590 0.078247 40.14 <2e-16 ***
## time 0.202121 0.007771 26.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 872.206 on 12 degrees of freedom
## Residual deviance: 80.686 on 11 degrees of freedom
## AIC: 166.37
##
## Number of Fisher Scoring iterations: 4
The residuals vs fitted plot further proves that the data is overdispersed. At time 1, 2, and 13 the residual values are far from zero which idicates they are outliers. Additionally, there is a clear pattern to the residual plot which indicates that mean does not increase as the variance increase because there is not a constant spread in the residuals.
#call the plot funtion on the model selecting only the residual vs fitted plot
plot(model_p2a,which=1)
Now add a quadratic term in time (i.e., \(log(\mu_i) = \beta_0 + \beta_1 t_i + \beta_2 t^2_i\) ) and fit the model. Comment on the model parameters and assess the residual plots.
model_p2b <- glm(number ~ time + poly(time,degree=2),data=data,family=poisson())
After adding the 2nd order polynomial covariate to the model the residual deviance to the degrees of freedom now indicate there is no overdispersion.
summary(model_p2b)
##
## Call:
## glm(formula = number ~ time + poly(time, degree = 2), family = poisson(),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.45903 -0.64491 0.08927 0.67117 1.54596
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.64858 0.11323 23.390 < 2e-16 ***
## time 0.25716 0.01167 22.038 < 2e-16 ***
## poly(time, degree = 2)1 NA NA NA NA
## poly(time, degree = 2)2 -0.95511 0.11896 -8.029 9.82e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 872.2058 on 12 degrees of freedom
## Residual deviance: 9.2402 on 10 degrees of freedom
## AIC: 96.924
##
## Number of Fisher Scoring iterations: 4
The plot further justifies my thought that the model is not overdispersed and explains the data due to the residuals being evenly spread on the plot.
#call the plot funtion on the model selecting only the residual vs fitted plot
plot(model_p2b,which=1)
Compare the two models using AIC, which is better? Answer - Model 2’s AIC is clearly lower, 96.23 vs 166.37, which further tells us that the second model with the 2nd order polynomial covariate better explains the data.
model_p2a$aic
## [1] 166.3698
model_p2b$aic
## [1] 96.92358
Use anova()-function to perform \(\chi^2\) test for model selection. Did adding the quadratic term improve model? Answer - The second order polynomial improved the model. The p value shows statistical significance in the second model thus we reject the null hypothesis that there is no difference in models.
anova(model_p2a,model_p2b,test='Chisq')
## Analysis of Deviance Table
##
## Model 1: number ~ time
## Model 2: number ~ time + poly(time, degree = 2)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 11 80.686
## 2 10 9.240 1 71.446 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Load the Default dataset from ISLR library. The dataset contains information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt. It is a 4 dimensional dataset with 10000 observations. You had developed a logistic regression model on HW #2. Now consider the following two models
Model1 \(\rightarrow\) Default = Student + balance
Model2 \(\rightarrow\) Default = Balance
For the two competing models do the following
With the whole data compare the two models (Use AIC and/or error rate). Answer - Model 1 has a lower AIC value than model 2. Model 1 also has a lower mean square error than model 2, thus model 1 with the added \(student\) covariate is slightly better than the second model without \(student\) as a covariate.
data(Default)
#change the default variable to binary format
Default <- Default %>%
mutate(default=ifelse(default=='Yes',1,0))
#split the data set into 70% training and 30% test
smp_size <- floor(0.7 * nrow(Default))
#set the seed for reproducibility
set.seed(123)
#create an index for the training set
train_ind <- sample(seq_len(nrow(Default)), size = smp_size)
#subset the default data set by the rows in the trainin data set
train <- Default[train_ind, ]
test <- Default[-train_ind, ]
#build the two models
model_p3a <- glm(default ~ student + balance,data=Default,family=binomial())
model_p3b <- glm(default ~ balance,data=Default,family=binomial())
MSE
mean((predict(model_p3a,Default,type='response')-Default$default)^2)
## [1] 0.02130176
AIC
model_p3a$aic
## [1] 1577.682
MSE
mean((predict(model_p3b,Default,type='response')-Default$default)^2)
## [1] 0.02170579
AIC
model_p3b$aic
## [1] 1600.452
Use validation set approach and choose the best model. Be aware that we have few people who defaulted in the data. Answer - There is little difference between the two models even after adding the covariate \(student\). This suggests that \(student\) status is not a good predicting covariate of defaulting on a credit card. Nevertheless, there is less error and a lower AIC by adding student. An interesting differnce between these statistics and the statistics described without a validation approach is that there is a much lower AIC and means square error.
#use the training data set to rebuild both models
model_p3c <- glm(default ~ student + balance,data=train,family=binomial())
model_p3d <- glm(default ~ balance,data=train,family=binomial())
Summary Statistics
summary(model_p3c)
##
## Call:
## glm(formula = default ~ student + balance, family = binomial(),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5220 -0.1382 -0.0528 -0.0188 3.7802
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.100e+01 4.492e-01 -24.49 < 2e-16 ***
## studentYes -7.169e-01 1.753e-01 -4.09 4.32e-05 ***
## balance 5.914e-03 2.807e-04 21.07 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2130.6 on 6999 degrees of freedom
## Residual deviance: 1109.0 on 6997 degrees of freedom
## AIC: 1115
##
## Number of Fisher Scoring iterations: 8
MSE
mean((predict(model_p3c,test,type='response')-test$default)^2)
## [1] 0.02071291
AIC
model_p3c$aic
## [1] 1115.033
Summary Statistics
summary(model_p3d)
##
## Call:
## glm(formula = default ~ balance, family = binomial(), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3391 -0.1439 -0.0555 -0.0206 3.7937
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.090e+01 4.393e-01 -24.80 <2e-16 ***
## balance 5.673e-03 2.672e-04 21.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2130.6 on 6999 degrees of freedom
## Residual deviance: 1126.7 on 6998 degrees of freedom
## AIC: 1130.7
##
## Number of Fisher Scoring iterations: 8
MSE
mean((predict(model_p3d,test,type='response')-test$default)^2)
## [1] 0.02098329
AIC
model_p3d$aic
## [1] 1130.69
Use LOOCV approach and choose the best model. Answer - Model 1 is the bes model using Leave One Out Cross Validation as indicated by the below two outputs.
###LOOCV
##one
#set n to 100
#this method was usd because my computer crashed every time I used the method outlined in the power point and lecture
n <- 100
#build an empty matrix to loop through
loocv_tmp1 <- matrix(NA, nrow = n, ncol = 1)
for (k in 1:n) {
#create a loocv training set for k in 1:100
train_loocv <- Default[-k, ]
test_loocv <- Default[k, ]
#fit the model at each iteration
fitted_models <- glm(default ~ balance + student,data=train_loocv,family='binomial')
#calculation the mse at each iteration
MSE1 <- mean((predict(fitted_models,test_loocv,type='response')-test_loocv$default)^2)
#set the kth row to the mse from the above calculation
loocv_tmp1[k, ] <- MSE1
}
#calculation the mean of the cross validation
mean(loocv_tmp1)
## [1] 0.0008405851
##two
n <- 100
loocv_tmp2 <- matrix(NA, nrow = n, ncol = 1)
for (k in 1:n) {
train_loocv <- Default[-k, ]
test_loocv <- Default[k, ]
fitted_models <- glm(default ~ balance,data=train_loocv,family='binomial')
MSE1 <- mean((predict(fitted_models,test_loocv,type='response')-test_loocv$default)^2)
loocv_tmp2[k, ] <- MSE1
}
mean(loocv_tmp2)
## [1] 0.001071851
Use 10-fold cross-validation approach and choose the best model. Answer - Model 1 has a lower error rate using 10 fold cross validation thus model 1 is the best model in 10-fold cross validation.
cv.glm(train, model_p3c,K=10)$delta[1]
## [1] 0.02162689
cv.glm(train, model_p3d,K=10)$delta[1]
## [1] 0.0221047
It is evident from all three cross validation techniques that adding student as a coviariate reduces the error of predictions slightly, but enough for it to be used in model selection. Thus model 1, with studen as a covariate, is the best model under all three cross validation techniques.
In the ISLR library load the Smarket dataset. This contains Daily percentage returns for the S&P 500 stock index between 2001 and 2005. There are 1250 observations and 9 variables. The variable of interest is Direction which is a factor with levels Down and Up indicating whether the market had a positive or negative return on a given day. Since the goal is to predict the direction of the stock market in the future, here it would make sense to use the data from years 2001 - 2004 as training and 2005 as validation. According to this, create a training set and testing set. Perform logistic regression and assess the error rate.
data(Smarket)
#convert Direction to binary
Smarket <- Smarket %>%
mutate(Direction = ifelse(Direction == 'Up',1,0))
#split the training and test sets by year
train <- Smarket[Smarket$Year != 2005,]
test <- Smarket[Smarket$Year == 2005,]
#build the 7 models
model1 <- glm(Direction ~ Lag1,train, family=binomial())
model2 <- glm(Direction ~ Lag1 + Lag2,train,family=binomial())
model3 <- glm(Direction ~ Lag1 + Lag2+Lag3,train,family=binomial())
model4 <- glm(Direction ~ Lag1 + Lag2+Lag3+Lag4,train,family=binomial())
model5 <- glm(Direction ~ Lag1 + Lag2+Lag3+Lag4+Lag5,train,family=binomial())
model6 <- glm(Direction ~ Lag1 + Lag2+Lag3+Lag4+Lag5+Volume,train,family=binomial())
model7 <- glm(Direction ~ Lag1 + Lag2 + Lag1*Lag3 +Lag1*Lag5 +Lag3*Lag4 + Lag3*Lag5,data = train, family=binomial())
#run an anovva on the 7 different models to find the best model
anov <- anova(model1,model2,model3,model4,model5,model6,model7,test='Chisq')
#calculate the MSE of model 7
MSE <- mean((predict(model7,test,type='response')-test$Direction)^2)
#follow the same steps as problem 3 for LOOCV
n <- 100
loocv_tmp1 <- matrix(NA, nrow = n, ncol = 1)
for (k in 1:n) {
train_loocv <- Smarket[-k, ]
test_loocv <- Smarket[k, ]
fitted_models <- glm(Direction ~ Lag1 + Lag2 + Lag1*Lag3 +Lag1*Lag5 +Lag3*Lag4 + Lag3*Lag5,data = train_loocv, family='binomial')
loocv_tmp1[k, ] <- mean((predict(fitted_models,test_loocv,type='response')-test_loocv$Direction)^2)
}
loocv1 <- mean(loocv_tmp1)
#calculate 10 fold cross validation using the in class function
tenfold <-cv.glm(train, model7,K=10)$delta[1]
I have built 7 different models, all with different covariates, and the 7th model with interaction terms and only selected covariates. Below is the output of the analysis of variance of the models. Model7 is slighltly better due to its near significance, thus we will use it for our final model.
anov
## Analysis of Deviance Table
##
## Model 1: Direction ~ Lag1
## Model 2: Direction ~ Lag1 + Lag2
## Model 3: Direction ~ Lag1 + Lag2 + Lag3
## Model 4: Direction ~ Lag1 + Lag2 + Lag3 + Lag4
## Model 5: Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5
## Model 6: Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume
## Model 7: Direction ~ Lag1 + Lag2 + Lag1 * Lag3 + Lag1 * Lag5 + Lag3 *
## Lag4 + Lag3 * Lag5
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 996 1382.1
## 2 995 1381.4 1 0.7428 0.38877
## 3 994 1381.4 1 0.0293 0.86409
## 4 993 1381.3 1 0.0281 0.86694
## 5 992 1381.3 1 0.0040 0.94939
## 6 991 1381.1 1 0.2355 0.62748
## 7 988 1373.8 3 7.2708 0.06375 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary statistics indicate a slight statistical significance of the the percentage return for the previous day and for 5 days previous. The AIC of the model is relatively high.
summary(model7)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag1 * Lag3 + Lag1 *
## Lag5 + Lag3 * Lag4 + Lag3 * Lag5, family = binomial(), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7186 -1.1874 0.9387 1.1588 1.6609
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.032001 0.063624 0.503 0.6150
## Lag1 -0.052534 0.053664 -0.979 0.3276
## Lag2 -0.054475 0.052344 -1.041 0.2980
## Lag3 0.002665 0.053421 0.050 0.9602
## Lag5 -0.002196 0.051943 -0.042 0.9663
## Lag4 -0.001858 0.052108 -0.036 0.9716
## Lag1:Lag3 0.039926 0.031925 1.251 0.2111
## Lag1:Lag5 -0.067380 0.034159 -1.973 0.0486 *
## Lag3:Lag4 -0.039895 0.034931 -1.142 0.2534
## Lag3:Lag5 0.033801 0.031114 1.086 0.2773
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1383.3 on 997 degrees of freedom
## Residual deviance: 1373.8 on 988 degrees of freedom
## AIC: 1393.8
##
## Number of Fisher Scoring iterations: 4
The mean square error of our final model is 0.248, which means there is a relatively low error rate for predicting the outcome of the stock market given the previous 5 days worth of returns.
MSE
## [1] 0.248777
loocv1
## [1] 0.2595494
tenfold
## [1] 0.2520341