#load all packages used in the homework
library(dplyr)
library(tidyr)
library(ggplot2)
library(gamair)
library(MASS)
library(HSAUR3)
library(ISLR)
library(scales)
Collett (2003) argues that two outliers need to be removed from the plasma data. Try to identify those two unusual observations by means of a scatterplot. (7.2 on Handbook)
Answer. I believe the two outliers occur where fibrinogen levels encroach and exeed 4. The two points, where ESR is less than 20 occur at high fibrinogen levels in the blood, the first is the closest point to 4, and the second is beyond a fibinogen level of 5. I deduce these are the two outliers not only because of the high levels of fibrinogen, but because the globulin levels do not increase as deduced by the other points in the plot.
#pipe the plasma data set into a ggplot scatter plot.
#add a geom smooth to try to identify the outliers
plasma %>%
ggplot(aes(x=fibrinogen,y=globulin,color=ESR)) +
geom_point() +
geom_smooth(method='lm') +
labs(title='Plasma Scatterplot Outlier Detection',
x= 'Fibrinogen Level in Blood',
y='Globulin Level in Blood')
(Multiple Regression) Continuing from the lecture on the hubble data from gamair library ####a. Fit a quadratic regression model, i.e.,a model of the form Model 2: velocity = \(\beta1\) ×distance + \(\beta2\) ×distance2 + \(\epsilon\)
#add the x^2 variable to the hubble data set
data(hubble)
problem2 <- hubble %>%
mutate(x2 = x^2)
#fit a quadratic model to the hubble data set
model <- lm(y ~ x2 + x ,data=problem2)
#create a sequence of x values for prediction
dist <- seq(min(problem2$x),max(problem2$x),0.1)
#use the above x values to predict the y values from the model
vel <- predict(model,list(x=dist, x2=dist^2))
#create a data frame of x nd y values for plotting
data <- as.data.frame(cbind(dist,vel))
#create the model outlined in chapter 6
hmod <- lm(y ~ x - 1, data = hubble)
#use the model and the dist variable to predict the y values
vel2 <- predict(hmod,list(x=dist))
#create a data frame with distance and the vel2 object created above
data2 <- as.data.frame(cbind(dist,vel2))
problem2$color[1:12] <- 'green'
problem2$color[13:24] <- 'blue'
#use ggplot to plot the hubble data on a scatter plot and plot the predicted values with a line.
#add a simple linear regression line to the plot using geom_smooth
ggplot(data=problem2,aes(x=x,y=y)) +
geom_point() +
geom_line(data=data,aes(x=data$dist,y=data$vel),color='green',lwd=2) +
geom_line(data=data2,aes(x=data2$dist,y=data2$vel2),color='blue',lwd=2)+
labs(title="Linear and Quadratic Model of Hubble Data",
x='Distance in Mega Parsecs',
y='Velocity in km/s')
Answer - The most senseible answer to me is the second model without the quadratics. I think that the quadratic model overfits the data, and that the \(x^2\) variable is not significant. Looking at just the plot there is an outlier that weighs the quadratic in such a way that it does not fit the data as well as the simple linear model.
Answer - There is a higher standard error with the variables of the quadratic model. The standard error of the x variable is nearly 10 times higher in the quadratic, and also the p value of the x value is not nearly as significant as the simple linear model. Additionally, the r-squared value shows that the simple linear model is a much bettter fit, 0.9419 vs 0.7651.
summary(model)
##
## Call:
## lm(formula = y ~ x2 + x, data = problem2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -720.5 -119.5 29.7 143.8 537.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -196.364 196.122 -1.001 0.32811
## x2 -2.096 1.565 -1.339 0.19494
## x 123.871 36.861 3.361 0.00296 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 260.1 on 21 degrees of freedom
## Multiple R-squared: 0.7651, Adjusted R-squared: 0.7428
## F-statistic: 34.21 on 2 and 21 DF, p-value: 2.476e-07
##
## Call:
## lm(formula = y ~ x - 1, data = hubble)
##
## Residuals:
## Min 1Q Median 3Q Max
## -736.5 -132.5 -19.0 172.2 558.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 76.581 3.965 19.32 1.03e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 258.9 on 23 degrees of freedom
## Multiple R-squared: 0.9419, Adjusted R-squared: 0.9394
## F-statistic: 373.1 on 1 and 23 DF, p-value: 1.032e-15
The leuk data from the package MASS shows the survival times from diagnosis of patients suffering from leukemia and the values of two explanatory variables, the white blood cell count (wbc) and the presence or absence of a morphological characteristic of the white blood cells (ag).
#add a binary column named surv24 for time greater than or less than 24.
problem3 <- leuk %>%
mutate(surv24 = ifelse(time >= 24, 1,0))
surv24.model <- glm(surv24 ~ log(wbc) + ag, data=problem3,family='binomial')
Description - The first plot, ‘Survival Times and White Blood Counts for Leaukaemia Patients’, shows a high probability of death within 24 weeks for patients with a ‘present’ test result and a low white blood cell count. As the white blood cell count increases the probability of death dereases. For patients with an ‘absent’ test result the probability curve appears to move in the same path, just with reduced probabilities of death within 24 weeks.
#predict the fitted results
surv24.fitted <- predict(surv24.model,type='response')
#bind the fitted results to the data frame
problem3 <- cbind(problem3,fitted=surv24.fitted)
#create a scatter plot of the data fitting the two curves of test results to the fitted output of the model prediciton
ggplot(problem3,aes(x=wbc,y=surv24,color=ag)) +
geom_point() +
scale_colour_manual(name = "Test Results",values = c('red','blue')) +
geom_line(data=problem3[problem3$ag=='absent',],aes(x=wbc,y=fitted),color='red') +
geom_line(data=problem3[problem3$ag=='present',],aes(x=wbc,y=fitted),color='blue') +
labs(title='Survival Times and White Blood Cell Counts for Leukaemia Patients',
x='White Blood Cell Count',
y='Probability of Death Before 24 Weeks')
The box plot shows the variability in test results and the near certanty of death within 24 weeks for patients with the test result ‘preset’.
#create a ox plot to examine the proability of death for the two test results
ggplot(problem3,aes(x=ag,y=surv24)) +
geom_boxplot() +
labs(title='Boxplot of Leukaemia Test Results to Death Within 24 Weeks',
x='Presence or Absence of a Morphologic Characteristic of White Blood Cells',
y='Probability of Death After 24 Weeks')
#fitting the model with the interaction term ag * log(wbc)
surv24.model2 <- glm(surv24 ~ log(wbc) + ag + ag * log(wbc), data=problem3,family='binomial')
Answer - While the second model has a lower AIC value (42.167 vs 43.498), I believe the first model is a better representation of the data. With the first model (without an interaction term), the present test result is slightly more statistically significant than in the model where the test result interacts with the white blood cell count. Additionlly, there is a higher error in the model with the interaction term concerning a positive test result. I also beieve that the model with the interaction term aims to overfit the data. There is not enough information gain with the interaction term to use it as a working model. Finally, the chi square test signifies that model 2 is not significant at the 5% level thus model 1 will suffice as a working model of the data.
#build a confusion matrix and calulate the error rate
true <- problem3$surv24
problem3_pred <- factor(ifelse(surv24.fitted>=0.50,"Yes","No"))
problem3_true <- factor(ifelse(true >= .5, "Yes", "No"))
problem3_error <- table(problem3_pred,True=problem3_true)
error_rate <- 1-(problem3_error[1,1]+problem3_error[2,2])/sum(problem3_error)
error_rate
## [1] 0.2424242
summary(surv24.model)
##
## Call:
## glm(formula = surv24 ~ log(wbc) + ag, family = "binomial", data = problem3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6310 -0.9056 -0.6258 0.8592 2.1032
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.4556 2.9821 1.159 0.2466
## log(wbc) -0.4822 0.3149 -1.531 0.1257
## agpresent 1.7621 0.8093 2.177 0.0295 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 45.475 on 32 degrees of freedom
## Residual deviance: 37.498 on 30 degrees of freedom
## AIC: 43.498
##
## Number of Fisher Scoring iterations: 3
summary(surv24.model2)
##
## Call:
## glm(formula = surv24 ~ log(wbc) + ag + ag * log(wbc), family = "binomial",
## data = problem3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9183 -0.7835 -0.6750 0.7310 1.7838
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.5946 4.6583 -0.557 0.5775
## log(wbc) 0.1545 0.4746 0.326 0.7447
## agpresent 13.6306 7.0909 1.922 0.0546 .
## log(wbc):agpresent -1.2315 0.7182 -1.715 0.0864 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 45.475 on 32 degrees of freedom
## Residual deviance: 34.167 on 29 degrees of freedom
## AIC: 42.167
##
## Number of Fisher Scoring iterations: 4
anova(surv24.model,surv24.model2,test='Chisq')
## Analysis of Deviance Table
##
## Model 1: surv24 ~ log(wbc) + ag
## Model 2: surv24 ~ log(wbc) + ag + ag * log(wbc)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 30 37.498
## 2 29 34.167 1 3.3315 0.06797 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Load the Default dataset from ISLR library. The dataset contains information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt. It is a 4 dimensional dataset with 10000 observations. The question of interest is to predict individuals who will default . We want to examine how each predictor variable is related to the response (default). Do the following on this dataset
Answer - It is apparent that incomes are higher for non-students than for students but that has no bearing on default rates. Additionally, default dates rise as the loan balance rises for both studetns and non-students alike.
data(Default)
summary(Default[Default$default == 'Yes',])
## default student balance income
## No : 0 No :206 Min. : 652.4 Min. : 9664
## Yes:333 Yes:127 1st Qu.:1511.6 1st Qu.:19028
## Median :1789.1 Median :31515
## Mean :1747.8 Mean :32089
## 3rd Qu.:1988.9 3rd Qu.:43067
## Max. :2654.3 Max. :66466
summary(Default[Default$default == 'No',])
## default student balance income
## No :9667 No :6850 Min. : 0.0 Min. : 772
## Yes: 0 Yes:2817 1st Qu.: 465.7 1st Qu.:21405
## Median : 802.9 Median :34589
## Mean : 803.9 Mean :33566
## 3rd Qu.:1128.2 3rd Qu.:43824
## Max. :2391.0 Max. :73554
#create a table of means and standard deviation for the dataset grouping the factor variables
Default %>%
group_by(default,student) %>%
summarize(Mean_Bal = mean(balance),
sd_bal = sd(balance),
Mean_Inc = mean(income),
sd_Inc = sd(income))
## # A tibble: 4 x 6
## # Groups: default [?]
## default student Mean_Bal sd_bal Mean_Inc sd_Inc
## <fctr> <fctr> <dbl> <dbl> <dbl> <dbl>
## 1 No No 744.5044 445.5151 39993.52 10002.907
## 2 No Yes 948.4802 450.5537 17937.01 4543.320
## 3 Yes No 1678.4295 330.9141 40625.05 10258.730
## 4 Yes Yes 1860.3791 328.7356 18243.51 4304.513
#create a ggplot boxplot of the data
Default%>%
ggplot(aes(x=default,y=income)) +
geom_boxplot() +
scale_y_continuous(labels = scales::dollar) +
facet_grid(student~.) +
labs(title='Boxplot of Income to Default By Student Status',
x = 'Default?',
y='Income (USD)')
#create a ggplot boxplot of the data
Default%>%
ggplot(aes(x=default,y=balance)) +
geom_boxplot() +
scale_y_continuous(labels = scales::dollar) +
facet_grid(student~.) +
labs(title='Boxplot of Loan Balance to Default By Student Status',
x = 'Default?',
y='Balance (USD)')
#Convert the variable deault to a binary
Default <- Default %>%
mutate(default = ifelse(default == 'Yes',1,0))
#create a model with the interaction terms student*income, student*balance, and balance*income
model <- glm(default ~ student + balance + income +student*income + balance*student + balance * income, data=Default, family=binomial())
#create a second model with no interaction terms
model2 <- glm(default~student+balance+income,family="binomial",data=Default)
Using interaction terms the only idenifiable predictor variable is the balance. Without interaction terms the balance still plays a significant role, but borrowers who are students also has some significance in prediction default as they are statistically significant in this model. Additionally, the AIC of the model without interaction terms is lower than that with interactiont terms.
summary(model)
##
## Call:
## glm(formula = default ~ student + balance + income + student *
## income + balance * student + balance * income, family = binomial(),
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4848 -0.1417 -0.0554 -0.0202 3.7579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.104e+01 1.866e+00 -5.914 3.33e-09 ***
## studentYes -5.201e-01 1.344e+00 -0.387 0.699
## balance 5.882e-03 1.180e-03 4.983 6.27e-07 ***
## income 4.050e-06 4.459e-05 0.091 0.928
## studentYes:income 1.447e-05 2.779e-05 0.521 0.602
## studentYes:balance -2.551e-04 7.905e-04 -0.323 0.747
## balance:income -1.579e-09 2.815e-08 -0.056 0.955
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1571.1 on 9993 degrees of freedom
## AIC: 1585.1
##
## Number of Fisher Scoring iterations: 8
summary(model2)
##
## Call:
## glm(formula = default ~ student + balance + income, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4691 -0.1418 -0.0557 -0.0203 3.7383
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
## studentYes -6.468e-01 2.363e-01 -2.738 0.00619 **
## balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
## income 3.033e-06 8.203e-06 0.370 0.71152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1571.5 on 9996 degrees of freedom
## AIC: 1579.5
##
## Number of Fisher Scoring iterations: 8
Answer - The error rate of the two models, as suggest below, is very small for predictig defaults. This tells us that the model will predict default up to almost 98% accuracy. Neither model has significant deviance, and the less complicated model predicts as accurately so to avoid the possibility of overfitting the second, less complicated model should be used. Additionally, with the chi square test there is no significance at the 5% level thus we can conclude that both models will work about the same.
#predict both models
def_fit1 <- predict(model,type='response')
def_fit2 <- predict(model2,type='response')
#build two confusion matrices and calculate the error rate
true <- Default$default
p4_pred1 <- factor(ifelse(def_fit1 >=0.50,"Yes","No"))
p4_true1 <- factor(ifelse(true >= .5, "Yes", "No"))
p4_error1 <- table(p4_pred1,True=p4_true1)
error_rate <- 1-(p4_error1[1,1]+p4_error1[2,2])/sum(p4_error1)
error_rate
## [1] 0.0269
p4_pred2 <- factor(ifelse(def_fit2 >=0.50,"Yes","No"))
p4_true2 <- factor(ifelse(true >= .5, "Yes", "No"))
p4_error2 <- table(p4_pred2,True=p4_true2)
error_rate <- 1-(p4_error2[1,1]+p4_error1[2,2])/sum(p4_error2)
error_rate
## [1] 0.0269
anova(model,model2,test='Chisq')
## Analysis of Deviance Table
##
## Model 1: default ~ student + balance + income + student * income + balance *
## student + balance * income
## Model 2: default ~ student + balance + income
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 9993 1571.1
## 2 9996 1571.5 -3 -0.47911 0.9235
(During class on Friday) Go through Section 7.3.1 of the Handbook. Run all the codes (additional exploration of data is allowed) and write your own version of explanation and interpretation. This needs to be submitted with HW 2 problems 1-4 above.
Answer - The below two plots describe how the two variables fibrinogen and globulin change with the factors of ESR.
plasma <- plasma
layout(matrix(1:2,ncol=2))
cdplot(ESR ~ fibrinogen, data=plasma)
cdplot(ESR ~ globulin,data=plasma)
Answer - The functions used below produce a logistic regression model in R of the ESR and fibrinogen variables of the plasma data set. The family argument specifies the type of model, in this case binomial for a binomial logistic regression. The summary output specifies a 5% significance of fibrinogen and an increase of one unit increases the log-odds of ESR greater than 20 by approximately 1.83 with the beow confidence interval (0.33 to 3.99)
plasma_glm_1 <- glm(ESR ~ fibrinogen, data = plasma, family=binomial())
confint(plasma_glm_1,parm='fibrinogen')
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## 0.3387619 3.9984921
summary(plasma_glm_1)
##
## Call:
## glm(formula = ESR ~ fibrinogen, family = binomial(), data = plasma)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9298 -0.5399 -0.4382 -0.3356 2.4794
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.8451 2.7703 -2.471 0.0135 *
## fibrinogen 1.8271 0.9009 2.028 0.0425 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 30.885 on 31 degrees of freedom
## Residual deviance: 24.840 on 30 degrees of freedom
## AIC: 28.84
##
## Number of Fisher Scoring iterations: 5
Answer - The below output exponentiates the log-odds of fibrinogen and the confidence interval to correspond with the data. The confidence interval is wide, and exceeds any value in the actual observations, due to a low number of observations and because there are very few observations where the ESR value is above 20. It is evident that increased values of fibrinogen leads to a greater probability of an ESR above 20.
exp(coef(plasma_glm_1)['fibrinogen'])
## fibrinogen
## 6.215715
exp(confint(plasma_glm_1, parm='fibrinogen'))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## 1.403209 54.515884
Answer - The summary output shows a gamma globulin coefficient very close to zero which means that variable has little impact on increasing ESR values. The difference in residual deviance from the first model is only 1.87; additionally, with the chi square on a single degree of freedom the test is not significant at the 5% level thus globulin has no impact on ESR.
plasma_glm_2 <- glm(ESR ~ fibrinogen + globulin, data = plasma, family = binomial())
summary(plasma_glm_2)
##
## Call:
## glm(formula = ESR ~ fibrinogen + globulin, family = binomial(),
## data = plasma)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9683 -0.6122 -0.3458 -0.2116 2.2636
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.7921 5.7963 -2.207 0.0273 *
## fibrinogen 1.9104 0.9710 1.967 0.0491 *
## globulin 0.1558 0.1195 1.303 0.1925
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 30.885 on 31 degrees of freedom
## Residual deviance: 22.971 on 29 degrees of freedom
## AIC: 28.971
##
## Number of Fisher Scoring iterations: 5
anova(plasma_glm_1, plasma_glm_2, test= 'Chisq')
## Analysis of Deviance Table
##
## Model 1: ESR ~ fibrinogen
## Model 2: ESR ~ fibrinogen + globulin
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 30 24.840
## 2 29 22.971 1 1.8692 0.1716
Answer - The produced bubble plot of the predicted values of the second model indicate increased probability of an ESR value above 20 as fibrinogen and to a lesser extent, globulin increase.
prob <- predict(plasma_glm_2, type='response')
plot(globulin ~ fibrinogen,data=plasma,xlim=c(2,6),ylim=c(25,55),pch='.')
symbols(plasma$fibrinogen,plasma$globulin,circles=prob,add=T)