#load all packages used in the homework
library(dplyr)
library(tidyr)
library(ggplot2)
library(gamair)
library(MASS)
library(HSAUR3)
library(ISLR)
library(scales)

Question 1

Collett (2003) argues that two outliers need to be removed from the plasma data. Try to identify those two unusual observations by means of a scatterplot. (7.2 on Handbook)

Answer. I believe the two outliers occur where fibrinogen levels encroach and exeed 4. The two points, where ESR is less than 20 occur at high fibrinogen levels in the blood, the first is the closest point to 4, and the second is beyond a fibinogen level of 5. I deduce these are the two outliers not only because of the high levels of fibrinogen, but because the globulin levels do not increase as deduced by the other points in the plot.

#pipe the plasma data set into a ggplot scatter plot.
#add a geom smooth to try to identify the outliers
plasma %>%
  ggplot(aes(x=fibrinogen,y=globulin,color=ESR)) +
  geom_point() +
  geom_smooth(method='lm') + 
  labs(title='Plasma Scatterplot Outlier Detection',
       x= 'Fibrinogen Level in Blood',
       y='Globulin Level in Blood')

Question 2

(Multiple Regression) Continuing from the lecture on the hubble data from gamair library ####a. Fit a quadratic regression model, i.e.,a model of the form Model 2: velocity = \(\beta1\) ×distance + \(\beta2\) ×distance2 + \(\epsilon\)

#add the x^2 variable to the hubble data set
data(hubble)
problem2 <- hubble %>%
  mutate(x2 = x^2)

#fit a quadratic model to the hubble data set
model <- lm(y ~  x2 + x ,data=problem2)

b. Plot the fitted curve from Model 2 on the scatterplot of the data.

c. Add the simple linear regression fit (fitted in class) on this plot - use different color and line type to differentiate the two and add a legend to your plot.

#create a sequence of x values for prediction
dist <- seq(min(problem2$x),max(problem2$x),0.1)
#use the above x values to predict the y values from the model
vel <- predict(model,list(x=dist, x2=dist^2))
#create a data frame of x nd y values for plotting
data <- as.data.frame(cbind(dist,vel))
#create the model outlined in chapter 6
hmod <- lm(y ~ x - 1, data = hubble)
#use the model and the dist variable to predict the y values
vel2 <- predict(hmod,list(x=dist))
#create a data frame with distance and the vel2 object created above
data2 <- as.data.frame(cbind(dist,vel2))

problem2$color[1:12] <- 'green'
problem2$color[13:24] <- 'blue'

#use ggplot to plot the hubble data on a scatter plot and plot the predicted values with a line.
#add a simple linear regression line to the plot using geom_smooth
ggplot(data=problem2,aes(x=x,y=y)) +
   geom_point() +
   geom_line(data=data,aes(x=data$dist,y=data$vel),color='green',lwd=2) +
   geom_line(data=data2,aes(x=data2$dist,y=data2$vel2),color='blue',lwd=2)+
   labs(title="Linear and Quadratic Model of Hubble Data",
        x='Distance in Mega Parsecs',
        y='Velocity in km/s')

d. Which model do you consider most sensible considering the nature of the data - looking at the plot?

Answer - The most senseible answer to me is the second model without the quadratics. I think that the quadratic model overfits the data, and that the \(x^2\) variable is not significant. Looking at just the plot there is an outlier that weighs the quadratic in such a way that it does not fit the data as well as the simple linear model.

e. Which model is better? - provide a statistic to support you claim.

Answer - There is a higher standard error with the variables of the quadratic model. The standard error of the x variable is nearly 10 times higher in the quadratic, and also the p value of the x value is not nearly as significant as the simple linear model. Additionally, the r-squared value shows that the simple linear model is a much bettter fit, 0.9419 vs 0.7651.

Summary Statistics of Quadratic Model

summary(model)

## 
## Call:
## lm(formula = y ~ x2 + x, data = problem2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -720.5 -119.5   29.7  143.8  537.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -196.364    196.122  -1.001  0.32811   
## x2            -2.096      1.565  -1.339  0.19494   
## x            123.871     36.861   3.361  0.00296 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 260.1 on 21 degrees of freedom
## Multiple R-squared:  0.7651, Adjusted R-squared:  0.7428 
## F-statistic: 34.21 on 2 and 21 DF,  p-value: 2.476e-07

Summary Statistics of Simple Linear Model

## 
## Call:
## lm(formula = y ~ x - 1, data = hubble)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -736.5 -132.5  -19.0  172.2  558.0 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   76.581      3.965   19.32 1.03e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 258.9 on 23 degrees of freedom
## Multiple R-squared:  0.9419, Adjusted R-squared:  0.9394 
## F-statistic: 373.1 on 1 and 23 DF,  p-value: 1.032e-15

Question 3

The leuk data from the package MASS shows the survival times from diagnosis of patients suffering from leukemia and the values of two explanatory variables, the white blood cell count (wbc) and the presence or absence of a morphological characteristic of the white blood cells (ag).

a. Define a binary outcome variable according to whether or not patients lives for at least 24 weeks after diagnosis. Call it surv24.

#add a binary column named surv24 for time greater than or less than 24. 
problem3 <- leuk %>%
  mutate(surv24 = ifelse(time >= 24, 1,0))

b. Fit a logistic regression model to the data with surv24 as response. It is avisable to transform the very large white blood cunts to avoid regression coefficients very cose to 0 (and odds ratio close to 1). You ma use log transformation.

surv24.model <- glm(surv24 ~ log(wbc) + ag, data=problem3,family='binomial')

c. Construct some graphics useful in the interpretatin of the final model you fit.

Description - The first plot, ‘Survival Times and White Blood Counts for Leaukaemia Patients’, shows a high probability of death within 24 weeks for patients with a ‘present’ test result and a low white blood cell count. As the white blood cell count increases the probability of death dereases. For patients with an ‘absent’ test result the probability curve appears to move in the same path, just with reduced probabilities of death within 24 weeks.

#predict the fitted results 
surv24.fitted <- predict(surv24.model,type='response')
#bind the fitted results to the data frame 
problem3 <- cbind(problem3,fitted=surv24.fitted)
#create a scatter plot of the data fitting the two curves of test results to the fitted output of the model prediciton
ggplot(problem3,aes(x=wbc,y=surv24,color=ag)) +
  geom_point() + 
  scale_colour_manual(name = "Test Results",values = c('red','blue')) +
  geom_line(data=problem3[problem3$ag=='absent',],aes(x=wbc,y=fitted),color='red') +
  geom_line(data=problem3[problem3$ag=='present',],aes(x=wbc,y=fitted),color='blue') +
  labs(title='Survival Times and White Blood Cell Counts for Leukaemia Patients',
      x='White Blood Cell Count',
      y='Probability of Death Before 24 Weeks')

The box plot shows the variability in test results and the near certanty of death within 24 weeks for patients with the test result ‘preset’.

#create a ox plot to examine the proability of death for the two test results
ggplot(problem3,aes(x=ag,y=surv24)) + 
  geom_boxplot() +
  labs(title='Boxplot of Leukaemia Test Results to Death Within 24 Weeks',
       x='Presence or Absence of a Morphologic Characteristic of White Blood Cells',
       y='Probability of Death After 24 Weeks')

d. Fit a model with an interaction term between the two predictors. Which model fits the data better? Justify your answer.

#fitting the model with the interaction term ag * log(wbc)
surv24.model2 <- glm(surv24 ~ log(wbc) + ag + ag * log(wbc), data=problem3,family='binomial')

Answer - While the second model has a lower AIC value (42.167 vs 43.498), I believe the first model is a better representation of the data. With the first model (without an interaction term), the present test result is slightly more statistically significant than in the model where the test result interacts with the white blood cell count. Additionlly, there is a higher error in the model with the interaction term concerning a positive test result. I also beieve that the model with the interaction term aims to overfit the data. There is not enough information gain with the interaction term to use it as a working model. Finally, the chi square test signifies that model 2 is not significant at the 5% level thus model 1 will suffice as a working model of the data.

Error Rate of the First Model

#build a confusion matrix and calulate the error rate
true <- problem3$surv24
problem3_pred <- factor(ifelse(surv24.fitted>=0.50,"Yes","No"))
problem3_true <- factor(ifelse(true >= .5, "Yes", "No"))
problem3_error <- table(problem3_pred,True=problem3_true)
error_rate <- 1-(problem3_error[1,1]+problem3_error[2,2])/sum(problem3_error)
error_rate

## [1] 0.2424242

Summary Statistics of the First Model

summary(surv24.model)

## 
## Call:
## glm(formula = surv24 ~ log(wbc) + ag, family = "binomial", data = problem3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6310  -0.9056  -0.6258   0.8592   2.1032  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   3.4556     2.9821   1.159   0.2466  
## log(wbc)     -0.4822     0.3149  -1.531   0.1257  
## agpresent     1.7621     0.8093   2.177   0.0295 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 45.475  on 32  degrees of freedom
## Residual deviance: 37.498  on 30  degrees of freedom
## AIC: 43.498
## 
## Number of Fisher Scoring iterations: 3

Summary Statistics of the Second Model

summary(surv24.model2)

## 
## Call:
## glm(formula = surv24 ~ log(wbc) + ag + ag * log(wbc), family = "binomial", 
##     data = problem3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9183  -0.7835  -0.6750   0.7310   1.7838  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)  
## (Intercept)         -2.5946     4.6583  -0.557   0.5775  
## log(wbc)             0.1545     0.4746   0.326   0.7447  
## agpresent           13.6306     7.0909   1.922   0.0546 .
## log(wbc):agpresent  -1.2315     0.7182  -1.715   0.0864 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 45.475  on 32  degrees of freedom
## Residual deviance: 34.167  on 29  degrees of freedom
## AIC: 42.167
## 
## Number of Fisher Scoring iterations: 4

Analysis of Deviance

anova(surv24.model,surv24.model2,test='Chisq')

## Analysis of Deviance Table
## 
## Model 1: surv24 ~ log(wbc) + ag
## Model 2: surv24 ~ log(wbc) + ag + ag * log(wbc)
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1        30     37.498                       
## 2        29     34.167  1   3.3315  0.06797 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Question 4

Load the Default dataset from ISLR library. The dataset contains information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt. It is a 4 dimensional dataset with 10000 observations. The question of interest is to predict individuals who will default . We want to examine how each predictor variable is related to the response (default). Do the following on this dataset

Perform descriptive analysis on the dataset to have an insight. Use summaries and appropriate exploratory graphics to answer the question of interest.

Answer - It is apparent that incomes are higher for non-students than for students but that has no bearing on default rates. Additionally, default dates rise as the loan balance rises for both studetns and non-students alike.

Summary of the Default data where default equals ‘Yes’.

data(Default)
summary(Default[Default$default == 'Yes',])

##  default   student      balance           income     
##  No :  0   No :206   Min.   : 652.4   Min.   : 9664  
##  Yes:333   Yes:127   1st Qu.:1511.6   1st Qu.:19028  
##                      Median :1789.1   Median :31515  
##                      Mean   :1747.8   Mean   :32089  
##                      3rd Qu.:1988.9   3rd Qu.:43067  
##                      Max.   :2654.3   Max.   :66466

Summary of the Default data where default equals ‘No’.

summary(Default[Default$default == 'No',])

##  default    student       balance           income     
##  No :9667   No :6850   Min.   :   0.0   Min.   :  772  
##  Yes:   0   Yes:2817   1st Qu.: 465.7   1st Qu.:21405  
##                        Median : 802.9   Median :34589  
##                        Mean   : 803.9   Mean   :33566  
##                        3rd Qu.:1128.2   3rd Qu.:43824  
##                        Max.   :2391.0   Max.   :73554

Table of Means and Standard Deviations Grouped by Default and Student Status

#create a table of means and standard deviation for the dataset grouping the factor variables
  Default %>%
  group_by(default,student) %>%
  summarize(Mean_Bal = mean(balance),
            sd_bal = sd(balance),
            Mean_Inc = mean(income),
            sd_Inc = sd(income))

## # A tibble: 4 x 6
## # Groups:   default [?]
##   default student  Mean_Bal   sd_bal Mean_Inc    sd_Inc
##    <fctr>  <fctr>     <dbl>    <dbl>    <dbl>     <dbl>
## 1      No      No  744.5044 445.5151 39993.52 10002.907
## 2      No     Yes  948.4802 450.5537 17937.01  4543.320
## 3     Yes      No 1678.4295 330.9141 40625.05 10258.730
## 4     Yes     Yes 1860.3791 328.7356 18243.51  4304.513

Boxplot of Default Status to Income Faceted by Student Status

#create a ggplot boxplot of the data
Default%>%
  ggplot(aes(x=default,y=income)) +
  geom_boxplot() +
  scale_y_continuous(labels = scales::dollar) +
  facet_grid(student~.) +
  labs(title='Boxplot of Income to Default By Student Status',
       x = 'Default?',
       y='Income (USD)')

Boxplot of Default Status to Loan Balance Faceted by Student Status

#create a ggplot boxplot of the data
Default%>%
  ggplot(aes(x=default,y=balance)) +
  geom_boxplot() +
  scale_y_continuous(labels = scales::dollar) +
  facet_grid(student~.) + 
  labs(title='Boxplot of Loan Balance to Default By Student Status',
       x = 'Default?',
       y='Balance (USD)')

Use R to build a logistic regression model.

#Convert the variable deault to a binary
Default <- Default %>%
  mutate(default = ifelse(default == 'Yes',1,0))
#create a model with the interaction terms student*income, student*balance, and balance*income
model <- glm(default ~ student + balance + income +student*income + balance*student + balance * income, data=Default, family=binomial())

#create a second model with no interaction terms
model2 <- glm(default~student+balance+income,family="binomial",data=Default)

Discuss your result. Which predictor variables were important? Are there interactions?

Using interaction terms the only idenifiable predictor variable is the balance. Without interaction terms the balance still plays a significant role, but borrowers who are students also has some significance in prediction default as they are statistically significant in this model. Additionally, the AIC of the model without interaction terms is lower than that with interactiont terms.

Summary of Model With Interaction Terms

summary(model)

## 
## Call:
## glm(formula = default ~ student + balance + income + student * 
##     income + balance * student + balance * income, family = binomial(), 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4848  -0.1417  -0.0554  -0.0202   3.7579  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.104e+01  1.866e+00  -5.914 3.33e-09 ***
## studentYes         -5.201e-01  1.344e+00  -0.387    0.699    
## balance             5.882e-03  1.180e-03   4.983 6.27e-07 ***
## income              4.050e-06  4.459e-05   0.091    0.928    
## studentYes:income   1.447e-05  2.779e-05   0.521    0.602    
## studentYes:balance -2.551e-04  7.905e-04  -0.323    0.747    
## balance:income     -1.579e-09  2.815e-08  -0.056    0.955    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.1  on 9993  degrees of freedom
## AIC: 1585.1
## 
## Number of Fisher Scoring iterations: 8

Summary of Model Without Interaction Terms

summary(model2)

## 
## Call:
## glm(formula = default ~ student + balance + income, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4691  -0.1418  -0.0557  -0.0203   3.7383  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
## studentYes  -6.468e-01  2.363e-01  -2.738  0.00619 ** 
## balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
## income       3.033e-06  8.203e-06   0.370  0.71152    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.5  on 9996  degrees of freedom
## AIC: 1579.5
## 
## Number of Fisher Scoring iterations: 8

How good is your model? Assess the performance of the logistic regression classi???er. What is the error rate?

Answer - The error rate of the two models, as suggest below, is very small for predictig defaults. This tells us that the model will predict default up to almost 98% accuracy. Neither model has significant deviance, and the less complicated model predicts as accurately so to avoid the possibility of overfitting the second, less complicated model should be used. Additionally, with the chi square test there is no significance at the 5% level thus we can conclude that both models will work about the same.

Error Rate of First Model

#predict both models
def_fit1 <- predict(model,type='response')
def_fit2 <- predict(model2,type='response')

#build two confusion matrices and calculate the error rate
true <- Default$default
p4_pred1 <- factor(ifelse(def_fit1 >=0.50,"Yes","No"))
p4_true1 <- factor(ifelse(true >= .5, "Yes", "No"))
p4_error1 <- table(p4_pred1,True=p4_true1)
error_rate <- 1-(p4_error1[1,1]+p4_error1[2,2])/sum(p4_error1)
error_rate

## [1] 0.0269

Error Rate of Second Model

p4_pred2 <- factor(ifelse(def_fit2 >=0.50,"Yes","No"))
p4_true2 <- factor(ifelse(true >= .5, "Yes", "No"))
p4_error2 <- table(p4_pred2,True=p4_true2)
error_rate <- 1-(p4_error2[1,1]+p4_error1[2,2])/sum(p4_error2)
error_rate

## [1] 0.0269

Analysis of Deviance

anova(model,model2,test='Chisq')

## Analysis of Deviance Table
## 
## Model 1: default ~ student + balance + income + student * income + balance * 
##     student + balance * income
## Model 2: default ~ student + balance + income
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1      9993     1571.1                     
## 2      9996     1571.5 -3 -0.47911   0.9235

Question 5

(During class on Friday) Go through Section 7.3.1 of the Handbook. Run all the codes (additional exploration of data is allowed) and write your own version of explanation and interpretation. This needs to be submitted with HW 2 problems 1-4 above.

Answer - The below two plots describe how the two variables fibrinogen and globulin change with the factors of ESR.

Density Plot of Globulin an Fibrinogen to ESR

plasma <- plasma

layout(matrix(1:2,ncol=2))
cdplot(ESR ~ fibrinogen, data=plasma)
cdplot(ESR ~ globulin,data=plasma)

Answer - The functions used below produce a logistic regression model in R of the ESR and fibrinogen variables of the plasma data set. The family argument specifies the type of model, in this case binomial for a binomial logistic regression. The summary output specifies a 5% significance of fibrinogen and an increase of one unit increases the log-odds of ESR greater than 20 by approximately 1.83 with the beow confidence interval (0.33 to 3.99)

ESR Logistic Regression an Confidence Interval Estimates

plasma_glm_1 <- glm(ESR ~ fibrinogen, data = plasma, family=binomial())
confint(plasma_glm_1,parm='fibrinogen')

## Waiting for profiling to be done...

##     2.5 %    97.5 % 
## 0.3387619 3.9984921

summary(plasma_glm_1)

## 
## Call:
## glm(formula = ESR ~ fibrinogen, family = binomial(), data = plasma)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9298  -0.5399  -0.4382  -0.3356   2.4794  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -6.8451     2.7703  -2.471   0.0135 *
## fibrinogen    1.8271     0.9009   2.028   0.0425 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 30.885  on 31  degrees of freedom
## Residual deviance: 24.840  on 30  degrees of freedom
## AIC: 28.84
## 
## Number of Fisher Scoring iterations: 5

Answer - The below output exponentiates the log-odds of fibrinogen and the confidence interval to correspond with the data. The confidence interval is wide, and exceeds any value in the actual observations, due to a low number of observations and because there are very few observations where the ESR value is above 20. It is evident that increased values of fibrinogen leads to a greater probability of an ESR above 20.

Exponentiation of Odds and Confidence Intervals

exp(coef(plasma_glm_1)['fibrinogen'])

## fibrinogen 
##   6.215715

exp(confint(plasma_glm_1, parm='fibrinogen'))

## Waiting for profiling to be done...

##     2.5 %    97.5 % 
##  1.403209 54.515884

Answer - The summary output shows a gamma globulin coefficient very close to zero which means that variable has little impact on increasing ESR values. The difference in residual deviance from the first model is only 1.87; additionally, with the chi square on a single degree of freedom the test is not significant at the 5% level thus globulin has no impact on ESR.

Logistic Regression of Both Explanatory Variables (fibrinogen and globulin) & Chi Square Test

plasma_glm_2 <- glm(ESR ~ fibrinogen + globulin, data = plasma, family = binomial())

summary(plasma_glm_2)

## 
## Call:
## glm(formula = ESR ~ fibrinogen + globulin, family = binomial(), 
##     data = plasma)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9683  -0.6122  -0.3458  -0.2116   2.2636  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -12.7921     5.7963  -2.207   0.0273 *
## fibrinogen    1.9104     0.9710   1.967   0.0491 *
## globulin      0.1558     0.1195   1.303   0.1925  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 30.885  on 31  degrees of freedom
## Residual deviance: 22.971  on 29  degrees of freedom
## AIC: 28.971
## 
## Number of Fisher Scoring iterations: 5

anova(plasma_glm_1, plasma_glm_2, test= 'Chisq')

## Analysis of Deviance Table
## 
## Model 1: ESR ~ fibrinogen
## Model 2: ESR ~ fibrinogen + globulin
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        30     24.840                     
## 2        29     22.971  1   1.8692   0.1716

Answer - The produced bubble plot of the predicted values of the second model indicate increased probability of an ESR value above 20 as fibrinogen and to a lesser extent, globulin increase.

Bubble Plot of Predicted Values of Second Model

prob <- predict(plasma_glm_2, type='response')

plot(globulin ~ fibrinogen,data=plasma,xlim=c(2,6),ylim=c(25,55),pch='.')
symbols(plasma$fibrinogen,plasma$globulin,circles=prob,add=T)

STAT 701 Homework 2

Jeremiah Lowhorn

September 4, 2017

Question 1

Question 2

b. Plot the fitted curve from Model 2 on the scatterplot of the data.

c. Add the simple linear regression fit (fitted in class) on this plot - use different color and line type to differentiate the two and add a legend to your plot.

d. Which model do you consider most sensible considering the nature of the data - looking at the plot?

e. Which model is better? - provide a statistic to support you claim.

Summary Statistics of Quadratic Model

Summary Statistics of Simple Linear Model

Question 3

a. Define a binary outcome variable according to whether or not patients lives for at least 24 weeks after diagnosis. Call it surv24.

b. Fit a logistic regression model to the data with surv24 as response. It is avisable to transform the very large white blood cunts to avoid regression coefficients very cose to 0 (and odds ratio close to 1). You ma use log transformation.

c. Construct some graphics useful in the interpretatin of the final model you fit.

d. Fit a model with an interaction term between the two predictors. Which model fits the data better? Justify your answer.

Error Rate of the First Model

Summary Statistics of the First Model

Summary Statistics of the Second Model

Analysis of Deviance

Question 4

Summary of the Default data where default equals ‘Yes’.

Summary of the Default data where default equals ‘No’.

Table of Means and Standard Deviations Grouped by Default and Student Status

Boxplot of Default Status to Income Faceted by Student Status

Boxplot of Default Status to Loan Balance Faceted by Student Status

Summary of Model With Interaction Terms

Summary of Model Without Interaction Terms

Error Rate of First Model

Error Rate of Second Model

Analysis of Deviance

Question 5

Density Plot of Globulin an Fibrinogen to ESR

ESR Logistic Regression an Confidence Interval Estimates

Exponentiation of Odds and Confidence Intervals

Logistic Regression of Both Explanatory Variables (fibrinogen and globulin) & Chi Square Test

Bubble Plot of Predicted Values of Second Model