suppressPackageStartupMessages(library("dplyr"))
suppressPackageStartupMessages(library("tidyr"))
suppressPackageStartupMessages(library("leaps"))

Problem 1

Part A

\[y = \beta_0 + \beta_1x + \beta_2z + \beta_3xz + \epsilon\] z = indicator variables

0 -> Group A

2 -> Group B

This can be rewritten as:

\[y = (\beta_0 + \beta_2z) + (\beta_1 +\beta_3z)x + \epsilon\]

where \((\beta_0 + \beta_2z)\) is the intercept and \((\beta_1 +\beta_3z)\) is the slope.

Graphs
  1. \(\beta_0\), \(\beta_1\), and \(\beta_2\) would be significantly different from zero and the signs of all these coefficients would be positive -> additive model
  2. \(\beta_0\), \(\beta_1\), \(\beta_2\), \(\beta_3\) would be sigificantly different from zero and they would all be positive
  3. \(\beta_0\), \(\beta_1\), and \(\beta_2\) would be significantly different from zero and they would all be positive
  4. \(\beta_0\), \(\beta_1\), \(\beta_2\), \(\beta_3\) would be significantly different from zero. \(\beta_0\), \(\beta_3\) -> positive && \(\beta_1\), \(\beta_2\) -> negative

Part B

For scatterplot (a): Group A: \(y = \beta_0 + \beta_1x + \epsilon\) Group B: \(y = (\beta_0 + \beta_2) + \beta_1x + \epsilon\)

For scatterplot (b): Group A: \(y = \beta_0 + \beta_1x + \epsilon\) Group B: \(y = (\beta_0 + \beta_2) + (\beta_1 + \beta_3)x + \epsilon\)

For scatterplot (c): Group A: \(y = \beta_0 + \beta_1x + \epsilon\) Group B: \(y = \beta0 + (\beta1 + \beta3)x + \epsilon\)

For scatterplot (d): d) Group A: \(y = \beta_0 + \beta_1x + \epsilon\) d) Group B: \(y = (\beta_0 + \beta_2) + (\beta_1 + \beta_3)x + \epsilon\)

Problem 2

Part A

The above coding is not appropriate because using a 2 or 3 with categorical predictors changes the responsive variable by an incorrect factor of either 2 or 3. This suggests that drug/health stocks has twice the effect on P/E ratio than oil stocks and computers/electronics have triple the effect of oil. This is obvously incorrect as type of industry is nominal (categorical) variable doesn’t have order associated with it.

In order to correct the coding, C-1 binary indicator predictors must be chosen. There are 3 categories so 2 binary predictors must be chosen. If the computer / electronic industory is the baseline, recoding the data would become:

\(X_3\) = 1:Oil, 0:Otherwise

\(X_4\) = 1:Drug/Health, 0:Otherwise

Part B

P2 = read.csv("prob2.csv")
P2$Industry <-as.factor(P2$Industry)
levels(P2$Industry) <- c("OIL", "DRUG", "COMP") # create categorical levels
reg <- lm(Ratio~Margin+Rate+relevel(P2$Industry, ref = "COMP"),data=P2) # make COMP the base category
summary(reg) #obtain p values
## 
## Call:
## lm(formula = Ratio ~ Margin + Rate + relevel(P2$Industry, ref = "COMP"), 
##     data = P2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8630 -1.4804  0.1177  1.4617  3.3626 
## 
## Coefficients:
##                                        Estimate Std. Error t value
## (Intercept)                              6.7041     2.1782   3.078
## Margin                                   0.1827     0.2092   0.873
## Rate                                     0.2128     0.1311   1.623
## relevel(P2$Industry, ref = "COMP")OIL    0.8352     1.5091   0.553
## relevel(P2$Industry, ref = "COMP")DRUG   3.8194     1.7793   2.147
##                                        Pr(>|t|)   
## (Intercept)                             0.00819 **
## Margin                                  0.39722   
## Rate                                    0.12696   
## relevel(P2$Industry, ref = "COMP")OIL   0.58868   
## relevel(P2$Industry, ref = "COMP")DRUG  0.04983 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.309 on 14 degrees of freedom
## Multiple R-squared:  0.6947, Adjusted R-squared:  0.6075 
## F-statistic: 7.965 on 4 and 14 DF,  p-value: 0.001449

Which predictors are statistically significant at the 0.05 level?

The only predictor that is statistically significant is type of industry. More specifically, the drug coefficient is the only one significant as the 0.05 level (0.04983). The P/E ratio for the drug category is different from the base category (computers/electronics). Oil category is not significantly different than the base category (computers/electronics) at the 0.05 level.

Interpret the coefficients of the dummy variables.

The dummy variables are shift dummy variables (not slope dummy variables). The computers/electronics industry type is the reference type. When the industry type is oil, the ratio is 0.8352 larger compared to computers/electronics. When the industry type is drug, the ratio is 3.8194 larger compared to computers/electronics. #### Part C

If drug/health is the base category:

reg1 <- lm(Ratio~Margin+Rate+relevel(P2$Industry, ref = "DRUG"),data=P2) # set drugs as the base category
summary(reg1) #obtain coefficients 
## 
## Call:
## lm(formula = Ratio ~ Margin + Rate + relevel(P2$Industry, ref = "DRUG"), 
##     data = P2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8630 -1.4804  0.1177  1.4617  3.3626 
## 
## Coefficients:
##                                        Estimate Std. Error t value
## (Intercept)                             10.5235     3.2654   3.223
## Margin                                   0.1827     0.2092   0.873
## Rate                                     0.2128     0.1311   1.623
## relevel(P2$Industry, ref = "DRUG")OIL   -2.9842     2.2309  -1.338
## relevel(P2$Industry, ref = "DRUG")COMP  -3.8194     1.7793  -2.147
##                                        Pr(>|t|)   
## (Intercept)                             0.00614 **
## Margin                                  0.39722   
## Rate                                    0.12696   
## relevel(P2$Industry, ref = "DRUG")OIL   0.20234   
## relevel(P2$Industry, ref = "DRUG")COMP  0.04983 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.309 on 14 degrees of freedom
## Multiple R-squared:  0.6947, Adjusted R-squared:  0.6075 
## F-statistic: 7.965 on 4 and 14 DF,  p-value: 0.001449

How will the coefficients in the new equation be related to those fitted in (b)?

  • Intercept(new) = Intercept(old) + Drug(old) = 6.7041+3.8194 = 10.5235, which matches the coefficient above.
  • Oil(new) = Oil(old) – Drug(old) = 0.8352 – 3.8194 = -2.9842
  • Comp(new) = Comp(old)-Drug(old) = 0-3.8194 = -3.8194

The coefficients for drug, margin, and rate wont change. The margin and rate cofficients stayed the same since we are still plotting those same numbers. The coefficient for oil changed drastically because we are now comparing type of to type of drug the sign of the computer categorical predictor changed since we are comparing it the opposite way and switched the bases.

Part D

P2new <- P2
P2new$Oil <- (P2new$Industry == "OIL")
P2new$Drug <- (P2new$Industry == "DRUG")
reg2 <- lm(Ratio~Margin+Rate+relevel(P2$Industry, ref = "COMP"), data=P2) # no interaction
reg3 <- lm(Ratio~Margin+Rate+Oil+Drug+Rate:Oil+Rate:Drug, data=P2new) #model with interaction
anova(reg2, reg3)
## Analysis of Variance Table
## 
## Model 1: Ratio ~ Margin + Rate + relevel(P2$Industry, ref = "COMP")
## Model 2: Ratio ~ Margin + Rate + Oil + Drug + Rate:Oil + Rate:Drug
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     14 74.650                           
## 2     12 73.449  2    1.2014 0.0981 0.9072

The F-statistic for the partial F-test is 0.0981, and it has a P-value of 0.9072. This means that at the 0.05 significance level, adding the interaction terms between the ratio and the indicator variables is not statistically significant and therefore should not be included in the predictive model.

Part E

Explain the meaning of an interaction between growth rate and the type of industry.

When the coefficients for both the interaction terms are equal to zero, it means that the effects of growth on the P/E ratio does not depend on the type of industry. If either of the two interaction terms’ coefficients were not equal to zero, it would mean that the effects of growth on the P/E ratio does depend on the industry type.

An interaction between growth rate and the type of industry means that the growth rate would change depending on what type of industry was beign analyzed.

Problem 3

Part A

lm3 <- lm(Ratio~Margin+Rate+relevel(P2$Industry, ref = "COMP"),data=P2) 
lm0 <- lm(Ratio~1, data=P2)
upper <- formula(lm3) 
step(lm0, scope=upper, direction="both", trace=1)
## Start:  AIC=50.54
## Ratio ~ 1
## 
##                                      Df Sum of Sq     RSS    AIC
## + relevel(P2$Industry, ref = "COMP")  2   146.226  98.311 37.230
## + Margin                              1   131.398 113.139 37.899
## + Rate                                1    78.549 165.987 45.182
## <none>                                            244.537 50.544
## 
## Step:  AIC=37.23
## Ratio ~ relevel(P2$Industry, ref = "COMP")
## 
##                                      Df Sum of Sq     RSS    AIC
## + Rate                                1    19.594  78.717 35.007
## <none>                                             98.311 37.230
## + Margin                              1     9.621  88.690 37.273
## - relevel(P2$Industry, ref = "COMP")  2   146.226 244.537 50.544
## 
## Step:  AIC=35.01
## Ratio ~ relevel(P2$Industry, ref = "COMP") + Rate
## 
##                                      Df Sum of Sq     RSS    AIC
## <none>                                             78.717 35.007
## + Margin                              1     4.067  74.650 35.999
## - Rate                                1    19.594  98.311 37.230
## - relevel(P2$Industry, ref = "COMP")  2    87.271 165.987 45.182
## 
## Call:
## lm(formula = Ratio ~ relevel(P2$Industry, ref = "COMP") + Rate, 
##     data = P2)
## 
## Coefficients:
##                            (Intercept)  
##                                 8.0303  
##  relevel(P2$Industry, ref = "COMP")OIL  
##                                 0.4509  
## relevel(P2$Industry, ref = "COMP")DRUG  
##                                 4.8915  
##                                   Rate  
##                                 0.2427
#best model
lm4 <-lm(Ratio~Rate+relevel(P2$Industry, ref = "COMP"), data=P2)
summary(lm4)
## 
## Call:
## lm(formula = Ratio ~ Rate + relevel(P2$Industry, ref = "COMP"), 
##     data = P2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9340 -1.2694 -0.1194  0.9965  4.1953 
## 
## Coefficients:
##                                        Estimate Std. Error t value
## (Intercept)                              8.0303     1.5491   5.184
## Rate                                     0.2427     0.1256   1.932
## relevel(P2$Industry, ref = "COMP")OIL    0.4509     1.4321   0.315
## relevel(P2$Industry, ref = "COMP")DRUG   4.8915     1.2777   3.828
##                                        Pr(>|t|)    
## (Intercept)                            0.000111 ***
## Rate                                   0.072432 .  
## relevel(P2$Industry, ref = "COMP")OIL  0.757192    
## relevel(P2$Industry, ref = "COMP")DRUG 0.001645 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.291 on 15 degrees of freedom
## Multiple R-squared:  0.6781, Adjusted R-squared:  0.6137 
## F-statistic: 10.53 on 3 and 15 DF,  p-value: 0.0005572

What model does stepwise regression give as the best model?

The best model given by stepwise regression is

Ratio = Industry + Rate with AIC = 35.01

Which variables are included in the model?

(Ratio = OIL + DRUG + Rate)

How do the P-values for the individual coefficients compare with those in Problem 2b? Explain any differences.

The coefficients indicate that that whether or not the type of industry is drugs/heatlh or computers/electronics it has the most significant effect. Rate and whether the type of industry is oil has minimal effect on the model. THe p-values here compared to the p-value are smaller for Rate and Drug/Health, however, its bigger for Oil. Since this model is smaller, it depends more on Rate and whether or not the type of industry is drugs or computers since it depends heavily on these, the effect of Oil is mitigated.

Interpret the coefficients.

The interpretation of the coefficients it that a 1 unit increase in rate would increase the P/E ratio by 0.2427. If oil was chosen it would increase the P/E ratio by 0.4509 over computers/electronics. If drugs were chosen it would increase the P/E ratio by 0.48915 over computers/electronics.

Part B

best<-leaps(P2new[,c(2,3,6,7)], P2new[,5], method="Cp", nbest=2, names=names(P2new)[2:5]) #best subsets regression
data.frame(size=best$size,Cp=best$Cp,best$which)
## Warning in data.row.names(row.names, rowsi, i): some row.names duplicated:
## 2,4,6 --> row.names NOT used
##   size       Cp Margin  Rate Industry Ratio
## 1    2 3.466117  FALSE FALSE    FALSE  TRUE
## 2    2 6.218196   TRUE FALSE    FALSE FALSE
## 3    3 1.860284  FALSE  TRUE    FALSE  TRUE
## 4    3 3.728071   TRUE FALSE    FALSE  TRUE
## 5    4 3.306316   TRUE  TRUE    FALSE  TRUE
## 6    4 3.762704  FALSE  TRUE     TRUE  TRUE
## 7    5 5.000000   TRUE  TRUE     TRUE  TRUE

Compare the results to part (a).

Best Subsets Regression Performed.

Ratio = Rate + Drug Cp = 1.860284 This results in a different equation than in part (a).

Problem 4

Part A

P4 = read.csv("prob4.csv")
glm1 <-glm(y~x1+x2, family = binomial(link="logit"), data=P4) # model with both predictors
glm0 <-glm(y~1, family = binomial(link="logit"), data=P4) # model with no predictors
summary(glm1)
## 
## Call:
## glm(formula = y ~ x1 + x2, family = binomial(link = "logit"), 
##     data = P4)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.16160   0.00038   0.02340   0.09611   1.12259  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -33.36743   23.08518  -1.445   0.1483  
## x1            0.05688    0.40448   0.141   0.8882  
## x2            0.21042    0.11153   1.887   0.0592 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 48.170  on 45  degrees of freedom
## Residual deviance:  8.021  on 43  degrees of freedom
## AIC: 14.021
## 
## Number of Fisher Scoring iterations: 9
anova(glm0, glm1, test = "Chisq") # likelihood ratio test
## Analysis of Deviance Table
## 
## Model 1: y ~ 1
## Model 2: y ~ x1 + x2
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        45     48.170                          
## 2        43      8.021  2   40.149 1.913e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value for the likelihood ratio rest is 1.913 x 10^-9 indicating that the fit is statistically significant. The variable x2 (weight) appears to be more significant with a p-value of 0.0592 versues x1 (height) with a p-value of 0.89. The two predictors have moderate values compared to the test of whether any predictor is signficant because both are important to predict whether it is a male or female, however, with no predictors, it is impossible to predict this.

Part B

xnew <- data.frame(x1=71, x2=175)
pred_y <- predict(glm1, xnew, type = "response", se.fit = TRUE)
pred_y
## $fit
##         1 
## 0.9994446 
## 
## $se.fit
##           1 
## 0.001814901 
## 
## $residual.scale
## [1] 1

Fitted Model …. 0.999443 TODO: TODOTODODODODO the predict glm function gave 0.99944

Part C

x_plot <-data.frame(x2 = seq(100,200, by=3), x1 = 68)
p_plot <- predict(glm1, x_plot, type = "response")
plot(x_plot$x2, p_plot, xlab = "Weight", ylab = "Predicted Prob")

Yes, the plot makes sense because as weight increases, the probability that the person is male also increases. Generally, men weight more than women.

Part D

x2 adds ratio = exp { 0.210423 } = 1.23 This means that if x2 changes by a factor of 1, the odds change by a factor of 1.23. #### Part E

#Using Loglik
glm2<-glm(y~x1+x2,family=binomial(link = "logit"),data=P4)
glm3<-glm(y~x2,family=binomial(link = "logit"),data=P4)
logLik(glm2)
## 'log Lik.' -4.010504 (df=3)
logLik(glm3)
## 'log Lik.' -4.020338 (df=2)
G2<- 2*as.numeric(logLik(glm2)-logLik(glm3))
1-pchisq(G2,df=1)
## [1] 0.8884684
#Extracting AIC's
extractAIC(glm2)
## [1]  3.00000 14.02101
extractAIC(glm3)
## [1]  2.00000 12.04068
#Using anova
anova(glm2, glm3, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: y ~ x1 + x2
## Model 2: y ~ x2
##   Resid. Df Resid. Dev Df  Deviance Pr(>Chi)
## 1        43     8.0210                      
## 2        44     8.0407 -1 -0.019668   0.8885

The p-value from the chi squared test using the loglik function is 0.868 meaning x1 is not statistically significant and should not be included. The AIC for the model with weight was lower ; this is consistent with the chi-squared test. The anova showed a high p-value of 0.8885 showing weight is not significant.

Part F

glm4<-glm(y~x1+x2, family=binomial(link = "logit"), data=P4) #scatterplot of the response Y versus the fitted probability
plot(glm4$fitted, jitter(P4$y, factor=0.8))

1 was misclassified #### Part G

plot(P4$x1[P4$y==1],P4$x2[P4$y==1],col="red", pch=15, xlab="height", ylab="weight", xlim=c(60,80), ylim=c(90,250))
points(P4$x1[P4$y==0], P4$x2[P4$y==0], col="black", pch=19)

There is one female who is misclassified as a male #### Part H

glm5<-glm(y~x1+x2,family=binomial(link = "logit"),data=P4)
summary(glm5)
## 
## Call:
## glm(formula = y ~ x1 + x2, family = binomial(link = "logit"), 
##     data = P4)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.16160   0.00038   0.02340   0.09611   1.12259  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -33.36743   23.08518  -1.445   0.1483  
## x1            0.05688    0.40448   0.141   0.8882  
## x2            0.21042    0.11153   1.887   0.0592 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 48.170  on 45  degrees of freedom
## Residual deviance:  8.021  on 43  degrees of freedom
## AIC: 14.021
## 
## Number of Fisher Scoring iterations: 9
D <- cooks.distance(glm5)
D
##            1            2            3            4            5 
## 1.160707e+00 1.780602e-08 3.879525e-02 7.227238e-02 5.604026e-06 
##            6            7            8            9           10 
## 2.889759e-13 8.395500e-07 4.632908e-06 7.047128e-11 2.460444e-04 
##           11           12           13           14           15 
## 6.054429e-11 2.655760e-04 1.329026e-05 1.356478e-06 1.459591e-02 
##           16           17           18           19           20 
## 5.760908e-03 6.006840e-07 1.101212e-02 1.655535e-05 1.930297e-04 
##           21           22           23           24           25 
## 3.059753e-05 4.754830e-13 4.456592e-04 2.860409e-08 2.070623e-04 
##           26           27           28           29           30 
## 1.146303e-07 3.113488e-05 4.632908e-06 1.811331e-07 6.465401e-04 
##           31           32           33           34           35 
## 1.836941e-04 2.070623e-04 4.584481e-05 6.506773e-14 3.654594e-11 
##           36           37           38           39           40 
## 5.730641e-16 6.519623e-03 3.405853e-04 6.325835e-01 6.593042e-06 
##           41           42           43           44           45 
## 8.068036e-06 6.054429e-11 2.209435e-08 6.593042e-06 2.460444e-04 
##           46 
## 1.642805e-06
Inf1 <- influence(glm5)

The two most influential observations are #1 and #39. They are influential because they significantly change the estimated coefficients. These 2 are influential because their respective cook’s distance is > 4/n. If observation #1 was removed, the intercept changes by -26.55, x1 by .7 and x2 by -0.1429. If observation #39 was removed, the intercept changes by 14.437, x1 by -0.182, and x2 by 0.01835. If each were removed, the line wouldn’t change by much but the slope might become less negative. #### Part I

# THIS PART IS WRONG I THINK
P4new<-data.frame(P4[-1,c(1:3)])
glm6<-glm(y~x1+x2,family=binomial(link = "logit"), data=P4new) #omit the 1st observation
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm6)
## 
## Call:
## glm(formula = y ~ x1 + x2, family = binomial(link = "logit"), 
##     data = P4new)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.54803   0.00000   0.00139   0.03632   1.43945  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.0165    27.1967  -0.442    0.659
## x1           -0.7377     0.8770  -0.841    0.400
## x2            0.4349     0.3051   1.425    0.154
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 47.674  on 44  degrees of freedom
## Residual deviance:  5.255  on 42  degrees of freedom
## AIC: 11.255
## 
## Number of Fisher Scoring iterations: 10
glm7<-glm(y~x1+x2, data=P4[-39,c(1:3)]) #omit the 39th observation
summary(glm7)
## 
## Call:
## glm(formula = y ~ x1 + x2, data = P4[-39, c(1:3)])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.45572  -0.19926   0.00968   0.17444   0.40821  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.590656   0.626017  -4.138 0.000164 ***
## x1           0.032566   0.011071   2.942 0.005293 ** 
## x2           0.006900   0.001591   4.336 8.88e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.05925923)
## 
##     Null deviance: 7.2000  on 44  degrees of freedom
## Residual deviance: 2.4889  on 42  degrees of freedom
## AIC: 5.4373
## 
## Number of Fisher Scoring iterations: 2
#Graph without observation 1
P4new<-data.frame(P4[-1,c(1:3)])
plot(P4new$x1[P4new$y==1],P4new$x2[P4new$y==1],col="red", pch=15, xlab="height", ylab="weight", xlim=c(60,80), ylim=c(90,250))
points(P4new$x1[P4new$y==0], P4new$x2[P4new$y==0], col="black", pch=19)

#Graph without observation 39
P4new<-data.frame(P4[-39,c(1:3)])
plot(P4new$x1[P4new$y==1],P4new$x2[P4new$y==1],col="red", pch=15, xlab="height", ylab="weight", xlim=c(60,80), ylim=c(90,250))
points(P4new$x1[P4new$y==0], P4new$x2[P4new$y==0], col="black", pch=19)