Homework 2

Part 1: Problem 1 Describe the null hypotheses to which the p-values given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefcients of the linear model. a: P-value is the probability that the give result is true, considering the null hypothesis. If the p-value is less, we say that the p-value is significant and we reject the null hypothesis. In a given linear model, null hypothesis is that there is no relationship between a predictor and response, and there is low or significant p means we reject the null and that there is relationship between the predictor and response. b: In the given table (table 3.4), the p value of TV, radio, and newspaper are <0.001, <0.001, and 0.8599. The p-values for TV and Radio are significant, so their null hypothesis can be rejected. It shows a strong likelihood that TV and Radio advert spending impacts sales. However, the p-value for Newspaper is not significant and so cannot reject the null hypothesis. This shows that newspaper spending does not increase sales in the presence of TV and Radio spending.

Problem 3 Suppose we have a data set with fve predictors, X1 = GPA, X2 = IQ, X3 = Level (1 for College and 0 for High School), X4 = Interaction between GPA and IQ, and X5 = Interaction between GPA and Level. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to ft the model, and get β0 = 50, β1 = 20, β2 = 0.07, β3 = 35, β4 = 0.01, β5 = −10. a: (iii) is True; As males earn more on average than females after their GPA exceeds 3.5. b: 137.1k salary of a female

Iq = 110
Gpa = 4
Gen = 1
Sales = 50 +20*Gpa +.07*Iq + 35*Gen +.01*(Gpa*Iq)-10*(Gpa*Gen)
Sales

## [1] 137.1

c: False, In the case of the female with an IQ of 110 and a GPA of 4.0, the interaction term adds 17.6k to her final salary, and this represents around 15% of her final salary, therefore the impact of the interaction term is substantial but we have calculate the p-value of the coefficient to determine if it is statistically significant.

Problem 4 I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then ft a linear regression model to the data, as well as a separate cubic regression, i.e. Y = β0 + β1(X) + β2(X2) + β3(X3) + E a: For training data, RSS decreases as we increase model complexity. The extra polynomial terms allow for a closer fit (more degrees of freedom) of the training data, so I would expect the training RSS for cubic regression to be lower than for simple linear regression. b: In the case of test data, for cubic regression which is overfitted, will produce results far from the true values, where as linear regression, which is close to the true function will preform betweer on the test data. Hence, RSS will be lower on the test data for linear regression. c: Cubic regression will have a better fit to non-linear data and so its training RSS will be lower d: We can’t predict wheter RSS for test data will be lower for cubic regression or linear regression, since we have no clue what the true function is like. However, we can find out the value of RSS for test data for cubic regression and linear regression, if test RSS of cubic is less than test RSS of linear regression, than we can say that true relatiionship is more non-linear.

Part 2: 8:

# Part a
library(ISLR)

## Warning: package 'ISLR' was built under R version 4.3.2

fix(Auto)
data(Auto)
mpg_pwr = lm(mpg~horsepower, data = Auto)
summary(mpg_pwr)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

# (i)There is strong evidence of a relationship between mpg and horsepower as the p-value for horsepower's coefficient is close to zero.


# (ii) The R^2 statistic is 0.61, and this means 60% of variance in mpg can be explained by horsepower in this model. To calculate the residual error relative to the response we use the mean of mpg and the RSE. The relationship between mpg and horsepower is reasonably strong.


# (iii) MPG has a negative linear relationship with horsepower. For every unit increase in horsepower, the mpg falls by -0.1578mpg.

# (iv) y = 39.935 - (0.158 x 98) = 24.45mpg, we are getting the answer as 24.46 ,but we can't be sure about the confidence in the prediction and the resulting range of the confidence
predict(mpg_pwr, data.frame(horsepower = c(98)), interval = 'prediction')

##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476

predict(mpg_pwr, data.frame(horsepower = c(98)), interval = 'confidence')

##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108

# Part B
plot(mpg~horsepower,main= "Scatter plot of mpg vs. horsepower", data=Auto)
abline(mpg_pwr, lwd =3, col ="light blue")

# Part C
#The residuals v fitted chart shows a slight u-shaped pattern, and this indicates non-linearity in the data.
#The residuals v leverage chart shows that some observations have high leverage.
#The scale-location chart shows some possible outliers. We can confirm by using studentized residuals to find observations with values greater than 3:
par(mfrow=c(2,2))
plot(mpg_pwr)

print(rstudent(mpg_pwr)[which(rstudent(mpg_pwr)>3)])

##      323      330 
## 3.508709 3.149671

# Part A
head(Auto)

pairs(Auto[,1:8])

# Part B
cor(Auto[,1:8])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

# Part C
mpg_all<- lm(mpg~.-name, data = Auto)
summary(mpg_all)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

# Part D
# The Residuals v Fitted values chart indicates some non-linearity in the data, and so polynomial regression could provide a better fit than simple linear regression.
# The standardized residuals vs leverage chart shows that observation 14 has high leverage.
# Some observations are potential outliers: 
par(mfrow=c(2,2))
plot(mpg_all)

rstudent(mpg_all)[which(rstudent(mpg_all)>3)]

##      245      323      326      327 
## 3.390068 4.029537 3.494823 3.690246

# Additionally, we can check for collinearity between the predictors by finding `VIF` values. The results show some variables with VIF higher than 5 or 10, which according to the text is problematic. Collinearity increases inaccuracy in the estimate for predictors coefficients, and hence increases their standard errors.
library(car)

## Loading required package: carData

vif(mpg_all)

##    cylinders displacement   horsepower       weight acceleration         year 
##    10.737535    21.836792     9.943693    10.831260     2.625806     1.244952 
##       origin 
##     1.772386

# Part E
# cylinders:year and horsepower:acceleration are statistically significant. The $R^2$ metric has increased from 0.82 to 0.85.

mpg_interaction = lm(mpg~.-name+ year:cylinders+ acceleration:horsepower, data=Auto)
summary(mpg_interaction)

## 
## Call:
## lm(formula = mpg ~ . - name + year:cylinders + acceleration:horsepower, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0203 -1.7318 -0.1015  1.5639 11.9559 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -8.862e+01  1.212e+01  -7.311 1.57e-12 ***
## cylinders                1.181e+01  2.349e+00   5.029 7.61e-07 ***
## displacement            -8.775e-03  7.916e-03  -1.108  0.26838    
## horsepower               9.151e-02  2.502e-02   3.658  0.00029 ***
## weight                  -4.269e-03  6.967e-04  -6.127 2.23e-09 ***
## acceleration             8.439e-01  1.590e-01   5.306 1.90e-07 ***
## year                     1.521e+00  1.590e-01   9.569  < 2e-16 ***
## origin                   1.070e+00  2.609e-01   4.102 5.00e-05 ***
## cylinders:year          -1.520e-01  3.017e-02  -5.037 7.32e-07 ***
## horsepower:acceleration -9.837e-03  1.778e-03  -5.533 5.84e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.049 on 382 degrees of freedom
## Multiple R-squared:  0.8509, Adjusted R-squared:  0.8474 
## F-statistic: 242.2 on 9 and 382 DF,  p-value: < 2.2e-16

# Part F
# Both mpg_poly and mpg_poly2 models see an increase in $R^2$ metric from 0.82 to 0.86. There are differences in the statistical significance of some predictors between the transformed models and the multiple regression model in part C.

mpg_poly = lm(mpg~.-name + year:cylinders + I(horsepower^2)
+ I(acceleration^2),data=Auto)
summary(mpg_poly)

## 
## Call:
## lm(formula = mpg ~ . - name + year:cylinders + I(horsepower^2) + 
##     I(acceleration^2), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9986 -1.5525 -0.1194  1.4348 11.7722 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -3.394e+01  1.430e+01  -2.374 0.018075 *  
## cylinders          8.481e+00  2.340e+00   3.624 0.000329 ***
## displacement      -1.106e-02  7.330e-03  -1.509 0.132051    
## horsepower        -2.720e-01  3.531e-02  -7.703 1.16e-13 ***
## weight            -3.338e-03  6.812e-04  -4.900 1.42e-06 ***
## acceleration      -1.378e+00  5.421e-01  -2.542 0.011403 *  
## year               1.272e+00  1.594e-01   7.982 1.71e-14 ***
## origin             1.027e+00  2.493e-01   4.121 4.63e-05 ***
## I(horsepower^2)    8.040e-04  1.140e-04   7.054 8.22e-12 ***
## I(acceleration^2)  3.351e-02  1.578e-02   2.124 0.034303 *  
## cylinders:year    -1.056e-01  3.023e-02  -3.493 0.000533 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.935 on 381 degrees of freedom
## Multiple R-squared:  0.8622, Adjusted R-squared:  0.8586 
## F-statistic: 238.3 on 10 and 381 DF,  p-value: < 2.2e-16

mpg_poly2 = lm(mpg~.-name-cylinders + log(weight) + log(acceleration) +
sqrt(displacement),data=Auto)
summary(mpg_poly2)

## 
## Call:
## lm(formula = mpg ~ . - name - cylinders + log(weight) + log(acceleration) + 
##     sqrt(displacement), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.2104  -1.6665  -0.1085   1.5977  12.5231 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        290.113140  49.599189   5.849 1.06e-08 ***
## displacement         0.032477   0.028592   1.136 0.256720    
## horsepower          -0.043782   0.013065  -3.351 0.000885 ***
## weight               0.006923   0.002226   3.110 0.002013 ** 
## acceleration         2.001283   0.468834   4.269 2.48e-05 ***
## year                 0.801707   0.044950  17.836  < 2e-16 ***
## origin               0.502973   0.262462   1.916 0.056064 .  
## log(weight)        -34.848861   6.862200  -5.078 5.96e-07 ***
## log(acceleration)  -33.152402   7.671145  -4.322 1.98e-05 ***
## sqrt(displacement)  -1.043089   0.820337  -1.272 0.204311    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.914 on 382 degrees of freedom
## Multiple R-squared:  0.8639, Adjusted R-squared:  0.8607 
## F-statistic: 269.3 on 9 and 382 DF,  p-value: < 2.2e-16

13:

set.seed(1)
# Part A
x = rnorm(100, mean = 0, sd=1)

# Part B
eps = rnorm(100, mean = 0, sd=0.25)

# Part C
y = -1 +(0.5*x)+eps
# Length of y=100, beta0=-1, beta1=0.5

# Part D
plot(y~x, main="Scatter plot of x vs y", col='red')

# Part E
lm.fit = lm(y~x)
summary(lm.fit)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46921 -0.15344 -0.03487  0.13485  0.58654 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.00942    0.02425  -41.63   <2e-16 ***
## x            0.49973    0.02693   18.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2407 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16

abline(lm.fit, lwd = 1, col='cyan')

# Part F
#A positive linear relationship exists between x2 and y2, with added variance introduced by the error terms.
#beta_0 = -1.018$ and beta_1 = 0.499$.The regression estimates are very close to the true values: beta_0=-1, beta_1=0.5. This is further confirmed by the fact that the regression and population lines are very close to each other. P-values are near zero and F-statistic is large so null hypothesis can be rejected.
abline(a = -1, b=0.5,lwd = 1, col='black')
legend('bottomright',bty ='n', legend=c("Least Squares Line", "Population Line"),col=c('cyan','black'), lty = c(1,1))

# Part G
# The quadratic term does not improve the model fit. The F-statistic is reduced, and the p-value for the squared term is higher than 0.05 and shows that it isn't statistically significant.
lm.fit2 = lm(y~x+I(x^2))
summary(lm.fit2)

## 
## Call:
## lm(formula = y ~ x + I(x^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4913 -0.1563 -0.0322  0.1451  0.5675 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.98582    0.02941 -33.516   <2e-16 ***
## x            0.50429    0.02700  18.680   <2e-16 ***
## I(x^2)      -0.02973    0.02119  -1.403    0.164    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2395 on 97 degrees of freedom
## Multiple R-squared:  0.7828, Adjusted R-squared:  0.7784 
## F-statistic: 174.8 on 2 and 97 DF,  p-value: < 2.2e-16

# Part H
# The points are closer to each other, the RSE is lower, R2 and F-statistic are much higher than with variance of 0.25. The linear regression and population lines are very close to each other as noise is reduced, and the relationship is much more linear.

eps = rnorm(100, mean = 0, sd=sqrt(0.01))
y = -1+(0.5*x)+eps

plot(y~x, main='Reduced Noise', col='black')
lm.fit2 = lm(y~x)
summary(lm.fit2)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.291411 -0.048230 -0.004533  0.064924  0.264157 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.99726    0.01047  -95.25   <2e-16 ***
## x            0.50212    0.01163   43.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1039 on 98 degrees of freedom
## Multiple R-squared:  0.9501, Adjusted R-squared:  0.9495 
## F-statistic:  1864 on 1 and 98 DF,  p-value: < 2.2e-16

abline(lm.fit2, lwd=1, col ="red")

abline(a=-1,b=0.5, lwd=1, col="pink")
legend('bottomright', bty='n', legend=c('Least Squares Line', 'Population Line'), col=c('pink','red'), lty = c(1, 1))

# Part I
# The points are more spread out and so the relationship is less linear. The RSE is higher, the R2 and F-statistic are lower than with variance of 0.25.
eps = rnorm(100, mean=0, sd=sqrt(0.5625))
y = -1 +(0.5*x) + eps

plot(y~x, main='Increased Noise', col='light blue')
lm.fit4 = lm(y~x)
summary(lm.fit4)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.88719 -0.40893 -0.02832  0.50466  1.40916 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.95675    0.07521 -12.721  < 2e-16 ***
## x            0.45824    0.08354   5.485 3.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7466 on 98 degrees of freedom
## Multiple R-squared:  0.2349, Adjusted R-squared:  0.2271 
## F-statistic: 30.09 on 1 and 98 DF,  p-value: 3.227e-07

abline(lm.fit4, lwd=1, col ="blue")

abline(a=-1,b=0.5, lwd=1, col="red")
legend('bottomright', bty='n', legend=c('Least Squares Line', 'Population Line'), col=c('blue','red'), lty = c(1, 1))

# Part J
# Confidence interval values are narrowest for the lowest variance model, widest for the highest variance model and in-between these two for the original model.
confint(lm.fit)

##                  2.5 %     97.5 %
## (Intercept) -1.0575402 -0.9613061
## x            0.4462897  0.5531801

confint(lm.fit2)

##                  2.5 %     97.5 %
## (Intercept) -1.0180413 -0.9764850
## x            0.4790377  0.5251957

confint(lm.fit4)

##                  2.5 %     97.5 %
## (Intercept) -1.1060050 -0.8074970
## x            0.2924541  0.6240169

15:

library(MASS)

# Part A
# from above point, we can conclude that every predictor except CHAS, has a significant relation with CRIM
data <- Boston
print(dim(data))

## [1] 506  14

head(data)

simple_coeff <- list()
predictor <- setdiff(names(data), 'crim')
for (col in predictor) {
  formula <- as.formula(paste('crim ~', col))
  result <- lm(formula, data = data)
  coeff <- summary(result)$coefficients
  pvalue <- summary(result)$coefficients
  cat(sprintf('%s    %.4f    %.4f\n', col, coeff, pvalue))
  simple_coeff[[col]] <- coeff
  plot(data[[col]], data$crim, main = paste('crim vs', col), xlab = col, ylab = 'crim')
  abline(result, col = "yellow")}

## zn    4.4537    4.4537
##  zn    -0.0739    -0.0739
##  zn    0.4172    0.4172
##  zn    0.0161    0.0161
##  zn    10.6747    10.6747
##  zn    -4.5938    -4.5938
##  zn    0.0000    0.0000
##  zn    0.0000    0.0000

## indus    -2.0637    -2.0637
##  indus    0.5098    0.5098
##  indus    0.6672    0.6672
##  indus    0.0510    0.0510
##  indus    -3.0930    -3.0930
##  indus    9.9908    9.9908
##  indus    0.0021    0.0021
##  indus    0.0000    0.0000

## chas    3.7444    3.7444
##  chas    -1.8928    -1.8928
##  chas    0.3961    0.3961
##  chas    1.5061    1.5061
##  chas    9.4530    9.4530
##  chas    -1.2567    -1.2567
##  chas    0.0000    0.0000
##  chas    0.2094    0.2094

## nox    -13.7199    -13.7199
##  nox    31.2485    31.2485
##  nox    1.6995    1.6995
##  nox    2.9992    2.9992
##  nox    -8.0730    -8.0730
##  nox    10.4190    10.4190
##  nox    0.0000    0.0000
##  nox    0.0000    0.0000

## rm    20.4818    20.4818
##  rm    -2.6841    -2.6841
##  rm    3.3645    3.3645
##  rm    0.5320    0.5320
##  rm    6.0877    6.0877
##  rm    -5.0448    -5.0448
##  rm    0.0000    0.0000
##  rm    0.0000    0.0000

## age    -3.7779    -3.7779
##  age    0.1078    0.1078
##  age    0.9440    0.9440
##  age    0.0127    0.0127
##  age    -4.0021    -4.0021
##  age    8.4628    8.4628
##  age    0.0001    0.0001
##  age    0.0000    0.0000

## dis    9.4993    9.4993
##  dis    -1.5509    -1.5509
##  dis    0.7304    0.7304
##  dis    0.1683    0.1683
##  dis    13.0056    13.0056
##  dis    -9.2135    -9.2135
##  dis    0.0000    0.0000
##  dis    0.0000    0.0000

## rad    -2.2872    -2.2872
##  rad    0.6179    0.6179
##  rad    0.4435    0.4435
##  rad    0.0343    0.0343
##  rad    -5.1573    -5.1573
##  rad    17.9982    17.9982
##  rad    0.0000    0.0000
##  rad    0.0000    0.0000

## tax    -8.5284    -8.5284
##  tax    0.0297    0.0297
##  tax    0.8158    0.8158
##  tax    0.0018    0.0018
##  tax    -10.4539    -10.4539
##  tax    16.0994    16.0994
##  tax    0.0000    0.0000
##  tax    0.0000    0.0000

## ptratio    -17.6469    -17.6469
##  ptratio    1.1520    1.1520
##  ptratio    3.1473    3.1473
##  ptratio    0.1694    0.1694
##  ptratio    -5.6071    -5.6071
##  ptratio    6.8014    6.8014
##  ptratio    0.0000    0.0000
##  ptratio    0.0000    0.0000

## black    16.5535    16.5535
##  black    -0.0363    -0.0363
##  black    1.4259    1.4259
##  black    0.0039    0.0039
##  black    11.6092    11.6092
##  black    -9.3670    -9.3670
##  black    0.0000    0.0000
##  black    0.0000    0.0000

## lstat    -3.3305    -3.3305
##  lstat    0.5488    0.5488
##  lstat    0.6938    0.6938
##  lstat    0.0478    0.0478
##  lstat    -4.8007    -4.8007
##  lstat    11.4907    11.4907
##  lstat    0.0000    0.0000
##  lstat    0.0000    0.0000

## medv    11.7965    11.7965
##  medv    -0.3632    -0.3632
##  medv    0.9342    0.9342
##  medv    0.0384    0.0384
##  medv    12.6276    12.6276
##  medv    -9.4597    -9.4597
##  medv    0.0000    0.0000
##  medv    0.0000    0.0000

# Another Way
attach(Boston)
lm.zn = lm(crim~zn)
summary(lm.zn) # yes

## 
## Call:
## lm(formula = crim ~ zn)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.429 -4.222 -2.620  1.250 84.523 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.45369    0.41722  10.675  < 2e-16 ***
## zn          -0.07393    0.01609  -4.594 5.51e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.435 on 504 degrees of freedom
## Multiple R-squared:  0.04019,    Adjusted R-squared:  0.03828 
## F-statistic:  21.1 on 1 and 504 DF,  p-value: 5.506e-06

lm.medv = lm(crim~medv)
lm.lstat = lm(crim~lstat)
lm.black = lm(crim~black)
lm.ptratio = lm(crim~ptratio)
lm.tax = lm(crim~tax)
lm.rad = lm(crim~rad)
lm.dis = lm(crim~dis)
lm.age = lm(crim~age)
lm.rm = lm(crim~rm)
lm.nox = lm(crim~nox)
lm.chas = lm(crim~chas) 
lm.indus = lm(crim~indus)

# Part B
# although the coeff are all non zero, but inspecting the p values, we can only reject null hypothesis of ZN,Black,MedV, RAD, and DIS
lm.all = lm(crim~., data=Boston)
summary(lm.all)

## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.924 -2.120 -0.353  1.019 75.051 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.033228   7.234903   2.354 0.018949 *  
## zn            0.044855   0.018734   2.394 0.017025 *  
## indus        -0.063855   0.083407  -0.766 0.444294    
## chas         -0.749134   1.180147  -0.635 0.525867    
## nox         -10.313535   5.275536  -1.955 0.051152 .  
## rm            0.430131   0.612830   0.702 0.483089    
## age           0.001452   0.017925   0.081 0.935488    
## dis          -0.987176   0.281817  -3.503 0.000502 ***
## rad           0.588209   0.088049   6.680 6.46e-11 ***
## tax          -0.003780   0.005156  -0.733 0.463793    
## ptratio      -0.271081   0.186450  -1.454 0.146611    
## black        -0.007538   0.003673  -2.052 0.040702 *  
## lstat         0.126211   0.075725   1.667 0.096208 .  
## medv         -0.198887   0.060516  -3.287 0.001087 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared:  0.454,  Adjusted R-squared:  0.4396 
## F-statistic: 31.47 on 13 and 492 DF,  p-value: < 2.2e-16

# Part C
# Coefficient for nox is approximately -10 in univariate model and 31 multiple regression model
x = c(coefficients(lm.zn)[2],
      coefficients(lm.indus)[2],
      coefficients(lm.chas)[2],
      coefficients(lm.nox)[2],
      coefficients(lm.rm)[2],
      coefficients(lm.age)[2],
      coefficients(lm.dis)[2],
      coefficients(lm.rad)[2],
      coefficients(lm.tax)[2],
      coefficients(lm.ptratio)[2],
      coefficients(lm.black)[2],
      coefficients(lm.lstat)[2],
      coefficients(lm.medv)[2])
y = coefficients(lm.all)[2:14]
plot(x, y)

# Part D
# there is evidence of non-linear association between all of the predictors and the response, except for black and chas
# You could do the summary(lm.all) of them but I didn't want to output too much data so I left them out.
lm.zn = lm(crim~poly(zn,3))
# summary(lm.zn)
lm.indus = lm(crim~poly(indus,3))
# summary(lm.indus)
# lm.chas = lm(crim~poly(chas,3)) : qualitative predictor
lm.nox = lm(crim~poly(nox,3))
# summary(lm.nox)
lm.rm = lm(crim~poly(rm,3))
# summary(lm.rm)
lm.age = lm(crim~poly(age,3))
# summary(lm.age)
lm.dis = lm(crim~poly(dis,3))
# summary(lm.dis)
lm.rad = lm(crim~poly(rad,3))
# summary(lm.rad)
lm.tax = lm(crim~poly(tax,3))
# summary(lm.tax)
lm.ptratio = lm(crim~poly(ptratio,3))
# summary(lm.crim)
lm.black = lm(crim~poly(black,3))
# summary(lm.black)
lm.lstat = lm(crim~poly(lstat,3))
# summary(lm.lstat)
lm.medv = lm(crim~poly(medv,3))
# summary(lm.medv)

Part 3

library(stats)
library(ggplot2)
homes <- read.csv("homes2004.csv")

# conditional vs marginal value
par(mfrow=c(1,2)) # 1 row, 2 columns of plots 
hist(homes$VALUE, col="Aquamarine", xlab="home value", main="")
plot(VALUE ~ factor(BATHS), 
    col=rainbow(8), data=homes[homes$BATHS<8,],
    xlab="number of bathrooms", ylab="home value")

# create a var for downpayment being greater than 20%
homes$gt20dwn <- 
    factor(0.2<(homes$LPRICE-homes$AMMORT)/homes$LPRICE)
# omit the datas that are do not appear
homes$STATE <- factor(homes$STATE)
homes$FRSTHO<- factor(homes$FRSTHO)
homes <- na.omit(homes)
# some quick plots.  Do more to build your intuition!
par(mfrow=c(1,2)) 
plot(VALUE ~ STATE, data=homes, 
    col=rainbow(nlevels(homes$STATE)), 
    ylim=c(0,10^6), cex.axis=.65)
plot(gt20dwn ~ FRSTHO, data=homes, 
    col=c(1,3), xlab="Buyer's First Home?", 
    ylab="Greater than 20% down")

## My own model using ggplot: relationship between household income (assuming 'ZINC2') and log home value (assuming 'LPRICE')
ggplot(homes, aes(x = ZINC2, y = LPRICE)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "purple") +
  labs(title = "Square Footage vs. Log Home Value", x = "Square Footage", y = "Log Home Value")

## `geom_smooth()` using formula = 'y ~ x'

## Q2 
# "ECOM1Y"  "EGREENY" "ELOW1Y"  "ETRANSY" "ODORAY" are coefficients that are jointly significant above 10%. Also, for less than or equal to 10% it is a long list that is provided under names(pvals)[pvals<=.1]
# regress log(VALUE) on everything except AMMORT and LPRICE 
pricey <- glm(log(VALUE) ~ .-AMMORT-LPRICE, data=homes)
pvals <- summary(pricey)$coef[-1,4]
names(pvals)[pvals<=.1]

##  [1] "EAPTBLY"         "ECOM2Y"          "EJUNKY"          "ESFDY"          
##  [5] "EABANY"          "HOWHgood"        "HOWNgood"        "STRNAY"         
##  [9] "ZINC2"           "PER"             "ZADULT"          "HHGRADBach"     
## [13] "HHGRADGrad"      "HHGRADHS Grad"   "HHGRADNo HS"     "NUNITS"         
## [17] "INTW"            "METROurban"      "STATECO"         "STATECT"        
## [21] "STATEGA"         "STATEIL"         "STATEIN"         "STATELA"        
## [25] "STATEMO"         "STATEOH"         "STATEOK"         "STATEPA"        
## [29] "STATETX"         "STATEWA"         "BATHS"           "BEDRMS"         
## [33] "MATBUYY"         "DWNPAYprev home" "FRSTHOY"         "gt20dwnTRUE"

# To calculate those above 10%
names(pvals)[pvals>.1]

## [1] "ECOM1Y"  "EGREENY" "ELOW1Y"  "ETRANSY" "ODORAY"

## Q3: 
#1st home buyers has a negative affect ( -0.3027506) on the down payment of less than or equal to 20%. The number of bathroom has a positive effect (0.4238279 ) on the response variable. However, we can reject the null hypothesis for both of these predictors, meaning 1st home buyers and bathroom have an impact on home price with down payment less than or equal to 20%
#The interaction between 1st home buyer and bathrooms have a negative affect (-0.3370586) on the response variable and they have a p-value (2.26e-08) that we use to reject the null hypothesis. It means that the interaction term has a negative relationship and impacts home value with down payment less than or equal to 20%
model3 <- glm(gt20dwn ~ FRSTHO + BATHS + FRSTHO*BATHS - AMMORT - LPRICE, 
              family=binomial(link = "logit"), data=homes)
summary(model3)

## 
## Call:
## glm(formula = gt20dwn ~ FRSTHO + BATHS + FRSTHO * BATHS - AMMORT - 
##     LPRICE, family = binomial(link = "logit"), data = homes)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.36524    0.06524 -20.926  < 2e-16 ***
## FRSTHOY       -0.30275    0.11255  -2.690  0.00715 ** 
## BATHS          0.42383    0.02949  14.371  < 2e-16 ***
## FRSTHOY:BATHS -0.33706    0.06029  -5.591 2.26e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 18873  on 15564  degrees of freedom
## Residual deviance: 17884  on 15561  degrees of freedom
## AIC: 17892
## 
## Number of Fisher Scoring iterations: 4

cat("Effect of 1st-time buyer:", coef(model3)['FRSTHOY'], "\n")

## Effect of 1st-time buyer: -0.3027506

cat("Effect of # of bathrooms:", coef(model3)['BATHS'], "\n")

## Effect of # of bathrooms: 0.4238279

interaction_term_name <- paste("FRSTHOY", "BATHS", sep=":")
cat("Interaction effect:", coef(model3)[interaction_term_name], "\n")

## Interaction effect: -0.3370586

## Q4
# we can see that utilizing the predictive model and the actual model leads to a weaker relationship with the r-squared being 0.244429173849128, It means that the model is weak and should not be used as a predictive power or it is missing key variables to help elevate the r-squared and increase the relationship between the Independent Var and Dependent Var.
gt100 <- which(homes$VALUE>100000)
# ybar and null deviance
source("deviance.R")
model3 <- glm(gt20dwn ~ . - AMMORT - LPRICE, data = homes[gt100, ], family = binomial())
pred <- predict(model3, homes[gt100, ], type = "response")
ybar <- mean(homes$gt20dwn[-gt100]==TRUE)
D0 <- deviance(y=homes$gt20dwn, pred=rep(ybar, length(-gt100)), family="binomial")

## Warning in y * log(pred): longer object length is not a multiple of shorter
## object length

## Warning in (1 - y) * log(1 - pred): longer object length is not a multiple of
## shorter object length

D_train <- deviance(y = homes$gt20dwn[-gt100], pred = pred, family = "binomial")

## Warning in y * log(pred): longer object length is not a multiple of shorter
## object length

## Warning in y * log(pred): longer object length is not a multiple of shorter
## object length

print(paste("Training sample deviance:", D_train))

## [1] "Training sample deviance: 14756.3233679565"

print(paste("Null deviance:", D0))

## [1] "Null deviance: 19530.0332638968"

R_squared <- 1 - (D_train / D0)
print(paste("Pseudo R-squared:", R_squared))

## [1] "Pseudo R-squared: 0.244429173849128"

# My Version at first The STATER did not work
# Part A
library(ggplot2)
library(readr)
library(stats)
homes <- read_csv("homes2004.csv")

## Rows: 15565 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (19): EAPTBL, ECOM1, ECOM2, EGREEN, EJUNK, ELOW1, ESFD, ETRANS, EABAN, H...
## dbl (10): AMMORT, ZINC2, PER, ZADULT, NUNITS, INTW, LPRICE, BATHS, BEDRMS, V...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# For example, relationship between household income (assuming 'ZINC2') and log home value (assuming 'LPRICE')
ggplot(homes, aes(x = ZINC2, y = LPRICE)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "cyan") +
  labs(title = "Square Footage vs. Log Home Value", x = "Square Footage", y = "Log Home Value")

## `geom_smooth()` using formula = 'y ~ x'

# Part B
# "ECOM1Y"  "EGREENY" "ELOW1Y"  "ETRANSY" "ODORAY" are coefficients that are jointly significant above 10%
# For less than or equal to 10% it is a long list that is provided under names(pvals)[pvals<=.1]
model1 <- lm(log(VALUE) ~ . - AMMORT - LPRICE, data = homes)
summary(model1)

## 
## Call:
## lm(formula = log(VALUE) ~ . - AMMORT - LPRICE, data = homes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2738  -0.1572   0.0574   0.2756   2.4649 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.159e+01  6.226e-02 186.232  < 2e-16 ***
## EAPTBLY         -4.347e-02  2.346e-02  -1.853  0.06390 .  
## ECOM1Y          -2.568e-02  1.924e-02  -1.335  0.18202    
## ECOM2Y          -8.645e-02  4.805e-02  -1.799  0.07205 .  
## EGREENY          9.391e-03  1.400e-02   0.671  0.50249    
## EJUNKY          -1.265e-01  5.105e-02  -2.478  0.01324 *  
## ELOW1Y           2.870e-02  2.312e-02   1.241  0.21454    
## ESFDY            2.945e-01  2.956e-02   9.963  < 2e-16 ***
## ETRANSY         -1.515e-02  2.532e-02  -0.598  0.54953    
## EABANY          -1.621e-01  3.598e-02  -4.506 6.67e-06 ***
## HOWHgood         1.295e-01  2.632e-02   4.922 8.65e-07 ***
## HOWNgood         1.193e-01  2.190e-02   5.445 5.26e-08 ***
## ODORAY           1.026e-02  3.312e-02   0.310  0.75685    
## STRNAY          -3.618e-02  1.607e-02  -2.251  0.02437 *  
## ZINC2            6.244e-07  5.538e-08  11.273  < 2e-16 ***
## PER              9.651e-03  6.253e-03   1.543  0.12277    
## ZADULT          -1.864e-02  1.088e-02  -1.714  0.08649 .  
## HHGRADBach       1.321e-01  2.292e-02   5.766 8.28e-09 ***
## HHGRADGrad       1.973e-01  2.578e-02   7.652 2.09e-14 ***
## HHGRADHS Grad   -6.061e-02  2.171e-02  -2.792  0.00524 ** 
## HHGRADNo HS     -1.945e-01  3.183e-02  -6.112 1.01e-09 ***
## NUNITS          -9.324e-04  5.203e-04  -1.792  0.07314 .  
## INTW            -4.637e-02  4.408e-03 -10.518  < 2e-16 ***
## METROurban       8.610e-02  1.807e-02   4.764 1.92e-06 ***
## STATECO         -2.921e-01  2.921e-02 -10.001  < 2e-16 ***
## STATECT         -3.464e-01  3.125e-02 -11.084  < 2e-16 ***
## STATEGA         -6.551e-01  3.108e-02 -21.077  < 2e-16 ***
## STATEIL         -8.618e-01  5.768e-02 -14.940  < 2e-16 ***
## STATEIN         -7.792e-01  3.070e-02 -25.379  < 2e-16 ***
## STATELA         -7.196e-01  3.688e-02 -19.511  < 2e-16 ***
## STATEMO         -6.645e-01  3.343e-02 -19.875  < 2e-16 ***
## STATEOH         -6.737e-01  3.269e-02 -20.610  < 2e-16 ***
## STATEOK         -9.982e-01  3.281e-02 -30.425  < 2e-16 ***
## STATEPA         -8.716e-01  3.389e-02 -25.722  < 2e-16 ***
## STATETX         -1.049e+00  3.431e-02 -30.575  < 2e-16 ***
## STATEWA         -1.228e-01  3.094e-02  -3.970 7.23e-05 ***
## BATHS            2.117e-01  1.159e-02  18.271  < 2e-16 ***
## BEDRMS           8.740e-02  1.006e-02   8.690  < 2e-16 ***
## MATBUYY         -2.966e-02  1.368e-02  -2.168  0.03015 *  
## DWNPAYprev home  1.209e-01  1.785e-02   6.775 1.29e-11 ***
## FRSTHOY         -8.398e-02  1.724e-02  -4.870 1.12e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8171 on 15524 degrees of freedom
## Multiple R-squared:  0.3053, Adjusted R-squared:  0.3035 
## F-statistic: 170.6 on 40 and 15524 DF,  p-value: < 2.2e-16

pvals <- summary(model1)$coef[-1,4]
# example: those variable insignificant at alpha=0.10
names(pvals)[pvals>.1]

## [1] "ECOM1Y"  "EGREENY" "ELOW1Y"  "ETRANSY" "ODORAY"  "PER"

names(pvals)[pvals<=.1]

##  [1] "EAPTBLY"         "ECOM2Y"          "EJUNKY"          "ESFDY"          
##  [5] "EABANY"          "HOWHgood"        "HOWNgood"        "STRNAY"         
##  [9] "ZINC2"           "ZADULT"          "HHGRADBach"      "HHGRADGrad"     
## [13] "HHGRADHS Grad"   "HHGRADNo HS"     "NUNITS"          "INTW"           
## [17] "METROurban"      "STATECO"         "STATECT"         "STATEGA"        
## [21] "STATEIL"         "STATEIN"         "STATELA"         "STATEMO"        
## [25] "STATEOH"         "STATEOK"         "STATEPA"         "STATETX"        
## [29] "STATEWA"         "BATHS"           "BEDRMS"          "MATBUYY"        
## [33] "DWNPAYprev home" "FRSTHOY"

model1_reduced <- stepAIC(model1, direction = "backward")

## Start:  AIC=-6247.13
## log(VALUE) ~ (AMMORT + EAPTBL + ECOM1 + ECOM2 + EGREEN + EJUNK + 
##     ELOW1 + ESFD + ETRANS + EABAN + HOWH + HOWN + ODORA + STRNA + 
##     ZINC2 + PER + ZADULT + HHGRAD + NUNITS + INTW + METRO + STATE + 
##     LPRICE + BATHS + BEDRMS + MATBUY + DWNPAY + FRSTHO) - AMMORT - 
##     LPRICE
## 
##          Df Sum of Sq   RSS     AIC
## - ODORA   1      0.06 10365 -6249.0
## - ETRANS  1      0.24 10365 -6248.8
## - EGREEN  1      0.30 10365 -6248.7
## - ELOW1   1      1.03 10366 -6247.6
## - ECOM1   1      1.19 10366 -6247.3
## <none>                10365 -6247.1
## - PER     1      1.59 10366 -6246.7
## - ZADULT  1      1.96 10367 -6246.2
## - NUNITS  1      2.14 10367 -6245.9
## - ECOM2   1      2.16 10367 -6245.9
## - EAPTBL  1      2.29 10367 -6245.7
## - MATBUY  1      3.14 10368 -6244.4
## - STRNA   1      3.38 10368 -6244.0
## - EJUNK   1      4.10 10369 -6243.0
## - EABAN   1     13.55 10378 -6228.8
## - METRO   1     15.15 10380 -6226.4
## - FRSTHO  1     15.84 10380 -6225.4
## - HOWH    1     16.17 10381 -6224.9
## - HOWN    1     19.80 10384 -6219.4
## - DWNPAY  1     30.64 10395 -6203.2
## - BEDRMS  1     50.42 10415 -6173.6
## - ESFD    1     66.27 10431 -6149.9
## - INTW    1     73.86 10438 -6138.6
## - ZINC2   1     84.85 10450 -6122.2
## - HHGRAD  4    190.13 10555 -5972.2
## - BATHS   1    222.87 10588 -5918.0
## - STATE  12   1493.31 11858 -4176.1
## 
## Step:  AIC=-6249.03
## log(VALUE) ~ EAPTBL + ECOM1 + ECOM2 + EGREEN + EJUNK + ELOW1 + 
##     ESFD + ETRANS + EABAN + HOWH + HOWN + STRNA + ZINC2 + PER + 
##     ZADULT + HHGRAD + NUNITS + INTW + METRO + STATE + BATHS + 
##     BEDRMS + MATBUY + DWNPAY + FRSTHO
## 
##          Df Sum of Sq   RSS     AIC
## - ETRANS  1      0.23 10365 -6250.7
## - EGREEN  1      0.31 10365 -6250.6
## - ELOW1   1      1.02 10366 -6249.5
## - ECOM1   1      1.20 10366 -6249.2
## <none>                10365 -6249.0
## - PER     1      1.60 10366 -6248.6
## - ZADULT  1      1.96 10367 -6248.1
## - ECOM2   1      2.13 10367 -6247.8
## - NUNITS  1      2.15 10367 -6247.8
## - EAPTBL  1      2.29 10367 -6247.6
## - MATBUY  1      3.15 10368 -6246.3
## - STRNA   1      3.33 10368 -6246.0
## - EJUNK   1      4.05 10369 -6245.0
## - EABAN   1     13.51 10378 -6230.8
## - METRO   1     15.23 10380 -6228.2
## - FRSTHO  1     15.84 10380 -6227.3
## - HOWH    1     16.14 10381 -6226.8
## - HOWN    1     19.73 10384 -6221.4
## - DWNPAY  1     30.65 10395 -6205.1
## - BEDRMS  1     50.40 10415 -6175.5
## - ESFD    1     66.25 10431 -6151.9
## - INTW    1     73.82 10438 -6140.6
## - ZINC2   1     84.83 10450 -6124.2
## - HHGRAD  4    190.07 10555 -5974.2
## - BATHS   1    222.81 10588 -5920.0
## - STATE  12   1493.38 11858 -4177.9
## 
## Step:  AIC=-6250.69
## log(VALUE) ~ EAPTBL + ECOM1 + ECOM2 + EGREEN + EJUNK + ELOW1 + 
##     ESFD + EABAN + HOWH + HOWN + STRNA + ZINC2 + PER + ZADULT + 
##     HHGRAD + NUNITS + INTW + METRO + STATE + BATHS + BEDRMS + 
##     MATBUY + DWNPAY + FRSTHO
## 
##          Df Sum of Sq   RSS     AIC
## - EGREEN  1      0.28 10365 -6252.3
## - ELOW1   1      1.00 10366 -6251.2
## <none>                10365 -6250.7
## - ECOM1   1      1.34 10366 -6250.7
## - PER     1      1.60 10366 -6250.3
## - ZADULT  1      1.97 10367 -6249.7
## - NUNITS  1      2.14 10367 -6249.5
## - ECOM2   1      2.37 10367 -6249.1
## - EAPTBL  1      2.40 10367 -6249.1
## - MATBUY  1      3.17 10368 -6247.9
## - STRNA   1      3.49 10368 -6247.4
## - EJUNK   1      4.06 10369 -6246.6
## - EABAN   1     13.58 10378 -6232.3
## - METRO   1     15.23 10380 -6229.8
## - FRSTHO  1     15.85 10381 -6228.9
## - HOWH    1     16.11 10381 -6228.5
## - HOWN    1     19.81 10385 -6223.0
## - DWNPAY  1     30.66 10396 -6206.7
## - BEDRMS  1     50.48 10415 -6177.1
## - ESFD    1     66.27 10431 -6153.5
## - INTW    1     73.80 10439 -6142.3
## - ZINC2   1     85.01 10450 -6125.6
## - HHGRAD  4    190.21 10555 -5975.6
## - BATHS   1    223.01 10588 -5921.3
## - STATE  12   1497.26 11862 -4174.5
## 
## Step:  AIC=-6252.26
## log(VALUE) ~ EAPTBL + ECOM1 + ECOM2 + EJUNK + ELOW1 + ESFD + 
##     EABAN + HOWH + HOWN + STRNA + ZINC2 + PER + ZADULT + HHGRAD + 
##     NUNITS + INTW + METRO + STATE + BATHS + BEDRMS + MATBUY + 
##     DWNPAY + FRSTHO
## 
##          Df Sum of Sq   RSS     AIC
## - ELOW1   1      1.02 10366 -6252.7
## - ECOM1   1      1.31 10366 -6252.3
## <none>                10365 -6252.3
## - PER     1      1.63 10367 -6251.8
## - ZADULT  1      1.98 10367 -6251.3
## - NUNITS  1      2.16 10367 -6251.0
## - ECOM2   1      2.32 10368 -6250.8
## - EAPTBL  1      2.43 10368 -6250.6
## - MATBUY  1      3.16 10368 -6249.5
## - STRNA   1      3.49 10369 -6249.0
## - EJUNK   1      4.00 10369 -6248.3
## - EABAN   1     13.52 10379 -6234.0
## - METRO   1     14.97 10380 -6231.8
## - FRSTHO  1     15.96 10381 -6230.3
## - HOWH    1     16.07 10381 -6230.2
## - HOWN    1     19.96 10385 -6224.3
## - DWNPAY  1     30.72 10396 -6208.2
## - BEDRMS  1     50.37 10416 -6178.8
## - ESFD    1     66.01 10431 -6155.5
## - INTW    1     73.89 10439 -6143.7
## - ZINC2   1     85.32 10450 -6126.7
## - HHGRAD  4    190.61 10556 -5976.6
## - BATHS   1    224.39 10590 -5920.9
## - STATE  12   1501.24 11866 -4170.9
## 
## Step:  AIC=-6252.73
## log(VALUE) ~ EAPTBL + ECOM1 + ECOM2 + EJUNK + ESFD + EABAN + 
##     HOWH + HOWN + STRNA + ZINC2 + PER + ZADULT + HHGRAD + NUNITS + 
##     INTW + METRO + STATE + BATHS + BEDRMS + MATBUY + DWNPAY + 
##     FRSTHO
## 
##          Df Sum of Sq   RSS     AIC
## - ECOM1   1      1.22 10367 -6252.9
## <none>                10366 -6252.7
## - PER     1      1.53 10368 -6252.4
## - EAPTBL  1      1.84 10368 -6252.0
## - ZADULT  1      1.98 10368 -6251.8
## - NUNITS  1      2.23 10368 -6251.4
## - ECOM2   1      2.27 10368 -6251.3
## - MATBUY  1      3.03 10369 -6250.2
## - STRNA   1      3.44 10370 -6249.6
## - EJUNK   1      4.01 10370 -6248.7
## - EABAN   1     13.50 10380 -6234.5
## - METRO   1     15.16 10381 -6232.0
## - FRSTHO  1     15.94 10382 -6230.8
## - HOWH    1     16.10 10382 -6230.6
## - HOWN    1     20.01 10386 -6224.7
## - DWNPAY  1     30.72 10397 -6208.7
## - BEDRMS  1     49.43 10416 -6180.7
## - ESFD    1     64.99 10431 -6157.4
## - INTW    1     74.44 10441 -6143.4
## - ZINC2   1     85.20 10451 -6127.3
## - HHGRAD  4    192.13 10558 -5974.9
## - BATHS   1    225.52 10592 -5919.7
## - STATE  12   1511.06 11877 -4158.7
## 
## Step:  AIC=-6252.9
## log(VALUE) ~ EAPTBL + ECOM2 + EJUNK + ESFD + EABAN + HOWH + HOWN + 
##     STRNA + ZINC2 + PER + ZADULT + HHGRAD + NUNITS + INTW + METRO + 
##     STATE + BATHS + BEDRMS + MATBUY + DWNPAY + FRSTHO
## 
##          Df Sum of Sq   RSS     AIC
## <none>                10367 -6252.9
## - PER     1      1.50 10369 -6252.6
## - ZADULT  1      1.98 10369 -6251.9
## - NUNITS  1      2.37 10370 -6251.3
## - EAPTBL  1      2.74 10370 -6250.8
## - ECOM2   1      2.83 10370 -6250.6
## - MATBUY  1      3.09 10370 -6250.3
## - EJUNK   1      4.07 10372 -6248.8
## - STRNA   1      4.07 10372 -6248.8
## - EABAN   1     13.77 10381 -6234.2
## - METRO   1     14.68 10382 -6232.9
## - FRSTHO  1     16.13 10384 -6230.7
## - HOWH    1     16.31 10384 -6230.4
## - HOWN    1     20.04 10388 -6224.8
## - DWNPAY  1     31.00 10398 -6208.4
## - BEDRMS  1     49.74 10417 -6180.4
## - ESFD    1     64.67 10432 -6158.1
## - INTW    1     74.80 10442 -6143.0
## - ZINC2   1     84.90 10452 -6128.0
## - HHGRAD  4    192.34 10560 -5974.8
## - BATHS   1    226.81 10594 -5918.1
## - STATE  12   1510.08 11878 -4160.4

summary(model1_reduced)

## 
## Call:
## lm(formula = log(VALUE) ~ EAPTBL + ECOM2 + EJUNK + ESFD + EABAN + 
##     HOWH + HOWN + STRNA + ZINC2 + PER + ZADULT + HHGRAD + NUNITS + 
##     INTW + METRO + STATE + BATHS + BEDRMS + MATBUY + DWNPAY + 
##     FRSTHO, data = homes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2761  -0.1553   0.0577   0.2755   2.4690 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.161e+01  6.135e-02 189.165  < 2e-16 ***
## EAPTBLY         -4.475e-02  2.210e-02  -2.025 0.042896 *  
## ECOM2Y          -9.674e-02  4.698e-02  -2.059 0.039502 *  
## EJUNKY          -1.256e-01  5.088e-02  -2.468 0.013590 *  
## ESFDY            2.879e-01  2.925e-02   9.842  < 2e-16 ***
## EABANY          -1.632e-01  3.593e-02  -4.542 5.61e-06 ***
## HOWHgood         1.300e-01  2.630e-02   4.942 7.81e-07 ***
## HOWNgood         1.198e-01  2.186e-02   5.479 4.34e-08 ***
## STRNAY          -3.908e-02  1.583e-02  -2.469 0.013556 *  
## ZINC2            6.241e-07  5.535e-08  11.277  < 2e-16 ***
## PER              9.374e-03  6.246e-03   1.501 0.133407    
## ZADULT          -1.871e-02  1.087e-02  -1.721 0.085353 .  
## HHGRADBach       1.329e-01  2.290e-02   5.802 6.69e-09 ***
## HHGRADGrad       1.980e-01  2.576e-02   7.687 1.60e-14 ***
## HHGRADHS Grad   -6.085e-02  2.170e-02  -2.804 0.005052 ** 
## HHGRADNo HS     -1.955e-01  3.181e-02  -6.145 8.18e-10 ***
## NUNITS          -9.785e-04  5.197e-04  -1.883 0.059743 .  
## INTW            -4.663e-02  4.405e-03 -10.585  < 2e-16 ***
## METROurban       8.410e-02  1.793e-02   4.689 2.77e-06 ***
## STATECO         -2.882e-01  2.904e-02  -9.922  < 2e-16 ***
## STATECT         -3.452e-01  3.121e-02 -11.060  < 2e-16 ***
## STATEGA         -6.545e-01  3.097e-02 -21.130  < 2e-16 ***
## STATEIL         -8.626e-01  5.763e-02 -14.968  < 2e-16 ***
## STATEIN         -7.790e-01  3.067e-02 -25.398  < 2e-16 ***
## STATELA         -7.215e-01  3.677e-02 -19.621  < 2e-16 ***
## STATEMO         -6.644e-01  3.342e-02 -19.880  < 2e-16 ***
## STATEOH         -6.757e-01  3.261e-02 -20.718  < 2e-16 ***
## STATEOK         -9.981e-01  3.279e-02 -30.440  < 2e-16 ***
## STATEPA         -8.685e-01  3.383e-02 -25.676  < 2e-16 ***
## STATETX         -1.050e+00  3.427e-02 -30.632  < 2e-16 ***
## STATEWA         -1.201e-01  3.091e-02  -3.888 0.000102 ***
## BATHS            2.130e-01  1.156e-02  18.432  < 2e-16 ***
## BEDRMS           8.633e-02  1.000e-02   8.632  < 2e-16 ***
## MATBUYY         -2.940e-02  1.367e-02  -2.151 0.031471 *  
## DWNPAYprev home  1.216e-01  1.784e-02   6.814 9.84e-12 ***
## FRSTHOY         -8.471e-02  1.723e-02  -4.915 8.98e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8171 on 15529 degrees of freedom
## Multiple R-squared:  0.3051, Adjusted R-squared:  0.3035 
## F-statistic: 194.8 on 35 and 15529 DF,  p-value: < 2.2e-16

# Compare R-squared values
cat("Full model R-squared: ", summary(model1)$r.squared, "\n")

## Full model R-squared:  0.305303

cat("Reduced model R-squared: ", summary(model1_reduced)$r.squared, "\n")

## Reduced model R-squared:  0.3051141

# Part C
# 1st home buyers has a negative affect on the down payment of less than or equal to 20%. The number of bathroom has a positive effect on the response variable. However, we can reject the null hypothesis for both of these predictors, meaning 1st home buyers and bathroom have an impact on home price with down payment less than or equal to 20%
homes$gt20dwn <- (homes$LPRICE - homes$AMMORT) / homes$LPRICE > 0.2
model2 <- glm(gt20dwn ~ . - AMMORT - LPRICE, family = binomial, data = homes)
summary(model2)

## 
## Call:
## glm(formula = gt20dwn ~ . - AMMORT - LPRICE, family = binomial, 
##     data = homes)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -1.293e+00  1.831e-01  -7.065 1.61e-12 ***
## EAPTBLY          1.505e-02  7.025e-02   0.214 0.830424    
## ECOM1Y          -1.619e-01  5.809e-02  -2.787 0.005325 ** 
## ECOM2Y          -3.131e-01  1.600e-01  -1.957 0.050385 .  
## EGREENY         -1.569e-03  3.984e-02  -0.039 0.968582    
## EJUNKY          -9.697e-03  1.608e-01  -0.060 0.951913    
## ELOW1Y           4.635e-02  6.627e-02   0.699 0.484292    
## ESFDY           -2.670e-01  8.276e-02  -3.227 0.001252 ** 
## ETRANSY         -6.270e-02  7.616e-02  -0.823 0.410416    
## EABANY          -8.187e-02  1.157e-01  -0.708 0.479137    
## HOWHgood        -1.372e-01  7.947e-02  -1.726 0.084398 .  
## HOWNgood         1.597e-01  6.730e-02   2.372 0.017669 *  
## ODORAY           1.041e-01  9.811e-02   1.061 0.288528    
## STRNAY          -9.644e-02  4.737e-02  -2.036 0.041783 *  
## ZINC2           -1.277e-07  1.874e-07  -0.682 0.495530    
## PER             -1.253e-01  1.855e-02  -6.752 1.46e-11 ***
## ZADULT           1.944e-02  3.188e-02   0.610 0.542024    
## HHGRADBach       1.797e-01  6.596e-02   2.725 0.006431 ** 
## HHGRADGrad       2.729e-01  7.288e-02   3.745 0.000181 ***
## HHGRADHS Grad   -2.064e-02  6.376e-02  -0.324 0.746192    
## HHGRADNo HS     -7.246e-02  9.845e-02  -0.736 0.461720    
## NUNITS           2.377e-03  1.428e-03   1.664 0.096100 .  
## INTW            -6.327e-02  1.372e-02  -4.613 3.98e-06 ***
## METROurban      -8.000e-02  5.389e-02  -1.485 0.137672    
## STATECO         -2.513e-02  8.491e-02  -0.296 0.767257    
## STATECT          7.870e-01  8.825e-02   8.918  < 2e-16 ***
## STATEGA         -2.223e-01  9.455e-02  -2.351 0.018716 *  
## STATEIL          5.870e-01  1.635e-01   3.590 0.000330 ***
## STATEIN          2.431e-01  9.352e-02   2.599 0.009336 ** 
## STATELA          5.932e-01  1.077e-01   5.506 3.67e-08 ***
## STATEMO          5.309e-01  9.730e-02   5.456 4.87e-08 ***
## STATEOH          7.642e-01  9.480e-02   8.061 7.59e-16 ***
## STATEOK          1.291e-01  1.027e-01   1.257 0.208850    
## STATEPA          6.011e-01  1.007e-01   5.968 2.40e-09 ***
## STATETX          2.935e-01  1.073e-01   2.736 0.006221 ** 
## STATEWA          1.525e-01  8.819e-02   1.730 0.083717 .  
## BATHS            2.445e-01  3.419e-02   7.152 8.57e-13 ***
## BEDRMS          -2.086e-02  2.908e-02  -0.717 0.473120    
## MATBUYY          2.587e-01  3.927e-02   6.588 4.45e-11 ***
## DWNPAYprev home  7.417e-01  4.857e-02  15.272  < 2e-16 ***
## VALUE            1.489e-06  1.452e-07  10.256  < 2e-16 ***
## FRSTHOY         -3.700e-01  5.170e-02  -7.156 8.29e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 18873  on 15564  degrees of freedom
## Residual deviance: 16969  on 15523  degrees of freedom
## AIC: 17053
## 
## Number of Fisher Scoring iterations: 4

# The interaction between 1st home buyer and bathrooms have a p-value (2.26e-08) that we use to reject the null hypothesis. It means that the interaction term has a negative relationship and impacts home value with down payment less than or equal to 20%
model2_interaction <- glm(gt20dwn ~ FRSTHO + BATHS + FRSTHO:BATHS - AMMORT - LPRICE, family = binomial, data = homes)
summary(model2_interaction)

## 
## Call:
## glm(formula = gt20dwn ~ FRSTHO + BATHS + FRSTHO:BATHS - AMMORT - 
##     LPRICE, family = binomial, data = homes)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.36524    0.06524 -20.926  < 2e-16 ***
## FRSTHOY       -0.30275    0.11255  -2.690  0.00715 ** 
## BATHS          0.42383    0.02949  14.371  < 2e-16 ***
## FRSTHOY:BATHS -0.33706    0.06029  -5.591 2.26e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 18873  on 15564  degrees of freedom
## Residual deviance: 17884  on 15561  degrees of freedom
## AIC: 17892
## 
## Number of Fisher Scoring iterations: 4

# Part D
# Using (ii) we can see that the Gt 100k R-squared and Lt R-squared have a weak relationship with home values with gt20dwn.Meaning that the model, as specified, does not have a strong predictive power or that important variables may be missing or incorrectly modeled. Essentially, it implies that the Gt 100k and Lt 100k do not have a strong linear relationship with the gt20dwn.
# Using (iii) we can see that utilizing the predictive model and the actual model leads to a weaker relationship with the r-squared being 0.03347367, It means that the model is weak and should not be used as a predictive power or it is missing key variables to help elevate the r-squared and increase the relationship between the Indpendent Var and Dependent Var.

homes_gt100k <- subset(homes, VALUE > 100000)
homes_lt100k <- subset(homes, VALUE <= 100000)
model3 <- glm(gt20dwn ~ . - AMMORT - LPRICE, family = binomial(link = "logit"), data = homes_gt100k)
summary(model3)

## 
## Call:
## glm(formula = gt20dwn ~ . - AMMORT - LPRICE, family = binomial(link = "logit"), 
##     data = homes_gt100k)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -1.493e+00  2.132e-01  -7.004 2.49e-12 ***
## EAPTBLY          9.426e-02  8.202e-02   1.149 0.250490    
## ECOM1Y          -9.180e-02  6.695e-02  -1.371 0.170325    
## ECOM2Y          -4.082e-01  2.070e-01  -1.972 0.048625 *  
## EGREENY          4.097e-03  4.390e-02   0.093 0.925629    
## EJUNKY          -2.323e-01  2.179e-01  -1.066 0.286425    
## ELOW1Y           2.175e-02  7.399e-02   0.294 0.768779    
## ESFDY           -3.579e-01  9.838e-02  -3.638 0.000275 ***
## ETRANSY         -1.061e-01  8.863e-02  -1.198 0.231104    
## EABANY          -1.862e-01  1.607e-01  -1.158 0.246670    
## HOWHgood        -8.069e-02  9.756e-02  -0.827 0.408204    
## HOWNgood         1.945e-01  8.100e-02   2.401 0.016361 *  
## ODORAY           1.280e-01  1.161e-01   1.103 0.270076    
## STRNAY          -1.155e-01  5.439e-02  -2.124 0.033712 *  
## ZINC2           -2.881e-07  2.189e-07  -1.316 0.188130    
## PER             -1.262e-01  2.041e-02  -6.181 6.35e-10 ***
## ZADULT           1.013e-02  3.569e-02   0.284 0.776468    
## HHGRADBach       2.070e-01  7.311e-02   2.831 0.004633 ** 
## HHGRADGrad       2.899e-01  7.992e-02   3.627 0.000287 ***
## HHGRADHS Grad    2.554e-02  7.253e-02   0.352 0.724727    
## HHGRADNo HS     -1.754e-01  1.246e-01  -1.407 0.159381    
## NUNITS           1.274e-03  1.790e-03   0.712 0.476706    
## INTW            -6.134e-02  1.713e-02  -3.582 0.000341 ***
## METROurban      -1.251e-01  6.294e-02  -1.987 0.046903 *  
## STATECO          4.055e-02  8.739e-02   0.464 0.642607    
## STATECT          7.897e-01  9.210e-02   8.574  < 2e-16 ***
## STATEGA         -2.147e-01  9.940e-02  -2.160 0.030785 *  
## STATEIL          5.716e-01  1.971e-01   2.900 0.003734 ** 
## STATEIN          3.524e-01  1.016e-01   3.469 0.000522 ***
## STATELA          6.957e-01  1.214e-01   5.730 1.00e-08 ***
## STATEMO          6.164e-01  1.052e-01   5.862 4.57e-09 ***
## STATEOH          7.987e-01  1.024e-01   7.798 6.27e-15 ***
## STATEOK          2.693e-01  1.218e-01   2.211 0.027019 *  
## STATEPA          7.626e-01  1.169e-01   6.523 6.88e-11 ***
## STATETX          4.604e-01  1.297e-01   3.549 0.000386 ***
## STATEWA          1.765e-01  9.095e-02   1.941 0.052288 .  
## BATHS            2.324e-01  3.750e-02   6.198 5.71e-10 ***
## BEDRMS          -1.466e-02  3.233e-02  -0.453 0.650264    
## MATBUYY          3.422e-01  4.335e-02   7.893 2.94e-15 ***
## DWNPAYprev home  7.763e-01  5.318e-02  14.597  < 2e-16 ***
## VALUE            1.764e-06  1.608e-07  10.970  < 2e-16 ***
## FRSTHOY         -3.457e-01  5.950e-02  -5.810 6.25e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 15210  on 12143  degrees of freedom
## Residual deviance: 13620  on 12102  degrees of freedom
## AIC: 13704
## 
## Number of Fisher Scoring iterations: 4

# (ii) to compare the model I had to create a new regression of lm instead of glm which doesn't give the necessary r-squared for the models.
model3.5 <- lm(gt20dwn ~ . - AMMORT - LPRICE, family = binomial(link = "logit"), data = homes_gt100k)

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'family' will be disregarded

model4 <- lm(gt20dwn ~ . - AMMORT - LPRICE, family = binomial(link = "logit"), data = homes_lt100k)

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'family' will be disregarded

cat("Gt 100k R-squred: ", summary(model3.5)$r.squared, "\n")

## Gt 100k R-squred:  0.1261319

cat("Lt 100k R-squared: ", summary(model4)$r.squared, "\n")

## Lt 100k R-squared:  0.08317762

# (iii) R-Squared with GLM regression combined and use the predictive model from model3 to create a combined model that allows us to see the R-Squared.
predictions_lt100k <- predict(model3, newdata = homes_lt100k, type = "response")
R2_lt100k <- R2(homes_lt100k$gt20dwn, predictions_lt100k, family="binomial")
print(R2_lt100k)

## [1] 0.03347367

Homework 2

Jabbir Ahmed

2024-02-08