Part 1: Problem 1 Describe the null hypotheses to which the p-values given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefcients of the linear model. a: P-value is the probability that the give result is true, considering the null hypothesis. If the p-value is less, we say that the p-value is significant and we reject the null hypothesis. In a given linear model, null hypothesis is that there is no relationship between a predictor and response, and there is low or significant p means we reject the null and that there is relationship between the predictor and response. b: In the given table (table 3.4), the p value of TV, radio, and newspaper are <0.001, <0.001, and 0.8599. The p-values for TV and Radio are significant, so their null hypothesis can be rejected. It shows a strong likelihood that TV and Radio advert spending impacts sales. However, the p-value for Newspaper is not significant and so cannot reject the null hypothesis. This shows that newspaper spending does not increase sales in the presence of TV and Radio spending.
Problem 3 Suppose we have a data set with fve predictors, X1 = GPA, X2 = IQ, X3 = Level (1 for College and 0 for High School), X4 = Interaction between GPA and IQ, and X5 = Interaction between GPA and Level. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to ft the model, and get β0 = 50, β1 = 20, β2 = 0.07, β3 = 35, β4 = 0.01, β5 = −10. a: (iii) is True; As males earn more on average than females after their GPA exceeds 3.5. b: 137.1k salary of a female
Iq = 110
Gpa = 4
Gen = 1
Sales = 50 +20*Gpa +.07*Iq + 35*Gen +.01*(Gpa*Iq)-10*(Gpa*Gen)
Sales
## [1] 137.1
c: False, In the case of the female with an IQ of 110 and a GPA of 4.0, the interaction term adds 17.6k to her final salary, and this represents around 15% of her final salary, therefore the impact of the interaction term is substantial but we have calculate the p-value of the coefficient to determine if it is statistically significant.
Problem 4 I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then ft a linear regression model to the data, as well as a separate cubic regression, i.e. Y = β0 + β1(X) + β2(X2) + β3(X3) + E a: For training data, RSS decreases as we increase model complexity. The extra polynomial terms allow for a closer fit (more degrees of freedom) of the training data, so I would expect the training RSS for cubic regression to be lower than for simple linear regression. b: In the case of test data, for cubic regression which is overfitted, will produce results far from the true values, where as linear regression, which is close to the true function will preform betweer on the test data. Hence, RSS will be lower on the test data for linear regression. c: Cubic regression will have a better fit to non-linear data and so its training RSS will be lower d: We can’t predict wheter RSS for test data will be lower for cubic regression or linear regression, since we have no clue what the true function is like. However, we can find out the value of RSS for test data for cubic regression and linear regression, if test RSS of cubic is less than test RSS of linear regression, than we can say that true relatiionship is more non-linear.
Part 2: 8:
# Part a
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.3.2
fix(Auto)
data(Auto)
mpg_pwr = lm(mpg~horsepower, data = Auto)
summary(mpg_pwr)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
# (i)There is strong evidence of a relationship between mpg and horsepower as the p-value for horsepower's coefficient is close to zero.
# (ii) The R^2 statistic is 0.61, and this means 60% of variance in mpg can be explained by horsepower in this model. To calculate the residual error relative to the response we use the mean of mpg and the RSE. The relationship between mpg and horsepower is reasonably strong.
# (iii) MPG has a negative linear relationship with horsepower. For every unit increase in horsepower, the mpg falls by -0.1578mpg.
# (iv) y = 39.935 - (0.158 x 98) = 24.45mpg, we are getting the answer as 24.46 ,but we can't be sure about the confidence in the prediction and the resulting range of the confidence
predict(mpg_pwr, data.frame(horsepower = c(98)), interval = 'prediction')
## fit lwr upr
## 1 24.46708 14.8094 34.12476
predict(mpg_pwr, data.frame(horsepower = c(98)), interval = 'confidence')
## fit lwr upr
## 1 24.46708 23.97308 24.96108
# Part B
plot(mpg~horsepower,main= "Scatter plot of mpg vs. horsepower", data=Auto)
abline(mpg_pwr, lwd =3, col ="light blue")
# Part C
#The residuals v fitted chart shows a slight u-shaped pattern, and this indicates non-linearity in the data.
#The residuals v leverage chart shows that some observations have high leverage.
#The scale-location chart shows some possible outliers. We can confirm by using studentized residuals to find observations with values greater than 3:
par(mfrow=c(2,2))
plot(mpg_pwr)
print(rstudent(mpg_pwr)[which(rstudent(mpg_pwr)>3)])
## 323 330
## 3.508709 3.149671
9:
# Part A
head(Auto)
pairs(Auto[,1:8])
# Part B
cor(Auto[,1:8])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
# Part C
mpg_all<- lm(mpg~.-name, data = Auto)
summary(mpg_all)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
# Part D
# The Residuals v Fitted values chart indicates some non-linearity in the data, and so polynomial regression could provide a better fit than simple linear regression.
# The standardized residuals vs leverage chart shows that observation 14 has high leverage.
# Some observations are potential outliers:
par(mfrow=c(2,2))
plot(mpg_all)
rstudent(mpg_all)[which(rstudent(mpg_all)>3)]
## 245 323 326 327
## 3.390068 4.029537 3.494823 3.690246
# Additionally, we can check for collinearity between the predictors by finding `VIF` values. The results show some variables with VIF higher than 5 or 10, which according to the text is problematic. Collinearity increases inaccuracy in the estimate for predictors coefficients, and hence increases their standard errors.
library(car)
## Loading required package: carData
vif(mpg_all)
## cylinders displacement horsepower weight acceleration year
## 10.737535 21.836792 9.943693 10.831260 2.625806 1.244952
## origin
## 1.772386
# Part E
# cylinders:year and horsepower:acceleration are statistically significant. The $R^2$ metric has increased from 0.82 to 0.85.
mpg_interaction = lm(mpg~.-name+ year:cylinders+ acceleration:horsepower, data=Auto)
summary(mpg_interaction)
##
## Call:
## lm(formula = mpg ~ . - name + year:cylinders + acceleration:horsepower,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0203 -1.7318 -0.1015 1.5639 11.9559
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.862e+01 1.212e+01 -7.311 1.57e-12 ***
## cylinders 1.181e+01 2.349e+00 5.029 7.61e-07 ***
## displacement -8.775e-03 7.916e-03 -1.108 0.26838
## horsepower 9.151e-02 2.502e-02 3.658 0.00029 ***
## weight -4.269e-03 6.967e-04 -6.127 2.23e-09 ***
## acceleration 8.439e-01 1.590e-01 5.306 1.90e-07 ***
## year 1.521e+00 1.590e-01 9.569 < 2e-16 ***
## origin 1.070e+00 2.609e-01 4.102 5.00e-05 ***
## cylinders:year -1.520e-01 3.017e-02 -5.037 7.32e-07 ***
## horsepower:acceleration -9.837e-03 1.778e-03 -5.533 5.84e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.049 on 382 degrees of freedom
## Multiple R-squared: 0.8509, Adjusted R-squared: 0.8474
## F-statistic: 242.2 on 9 and 382 DF, p-value: < 2.2e-16
# Part F
# Both mpg_poly and mpg_poly2 models see an increase in $R^2$ metric from 0.82 to 0.86. There are differences in the statistical significance of some predictors between the transformed models and the multiple regression model in part C.
mpg_poly = lm(mpg~.-name + year:cylinders + I(horsepower^2)
+ I(acceleration^2),data=Auto)
summary(mpg_poly)
##
## Call:
## lm(formula = mpg ~ . - name + year:cylinders + I(horsepower^2) +
## I(acceleration^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9986 -1.5525 -0.1194 1.4348 11.7722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.394e+01 1.430e+01 -2.374 0.018075 *
## cylinders 8.481e+00 2.340e+00 3.624 0.000329 ***
## displacement -1.106e-02 7.330e-03 -1.509 0.132051
## horsepower -2.720e-01 3.531e-02 -7.703 1.16e-13 ***
## weight -3.338e-03 6.812e-04 -4.900 1.42e-06 ***
## acceleration -1.378e+00 5.421e-01 -2.542 0.011403 *
## year 1.272e+00 1.594e-01 7.982 1.71e-14 ***
## origin 1.027e+00 2.493e-01 4.121 4.63e-05 ***
## I(horsepower^2) 8.040e-04 1.140e-04 7.054 8.22e-12 ***
## I(acceleration^2) 3.351e-02 1.578e-02 2.124 0.034303 *
## cylinders:year -1.056e-01 3.023e-02 -3.493 0.000533 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.935 on 381 degrees of freedom
## Multiple R-squared: 0.8622, Adjusted R-squared: 0.8586
## F-statistic: 238.3 on 10 and 381 DF, p-value: < 2.2e-16
mpg_poly2 = lm(mpg~.-name-cylinders + log(weight) + log(acceleration) +
sqrt(displacement),data=Auto)
summary(mpg_poly2)
##
## Call:
## lm(formula = mpg ~ . - name - cylinders + log(weight) + log(acceleration) +
## sqrt(displacement), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.2104 -1.6665 -0.1085 1.5977 12.5231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 290.113140 49.599189 5.849 1.06e-08 ***
## displacement 0.032477 0.028592 1.136 0.256720
## horsepower -0.043782 0.013065 -3.351 0.000885 ***
## weight 0.006923 0.002226 3.110 0.002013 **
## acceleration 2.001283 0.468834 4.269 2.48e-05 ***
## year 0.801707 0.044950 17.836 < 2e-16 ***
## origin 0.502973 0.262462 1.916 0.056064 .
## log(weight) -34.848861 6.862200 -5.078 5.96e-07 ***
## log(acceleration) -33.152402 7.671145 -4.322 1.98e-05 ***
## sqrt(displacement) -1.043089 0.820337 -1.272 0.204311
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.914 on 382 degrees of freedom
## Multiple R-squared: 0.8639, Adjusted R-squared: 0.8607
## F-statistic: 269.3 on 9 and 382 DF, p-value: < 2.2e-16
13:
set.seed(1)
# Part A
x = rnorm(100, mean = 0, sd=1)
# Part B
eps = rnorm(100, mean = 0, sd=0.25)
# Part C
y = -1 +(0.5*x)+eps
# Length of y=100, beta0=-1, beta1=0.5
# Part D
plot(y~x, main="Scatter plot of x vs y", col='red')
# Part E
lm.fit = lm(y~x)
summary(lm.fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46921 -0.15344 -0.03487 0.13485 0.58654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.00942 0.02425 -41.63 <2e-16 ***
## x 0.49973 0.02693 18.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2407 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
abline(lm.fit, lwd = 1, col='cyan')
# Part F
#A positive linear relationship exists between x2 and y2, with added variance introduced by the error terms.
#beta_0 = -1.018$ and beta_1 = 0.499$.The regression estimates are very close to the true values: beta_0=-1, beta_1=0.5. This is further confirmed by the fact that the regression and population lines are very close to each other. P-values are near zero and F-statistic is large so null hypothesis can be rejected.
abline(a = -1, b=0.5,lwd = 1, col='black')
legend('bottomright',bty ='n', legend=c("Least Squares Line", "Population Line"),col=c('cyan','black'), lty = c(1,1))
# Part G
# The quadratic term does not improve the model fit. The F-statistic is reduced, and the p-value for the squared term is higher than 0.05 and shows that it isn't statistically significant.
lm.fit2 = lm(y~x+I(x^2))
summary(lm.fit2)
##
## Call:
## lm(formula = y ~ x + I(x^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4913 -0.1563 -0.0322 0.1451 0.5675
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.98582 0.02941 -33.516 <2e-16 ***
## x 0.50429 0.02700 18.680 <2e-16 ***
## I(x^2) -0.02973 0.02119 -1.403 0.164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2395 on 97 degrees of freedom
## Multiple R-squared: 0.7828, Adjusted R-squared: 0.7784
## F-statistic: 174.8 on 2 and 97 DF, p-value: < 2.2e-16
# Part H
# The points are closer to each other, the RSE is lower, R2 and F-statistic are much higher than with variance of 0.25. The linear regression and population lines are very close to each other as noise is reduced, and the relationship is much more linear.
eps = rnorm(100, mean = 0, sd=sqrt(0.01))
y = -1+(0.5*x)+eps
plot(y~x, main='Reduced Noise', col='black')
lm.fit2 = lm(y~x)
summary(lm.fit2)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.291411 -0.048230 -0.004533 0.064924 0.264157
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.99726 0.01047 -95.25 <2e-16 ***
## x 0.50212 0.01163 43.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1039 on 98 degrees of freedom
## Multiple R-squared: 0.9501, Adjusted R-squared: 0.9495
## F-statistic: 1864 on 1 and 98 DF, p-value: < 2.2e-16
abline(lm.fit2, lwd=1, col ="red")
abline(a=-1,b=0.5, lwd=1, col="pink")
legend('bottomright', bty='n', legend=c('Least Squares Line', 'Population Line'), col=c('pink','red'), lty = c(1, 1))
# Part I
# The points are more spread out and so the relationship is less linear. The RSE is higher, the R2 and F-statistic are lower than with variance of 0.25.
eps = rnorm(100, mean=0, sd=sqrt(0.5625))
y = -1 +(0.5*x) + eps
plot(y~x, main='Increased Noise', col='light blue')
lm.fit4 = lm(y~x)
summary(lm.fit4)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.88719 -0.40893 -0.02832 0.50466 1.40916
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.95675 0.07521 -12.721 < 2e-16 ***
## x 0.45824 0.08354 5.485 3.23e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7466 on 98 degrees of freedom
## Multiple R-squared: 0.2349, Adjusted R-squared: 0.2271
## F-statistic: 30.09 on 1 and 98 DF, p-value: 3.227e-07
abline(lm.fit4, lwd=1, col ="blue")
abline(a=-1,b=0.5, lwd=1, col="red")
legend('bottomright', bty='n', legend=c('Least Squares Line', 'Population Line'), col=c('blue','red'), lty = c(1, 1))
# Part J
# Confidence interval values are narrowest for the lowest variance model, widest for the highest variance model and in-between these two for the original model.
confint(lm.fit)
## 2.5 % 97.5 %
## (Intercept) -1.0575402 -0.9613061
## x 0.4462897 0.5531801
confint(lm.fit2)
## 2.5 % 97.5 %
## (Intercept) -1.0180413 -0.9764850
## x 0.4790377 0.5251957
confint(lm.fit4)
## 2.5 % 97.5 %
## (Intercept) -1.1060050 -0.8074970
## x 0.2924541 0.6240169
15:
library(MASS)
# Part A
# from above point, we can conclude that every predictor except CHAS, has a significant relation with CRIM
data <- Boston
print(dim(data))
## [1] 506 14
head(data)
simple_coeff <- list()
predictor <- setdiff(names(data), 'crim')
for (col in predictor) {
formula <- as.formula(paste('crim ~', col))
result <- lm(formula, data = data)
coeff <- summary(result)$coefficients
pvalue <- summary(result)$coefficients
cat(sprintf('%s %.4f %.4f\n', col, coeff, pvalue))
simple_coeff[[col]] <- coeff
plot(data[[col]], data$crim, main = paste('crim vs', col), xlab = col, ylab = 'crim')
abline(result, col = "yellow")}
## zn 4.4537 4.4537
## zn -0.0739 -0.0739
## zn 0.4172 0.4172
## zn 0.0161 0.0161
## zn 10.6747 10.6747
## zn -4.5938 -4.5938
## zn 0.0000 0.0000
## zn 0.0000 0.0000
## indus -2.0637 -2.0637
## indus 0.5098 0.5098
## indus 0.6672 0.6672
## indus 0.0510 0.0510
## indus -3.0930 -3.0930
## indus 9.9908 9.9908
## indus 0.0021 0.0021
## indus 0.0000 0.0000
## chas 3.7444 3.7444
## chas -1.8928 -1.8928
## chas 0.3961 0.3961
## chas 1.5061 1.5061
## chas 9.4530 9.4530
## chas -1.2567 -1.2567
## chas 0.0000 0.0000
## chas 0.2094 0.2094
## nox -13.7199 -13.7199
## nox 31.2485 31.2485
## nox 1.6995 1.6995
## nox 2.9992 2.9992
## nox -8.0730 -8.0730
## nox 10.4190 10.4190
## nox 0.0000 0.0000
## nox 0.0000 0.0000
## rm 20.4818 20.4818
## rm -2.6841 -2.6841
## rm 3.3645 3.3645
## rm 0.5320 0.5320
## rm 6.0877 6.0877
## rm -5.0448 -5.0448
## rm 0.0000 0.0000
## rm 0.0000 0.0000
## age -3.7779 -3.7779
## age 0.1078 0.1078
## age 0.9440 0.9440
## age 0.0127 0.0127
## age -4.0021 -4.0021
## age 8.4628 8.4628
## age 0.0001 0.0001
## age 0.0000 0.0000
## dis 9.4993 9.4993
## dis -1.5509 -1.5509
## dis 0.7304 0.7304
## dis 0.1683 0.1683
## dis 13.0056 13.0056
## dis -9.2135 -9.2135
## dis 0.0000 0.0000
## dis 0.0000 0.0000
## rad -2.2872 -2.2872
## rad 0.6179 0.6179
## rad 0.4435 0.4435
## rad 0.0343 0.0343
## rad -5.1573 -5.1573
## rad 17.9982 17.9982
## rad 0.0000 0.0000
## rad 0.0000 0.0000
## tax -8.5284 -8.5284
## tax 0.0297 0.0297
## tax 0.8158 0.8158
## tax 0.0018 0.0018
## tax -10.4539 -10.4539
## tax 16.0994 16.0994
## tax 0.0000 0.0000
## tax 0.0000 0.0000
## ptratio -17.6469 -17.6469
## ptratio 1.1520 1.1520
## ptratio 3.1473 3.1473
## ptratio 0.1694 0.1694
## ptratio -5.6071 -5.6071
## ptratio 6.8014 6.8014
## ptratio 0.0000 0.0000
## ptratio 0.0000 0.0000
## black 16.5535 16.5535
## black -0.0363 -0.0363
## black 1.4259 1.4259
## black 0.0039 0.0039
## black 11.6092 11.6092
## black -9.3670 -9.3670
## black 0.0000 0.0000
## black 0.0000 0.0000
## lstat -3.3305 -3.3305
## lstat 0.5488 0.5488
## lstat 0.6938 0.6938
## lstat 0.0478 0.0478
## lstat -4.8007 -4.8007
## lstat 11.4907 11.4907
## lstat 0.0000 0.0000
## lstat 0.0000 0.0000
## medv 11.7965 11.7965
## medv -0.3632 -0.3632
## medv 0.9342 0.9342
## medv 0.0384 0.0384
## medv 12.6276 12.6276
## medv -9.4597 -9.4597
## medv 0.0000 0.0000
## medv 0.0000 0.0000
# Another Way
attach(Boston)
lm.zn = lm(crim~zn)
summary(lm.zn) # yes
##
## Call:
## lm(formula = crim ~ zn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.429 -4.222 -2.620 1.250 84.523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.45369 0.41722 10.675 < 2e-16 ***
## zn -0.07393 0.01609 -4.594 5.51e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.435 on 504 degrees of freedom
## Multiple R-squared: 0.04019, Adjusted R-squared: 0.03828
## F-statistic: 21.1 on 1 and 504 DF, p-value: 5.506e-06
lm.medv = lm(crim~medv)
lm.lstat = lm(crim~lstat)
lm.black = lm(crim~black)
lm.ptratio = lm(crim~ptratio)
lm.tax = lm(crim~tax)
lm.rad = lm(crim~rad)
lm.dis = lm(crim~dis)
lm.age = lm(crim~age)
lm.rm = lm(crim~rm)
lm.nox = lm(crim~nox)
lm.chas = lm(crim~chas)
lm.indus = lm(crim~indus)
# Part B
# although the coeff are all non zero, but inspecting the p values, we can only reject null hypothesis of ZN,Black,MedV, RAD, and DIS
lm.all = lm(crim~., data=Boston)
summary(lm.all)
##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.924 -2.120 -0.353 1.019 75.051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.033228 7.234903 2.354 0.018949 *
## zn 0.044855 0.018734 2.394 0.017025 *
## indus -0.063855 0.083407 -0.766 0.444294
## chas -0.749134 1.180147 -0.635 0.525867
## nox -10.313535 5.275536 -1.955 0.051152 .
## rm 0.430131 0.612830 0.702 0.483089
## age 0.001452 0.017925 0.081 0.935488
## dis -0.987176 0.281817 -3.503 0.000502 ***
## rad 0.588209 0.088049 6.680 6.46e-11 ***
## tax -0.003780 0.005156 -0.733 0.463793
## ptratio -0.271081 0.186450 -1.454 0.146611
## black -0.007538 0.003673 -2.052 0.040702 *
## lstat 0.126211 0.075725 1.667 0.096208 .
## medv -0.198887 0.060516 -3.287 0.001087 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared: 0.454, Adjusted R-squared: 0.4396
## F-statistic: 31.47 on 13 and 492 DF, p-value: < 2.2e-16
# Part C
# Coefficient for nox is approximately -10 in univariate model and 31 multiple regression model
x = c(coefficients(lm.zn)[2],
coefficients(lm.indus)[2],
coefficients(lm.chas)[2],
coefficients(lm.nox)[2],
coefficients(lm.rm)[2],
coefficients(lm.age)[2],
coefficients(lm.dis)[2],
coefficients(lm.rad)[2],
coefficients(lm.tax)[2],
coefficients(lm.ptratio)[2],
coefficients(lm.black)[2],
coefficients(lm.lstat)[2],
coefficients(lm.medv)[2])
y = coefficients(lm.all)[2:14]
plot(x, y)
# Part D
# there is evidence of non-linear association between all of the predictors and the response, except for black and chas
# You could do the summary(lm.all) of them but I didn't want to output too much data so I left them out.
lm.zn = lm(crim~poly(zn,3))
# summary(lm.zn)
lm.indus = lm(crim~poly(indus,3))
# summary(lm.indus)
# lm.chas = lm(crim~poly(chas,3)) : qualitative predictor
lm.nox = lm(crim~poly(nox,3))
# summary(lm.nox)
lm.rm = lm(crim~poly(rm,3))
# summary(lm.rm)
lm.age = lm(crim~poly(age,3))
# summary(lm.age)
lm.dis = lm(crim~poly(dis,3))
# summary(lm.dis)
lm.rad = lm(crim~poly(rad,3))
# summary(lm.rad)
lm.tax = lm(crim~poly(tax,3))
# summary(lm.tax)
lm.ptratio = lm(crim~poly(ptratio,3))
# summary(lm.crim)
lm.black = lm(crim~poly(black,3))
# summary(lm.black)
lm.lstat = lm(crim~poly(lstat,3))
# summary(lm.lstat)
lm.medv = lm(crim~poly(medv,3))
# summary(lm.medv)
Part 3
library(stats)
library(ggplot2)
homes <- read.csv("homes2004.csv")
# conditional vs marginal value
par(mfrow=c(1,2)) # 1 row, 2 columns of plots
hist(homes$VALUE, col="Aquamarine", xlab="home value", main="")
plot(VALUE ~ factor(BATHS),
col=rainbow(8), data=homes[homes$BATHS<8,],
xlab="number of bathrooms", ylab="home value")
# create a var for downpayment being greater than 20%
homes$gt20dwn <-
factor(0.2<(homes$LPRICE-homes$AMMORT)/homes$LPRICE)
# omit the datas that are do not appear
homes$STATE <- factor(homes$STATE)
homes$FRSTHO<- factor(homes$FRSTHO)
homes <- na.omit(homes)
# some quick plots. Do more to build your intuition!
par(mfrow=c(1,2))
plot(VALUE ~ STATE, data=homes,
col=rainbow(nlevels(homes$STATE)),
ylim=c(0,10^6), cex.axis=.65)
plot(gt20dwn ~ FRSTHO, data=homes,
col=c(1,3), xlab="Buyer's First Home?",
ylab="Greater than 20% down")
## My own model using ggplot: relationship between household income (assuming 'ZINC2') and log home value (assuming 'LPRICE')
ggplot(homes, aes(x = ZINC2, y = LPRICE)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "purple") +
labs(title = "Square Footage vs. Log Home Value", x = "Square Footage", y = "Log Home Value")
## `geom_smooth()` using formula = 'y ~ x'
## Q2
# "ECOM1Y" "EGREENY" "ELOW1Y" "ETRANSY" "ODORAY" are coefficients that are jointly significant above 10%. Also, for less than or equal to 10% it is a long list that is provided under names(pvals)[pvals<=.1]
# regress log(VALUE) on everything except AMMORT and LPRICE
pricey <- glm(log(VALUE) ~ .-AMMORT-LPRICE, data=homes)
pvals <- summary(pricey)$coef[-1,4]
names(pvals)[pvals<=.1]
## [1] "EAPTBLY" "ECOM2Y" "EJUNKY" "ESFDY"
## [5] "EABANY" "HOWHgood" "HOWNgood" "STRNAY"
## [9] "ZINC2" "PER" "ZADULT" "HHGRADBach"
## [13] "HHGRADGrad" "HHGRADHS Grad" "HHGRADNo HS" "NUNITS"
## [17] "INTW" "METROurban" "STATECO" "STATECT"
## [21] "STATEGA" "STATEIL" "STATEIN" "STATELA"
## [25] "STATEMO" "STATEOH" "STATEOK" "STATEPA"
## [29] "STATETX" "STATEWA" "BATHS" "BEDRMS"
## [33] "MATBUYY" "DWNPAYprev home" "FRSTHOY" "gt20dwnTRUE"
# To calculate those above 10%
names(pvals)[pvals>.1]
## [1] "ECOM1Y" "EGREENY" "ELOW1Y" "ETRANSY" "ODORAY"
## Q3:
#1st home buyers has a negative affect ( -0.3027506) on the down payment of less than or equal to 20%. The number of bathroom has a positive effect (0.4238279 ) on the response variable. However, we can reject the null hypothesis for both of these predictors, meaning 1st home buyers and bathroom have an impact on home price with down payment less than or equal to 20%
#The interaction between 1st home buyer and bathrooms have a negative affect (-0.3370586) on the response variable and they have a p-value (2.26e-08) that we use to reject the null hypothesis. It means that the interaction term has a negative relationship and impacts home value with down payment less than or equal to 20%
model3 <- glm(gt20dwn ~ FRSTHO + BATHS + FRSTHO*BATHS - AMMORT - LPRICE,
family=binomial(link = "logit"), data=homes)
summary(model3)
##
## Call:
## glm(formula = gt20dwn ~ FRSTHO + BATHS + FRSTHO * BATHS - AMMORT -
## LPRICE, family = binomial(link = "logit"), data = homes)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.36524 0.06524 -20.926 < 2e-16 ***
## FRSTHOY -0.30275 0.11255 -2.690 0.00715 **
## BATHS 0.42383 0.02949 14.371 < 2e-16 ***
## FRSTHOY:BATHS -0.33706 0.06029 -5.591 2.26e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 18873 on 15564 degrees of freedom
## Residual deviance: 17884 on 15561 degrees of freedom
## AIC: 17892
##
## Number of Fisher Scoring iterations: 4
cat("Effect of 1st-time buyer:", coef(model3)['FRSTHOY'], "\n")
## Effect of 1st-time buyer: -0.3027506
cat("Effect of # of bathrooms:", coef(model3)['BATHS'], "\n")
## Effect of # of bathrooms: 0.4238279
interaction_term_name <- paste("FRSTHOY", "BATHS", sep=":")
cat("Interaction effect:", coef(model3)[interaction_term_name], "\n")
## Interaction effect: -0.3370586
## Q4
# we can see that utilizing the predictive model and the actual model leads to a weaker relationship with the r-squared being 0.244429173849128, It means that the model is weak and should not be used as a predictive power or it is missing key variables to help elevate the r-squared and increase the relationship between the Independent Var and Dependent Var.
gt100 <- which(homes$VALUE>100000)
# ybar and null deviance
source("deviance.R")
model3 <- glm(gt20dwn ~ . - AMMORT - LPRICE, data = homes[gt100, ], family = binomial())
pred <- predict(model3, homes[gt100, ], type = "response")
ybar <- mean(homes$gt20dwn[-gt100]==TRUE)
D0 <- deviance(y=homes$gt20dwn, pred=rep(ybar, length(-gt100)), family="binomial")
## Warning in y * log(pred): longer object length is not a multiple of shorter
## object length
## Warning in (1 - y) * log(1 - pred): longer object length is not a multiple of
## shorter object length
D_train <- deviance(y = homes$gt20dwn[-gt100], pred = pred, family = "binomial")
## Warning in y * log(pred): longer object length is not a multiple of shorter
## object length
## Warning in y * log(pred): longer object length is not a multiple of shorter
## object length
print(paste("Training sample deviance:", D_train))
## [1] "Training sample deviance: 14756.3233679565"
print(paste("Null deviance:", D0))
## [1] "Null deviance: 19530.0332638968"
R_squared <- 1 - (D_train / D0)
print(paste("Pseudo R-squared:", R_squared))
## [1] "Pseudo R-squared: 0.244429173849128"
# My Version at first The STATER did not work
# Part A
library(ggplot2)
library(readr)
library(stats)
homes <- read_csv("homes2004.csv")
## Rows: 15565 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (19): EAPTBL, ECOM1, ECOM2, EGREEN, EJUNK, ELOW1, ESFD, ETRANS, EABAN, H...
## dbl (10): AMMORT, ZINC2, PER, ZADULT, NUNITS, INTW, LPRICE, BATHS, BEDRMS, V...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# For example, relationship between household income (assuming 'ZINC2') and log home value (assuming 'LPRICE')
ggplot(homes, aes(x = ZINC2, y = LPRICE)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "cyan") +
labs(title = "Square Footage vs. Log Home Value", x = "Square Footage", y = "Log Home Value")
## `geom_smooth()` using formula = 'y ~ x'
# Part B
# "ECOM1Y" "EGREENY" "ELOW1Y" "ETRANSY" "ODORAY" are coefficients that are jointly significant above 10%
# For less than or equal to 10% it is a long list that is provided under names(pvals)[pvals<=.1]
model1 <- lm(log(VALUE) ~ . - AMMORT - LPRICE, data = homes)
summary(model1)
##
## Call:
## lm(formula = log(VALUE) ~ . - AMMORT - LPRICE, data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2738 -0.1572 0.0574 0.2756 2.4649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.159e+01 6.226e-02 186.232 < 2e-16 ***
## EAPTBLY -4.347e-02 2.346e-02 -1.853 0.06390 .
## ECOM1Y -2.568e-02 1.924e-02 -1.335 0.18202
## ECOM2Y -8.645e-02 4.805e-02 -1.799 0.07205 .
## EGREENY 9.391e-03 1.400e-02 0.671 0.50249
## EJUNKY -1.265e-01 5.105e-02 -2.478 0.01324 *
## ELOW1Y 2.870e-02 2.312e-02 1.241 0.21454
## ESFDY 2.945e-01 2.956e-02 9.963 < 2e-16 ***
## ETRANSY -1.515e-02 2.532e-02 -0.598 0.54953
## EABANY -1.621e-01 3.598e-02 -4.506 6.67e-06 ***
## HOWHgood 1.295e-01 2.632e-02 4.922 8.65e-07 ***
## HOWNgood 1.193e-01 2.190e-02 5.445 5.26e-08 ***
## ODORAY 1.026e-02 3.312e-02 0.310 0.75685
## STRNAY -3.618e-02 1.607e-02 -2.251 0.02437 *
## ZINC2 6.244e-07 5.538e-08 11.273 < 2e-16 ***
## PER 9.651e-03 6.253e-03 1.543 0.12277
## ZADULT -1.864e-02 1.088e-02 -1.714 0.08649 .
## HHGRADBach 1.321e-01 2.292e-02 5.766 8.28e-09 ***
## HHGRADGrad 1.973e-01 2.578e-02 7.652 2.09e-14 ***
## HHGRADHS Grad -6.061e-02 2.171e-02 -2.792 0.00524 **
## HHGRADNo HS -1.945e-01 3.183e-02 -6.112 1.01e-09 ***
## NUNITS -9.324e-04 5.203e-04 -1.792 0.07314 .
## INTW -4.637e-02 4.408e-03 -10.518 < 2e-16 ***
## METROurban 8.610e-02 1.807e-02 4.764 1.92e-06 ***
## STATECO -2.921e-01 2.921e-02 -10.001 < 2e-16 ***
## STATECT -3.464e-01 3.125e-02 -11.084 < 2e-16 ***
## STATEGA -6.551e-01 3.108e-02 -21.077 < 2e-16 ***
## STATEIL -8.618e-01 5.768e-02 -14.940 < 2e-16 ***
## STATEIN -7.792e-01 3.070e-02 -25.379 < 2e-16 ***
## STATELA -7.196e-01 3.688e-02 -19.511 < 2e-16 ***
## STATEMO -6.645e-01 3.343e-02 -19.875 < 2e-16 ***
## STATEOH -6.737e-01 3.269e-02 -20.610 < 2e-16 ***
## STATEOK -9.982e-01 3.281e-02 -30.425 < 2e-16 ***
## STATEPA -8.716e-01 3.389e-02 -25.722 < 2e-16 ***
## STATETX -1.049e+00 3.431e-02 -30.575 < 2e-16 ***
## STATEWA -1.228e-01 3.094e-02 -3.970 7.23e-05 ***
## BATHS 2.117e-01 1.159e-02 18.271 < 2e-16 ***
## BEDRMS 8.740e-02 1.006e-02 8.690 < 2e-16 ***
## MATBUYY -2.966e-02 1.368e-02 -2.168 0.03015 *
## DWNPAYprev home 1.209e-01 1.785e-02 6.775 1.29e-11 ***
## FRSTHOY -8.398e-02 1.724e-02 -4.870 1.12e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8171 on 15524 degrees of freedom
## Multiple R-squared: 0.3053, Adjusted R-squared: 0.3035
## F-statistic: 170.6 on 40 and 15524 DF, p-value: < 2.2e-16
pvals <- summary(model1)$coef[-1,4]
# example: those variable insignificant at alpha=0.10
names(pvals)[pvals>.1]
## [1] "ECOM1Y" "EGREENY" "ELOW1Y" "ETRANSY" "ODORAY" "PER"
names(pvals)[pvals<=.1]
## [1] "EAPTBLY" "ECOM2Y" "EJUNKY" "ESFDY"
## [5] "EABANY" "HOWHgood" "HOWNgood" "STRNAY"
## [9] "ZINC2" "ZADULT" "HHGRADBach" "HHGRADGrad"
## [13] "HHGRADHS Grad" "HHGRADNo HS" "NUNITS" "INTW"
## [17] "METROurban" "STATECO" "STATECT" "STATEGA"
## [21] "STATEIL" "STATEIN" "STATELA" "STATEMO"
## [25] "STATEOH" "STATEOK" "STATEPA" "STATETX"
## [29] "STATEWA" "BATHS" "BEDRMS" "MATBUYY"
## [33] "DWNPAYprev home" "FRSTHOY"
model1_reduced <- stepAIC(model1, direction = "backward")
## Start: AIC=-6247.13
## log(VALUE) ~ (AMMORT + EAPTBL + ECOM1 + ECOM2 + EGREEN + EJUNK +
## ELOW1 + ESFD + ETRANS + EABAN + HOWH + HOWN + ODORA + STRNA +
## ZINC2 + PER + ZADULT + HHGRAD + NUNITS + INTW + METRO + STATE +
## LPRICE + BATHS + BEDRMS + MATBUY + DWNPAY + FRSTHO) - AMMORT -
## LPRICE
##
## Df Sum of Sq RSS AIC
## - ODORA 1 0.06 10365 -6249.0
## - ETRANS 1 0.24 10365 -6248.8
## - EGREEN 1 0.30 10365 -6248.7
## - ELOW1 1 1.03 10366 -6247.6
## - ECOM1 1 1.19 10366 -6247.3
## <none> 10365 -6247.1
## - PER 1 1.59 10366 -6246.7
## - ZADULT 1 1.96 10367 -6246.2
## - NUNITS 1 2.14 10367 -6245.9
## - ECOM2 1 2.16 10367 -6245.9
## - EAPTBL 1 2.29 10367 -6245.7
## - MATBUY 1 3.14 10368 -6244.4
## - STRNA 1 3.38 10368 -6244.0
## - EJUNK 1 4.10 10369 -6243.0
## - EABAN 1 13.55 10378 -6228.8
## - METRO 1 15.15 10380 -6226.4
## - FRSTHO 1 15.84 10380 -6225.4
## - HOWH 1 16.17 10381 -6224.9
## - HOWN 1 19.80 10384 -6219.4
## - DWNPAY 1 30.64 10395 -6203.2
## - BEDRMS 1 50.42 10415 -6173.6
## - ESFD 1 66.27 10431 -6149.9
## - INTW 1 73.86 10438 -6138.6
## - ZINC2 1 84.85 10450 -6122.2
## - HHGRAD 4 190.13 10555 -5972.2
## - BATHS 1 222.87 10588 -5918.0
## - STATE 12 1493.31 11858 -4176.1
##
## Step: AIC=-6249.03
## log(VALUE) ~ EAPTBL + ECOM1 + ECOM2 + EGREEN + EJUNK + ELOW1 +
## ESFD + ETRANS + EABAN + HOWH + HOWN + STRNA + ZINC2 + PER +
## ZADULT + HHGRAD + NUNITS + INTW + METRO + STATE + BATHS +
## BEDRMS + MATBUY + DWNPAY + FRSTHO
##
## Df Sum of Sq RSS AIC
## - ETRANS 1 0.23 10365 -6250.7
## - EGREEN 1 0.31 10365 -6250.6
## - ELOW1 1 1.02 10366 -6249.5
## - ECOM1 1 1.20 10366 -6249.2
## <none> 10365 -6249.0
## - PER 1 1.60 10366 -6248.6
## - ZADULT 1 1.96 10367 -6248.1
## - ECOM2 1 2.13 10367 -6247.8
## - NUNITS 1 2.15 10367 -6247.8
## - EAPTBL 1 2.29 10367 -6247.6
## - MATBUY 1 3.15 10368 -6246.3
## - STRNA 1 3.33 10368 -6246.0
## - EJUNK 1 4.05 10369 -6245.0
## - EABAN 1 13.51 10378 -6230.8
## - METRO 1 15.23 10380 -6228.2
## - FRSTHO 1 15.84 10380 -6227.3
## - HOWH 1 16.14 10381 -6226.8
## - HOWN 1 19.73 10384 -6221.4
## - DWNPAY 1 30.65 10395 -6205.1
## - BEDRMS 1 50.40 10415 -6175.5
## - ESFD 1 66.25 10431 -6151.9
## - INTW 1 73.82 10438 -6140.6
## - ZINC2 1 84.83 10450 -6124.2
## - HHGRAD 4 190.07 10555 -5974.2
## - BATHS 1 222.81 10588 -5920.0
## - STATE 12 1493.38 11858 -4177.9
##
## Step: AIC=-6250.69
## log(VALUE) ~ EAPTBL + ECOM1 + ECOM2 + EGREEN + EJUNK + ELOW1 +
## ESFD + EABAN + HOWH + HOWN + STRNA + ZINC2 + PER + ZADULT +
## HHGRAD + NUNITS + INTW + METRO + STATE + BATHS + BEDRMS +
## MATBUY + DWNPAY + FRSTHO
##
## Df Sum of Sq RSS AIC
## - EGREEN 1 0.28 10365 -6252.3
## - ELOW1 1 1.00 10366 -6251.2
## <none> 10365 -6250.7
## - ECOM1 1 1.34 10366 -6250.7
## - PER 1 1.60 10366 -6250.3
## - ZADULT 1 1.97 10367 -6249.7
## - NUNITS 1 2.14 10367 -6249.5
## - ECOM2 1 2.37 10367 -6249.1
## - EAPTBL 1 2.40 10367 -6249.1
## - MATBUY 1 3.17 10368 -6247.9
## - STRNA 1 3.49 10368 -6247.4
## - EJUNK 1 4.06 10369 -6246.6
## - EABAN 1 13.58 10378 -6232.3
## - METRO 1 15.23 10380 -6229.8
## - FRSTHO 1 15.85 10381 -6228.9
## - HOWH 1 16.11 10381 -6228.5
## - HOWN 1 19.81 10385 -6223.0
## - DWNPAY 1 30.66 10396 -6206.7
## - BEDRMS 1 50.48 10415 -6177.1
## - ESFD 1 66.27 10431 -6153.5
## - INTW 1 73.80 10439 -6142.3
## - ZINC2 1 85.01 10450 -6125.6
## - HHGRAD 4 190.21 10555 -5975.6
## - BATHS 1 223.01 10588 -5921.3
## - STATE 12 1497.26 11862 -4174.5
##
## Step: AIC=-6252.26
## log(VALUE) ~ EAPTBL + ECOM1 + ECOM2 + EJUNK + ELOW1 + ESFD +
## EABAN + HOWH + HOWN + STRNA + ZINC2 + PER + ZADULT + HHGRAD +
## NUNITS + INTW + METRO + STATE + BATHS + BEDRMS + MATBUY +
## DWNPAY + FRSTHO
##
## Df Sum of Sq RSS AIC
## - ELOW1 1 1.02 10366 -6252.7
## - ECOM1 1 1.31 10366 -6252.3
## <none> 10365 -6252.3
## - PER 1 1.63 10367 -6251.8
## - ZADULT 1 1.98 10367 -6251.3
## - NUNITS 1 2.16 10367 -6251.0
## - ECOM2 1 2.32 10368 -6250.8
## - EAPTBL 1 2.43 10368 -6250.6
## - MATBUY 1 3.16 10368 -6249.5
## - STRNA 1 3.49 10369 -6249.0
## - EJUNK 1 4.00 10369 -6248.3
## - EABAN 1 13.52 10379 -6234.0
## - METRO 1 14.97 10380 -6231.8
## - FRSTHO 1 15.96 10381 -6230.3
## - HOWH 1 16.07 10381 -6230.2
## - HOWN 1 19.96 10385 -6224.3
## - DWNPAY 1 30.72 10396 -6208.2
## - BEDRMS 1 50.37 10416 -6178.8
## - ESFD 1 66.01 10431 -6155.5
## - INTW 1 73.89 10439 -6143.7
## - ZINC2 1 85.32 10450 -6126.7
## - HHGRAD 4 190.61 10556 -5976.6
## - BATHS 1 224.39 10590 -5920.9
## - STATE 12 1501.24 11866 -4170.9
##
## Step: AIC=-6252.73
## log(VALUE) ~ EAPTBL + ECOM1 + ECOM2 + EJUNK + ESFD + EABAN +
## HOWH + HOWN + STRNA + ZINC2 + PER + ZADULT + HHGRAD + NUNITS +
## INTW + METRO + STATE + BATHS + BEDRMS + MATBUY + DWNPAY +
## FRSTHO
##
## Df Sum of Sq RSS AIC
## - ECOM1 1 1.22 10367 -6252.9
## <none> 10366 -6252.7
## - PER 1 1.53 10368 -6252.4
## - EAPTBL 1 1.84 10368 -6252.0
## - ZADULT 1 1.98 10368 -6251.8
## - NUNITS 1 2.23 10368 -6251.4
## - ECOM2 1 2.27 10368 -6251.3
## - MATBUY 1 3.03 10369 -6250.2
## - STRNA 1 3.44 10370 -6249.6
## - EJUNK 1 4.01 10370 -6248.7
## - EABAN 1 13.50 10380 -6234.5
## - METRO 1 15.16 10381 -6232.0
## - FRSTHO 1 15.94 10382 -6230.8
## - HOWH 1 16.10 10382 -6230.6
## - HOWN 1 20.01 10386 -6224.7
## - DWNPAY 1 30.72 10397 -6208.7
## - BEDRMS 1 49.43 10416 -6180.7
## - ESFD 1 64.99 10431 -6157.4
## - INTW 1 74.44 10441 -6143.4
## - ZINC2 1 85.20 10451 -6127.3
## - HHGRAD 4 192.13 10558 -5974.9
## - BATHS 1 225.52 10592 -5919.7
## - STATE 12 1511.06 11877 -4158.7
##
## Step: AIC=-6252.9
## log(VALUE) ~ EAPTBL + ECOM2 + EJUNK + ESFD + EABAN + HOWH + HOWN +
## STRNA + ZINC2 + PER + ZADULT + HHGRAD + NUNITS + INTW + METRO +
## STATE + BATHS + BEDRMS + MATBUY + DWNPAY + FRSTHO
##
## Df Sum of Sq RSS AIC
## <none> 10367 -6252.9
## - PER 1 1.50 10369 -6252.6
## - ZADULT 1 1.98 10369 -6251.9
## - NUNITS 1 2.37 10370 -6251.3
## - EAPTBL 1 2.74 10370 -6250.8
## - ECOM2 1 2.83 10370 -6250.6
## - MATBUY 1 3.09 10370 -6250.3
## - EJUNK 1 4.07 10372 -6248.8
## - STRNA 1 4.07 10372 -6248.8
## - EABAN 1 13.77 10381 -6234.2
## - METRO 1 14.68 10382 -6232.9
## - FRSTHO 1 16.13 10384 -6230.7
## - HOWH 1 16.31 10384 -6230.4
## - HOWN 1 20.04 10388 -6224.8
## - DWNPAY 1 31.00 10398 -6208.4
## - BEDRMS 1 49.74 10417 -6180.4
## - ESFD 1 64.67 10432 -6158.1
## - INTW 1 74.80 10442 -6143.0
## - ZINC2 1 84.90 10452 -6128.0
## - HHGRAD 4 192.34 10560 -5974.8
## - BATHS 1 226.81 10594 -5918.1
## - STATE 12 1510.08 11878 -4160.4
summary(model1_reduced)
##
## Call:
## lm(formula = log(VALUE) ~ EAPTBL + ECOM2 + EJUNK + ESFD + EABAN +
## HOWH + HOWN + STRNA + ZINC2 + PER + ZADULT + HHGRAD + NUNITS +
## INTW + METRO + STATE + BATHS + BEDRMS + MATBUY + DWNPAY +
## FRSTHO, data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2761 -0.1553 0.0577 0.2755 2.4690
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.161e+01 6.135e-02 189.165 < 2e-16 ***
## EAPTBLY -4.475e-02 2.210e-02 -2.025 0.042896 *
## ECOM2Y -9.674e-02 4.698e-02 -2.059 0.039502 *
## EJUNKY -1.256e-01 5.088e-02 -2.468 0.013590 *
## ESFDY 2.879e-01 2.925e-02 9.842 < 2e-16 ***
## EABANY -1.632e-01 3.593e-02 -4.542 5.61e-06 ***
## HOWHgood 1.300e-01 2.630e-02 4.942 7.81e-07 ***
## HOWNgood 1.198e-01 2.186e-02 5.479 4.34e-08 ***
## STRNAY -3.908e-02 1.583e-02 -2.469 0.013556 *
## ZINC2 6.241e-07 5.535e-08 11.277 < 2e-16 ***
## PER 9.374e-03 6.246e-03 1.501 0.133407
## ZADULT -1.871e-02 1.087e-02 -1.721 0.085353 .
## HHGRADBach 1.329e-01 2.290e-02 5.802 6.69e-09 ***
## HHGRADGrad 1.980e-01 2.576e-02 7.687 1.60e-14 ***
## HHGRADHS Grad -6.085e-02 2.170e-02 -2.804 0.005052 **
## HHGRADNo HS -1.955e-01 3.181e-02 -6.145 8.18e-10 ***
## NUNITS -9.785e-04 5.197e-04 -1.883 0.059743 .
## INTW -4.663e-02 4.405e-03 -10.585 < 2e-16 ***
## METROurban 8.410e-02 1.793e-02 4.689 2.77e-06 ***
## STATECO -2.882e-01 2.904e-02 -9.922 < 2e-16 ***
## STATECT -3.452e-01 3.121e-02 -11.060 < 2e-16 ***
## STATEGA -6.545e-01 3.097e-02 -21.130 < 2e-16 ***
## STATEIL -8.626e-01 5.763e-02 -14.968 < 2e-16 ***
## STATEIN -7.790e-01 3.067e-02 -25.398 < 2e-16 ***
## STATELA -7.215e-01 3.677e-02 -19.621 < 2e-16 ***
## STATEMO -6.644e-01 3.342e-02 -19.880 < 2e-16 ***
## STATEOH -6.757e-01 3.261e-02 -20.718 < 2e-16 ***
## STATEOK -9.981e-01 3.279e-02 -30.440 < 2e-16 ***
## STATEPA -8.685e-01 3.383e-02 -25.676 < 2e-16 ***
## STATETX -1.050e+00 3.427e-02 -30.632 < 2e-16 ***
## STATEWA -1.201e-01 3.091e-02 -3.888 0.000102 ***
## BATHS 2.130e-01 1.156e-02 18.432 < 2e-16 ***
## BEDRMS 8.633e-02 1.000e-02 8.632 < 2e-16 ***
## MATBUYY -2.940e-02 1.367e-02 -2.151 0.031471 *
## DWNPAYprev home 1.216e-01 1.784e-02 6.814 9.84e-12 ***
## FRSTHOY -8.471e-02 1.723e-02 -4.915 8.98e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8171 on 15529 degrees of freedom
## Multiple R-squared: 0.3051, Adjusted R-squared: 0.3035
## F-statistic: 194.8 on 35 and 15529 DF, p-value: < 2.2e-16
# Compare R-squared values
cat("Full model R-squared: ", summary(model1)$r.squared, "\n")
## Full model R-squared: 0.305303
cat("Reduced model R-squared: ", summary(model1_reduced)$r.squared, "\n")
## Reduced model R-squared: 0.3051141
# Part C
# 1st home buyers has a negative affect on the down payment of less than or equal to 20%. The number of bathroom has a positive effect on the response variable. However, we can reject the null hypothesis for both of these predictors, meaning 1st home buyers and bathroom have an impact on home price with down payment less than or equal to 20%
homes$gt20dwn <- (homes$LPRICE - homes$AMMORT) / homes$LPRICE > 0.2
model2 <- glm(gt20dwn ~ . - AMMORT - LPRICE, family = binomial, data = homes)
summary(model2)
##
## Call:
## glm(formula = gt20dwn ~ . - AMMORT - LPRICE, family = binomial,
## data = homes)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.293e+00 1.831e-01 -7.065 1.61e-12 ***
## EAPTBLY 1.505e-02 7.025e-02 0.214 0.830424
## ECOM1Y -1.619e-01 5.809e-02 -2.787 0.005325 **
## ECOM2Y -3.131e-01 1.600e-01 -1.957 0.050385 .
## EGREENY -1.569e-03 3.984e-02 -0.039 0.968582
## EJUNKY -9.697e-03 1.608e-01 -0.060 0.951913
## ELOW1Y 4.635e-02 6.627e-02 0.699 0.484292
## ESFDY -2.670e-01 8.276e-02 -3.227 0.001252 **
## ETRANSY -6.270e-02 7.616e-02 -0.823 0.410416
## EABANY -8.187e-02 1.157e-01 -0.708 0.479137
## HOWHgood -1.372e-01 7.947e-02 -1.726 0.084398 .
## HOWNgood 1.597e-01 6.730e-02 2.372 0.017669 *
## ODORAY 1.041e-01 9.811e-02 1.061 0.288528
## STRNAY -9.644e-02 4.737e-02 -2.036 0.041783 *
## ZINC2 -1.277e-07 1.874e-07 -0.682 0.495530
## PER -1.253e-01 1.855e-02 -6.752 1.46e-11 ***
## ZADULT 1.944e-02 3.188e-02 0.610 0.542024
## HHGRADBach 1.797e-01 6.596e-02 2.725 0.006431 **
## HHGRADGrad 2.729e-01 7.288e-02 3.745 0.000181 ***
## HHGRADHS Grad -2.064e-02 6.376e-02 -0.324 0.746192
## HHGRADNo HS -7.246e-02 9.845e-02 -0.736 0.461720
## NUNITS 2.377e-03 1.428e-03 1.664 0.096100 .
## INTW -6.327e-02 1.372e-02 -4.613 3.98e-06 ***
## METROurban -8.000e-02 5.389e-02 -1.485 0.137672
## STATECO -2.513e-02 8.491e-02 -0.296 0.767257
## STATECT 7.870e-01 8.825e-02 8.918 < 2e-16 ***
## STATEGA -2.223e-01 9.455e-02 -2.351 0.018716 *
## STATEIL 5.870e-01 1.635e-01 3.590 0.000330 ***
## STATEIN 2.431e-01 9.352e-02 2.599 0.009336 **
## STATELA 5.932e-01 1.077e-01 5.506 3.67e-08 ***
## STATEMO 5.309e-01 9.730e-02 5.456 4.87e-08 ***
## STATEOH 7.642e-01 9.480e-02 8.061 7.59e-16 ***
## STATEOK 1.291e-01 1.027e-01 1.257 0.208850
## STATEPA 6.011e-01 1.007e-01 5.968 2.40e-09 ***
## STATETX 2.935e-01 1.073e-01 2.736 0.006221 **
## STATEWA 1.525e-01 8.819e-02 1.730 0.083717 .
## BATHS 2.445e-01 3.419e-02 7.152 8.57e-13 ***
## BEDRMS -2.086e-02 2.908e-02 -0.717 0.473120
## MATBUYY 2.587e-01 3.927e-02 6.588 4.45e-11 ***
## DWNPAYprev home 7.417e-01 4.857e-02 15.272 < 2e-16 ***
## VALUE 1.489e-06 1.452e-07 10.256 < 2e-16 ***
## FRSTHOY -3.700e-01 5.170e-02 -7.156 8.29e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 18873 on 15564 degrees of freedom
## Residual deviance: 16969 on 15523 degrees of freedom
## AIC: 17053
##
## Number of Fisher Scoring iterations: 4
# The interaction between 1st home buyer and bathrooms have a p-value (2.26e-08) that we use to reject the null hypothesis. It means that the interaction term has a negative relationship and impacts home value with down payment less than or equal to 20%
model2_interaction <- glm(gt20dwn ~ FRSTHO + BATHS + FRSTHO:BATHS - AMMORT - LPRICE, family = binomial, data = homes)
summary(model2_interaction)
##
## Call:
## glm(formula = gt20dwn ~ FRSTHO + BATHS + FRSTHO:BATHS - AMMORT -
## LPRICE, family = binomial, data = homes)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.36524 0.06524 -20.926 < 2e-16 ***
## FRSTHOY -0.30275 0.11255 -2.690 0.00715 **
## BATHS 0.42383 0.02949 14.371 < 2e-16 ***
## FRSTHOY:BATHS -0.33706 0.06029 -5.591 2.26e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 18873 on 15564 degrees of freedom
## Residual deviance: 17884 on 15561 degrees of freedom
## AIC: 17892
##
## Number of Fisher Scoring iterations: 4
# Part D
# Using (ii) we can see that the Gt 100k R-squared and Lt R-squared have a weak relationship with home values with gt20dwn.Meaning that the model, as specified, does not have a strong predictive power or that important variables may be missing or incorrectly modeled. Essentially, it implies that the Gt 100k and Lt 100k do not have a strong linear relationship with the gt20dwn.
# Using (iii) we can see that utilizing the predictive model and the actual model leads to a weaker relationship with the r-squared being 0.03347367, It means that the model is weak and should not be used as a predictive power or it is missing key variables to help elevate the r-squared and increase the relationship between the Indpendent Var and Dependent Var.
homes_gt100k <- subset(homes, VALUE > 100000)
homes_lt100k <- subset(homes, VALUE <= 100000)
model3 <- glm(gt20dwn ~ . - AMMORT - LPRICE, family = binomial(link = "logit"), data = homes_gt100k)
summary(model3)
##
## Call:
## glm(formula = gt20dwn ~ . - AMMORT - LPRICE, family = binomial(link = "logit"),
## data = homes_gt100k)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.493e+00 2.132e-01 -7.004 2.49e-12 ***
## EAPTBLY 9.426e-02 8.202e-02 1.149 0.250490
## ECOM1Y -9.180e-02 6.695e-02 -1.371 0.170325
## ECOM2Y -4.082e-01 2.070e-01 -1.972 0.048625 *
## EGREENY 4.097e-03 4.390e-02 0.093 0.925629
## EJUNKY -2.323e-01 2.179e-01 -1.066 0.286425
## ELOW1Y 2.175e-02 7.399e-02 0.294 0.768779
## ESFDY -3.579e-01 9.838e-02 -3.638 0.000275 ***
## ETRANSY -1.061e-01 8.863e-02 -1.198 0.231104
## EABANY -1.862e-01 1.607e-01 -1.158 0.246670
## HOWHgood -8.069e-02 9.756e-02 -0.827 0.408204
## HOWNgood 1.945e-01 8.100e-02 2.401 0.016361 *
## ODORAY 1.280e-01 1.161e-01 1.103 0.270076
## STRNAY -1.155e-01 5.439e-02 -2.124 0.033712 *
## ZINC2 -2.881e-07 2.189e-07 -1.316 0.188130
## PER -1.262e-01 2.041e-02 -6.181 6.35e-10 ***
## ZADULT 1.013e-02 3.569e-02 0.284 0.776468
## HHGRADBach 2.070e-01 7.311e-02 2.831 0.004633 **
## HHGRADGrad 2.899e-01 7.992e-02 3.627 0.000287 ***
## HHGRADHS Grad 2.554e-02 7.253e-02 0.352 0.724727
## HHGRADNo HS -1.754e-01 1.246e-01 -1.407 0.159381
## NUNITS 1.274e-03 1.790e-03 0.712 0.476706
## INTW -6.134e-02 1.713e-02 -3.582 0.000341 ***
## METROurban -1.251e-01 6.294e-02 -1.987 0.046903 *
## STATECO 4.055e-02 8.739e-02 0.464 0.642607
## STATECT 7.897e-01 9.210e-02 8.574 < 2e-16 ***
## STATEGA -2.147e-01 9.940e-02 -2.160 0.030785 *
## STATEIL 5.716e-01 1.971e-01 2.900 0.003734 **
## STATEIN 3.524e-01 1.016e-01 3.469 0.000522 ***
## STATELA 6.957e-01 1.214e-01 5.730 1.00e-08 ***
## STATEMO 6.164e-01 1.052e-01 5.862 4.57e-09 ***
## STATEOH 7.987e-01 1.024e-01 7.798 6.27e-15 ***
## STATEOK 2.693e-01 1.218e-01 2.211 0.027019 *
## STATEPA 7.626e-01 1.169e-01 6.523 6.88e-11 ***
## STATETX 4.604e-01 1.297e-01 3.549 0.000386 ***
## STATEWA 1.765e-01 9.095e-02 1.941 0.052288 .
## BATHS 2.324e-01 3.750e-02 6.198 5.71e-10 ***
## BEDRMS -1.466e-02 3.233e-02 -0.453 0.650264
## MATBUYY 3.422e-01 4.335e-02 7.893 2.94e-15 ***
## DWNPAYprev home 7.763e-01 5.318e-02 14.597 < 2e-16 ***
## VALUE 1.764e-06 1.608e-07 10.970 < 2e-16 ***
## FRSTHOY -3.457e-01 5.950e-02 -5.810 6.25e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 15210 on 12143 degrees of freedom
## Residual deviance: 13620 on 12102 degrees of freedom
## AIC: 13704
##
## Number of Fisher Scoring iterations: 4
# (ii) to compare the model I had to create a new regression of lm instead of glm which doesn't give the necessary r-squared for the models.
model3.5 <- lm(gt20dwn ~ . - AMMORT - LPRICE, family = binomial(link = "logit"), data = homes_gt100k)
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
## extra argument 'family' will be disregarded
model4 <- lm(gt20dwn ~ . - AMMORT - LPRICE, family = binomial(link = "logit"), data = homes_lt100k)
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
## extra argument 'family' will be disregarded
cat("Gt 100k R-squred: ", summary(model3.5)$r.squared, "\n")
## Gt 100k R-squred: 0.1261319
cat("Lt 100k R-squared: ", summary(model4)$r.squared, "\n")
## Lt 100k R-squared: 0.08317762
# (iii) R-Squared with GLM regression combined and use the predictive model from model3 to create a combined model that allows us to see the R-Squared.
predictions_lt100k <- predict(model3, newdata = homes_lt100k, type = "response")
R2_lt100k <- R2(homes_lt100k$gt20dwn, predictions_lt100k, family="binomial")
print(R2_lt100k)
## [1] 0.03347367