setwd("~/Desktop/R Materials/mih140/Assignments/'20 Assignments/Data")
house = read.table("house.txt", sep = "\t", header = T)
# Lets continue with our example from last time! Recall we were examining the relationship between price and sq footage.
reg_model = lm(data = house, price ~ sqft_living) # linear regression
summary(reg_model) # Adjusted R^2 = .6382, good explanation
##
## Call:
## lm(formula = price ~ sqft_living, data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -198874 -38967 -5966 21908 534936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37199.05 10737.99 3.464 0.00057 ***
## sqft_living 153.50 4.76 32.249 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 79820 on 588 degrees of freedom
## Multiple R-squared: 0.6388, Adjusted R-squared: 0.6382
## F-statistic: 1040 on 1 and 588 DF, p-value: < 2.2e-16
# Check assumptions by examining
par(mfrow = c(2,2))
plot(reg_model)
# 1. Linearity - Looked at residuals
# 2. Normality - Looked qqplot of residuals
# 3. Constant Variance - Looked at residuals
# 4. Independence (of error term and response) - Let's do this now!
# Need to install package install.packages("lmtest")
library(lmtest) # loads in package after installation
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
dwtest(reg_model) # pval = .3569, not strong enough to reject the null of indep.
##
## Durbin-Watson test
##
## data: reg_model
## DW = 1.9698, p-value = 0.3569
## alternative hypothesis: true autocorrelation is greater than 0
# Alternative Method
# install.packages("gvlma")
library(gvlma)
gvlma(reg_model)
##
## Call:
## lm(formula = price ~ sqft_living, data = house)
##
## Coefficients:
## (Intercept) sqft_living
## 37199.1 153.5
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = reg_model)
##
## Value p-value Decision
## Global Stat 3066.164 0.000e+00 Assumptions NOT satisfied!
## Skewness 448.590 0.000e+00 Assumptions NOT satisfied!
## Kurtosis 2575.988 0.000e+00 Assumptions NOT satisfied!
## Link Function 39.115 3.996e-10 Assumptions NOT satisfied!
## Heteroscedasticity 2.472 1.159e-01 Assumptions acceptable.
plot(gvlma(reg_model))
Value p-value Decision
Global Stat 3066.164 0.000e+00 Assumptions NOT satisfied! Skewness 448.590 0.000e+00 Assumptions NOT satisfied! Kurtosis 2575.988 0.000e+00 Assumptions NOT satisfied! Link Function 39.115 3.996e-10 Assumptions NOT satisfied! Heteroscedasticity 2.472 1.159e-01 Assumptions acceptable.
GVLMA gives a more sophifisticated take on assumption testing then we’ll use in this class. The interested reader should refer to: Pena, E. A., & Slate, E. H. (2006). Global validation of linear model assumptions. Journal of the American Statistical Association, 101(473), 341-354. https://doi.org/10.1198/016214505000000637
# Examining the regression
reg_model$coefficients # gets the coeff. of the model
## (Intercept) sqft_living
## 37199.0531 153.5013
confint.lm(reg_model, level = .95) # change confidence level with level param.
## 2.5 % 97.5 %
## (Intercept) 16109.5657 58288.5405
## sqft_living 144.1529 162.8496
# 2.5 % 97.5 %
# (Intercept) 16109.5657 58288.5405
# sqft_living 144.1529 162.8496
Suppose our assumptions are all satisfied and now we’d like to predict the cost of some new observations (i.e. homes) using our model. We saw in lecture that there are two main types of prediction and two main sources of prediction error.
# Suppose we want to predict the price of a home thats 1000, 2000, 20,000 ft
test_data = house[1:3,] # copies some rows
test_data$sqft_living = c(1000, 2000, 20000) # changes sqft to what we want to predict
predict(reg_model, test_data, interval = "confidence") # 95% interval for mean response E[y|x]
## fit lwr upr
## 1 190700.3 178179.3 203221.3
## 2 344201.6 337601.2 350802.0
## 3 3107224.5 2940210.7 3274238.2
predict(reg_model, test_data, interval = "prediction") # 95% interval for predicted obs, y|x
## fit lwr upr
## 1 190700.3 33425.28 347975.4
## 2 344201.6 187286.87 501116.3
## 3 3107224.5 2878156.24 3336292.7