# load the packages for graphing and data wrangling
library(ggplot2)
library(PASWR2)
library(car) # install the car package first, if not already installed
library(dplyr)
library(lattice)
library(boot)
library(MASS)
Note: If you Rmd file submission knits you will receive total of (5 points)
Data and ideas for this case study come from (Militino et al., 2004).
Problem 16, page 899 (not entire problem, only specified parts below)
The goal of this case study is to walk the user through the creation of a parsimonious multiple linear regression model that can be used to predict the total price (totalprice) of apartments by their hedonic (structural) characteristics. The data frame VIT2005 contains several variables, and further description of the data can be found in the help file (listed below).
A data frame with 218 observations on the following 15 variables:
totalprice (the market total price (in Euros) of the apartment including garage(s) and storage room(s))
area (the total living area of the apartment in square meters)
zone (a factor indicating the neighborhood where the apartment is located with levels Z11, Z21, Z31, Z32, Z34, Z35, Z36, Z37, Z38, Z41, Z42, Z43, Z44, Z45, Z46, Z47, Z48, Z49, Z52, Z53, Z56, Z61, and Z62)
category (a factor indicating the condition of the apartment with levels 2A, 2B, 3A, 3B, 4A, 4B, and 5A ordered so that 2A is the best and 5A is the worst)
age (age of the apartment in years)
floor (floor on which the apartment is located)
rooms (total number of rooms including bedrooms, dining room, and kitchen)
out (a factor indicating the percent of the apartment exposed to the elements: The levels E100, E75, E50, and E25, correspond to complete exposure, 75% exposure, 50% exposure, and 25% exposure, respectively.)
conservation (is an ordered factor indicating the state of conservation of the apartment. The levels 1A, 2A, 2B, and 3A are ordered from best to worst conservation.)
toilets (the number of bathrooms)
garage (the number of garages)
elevator (indicates the absence (0) or presence (1) of elevators.)
streetcategory (an ordered factor from best to worst indicating the category of the street with levels S2, S3, S4, and S5)
heating (a factor indicating the type of heating with levels 1A, 3A, 3B, and 4A which correspond to: no heating, low-standard private heating, high-standard private heating, and central heating, respectively.)
storage (the number of storage rooms outside of the apartment)
(10 pts) Quiz-Project 3 Pr.1
totalprice.Hint: Use ggplot function from ggplot2 package to graph the totalprice density function. Use median and IQR to find the median and IQR for the totalprice.
Solution: Hint: Use the template for the graph:
ggplot(data = YOUR DATA, aes(x = variable to plot)) + geom_density(fill = "your favorite color") + theme_bw()
YOUR CODE HERE:
ggplot(data = totalprice, aes(x = IQR)) +
geom_density(fill = "blue") +
theme_bw()
Error in ggplot(data = totalprice, aes(x = IQR)) :
object 'totalprice' not found
Observation: The distribution of totalprice is … with a median of ... and an IQR of ...
(10 pts) Quiz-Project 3 Pr.2
scatterplotMatrix() from car package or pairs() to explore the relationships between totalprice and the numerical explanatory variables area, age, floor, rooms, toilets, garage, elevator, and storage.Hint: To use scatterplotMatrix type
scatterplotMatrix( ~ totalprice + var1 + var2 + ... + var n, data = VIT2005), do not use more than 5 variables to produce input that fits the screen and can be reviewed. Use the command as many times as you need to review how totalprice correlates with other variables in the data.
Solution:
YOUR CODE HERE:
Observation: The variable totalprice appears to have a moderate linear relationship with area.
Compute the correlation between totalprice and all of the other numerical variables. List the three variables in order along with their correlation coefficients that have the highest correlation with totalprice.
VIT2005. Use a “P-value-to remove” of 5%. Store the final model in the object modelA.(5 pts) Quiz-Project 3 Pr.3
The correlation coefficients are:
NUM <- c("area", "toilets", "rooms")
COR <- cor(VIT2005[, "totalprice"], VIT2005[, NUM])
COR
Observation: The highest three correlations with totalprice occur with area (0.8092), toilets (0.6876), and rooms(0.5256).
Model (A) The functions drop1() and update() are used to create a model using backward elimination.
Hint: drop1(model_.be_name, test = "F") test for significance of all individual predictors given all others are already in the model.
model.be <- lm(totalprice ~ ., data = VIT2005)
#
drop1(model.be, test = "F")
Single term deletions
Model:
totalprice ~ area + zone + category + age + floor + rooms + out +
conservation + toilets + garage + elevator + streetcategory +
heating + storage
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 8.5891e+10 4412.6
area 1 4.4519e+10 1.3041e+11 4501.7 87.5972 < 2.2e-16 ***
zone 22 1.1171e+11 1.9760e+11 4550.3 9.9910 < 2.2e-16 ***
category 6 9.5199e+09 9.5411e+10 4423.5 3.1219 0.006303 **
age 1 3.7563e+06 8.5895e+10 4410.6 0.0074 0.931591
floor 1 3.8440e+07 8.5929e+10 4410.7 0.0756 0.783639
rooms 1 1.0656e+09 8.6956e+10 4413.3 2.0967 0.149472
out 3 3.8946e+09 8.9785e+10 4416.3 2.5544 0.057135 .
conservation 3 1.0031e+09 8.6894e+10 4409.2 0.6579 0.579069
toilets 1 4.7971e+09 9.0688e+10 4422.5 9.4389 0.002477 **
garage 1 1.4771e+10 1.0066e+11 4445.2 29.0627 2.328e-07 ***
elevator 1 5.4265e+09 9.1317e+10 4424.0 10.6772 0.001314 **
streetcategory 3 3.5550e+09 8.9446e+10 4415.5 2.3316 0.076019 .
heating 3 4.3202e+09 9.0211e+10 4417.3 2.8335 0.039877 *
storage 1 1.6433e+09 8.7534e+10 4414.8 3.2334 0.073937 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(10 pts) (part c.1) Quiz-Project 3 Pr.4
Which one appears most insignificant (biggest P-value)? Drop it from the model.
E.g. If age is most insignificant, use the update function to update the model with the following code:
model.be <- update(model.be, .~. - age) - the first dot means we use the same response variable, the second all previously included predictors minus age. The again use drop1 and so on, until all remaining variables are of significance 0.05 as specified in the directions of the problem
YOUR CODE HERE:
model.be <- update(model.be, .~. - age)
drop1(model.be, test = "F")
Single term deletions
Model:
totalprice ~ area + zone + category + floor + rooms + out + conservation +
toilets + garage + elevator + streetcategory + heating +
storage
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 8.5895e+10 4410.6
area 1 4.4894e+10 1.3079e+11 4500.3 88.8523 < 2.2e-16 ***
zone 22 1.1220e+11 1.9810e+11 4548.8 10.0942 < 2.2e-16 ***
category 6 9.6605e+09 9.5555e+10 4421.9 3.1866 0.0054598 **
floor 1 3.6864e+07 8.5931e+10 4408.7 0.0730 0.7874016
rooms 1 1.0655e+09 8.6960e+10 4411.3 2.1088 0.1482969
out 3 4.0439e+09 8.9938e+10 4414.7 2.6678 0.0493562 *
conservation 3 1.3497e+09 8.7244e+10 4408.0 0.8905 0.4473466
toilets 1 4.7993e+09 9.0694e+10 4420.5 9.4987 0.0023994 **
garage 1 1.4767e+10 1.0066e+11 4443.2 29.2260 2.153e-07 ***
elevator 1 5.6740e+09 9.1569e+10 4422.6 11.2298 0.0009919 ***
streetcategory 3 3.5550e+09 8.9450e+10 4413.5 2.3453 0.0746761 .
heating 3 4.3716e+09 9.0266e+10 4415.5 2.8840 0.0373395 *
storage 1 1.6950e+09 8.7590e+10 4412.9 3.3547 0.0687625 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
update(model.be, area = FALSE)
In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
extra argument 㤼㸱area㤼㸲 will be disregarded
Call:
lm(formula = totalprice ~ area + zone + category + floor + rooms +
out + conservation + toilets + garage + elevator + streetcategory +
heating + storage, data = VIT2005, area = FALSE)
Coefficients:
(Intercept) area zoneZ21 zoneZ31 zoneZ32 zoneZ34 zoneZ35 zoneZ36 zoneZ37 zoneZ38 zoneZ41 zoneZ42 zoneZ43
78583.7 1348.9 93618.0 77753.4 40574.2 16315.0 60875.7 36073.0 74093.5 21941.1 52112.2 81135.2 36420.4
zoneZ44 zoneZ45 zoneZ46 zoneZ47 zoneZ48 zoneZ49 zoneZ52 zoneZ53 zoneZ56 zoneZ61 zoneZ62 category2B category3A
18260.2 15305.1 8484.9 -10403.1 34476.9 29796.0 2877.1 18527.6 23027.2 16272.6 15309.2 -889.6 -22657.6
category3B category4A category4B category5A floor rooms outE25 outE50 outE75 conservation2A conservation2B conservation3A toilets
-26110.9 -32763.8 -41086.2 -69865.5 230.1 5158.0 35706.5 -4451.9 3090.2 -4938.7 -8117.1 -8949.4 17542.2
garage elevator streetcategoryS3 streetcategoryS4 streetcategoryS5 heating3A heating3B heating4A storage
24874.9 19530.8 13161.3 9543.9 -4119.2 -12243.9 -5427.9 820.0 8530.4
(10 pts) (part c.2) Quiz-Project 3 Pr.5
Which one is to be dropped next?
Answer: conservation
model.be <- update(model.be, .~. - age)
drop1(model.be, test = "F")
update(model.be, conservation = FALSE)
(10 pts) (part c.3) Quiz-Project 3 Pr.6 Which one is to be dropped next?
Answer: rooms
model.be <- update(model.be, .~. - age)
drop1(model.be, test = "F")
update(model.be, rooms = FALSE)
(10 pts) (part c.4) Quiz-Project 3 Pr.7 Which one is to be dropped next?
Answer:streetcategory
model.be <- update(model.be, .~. - age)
drop1(model.be, test = "F")
update(model.be, streetcategory = FALSE)
(10 pts) (part c.5) Quiz-Project 3 Pr.8 Which one is to be dropped next?
Answer: storage
model.be <- update(model.be, .~. - age)
drop1(model.be, test = "F")
update(model.be, storage = FALSE)
If all variable are significant create the object holding the optimal model as per the description of the problem using the code:
formula(model.be)
totalprice ~ area + zone + category + floor + rooms + out + conservation +
toilets + garage + elevator + streetcategory + heating +
storage
modelA <- lm(formula(model.be), data = VIT2005)
Question 1: (5 pts) Quiz-Project 3 Pr.9 Which variable are left in the model, list them?
“area”, “zone”, “category”, “out”, “toilets”, “garage”, “elevator”, “streetcorner”, and “heating”.
Observation: Backward elimination suggests using the variables area, zone, category, out, toilets, garage, elevator, streetcorner, and heating to best predict totalprice.
Compute \(CV_n\), the leave-one-out cross validation error, for modelA. Set the seed to 5 and compute \(CV_5\), the five-fold cross validation error, for modelA. The cross validation error for a generalized linear model can be computed using the cv.glm() from the boot package. Using the function glm() without passing a family argument is the same as using the function lm(). R Code 12.3 provides a template of how to use the cv.glm() function. Note that \(CV_n\) is returned with cv.error$delta[1]. To compute \(CV_5\), pass the value 5 to the argument K inside the cv.glm() function.
mod.glm <- glm(y ~ x1 + x2, data = modelA)
Error in as.data.frame.default(data) :
cannot coerce class ‘"lm"’ to a data.frame
Compute \(R^2\), \(R^2_a\), the AIC, and the BIC for Model (A). What is the proportion of total variability explained by Model (A)?
Your Solution:
The proportion of total variability is 0.9138.
modelAg <- glm(formula(model.be), data = VIT2005)
# # Fro cv.glm - the default is to set K equal to the number of observations in data which gives the usual leave-one-out cross-validation.
#
set.seed(5) # use for replication purposes
cv.errorN <- cv.glm(data = VIT2005, glmfit = modelAg)
CVNa <- cv.errorN$delta[1]
CVNa
modelAg <- glm(formula(model.be), data = VIT2005)
# UNCOMMENT the code below and run it
set.seed(5) # use for replication purposes
cv.error5 <- cv.glm(data = VIT2005, glmfit = modelAg, K = 5)
CV5a <- cv.error5$delta[1]
CV5a
Observation: The \(CV_n = ...\) for Model (A), and \(CV_5 =...\) for Model (A).
mgof() is written to compute the requested values. Use it as shown below.mgof <- function(model = model, data = DF, ...){
R2a <- summary(model)$adj.r.squared
R2 <- summary(model)$r.squared
aic <- AIC(model)
bic <- AIC(model, k = log(nrow(data)))
se <- summary(model)$sigma
form <- formula(model)
ANS <- c(R2 = R2, R2.adj = R2a, AIC = aic, BIC = bic, SE = se)
ANS
}
MGOF <- mgof(model = modelA, data = VIT2005)
MGOF
R2 R2.adj AIC BIC SE
9.175750e-01 8.947869e-01 5.031290e+03 5.197130e+03 2.247804e+04
Observation: The total proportion of variability explained by modelA is 0.9138.
Explore the residuals of the Models (A) using the function residualPlot() or residualPlots() from the package car. Comment on the results. (Diagnostics: Checking the model assumptions)
(d) Question 2: Is there curvature of the residuals on the above plot?
Yes, there is a large curve in this plot.
Observation: The residuals versus the fitted values for Model (A) have a definite curvature indicating the model is not quite adequate.
Extra Credit (10 pts) Quiz-Project 3 Pr.13
(e) Use the function boxCox() from car to find a suitable transformation for totalprice. Build Model (E) Use backward elimination to develop a model that predicts log(totalprice) using the data frame VIT2005. Use a “P-value-to remove” of 5%. Store the final model in the object modelE.
For details on boxCox() see page 856 in text
log transformation is suggested for the response totalprice in Model (A).Extra Credit (20 pts) Quiz-Project 3 Pr.14 - 4 pts for excluding the correct variable at each step.
log(totalprice) using the data frame VIT2005. Use a “P-value-to remove” of 5%. Store the final model in the object modelE.Model (E) The functions drop1() and update() are used to create a model using backward elimination. (as shown at the beginning)
VITNEW <- VIT2005 %>% mutate(logtotalprice = log(totalprice))
# build the fitted full model
model.be <- lm(logtotalprice ~ ., data = VITNEW[ ,-1]) # exclude the totalprice variable since it is no longer response variable
drop1(model.be, test = "F")
Single term deletions
Model:
logtotalprice ~ area + zone + category + age + floor + rooms +
out + conservation + toilets + garage + elevator + streetcategory +
heating + storage
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 0.97895 -1080.46
area 1 0.45808 1.43703 -998.78 79.0794 8.788e-16 ***
zone 22 1.25852 2.23747 -944.25 9.8756 < 2.2e-16 ***
category 6 0.10125 1.08020 -1071.00 2.9132 0.009938 **
age 1 0.00061 0.97956 -1082.32 0.1045 0.746909
floor 1 0.00185 0.98080 -1082.05 0.3191 0.572909
rooms 1 0.00422 0.98317 -1081.52 0.7279 0.394766
out 3 0.06721 1.04616 -1071.98 3.8675 0.010426 *
conservation 3 0.01134 0.99029 -1083.95 0.6523 0.582546
toilets 1 0.08460 1.06356 -1064.39 14.6057 0.000186 ***
garage 1 0.14304 1.12199 -1052.73 24.6935 1.636e-06 ***
elevator 1 0.14529 1.12424 -1052.29 25.0826 1.373e-06 ***
streetcategory 3 0.03040 1.00935 -1079.79 1.7492 0.158880
heating 3 0.04007 1.01902 -1077.71 2.3060 0.078550 .
storage 1 0.03280 1.01175 -1075.27 5.6625 0.018445 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
and so on.