Submit your answers as a knitted HTML document with a floating table of contents, showing both code and results. Each question should be its own section. This document has the output format pre-set, so you should get a correctly formatted output document by knitting this R Markdown file. Turn in the HTML file on Canvas, NOT the R Markdown document!
Improperly formatted submissions (submitting an Rmd instead of an HTML, or an HTML with a code error such that the document does not render properly, a document where code is not visible) will receive a 50% grade deduction.
Questions where your code returns an error will receive a 50% deduction. Please try your best to make sure your code runs, even if you’re not sure if the output is correct!
If one chunk is giving you an error and stopping the document from
rendering, but the rest of your code works and you need to knit, you can
change error = FALSE in the setup chunk to
error = TRUE so that error-producing code chunks will not
stop the HTML document from rendering. We recommend not changing this
setting unless you really need to render the document with errors
included. Most of the time, it’s helpful that the HTML document will not
render when your code produces an error, because that helps you go back
and fix your code.
Read the NYC Pluto data about land value from in Manhattan from https://query.data.world/s/WuYif0WbZCxzOLjLh95TY9rK--Tw64
. Ignore any warnings caused by using read_csv(). The data
contain characteristic information about individual lots in Manhattan.
We want to model the value of lots (TotalValue) as a
function of its characteristics.
# Code that loads data here
land_data <- read.csv("https://query.data.world/s/WuYif0WbZCxzOLjLh95TY9rK--Tw64")
First, choose 3 variables in the Pluto data you find interesting. For each of these 3 variables:
TotalValue. Choose the type of plot that makes
the most sense to you for each variable.The first variable I’m going to use, based purely on intuition, is school district, because there’s often a direct correlation between the value of a home and the associated school district it resides in.
# Code that outputs 2 relevant summary statistics for variable 1 here
stat1 <- lm(TotalValue ~ SchoolDistrict, data=land_data)
summary(stat1)
##
## Call:
## lm(formula = TotalValue ~ SchoolDistrict, data = land_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2560075 -1612455 -1050084 -29965 19221306
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2929747 37288 78.57 <2e-16 ***
## SchoolDistrict -365892 11105 -32.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3151000 on 32248 degrees of freedom
## Multiple R-squared: 0.03257, Adjusted R-squared: 0.03254
## F-statistic: 1086 on 1 and 32248 DF, p-value: < 2.2e-16
# Code that outputs plot of variable 1 and TotalValue here
ggplot(land_data, aes(x=as.factor(SchoolDistrict), y=TotalValue))+
geom_boxplot(outlier.shape = NA, fill="slateblue", alpha=0.2) +
coord_cartesian(ylim = c(0, 5000000)) +
xlab('School District') + ylab('Total Value') +
scale_y_continuous(labels=scales::dollar_format())
I selected the second variable because there should be a direct linear relationship between the area of the building and its associated value.
# Code that outputs 2 relevant summary statistics for variable 2 here
stat2 <- lm(TotalValue~ BldgArea, data=land_data)
summary(stat2)
##
## Call:
## lm(formula = TotalValue ~ BldgArea, data = land_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -75663806 -755812 -503570 66825 19268725
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.304e+05 1.379e+04 45.71 <2e-16 ***
## BldgArea 5.655e+01 2.966e-01 190.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2196000 on 32248 degrees of freedom
## Multiple R-squared: 0.5298, Adjusted R-squared: 0.5298
## F-statistic: 3.634e+04 on 1 and 32248 DF, p-value: < 2.2e-16
# Code that outputs plot of variable 2 and TotalValue here
ggplot(land_data, aes(x=BldgArea, y=TotalValue))+
geom_point() + coord_cartesian(ylim = c(0, 5000000)) +
coord_cartesian(xlim = c(0, 200000)) +
xlab('Building Area') + ylab('Total Value') +
scale_y_continuous(labels=scales::dollar_format()) +
scale_x_continuous(labels=scales::unit_format())
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
I selected the third variable because the zoning should have a clear effect on the land’s value depending on whether it’s residential / commercial / etc.
# Code that outputs 2 relevant summary statistics for variable 3 here
stat3 <- lm(TotalValue ~ ZoneDist1, land_data)
summary(stat3)
##
## Call:
## lm(formula = TotalValue ~ ZoneDist1, data = land_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3545215 -1277097 -939879 -92059 18647432
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2842356 33680 84.392 < 2e-16 ***
## ZoneDist1Manufacturing 627701 82443 7.614 2.73e-14 ***
## ZoneDist1Mixed Residential -1481232 533940 -2.774 0.00554 **
## ZoneDist1Park 704209 232145 3.033 0.00242 **
## ZoneDist1Residential -1525988 39710 -38.428 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3107000 on 32245 degrees of freedom
## Multiple R-squared: 0.05924, Adjusted R-squared: 0.05912
## F-statistic: 507.6 on 4 and 32245 DF, p-value: < 2.2e-16
# Code that outputs plot of variable 3 and TotalValue here
ggplot(land_data, aes(x=as.factor(ZoneDist1), y=TotalValue))+
geom_boxplot(outlier.shape = NA, fill="slateblue", alpha=0.2) +
coord_cartesian(ylim = c(0, 9000000)) +
xlab('Zoning') + ylab('Total Value') +
scale_y_continuous(labels=scales::dollar_format())
Next, regress TotalValue against
FireService, LotArea, LotFront
and HistoricDistrict.
# Code that fits regression here
regress1 <- glm(TotalValue ~ FireService + LotArea + LotFront + HistoricDistrict, data=land_data)
Use the coefplot package to visualize the regression
results with a coefficient plot.
# Code that outputs coefplot here
coefplot(regress1)
Interpret each coefficient in your own words (1-2 sentences per coefficient).
FILL IN BELOW WITH COEFFICIENT INTERPRETATIONS OF THE FOLLOWING VARIABLES
FireService: The type of fire service that
covers the property has a slight negative impact on the total
value.LotArea: Has a small, albeit positive, impact
on total valueLotFront: Also has a small, albeit positive,
impact on the total valueHistoricDistrict: The property being located in
a historic district has little to no effect on its total
value.Continue adjusting your model on the Pluto data by changing which predictor variables are included in the model until the BIC of the model is below 1,040,000.
# Code that re-fits model here
stat4 <- lm(TotalValue ~ ZoneDist1 + BldgArea, land_data)
Display the BIC of your final model.
# Code that outputs model BIC here
BIC(stat4, stat3, stat2, stat1, regress1)
## df BIC
## stat4 7 1031147
## stat3 6 1055804
## stat2 3 1033403
## stat1 3 1056674
## regress1 7 1051521
Show a coefplot coefficient plot of the final model.
# Code that outputs coefplot here
coefplot(stat4)
Display a coefficient plot, with multiplot, showing the coefficients of both the previous model and this model.
# Code that outputs multiplot here
multiplot(regress1, stat4)
Load the orings dataset by downloading the
daag_orings.csv file from Canvas into the “data” subfolder
of your R project folder, and then reading the data into R. The dataset
contains information on characteristics of space shuttle launches in the
early 1980s, and the number and type of 45 O-ring failures. Investigate
whether the temperature at launch has an impact on the probability of
O-ring failure using binary (logistic) regression.
# Code that loads data here
space_data <- read.csv(file = "daag_orings.csv")
space_data
## Temperature Erosion Blowby Total
## 1 53 3 2 5
## 2 57 1 0 1
## 3 58 1 0 1
## 4 63 1 0 1
## 5 66 0 0 0
## 6 67 0 0 0
## 7 67 0 0 0
## 8 67 0 0 0
## 9 68 0 0 0
## 10 69 0 0 0
## 11 70 1 0 1
## 12 70 0 0 0
## 13 70 1 0 1
## 14 70 0 0 0
## 15 72 0 0 0
## 16 73 0 0 0
## 17 75 0 0 0
## 18 75 0 2 1
## 19 76 0 0 0
## 20 76 0 0 0
## 21 78 0 0 0
## 22 79 0 0 0
## 23 81 0 0 0
First, use mutate() to create a binary variable
indicating whether there is any erosion (Erosion) or
blow-by (Blowby). The variable should be 0 if there is
neither erosion nor blow-by for a given O-ring failure, and 1 if there
is either erosion or blow-by (or both).
# Code that creates new variable here
space_data_new <- space_data %>%
mutate(Damaged = case_when(Erosion > 0 ~ 1,
Blowby > 0 ~ 1,
Erosion == 0 | Blowby == 0 ~ 0))
Next, train a logistic regression where this new variable is the
response and Temperature is a predictor.
# Code that fits model here
model1 <- lm(Damaged ~ Temperature, space_data_new)
model1
##
## Call:
## lm(formula = Damaged ~ Temperature, data = space_data_new)
##
## Coefficients:
## (Intercept) Temperature
## 2.90476 -0.03738
Does Temperature impact the risk of o-ring failure
positively, negatively, or not at all? Write 1-2 sentences interpreting
the result of your regression.
Temperature impacts the risk of o-ring failure negatively, meaning the higher the temperature (within the bounds of the data) the less likely the chance of failure.
Load the DoctorAUS dataset from the Ecdat
package. The dataset contains information on Australians’ doctor visits
from 1977-78. We want to model the number of doctor visits per person
(doctorco) as a function of other health-related predictor
variable(s).
# Code that loads data here
doctor_data <- DoctorAUS
Use Poisson regression to model doctorco as a function
of 1-5 variables of your choice (more is not necessarily better!).
# Code that fits model here
doc_model1 <- glm(doctorco ~ actdays + chcond, data=doctor_data, family=poisson)
summary(doc_model1)
##
## Call:
## glm(formula = doctorco ~ actdays + chcond, family = poisson,
## data = doctor_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1153 -0.7225 -0.5779 -0.5779 5.7162
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.789979 0.046454 -38.532 < 2e-16 ***
## actdays 0.146309 0.004585 31.912 < 2e-16 ***
## chcondla 0.446865 0.059536 7.506 6.11e-14 ***
## chcondnla 0.546872 0.075142 7.278 3.39e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 5634.8 on 5189 degrees of freedom
## Residual deviance: 4581.8 on 5186 degrees of freedom
## AIC: 6921.4
##
## Number of Fisher Scoring iterations: 6
Write 1-2 sentences describing why you picked each of those variables.
I picked number of actdays because reduced activity due to illness likely means a patient would seek out more doctor consultations, and patients with chronic conditions are also more likely to require more doctor visits.
Report the BIC of your model.
# Code that outputs model BIC here
BIC(doc_model1)
## [1] 6947.603
Display a coefficient plot on the original scale of the data
by setting the trans argument of coefplot() to
the appropriate inverse function. This requires reading (AS YET
UN-NUMBERED TEXTBOOK SECTION).
# Code that outputs coefplot here
coefplot(doc_model1, trans = exp)
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
Model the number of doctor visits as a function of your variables of choice again, this time accounting for overdispersion by using a quasipoisson family.
# Code to fit new model here
doc_model2 <- glm(doctorco ~ actdays + chcond, data=doctor_data, family=quasipoisson)
Report the overdispersion parameter of your new model.
# Code that prints out model overdispersion parameter here
summary(doc_model2)
##
## Call:
## glm(formula = doctorco ~ actdays + chcond, family = quasipoisson,
## data = doctor_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1153 -0.7225 -0.5779 -0.5779 5.7162
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.78998 0.05461 -32.778 < 2e-16 ***
## actdays 0.14631 0.00539 27.147 < 2e-16 ***
## chcondla 0.44687 0.06999 6.385 1.86e-10 ***
## chcondnla 0.54687 0.08833 6.191 6.44e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 1.381926)
##
## Null deviance: 5634.8 on 5189 degrees of freedom
## Residual deviance: 4581.8 on 5186 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 6
Display a multiplot of the previous model and this one.
# Code that outputs multiplot here
multiplot(doc_model1, doc_model2)