Intro

Submit your answers as a knitted HTML document with a floating table of contents, showing both code and results. Each question should be its own section. This document has the output format pre-set, so you should get a correctly formatted output document by knitting this R Markdown file. Turn in the HTML file on Canvas, NOT the R Markdown document!

Improperly formatted submissions (submitting an Rmd instead of an HTML, or an HTML with a code error such that the document does not render properly, a document where code is not visible) will receive a 50% grade deduction.

Questions where your code returns an error will receive a 50% deduction. Please try your best to make sure your code runs, even if you’re not sure if the output is correct!

If one chunk is giving you an error and stopping the document from rendering, but the rest of your code works and you need to knit, you can change error = FALSE in the setup chunk to error = TRUE so that error-producing code chunks will not stop the HTML document from rendering. We recommend not changing this setting unless you really need to render the document with errors included. Most of the time, it’s helpful that the HTML document will not render when your code produces an error, because that helps you go back and fix your code.

Question 1

Read the NYC Pluto data about land value from in Manhattan from https://query.data.world/s/WuYif0WbZCxzOLjLh95TY9rK--Tw64 . Ignore any warnings caused by using read_csv(). The data contain characteristic information about individual lots in Manhattan. We want to model the value of lots (TotalValue) as a function of its characteristics.

# Code that loads data here
land_data <- read.csv("https://query.data.world/s/WuYif0WbZCxzOLjLh95TY9rK--Tw64")

1.1

First, choose 3 variables in the Pluto data you find interesting. For each of these 3 variables:

  • Write 1-2 sentences describing why you picked that variable (you may use data-driven methods or intuition!),
  • Generate and report at least 2 relevant summary statistics of that variable,
  • Show a plot that helps you illustrate the relationship between that variable and TotalValue. Choose the type of plot that makes the most sense to you for each variable.

The first variable I’m going to use, based purely on intuition, is school district, because there’s often a direct correlation between the value of a home and the associated school district it resides in.

# Code that outputs 2 relevant summary statistics for variable 1 here
stat1 <- lm(TotalValue ~ SchoolDistrict, data=land_data)
summary(stat1)
## 
## Call:
## lm(formula = TotalValue ~ SchoolDistrict, data = land_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2560075 -1612455 -1050084   -29965 19221306 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2929747      37288   78.57   <2e-16 ***
## SchoolDistrict  -365892      11105  -32.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3151000 on 32248 degrees of freedom
## Multiple R-squared:  0.03257,    Adjusted R-squared:  0.03254 
## F-statistic:  1086 on 1 and 32248 DF,  p-value: < 2.2e-16
# Code that outputs plot of variable 1 and TotalValue here
ggplot(land_data, aes(x=as.factor(SchoolDistrict), y=TotalValue))+
    geom_boxplot(outlier.shape = NA, fill="slateblue", alpha=0.2) +
    coord_cartesian(ylim = c(0, 5000000)) + 
    xlab('School District') + ylab('Total Value') +
    scale_y_continuous(labels=scales::dollar_format())

I selected the second variable because there should be a direct linear relationship between the area of the building and its associated value.

# Code that outputs 2 relevant summary statistics for variable 2 here
stat2 <- lm(TotalValue~ BldgArea, data=land_data)
summary(stat2)
## 
## Call:
## lm(formula = TotalValue ~ BldgArea, data = land_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -75663806   -755812   -503570     66825  19268725 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.304e+05  1.379e+04   45.71   <2e-16 ***
## BldgArea    5.655e+01  2.966e-01  190.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2196000 on 32248 degrees of freedom
## Multiple R-squared:  0.5298, Adjusted R-squared:  0.5298 
## F-statistic: 3.634e+04 on 1 and 32248 DF,  p-value: < 2.2e-16
# Code that outputs plot of variable 2 and TotalValue here
ggplot(land_data, aes(x=BldgArea, y=TotalValue))+
    geom_point() + coord_cartesian(ylim = c(0, 5000000)) + 
    coord_cartesian(xlim = c(0, 200000)) + 
    xlab('Building Area') + ylab('Total Value') + 
    scale_y_continuous(labels=scales::dollar_format()) +
    scale_x_continuous(labels=scales::unit_format())
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

I selected the third variable because the zoning should have a clear effect on the land’s value depending on whether it’s residential / commercial / etc.

# Code that outputs 2 relevant summary statistics for variable 3 here
stat3 <- lm(TotalValue ~ ZoneDist1, land_data)
summary(stat3)
## 
## Call:
## lm(formula = TotalValue ~ ZoneDist1, data = land_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3545215 -1277097  -939879   -92059 18647432 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 2842356      33680  84.392  < 2e-16 ***
## ZoneDist1Manufacturing       627701      82443   7.614 2.73e-14 ***
## ZoneDist1Mixed Residential -1481232     533940  -2.774  0.00554 ** 
## ZoneDist1Park                704209     232145   3.033  0.00242 ** 
## ZoneDist1Residential       -1525988      39710 -38.428  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3107000 on 32245 degrees of freedom
## Multiple R-squared:  0.05924,    Adjusted R-squared:  0.05912 
## F-statistic: 507.6 on 4 and 32245 DF,  p-value: < 2.2e-16
# Code that outputs plot of variable 3 and TotalValue here
ggplot(land_data, aes(x=as.factor(ZoneDist1), y=TotalValue))+
    geom_boxplot(outlier.shape = NA, fill="slateblue", alpha=0.2) +
    coord_cartesian(ylim = c(0, 9000000)) + 
    xlab('Zoning') + ylab('Total Value') +
    scale_y_continuous(labels=scales::dollar_format())

1.2

Next, regress TotalValue against FireService, LotArea, LotFront and HistoricDistrict.

# Code that fits regression here
regress1 <- glm(TotalValue ~ FireService + LotArea + LotFront + HistoricDistrict, data=land_data)

Use the coefplot package to visualize the regression results with a coefficient plot.

# Code that outputs coefplot here
coefplot(regress1)

Interpret each coefficient in your own words (1-2 sentences per coefficient).

FILL IN BELOW WITH COEFFICIENT INTERPRETATIONS OF THE FOLLOWING VARIABLES

  • Intercept: Sets the baseline value for properties
  • FireService: The type of fire service that covers the property has a slight negative impact on the total value.
  • LotArea: Has a small, albeit positive, impact on total value
  • LotFront: Also has a small, albeit positive, impact on the total value
  • HistoricDistrict: The property being located in a historic district has little to no effect on its total value.

1.3

Continue adjusting your model on the Pluto data by changing which predictor variables are included in the model until the BIC of the model is below 1,040,000.

# Code that re-fits model here
stat4 <- lm(TotalValue ~ ZoneDist1 + BldgArea, land_data)    

Display the BIC of your final model.

# Code that outputs model BIC here
BIC(stat4, stat3, stat2, stat1, regress1)
##          df     BIC
## stat4     7 1031147
## stat3     6 1055804
## stat2     3 1033403
## stat1     3 1056674
## regress1  7 1051521

Show a coefplot coefficient plot of the final model.

# Code that outputs coefplot here
coefplot(stat4)

Display a coefficient plot, with multiplot, showing the coefficients of both the previous model and this model.

# Code that outputs multiplot here
multiplot(regress1, stat4)

Question 2

Load the orings dataset by downloading the daag_orings.csv file from Canvas into the “data” subfolder of your R project folder, and then reading the data into R. The dataset contains information on characteristics of space shuttle launches in the early 1980s, and the number and type of 45 O-ring failures. Investigate whether the temperature at launch has an impact on the probability of O-ring failure using binary (logistic) regression.

# Code that loads data here
space_data <- read.csv(file = "daag_orings.csv")
space_data
##    Temperature Erosion Blowby Total
## 1           53       3      2     5
## 2           57       1      0     1
## 3           58       1      0     1
## 4           63       1      0     1
## 5           66       0      0     0
## 6           67       0      0     0
## 7           67       0      0     0
## 8           67       0      0     0
## 9           68       0      0     0
## 10          69       0      0     0
## 11          70       1      0     1
## 12          70       0      0     0
## 13          70       1      0     1
## 14          70       0      0     0
## 15          72       0      0     0
## 16          73       0      0     0
## 17          75       0      0     0
## 18          75       0      2     1
## 19          76       0      0     0
## 20          76       0      0     0
## 21          78       0      0     0
## 22          79       0      0     0
## 23          81       0      0     0

2.1

First, use mutate() to create a binary variable indicating whether there is any erosion (Erosion) or blow-by (Blowby). The variable should be 0 if there is neither erosion nor blow-by for a given O-ring failure, and 1 if there is either erosion or blow-by (or both).

# Code that creates new variable here
space_data_new <- space_data %>%
  mutate(Damaged = case_when(Erosion > 0 ~ 1,
                           Blowby > 0 ~ 1,
                           Erosion == 0 | Blowby == 0 ~ 0))

2.2

Next, train a logistic regression where this new variable is the response and Temperature is a predictor.

# Code that fits model here
model1 <- lm(Damaged ~ Temperature, space_data_new)
model1
## 
## Call:
## lm(formula = Damaged ~ Temperature, data = space_data_new)
## 
## Coefficients:
## (Intercept)  Temperature  
##     2.90476     -0.03738

Does Temperature impact the risk of o-ring failure positively, negatively, or not at all? Write 1-2 sentences interpreting the result of your regression.

Temperature impacts the risk of o-ring failure negatively, meaning the higher the temperature (within the bounds of the data) the less likely the chance of failure.

Question 3

Load the DoctorAUS dataset from the Ecdat package. The dataset contains information on Australians’ doctor visits from 1977-78. We want to model the number of doctor visits per person (doctorco) as a function of other health-related predictor variable(s).

# Code that loads data here
doctor_data <- DoctorAUS

3.1

Use Poisson regression to model doctorco as a function of 1-5 variables of your choice (more is not necessarily better!).

# Code that fits model here
doc_model1 <- glm(doctorco ~ actdays + chcond, data=doctor_data, family=poisson)
summary(doc_model1)
## 
## Call:
## glm(formula = doctorco ~ actdays + chcond, family = poisson, 
##     data = doctor_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1153  -0.7225  -0.5779  -0.5779   5.7162  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.789979   0.046454 -38.532  < 2e-16 ***
## actdays      0.146309   0.004585  31.912  < 2e-16 ***
## chcondla     0.446865   0.059536   7.506 6.11e-14 ***
## chcondnla    0.546872   0.075142   7.278 3.39e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 5634.8  on 5189  degrees of freedom
## Residual deviance: 4581.8  on 5186  degrees of freedom
## AIC: 6921.4
## 
## Number of Fisher Scoring iterations: 6

Write 1-2 sentences describing why you picked each of those variables.

I picked number of actdays because reduced activity due to illness likely means a patient would seek out more doctor consultations, and patients with chronic conditions are also more likely to require more doctor visits.

Report the BIC of your model.

# Code that outputs model BIC here
BIC(doc_model1)
## [1] 6947.603

Display a coefficient plot on the original scale of the data by setting the trans argument of coefplot() to the appropriate inverse function. This requires reading (AS YET UN-NUMBERED TEXTBOOK SECTION).

# Code that outputs coefplot here
coefplot(doc_model1, trans = exp)
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

3.2

Model the number of doctor visits as a function of your variables of choice again, this time accounting for overdispersion by using a quasipoisson family.

# Code to fit new model here
doc_model2 <- glm(doctorco ~ actdays + chcond, data=doctor_data, family=quasipoisson)

Report the overdispersion parameter of your new model.

# Code that prints out model overdispersion parameter here
summary(doc_model2)
## 
## Call:
## glm(formula = doctorco ~ actdays + chcond, family = quasipoisson, 
##     data = doctor_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1153  -0.7225  -0.5779  -0.5779   5.7162  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.78998    0.05461 -32.778  < 2e-16 ***
## actdays      0.14631    0.00539  27.147  < 2e-16 ***
## chcondla     0.44687    0.06999   6.385 1.86e-10 ***
## chcondnla    0.54687    0.08833   6.191 6.44e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 1.381926)
## 
##     Null deviance: 5634.8  on 5189  degrees of freedom
## Residual deviance: 4581.8  on 5186  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6

Display a multiplot of the previous model and this one.

# Code that outputs multiplot here
multiplot(doc_model1, doc_model2)