This assignment is to produce a model of median household income using other variables in the countyComplete dataframe, which is in the openintro package.
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'dplyr' was built under R version 3.4.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(openintro)
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following object is masked from 'package:ggplot2':
##
## diamonds
## The following objects are masked from 'package:datasets':
##
## cars, trees
library(broom)
Run the commands glimpse() and summary() on the dataframe to understand the meaning of the variables and to verify the integrity of the data. The documentation in the openintro package is useful for understanding the variables.
# Place your code here.
Use lm() to create a model to predict median household income using 5 other variables. Do not include per capita income. Display a summary of the model.
# Place your code here.
lm1 = lm(median_household_income ~ + persons_per_household + density + firms + bachelors + + home_ownership,data = countyComplete)
summary(lm1)
##
## Call:
## lm(formula = median_household_income ~ +persons_per_household +
## density + firms + bachelors + +home_ownership, data = countyComplete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28552 -4161 68 4078 56892
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.520e+04 2.160e+03 -20.926 < 2e-16 ***
## persons_per_household 1.169e+04 5.475e+02 21.349 < 2e-16 ***
## density 3.737e-01 8.290e-02 4.508 6.79e-06 ***
## firms 1.894e-02 4.598e-03 4.119 3.91e-05 ***
## bachelors 1.046e+03 1.640e+01 63.773 < 2e-16 ***
## home_ownership 5.449e+02 1.874e+01 29.079 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7125 on 2961 degrees of freedom
## (176 observations deleted due to missingness)
## Multiple R-squared: 0.6273, Adjusted R-squared: 0.6267
## F-statistic: 996.7 on 5 and 2961 DF, p-value: < 2.2e-16
Which 2 numbers in the summary output describe the overall performance of the model. Use these two numbers in appropriate sentences to describe how well your model performed.
The model explained 62.73 percent of the residual variance.
The absolute value of the typical mistake made by the model was 7125.
Place your answer here.
Examine the p-values for the individual coefficients of the model. Can you reject the hypothesis that the true coefficient value is zero in every case?
In all cases, the p-values are below .05. So we can always reject the hypothesis that the true coefficient is zero.
Look at the signs of the coefficients. Do all of them have the signs that you would expect? Note any exceptions.
I expected all of these variables to be indicators of higher income and that turned out to be true.
Construct forecasts of median household income for three different countys: Thurston County Washington, Los Angeles County California and Fayette County Kentucky. Use the predict() or augment() functions in R. Describe how the forecasts compare with the actuals.
test = filter(countyComplete,
state == "Washington" & name == "Thurston County" |
state == "California" & name == "Los Angeles County" |
state == "Kentucky" & name == "Fayette County")
x = predict(lm1,newdata=test)
x
## 1 2 3
## 66846.01 54352.63 53610.43
test %>% select(state,name,median_household_income)
## state name median_household_income
## 1 California Los Angeles County 55476
## 2 Kentucky Fayette County 47469
## 3 Washington Thurston County 60930
Put your verbal answer here.
countyComplete %>% select(median_household_income, persons_per_household, density, firms, bachelors, home_ownership) -> myvars
Now look at the relationships between the pairs of variables in myvars.
plot(myvars)
cor(myvars)
## median_household_income persons_per_household
## median_household_income 1.00000000 0.117497589
## persons_per_household 0.11749759 1.000000000
## density 0.14107089 0.003251426
## firms NA NA
## bachelors 0.68409243 -0.118615183
## home_ownership 0.06705549 -0.073771157
## density firms bachelors home_ownership
## median_household_income 0.141070892 NA 0.6840924 0.06705549
## persons_per_household 0.003251426 NA -0.1186152 -0.07377116
## density 1.000000000 NA 0.2324347 -0.31836611
## firms NA 1 NA NA
## bachelors 0.232434681 NA 1.0000000 -0.27800358
## home_ownership -0.318366108 NA -0.2780036 1.00000000
I see outlying observations for firms and density and a potentially non-linear relationship between my target variable and home ownership. This suggests trying transformed variables.
myvars %>% mutate(ld = log10(density +.01),
lf = log10(firms +.01),
hsq = home_ownership^2) -> myvars
lm2 = lm(median_household_income ~ + persons_per_household + ld + lf + bachelors + + home_ownership + hsq,data = myvars)
summary(lm2)
##
## Call:
## lm(formula = median_household_income ~ +persons_per_household +
## ld + lf + bachelors + +home_ownership + hsq, data = myvars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29155 -4317 -32 3904 55172
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -31777.391 5687.063 -5.588 2.51e-08 ***
## persons_per_household 10532.432 549.816 19.156 < 2e-16 ***
## ld 306.420 308.502 0.993 0.320669
## lf 2706.699 414.510 6.530 7.72e-11 ***
## bachelors 960.016 18.754 51.190 < 2e-16 ***
## home_ownership 6.285 157.073 0.040 0.968086
## hsq 3.881 1.138 3.410 0.000659 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7032 on 2960 degrees of freedom
## (176 observations deleted due to missingness)
## Multiple R-squared: 0.6371, Adjusted R-squared: 0.6364
## F-statistic: 866.2 on 6 and 2960 DF, p-value: < 2.2e-16
aug2 = augment(lm2)
## Warning: Deprecated: please use `purrr::possibly()` instead
## Warning: Deprecated: please use `purrr::possibly()` instead
## Warning: Deprecated: please use `purrr::possibly()` instead
## Warning: Deprecated: please use `purrr::possibly()` instead
## Warning: Deprecated: please use `purrr::possibly()` instead
lm3 = lm(median_household_income ~ + persons_per_household + bachelors, data = myvars)
summary(lm3)
##
## Call:
## lm(formula = median_household_income ~ +persons_per_household +
## bachelors, data = myvars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35069 -4572 -160 4626 46743
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3008.5 1535.9 1.959 0.0502 .
## persons_per_household 9271.5 580.1 15.982 <2e-16 ***
## bachelors 943.7 16.8 56.161 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8102 on 3140 degrees of freedom
## Multiple R-squared: 0.508, Adjusted R-squared: 0.5077
## F-statistic: 1621 on 2 and 3140 DF, p-value: < 2.2e-16