First review the assignment.

This assignment is to produce a model of median household income using other variables in the countyComplete dataframe, which is in the openintro package.

Load the required libraries.

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'dplyr' was built under R version 3.4.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(openintro)
## Please visit openintro.org for free statistics materials
## 
## Attaching package: 'openintro'
## The following object is masked from 'package:ggplot2':
## 
##     diamonds
## The following objects are masked from 'package:datasets':
## 
##     cars, trees
library(broom)

Problem 1

Run the commands glimpse() and summary() on the dataframe to understand the meaning of the variables and to verify the integrity of the data. The documentation in the openintro package is useful for understanding the variables.

# Place your code here.

Problem 2

Use lm() to create a model to predict median household income using 5 other variables. Do not include per capita income. Display a summary of the model.

# Place your code here.
lm1 = lm(median_household_income ~ + persons_per_household + density + firms + bachelors + + home_ownership,data = countyComplete)
summary(lm1)
## 
## Call:
## lm(formula = median_household_income ~ +persons_per_household + 
##     density + firms + bachelors + +home_ownership, data = countyComplete)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -28552  -4161     68   4078  56892 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -4.520e+04  2.160e+03 -20.926  < 2e-16 ***
## persons_per_household  1.169e+04  5.475e+02  21.349  < 2e-16 ***
## density                3.737e-01  8.290e-02   4.508 6.79e-06 ***
## firms                  1.894e-02  4.598e-03   4.119 3.91e-05 ***
## bachelors              1.046e+03  1.640e+01  63.773  < 2e-16 ***
## home_ownership         5.449e+02  1.874e+01  29.079  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7125 on 2961 degrees of freedom
##   (176 observations deleted due to missingness)
## Multiple R-squared:  0.6273, Adjusted R-squared:  0.6267 
## F-statistic: 996.7 on 5 and 2961 DF,  p-value: < 2.2e-16

Problem 3

Which 2 numbers in the summary output describe the overall performance of the model. Use these two numbers in appropriate sentences to describe how well your model performed.

The model explained 62.73 percent of the residual variance.

The absolute value of the typical mistake made by the model was 7125.

Place your answer here.

Problem 4

Examine the p-values for the individual coefficients of the model. Can you reject the hypothesis that the true coefficient value is zero in every case?

In all cases, the p-values are below .05. So we can always reject the hypothesis that the true coefficient is zero.

Problem 5

Look at the signs of the coefficients. Do all of them have the signs that you would expect? Note any exceptions.

I expected all of these variables to be indicators of higher income and that turned out to be true.

Problem 6

Construct forecasts of median household income for three different countys: Thurston County Washington, Los Angeles County California and Fayette County Kentucky. Use the predict() or augment() functions in R. Describe how the forecasts compare with the actuals.

test = filter(countyComplete, 
        state == "Washington" & name == "Thurston County" | 
        state == "California" & name == "Los Angeles County" |
        state == "Kentucky" & name == "Fayette County") 


x = predict(lm1,newdata=test)
x
##        1        2        3 
## 66846.01 54352.63 53610.43
test %>% select(state,name,median_household_income)
##        state               name median_household_income
## 1 California Los Angeles County                   55476
## 2   Kentucky     Fayette County                   47469
## 3 Washington    Thurston County                   60930

Put your verbal answer here.

Things I would do to debug/improve my model.

First get a more convenient datframe with only the variables I want.

countyComplete %>% select(median_household_income, persons_per_household, density, firms, bachelors, home_ownership) -> myvars

Now look at the relationships between the pairs of variables in myvars.

plot(myvars)

cor(myvars)
##                         median_household_income persons_per_household
## median_household_income              1.00000000           0.117497589
## persons_per_household                0.11749759           1.000000000
## density                              0.14107089           0.003251426
## firms                                        NA                    NA
## bachelors                            0.68409243          -0.118615183
## home_ownership                       0.06705549          -0.073771157
##                              density firms  bachelors home_ownership
## median_household_income  0.141070892    NA  0.6840924     0.06705549
## persons_per_household    0.003251426    NA -0.1186152    -0.07377116
## density                  1.000000000    NA  0.2324347    -0.31836611
## firms                             NA     1         NA             NA
## bachelors                0.232434681    NA  1.0000000    -0.27800358
## home_ownership          -0.318366108    NA -0.2780036     1.00000000

I see outlying observations for firms and density and a potentially non-linear relationship between my target variable and home ownership. This suggests trying transformed variables.

myvars %>% mutate(ld = log10(density +.01),
                  lf = log10(firms +.01),
                  hsq = home_ownership^2) -> myvars

lm2 = lm(median_household_income ~ + persons_per_household + ld + lf + bachelors + + home_ownership + hsq,data = myvars)
summary(lm2)
## 
## Call:
## lm(formula = median_household_income ~ +persons_per_household + 
##     ld + lf + bachelors + +home_ownership + hsq, data = myvars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29155  -4317    -32   3904  55172 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -31777.391   5687.063  -5.588 2.51e-08 ***
## persons_per_household  10532.432    549.816  19.156  < 2e-16 ***
## ld                       306.420    308.502   0.993 0.320669    
## lf                      2706.699    414.510   6.530 7.72e-11 ***
## bachelors                960.016     18.754  51.190  < 2e-16 ***
## home_ownership             6.285    157.073   0.040 0.968086    
## hsq                        3.881      1.138   3.410 0.000659 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7032 on 2960 degrees of freedom
##   (176 observations deleted due to missingness)
## Multiple R-squared:  0.6371, Adjusted R-squared:  0.6364 
## F-statistic: 866.2 on 6 and 2960 DF,  p-value: < 2.2e-16
aug2 = augment(lm2)
## Warning: Deprecated: please use `purrr::possibly()` instead
## Warning: Deprecated: please use `purrr::possibly()` instead
## Warning: Deprecated: please use `purrr::possibly()` instead
## Warning: Deprecated: please use `purrr::possibly()` instead
## Warning: Deprecated: please use `purrr::possibly()` instead

Let’s try a model with only the two strongest variables. These are bachelors and persons per household.

lm3 = lm(median_household_income ~ + persons_per_household  + bachelors, data = myvars)
summary(lm3)
## 
## Call:
## lm(formula = median_household_income ~ +persons_per_household + 
##     bachelors, data = myvars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -35069  -4572   -160   4626  46743 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             3008.5     1535.9   1.959   0.0502 .  
## persons_per_household   9271.5      580.1  15.982   <2e-16 ***
## bachelors                943.7       16.8  56.161   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8102 on 3140 degrees of freedom
## Multiple R-squared:  0.508,  Adjusted R-squared:  0.5077 
## F-statistic:  1621 on 2 and 3140 DF,  p-value: < 2.2e-16