Suppose that you want to build a regression model that predicts the income level of counties in the United States, using their educational level (Percent of population that earned a bachelor’s degree) and density (Persons per square mile). The data is obtained from countyComplete in the openintro package.
# Load packages
library(tidyverse)
library(openintro)
countyComplete <- as_tibble(countyComplete)
countyComplete %>%
ggplot(aes(bachelors, per_capita_income )) +
geom_point()
Hint: Make sure to interpret the direction and the magnitude of the relationship. In addition, keep in mind that correlation (or regression) coefficients do not show causation but only association.
cor(countyComplete$ per_capita_income,countyComplete$ bachelors , use = "pairwise.complete.obs")
## [1] 0.7924464
mod_1 <- lm(per_capita_income ~ bachelors, data =countyComplete )
# View summary of model 1
summary(mod_1)
##
## Call:
## lm(formula = per_capita_income ~ bachelors, data = countyComplete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18032.7 -1708.2 73.8 1748.0 21756.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13087.680 142.091 92.11 <2e-16 ***
## bachelors 494.753 6.795 72.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3299 on 3141 degrees of freedom
## Multiple R-squared: 0.628, Adjusted R-squared: 0.6279
## F-statistic: 5302 on 1 and 3141 DF, p-value: < 2.2e-16
Yes since there are three stars and it is signifinant at .1 percent means it is signifiant at 1% it is signifiant at 5% Yes, the data is more that 99% significant
income=y intercept + coefficent (bachelors) * value of bachloers(def=percent of its population with a bachelors degree) 1300+500*70=1300+3500=48000
3299 on 3141 degrees of freedom It means that the typically bachleors in the community is around 3299
Adjusted R-squared: 0.6279
Hint: Google something like “multivariate linear regression model”.
mod2<- lm(per_capita_income ~ bachelors + density,data =countyComplete )
Hint: Discuss your answer by comparing the residual standard error and the adjusted R squared between the two models.