# Load packages
library(tidyverse)
library(openintro)
countyComplete <- as_tibble(countyComplete)

Q1 Describe the first observation using the following variables: state, name, pop2010, bachelors, per_capita_income, and density.

This is an observation of Autauga County, Alabama. In 2010 it had a population of 54,571, with 21.6% of its residents as bachelors degree holders and a population density of 91.8 people per sq mile. It has a per-capita income of $24,568.

Q2 Create a scatterplot to examine the relationship between bachelors and per_capita_income.

countyComplete %>%
  ggplot(aes(bachelors, per_capita_income)) +
  geom_point()

Q3 Compute correlation coefficient between the two variables and interpret them.

Hint: Make sure to interpret the direction and the magnitude of the relationship. In addition, keep in mind that correlation (or regression) coefficients do not show causation but only association.

cor(countyComplete$bachelors, countyComplete$per_capita_income, use = "pairwise.complete.obs")
## [1] 0.7924464

There is a strong, positive correlation between per-capita income and a percentage of residents being bachelors degree holders. This can indicate that either having a bachelors degree will result in higher earnings, or that people who earn more tend to move to areas with more people that are bachelors degree owners.

Q4 Build a regression model to predict per_capita_income using bachelors, and show the summary result.

mod_1 <- lm(per_capita_income ~ bachelors, data = countyComplete)

summary(mod_1)
## 
## Call:
## lm(formula = per_capita_income ~ bachelors, data = countyComplete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18032.7  -1708.2     73.8   1748.0  21756.5 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13087.680    142.091   92.11   <2e-16 ***
## bachelors     494.753      6.795   72.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3299 on 3141 degrees of freedom
## Multiple R-squared:  0.628,  Adjusted R-squared:  0.6279 
## F-statistic:  5302 on 1 and 3141 DF,  p-value: < 2.2e-16

Q5 Is the coefficient of bachelors statistically significant at 5%? Interpret the coefficient.

The probability of error is very small, so the coefficient of bachelors is statistically significant.

Q6 How much a typical person is predicted to make a year in a county that has 70% of its population with a bachelor’s degree?

13087.68+494.753*70 = 47720.39 Per-capita income is expected to be $47,720.39 in a county with 70% of its residents owning a bachelors degree.

Q7 Interpret the reported residual standard error.

The model is estimated to miss the actual per-capita income of a county by $3,299.

Q8 Interpret the reported adjusted R squared.

The adjusted r-squared value of .6279 would indicate that a county’s percentage of population owning a bachelors degree accounts for 62.79% of the variability in per-capita income

Q9 Further develop the regression model above by adding another variable, density.

mod_1 <- lm(per_capita_income ~ bachelors + density, data = countyComplete)

summary(mod_1)
## 
## Call:
## lm(formula = per_capita_income ~ bachelors + density, data = countyComplete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18257.9  -1707.7     71.5   1749.6  22101.1 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.319e+04  1.433e+02  92.042  < 2e-16 ***
## bachelors   4.872e+02  6.963e+00  69.973  < 2e-16 ***
## density     1.623e-01  3.499e-02   4.639 3.65e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3289 on 3140 degrees of freedom
## Multiple R-squared:  0.6305, Adjusted R-squared:  0.6303 
## F-statistic:  2679 on 2 and 3140 DF,  p-value: < 2.2e-16

Q10 Compare model 1 and model 2. Which of the two models better fits the data?

Model 2 fits the data better, given the residual standard error is slightly lower and the r-squared value is slightly higher. This indicates a slightly stronger predictor.