Suppose that you want to build a regression model that predicts the income level of counties in the United States, using their educational level (Percent of population that earned a bachelor’s degree) and density (Persons per square mile). The data is obtained from countyComplete in the openintro package.

# Load packages
library(tidyverse)
library(openintro)
countyComplete <- as_tibble(countyComplete)

Q1 Describe the first observation using the following variables: state, name, pop2010, bachelors, per_capita_income, and density.

Q2 Create a scatterplot to examine the relationship between bachelors and per_capita_income.

countyComplete %>%
  ggplot(aes(bachelors, per_capita_income )) +
  geom_point()

Q3 Compute correlation coefficient between the two variables and interpret them.

Hint: Make sure to interpret the direction and the magnitude of the relationship. In addition, keep in mind that correlation (or regression) coefficients do not show causation but only association.

cor(countyComplete$ per_capita_income,countyComplete$ bachelors , use = "pairwise.complete.obs")
## [1] 0.7924464

Q4 Build a regression model to predict per_capita_income using bachelors, and show the summary result.

mod_1 <- lm(per_capita_income ~ bachelors, data =countyComplete )
# View summary of model 1
summary(mod_1)
## 
## Call:
## lm(formula = per_capita_income ~ bachelors, data = countyComplete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18032.7  -1708.2     73.8   1748.0  21756.5 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13087.680    142.091   92.11   <2e-16 ***
## bachelors     494.753      6.795   72.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3299 on 3141 degrees of freedom
## Multiple R-squared:  0.628,  Adjusted R-squared:  0.6279 
## F-statistic:  5302 on 1 and 3141 DF,  p-value: < 2.2e-16

Q5 Is the coefficient of bachelors statistically significant at 5%? Interpret the coefficient.

Yes since there are three stars and it is signifinant at .1 percent means it is signifiant at 1% it is signifiant at 5% Yes, the data is more that 99% significant

Q6 How much a typical person is predicted to make a year in a county that has 70% of its population with a bachelor’s degree?

income=y intercept + coefficent (bachelors) * value of bachloers(def=percent of its population with a bachelors degree) 1300+500*70=1300+3500=48000

Q7 Interpret the reported residual standard error.

3299 on 3141 degrees of freedom It means that the typically bachleors in the community is around 3299

Q8 Interpret the reported adjusted R squared.

Adjusted R-squared: 0.6279

Q9 Further develop the regression model above by adding another variable, density.

Hint: Google something like “multivariate linear regression model”.

mod2<- lm(per_capita_income ~ bachelors + density,data =countyComplete )

Q10 Compare model 1 and model 2. Which of the two models better fits the data?

Hint: Discuss your answer by comparing the residual standard error and the adjusted R squared between the two models.