Correlation and Regression

The given dataset was computed from a sample of 67,248 New Hampshire residents at the age of 25-65. The sample data was obtained from the U.S. Census, 2012-2016 ACS PUMS DATA.

ed_avg is the average years of schooling of New Hampshire residents in 2012-2016.
income_median is the median income of New Hampshire residents in 2012-2016. It represents a total income including wages and salaries, self-employment income, and interest, dividends and rent income.
region indicates the place within New Hampshire: southeastern regions take 1, and 0 otherwise.

Q1. Describe the Grafton and Coos Counties using ALL variables in the data set.

Answer:Grafton and Coos Counties are in row 3. PUMA_label is what the counties/cities are listed under. Ed_avg is the average years of schooling of New Hampshire residents in 2012-2016. Ed_avg is represented in a number and in decimal, which would come together to get the average.Ed_Avg for Grafton and Coos Counties is 18.52291. Income_Median is the median income of New Hampshire residents in 2012-2016. Income_Median represents the total amount of income including wages, salaries, self-employment, income, dividends, and rent income. Income_Median is 30,000 for this county, represented in dollars (Accounting). Region indicated the area within New Hampshire, southeastern regions take 1, and 0 otherwise.Grafton and Coos County is 0, meaning that it is not in the southeastern region of the state.

Q2. Create a scatterplot to examine the relationship between ed_avg and income_median.

## # A tibble: 10 x 5
##        X PUMA_label                    ed_avg income_median region
##    <int> <fct>                          <dbl>         <int>  <int>
##  1     1 Cheshire & Sullivan Counties    18.6         35390      0
##  2     2 Concord City                    19.0         36790      0
##  3     3 Grafton & Coos Counties         18.5         30000      0
##  4     4 Greater Nashua City             18.9         40800      1
##  5     5 Hillsborough County (Western)   19.2         42900      1
##  6     6 Lakes Region                    18.6         33050      0
##  7     7 Manchester City                 18.2         32000      1
##  8     8 Outer Manchester City           19.1         44700      1
##  9     9 Portsmouth City                 19.3         45000      1
## 10    10 Strafford Region                19.0         36200      0

Q3. Compute the correlation coefficient between the two variables and interpret them.

Hint: Make sure to interpret the direction and the magnitude of the relationship. In addition, keep in mind that correlation (or regression) coefficients do not show causation but only association.

## [1] 0.8622811

Answer: The number that has been calculated based on the info which is currently present, we end with the number of 0.8622811. Due to the rule of thumb, we can determine this to be a strong correlation, as the nuber is greater than .6. It is also a positive correlation due to there being no negative sign in front. Giving us a Strong Positive Correlation

Q4. Build a regression model to predict income_median using ed_avg, save the regression result in mod_1, and show the summary result.

## 
## Call:
## lm(formula = income_median ~ ed_avg, data = residents_25to65)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3643.9 -2548.6   655.8  1730.7  4150.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -201503      49675  -4.056  0.00365 **
## ed_avg         12695       2636   4.816  0.00133 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2891 on 8 degrees of freedom
## Multiple R-squared:  0.7435, Adjusted R-squared:  0.7115 
## F-statistic: 23.19 on 1 and 8 DF,  p-value: 0.001328

Q5. Is the coefficient of ed_avg statistically significant at 5%? How do you know?

Hint: Discuss your answer in terms of the number of stars in the summary result. Refer to the interpretation section in quiz4_a.

Answer: 99.5% confident = **= 0.01. We are confident that it’s statistically significant at 5%. This is due to looking at the cofficients table and looking at the far left. The two stars indicate the amount of significance. We can go down the bottom where it has the code, two stars =0.01% confidence. We start out at 100% confidence and we factor in the significance bring it down to 99.5%.

Q6. Further develop the regression model above by adding another variable, region, save the regression result in mod_2, and show the summary result.

## 
## Call:
## lm(formula = income_median ~ ed_avg + region, data = residents_25to65)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2016.2  -778.4  -373.5   353.4  2780.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -166192      30700  -5.413 0.000994 ***
## ed_avg         10701       1638   6.532 0.000324 ***
## region          4524       1136   3.981 0.005314 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1711 on 7 degrees of freedom
## Multiple R-squared:  0.9214, Adjusted R-squared:  0.899 
## F-statistic: 41.05 on 2 and 7 DF,  p-value: 0.0001359

Q7. Compare mod_1 and mod_2. Which of the two models better fits the data?

Hint: Discuss your answer by comparing the residual standard error and the adjusted R squared between the two models.

Answer: Mod_2 would be better fitting for the data. The Residual standard error for mod_1 is 2891 on 8 degrees of freedom. While mod_2 Residual standard for error is 1711 on 7 degrees of freedom. This is a drastic improvement of what we had previously. The siginfiance for ed_avg goes from 99.5% to 99.9%, resulting in an improvement of our overall confidence. This goes from just being confident to be very confident. The adjusted R squared is 0.7115 for mod_1, but in mod_2 it has increased to .899. This tells us that the new term improves the model than would be expected by chance.

Q8. How much median income does the second model predict for the Grafton and Coos Counties?

Hint: Note that the second model has two predictors. Use both predictors to compute the predicted income.

Answer: = The second predicts that the income will be 38,729, a jump from 30,000 as displayed in the first model.

Q9. According to the result of the second regression model, are residents of southeastern regions of the State likely to make more income? Why or why not?

Hint: Discuss your answer based on the coefficient of region. You may refer to the interpretation section in quiz4_a.

Answer: According to the two models that have been displayed, those who live in the southeastern regions will likely make more income than other regions. When examining both models, model two has a slightly lower intercept point than model one. The confidence for model 2 has improved compared to model one, when calculations are made we see an increase of income.

Correlation and Regression

Josh Lorden

April 15, 2019