Relationship of TN College Graduates & Median Family Income, by County

Attached is the code written for the Week 13 Lab of Dr. Ken Blake’s JOUR 3841 Course (Data Skills for Media Professionals) at MTSU.

After running a summary of the myreg data frame, users can see that the relationship between DV and IV is statistically significant; IV’s value under “Pr(>|t|)” is exceptionally less than 0.05. The Adjusted R-squared value of 0.7001 means that each county’s percentage of residents over 24 with at least a Bachelor’s degree accounts for 70.01% of the variation in each county’s median family income. Additionally, while Williamson County (the most top-right point on the scatterplot) is considered an outlier, its removal from the graph would not make a significant enough impact on the graph to be considered notable.

# Installing required packages
if (!require("dplyr"))
  install.packages("dplyr")
if (!require("tidyverse"))
  install.packages("tidyverse")
library(dplyr)
library(ggplot2) 

mydata <- read.csv("https://raw.githubusercontent.com/drkblake/Data/main/Educ_Income_2022.csv")
head(mydata, 10)
##    GEOID          County     State PctCollege FamIncome
## 1  47001 Anderson County Tennessee       24.5     75637
## 2  47003  Bedford County Tennessee       17.1     71159
## 3  47005   Benton County Tennessee       11.1     65800
## 4  47007  Bledsoe County Tennessee       10.3     59695
## 5  47009   Blount County Tennessee       26.0     85194
## 6  47011  Bradley County Tennessee       23.9     75270
## 7  47013 Campbell County Tennessee       12.9     61629
## 8  47015   Cannon County Tennessee       17.5     71000
## 9  47017  Carroll County Tennessee       19.8     68542
## 10 47019   Carter County Tennessee       21.1     61776
# Specify the DV and IV
mydata$DV <- mydata$FamIncome #Edit YOURDVNAME
mydata$IV <- mydata$PctCollege #Edit YOURIVNAME

# Look at the DV and IV
ggplot(mydata, aes(x = DV)) + geom_histogram(color = "black", fill = "#1f78b4")

ggplot(mydata, aes(x = IV)) + geom_histogram(color = "black", fill = "#1f78b4")

# Creating and summarizing an initial regression model called myreg, and checking for bivariate outliers.
options(scipen = 999)
myreg <- lm(DV ~ IV,
            data = mydata)
plot(mydata$IV, mydata$DV)
abline(lm(mydata$DV ~ mydata$IV))

summary(myreg)
## 
## Call:
## lm(formula = DV ~ IV, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15380.9  -5904.4    538.2   4730.4  17582.3 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 44664.55    1903.16   23.47 <0.0000000000000002 ***
## IV           1349.18      90.88   14.85 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7516 on 93 degrees of freedom
## Multiple R-squared:  0.7032, Adjusted R-squared:  0.7001 
## F-statistic: 220.4 on 1 and 93 DF,  p-value: < 0.00000000000000022