Attached is the code written for the Week 13 Lab of Dr. Ken Blake’s JOUR 3841 Course (Data Skills for Media Professionals) at MTSU.
After running a summary of the myreg data frame, users can see that the relationship between DV and IV is statistically significant; IV’s value under “Pr(>|t|)” is exceptionally less than 0.05. The Adjusted R-squared value of 0.7001 means that each county’s percentage of residents over 24 with at least a Bachelor’s degree accounts for 70.01% of the variation in each county’s median family income. Additionally, while Williamson County (the most top-right point on the scatterplot) is considered an outlier, its removal from the graph would not make a significant enough impact on the graph to be considered notable.
# Installing required packages
if (!require("dplyr"))
install.packages("dplyr")
if (!require("tidyverse"))
install.packages("tidyverse")
library(dplyr)
library(ggplot2)
mydata <- read.csv("https://raw.githubusercontent.com/drkblake/Data/main/Educ_Income_2022.csv")
head(mydata, 10)
## GEOID County State PctCollege FamIncome
## 1 47001 Anderson County Tennessee 24.5 75637
## 2 47003 Bedford County Tennessee 17.1 71159
## 3 47005 Benton County Tennessee 11.1 65800
## 4 47007 Bledsoe County Tennessee 10.3 59695
## 5 47009 Blount County Tennessee 26.0 85194
## 6 47011 Bradley County Tennessee 23.9 75270
## 7 47013 Campbell County Tennessee 12.9 61629
## 8 47015 Cannon County Tennessee 17.5 71000
## 9 47017 Carroll County Tennessee 19.8 68542
## 10 47019 Carter County Tennessee 21.1 61776
# Specify the DV and IV
mydata$DV <- mydata$FamIncome #Edit YOURDVNAME
mydata$IV <- mydata$PctCollege #Edit YOURIVNAME
# Look at the DV and IV
ggplot(mydata, aes(x = DV)) + geom_histogram(color = "black", fill = "#1f78b4")
ggplot(mydata, aes(x = IV)) + geom_histogram(color = "black", fill = "#1f78b4")
# Creating and summarizing an initial regression model called myreg, and checking for bivariate outliers.
options(scipen = 999)
myreg <- lm(DV ~ IV,
data = mydata)
plot(mydata$IV, mydata$DV)
abline(lm(mydata$DV ~ mydata$IV))
summary(myreg)
##
## Call:
## lm(formula = DV ~ IV, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15380.9 -5904.4 538.2 4730.4 17582.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44664.55 1903.16 23.47 <0.0000000000000002 ***
## IV 1349.18 90.88 14.85 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7516 on 93 degrees of freedom
## Multiple R-squared: 0.7032, Adjusted R-squared: 0.7001
## F-statistic: 220.4 on 1 and 93 DF, p-value: < 0.00000000000000022