Suppose we have the population of cities in the midwest, and we want to to model the data and find a relationship where we can predict the high school education level and children living below poverty.
Let’s first use a scatterplot.
2025-11-08
Suppose we have the population of cities in the midwest, and we want to to model the data and find a relationship where we can predict the high school education level and children living below poverty.
Let’s first use a scatterplot.
midwestFor reference, this is the dataset we will be using. In particular we will be focusing on “perchsd” and “percchildbelowpovert”.
str(midwest, 30)
tibble [437 × 28] (S3: tbl_df/tbl/data.frame) $ PID : int [1:437] 561 562 563 564 565 566 567 568 569 570 ... $ county : chr [1:437] "ADAMS" "ALEXANDER" "BOND" "BOONE" ... $ state : chr [1:437] "IL" "IL" "IL" "IL" ... $ area : num [1:437] 0.052 0.014 0.022 0.017 0.018 0.05 0.017 0.027 0.024 0.058 ... $ poptotal : int [1:437] 66090 10626 14991 30806 5836 35688 5322 16805 13437 173025 ... $ popdensity : num [1:437] 1271 759 681 1812 324 ... $ popwhite : int [1:437] 63917 7054 14477 29344 5264 35157 5298 16519 13384 146506 ... $ popblack : int [1:437] 1702 3496 429 127 547 50 1 111 16 16559 ... $ popamerindian : int [1:437] 98 19 35 46 14 65 8 30 8 331 ... $ popasian : int [1:437] 249 48 16 150 5 195 15 61 23 8033 ... $ popother : int [1:437] 124 9 34 1139 6 221 0 84 6 1596 ... $ percwhite : num [1:437] 96.7 66.4 96.6 95.3 90.2 ... $ percblack : num [1:437] 2.575 32.9 2.862 0.412 9.373 ... $ percamerindan : num [1:437] 0.148 0.179 0.233 0.149 0.24 ... $ percasian : num [1:437] 0.3768 0.4517 0.1067 0.4869 0.0857 ... $ percother : num [1:437] 0.1876 0.0847 0.2268 3.6973 0.1028 ... $ popadults : int [1:437] 43298 6724 9669 19272 3979 23444 3583 11323 8825 95971 ... $ perchsd : num [1:437] 75.1 59.7 69.3 75.5 68.9 ... $ percollege : num [1:437] 19.6 11.2 17 17.3 14.5 ... $ percprof : num [1:437] 4.36 2.87 4.49 4.2 3.37 ... $ poppovertyknown : int [1:437] 63628 10529 14235 30337 4815 35107 5241 16455 13081 154934 ... $ percpovertyknown : num [1:437] 96.3 99.1 95 98.5 82.5 ... $ percbelowpoverty : num [1:437] 13.15 32.24 12.07 7.21 13.52 ... $ percchildbelowpovert: num [1:437] 18 45.8 14 11.2 13 ... $ percadultpoverty : num [1:437] 11.01 27.39 10.85 5.54 11.14 ... $ percelderlypoverty : num [1:437] 12.44 25.23 12.7 6.22 19.2 ... $ inmetro : int [1:437] 0 0 0 1 0 0 0 0 0 1 ... $ category : chr [1:437] "AAR" "LHR" "AAR" "ALU" ...
This scatterplot shows that there appears to be a relationship between the two variables, where there is a higher amount of high school diplomas when there are less children living below poverty. But how do we predict the percentage of high school diplomas from percentage of children in poverty?
Method in Statics used to model the relationship of two variables by using a best fit line.
\(Y = \beta_0 + \beta_1\cdot x + \varepsilon\), where \(\varepsilon \sim \mathcal{N}(\mu=0; \,\,\sigma^2)\)
\(\beta_0\) = estimated y-intercept \(\beta_1\) = estimated slope
\(\varepsilon\) = random term error - Independent - Has normal distribution with mean of 0 and some fixed contant variance
If there is a trend upwards, then it’s positive correlation. A trend downwards would indicate negative correlation.
Using the fitted model: \(Y = \beta_0 + \beta_1\cdot X\)
Our equation becomes: % of High School Diplomas = \(\beta_0 + \beta_1\cdot\)% of Children Below Poverty
## ## Call: ## lm(formula = perchsd ~ percchildbelowpovert, data = midwest) ## ## Residuals: ## Min 1Q Median 3Q Max ## -22.9842 -2.4723 0.0752 2.8040 12.8634 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 82.25914 0.54411 151.18 <2e-16 *** ## percchildbelowpovert -0.50425 0.03029 -16.65 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.572 on 435 degrees of freedom ## Multiple R-squared: 0.3891, Adjusted R-squared: 0.3877 ## F-statistic: 277.1 on 1 and 435 DF, p-value: < 2.2e-16
The downward line indicates the trend and predicts the y-value for a given x-value.
Example with 3 variables in the plot
ggplot(midwest, aes(x=percchildbelowpovert, y=perchsd, color=inmetro)) +
geom_point() +
labs(x = "Percent of Childern Below Poverty",
y = "Percent with High School Diplomas") +
geom_point(alpha=0.5) +
geom_smooth(method="lm", formula = y ~ x, se=F, color="blue")