2025-11-08

Is there a relationship between two variables?

Suppose we have the population of cities in the midwest, and we want to to model the data and find a relationship where we can predict the high school education level and children living below poverty.

Let’s first use a scatterplot.

Introduction to Dataset midwest

For reference, this is the dataset we will be using. In particular we will be focusing on “perchsd” and “percchildbelowpovert”.

str(midwest, 30)
tibble [437 × 28] (S3: tbl_df/tbl/data.frame)
 $ PID                 : int [1:437] 561 562 563 564 565 566 567 568 569 570 ...
 $ county              : chr [1:437] "ADAMS" "ALEXANDER" "BOND" "BOONE" ...
 $ state               : chr [1:437] "IL" "IL" "IL" "IL" ...
 $ area                : num [1:437] 0.052 0.014 0.022 0.017 0.018 0.05 0.017 0.027 0.024 0.058 ...
 $ poptotal            : int [1:437] 66090 10626 14991 30806 5836 35688 5322 16805 13437 173025 ...
 $ popdensity          : num [1:437] 1271 759 681 1812 324 ...
 $ popwhite            : int [1:437] 63917 7054 14477 29344 5264 35157 5298 16519 13384 146506 ...
 $ popblack            : int [1:437] 1702 3496 429 127 547 50 1 111 16 16559 ...
 $ popamerindian       : int [1:437] 98 19 35 46 14 65 8 30 8 331 ...
 $ popasian            : int [1:437] 249 48 16 150 5 195 15 61 23 8033 ...
 $ popother            : int [1:437] 124 9 34 1139 6 221 0 84 6 1596 ...
 $ percwhite           : num [1:437] 96.7 66.4 96.6 95.3 90.2 ...
 $ percblack           : num [1:437] 2.575 32.9 2.862 0.412 9.373 ...
 $ percamerindan       : num [1:437] 0.148 0.179 0.233 0.149 0.24 ...
 $ percasian           : num [1:437] 0.3768 0.4517 0.1067 0.4869 0.0857 ...
 $ percother           : num [1:437] 0.1876 0.0847 0.2268 3.6973 0.1028 ...
 $ popadults           : int [1:437] 43298 6724 9669 19272 3979 23444 3583 11323 8825 95971 ...
 $ perchsd             : num [1:437] 75.1 59.7 69.3 75.5 68.9 ...
 $ percollege          : num [1:437] 19.6 11.2 17 17.3 14.5 ...
 $ percprof            : num [1:437] 4.36 2.87 4.49 4.2 3.37 ...
 $ poppovertyknown     : int [1:437] 63628 10529 14235 30337 4815 35107 5241 16455 13081 154934 ...
 $ percpovertyknown    : num [1:437] 96.3 99.1 95 98.5 82.5 ...
 $ percbelowpoverty    : num [1:437] 13.15 32.24 12.07 7.21 13.52 ...
 $ percchildbelowpovert: num [1:437] 18 45.8 14 11.2 13 ...
 $ percadultpoverty    : num [1:437] 11.01 27.39 10.85 5.54 11.14 ...
 $ percelderlypoverty  : num [1:437] 12.44 25.23 12.7 6.22 19.2 ...
 $ inmetro             : int [1:437] 0 0 0 1 0 0 0 0 0 1 ...
 $ category            : chr [1:437] "AAR" "LHR" "AAR" "ALU" ...

Scatterplot

This scatterplot shows that there appears to be a relationship between the two variables, where there is a higher amount of high school diplomas when there are less children living below poverty. But how do we predict the percentage of high school diplomas from percentage of children in poverty?

Simple Linear Regression - Model

Method in Statics used to model the relationship of two variables by using a best fit line.

\(Y = \beta_0 + \beta_1\cdot x + \varepsilon\), where \(\varepsilon \sim \mathcal{N}(\mu=0; \,\,\sigma^2)\)

\(\beta_0\) = estimated y-intercept
\(\beta_1\) = estimated slope

\(\varepsilon\) = random term error
- Independent
- Has normal distribution with mean of 0 and some fixed contant variance


If there is a trend upwards, then it’s positive correlation. A trend downwards would indicate negative correlation.

Simple Linear Regression - Fitted

Using the fitted model:
\(Y = \beta_0 + \beta_1\cdot X\)

Our equation becomes:
% of High School Diplomas = \(\beta_0 + \beta_1\cdot\)% of Children Below Poverty

Calculation of Data

## 
## Call:
## lm(formula = perchsd ~ percchildbelowpovert, data = midwest)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.9842  -2.4723   0.0752   2.8040  12.8634 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          82.25914    0.54411  151.18   <2e-16 ***
## percchildbelowpovert -0.50425    0.03029  -16.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.572 on 435 degrees of freedom
## Multiple R-squared:  0.3891, Adjusted R-squared:  0.3877 
## F-statistic: 277.1 on 1 and 435 DF,  p-value: < 2.2e-16

Percent of Children Below Poverty vs. Percent with High School Diplomas

The downward line indicates the trend and predicts the y-value for a given x-value.

Children Below Poverty (%) vs. High School Diplomas (%) in States with Simple Linear Regressions

Example with 3 variables in the plot

ggplot(midwest, aes(x=percchildbelowpovert, y=perchsd, color=inmetro)) + 
  geom_point() +
  labs(x = "Percent of Childern Below Poverty", 
       y = "Percent with High School Diplomas") +
  geom_point(alpha=0.5) +
  geom_smooth(method="lm", formula = y ~ x, se=F, color="blue")