YPAR Mapping Workshop

Author

Healthy Future of Texas - Jacob M. Souch

🏠 YPAR Mapping Workshop

Introduction

In this workshop, you’ll explore and interpret socioeconomic data, focusing on teen birth rates, poverty, and education levels across Texas counties. Through hands-on analysis and mapping exercises, you’ll gain insights into how these variables interact, identifying patterns and potential areas for intervention. This workshop is divided into three key parts:

Univariate Analysis – Analyzing single variables to understand data distribution and trends.
Bivariate Analysis – Examining relationships between two variables to see correlations and implications.
Mapping in QGIS – Visualizing data spatially to capture geographical patterns and associations.

By the end, you’ll not only have practical data skills but also a deeper understanding of the factors influencing youth outcomes in Texas.

1️⃣PART I - Univariate Analysis (One Variable)

(Estimated time: 6 min)

Goals of Activity:

Combine Data from Different Sources

We’ll be using data from two different sources.

Source

Purpose

Notes

Texas DSHS

Teen Birth Rates, 2022

US Census Bureau

Percent of Population with Less than High School Education
Percent of Population in Poverty

Understanding your data

Bring together Texas data on teen birth rates, poverty rates, and education statistics to understand broader trends.
Organize data by counties or regions within Texas, ensuring data aligns for accurate comparison across each factor.
Merging these data points gives a well-rounded view of how different socioeconomic factors relate to teen birth rates.

Import Data 📦

Click the green triangle in each section to run the code.

Show the code

library(tidyverse)

#TX_Census_2022 <- read_csv("J:/My Drive/TexasDataYPAR.csv")

file_path <- file.choose()  # Opens a dialog for file selection. 

#Choose "the TexasDataYPAR.csv file!

# Load the file (adjust function as per file type)
# For example, if it's a CSV file:

#Be sure to choose the file called ChooseThisFile.csv

TX_Census_2022 <- read_csv(file_path)

Variable Definitions

📖 Teen Birth Rate: The teen birth rate is defined as the number of live births per 1,000 females aged 15-19 in a given population and time period, typically calculated annually. It serves as a key indicator for adolescent health and social conditions and helps policymakers monitor and address issues related to teenage pregnancy.

📖 Percent of Population without HS Degree: The Census variable code B06009_002 refers to the Percent of Population without a High School Degree. This variable captures the number of individuals aged 25 and over who have not completed high school (or the equivalent, such as a GED) as a percentage of the total population aged 25 and over in a given geographic area. This measure is often used to assess educational attainment and socioeconomic factors within communities.

📖 Percent of Population in Poverty: B17001_002 represents individuals whose income falls below 100% of the federal poverty level, based on household size and composition. Calculating this as a percentage of the total population provides insight into poverty rates within a given area, which is valuable for assessing economic need and directing resources.

Format Data 🛠️

Note

It’s said that 80% of data analysis is actually cleaning. 🫧🧼.

For the purposes of this exercise, several data cleaning steps have been performed for you. We’ll cover those in more detail in a future workshop. For now, you just have to know that all of these data sources were joined using a LEFT JOIN.

Rows that have counties that matched across all three tables (Poverty, Education table, and Teen Birth) were kept and matched by county code. 🙂

If you’d like to learn more about basic data cleaning practices, click here.

Show the code

#library(tidycensus)
#install.packages("tidyverse")
library(tidyverse)

Show the code

#You can ignore this code, traveler. :) 

# #EUCATIONAL ATTAINMENT
# #Getting Educational Attainment Data from Census
# library(tidycensus)
# library(dplyr)
# 
# # Load Census API key
# 
# # Fetch ACS data for people with less than high school education by county
# hs_grad_2022 <- get_acs(
#   geography = "county",
#   state = "TX",
#   variables = c(
#     "B15003_002", # No schooling completed
#     "B15003_003", # Nursery to 4th grade
#     "B15003_004", # 5th and 6th grade
#     "B15003_005", # 7th and 8th grade
#     "B15003_006", # 9th grade
#     "B15003_007", # 10th grade
#     "B15003_008", # 11th grade
#     "B15003_009"  # 12th grade, no diploma
#   ),
#   summary_var = "B15003_001", # Total population over 25
#   year = 2022
# )
# 
# # Sum the estimates for all education levels below high school by county
# hs_grad_2022 <- hs_grad_2022 %>%
#   group_by(GEOID, NAME, summary_est) %>%
#   summarize(
#     less_than_hs_estimate = sum(estimate),
#     .groups = "drop"
#   ) %>%
#   mutate(percent_less_than_hs = (less_than_hs_estimate / summary_est) * 100)
# 
# # View the results with percentage by county
# head(hs_grad_2022)
# 
# 
# #POVERTY RATE
# #TEEN BIRTH DATA

Show the code

# You can ignore this scary code, friend.:)  Ask questions later if you're interested in learning about APIs. 

####################################
# tidycensus::census_api_key(key = "60ec4b911f10678cf4422863660b83336cc93435", overwrite = TRUE, install = TRUE)
#  
#  census_vars <- load_variables(2022, "acs5", cache = TRUE)

Histograms 📊

📖Histogram:
A histogram is a bar graph that shows how often values occur within certain ranges. It helps display the distribution of data in a dataset.

Show the code

hist(TX_Census_2022$`Percent Less Than High School Diploma, 2022`, main = "Distribution of Percent of Population without High School Graduation")

Show the code

hist(TX_Census_2022$`Percent in Poverty, 2022`, main = "Distribution of Percent of Population in Poverty, 2022")

Show the code

hist(TX_Census_2022$`Teen Birth Rate, 2022`, main = "Distribution of Teen Birth Rate")

Show the code

# Calculate summary statistics for each variable
less_than_high_school <- summary(TX_Census_2022$`Percent Less Than High School Diploma, 2022`)
poverty <- summary(TX_Census_2022$`Percent in Poverty, 2022`)
teen_birth_rate <- summary(TX_Census_2022$`Teen Birth Rate, 2022`)

# Calculate standard deviations
sd_less_than_high_school <- sd(TX_Census_2022$`Percent Less Than High School Diploma, 2022`, na.rm = TRUE)
sd_poverty <- sd(TX_Census_2022$`Percent in Poverty, 2022`, na.rm = TRUE)
sd_teen_birth_rate <- sd(TX_Census_2022$`Teen Birth Rate, 2022`, na.rm = TRUE)

# Load knitr package
library(knitr)

# Load knitr package
library(knitr)

# Create summary table
results <- data.frame(
  Variable = c("Percent Less Than High School Diploma, 2022", "Percent in Poverty, 2022", "Teen Birth Rate, 2022"),
  Min = c(less_than_high_school["Min."], poverty["Min."], teen_birth_rate["Min."]),
  Q1 = c(less_than_high_school["1st Qu."], poverty["1st Qu."], teen_birth_rate["1st Qu."]),
  Median = c(less_than_high_school["Median"], poverty["Median"], teen_birth_rate["Median"]),
  Mean = c(less_than_high_school["Mean"], poverty["Mean"], teen_birth_rate["Mean"]),
  Q3 = c(less_than_high_school["3rd Qu."], poverty["3rd Qu."], teen_birth_rate["3rd Qu."]),
  Max = c(less_than_high_school["Max."], poverty["Max."], teen_birth_rate["Max."]),
  SD = c(sd_less_than_high_school, sd_poverty, sd_teen_birth_rate)
)

# Display table with kable
kable(results, caption = "Summary Statistics for Texas Census Data, 2022", 
      col.names = c("Variable", "Min", "Q1", "Median", "Mean", "Q3", "Max", "SD"),
      format = "html") # Use "html" or "markdown" based on your environment

Summary Statistics for Texas Census Data, 2022
Variable	Q1	Median	Mean	Q3	Max	SD
Percent Less Than High School Diploma, 2022	2.194723	3.197616	4.040446	4.793414	33.33333	3.261233
Percent in Poverty, 2022	4.826984	6.480668	7.511558	9.384352	30.32870	4.183396
Teen Birth Rate, 2022	18.130000	26.170000	25.884856	34.760000	68.42000	13.405300

Descriptive Statistics

Describe the mean and standard deviation for each variable.

📖Mean:
The mean is the average of a set of numbers. You find it by adding up all the numbers and then dividing by how many numbers there are. It shows the typical value in a group.

🔑Key Idea: it’s the ‘signal’.

📖Standard Deviation:
Standard deviation tells us how spread out the numbers in a set are from the mean. A small standard deviation means the numbers are close to the mean, while a large one means they are more spread out.

🔑Key Idea: It’s the ‘noise’.

The mean of percent of population without HS grad is _____ percent. The SD is ____.
The mean of percent of population in poverty is _____ percent. The SD is ____.
The mean of teen birth rate is _____ per 1000. The SD is ____.

Did you get it? 👀

Make sure you understand what a histogram shows and why it’s important to make one before mapping anything.
Are you familiar with the basic descriptive statistics of each variable?
We’ll return to histograms in Part III.

2️⃣PART II - Bivariate Analysis (Two Variables)

(Estimated time: 6 min)

Goals of Activity:

Explore Relationships Between Data Points

📖Scatterplot:
A scatterplot is a graph that shows the relationship between two variables, with each point representing one observation. One variable is on the x-axis and the other on the y-axis.

Investigate connections between income, poverty, education, and teen birth rates to see if certain factors are more strongly associated.
Use statistical analysis to look for trends, like whether areas with lower education levels or higher poverty rates have different teen birth rates.
Understanding these relationships helps identify potential risk factors and areas for improvement within Texas.

Let’s make scatterplots. These plots help us see how variables are related.

Show the code

library(broom)

# Run the correlation tests and tidy the results
cor1 <- cor.test(TX_Census_2022$`Percent Less Than High School Diploma, 2022`, 
                 TX_Census_2022$`Percent in Poverty, 2022`) %>% tidy()
cor2 <- cor.test(TX_Census_2022$`Percent Less Than High School Diploma, 2022`, 
                 TX_Census_2022$`Teen Birth Rate, 2022`) %>% tidy()
cor3 <- cor.test(TX_Census_2022$`Percent in Poverty, 2022`, 
                 TX_Census_2022$`Teen Birth Rate, 2022`) %>% tidy()

# Combine the results into a single data frame
cor_results <- rbind(
  data.frame(Variables = "Less Than HS Diploma vs. Poverty", cor1),
  data.frame(Variables = "Less Than HS Diploma vs. Teen Birth Rate", cor2),
  data.frame(Variables = "Poverty vs. Teen Birth Rate", cor3)
)

# Select and rename columns for a pretty table
cor_results <- cor_results %>%
  select(Variables, estimate, statistic, p.value, conf.low, conf.high) %>%
  rename(
    "Correlation Coefficient" = estimate,
    "t-statistic" = statistic,
    "p-value" = p.value,
    "Confidence Interval Lower Bound" = conf.low,
    "Confidence Interval Upper Bound" = conf.high
  )

# Print the table with kable for a nice format
cor_results %>%
  kable(
    caption = "Correlation Results for Texas Census Data, 2022",
    format = "markdown",  # or use "html" for HTML output, "latex" for LaTeX
    digits = 3
  )

Correlation Results for Texas Census Data, 2022
Variables	Correlation Coefficient	t-statistic	Confidence Interval Lower Bound	Confidence Interval Upper Bound
Less Than HS Diploma vs. Poverty	0.269	4.429	0.151	0.379
Less Than HS Diploma vs. Teen Birth Rate	0.455	6.687	0.328	0.566
Poverty vs. Teen Birth Rate	0.382	5.404	0.247	0.503

Interpreting scatterplots 📊

What kind of correlation do you observe? Does this make sense? Why or Why not?

There is a ____________ relationship between poverty and percent of population without high school diploma.
There is a ____________ relationship between poverty and teen birth rate.
There is a ____________ relationship between percent of population without high school diploma and teen birth rate.

Show the code

#TXData <- left_join(TX_Census_2022, teen_birth_2022, by = "County")

panel.lines <- function(x, y) {
  points(x, y)
  abline(lm(y ~ x), col = "blue")  # Add regression line
}


TX_Census_2022 %>% select(`Teen Birth Rate, 2022`,`Percent in Poverty, 2022`, `Percent Less Than High School Diploma, 2022`) %>% pairs(lower.panel = panel.lines, upper.panel = panel.lines)

What do you notice about the plots? Can summarize your findings?

Did you expect these results?

Did you get it? 👀

Make sure you understand Parts I and II.
Make sure you’re familiar with normality and histograms and correlation.
Make sure you’re familiar with the descriptive statistics of all three variables. You’ll need to keep these figures in mind for Part III.

3️⃣PART III - Mapping!

Refer to QGIS project in the Drive.

Congratulations to making to this part! A few housekeeping things:

The data have been exported from R after being cleaned, summarized, and formatted.
The QGIS file has everything you need to continue with this workshop.
Any questions, comments or concerns?

See you in QGIS!

Y’all come back now. No seriously, we have to wrap up later.

›

View after exploring the maps in QGIS.

✨Maps Answer Key and Results✨

Univariate Maps

Multivariate Maps

Interactive QGIS Map

Nothing to see here.

⭐BONUS Normality Vibe Check (Is it a Bell 🔔 Curve?)

Feel free to skip this section!

How do we check for normality? One popular method is the Shapiro-Wilk Test for Normality.

You don’t have to know the knitty gritty of how it works. Basically, this test asks how much like a bell curve is the distribution?

Note

A good W score is close to 1. For the purposes of this exercise, we can proceed with the data as they are. However, in more advanced analyses we would perform transformations and carry them throughout the rest of the analysis.

💡MAIN IDEA: Values of W close to 1 suggest normality.

Significant tests where the p-value is less than 0.05 indicate non-normality ☹️.

The Shapiro-Wilk tests have been performed below for our three variables. What do these results suggest? Interpret each variable separately.

Show the code

shapiro.test(TX_Census_2022$`Percent Less Than High School Diploma, 2022`)


    Shapiro-Wilk normality test

data:  TX_Census_2022$`Percent Less Than High School Diploma, 2022`
W = 0.73845, p-value < 2.2e-16

Show the code

shapiro.test(TX_Census_2022$`Percent in Poverty, 2022`)


    Shapiro-Wilk normality test

data:  TX_Census_2022$`Percent in Poverty, 2022`
W = 0.90559, p-value = 1.505e-11

Show the code

shapiro.test(TX_Census_2022$`Teen Birth Rate, 2022`)


    Shapiro-Wilk normality test

data:  TX_Census_2022$`Teen Birth Rate, 2022`
W = 0.97709, p-value = 0.005879

⭐BONUS Trivariate Choropleth Maps (Only view after QGIS)⭐

What would three variables look like on a single legend?
What are the advantages of this view?
What are the disadvantages of this view?