In this workshop, you’ll explore and interpret socioeconomic data, focusing on teen birth rates, poverty, and education levels across Texas counties. Through hands-on analysis and mapping exercises, you’ll gain insights into how these variables interact, identifying patterns and potential areas for intervention. This workshop is divided into three key parts:
Univariate Analysis – Analyzing single variables to understand data distribution and trends.
Bivariate Analysis – Examining relationships between two variables to see correlations and implications.
Mapping in QGIS – Visualizing data spatially to capture geographical patterns and associations.
By the end, you’ll not only have practical data skills but also a deeper understanding of the factors influencing youth outcomes in Texas.
1️⃣PART I - Univariate Analysis (One Variable)
(Estimated time: 6 min)
Goals of Activity:
Combine Data from Different Sources
We’ll be using data from two different sources.
Source
Purpose
Notes
Texas DSHS
Teen Birth Rates, 2022
US Census Bureau
Percent of Population with Less than High School Education
Percent of Population in Poverty
Understanding your data
Bring together Texas data on teen birth rates, poverty rates, and education statistics to understand broader trends.
Organize data by counties or regions within Texas, ensuring data aligns for accurate comparison across each factor.
Merging these data points gives a well-rounded view of how different socioeconomic factors relate to teen birth rates.
Import Data 📦
Click the green triangle in each section to run the code.
Show the code
library(tidyverse)#TX_Census_2022 <- read_csv("J:/My Drive/TexasDataYPAR.csv")file_path <-file.choose() # Opens a dialog for file selection. #Choose "the TexasDataYPAR.csv file!# Load the file (adjust function as per file type)# For example, if it's a CSV file:#Be sure to choose the file called ChooseThisFile.csvTX_Census_2022 <-read_csv(file_path)
Variable Definitions
📖 Teen Birth Rate: The teen birth rate is defined as the number of live births per 1,000 females aged 15-19 in a given population and time period, typically calculated annually. It serves as a key indicator for adolescent health and social conditions and helps policymakers monitor and address issues related to teenage pregnancy.
📖 Percent of Population without HS Degree: The Census variable code B06009_002 refers to the Percent of Population without a High School Degree. This variable captures the number of individuals aged 25 and over who have not completed high school (or the equivalent, such as a GED) as a percentage of the total population aged 25 and over in a given geographic area. This measure is often used to assess educational attainment and socioeconomic factors within communities.
📖 Percent of Population in Poverty: B17001_002 represents individuals whose income falls below 100% of the federal poverty level, based on household size and composition. Calculating this as a percentage of the total population provides insight into poverty rates within a given area, which is valuable for assessing economic need and directing resources.
Format Data 🛠️
Note
It’s said that 80% of data analysis is actually cleaning. 🫧🧼.
For the purposes of this exercise, several data cleaning steps have been performed for you. We’ll cover those in more detail in a future workshop. For now, you just have to know that all of these data sources were joined using a LEFT JOIN.
Rows that have counties that matched across all three tables (Poverty, Education table, and Teen Birth) were kept and matched by county code. 🙂
If you’d like to learn more about basic data cleaning practices, click here.
#You can ignore this code, traveler. :) # #EUCATIONAL ATTAINMENT# #Getting Educational Attainment Data from Census# library(tidycensus)# library(dplyr)# # # Load Census API key# # # Fetch ACS data for people with less than high school education by county# hs_grad_2022 <- get_acs(# geography = "county",# state = "TX",# variables = c(# "B15003_002", # No schooling completed# "B15003_003", # Nursery to 4th grade# "B15003_004", # 5th and 6th grade# "B15003_005", # 7th and 8th grade# "B15003_006", # 9th grade# "B15003_007", # 10th grade# "B15003_008", # 11th grade# "B15003_009" # 12th grade, no diploma# ),# summary_var = "B15003_001", # Total population over 25# year = 2022# )# # # Sum the estimates for all education levels below high school by county# hs_grad_2022 <- hs_grad_2022 %>%# group_by(GEOID, NAME, summary_est) %>%# summarize(# less_than_hs_estimate = sum(estimate),# .groups = "drop"# ) %>%# mutate(percent_less_than_hs = (less_than_hs_estimate / summary_est) * 100)# # # View the results with percentage by county# head(hs_grad_2022)# # # #POVERTY RATE# #TEEN BIRTH DATA
Show the code
# You can ignore this scary code, friend.:) Ask questions later if you're interested in learning about APIs. ##################################### tidycensus::census_api_key(key = "60ec4b911f10678cf4422863660b83336cc93435", overwrite = TRUE, install = TRUE)# # census_vars <- load_variables(2022, "acs5", cache = TRUE)
Histograms 📊
📖Histogram:
A histogram is a bar graph that shows how often values occur within certain ranges. It helps display the distribution of data in a dataset.
Show the code
hist(TX_Census_2022$`Percent Less Than High School Diploma, 2022`, main ="Distribution of Percent of Population without High School Graduation")
Show the code
hist(TX_Census_2022$`Percent in Poverty, 2022`, main ="Distribution of Percent of Population in Poverty, 2022")
Show the code
hist(TX_Census_2022$`Teen Birth Rate, 2022`, main ="Distribution of Teen Birth Rate")
Show the code
# Calculate summary statistics for each variableless_than_high_school <-summary(TX_Census_2022$`Percent Less Than High School Diploma, 2022`)poverty <-summary(TX_Census_2022$`Percent in Poverty, 2022`)teen_birth_rate <-summary(TX_Census_2022$`Teen Birth Rate, 2022`)# Calculate standard deviationssd_less_than_high_school <-sd(TX_Census_2022$`Percent Less Than High School Diploma, 2022`, na.rm =TRUE)sd_poverty <-sd(TX_Census_2022$`Percent in Poverty, 2022`, na.rm =TRUE)sd_teen_birth_rate <-sd(TX_Census_2022$`Teen Birth Rate, 2022`, na.rm =TRUE)# Load knitr packagelibrary(knitr)# Load knitr packagelibrary(knitr)# Create summary tableresults <-data.frame(Variable =c("Percent Less Than High School Diploma, 2022", "Percent in Poverty, 2022", "Teen Birth Rate, 2022"),Min =c(less_than_high_school["Min."], poverty["Min."], teen_birth_rate["Min."]),Q1 =c(less_than_high_school["1st Qu."], poverty["1st Qu."], teen_birth_rate["1st Qu."]),Median =c(less_than_high_school["Median"], poverty["Median"], teen_birth_rate["Median"]),Mean =c(less_than_high_school["Mean"], poverty["Mean"], teen_birth_rate["Mean"]),Q3 =c(less_than_high_school["3rd Qu."], poverty["3rd Qu."], teen_birth_rate["3rd Qu."]),Max =c(less_than_high_school["Max."], poverty["Max."], teen_birth_rate["Max."]),SD =c(sd_less_than_high_school, sd_poverty, sd_teen_birth_rate))# Display table with kablekable(results, caption ="Summary Statistics for Texas Census Data, 2022", col.names =c("Variable", "Min", "Q1", "Median", "Mean", "Q3", "Max", "SD"),format ="html") # Use "html" or "markdown" based on your environment
Summary Statistics for Texas Census Data, 2022
Variable
Min
Q1
Median
Mean
Q3
Max
SD
Percent Less Than High School Diploma, 2022
0
2.194723
3.197616
4.040446
4.793414
33.33333
3.261233
Percent in Poverty, 2022
0
4.826984
6.480668
7.511558
9.384352
30.32870
4.183396
Teen Birth Rate, 2022
0
18.130000
26.170000
25.884856
34.760000
68.42000
13.405300
Descriptive Statistics
Describe the mean and standard deviation for each variable.
📖Mean:
The mean is the average of a set of numbers. You find it by adding up all the numbers and then dividing by how many numbers there are. It shows the typical value in a group.
🔑Key Idea: it’s the ‘signal’.
📖Standard Deviation:
Standard deviation tells us how spread out the numbers in a set are from the mean. A small standard deviation means the numbers are close to the mean, while a large one means they are more spread out.
🔑Key Idea: It’s the ‘noise’.
The mean of percent of population without HS grad is _____ percent. The SD is ____.
The mean of percent of population in poverty is _____ percent. The SD is ____.
The mean of teen birth rate is _____ per 1000. The SD is ____.
.
Did you get it? 👀
Make sure you understand what a histogram shows and why it’s important to make one before mapping anything.
Are you familiar with the basic descriptive statistics of each variable?
We’ll return to histograms in Part III.
2️⃣PART II - Bivariate Analysis (Two Variables)
(Estimated time: 6 min)
Goals of Activity:
Explore Relationships Between Data Points
📖Scatterplot:
A scatterplot is a graph that shows the relationship between two variables, with each point representing one observation. One variable is on the x-axis and the other on the y-axis.
Investigate connections between income, poverty, education, and teen birth rates to see if certain factors are more strongly associated.
Use statistical analysis to look for trends, like whether areas with lower education levels or higher poverty rates have different teen birth rates.
Understanding these relationships helps identify potential risk factors and areas for improvement within Texas.
Let’s make scatterplots. These plots help us see how variables are related.
Show the code
library(broom)# Run the correlation tests and tidy the resultscor1 <-cor.test(TX_Census_2022$`Percent Less Than High School Diploma, 2022`, TX_Census_2022$`Percent in Poverty, 2022`) %>%tidy()cor2 <-cor.test(TX_Census_2022$`Percent Less Than High School Diploma, 2022`, TX_Census_2022$`Teen Birth Rate, 2022`) %>%tidy()cor3 <-cor.test(TX_Census_2022$`Percent in Poverty, 2022`, TX_Census_2022$`Teen Birth Rate, 2022`) %>%tidy()# Combine the results into a single data framecor_results <-rbind(data.frame(Variables ="Less Than HS Diploma vs. Poverty", cor1),data.frame(Variables ="Less Than HS Diploma vs. Teen Birth Rate", cor2),data.frame(Variables ="Poverty vs. Teen Birth Rate", cor3))# Select and rename columns for a pretty tablecor_results <- cor_results %>%select(Variables, estimate, statistic, p.value, conf.low, conf.high) %>%rename("Correlation Coefficient"= estimate,"t-statistic"= statistic,"p-value"= p.value,"Confidence Interval Lower Bound"= conf.low,"Confidence Interval Upper Bound"= conf.high )# Print the table with kable for a nice formatcor_results %>%kable(caption ="Correlation Results for Texas Census Data, 2022",format ="markdown", # or use "html" for HTML output, "latex" for LaTeXdigits =3 )
Correlation Results for Texas Census Data, 2022
Variables
Correlation Coefficient
t-statistic
p-value
Confidence Interval Lower Bound
Confidence Interval Upper Bound
Less Than HS Diploma vs. Poverty
0.269
4.429
0
0.151
0.379
Less Than HS Diploma vs. Teen Birth Rate
0.455
6.687
0
0.328
0.566
Poverty vs. Teen Birth Rate
0.382
5.404
0
0.247
0.503
Interpreting scatterplots 📊
What kind of correlation do you observe? Does this make sense? Why or Why not?
There is a ____________ relationship between poverty and percent of population without high school diploma.
There is a ____________ relationship between poverty and teen birth rate.
There is a ____________ relationship between percent of population without high school diploma and teen birth rate.
Show the code
#TXData <- left_join(TX_Census_2022, teen_birth_2022, by = "County")panel.lines <-function(x, y) {points(x, y)abline(lm(y ~ x), col ="blue") # Add regression line}TX_Census_2022 %>%select(`Teen Birth Rate, 2022`,`Percent in Poverty, 2022`, `Percent Less Than High School Diploma, 2022`) %>%pairs(lower.panel = panel.lines, upper.panel = panel.lines)
What do you notice about the plots? Can summarize your findings?
Did you expect these results?
Did you get it? 👀
Make sure you understand Parts I and II.
Make sure you’re familiar with normality and histograms and correlation.
Make sure you’re familiar with the descriptive statistics of all three variables. You’ll need to keep these figures in mind for Part III.
3️⃣PART III - Mapping!
Refer to QGIS project in the Drive.
Congratulations to making to this part! A few housekeeping things:
The data have been exported from R after being cleaned, summarized, and formatted.
The QGIS file has everything you need to continue with this workshop.
Any questions, comments or concerns?
See you in QGIS!
Y’all come back now. No seriously, we have to wrap up later.
You don’t have to know the knitty gritty of how it works. Basically, this test asks how much like a bell curve is the distribution?
Note
A good W score is close to 1. For the purposes of this exercise, we can proceed with the data as they are. However, in more advanced analyses we would perform transformations and carry them throughout the rest of the analysis.
💡MAIN IDEA: Values of W close to 1 suggest normality.
Significant tests where the p-value is less than 0.05 indicate non-normality ☹️.
The Shapiro-Wilk tests have been performed below for our three variables. What do these results suggest? Interpret each variable separately.
Show the code
shapiro.test(TX_Census_2022$`Percent Less Than High School Diploma, 2022`)
Shapiro-Wilk normality test
data: TX_Census_2022$`Percent Less Than High School Diploma, 2022`
W = 0.73845, p-value < 2.2e-16
Show the code
shapiro.test(TX_Census_2022$`Percent in Poverty, 2022`)
Shapiro-Wilk normality test
data: TX_Census_2022$`Percent in Poverty, 2022`
W = 0.90559, p-value = 1.505e-11