##Hi Professor, For this assignment I selected a dataset on childcare costs from 2000–2018. While it doesn’t align with the time frame of my capstone project (which focuses more on the past five years), I thought it would be valuable background to analyze. My capstone research question is: How do disparities in childcare availability and affordability across San Antonio neighborhoods affect workforce participation among low-income parents, and what policy interventions could reduce these inequities? Since my capstone is a qualitative study focused on Bexar County, I see this dataset as practice and a way to strengthen my understanding of how childcare costs have shifted over time nationally. Even though the dataset is older and broader, I think it provides useful context as I continue building out my literature review and preparing to analyze more recent and local data for my capstone. I also think it would be good to write a research paper because it will be diffrent from my actual capstone in the sense of it being more quantitative and focused on national numbers. If you advise otherwise I do not mind doing the school districts! I’ll admit I struggled the most with the setup of this assignment, especially uploading the files (which was not difficult but just the codes to run) and working through knit errors. The statistical parts (summaries, histograms, scatterplots, and correlation) were much easier for me to follow. It took some trial and error (and some help), but I was able to get it working and I think this exercise really helped me get more comfortable with the process. Still need more practice ! Best,Lori

#This is the link to where I found my data:https://www.kaggle.com/datasets/sujaykapadnis/childcare-costs/data

library(readxl)   
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Load childcare data (paths include the subfolder 'childcare-costs')
childcare_costs <- read_csv("childcare-costs/childcare_costs.csv")

## Rows: 34567 Columns: 61
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (61): county_fips_code, study_year, unr_16, funr_16, munr_16, unr_20to64...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

counties <- read_csv("childcare-costs/counties.csv")

## Rows: 3144 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): county_name, state_name, state_abbreviation
## dbl (1): county_fips_code
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(childcare_costs)

## # A tibble: 6 × 61
##   county_fips_code study_year unr_16 funr_16 munr_16 unr_20to64 funr_20to64
##              <dbl>      <dbl>  <dbl>   <dbl>   <dbl>      <dbl>       <dbl>
## 1             1001       2008   5.42    4.41    6.32        4.6         3.5
## 2             1001       2009   5.93    5.72    6.11        4.8         4.6
## 3             1001       2010   6.21    5.57    6.78        5.1         4.6
## 4             1001       2011   7.55    8.13    7.03        6.2         6.3
## 5             1001       2012   8.6     8.88    8.29        6.7         6.4
## 6             1001       2013   9.39   10.3     8.56        7.3         7.6
## # ℹ 54 more variables: munr_20to64 <dbl>, flfpr_20to64 <dbl>,
## #   flfpr_20to64_under6 <dbl>, flfpr_20to64_6to17 <dbl>,
## #   flfpr_20to64_under6_6to17 <dbl>, mlfpr_20to64 <dbl>, pr_f <dbl>,
## #   pr_p <dbl>, mhi_2018 <dbl>, me_2018 <dbl>, fme_2018 <dbl>, mme_2018 <dbl>,
## #   total_pop <dbl>, one_race <dbl>, one_race_w <dbl>, one_race_b <dbl>,
## #   one_race_i <dbl>, one_race_a <dbl>, one_race_h <dbl>, one_race_other <dbl>,
## #   two_races <dbl>, hispanic <dbl>, households <dbl>, …

head(counties)

## # A tibble: 6 × 4
##   county_fips_code county_name    state_name state_abbreviation
##              <dbl> <chr>          <chr>      <chr>             
## 1             1001 Autauga County Alabama    AL                
## 2             1003 Baldwin County Alabama    AL                
## 3             1005 Barbour County Alabama    AL                
## 4             1007 Bibb County    Alabama    AL                
## 5             1009 Blount County  Alabama    AL                
## 6             1011 Bullock County Alabama    AL

if (exists("childcare_dict")) head(childcare_dict)

R Markdown ## I think this is the part I struggle with the most and its just getting set up on R Markdown. I know how to save all the files in one place (having hte r markdwon file in the same place as my data. I understand the library command, but I know I am doing something wrong becuase I will go to knit and then publish and I will get multiple errors. So I will get answers in the console, its just in the script where I am trying to practice alot more.)

library(readr)   # for reading csv files 
library(dplyr)   # for tidy data tools


# trying to look at both datasets
head(childcare_costs)

## # A tibble: 6 × 61
##   county_fips_code study_year unr_16 funr_16 munr_16 unr_20to64 funr_20to64
##              <dbl>      <dbl>  <dbl>   <dbl>   <dbl>      <dbl>       <dbl>
## 1             1001       2008   5.42    4.41    6.32        4.6         3.5
## 2             1001       2009   5.93    5.72    6.11        4.8         4.6
## 3             1001       2010   6.21    5.57    6.78        5.1         4.6
## 4             1001       2011   7.55    8.13    7.03        6.2         6.3
## 5             1001       2012   8.6     8.88    8.29        6.7         6.4
## 6             1001       2013   9.39   10.3     8.56        7.3         7.6
## # ℹ 54 more variables: munr_20to64 <dbl>, flfpr_20to64 <dbl>,
## #   flfpr_20to64_under6 <dbl>, flfpr_20to64_6to17 <dbl>,
## #   flfpr_20to64_under6_6to17 <dbl>, mlfpr_20to64 <dbl>, pr_f <dbl>,
## #   pr_p <dbl>, mhi_2018 <dbl>, me_2018 <dbl>, fme_2018 <dbl>, mme_2018 <dbl>,
## #   total_pop <dbl>, one_race <dbl>, one_race_w <dbl>, one_race_b <dbl>,
## #   one_race_i <dbl>, one_race_a <dbl>, one_race_h <dbl>, one_race_other <dbl>,
## #   two_races <dbl>, hispanic <dbl>, households <dbl>, …

head(counties)

## # A tibble: 6 × 4
##   county_fips_code county_name    state_name state_abbreviation
##              <dbl> <chr>          <chr>      <chr>             
## 1             1001 Autauga County Alabama    AL                
## 2             1003 Baldwin County Alabama    AL                
## 3             1005 Barbour County Alabama    AL                
## 4             1007 Bibb County    Alabama    AL                
## 5             1009 Blount County  Alabama    AL                
## 6             1011 Bullock County Alabama    AL

# summarize infant care costs
summary(childcare_costs$mc_infant)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   27.73  108.75  134.50  146.05  166.33  470.00   10974

# summarize toddler care costs
summary(childcare_costs$mc_toddler)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   21.54  100.00  120.99  130.48  148.71  419.00   10974

Attempting the Histogram next

ggplot(childcare_costs, aes(x = mc_infant)) ggplot() = is what helps R know I am making a plot. childcare_costs = because its the data set I am using. aes(x = mc_infant) = aesthetic mapping → trying to say say “on the x-axis, plot the variable mc_infant” mc_infant = one of the numeric columns in the dataset I got from Kaggle (average market cost for infant childcare). binwidth = 100 = each bar covers a range of 100 dollars. Smaller binwidth → more bars; larger binwidth → fewer bars. (Remember this one because you get confused girly pop)

# histogram for infant childcare costs
ggplot(childcare_costs, aes(x = mc_infant)) +
  geom_histogram(binwidth = 100, fill = "lightgreen", color = "black") +
  labs(title = "Distribution of Infant Childcare Costs",
       x = "Infant cost (market center)",
       y = "Count")

## Warning: Removed 10974 rows containing non-finite outside the scale range
## (`stat_bin()`).

#
# histogram for toddler childcare costs
ggplot(childcare_costs, aes(x = mc_toddler)) +
  geom_histogram(binwidth = 100, fill = "lightblue", color = "black") +
  labs(title = "Distribution of Toddler Childcare Costs",
       x = "Toddler cost (market center)",
       y = "Count")

## Warning: Removed 10974 rows containing non-finite outside the scale range
## (`stat_bin()`).

OKay next step… ScatterPlots Here I am working to compare infant and toddler childcare costs to see if they are related.
This lets me see whether counties with higher infant costs also tend to have higher toddler costs.
KEY NOTES: ggplot() - start making a plot. childcare_costs - dataset to use aes(x = mc_infant, y = mc_toddler) - set up the axes: x-axis = infant childcare costs y-axis = toddler childcare costs

ggplot(childcare_costs, aes(x = mc_infant, y = mc_toddler)) +
  geom_point(color = "lightblue", alpha = 0.6) +
  labs(title = "Infant vs. Toddler Childcare Costs",
       x = "Infant cost (market center)",
       y = "Toddler cost (market center)")

## Warning: Removed 10974 rows containing missing values or values outside the scale range
## (`geom_point()`).

##What does the Scatterplot say besties?? Counties with higher infant costs also have higher toddler costs.

Correlation

dont forget! cor() is correlation function and dont forget abut $ use = “complete.obs”to help with ignoring roles with misisng info (not much)

cor(childcare_costs$mc_infant, childcare_costs$mc_toddler, use = "complete.obs")

## [1] 0.9616953

Infant and toddler childcare costs have a very strong positive correlation. This means counties that charge more for infant care almost always charge more for toddler care as well. 0.96 is closer to 1 than -1

The summary statistics, histograms, scatterplot, and correlation together that I was fighting for my life to do show that infant and toddler childcare costs are very strongly related.
Counties with high infant costs almost always also have high toddler costs.

The correlation coefficient is close to +1, which confirms a strong positive relationship. This means that as one type of childcare cost goes up, the other tends to go up as well.

This suggests that childcare costs are not random, but move together across counties.

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Research Paper Data Selection

Lori Day

2025-09-17

Attempting the Histogram next

Correlation

Including Plots