In this activity, you merge together two datasets, each of which has U.S. counties as the unit of analysis. In other words, an observation in each of these datasets is a U.S. county.
The first dataset, which can be downloaded as a comma separated value
(.csv) file from Canvas, is called
Child Care Costs 2018. It includes information about the
poverty rate, employment, and child care costs and is obtained from the
National Database of Childcare Prices in the United States in 2018.
The second dataset (also a .csv file that can be
downloaded from Canvas) is called Counties2018.csv and
contains the state each county is in.
The variables in Child Care Costs 2018.csv are:
CountyNumber a code corresponding to county the data is
collected fromPovertyRateFamily Poverty rate for familiesPreschoolCost Aggregated weekly, full-time median price
charged for Family Childcare for preschoolersSingleMotherTotal Number of households with children
between 6 and 17 years old with a single motherTotalPopulation total population of the countyWhite Percent of population that identifies as being
one race and being only White or CaucasianThe variables in Counties 2018.csv are
CountyNumber a code corresponding to county the data is
collected fromStateName The full name of the state in which the
county is foundCountyName county nameStateAbbreviation the two-letter state
abbreviationYou should begin by downloading both of these datasets as well as the
Lab 2 R Markdown template to your computer, saving them all in the same
folder. Then double-click the .Rmd template file to start
RStudio.
Load both datasets (don’t attach either of them yet since we’ll be
merging them, then attaching the merged dataset). Because both of these
datasets are stored as .csv files, you’ll use the
read.csv() command to load each, assigning them each a
name. (I suggest calling them ChildCareCosts and
Counties, respectively).
Then use the tail() command twice to look at the last
several rows of each dataset separately.
Counties <- read.csv("Counties 2018.csv")
ChildCareCosts <- read.csv("ChildCareCosts.csv")
tail(ChildCareCosts)
## CountyNumber TotalPopulation PovertyRateFamily SingleMotherTotal White
## 3137 56035 9951 8.4 166 96.3
## 3138 56037 44117 12.0 1281 93.1
## 3139 56039 23059 7.1 484 90.3
## 3140 56041 20609 12.5 560 93.4
## 3141 56043 8129 12.4 200 89.7
## 3142 56045 7100 17.4 228 92.5
## PreschoolCost
## 3137 138.95
## 3138 119.72
## 3139 331.34
## 3140 93.49
## 3141 95.78
## 3142 109.16
tail(Counties)
## CountyNumber CountyName StateName StateAbbreviation
## 3139 56035 Sublette County Wyoming WY
## 3140 56037 Sweetwater County Wyoming WY
## 3141 56039 Teton County Wyoming WY
## 3142 56041 Uinta County Wyoming WY
## 3143 56043 Washakie County Wyoming WY
## 3144 56045 Weston County Wyoming WY
Merge the two datasets (called ChildCareCosts and
Counties if you used the suggested names) together. Note
that the variable CountyNumber is included in both
datasets. You should call this new merged dataset
USCountyMerge.
(Hint: Remember that the merge() command wants you to
give the variable name on which you’re merging – the one that tells it
how to match observations between the two datasets – in quotes for the
by argument.)
After merging, you should attach the new merged dataset.
USCountyMerge <- merge(ChildCareCosts, Counties, by = "CountyNumber")
attach(USCountyMerge)
Make a scatterplot using the plot() command of
PovertyRateFamily against PreschoolCost (with
PovertyRateFamily on the vertical axis) and briefly comment
on what relationship you see, if any.
# Create a scatterplot
plot(PreschoolCost, PovertyRateFamily,
xlab = "Preschool Cost",
ylab = "Poverty Rate for Families",
main = "Scatterplot of Poverty Rate vs. Preschool Cost",
pch = 19,
col = "blue")
The cost of preschool increases the lower the poverty rate is. The higher the poverty rate the typical preschool cost lowers.
Create a new variable that is the percent of the single mothers
PercentSingleMothers.
Hint: you will want to divide the number of people with at least one dose by the population, then multiply the whole thing by 100 so it’s a percent not a proportion (i.e. it’s between 0 and 100 like a percent, not just between 0 and 1 like a proportion).
Next, make a histogram of this new variable and briefly comment on what you learned.
PercentSingleMothers <- (SingleMotherTotal / TotalPopulation) * 100
USCountyMerge$PercentSingleMothers <- PercentSingleMothers
hist(USCountyMerge$PercentSingleMothers,
main = "Histogram of Percent Single Mothers",
xlab = "Percent of Single Mothers",
col = "lightblue",
border = "black")
This histogram shows that the vast mojority of data points to the fact that the United States single mother percentage is around 2-4 percent.
Now make another scatterplot, but this time having
PercentSingleMothers on the vertical axis and again having
PovertyRateFamily on the horizontal axis. Briefly comment
on what you see.
plot(PovertyRateFamily, PercentSingleMothers,
xlab = "Poverty Rate for Families",
ylab = "Percent of Single Mothers",
main = "Scatterplot of Percent Single Mothers vs. Poverty Rate",
pch = 19, # Point shape
col = "green") # Point color
This scatterplot shows the correlation of single mothers to the poverty rate. This scatterplot shows that the higher percentage of single mothers there are the higher the poverty rate is for families.
What is the percentage of people in Travis county are single mothers? …and what percentage of Travis County is white? …and what is the median weekly child care costs for preschoolers? …and how big is Travis County?
(Hint: Remember you can type x[y=="something"] to have R
print the value of the variable x for the observation that
has variable y equal to “something”. Also in this data set
the county name is “Travis County”, you can also use the County Number
for Travis County which is 48453)
percent_single_mothers_travis <- USCountyMerge$PercentSingleMothers[USCountyMerge$CountyName == "Travis County"]
print(percent_single_mothers_travis)
## [1] 3.373101
percent_white_travis <- USCountyMerge$White[USCountyMerge$CountyName == "Travis County"]
print(percent_white_travis)
## [1] 73.5
median_preschool_cost_travis <- USCountyMerge$PreschoolCost[USCountyMerge$CountyName == "Travis County"]
print(median_preschool_cost_travis)
## [1] 159.35
total_population_travis <- USCountyMerge$TotalPopulation[USCountyMerge$CountyName == "Travis County"]
print(total_population_travis)
## [1] 1203166
Use these data to calculate the proportion of the entire U.S. population that are single mothers (for our purposes here we’ll ignore the fact that these data don’t contain Washington, DC or U.S. territories). Also separately calculate the costs of preschool care divided by the total US population.
Hint: This is not the simple average (mean) of the
SingleMotherTotal variable (or of the
PreschoolCost variable for the second part of the
question). To calculate this appropriately, you can first calculate the
U.S. population based on the populations of all US counties, then
calculate the number of people who have had at single mothers in all US
counties. Then calculate the costs of preschool care divided by the
total US population. You will have to put na.rm = T within the sum()
function because for some counties the researchers were unable to
collect weekly child care costs. You can do this all with one line of
code for each variable if you think carefully.
total_us_population <- sum(USCountyMerge$TotalPopulation, na.rm = TRUE)
total_single_mothers <- sum(USCountyMerge$SingleMotherTotal, na.rm = TRUE)
proportion_single_mothers <- total_single_mothers / total_us_population
total_preschool_costs <- sum(USCountyMerge$PreschoolCost, na.rm = TRUE)
average_preschool_cost_per_person <- total_preschool_costs / total_us_population
print(paste("Proportion of single mothers in the U.S.:", proportion_single_mothers))
## [1] "Proportion of single mothers in the U.S.: 0.0385594647408542"
print(paste("Average preschool care cost per person in the U.S.:", average_preschool_cost_per_person))
## [1] "Average preschool care cost per person in the U.S.: 0.000838793584563145"