Lab 2: Merging and Analyzing Data

Due by 11:00pm on 9/20, submitted through Canvas

In this activity, you merge together two datasets, each of which has U.S. counties as the unit of analysis. In other words, an observation in each of these datasets is a U.S. county.

The first dataset, which can be downloaded as a comma separated value (.csv) file from Canvas, is called Child Care Costs 2018. It includes information about the poverty rate, employment, and child care costs and is obtained from the National Database of Childcare Prices in the United States in 2018.

The second dataset (also a .csv file that can be downloaded from Canvas) is called Counties2018.csv and contains the state each county is in.

The variables in Child Care Costs 2018.csv are:

CountyNumber a code corresponding to county the data is collected from
PovertyRateFamily Poverty rate for families
PreschoolCost Aggregated weekly, full-time median price charged for Family Childcare for preschoolers
SingleMotherTotal Number of households with children between 6 and 17 years old with a single mother
TotalPopulation total population of the county
White Percent of population that identifies as being one race and being only White or Caucasian

The variables in Counties 2018.csv are

CountyNumber a code corresponding to county the data is collected from
StateName The full name of the state in which the county is found
CountyName county name
StateAbbreviation the two-letter state abbreviation

You should begin by downloading both of these datasets as well as the Lab 2 R Markdown template to your computer, saving them all in the same folder. Then double-click the .Rmd template file to start RStudio.

Question 1: Loading and Exploring the Dataset

Load both datasets (don’t attach either of them yet since we’ll be merging them, then attaching the merged dataset). Because both of these datasets are stored as .csv files, you’ll use the read.csv() command to load each, assigning them each a name. (I suggest calling them ChildCareCosts and Counties, respectively).

Then use the tail() command twice to look at the last several rows of each dataset separately.

Counties <- read.csv("Counties 2018.csv")
ChildCareCosts <- read.csv("ChildCareCosts.csv")
tail(ChildCareCosts)

##      CountyNumber TotalPopulation PovertyRateFamily SingleMotherTotal White
## 3137        56035            9951               8.4               166  96.3
## 3138        56037           44117              12.0              1281  93.1
## 3139        56039           23059               7.1               484  90.3
## 3140        56041           20609              12.5               560  93.4
## 3141        56043            8129              12.4               200  89.7
## 3142        56045            7100              17.4               228  92.5
##      PreschoolCost
## 3137        138.95
## 3138        119.72
## 3139        331.34
## 3140         93.49
## 3141         95.78
## 3142        109.16

tail(Counties)

##      CountyNumber        CountyName StateName StateAbbreviation
## 3139        56035   Sublette County   Wyoming                WY
## 3140        56037 Sweetwater County   Wyoming                WY
## 3141        56039      Teton County   Wyoming                WY
## 3142        56041      Uinta County   Wyoming                WY
## 3143        56043   Washakie County   Wyoming                WY
## 3144        56045     Weston County   Wyoming                WY

Question 2: Merging data

Merge the two datasets (called ChildCareCosts and Counties if you used the suggested names) together. Note that the variable CountyNumber is included in both datasets. You should call this new merged dataset USCountyMerge.

(Hint: Remember that the merge() command wants you to give the variable name on which you’re merging – the one that tells it how to match observations between the two datasets – in quotes for the by argument.)

After merging, you should attach the new merged dataset.

USCountyMerge <- merge(ChildCareCosts, Counties, by = "CountyNumber")
attach(USCountyMerge)

Question 3: Examining Association Between Variables

Make a scatterplot using the plot() command of PovertyRateFamily against PreschoolCost (with PovertyRateFamily on the vertical axis) and briefly comment on what relationship you see, if any.

# Create a scatterplot
plot(PreschoolCost, PovertyRateFamily,
     xlab = "Preschool Cost",
     ylab = "Poverty Rate for Families",
     main = "Scatterplot of Poverty Rate vs. Preschool Cost",
     pch = 19,       
     col = "blue")

The cost of preschool increases the lower the poverty rate is. The higher the poverty rate the typical preschool cost lowers.

Question 4: Creating New Variables

Create a new variable that is the percent of the single mothers PercentSingleMothers.

Hint: you will want to divide the number of people with at least one dose by the population, then multiply the whole thing by 100 so it’s a percent not a proportion (i.e. it’s between 0 and 100 like a percent, not just between 0 and 1 like a proportion).

Next, make a histogram of this new variable and briefly comment on what you learned.

PercentSingleMothers <- (SingleMotherTotal / TotalPopulation) * 100
USCountyMerge$PercentSingleMothers <- PercentSingleMothers
hist(USCountyMerge$PercentSingleMothers,
     main = "Histogram of Percent Single Mothers",
     xlab = "Percent of Single Mothers",
     col = "lightblue",
     border = "black")

This histogram shows that the vast mojority of data points to the fact that the United States single mother percentage is around 2-4 percent.

Question 5: Associations Between Variables

Now make another scatterplot, but this time having PercentSingleMothers on the vertical axis and again having PovertyRateFamily on the horizontal axis. Briefly comment on what you see.

plot(PovertyRateFamily, PercentSingleMothers,
     xlab = "Poverty Rate for Families",
     ylab = "Percent of Single Mothers",
     main = "Scatterplot of Percent Single Mothers vs. Poverty Rate",
     pch = 19,       # Point shape
     col = "green")  # Point color

This scatterplot shows the correlation of single mothers to the poverty rate. This scatterplot shows that the higher percentage of single mothers there are the higher the poverty rate is for families.

Question 6: What about Travis County, Texas?

What is the percentage of people in Travis county are single mothers? …and what percentage of Travis County is white? …and what is the median weekly child care costs for preschoolers? …and how big is Travis County?

(Hint: Remember you can type x[y=="something"] to have R print the value of the variable x for the observation that has variable y equal to “something”. Also in this data set the county name is “Travis County”, you can also use the County Number for Travis County which is 48453)

percent_single_mothers_travis <- USCountyMerge$PercentSingleMothers[USCountyMerge$CountyName == "Travis County"]
print(percent_single_mothers_travis)

## [1] 3.373101

percent_white_travis <- USCountyMerge$White[USCountyMerge$CountyName == "Travis County"]
print(percent_white_travis)

## [1] 73.5

median_preschool_cost_travis <- USCountyMerge$PreschoolCost[USCountyMerge$CountyName == "Travis County"]
print(median_preschool_cost_travis)

## [1] 159.35

total_population_travis <- USCountyMerge$TotalPopulation[USCountyMerge$CountyName == "Travis County"]
print(total_population_travis)

## [1] 1203166

Question 7: Summing Variables

Use these data to calculate the proportion of the entire U.S. population that are single mothers (for our purposes here we’ll ignore the fact that these data don’t contain Washington, DC or U.S. territories). Also separately calculate the costs of preschool care divided by the total US population.

Hint: This is not the simple average (mean) of the SingleMotherTotal variable (or of the PreschoolCost variable for the second part of the question). To calculate this appropriately, you can first calculate the U.S. population based on the populations of all US counties, then calculate the number of people who have had at single mothers in all US counties. Then calculate the costs of preschool care divided by the total US population. You will have to put na.rm = T within the sum() function because for some counties the researchers were unable to collect weekly child care costs. You can do this all with one line of code for each variable if you think carefully.

total_us_population <- sum(USCountyMerge$TotalPopulation, na.rm = TRUE)

total_single_mothers <- sum(USCountyMerge$SingleMotherTotal, na.rm = TRUE)

proportion_single_mothers <- total_single_mothers / total_us_population

total_preschool_costs <- sum(USCountyMerge$PreschoolCost, na.rm = TRUE)

average_preschool_cost_per_person <- total_preschool_costs / total_us_population

print(paste("Proportion of single mothers in the U.S.:", proportion_single_mothers))

## [1] "Proportion of single mothers in the U.S.: 0.0385594647408542"

print(paste("Average preschool care cost per person in the U.S.:", average_preschool_cost_per_person))

## [1] "Average preschool care cost per person in the U.S.: 0.000838793584563145"