Project 2: US Marriages

The following datasets were included to support FiveThirtyEight’s article on marriage rates in the US over several decades. I used the 2 separate files for men and women to analyze how marriage may have changed for the sexes differently, in relation to education levels.

Import and tidy the data

wm_raw <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/marriage/women.csv")

m_raw <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/marriage/men.csv")

Per the README for the raw CSV files, the columns were all different types of samples of the population - like race, region, and employment - which were then further divided by age. For example, SC_2534 represented the marriage rate for those in the 25 to 34 age range who had some college education.

Most importantly, “figures represent share of the relevant population that has never been married”, not the percent of married people.

Using a quick print(colnames(wm_raw) == colnames(m_raw)), I was able to confirm that the variables for both datasets were the same. Then I selected all the columns with education prefixes.

col_prefix <- c("year", "all_", "HS_", "SC_", "BAp_", "BAo_", "GD_")

women_sub <- wm_raw |>
  select(starts_with(col_prefix))

men_sub <- m_raw |>
  select(starts_with(col_prefix))

Since there were no variables for education without age separation, the best option at this point (still not ideal) was simply to calculate the average for each type into new columns. These would have to be the best representations for all age levels in the given time period (really only 25 to 54, based on the original data). I also chose to combine the BAp and BAo columns, since these designations both referred to at least a Bachelor’s degree.

Then, since the top part of the table went by decade before shifting to yearly rows, I removed any rows that did not represent the beginning of the decade.

all <- rowMeans(select(women_sub, starts_with("all_")))
HS <- rowMeans(select(women_sub, starts_with("HS_")))
SC <- rowMeans(select(women_sub, starts_with("SC_")))
BA <- rowMeans(select(women_sub, starts_with("BA")))
GD <- rowMeans(select(women_sub, starts_with("GD_")))

women <- data.frame(
  year = women_sub$year,
  all, HS, SC, BA, GD)

women <- women |> filter(year %% 10 == 0)

all <- rowMeans(select(men_sub, starts_with("all_")))
HS <- rowMeans(select(men_sub, starts_with("HS_")))
SC <- rowMeans(select(men_sub, starts_with("SC_")))
BA <- rowMeans(select(men_sub, starts_with("BA")))
GD <- rowMeans(select(men_sub, starts_with("GD_")))

men <- data.frame(
  year = men_sub$year,
  all, HS, SC, BA, GD)

men <- men |> filter(year %% 10 == 0)

For a tidiness and easier grouping and analysis, the tables were pivoted longer so that each observation represented a single year, education level, gender and the marriage rate (now inverted to the positive share of the population); then joined together on the year column.

women <- women |>
  pivot_longer(cols = !year, names_to = "education", values_to = "marriage_rate") |>
  mutate(gender = "women")

men <- men |>
  pivot_longer(cols = !year, names_to = "education", values_to = "marriage_rate") |>
  mutate(gender = "men")

marriages <- women |> full_join(men)

marriages <- marriages |>
  arrange(year, education) |>
  mutate(marriage_rate = 1 - marriage_rate)

knitr::kable(marriages)
year education marriage_rate gender
1960 BA 0.7262540 women
1960 BA 0.8672069 men
1960 GD NA women
1960 GD NA men
1960 HS 0.9371548 women
1960 HS 0.8956564 men
1960 SC 0.9048795 women
1960 SC 0.8989768 men
1960 all 0.9279299 women
1960 all 0.8936265 men
1970 BA 0.7841886 women
1970 BA 0.8839521 men
1970 GD NA women
1970 GD NA men
1970 HS 0.9386197 women
1970 HS 0.9029050 men
1970 SC 0.9189273 women
1970 SC 0.9038224 men
1970 all 0.9294602 women
1970 all 0.9003839 men
1980 BA 0.8098409 women
1980 BA 0.8551532 men
1980 GD NA women
1980 GD NA men
1980 HS 0.9278144 women
1980 HS 0.8840218 men
1980 SC 0.9020989 women
1980 SC 0.8737840 men
1980 all 0.9103697 women
1980 all 0.8753598 men
1990 BA 0.8230220 women
1990 BA 0.8034109 men
1990 GD 0.7970178 women
1990 GD 0.8215648 men
1990 HS 0.8832472 women
1990 HS 0.8167907 men
1990 SC 0.8820704 women
1990 SC 0.8343551 men
1990 all 0.8680386 women
1990 all 0.8196127 men
2000 BA 0.8040859 women
2000 BA 0.7697464 men
2000 GD 0.7901273 women
2000 GD 0.7961433 men
2000 HS 0.8369494 women
2000 HS 0.7694291 men
2000 SC 0.8459649 women
2000 SC 0.7917153 men
2000 all 0.8296088 women
2000 all 0.7776125 men
2010 BA 0.7676716 women
2010 BA 0.7242816 men
2010 GD 0.7699064 women
2010 GD 0.7740603 men
2010 HS 0.7434646 women
2010 HS 0.6589072 men
2010 SC 0.7724786 women
2010 SC 0.7106068 men
2010 all 0.7608086 women
2010 all 0.6946467 men
NOTE: Kids or no kids? Work or no work?

I had originally intended to include some designations that were combinations with the selected populations, such as whether the sample population had children, or if they had steady employment the prior year. However, these only included observations for ages 25-34, which would be unhelpful for analysis against the other columns that covered the full age range.

Analysis

The final, tidy version of the data above would allow for marriage rates to be grouped and filtered by year, education and/or gender for analysis.

Since the marriage rates were all averages already, I thought a few visual representations of the changes in marriage rates could be useful.

This visualization selected the marriage rates for rows in which the education column included “all” levels, then compared those by gender and arranged by year on the x-axis:

all_rate <- filter(marriages, education == "all")

ggplot(all_rate, aes(fill = gender, x = year, y = marriage_rate)) + 
    geom_bar(position="dodge", stat="identity")

This next visualization also plotted marriage rates against the year in order to examine the changes to the rate over time. In this case, using ggplot2’s scatterplot grouped the observations by the gender column (represented by dot color) and the education column (represented by dot shape).

Of note below: the rates of women with Bachelor’s degrees and men with only high school education.

ggplot(marriages, aes(year, marriage_rate)) +
  geom_point(aes(color = education, shape = gender))

### Conclusion

In this case, I decided to merge and tidy 2 sets of data into 1 for direct comparison; an alternative approach could have been to keep each dataset separate, and either group and summarize before analyzing each or create separate plots.

Changing the original sets into this tidy form, though, allowed for grouping and selecting by column values and analyzing the rates between men and women directly. Tidy data also fed easily into ggplot2 for visualization using code that was simple to produce and read.