The following datasets were included to support FiveThirtyEight’s article on marriage rates in the US over several decades. I used the 2 separate files for men and women to analyze how marriage may have changed for the sexes differently, in relation to education levels.
wm_raw <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/marriage/women.csv")
m_raw <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/marriage/men.csv")
Per the README for the raw CSV files, the columns were all different
types of samples of the population - like race, region, and employment -
which were then further divided by age. For example,
SC_2534
represented the marriage rate for those in the 25
to 34 age range who had some college education.
Most importantly, “figures represent share of the relevant population that has never been married”, not the percent of married people.
Using a quick
print(colnames(wm_raw) == colnames(m_raw))
, I was able to
confirm that the variables for both datasets were the same. Then I
selected all the columns with education prefixes.
col_prefix <- c("year", "all_", "HS_", "SC_", "BAp_", "BAo_", "GD_")
women_sub <- wm_raw |>
select(starts_with(col_prefix))
men_sub <- m_raw |>
select(starts_with(col_prefix))
Since there were no variables for education without age separation,
the best option at this point (still not ideal) was simply to calculate
the average for each type into new columns. These would have to be the
best representations for all age levels in the given time period (really
only 25 to 54, based on the original data). I also chose to combine the
BAp
and BAo
columns, since these designations
both referred to at least a Bachelor’s degree.
Then, since the top part of the table went by decade before shifting to yearly rows, I removed any rows that did not represent the beginning of the decade.
all <- rowMeans(select(women_sub, starts_with("all_")))
HS <- rowMeans(select(women_sub, starts_with("HS_")))
SC <- rowMeans(select(women_sub, starts_with("SC_")))
BA <- rowMeans(select(women_sub, starts_with("BA")))
GD <- rowMeans(select(women_sub, starts_with("GD_")))
women <- data.frame(
year = women_sub$year,
all, HS, SC, BA, GD)
women <- women |> filter(year %% 10 == 0)
all <- rowMeans(select(men_sub, starts_with("all_")))
HS <- rowMeans(select(men_sub, starts_with("HS_")))
SC <- rowMeans(select(men_sub, starts_with("SC_")))
BA <- rowMeans(select(men_sub, starts_with("BA")))
GD <- rowMeans(select(men_sub, starts_with("GD_")))
men <- data.frame(
year = men_sub$year,
all, HS, SC, BA, GD)
men <- men |> filter(year %% 10 == 0)
For a tidiness and easier grouping and analysis, the tables were pivoted longer so that each observation represented a single year, education level, gender and the marriage rate (now inverted to the positive share of the population); then joined together on the year column.
women <- women |>
pivot_longer(cols = !year, names_to = "education", values_to = "marriage_rate") |>
mutate(gender = "women")
men <- men |>
pivot_longer(cols = !year, names_to = "education", values_to = "marriage_rate") |>
mutate(gender = "men")
marriages <- women |> full_join(men)
marriages <- marriages |>
arrange(year, education) |>
mutate(marriage_rate = 1 - marriage_rate)
knitr::kable(marriages)
year | education | marriage_rate | gender |
---|---|---|---|
1960 | BA | 0.7262540 | women |
1960 | BA | 0.8672069 | men |
1960 | GD | NA | women |
1960 | GD | NA | men |
1960 | HS | 0.9371548 | women |
1960 | HS | 0.8956564 | men |
1960 | SC | 0.9048795 | women |
1960 | SC | 0.8989768 | men |
1960 | all | 0.9279299 | women |
1960 | all | 0.8936265 | men |
1970 | BA | 0.7841886 | women |
1970 | BA | 0.8839521 | men |
1970 | GD | NA | women |
1970 | GD | NA | men |
1970 | HS | 0.9386197 | women |
1970 | HS | 0.9029050 | men |
1970 | SC | 0.9189273 | women |
1970 | SC | 0.9038224 | men |
1970 | all | 0.9294602 | women |
1970 | all | 0.9003839 | men |
1980 | BA | 0.8098409 | women |
1980 | BA | 0.8551532 | men |
1980 | GD | NA | women |
1980 | GD | NA | men |
1980 | HS | 0.9278144 | women |
1980 | HS | 0.8840218 | men |
1980 | SC | 0.9020989 | women |
1980 | SC | 0.8737840 | men |
1980 | all | 0.9103697 | women |
1980 | all | 0.8753598 | men |
1990 | BA | 0.8230220 | women |
1990 | BA | 0.8034109 | men |
1990 | GD | 0.7970178 | women |
1990 | GD | 0.8215648 | men |
1990 | HS | 0.8832472 | women |
1990 | HS | 0.8167907 | men |
1990 | SC | 0.8820704 | women |
1990 | SC | 0.8343551 | men |
1990 | all | 0.8680386 | women |
1990 | all | 0.8196127 | men |
2000 | BA | 0.8040859 | women |
2000 | BA | 0.7697464 | men |
2000 | GD | 0.7901273 | women |
2000 | GD | 0.7961433 | men |
2000 | HS | 0.8369494 | women |
2000 | HS | 0.7694291 | men |
2000 | SC | 0.8459649 | women |
2000 | SC | 0.7917153 | men |
2000 | all | 0.8296088 | women |
2000 | all | 0.7776125 | men |
2010 | BA | 0.7676716 | women |
2010 | BA | 0.7242816 | men |
2010 | GD | 0.7699064 | women |
2010 | GD | 0.7740603 | men |
2010 | HS | 0.7434646 | women |
2010 | HS | 0.6589072 | men |
2010 | SC | 0.7724786 | women |
2010 | SC | 0.7106068 | men |
2010 | all | 0.7608086 | women |
2010 | all | 0.6946467 | men |
I had originally intended to include some designations that were combinations with the selected populations, such as whether the sample population had children, or if they had steady employment the prior year. However, these only included observations for ages 25-34, which would be unhelpful for analysis against the other columns that covered the full age range.
The final, tidy version of the data above would allow for marriage rates to be grouped and filtered by year, education and/or gender for analysis.
Since the marriage rates were all averages already, I thought a few visual representations of the changes in marriage rates could be useful.
This visualization selected the marriage rates for rows in which the education column included “all” levels, then compared those by gender and arranged by year on the x-axis:
all_rate <- filter(marriages, education == "all")
ggplot(all_rate, aes(fill = gender, x = year, y = marriage_rate)) +
geom_bar(position="dodge", stat="identity")
This next visualization also plotted marriage rates against the year in order to examine the changes to the rate over time. In this case, using ggplot2’s scatterplot grouped the observations by the gender column (represented by dot color) and the education column (represented by dot shape).
Of note below: the rates of women with Bachelor’s degrees and men with only high school education.
ggplot(marriages, aes(year, marriage_rate)) +
geom_point(aes(color = education, shape = gender))
### Conclusion
In this case, I decided to merge and tidy 2 sets of data into 1 for direct comparison; an alternative approach could have been to keep each dataset separate, and either group and summarize before analyzing each or create separate plots.
Changing the original sets into this tidy form, though, allowed for grouping and selecting by column values and analyzing the rates between men and women directly. Tidy data also fed easily into ggplot2 for visualization using code that was simple to produce and read.