Overview

This is part 1 on project 2 for Data 607. I am using marriage data from Census.gov. I will create a .csv file, use the tidyverse to read and transform the data in R, and perform analysis on it. I want to analyze the proportion of men and women that have never been married. I want to find out the age bracket where the difference in proportions is the largest.

First I created a .csv file of the marriage data and added it to my github repository

marriage <- read.csv("https://raw.githubusercontent.com/evelynbartley/Data-607/main/ACSST1Y2022.S1201-2024-02-20T191328.csv", header = TRUE)
tibble(marriage)
## # A tibble: 37 × 13
##    Label..Grouping.                United.States..Total…¹ United.States..Total…²
##    <chr>                           <chr>                  <chr>                 
##  1 Population 15 years and over    "273,938,835"          "±39,204"             
##  2     AGE AND SEX                 ""                     ""                    
##  3         Males 15 years and over "134,829,992"          "±33,498"             
##  4             15 to 19 years      "11,167,522"           "±29,157"             
##  5             20 to 34 years      "34,518,927"           "±31,844"             
##  6             35 to 44 years      "22,262,365"           "±21,601"             
##  7             45 to 54 years      "20,300,592"           "±21,231"             
##  8             55 to 64 years      "20,655,942"           "±13,723"             
##  9             65 years and over   "25,924,644"           "±13,171"             
## 10         Females 15 years and o… "139,108,843"          "±29,350"             
## # ℹ 27 more rows
## # ℹ abbreviated names: ¹​United.States..Total..Estimate,
## #   ²​United.States..Total..Margin.of.Error
## # ℹ 10 more variables:
## #   United.States..Now.married..except.separated...Estimate <chr>,
## #   United.States..Now.married..except.separated...Margin.of.Error <chr>,
## #   United.States..Widowed..Estimate <chr>, …

I want to rename columns I want to keep. This is dependent on the analysis I want to do.

marriage1 <- marriage |> 
  select(Group = Label..Grouping., Total = United.States..Total..Estimate, Nevermarried = United.States..Never.married..Estimate)
tibble(marriage1)
## # A tibble: 37 × 3
##    Group                             Total         Nevermarried
##    <chr>                             <chr>         <chr>       
##  1 Population 15 years and over      "273,938,835" "34.3%"     
##  2     AGE AND SEX                   ""            ""          
##  3         Males 15 years and over   "134,829,992" "37.2%"     
##  4             15 to 19 years        "11,167,522"  "98.9%"     
##  5             20 to 34 years        "34,518,927"  "71.3%"     
##  6             35 to 44 years        "22,262,365"  "29.3%"     
##  7             45 to 54 years        "20,300,592"  "17.1%"     
##  8             55 to 64 years        "20,655,942"  "13.2%"     
##  9             65 years and over     "25,924,644"  "6.9%"      
## 10         Females 15 years and over "139,108,843" "31.6%"     
## # ℹ 27 more rows

Now I want to get rid of the rows that are irrelevant to my analysis.

#start by making a subset of rows that will inform our analysis
marriage2 <- marriage1[-c(1:3), ]
marriage3 <- marriage2[-c(14:37),]
marriage4 <- marriage3[-c(7), ]

#add gender to the age brackets
marriage4$Group[1:6] <- sapply(marriage4$Group, function(x) paste("Female", x))
## Warning in marriage4$Group[1:6] <- sapply(marriage4$Group, function(x)
## paste("Female", : number of items to replace is not a multiple of replacement
## length
marriage4$Group[7:12] <- sapply(marriage4$Group[7:12], function(x) paste("Male", x))
head(marriage4)
##                                  Group      Total Nevermarried
## 4    Female             15 to 19 years 11,167,522        98.9%
## 5    Female             20 to 34 years 34,518,927        71.3%
## 6    Female             35 to 44 years 22,262,365        29.3%
## 7    Female             45 to 54 years 20,300,592        17.1%
## 8    Female             55 to 64 years 20,655,942        13.2%
## 9 Female             65 years and over 25,924,644         6.9%

Let’s separate gender from age.

marriage5 <- marriage4 %>%
  separate(Group, into = c("Gender", "Age"), sep = "\\s", extra = "merge")
tibble(marriage5)
## # A tibble: 12 × 4
##    Gender Age                           Total      Nevermarried
##    <chr>  <chr>                         <chr>      <chr>       
##  1 Female             15 to 19 years    11,167,522 98.9%       
##  2 Female             20 to 34 years    34,518,927 71.3%       
##  3 Female             35 to 44 years    22,262,365 29.3%       
##  4 Female             45 to 54 years    20,300,592 17.1%       
##  5 Female             55 to 64 years    20,655,942 13.2%       
##  6 Female             65 years and over 25,924,644 6.9%        
##  7 Male               15 to 19 years    10,618,136 98.7%       
##  8 Male               20 to 34 years    33,160,377 63.5%       
##  9 Male               35 to 44 years    21,785,279 23.6%       
## 10 Male               45 to 54 years    20,175,854 14.3%       
## 11 Male               55 to 64 years    21,471,526 10.6%       
## 12 Male               65 years and over 31,897,671 6.4%

Analysis and Conclusion

We are analyzing the difference in proportion of men and women who have never been married based on age bracket. We want to find out which age bracket that the difference in proportion is the largest. When we view marriage5, we can use the small arrows on the “Age” column button to arrange the data in descending or ascending order. Doing this, we can see that the difference in proportion of males and females that have never been married is the largest in the age range 20 to 34 years. The difference (71.3 - 63.5 = 7.8) is 7.8%. So the difference in proportion between men and women age 20 to 34 years that have never been married is 7.8%. 71.3% of women age 20 to 34 have never been married and 63.5% of men age 20 to 34 have never been married. These conclusions are based on the data collected by the US census in 2022.