Cheking rows with a specific pattern
library(readr)
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.3.0 v dplyr 0.8.4
## v tibble 2.1.3 v stringr 1.4.0
## v tidyr 1.0.2 v forcats 0.4.0
## v purrr 0.3.3
## Warning: package 'ggplot2' was built under R version 3.6.3
## -- Conflicts --------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
incomeUS <- read_csv("adultincome.csv")
## Parsed with column specification:
## cols(
## age = col_double(),
## workclass = col_character(),
## fnlwgt = col_double(),
## education = col_character(),
## education.num = col_double(),
## marital.status = col_character(),
## occupation = col_character(),
## relationship = col_character(),
## race = col_character(),
## sex = col_character(),
## capital.gain = col_double(),
## capital.loss = col_double(),
## hours.per.week = col_double(),
## native.country = col_character(),
## income = col_character()
## )
incomeUS %>%
group_by(marital.status) %>%
summarise(counts = n()) %>%
arrange(desc(counts))
## # A tibble: 7 x 2
## marital.status counts
## <chr> <int>
## 1 Married-civ-spouse 14976
## 2 Never-married 10683
## 3 Divorced 4443
## 4 Separated 1025
## 5 Widowed 993
## 6 Married-spouse-absent 418
## 7 Married-AF-spouse 23
we can put all the married people in the same group
This varaiable have 7 labels, three of them are:
Married-AF-spouse: Married armed forces spouse
Married-civ-spouse: Married civilian spouse
Married-spouse-absent : Married but the spouse was absent
Both levels can be grouped into the group “Married”.
Replace rows containing either “Married-AF-spouse”, “Married-civ-spouse”, or “Married-spouse-absent” by “Married”.
patterns <- c("Married-AF-spouse|Married-civ-spouse|Married-spouse-absent")
incomeUS <- incomeUS %>%
mutate(marital = stringr::str_replace_all(marital.status, patterns, "Married"))
To check the resulting column
incomeUS %>%
group_by(marital) %>%
summarise(counts = n()) %>%
arrange(desc(counts))
## # A tibble: 5 x 2
## marital counts
## <chr> <int>
## 1 Married 15417
## 2 Never-married 10683
## 3 Divorced 4443
## 4 Separated 1025
## 5 Widowed 993