For this assignment, you must select at least two variables of interest from your dataset, provide a basic description of the variables, clean and recode as needed, and present descriptive statistics and any visualizations. Your R Markdown document should include:
Your goal is to build on the previous assignments, and create a data visualization using the dataset you have been working with in R. Then, use R Markdown to draft a document that tracks the creation of the data visualization. In particular, comment on and discuss:
I will be completing both Homework 4 and 5 in this blog post.
I am using the ActiveDuty_MaritalStatus data set. It includes counts of enlisted, officer, and warrant members of the military, their paygrade, their gender, and their marital status. For this analysis I want to look at the intersection of Pay Grade/Rank and Gender/Marital status.
For posterity (how I cleaned the data and a before and after), here is the dataset uncleaned:
military_marriage <- read_excel("../../_data/ActiveDuty_MaritalStatus.xls", sheet = "TotalDoD")
## New names:
## * `` -> ...1
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...
paged_table(military_marriage)
military_marriage <- read_excel("../../_data/ActiveDuty_MaritalStatus.xls", sheet = "TotalDoD", skip = 9, col_names = c("del", "PayGrade", "SingleWithoutChildren_Male", "SingleWithoutChildren_Female", "del", "SingleWithChildren_Male", "SingleWithChildren_Female", "del", "JointServiceMarriage_Male", "JointServiceMarriage_Female", "del", "CivilianMarriage_Male", "CivilianMarriage_Female", "del", "del", "del", "del"))
## New names:
## * del -> del...1
## * del -> del...5
## * del -> del...8
## * del -> del...11
## * del -> del...14
## * ...
paged_table(military_marriage)
military_marriage <- military_marriage %>%
select( !(starts_with("del")))
paged_table(military_marriage)
military_marriage <- military_marriage %>%
filter(PayGrade != "TOTAL ENLISTED" & PayGrade != "TOTAL OFFICER" & PayGrade != "TOTAL WARRANT" & PayGrade != "GRAND TOTAL")
datatable(military_marriage, class = 'cell-border stripe', options = list(pageLength = 5, scrollX = TRUE), caption = 'Military Marriages', width = '100%')
**See conclusion, but based on the way that I want to run the analysis I realized it would be helpful to create an additional column for just “Rank.” So I tried to add that in below by splitting out the Pay Grade Level and the rank:
military_marriage2 <- military_marriage %>%
separate(col = 1, into=c('Rank', 'PayGrade'), sep = '-')
paged_table(military_marriage2)
And, for clarity, I changed the values of the “Rank” row to be the actual ranks. I know we’re not supposed to copy and past, but I couldn’t figure out how to group the statements so….
military_marriage2[military_marriage2 == "E"] <- "Enlisted"
paged_table(military_marriage2)
military_marriage2[military_marriage2 == "O"] <- "Officer"
paged_table(military_marriage2)
military_marriage2[military_marriage2 == "W"] <- "Warrant"
paged_table(military_marriage2)
military_marriage3 <- military_marriage2 %>%
pivot_longer(
cols = !PayGrade & !Rank,
names_to = c("MaritalStatus", "Gender"),
names_sep = "_",
values_to = "Count")
paged_table(military_marriage3)
military_marriage4 <- military_marriage3
group_by(military_marriage3, Gender) %>%
summarise(Total = sum(Count))
## # A tibble: 2 x 2
## Gender Total
## <chr> <dbl>
## 1 Female 202723
## 2 Male 1212228
military_marriage5 <- military_marriage3
group_by(military_marriage5, MaritalStatus) %>%
summarise(Total = sum(Count))
## # A tibble: 4 x 2
## MaritalStatus Total
## <chr> <dbl>
## 1 CivilianMarriage 702716
## 2 JointServiceMarriage 94890
## 3 SingleWithChildren 75900
## 4 SingleWithoutChildren 541445
military_marriage6 <- military_marriage3
group_by(military_marriage6, Gender, MaritalStatus) %>%
summarise(Total = sum(Count))
## `summarise()` has grouped output by 'Gender'. You can override using the `.groups` argument.
## # A tibble: 8 x 3
## # Groups: Gender [2]
## Gender MaritalStatus Total
## <chr> <chr> <dbl>
## 1 Female CivilianMarriage 48702
## 2 Female JointServiceMarriage 45715
## 3 Female SingleWithChildren 24472
## 4 Female SingleWithoutChildren 83834
## 5 Male CivilianMarriage 654014
## 6 Male JointServiceMarriage 49175
## 7 Male SingleWithChildren 51428
## 8 Male SingleWithoutChildren 457611
military_marriage7 <- military_marriage3
group_by(military_marriage7, Gender, Rank, PayGrade) %>%
summarise(Total = sum(Count))
## `summarise()` has grouped output by 'Gender', 'Rank'. You can override using the `.groups` argument.
## # A tibble: 48 x 4
## # Groups: Gender, Rank [6]
## Gender Rank PayGrade Total
## <chr> <chr> <chr> <dbl>
## 1 Female Enlisted 1 6699
## 2 Female Enlisted 2 10924
## 3 Female Enlisted 3 34482
## 4 Female Enlisted 4 40782
## 5 Female Enlisted 5 37306
## 6 Female Enlisted 6 22404
## 7 Female Enlisted 7 10995
## 8 Female Enlisted 8 2422
## 9 Female Enlisted 9 787
## 10 Female Officer 1 4807
## # … with 38 more rows
**(It was at this point that I realized I might want a separate column just for Rank (‘Enlisted’, ‘Officer’, and ’Warrant) so I went back and added that in).
I wish I could do more dynamic summarizing with the summarise
command. I want to cut the data by both Role and Marital Status + Gender. Perhaps this is something that is better done with visualizations, which I will get to in the next homework (below).
Okay! Data visualization time.
The first thing I did was re-categorize PayGrade as a numeric data type instead of a character because I want to make sure I can order the grades from 1 to 10 (instead of alphabetically). I promise, this will pay off.
military_marriage3$PayGrade <- as.numeric(military_marriage3$PayGrade)
Next, a simple breakdown of gender in the military by pay level.
#for reference, our tidy table is military_marriage3
ggplot(data = military_marriage3, mapping = aes(fill = Gender, x = PayGrade, y=Count)) +
geom_bar(position= "stack", stat= "identity") +
scale_x_discrete(limit = c(1,2,3,4,5,6,7,8,9,10)) +
scale_y_continuous(labels = comma)
## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?
ggplot(data = military_marriage3, mapping = aes(fill = Gender, x = PayGrade, y=Count)) +
geom_bar(position= "fill", stat= "identity") +
scale_x_discrete(limit = c(1,2,3,4,5,6,7,8,9,10)) +
scale_y_continuous(labels = percent)
## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?
This was really a basic beginning to data visualization and I am excited to get more involved with the next homework and the final project!