HW 4 & 5 - Univariate Stats & Data Visualization

Instructions

Homework 4

For this assignment, you must select at least two variables of interest from your dataset, provide a basic description of the variables, clean and recode as needed, and present descriptive statistics and any visualizations. Your R Markdown document should include:

Descriptions of the variables - how they were collected, any missing values, etc
How you cleaned and coded the data, including a before/after comparison as needed
summary descriptives of the recoded variables
appropriate visualizations (not required)

Homework 5

Your goal is to build on the previous assignments, and create a data visualization using the dataset you have been working with in R. Then, use R Markdown to draft a document that tracks the creation of the data visualization. In particular, comment on and discuss:

What the visualization demonstrates.
Why you chose the visualization approach you selected, and what alternative approaches you considered but decided not to pursue.
What you wished, if anything, you could have executed but found limited capability to do.

I will be completing both Homework 4 and 5 in this blog post.

HW 4.1. Description of variables

I am using the ActiveDuty_MaritalStatus data set. It includes counts of enlisted, officer, and warrant members of the military, their paygrade, their gender, and their marital status. For this analysis I want to look at the intersection of Pay Grade/Rank and Gender/Marital status.

HW 4.2. Reading in and cleaning the data

For posterity (how I cleaned the data and a before and after), here is the dataset uncleaned:

military_marriage <- read_excel("../../_data/ActiveDuty_MaritalStatus.xls", sheet = "TotalDoD")

## New names:
## * `` -> ...1
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...

paged_table(military_marriage)

Skipping the first 9 rows to get to the real column headers:
Row 7 has information around the cohort (“Single Without Children”, “Single With Children”, “Joint Service Marriage”, “Civilian Marriage”).
I renamed the columns and incorporated information about the cohort into the column names themselves (combining row 7 + 8).
In naming the columns, I marked the total columns for deletion (“del”) since I can always recalculate totals.

military_marriage <- read_excel("../../_data/ActiveDuty_MaritalStatus.xls", sheet = "TotalDoD", skip = 9, col_names = c("del", "PayGrade", "SingleWithoutChildren_Male", "SingleWithoutChildren_Female", "del", "SingleWithChildren_Male", "SingleWithChildren_Female", "del", "JointServiceMarriage_Male", "JointServiceMarriage_Female", "del", "CivilianMarriage_Male", "CivilianMarriage_Female", "del", "del", "del", "del"))

## New names:
## * del -> del...1
## * del -> del...5
## * del -> del...8
## * del -> del...11
## * del -> del...14
## * ...

paged_table(military_marriage)

Time to delete those columns that I don’t need.

military_marriage <- military_marriage %>%
  select( !(starts_with("del")))

paged_table(military_marriage)

Last, delete the TOTAL rows that I don’t need and create the prettier table.

military_marriage <- military_marriage %>%
  filter(PayGrade != "TOTAL ENLISTED" & PayGrade != "TOTAL OFFICER" & PayGrade != "TOTAL WARRANT" & PayGrade != "GRAND TOTAL")

datatable(military_marriage, class = 'cell-border stripe', options = list(pageLength = 5, scrollX = TRUE), caption = 'Military Marriages', width = '100%')

**See conclusion, but based on the way that I want to run the analysis I realized it would be helpful to create an additional column for just “Rank.” So I tried to add that in below by splitting out the Pay Grade Level and the rank:

military_marriage2 <- military_marriage %>%
separate(col = 1, into=c('Rank', 'PayGrade'), sep = '-')

paged_table(military_marriage2)

And, for clarity, I changed the values of the “Rank” row to be the actual ranks. I know we’re not supposed to copy and past, but I couldn’t figure out how to group the statements so….

military_marriage2[military_marriage2 == "E"] <- "Enlisted"
  
paged_table(military_marriage2)

military_marriage2[military_marriage2 == "O"] <- "Officer"
  
paged_table(military_marriage2)

military_marriage2[military_marriage2 == "W"] <- "Warrant"
  
paged_table(military_marriage2)

military_marriage3 <- military_marriage2 %>%
  pivot_longer(
    cols = !PayGrade & !Rank,
    names_to = c("MaritalStatus", "Gender"),
    names_sep = "_",
    values_to = "Count")

paged_table(military_marriage3)

HW 4.3. Summarize the Data

military_marriage4 <- military_marriage3
  group_by(military_marriage3, Gender) %>%
  summarise(Total = sum(Count))

## # A tibble: 2 x 2
##   Gender   Total
##   <chr>    <dbl>
## 1 Female  202723
## 2 Male   1212228

military_marriage5 <- military_marriage3
  group_by(military_marriage5, MaritalStatus) %>%
  summarise(Total = sum(Count))

## # A tibble: 4 x 2
##   MaritalStatus          Total
##   <chr>                  <dbl>
## 1 CivilianMarriage      702716
## 2 JointServiceMarriage   94890
## 3 SingleWithChildren     75900
## 4 SingleWithoutChildren 541445

military_marriage6 <- military_marriage3
  group_by(military_marriage6, Gender, MaritalStatus) %>%
  summarise(Total = sum(Count))

## `summarise()` has grouped output by 'Gender'. You can override using the `.groups` argument.

## # A tibble: 8 x 3
## # Groups:   Gender [2]
##   Gender MaritalStatus          Total
##   <chr>  <chr>                  <dbl>
## 1 Female CivilianMarriage       48702
## 2 Female JointServiceMarriage   45715
## 3 Female SingleWithChildren     24472
## 4 Female SingleWithoutChildren  83834
## 5 Male   CivilianMarriage      654014
## 6 Male   JointServiceMarriage   49175
## 7 Male   SingleWithChildren     51428
## 8 Male   SingleWithoutChildren 457611

military_marriage7 <- military_marriage3
  group_by(military_marriage7, Gender, Rank, PayGrade) %>%
  summarise(Total = sum(Count))

## `summarise()` has grouped output by 'Gender', 'Rank'. You can override using the `.groups` argument.

## # A tibble: 48 x 4
## # Groups:   Gender, Rank [6]
##    Gender Rank     PayGrade Total
##    <chr>  <chr>    <chr>    <dbl>
##  1 Female Enlisted 1         6699
##  2 Female Enlisted 2        10924
##  3 Female Enlisted 3        34482
##  4 Female Enlisted 4        40782
##  5 Female Enlisted 5        37306
##  6 Female Enlisted 6        22404
##  7 Female Enlisted 7        10995
##  8 Female Enlisted 8         2422
##  9 Female Enlisted 9          787
## 10 Female Officer  1         4807
## # … with 38 more rows

Concluding Thoughts/Reflections

**(It was at this point that I realized I might want a separate column just for Rank (‘Enlisted’, ‘Officer’, and ’Warrant) so I went back and added that in).

I wish I could do more dynamic summarizing with the summarise command. I want to cut the data by both Role and Marital Status + Gender. Perhaps this is something that is better done with visualizations, which I will get to in the next homework (below).

HW 5. Data visualizations

Okay! Data visualization time.

The first thing I did was re-categorize PayGrade as a numeric data type instead of a character because I want to make sure I can order the grades from 1 to 10 (instead of alphabetically). I promise, this will pay off.

military_marriage3$PayGrade <- as.numeric(military_marriage3$PayGrade)

Next, a simple breakdown of gender in the military by pay level.

#for reference, our tidy table is military_marriage3
ggplot(data = military_marriage3, mapping = aes(fill = Gender, x = PayGrade, y=Count)) +
  geom_bar(position= "stack", stat= "identity") +
  scale_x_discrete(limit = c(1,2,3,4,5,6,7,8,9,10)) +
  scale_y_continuous(labels = comma)

## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?

ggplot(data = military_marriage3, mapping = aes(fill = Gender, x = PayGrade, y=Count)) +
  geom_bar(position= "fill", stat= "identity") +
  scale_x_discrete(limit = c(1,2,3,4,5,6,7,8,9,10)) +
  scale_y_continuous(labels = percent)

## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?

This was really a basic beginning to data visualization and I am excited to get more involved with the next homework and the final project!