library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("C:/Users/njnav/OneDrive/Data 101/Projects")
military <- read_csv("military.csv")
## Rows: 1414593 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): grade, branch, gender, race
## dbl (1): rank
## lgl (1): hisp
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Is there a relationship between a person’s rank in the military and their gender?
Many people enlist into the military to serve the country whether it be a part of the army, marines, navy, air force, or coast guard. While in the military your rank can increase due to several different factors such as achievements, performance, and time served in the military. One factor I wanted to question is whether the persons gender plays a part in their persons rank in the military.
The data set I will be using to answer this question is titled military. This data was collected by the Department of Defense on 02-20-2012. This data contains information from the branches of the Army, Navy, Air Force, and Marine Corps. There are 1,414,593 observations and 6 variables. The data set can be found on OpenIntro at https://www.openintro.org/data/index.php?data=military.
The names of the variables I will be using is rank, and gender. Both variables are categorical variables.
rank: This is their numeric rank with higher numbers meaning higher ranks
gender: This is the gender of the person
Checking the head and structure of the data set. Everything looks good so I check if there is any NA’s in any of the columns. None of the columns have any NA’s so I don’t have to clean up much.
head(military)
## # A tibble: 6 × 6
## grade branch gender race hisp rank
## <chr> <chr> <chr> <chr> <lgl> <dbl>
## 1 officer army male ami/aln TRUE 2
## 2 officer army male ami/aln TRUE 2
## 3 officer army male ami/aln TRUE 5
## 4 officer army male ami/aln TRUE 5
## 5 officer army male ami/aln TRUE 5
## 6 officer army male ami/aln TRUE 5
str(military)
## spc_tbl_ [1,414,593 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ grade : chr [1:1414593] "officer" "officer" "officer" "officer" ...
## $ branch: chr [1:1414593] "army" "army" "army" "army" ...
## $ gender: chr [1:1414593] "male" "male" "male" "male" ...
## $ race : chr [1:1414593] "ami/aln" "ami/aln" "ami/aln" "ami/aln" ...
## $ hisp : logi [1:1414593] TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ rank : num [1:1414593] 2 2 5 5 5 5 5 7 10 2 ...
## - attr(*, "spec")=
## .. cols(
## .. grade = col_character(),
## .. branch = col_character(),
## .. gender = col_character(),
## .. race = col_character(),
## .. hisp = col_logical(),
## .. rank = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
colSums(is.na(military))
## grade branch gender race hisp rank
## 0 0 0 0 0 0
Here I make both the varibles factors to be able to plot the data easier. I also make tables for both to see the total amounts in each group.
military <- military|>
mutate(gender_factor = factor(gender))|>
mutate(rank_factor = factor(rank))
table(military$gender_factor)
##
## female male
## 202718 1211875
table(military$rank_factor)
##
## 1 2 3 4 5 6 7 8 9 10 11
## 9 76159 38 113016 301070 324590 279320 183252 98525 28093 10521
Here I mutate the factors to group the ranks together into 5 groups being “1-3”, “4-5”, “6-7”, “8-9”, and “10-11”. This is because rank 1 and rank 3 didn’t have many people which did cause issues for the chi-squared test as there was not an expected amount over 5 making the results unreliable.
military <- military |>
mutate(
rank_recoded = case_when(
rank_factor %in% c("1","2","3") ~ "1-3",
rank_factor %in% c("4","5") ~ "4-5",
rank_factor %in% c("6","7") ~ "6-7",
rank_factor %in% c("8","9") ~ "8-9",
rank_factor %in% c("10","11") ~ "10-11"),
rank_recoded = factor(rank_recoded,
levels = c("1-3", "4-5", "6-7", "8-9", "10-11")))
table(military$rank_recoded)
##
## 1-3 4-5 6-7 8-9 10-11
## 76206 414086 603910 281777 38614
Here I create bar plots to show the total amount of people there are in each gender and in each rank. I also create a third bar plot to visualize the total amount of each gender in each rank with pink showing the total amount of women and light blue showing the total amount of men. There is also a bar plot showing the amount of people in each rank group. I made one for each rank amount by gender for fun but it doesn’t really show the data well as the colors don’t correlate.
barplot(table(military$gender_factor),
main = "Total Count by Gender",
xlab = "Gender",
ylab = "Count",
col = "purple")
barplot(table(military$rank_factor),
main = "Total Count by Rank",
xlab = "Rank",
ylab = "Count",
col = "violet")
barplot(table(military$gender_factor, military$rank_factor),
beside = TRUE,
main = "Amount of Gender in Rank",
xlab = "Rank",
ylab = "Count",
col = c("pink","lightblue"))
barplot(table(military$rank_recoded),
main = "Total Count by Rank Group",
xlab = "Rank",
ylab = "Count",
col = "maroon")
barplot(table(military$rank_factor, military$gender_factor),
beside = TRUE,
main = "Amount of Rank in Gender",
xlab = "Gender",
ylab = "Count",
col = c("black","red","orange","yellow","green","blue","lightblue", "purple", "magenta", "grey"))
To help me answer my question I am going to perform a chi-squared test for association at a 5% significance level.
Hypothesis
\(H_0\): A person’s rank in the military is not associated with their gender
\(H_a\): A person’s rank in the military is associated with their gender
α = 0.05
observed_dataset<- table(military$gender, military$rank_recoded)
observed_dataset
##
## 1-3 4-5 6-7 8-9 10-11
## female 12359 64196 88155 34755 3253
## male 63847 349890 515755 247022 35361
test <- chisq.test(observed_dataset)
test
##
## Pearson's Chi-squared test
##
## data: observed_dataset
## X-squared = 2731.7, df = 4, p-value < 2.2e-16
test$expected
##
## 1-3 4-5 6-7 8-9 10-11
## female 10920.69 59340.52 86543.22 40380 5533.572
## male 65285.31 354745.48 517366.78 241397 33080.428
test$statistic
## X-squared
## 2731.683
All the expected values are more than 5 so the chi-squared test is able to be performed and the results such as the p-value are reliable.
The results of the chi-squared test show a degree of freedom (df) = 4 and a p-value < 2.2e-16. As the p-value < 2.2e-16 this shows that the p-value is extremely small and is less than 0.05 so we can reject the null. The p-value is statistically significant at α = 0.05. We have enough evidence at a 0.05 significance level that there is a significant association between a person’s rank in the military and their gender.
After performing my chi-squared test for association I am certain that a persons gender does play a part in their rank in the military. The test results showed me a p-value < 2.2e-16 meaning that I have enough evidence to say there is an association between the two. Looking back at the bar plot that shows the total amount of each gender in each rank you can somewhat notice that the peak for the womens rank is around rank 5 while the peak for the mens rank is at rank 6. This goes to show that men do have higher ranks more often when compared to women further showing there is a factor at play. The data set I used didn’t have a lot of people in either rank 1 or rank 3 which made me have to group the ranks together in order to perform the chi-squared test for association. Maybe in the future if I am able to find another data set that is distributed more evenly I can perform another test to see if there is any change in the results. We can also test for associations with other factors in this data set as well such as testing for an association between a persons race and their rank which could be interesting.
Data set found from openintro.org at https://www.openintro.org/data/index.php?data=military. Data collected by the Department of Defense on 02-20-2012.