Overview

This project will look to see if women are less likely than men to be of higher military rank.

Introduction

The data used for this project was obtained through the openintro library in R and provided by the Department of Defense, this data was retrieved in 2012.The Department of Defense is a government agency in charge of the US military and its affairs. They provide and train the military needed to maintain national security and protect our country from war. The DOD collects demographic data each year to observe who makes up their military.

Using this data, I will be looking to answer the question: Are women less likely to be of higher military rank than men?

I find this question worth researching because it’s been 78 years since women have been allowed to join the military. I’d like to see how much progress has been made in regards to gender equality. Since this data set is from 2012, we will really only be observing the progress made in 64 years.

Exploring the Data

#Load and store data in environment
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
military2012 <- military
#View the structure of data set
str(military2012)
## tibble [1,414,593 × 6] (S3: tbl_df/tbl/data.frame)
##  $ grade : Factor w/ 3 levels "enlisted","officer",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ branch: Factor w/ 4 levels "air force","army",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ gender: Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ race  : Factor w/ 7 levels "ami/aln","asian",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ hisp  : logi [1:1414593] TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ rank  : int [1:1414593] 2 2 5 5 5 5 5 7 10 2 ...
#Check for missing data
sum(is.na(table(military2012)))
## [1] 0

The data set contains 1,414,593 observations of US military personnel. There are six variables recorded: grade, branch of military, gender, race , hispanic or not, and rank. There is no missing data.

#View the summary of the data set 
summary(military2012)
##              grade                  branch          gender       
##  enlisted       :1183889   air force   :331486   female: 202718  
##  officer        : 211525   army        :555849   male  :1211875  
##  warrant officer:  19179   marine corps:202966                   
##                            navy        :324292                   
##                                                                  
##                                                                  
##                                                                  
##       race           hisp              rank       
##  ami/aln: 23984   Mode :logical   Min.   : 1.000  
##  asian  : 51735   FALSE:1265480   1st Qu.: 5.000  
##  black  :241133   TRUE :149113    Median : 6.000  
##  multi  : 26054                   Mean   : 6.194  
##  p/i    :  8703                   3rd Qu.: 7.000  
##  unk    : 71269                   Max.   :11.000  
##  white  :991715

The average rank is a middle rank of 6. Ranks range from 1 (lowest) to 11 (highest). Most notable from this summary is that there are way more men than women enlisted in the army, about six times more.

barplot(table(military2012$gender, military2012$rank), xlab = "Military Rank", ylab = "Number of People", main = "Female vs Male in Military Ranks", col = c("Light Yellow", "Light Blue"), legend.text = c("Female", "Male"))

Visually, it seems that there are more men of higher rank than women and that the military is disproportionately made up of men. There is approximately a ratio of six men for every one woman.

Analysis

To answer my question, I will perform a hypothesis test for two independent proportions. (α = 0.05)

My hypotheses:

\(H_0: p_F = p_M\)

\(H_A: p_F \neq p_M\)

#Create the subsets between gender to find size
milwomen <- subset(military2012, gender == "female")
milmen <- subset(military2012, gender == "male")

sum(table(milwomen))
## [1] 202718
sum(table(milmen))
## [1] 1211875
#Create the subsets of higher rank to find size
highmilwomen <- subset(milwomen, rank >= 8)
highmilmen <- subset(milmen, rank >= 8)

sum(table(highmilwomen))
## [1] 38008
sum(table(highmilmen))
## [1] 282383
#Calculate test statistic and p-value using t.test
prop.test(x = c(38008, 282383), n = c(202718, 1211875))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(38008, 282383) out of c(202718, 1211875)
## X-squared = 2053.9, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.04738250 -0.04366014
## sample estimates:
##    prop 1    prop 2 
## 0.1874920 0.2330133

To be further confident of the results, I will perform a Mann Whitney U Test. This evaluates the differences between two independent groups when the dependent variable is ordinal, which military rank is. I will be doing a one-sided test to determine if the rank distribution of male personnel is higher than the rank of female personnel.

My hypotheses:

\(H_0: rank_F = rank_M\)

\(H_A: rank_F < rank_M\)

#Mann Whitney U-Test to test if female group is systemically less than male group
wilcox.test(rank~gender, data = military2012, alternative = "less")
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  rank by gender
## W = 1.1546e+11, p-value < 2.2e-16
## alternative hypothesis: true location shift is less than 0

Conclusions

Although the proportions seem close in value, the hypothesis test for two independent proportions produced a p-value greatly less than the alpha of 0.05. There is a highly statistically significant difference between the two proportions and is not by chance. We can confidently reject the null hypothesis and accept the alternative. The proportion of high rank females is not equal to the proportion of high rank males.

The Mann Whitney U-Test confirmed this conclusion, providing us with a p-value and w-value. The w-value is a representation of when every female and male is paired up. If the value of their ranks are tied, +0.5 is added, and if the female outranks the male, +1 is added, and if the male outranks the female, +0 is added.

The maximum w-value can be calculated by multiplying the two sizes of the independent groups, giving us a possible maximum w-value of 245,668,876,250. So, if all pairings were equal and gender had no influence, our w-value would be half of the maximum, giving us 122,834,438,125. Our w-value was calculated to be approximately 115,460,000,000, which is less than the w-value of equality. This means that when females and males were compared, males tended to outrank females.

Furthermore, our p-value from the Mann Whitney U-Test was greatly less than 0.05. This is highly statistically significant and means that we can confidently reject the null hypothesis that the rank distribution is equal. The data shows that the rank distribution of females is lower than that of males.

As of 2012, there is very strong evidence to show that gender inequality was still present in US military authority.

Limitations

The major limitation of this data set is that it is from 14 years ago. Perhaps our results would be different if the data were more recent. It wasn’t until 2016 that all military positions became open to women, especially higher positions of authority.

Furthermore, the classifications of gender may be slightly unreliable because this data was recorded four years before transgender people were allowed to openly serve in our military. Even with that bill being passed, it is not safe for many to openly serve. Perhaps there are secretly more women of higher authority than we may realize.

Our data is otherwise generally unbiased and reliable because it is recorded by the government. The variables are of fact and not opinion. Furthermore, it is not just a sample, it represents almost the entire US military during the time of 2012.


This document was produced as a final project for MAT 143H - Introduction to Statistics (Honors) at North Shore Community College.
The course was led by Professor Billy Jackson.
Student Name: Alessandra Marenghi
Semester: Spring 2026