Introduction

Research Question: Is there statistical evidence that demographic factors, including race, gender, and Hispanic origin, significantly predict military rank among U.S. service members?

This project investigates whether the rank of a U.S. service member is associated with specific demographic characteristics. For this study, we will use the US Military Demographics dataset from OpenIntro. This dataset contains information on every active duty member of the U.S. Army, Navy, Air Force, and Marine Corps. This datset consists of 1,414,593 observations (rows = individual service member) and 6 variables, including:

While rank could be considered categorical, we are treating it as a quantitative variable in this model to enable the use of multiple linear regression and interpret predicted rank as a continuous indicator of position within the military hierarchy. If the model produces decimal values (e.g., 5.4), we will round to the nearest whole number to interpret the expected rank. Using linear regression will also allow us to observe the estimated independent effect of each demographic predictor on rank while controlling the others. In the following steps, we will analyze and explore our data, fit the multiple linear regression model, and check regression assumptions (linearity, independence of observations, homoscedasticity, normality of residuals, and multicollinearity).

Data Analysis

We will start by conducting exploratory data analysis to gain a deeper understanding of our data and examine the distribution of military rank. We will also clean and prepare our data by choosing only relevant variables and factoring predictors into their main categories. We will then summarize the data and create visualizations to look for any patterns.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

military <- read.csv("military.csv")

# Examine dimensions and preview data

dim(military)
## [1] 1414593       6
head(military)
##     grade branch gender    race hisp rank
## 1 officer   army   male ami/aln TRUE    2
## 2 officer   army   male ami/aln TRUE    2
## 3 officer   army   male ami/aln TRUE    5
## 4 officer   army   male ami/aln TRUE    5
## 5 officer   army   male ami/aln TRUE    5
## 6 officer   army   male ami/aln TRUE    5
# Select only relevant columns and factor

military_sub <- military |>
  select(rank, gender, race, hisp) |>
  mutate(
    gender = as.factor(gender),
    race = as.factor(race),
    hisp = as.factor(hisp)
  )

# Check structure of subset data, ensuring factors

str(military_sub)
## 'data.frame':    1414593 obs. of  4 variables:
##  $ rank  : int  2 2 5 5 5 5 5 7 10 2 ...
##  $ gender: Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ race  : Factor w/ 7 levels "ami/aln","asian",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ hisp  : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 2 1 ...
# Summarize statistics by demographic group

military_sub |> group_by(gender) |>
summarise(mean_rank = mean(rank), sd_rank = sd(rank), count = n())
## # A tibble: 2 × 4
##   gender mean_rank sd_rank   count
##   <fct>      <dbl>   <dbl>   <int>
## 1 female      6.02    1.73  202718
## 2 male        6.22    1.80 1211875
military_sub |> group_by(race) |>
summarise(mean_rank = mean(rank), sd_rank = sd(rank), count = n())
## # A tibble: 7 × 4
##   race    mean_rank sd_rank  count
##   <fct>       <dbl>   <dbl>  <int>
## 1 ami/aln      5.93    1.63  23984
## 2 asian        6.20    1.72  51735
## 3 black        6.48    1.85 241133
## 4 multi        5.34    1.73  26054
## 5 p/i          5.84    1.63   8703
## 6 unk          6.83    1.79  71269
## 7 white        6.11    1.76 991715
military_sub |> group_by(hisp) |>
summarise(mean_rank = mean(rank), sd_rank = sd(rank), count = n())
## # A tibble: 2 × 4
##   hisp  mean_rank sd_rank   count
##   <fct>     <dbl>   <dbl>   <int>
## 1 FALSE      6.18    1.81 1265480
## 2 TRUE       6.31    1.63  149113

After summarizing our data, we can make the following observations:

We will now visualize the distribution of rank within each category of our predictors. We will use boxplots to show median rankings and the variability per group.

# Boxplot of Rank by Gender

ggplot(military_sub, aes(x = gender, y = rank)) +
geom_boxplot(fill = "blue") +
labs(title = "Distribution of Rank by Gender", x = "Gender", y = "Rank")

# Boxplot of Rank by Race

ggplot(military_sub, aes(x = race, y = rank)) +
geom_boxplot(fill = "maroon") +
labs(title = "Distribution of Rank by Race", x = "Race", y = "Rank") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Boxplot of Rank by Hispanic Origin

ggplot(military_sub, aes(x = hisp, y = rank)) + 
geom_boxplot(fill = "orange") +
labs(title = "Distribution of Rank by Hispanic Origin", x = "Hispanic Origin", y = "Rank")

Visual Observations:

The gender vs rank boxplot showcases a slightly higher median for males than females. The interquartile ranges (IQRs) also show a lot of overlap, indicating substantial similarities in average ranks by gender. In the race vs rank boxplot, the median rank for the unknown category is visually much higher than the others, and the multi-ethnic category much lower. The IQRs are similar for most racial groups, with black showcasing the most variability, and multi-ethnic showcasing the least variability. The Hispanic versus non-Hispanic boxplot has similar structure to gender, with similar IQRs and whisker ranges for both groups. These graphs reinforce our numeric findings that there are light to moderate differences in rank distributions across demographic groups, with most variation in the race category.

Regression Analysis

To address the research question with actual numeric thresholds, we will fit a multiple linear regression model that predicts rank from gender, race, and Hispanic origin. In regression, a categorical variable needs one category as a reference so the model can compare the others to it. By default, R picks one level to keep as the “baseline” to avoid redundancy. We will use levels() to figure out the default:

# Check the reference levels (first level shown is the default reference)
levels(military_sub$gender)
## [1] "female" "male"
levels(military_sub$race)
## [1] "ami/aln" "asian"   "black"   "multi"   "p/i"     "unk"     "white"
levels(military_sub$hisp)
## [1] "FALSE" "TRUE"

Our default values include:

# Fit the multiple linear regression model

model <- lm(rank ~ gender + race + hisp, data = military_sub)

# Output the model summary

summary(model)
## 
## Call:
## lm(formula = rank ~ gender + race + hisp, data = military_sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8893 -1.1374 -0.1374  1.0095  5.8697 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.712860   0.011998 476.132  < 2e-16 ***
## gendermale   0.254692   0.004296  59.291  < 2e-16 ***
## raceasian    0.277598   0.013856  20.035  < 2e-16 ***
## raceblack    0.570349   0.012015  47.468  < 2e-16 ***
## racemulti   -0.582556   0.015858 -36.736  < 2e-16 ***
## racep/i     -0.079390   0.022177  -3.580 0.000344 ***
## raceunk      0.897819   0.013395  67.028  < 2e-16 ***
## racewhite    0.169835   0.011587  14.657  < 2e-16 ***
## hispTRUE     0.023966   0.005175   4.631 3.64e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.772 on 1414584 degrees of freedom
## Multiple R-squared:  0.01954,    Adjusted R-squared:  0.01953 
## F-statistic:  3524 on 8 and 1414584 DF,  p-value: < 2.2e-16

Regression Analysis

The model is statistically significant overall. The p-value yielded is less than 2.2e-16, which suggests that at least one of the predictors is significantly related to rank. On the other hand, the adjusted R-squared value is .01953, meaning that these demographic factors only explain about 2% of the variation in military rank. This means that the majority of the variation in rank is due to outside factors not taken into account within this model.

Intercept: The intercept is approximately 5.71, meaning that the expected rank for a female, American Indian/Alaskan Native, non-Hispanic service member is between 5-6.

Gender: The coefficient for “gendermale” is +0.25 (p < .001), meaning that male service members average 0.25 points higher than female service members, after controlling for other demographic factors. Although this value is statistically significant, 0.25 is not a substantial difference in the context of the military rank scale.

Race (Compared to baseline: American Indian/Alaskan Native):

Each race category shows a statistically significant difference from the baseline group (AI/AN), with the unknown race category showing the highest increase estimate rank, and multi-ethnic showing the lowest.

Hispanic Origin: The coefficient for hispTRUE is +0.024 (p < 0.001), which is statistically significant but very small ranking context.

Conclusion: While gender, race, and Hispanic origin show a statistically significant effect on military rank, the weight of each is practically negligible. Additionally, the model has very low explanatory power, meaning that the majority of the rank variation is due to other factors outside of this model.

Model Assumptions and Diagnostic Tests

After fitting our model, we must check linearity, independence, homoscedasticity, normality, and multicollinearity using plots and statistical measures:

# Diagnostic plots for the linear model

par(mfrow = c(2, 2))
plot(model)   

par(mfrow = c(1, 1))
library(car)
## Warning: package 'car' was built under R version 4.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
crPlots(model)

Assumption Checks:

Conclusion and Future Directions:

This project examined whether demographic factors, including gender, race, and Hispanic origin, significantly predict U.S. military rank. Our multiple linear regression analysis found that gender, race, and Hispanic origin are statistically significant predictors of rank. Despite these significant effects, the model only showed an R-squared of ~.02, meaning that our demographics only explain about 2 percent of the variation in rank. This suggests that the majority of the rank variation is due to external factors not covered in our model, such as years of service, grade, branch, education, etc. Adding such factors could improve the r-squared value and provide more explanation for rank variation. We treated rank as a continuous numeric variable, but it is technically ordinal. In future studies, an ordinal regression would be more appropriate. Terms that may interact, such as gender and race could also be examined further to see more effects on rank. Overall, despite small demographic differences, the p-values were statistically significant, possibly warranting further study on promotion equity in military rank advancement.