Research Question: Is there statistical evidence that demographic factors, including race, gender, and Hispanic origin, significantly predict military rank among U.S. service members?
This project investigates whether the rank of a U.S. service member is associated with specific demographic characteristics. For this study, we will use the US Military Demographics dataset from OpenIntro. This dataset contains information on every active duty member of the U.S. Army, Navy, Air Force, and Marine Corps. This datset consists of 1,414,593 observations (rows = individual service member) and 6 variables, including:
While rank could be considered categorical, we are treating it as a quantitative variable in this model to enable the use of multiple linear regression and interpret predicted rank as a continuous indicator of position within the military hierarchy. If the model produces decimal values (e.g., 5.4), we will round to the nearest whole number to interpret the expected rank. Using linear regression will also allow us to observe the estimated independent effect of each demographic predictor on rank while controlling the others. In the following steps, we will analyze and explore our data, fit the multiple linear regression model, and check regression assumptions (linearity, independence of observations, homoscedasticity, normality of residuals, and multicollinearity).
We will start by conducting exploratory data analysis to gain a deeper understanding of our data and examine the distribution of military rank. We will also clean and prepare our data by choosing only relevant variables and factoring predictors into their main categories. We will then summarize the data and create visualizations to look for any patterns.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
military <- read.csv("military.csv")
# Examine dimensions and preview data
dim(military)
## [1] 1414593 6
head(military)
## grade branch gender race hisp rank
## 1 officer army male ami/aln TRUE 2
## 2 officer army male ami/aln TRUE 2
## 3 officer army male ami/aln TRUE 5
## 4 officer army male ami/aln TRUE 5
## 5 officer army male ami/aln TRUE 5
## 6 officer army male ami/aln TRUE 5
# Select only relevant columns and factor
military_sub <- military |>
select(rank, gender, race, hisp) |>
mutate(
gender = as.factor(gender),
race = as.factor(race),
hisp = as.factor(hisp)
)
# Check structure of subset data, ensuring factors
str(military_sub)
## 'data.frame': 1414593 obs. of 4 variables:
## $ rank : int 2 2 5 5 5 5 5 7 10 2 ...
## $ gender: Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 2 ...
## $ race : Factor w/ 7 levels "ami/aln","asian",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ hisp : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 2 1 ...
# Summarize statistics by demographic group
military_sub |> group_by(gender) |>
summarise(mean_rank = mean(rank), sd_rank = sd(rank), count = n())
## # A tibble: 2 × 4
## gender mean_rank sd_rank count
## <fct> <dbl> <dbl> <int>
## 1 female 6.02 1.73 202718
## 2 male 6.22 1.80 1211875
military_sub |> group_by(race) |>
summarise(mean_rank = mean(rank), sd_rank = sd(rank), count = n())
## # A tibble: 7 × 4
## race mean_rank sd_rank count
## <fct> <dbl> <dbl> <int>
## 1 ami/aln 5.93 1.63 23984
## 2 asian 6.20 1.72 51735
## 3 black 6.48 1.85 241133
## 4 multi 5.34 1.73 26054
## 5 p/i 5.84 1.63 8703
## 6 unk 6.83 1.79 71269
## 7 white 6.11 1.76 991715
military_sub |> group_by(hisp) |>
summarise(mean_rank = mean(rank), sd_rank = sd(rank), count = n())
## # A tibble: 2 × 4
## hisp mean_rank sd_rank count
## <fct> <dbl> <dbl> <int>
## 1 FALSE 6.18 1.81 1265480
## 2 TRUE 6.31 1.63 149113
After summarizing our data, we can make the following observations:
Gender: Mean rank for female service members is 6.02 and males 6.22. There are about 1.2 million males and around 202k females in the data. This suggests that before controlling for other factors, males have a slightly average higher rank than females.
Race: There is pretty noticeable variation across each race. For example, the average black service member ranking is 6.48, whereas multi-ethnic (multi) service members average a rank of 5.34. Other average ranks include American Indian/Alaskan Native (ami/aln, 5.93), White (6.11), Unknown (unk, 6.83), Pacific Islander (p/i, 5.84), and Asian (6.2).
Hispanic Origin: Not much variation exists between service members who identify as hispanic and those who do not; 6.31 (Hispanic origin) to 6.18 (no Hispanic origin).
We will now visualize the distribution of rank within each category of our predictors. We will use boxplots to show median rankings and the variability per group.
# Boxplot of Rank by Gender
ggplot(military_sub, aes(x = gender, y = rank)) +
geom_boxplot(fill = "blue") +
labs(title = "Distribution of Rank by Gender", x = "Gender", y = "Rank")
# Boxplot of Rank by Race
ggplot(military_sub, aes(x = race, y = rank)) +
geom_boxplot(fill = "maroon") +
labs(title = "Distribution of Rank by Race", x = "Race", y = "Rank") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Boxplot of Rank by Hispanic Origin
ggplot(military_sub, aes(x = hisp, y = rank)) +
geom_boxplot(fill = "orange") +
labs(title = "Distribution of Rank by Hispanic Origin", x = "Hispanic Origin", y = "Rank")
Visual Observations:
The gender vs rank boxplot showcases a slightly higher median for males than females. The interquartile ranges (IQRs) also show a lot of overlap, indicating substantial similarities in average ranks by gender. In the race vs rank boxplot, the median rank for the unknown category is visually much higher than the others, and the multi-ethnic category much lower. The IQRs are similar for most racial groups, with black showcasing the most variability, and multi-ethnic showcasing the least variability. The Hispanic versus non-Hispanic boxplot has similar structure to gender, with similar IQRs and whisker ranges for both groups. These graphs reinforce our numeric findings that there are light to moderate differences in rank distributions across demographic groups, with most variation in the race category.
To address the research question with actual numeric thresholds, we will fit a multiple linear regression model that predicts rank from gender, race, and Hispanic origin. In regression, a categorical variable needs one category as a reference so the model can compare the others to it. By default, R picks one level to keep as the “baseline” to avoid redundancy. We will use levels() to figure out the default:
# Check the reference levels (first level shown is the default reference)
levels(military_sub$gender)
## [1] "female" "male"
levels(military_sub$race)
## [1] "ami/aln" "asian" "black" "multi" "p/i" "unk" "white"
levels(military_sub$hisp)
## [1] "FALSE" "TRUE"
Our default values include:
# Fit the multiple linear regression model
model <- lm(rank ~ gender + race + hisp, data = military_sub)
# Output the model summary
summary(model)
##
## Call:
## lm(formula = rank ~ gender + race + hisp, data = military_sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8893 -1.1374 -0.1374 1.0095 5.8697
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.712860 0.011998 476.132 < 2e-16 ***
## gendermale 0.254692 0.004296 59.291 < 2e-16 ***
## raceasian 0.277598 0.013856 20.035 < 2e-16 ***
## raceblack 0.570349 0.012015 47.468 < 2e-16 ***
## racemulti -0.582556 0.015858 -36.736 < 2e-16 ***
## racep/i -0.079390 0.022177 -3.580 0.000344 ***
## raceunk 0.897819 0.013395 67.028 < 2e-16 ***
## racewhite 0.169835 0.011587 14.657 < 2e-16 ***
## hispTRUE 0.023966 0.005175 4.631 3.64e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.772 on 1414584 degrees of freedom
## Multiple R-squared: 0.01954, Adjusted R-squared: 0.01953
## F-statistic: 3524 on 8 and 1414584 DF, p-value: < 2.2e-16
The model is statistically significant overall. The p-value yielded is less than 2.2e-16, which suggests that at least one of the predictors is significantly related to rank. On the other hand, the adjusted R-squared value is .01953, meaning that these demographic factors only explain about 2% of the variation in military rank. This means that the majority of the variation in rank is due to outside factors not taken into account within this model.
Intercept: The intercept is approximately 5.71, meaning that the expected rank for a female, American Indian/Alaskan Native, non-Hispanic service member is between 5-6.
Gender: The coefficient for “gendermale” is +0.25 (p < .001), meaning that male service members average 0.25 points higher than female service members, after controlling for other demographic factors. Although this value is statistically significant, 0.25 is not a substantial difference in the context of the military rank scale.
Race (Compared to baseline: American Indian/Alaskan Native):
Asian: +0.28 (p < 0.001): Asian members have slightly higher ranks on average than AI/AN.
Black: +0.57 (p < 0.001): Black members have the largest positive race effect relative to AI/AN.
Multi-ethnic: –0.58 (p < 0.001): Multi-ethnic members are predicted to have significantly lower ranks than AI/AN.
Pacific Islander: –0.08 (p = 0.0003): A small but significant negative difference.
Unknown: +0.90 (p < 0.001): Those with unknown race have higher average ranks.
White: +0.17 (p < 0.001): White members have moderately higher ranks than AI/AN.
Each race category shows a statistically significant difference from the baseline group (AI/AN), with the unknown race category showing the highest increase estimate rank, and multi-ethnic showing the lowest.
Hispanic Origin: The coefficient for hispTRUE is +0.024 (p < 0.001), which is statistically significant but very small ranking context.
Conclusion: While gender, race, and Hispanic origin show a statistically significant effect on military rank, the weight of each is practically negligible. Additionally, the model has very low explanatory power, meaning that the majority of the rank variation is due to other factors outside of this model.
After fitting our model, we must check linearity, independence, homoscedasticity, normality, and multicollinearity using plots and statistical measures:
# Diagnostic plots for the linear model
par(mfrow = c(2, 2))
plot(model)
par(mfrow = c(1, 1))
library(car)
## Warning: package 'car' was built under R version 4.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
crPlots(model)
Assumption Checks:
Linearity: The component and residual plots for gender, race, and Hispanic origin show mostly flat and horizontal boxplots, which suggests no major violations of the linearity assumption. The “Residuals vs Fitted” plot also shows randomness, further supporting linearity
Homoscedasticity: The “Scale-Location” plot shows mostly consistent spread of residual values, although there is some widening at the upper end. This suggests mild heteroscedasticity, but not severe.
Normality of Residuals: The “Q-Q” plot shows a slight deviation from the diagonal at both ends of the line, suggesting mild non-normality. This is acceptable though, due to the large size of the sample.
Influence / Leverage: The “Residuals vs Leverage” plot show some higher leverage observations, but no extreme Cook’s distances are flagged. This suggests that there are no major concerns with influence.
Summary: Each assumption for regression is reasonably satisfied. Each of these checks support the validity of the model and can increase the confidence within our results.
This project examined whether demographic factors, including gender, race, and Hispanic origin, significantly predict U.S. military rank. Our multiple linear regression analysis found that gender, race, and Hispanic origin are statistically significant predictors of rank. Despite these significant effects, the model only showed an R-squared of ~.02, meaning that our demographics only explain about 2 percent of the variation in rank. This suggests that the majority of the rank variation is due to external factors not covered in our model, such as years of service, grade, branch, education, etc. Adding such factors could improve the r-squared value and provide more explanation for rank variation. We treated rank as a continuous numeric variable, but it is technically ordinal. In future studies, an ordinal regression would be more appropriate. Terms that may interact, such as gender and race could also be examined further to see more effects on rank. Overall, despite small demographic differences, the p-values were statistically significant, possibly warranting further study on promotion equity in military rank advancement.