Author: Ryann Laky
With the appointment of the Indiana National Guard’s latest Adjutant General, Major General R. Dale Lyles, the motto for the state has also changed: ‘People First’. Over the years, the whole of the United States Army has come to realize that its greatest asset is its people. However, given that the Army is an operations-based organization, it is often easy to forget to care for the greatest asset. With the INNG’s motto of ‘People First’, a few well-respected leaders were asked to give their definitions of Soldier Care:
“Soldier Care is a function of basic leadership at all levels. Leaders provide purpose, motivation, and direction. This applies equally to mission accomplishment as well as Soldier well-being. An engaged leader must genuinely care to ensure that all aspects of a Soldier’s life are in balance so they may offer maximum effort toward their military task.”
“To care for Soldiers is to create an environment of living, learning, and training such that it gives the Soldier the greatest opportunity to thrive unrestricted [to] become the best and most lethal person that they could possibly be; and to identify, build on, and reinforce their strengths, and identify, reduce, and eliminate their weaknesses [to] allow that individual to do their part when the time comes to ensure mission success.”
“Soldier care is taking care of the whole Soldier: making sure they have what they need to do their job now, providing the mentorship and training to do their job in the future, and ensuring they are physically, medically, and mentally prepared. This includes taking care of the individual personal needs.”
Vision: The Indiana National Guard
will be the premier community-based military force for state and
international missions, by putting our people
first.
The INNG has a dedicated personnel section, much like human resources in the civilian sector, called J1. However, J1 covers much more than just human resources; they handle various pay issues, personnel actions (such as changes of address), recruiting, retention, resolution of medical issues, and various training. One of the biggest problems the INNG faces is the retention of talent. Service Members usually join the Guard for education benefits, to gain skills, and to travel the world at low cost – all benefits for Service Members in a weakened economy. As Service Members stay in the Guard, they gain a set of indispensable skills and talents that translate and market well to the civilian market, pulling them away from the Guard and making the Guard more of an inconvenience than an asset for the individual. Retention has posed a major problem for the INNG, especially as the need for more specialized occupations increases and societal interest in a part-time military career decreases. Therefore, J1 must work harder to retain this talent by caring for Service Members and putting them and their families first.
Before addressing retention, there are a handful of definitions that will be used moving forward:
Absent Without Leave (AWOL) – any Soldier who has taken an unauthorized leave from his/her training or duty station is considered AWOL. Additional stipulations can be found here.
Civilian Education Level (CIVED) – the highest level of civilian education received by the Soldier at the date of the data pull.
Date of Rank (DOR) – Promotion date for the Soldier’s current rank.
Department of the Army (DOA) – Military Department within the United States Department of Defense.
Director’s Personnel Readiness Overview (DPRO) – the Army’s primary database and personnel readiness data collection tool that pulls from various personnel data sources, to include information regarding Soldier demographics, promotion periods, sources of enlistment/commission, etc.
Expiration Term of Service (ETS) – colloquially referred to an enlisted Soldier’s last date in service and used interchangeably between enlisted Soldiers and commissioned Officers.
Fiscal Year (FY) – financial planning year for the US military, spanning from the first day of October through the last day of the following September; as an example, FY23 starts in October 2022 and ends in September 2023.
J1 – Director of Manpower and Personnel; equivalent of a Director of Human Resources in the private sector; J implies joint.
Joint – implies an operation or function including multiple branches of service, to include US entities and multinational agencies; in this case, it refers to the joining of the US Army and the US Air Force.
Initial Entry Training (IET)– formerly known as Basic Training, is the program of physical and mental training required in order for an individual to become a Soldier in the United States Army, Army Reserve or Army National Guard; both officers and enlisted Soldiers must complete IET to be considered qualified for entry into a formation.
Mandatory Removal Date (MRD) – the date at which a commissioned officer is required to be removed from service, usually on the date they reach 60 years of age or later (pending waivers).
Military Education Level (MILED) – the highest level of military education received by the Soldier at the date of the data pull.
Military Occupational Specialty (MOS) – a coded job for each service member, specific to their duties and assignments aligned within the military’s organizational structure.
National Guard Bureau (NGB) – the federal instrument responsible for the administration of the National Guard established by the United States Congress as a joint bureau of the Department of the Army and the Department of the Air Force, created by the Militia Act of 1903.
Pay Entry Base Date (PEBD) – the date at which the Soldier signed their first contract, whether commissioned or enlisted.
Soldier Care – a relatively new concept in the Army, it is the holistic approach to Soldier health and wellbeing and is defined differently by leaders across the United States Army.
Source of Commissioning (SOC) – the source through which a Soldier commissioned (i.e., Direct Commission, Officer Candidate School, Reserve Officer Training Corps, etc.).
Source of Enlistment (SOE) – the source through which a Soldier enlisted (i.e., Voluntary Enrollment, Voluntary Enlistment, etc.).
The Adjutant General (TAG) – the highest position of the any state’s National Guard; formally politically appointed by the state’s governor, but with jurisdiction over all assets within a state’s National Guard.
Training Pay Category (TPC) – high-level generic code for pay status the Soldier is in, particularly as it pertains to their status of IET to become qualified in the position they hold.
War Fighting Function (WFF) – broad coverage of MOSs or series numbers categorized into the overall function in which they fall, to include Command and Control, Movement and Maneuver, Intelligence, Fires, Sustainment, and Protection.
Because of the stratification of the J1, the response to the business problem has been phased to best serve the INNG, as follows:
Phase I: Gather and analyze retention data through DPRO data extraction from INNG Service Members to identify the generic profile of a non-retained Soldier. This project will primarily focus on this phase.
Phase II: Use this generic profile of a non-retained Soldier to address potential solutions in retaining this type of Soldier.
Phase III: Identify trends in self-identified justifications for Soldiers leaving the INNG, and pair those responses with profiles pulled from the DPRO data extraction.
Phase IV: Use this trend in self-identified attributes of non-retained Soldiers to address potential solutions in retaining this type of Soldier.
Phase V: Develop and implement a Service Member re-enlistment survey (much like the exit survey) to allow Service Members the opportunity to self-identify their reasons for staying in the INNG.
There are a handful of key questions that can be answered to help feed this phased approach in tackling retention, such as:
What, if any, indicators exist in a Soldier’s DPRO profile (i.e., sex, rank, MOS, etc.) to better determine which Soldiers are more likely to get out? This is largely dependent on consistencies between Soldiers within the data provided.
What are some quantifiable justifications that affect whether they get out (pay, education benefits, rank stagnancy, etc.)? This would further shape the efforts of the Indiana National Guard to retain Soldiers via tangible actions in their careers.
What factors are being identified within that retention window and being used to shape the Soldier’s exit interview? In other words, are the opinions shared in the exit survey reflective of their DPRO profile? This would provide legitimacy (or lack thereof) to the survey used when Soldiers leave the INNG.
What factors cause Soldiers to stay in the Guard (based on what career markers exist within a Soldier’s profile who does re-enlist prior to their contract end-date)? This is also largely dependent on consistencies between Soldiers within the data provided.
To the standard person, retention in the INNG is not at all a priority. In fact, most would argue that even members of the INNG are not worried about retention unless it is their own and they’re approaching their ETS window. However, leaders across the National Guards of all states are interested in retention of their Soldiers. Without people (again, the greatest asset), the INNG and other states cannot operate. To assess the criticality of stakeholders to this project, the image to the right is used to generate a scale on which to prioritize stakeholders, with quadrants addressed from left-to-right, top-to-bottom.
Quadrant 1 (High Power, Low Interest): require some attention and must be satisified
Quadrant 2 (High Power, High Interest): require the most amount of attention and must be managed closely
Quadrant 3 (Low Power, Low Interest): require minimanl attention and must be monitored with minimal effort
Quadrant 4 (Low Power, High Interest): require some attention and must be informed
Some key stakeholders of this project are outlined below, in order of their priority according to the image on the right:
TAG of the INNG: With his ‘People First’ strategy and the amount of power he holds in the state, he is the greatest stakeholder in this project. He specifically requested formal regular updates on this project as it progresses to redirect his staff to assist and to reallocate resources for further research. The retention of Soldiers in the INNG also greatly affects his annual budget allotments sanctioned through NGB. TAG is assessed to have both high power and high interest.
J1 of the INNG: This is the primary section responsible for pulling data that contributes to this project and the primary implementation group, pending the results. While they are under the authority of TAG, they hold the main power (with about 50 employees) for assistance with and implementation of this project. INNG’s J1 is assessed to have both high power and high interest, but less than that of TAG.
Soldiers and Leaders of the INNG: While most Soldiers and Leaders are unaware of this project, they all hold power over this project in the tactical implementation of its results, their implementation of Soldier Care, and the effects they can have on this project moving forward. Soldiers and Leaders are assessed to have some power and low interest.
The General Public: The general public is not aware of this project, but if the INNG can use this project to better shape its retention, that will leave the INNG with a sustained and experienced force ready to assist, protect, and defend the civilian opulace of Indiana. The general public is assessed to have low power and low interest.
DOA, NGB, all 54 States/Territories: While no individuals at NGB, DOA, or other states are aware of this ongoing project, this project could have impacts that may be worth implementing at their levels if it proves fruitful. Retention is an issue across all of DOA, to include NGB and all 54 states/territories. If this retention project and analysis affects the state appropriately, the results could benefit all leaders across all echelons. As a group, these are assessed to have high, but not applicable, power and very low interest. This is all contingent on the outcome of and the feasibility to implement the project.
Below is information regarding the raw data files pulled from DPRO, including their specific number of rows and columns. Because this data is fragmented, the data understanding tables are the same (where duplicated) as labeled in the headers. Everything from this point forward is considered unclassified but for official use only (with limited distribution) – it is not to be used for public release.
Fiscal Year Losses: These tables contain the Soldier profiles for those not retained at the end of a FY. There are a total of six tables (by FY). This data can be used to compare qualities or attributes of Soldiers that stayed in versus those who exited the guard between FYs. This data contains basic Soldier information such as unit, months in current grade and MOS, gender, race, ethnicity, and other demographic data. It also includes the type of Soldier (whether enlisted, warrant officer, or commissioned officer) and basic unit information. These tables also contain information on why the Soldier exited the INNG, according to codes (including descriptions), as determined by the INNG J1 (i.e., medical retirement, regular retirement, adverse action, etc.).
Fiscal Year Strength: These tables contain basic Soldier information for the entire starting strength for the FY, which also includes the Soldiers that were lost at the end of a fiscal year. In contrast to the losses data, this strength data does not indicate codes for exiting the INNG. However, this does contain additional information regarding ETS or MRD.
Below are the tables for data understanding for the data. Columns for tables across fiscal year losses are the same, just as columns for tables across fiscal year strength are the same. However, there are some duplicate columns between losses/strength, and some that are inconsistent between losses/strength.
FYXX Losses.xlsx):
FYXX Assigned Strength.xlsx):
Cleaning of this data used several methods, to include manual entry, removal of duplicates, addition of columns for simpler filtration, and down-scaling sets for targeted analysis.
Soldier Name jointly
with Last Four were removed from the tables for total
strength and total losses. This leaves the following two tables:
Unit State,
UPC, Unit Name, POD,
Soldier Name, Last Four,
Months in Grade Completed, Grade,
DMOS, Gender, Race/Ethnicity,
Loss Reason, Unit State-1, POD-1,
UPC-1, TPC, Mo in Grd,
RSP Site, Military Personnel Class, and
RSID (Key). This leaves one table with duplicate
information removed:
Soldier Name-1 → NameLast Four-1 → Last FourGender-1 → GenderRace / Ethnicity-1 → Race/EthnicityGrade-1 → GradeDMOS-1 → MOSMonths in Grade Completed-1 →
Months in GradeMo in Svc → Months in ServiceETS or MRD Date → ETS/MRDCIVED Cert → CIVEDUnit Name-1 → Unit NameTPC-Desc → TPCLoss Reason-Desc → Loss ReasonDOR: String to DateETS/MRD: String to DatePEBD: String to DateDate of Commission: String to DateMonths in Service is complete only for those considered as
losses. However, this information was reasonably deduced using the
difference between the PEBD and the date for which this data was pulled.
Reported Months in Service remained, while missing values
were imputed using the DATEDIFF() function. This leaves one
table with Months in Service imputed:
Loss Reason has
many closely-related loss reasons. These were consolidated for more
meaningful analysis later one. This leaves one table with more
consolidated content: Joined_v6 20 columns x 18,835
rows. Below is a summary of the remaining categories for separation:
Loss with the
values of Y and N to indicate whether the
Soldier was a loss, leaving the following table:
Name is a column containing full unique names for
each Soldier, using this data isn’t in accordance with the Privacy
Act. An additional column titled ID containing unique
indentifiers for each Soldier’s input was created, leaving the following
table:
Last Four in conjunction with
Name, which together would be considered PII, or Personally
Identifiable Information and in violation of the Privacy
Act. To mitigate bias in identifying Soldiers names, the columns
were removed, leaving:
MOS, it is much easier to analyze the
series as to the explicit MOS for each Soldier. As an example, the
explicit MOS of a Soldier is outlined by several factors: their series
number (indicated by the first two numbers of the MOS), the
specification (indicated by the following letter of the MOS), and their
proficiency (indicated by the string following the letter). As Soldiers
rise in the ranks of their MOS, their series and specification remains
the same but their proficiency changes. Delineating between specific MOS
and series number is an easier way to group Soldier for analysis, as a
series specifies the general area they specialize in. As an example, a
25-series has a plethora of specific MOSs, but all have the same number
of 25 designating their series. This leaves the following:
MOS and
Grade. For MOS, a new column titled
WFF for War Fighting Function was added. This WFF is a
broader category that defines the generic purpose for each MOS. For
Grade, a new column titled Type was added. These types
include Warrant, Officer, and Enlisted. These changes leave the
following table:
Race/Ethnicity has many microcategories. For ease of
analysis, the default race/ethnicity was pushed to the first
race/ethnicity listed in the column, leaving the following table:
Below is the heading of the final table representing Joined_v12, including the data types and a snippet into the values these columns contain.
#######################
# Final Excel Heading #
#######################
str(retention)
## tibble [18,835 × 22] (S3: tbl_df/tbl/data.frame)
## $ ID : num [1:18835] 10000 10001 10002 10003 10004 ...
## $ Gender : chr [1:18835] "M" "F" "M" "M" ...
## $ Race/Ethnicity : chr [1:18835] "White" "Black" "White" "White" ...
## $ MOS : chr [1:18835] "01" "01" "01" "00" ...
## $ WFF : chr [1:18835] "Immaterial" "Immaterial" "Immaterial" "Immaterial" ...
## $ Grade : chr [1:18835] "O3" "O4" "O3" "E9" ...
## $ Type : chr [1:18835] "Officer" "Officer" "Officer" "Enlisted" ...
## $ DOR : POSIXct[1:18835], format: NA NA ...
## $ Months in Grade : num [1:18835] 9 57 13 46 58 20 53 16 2 97 ...
## $ Months in Service : num [1:18835] 229 357 235 398 203 358 435 154 188 407 ...
## $ ETS/MRD : POSIXct[1:18835], format: "2039-06-30" "2027-09-30" ...
## $ MILED : chr [1:18835] NA NA NA "SSD Level 6 Grad" ...
## $ CIVED : chr [1:18835] NA NA NA "Completed 1 Semester to 1-4 Years College" ...
## $ Unit Name : chr [1:18835] "DET 1 (CERF P) 81ST TC" "81ST TROOP COMMAND (-)" "81ST TROOP COMMAND (-)" "81ST TROOP COMMAND (-)" ...
## $ TPC : chr [1:18835] "Completed Training" "Completed Training" "Completed Training" "Completed Training" ...
## $ PEBD : POSIXct[1:18835], format: "2003-08-15" "1992-12-21" ...
## $ Transaction Date : POSIXct[1:18835], format: NA NA ...
## $ Attrition : chr [1:18835] NA NA NA "Y" ...
## $ SOC/SOE : chr [1:18835] NA NA NA "Vol Enrl RC, Under 10 USC651 on/After 1 June 84" ...
## $ Date of Commission: POSIXct[1:18835], format: NA NA ...
## $ Loss Reason : chr [1:18835] NA NA NA "Regular Retirement" ...
## $ Loss : chr [1:18835] "N" "N" "N" "Y" ...
Below is summary of missing values per column in
Joined_v12. One special attribute to note about the
total number of missing values is that many have a total of 9,874. This
is due to the initial data sets not having matching columns on this
basic demographic data. In this case, the initial Strengths
data set was missing information such as MILED,
CIVED, and SOC/SOE. And of course, initial
strength also did not include the Transaction Date,
Attrition, or Loss Reason columns as these
Soldiers were not initially counted as losses unless truly lost
following the join.
##################
# Number Missing #
##################
colSums(is.na(retention))
## ID Gender Race/Ethnicity MOS
## 0 0 0 0
## WFF Grade Type DOR
## 0 0 0 9874
## Months in Grade Months in Service ETS/MRD MILED
## 0 0 0 9874
## CIVED Unit Name TPC PEBD
## 9874 0 0 0
## Transaction Date Attrition SOC/SOE Date of Commission
## 9874 9874 9874 18211
## Loss Reason Loss
## 9874 0
Below is the final data understanding table for
Joined_v12 to be used throughout the remainder of this
analysis. One special attribute to note about the total number of
missing values is that many have a total of 9,874. This is due to the
initial data sets not having matching columns on this basic demographic
data. In this case, the initial Strengths data set was
missing information such as MILED, CIVED, and
SOC/SOE. And of course, initial strength also did not
include the Transaction Date, Attrition, or
Loss Reason columns as these Soldiers were not initially
counted as losses unless truly lost following the join.
The target value for this predictive analysis is: Loss.
This would allow factors in the data set to potentially be used to
predict whether a Soldier is lost or retained.
This section provides an overview of some exploratory analysis into the data set, including two interactive portions from a Tableau Dashboard. This also provides some insight into the methods used and the goals of their use.
Below is a shallow step into some analysis to learn more about the configuration of the data set. The below figures explore all of the data, to include both standing strength and losses. These first three visualizations built in R detail several counts across the force, including counts by Gender, WFF, Type, and Race/Ethnicity. Some key trends can be identified in these exploratory graphs:
The below figures explore only information known for Soldiers who left the INNG. Because this data is not available for both lost and retained Soldiers, it’ll only be used here for broader understanding of the data set. The below bar plots show the counts of the highest level of military education achieved by Soldiers who left the INNG, as well as some basic counts of lost Soldiers by WFF, Grade, and Gender. From the below visualizations, a few observations can be made:
The below shows some basic counts of lost Soldiers by WFF, Grade, and Gender. From the below visualizations, a few observations can be made:
Below is a description of the analytic methods used for predicting a loss based on the provided data that exists for all Soldiers.
Correlation Heat
Maps: While not an explicit predictive method, correlation maps
and their associated matrices can assist in identifying relationships in
the data prior to the analysis. In this case, WFF and
Gender were correlated, Soldier Type and
Gender were correlated, and Loss Reason and
Gender were correlated. The correlation maps done were limited to
explore the data and identify key relationships for future
recommendations. However, this list is not exhaustive of all
possibilities for correlation exploration.
Classification
Tree: Classification trees are used to predict categorical
dependent variables using categorical and numeric covariates. In this
case, Gender, WFF, Grade,
Months in Grade, and Months in Service were
used to build this model.
KNN: KNN is used
to predict categorical independent variables using numeric dependent
variables. In this case, Months in Grade and
Months in Service were used as predictors for whether the
Soldier would be considered a loss.
Logistic
Regression: Logistic Regression is also used to predict
categorical dependent variables using categorical and numeric
independent variables, although conversion of character variables to
factor variables is required. In this case, following multiple
iterations of backwards selection (removing insignificant variables),
the covariates of Gender, Months in Grade, and
Months in Service are reliable and significant in
predicting Loss as Y or
N.
Where required, the same testing and training sets were used throughout all analytic methods, with a seed set to ensure consistent sampling for each of the sets.
Below is a table of R packages used, their justifications for use, and the designated definitions for each.
Definitions for these R packages were taken directly from the R Project.
Below are two correlation heat maps identifying correlation between the War Fighting Functions (WFF) and Gender, the first for losses and the second for Soldiers that were retained. While there are no obvious significant correlations between the variables given for Soldiers lost nor between the variables given for Soldiers retained, some subtle differences can be noted:
SustainmentMovement & Maneuver and FiresIntelligence
slightly more and Males trending in Protection,
Immaterial, Fires, and
Command and Control
#####################
# Correlation Plots #
#####################
#Correlation for Losses
loss_Y<-filter(retention, Loss=="Y") #filter for lost
retention_loss_cor1 <- dummy_cols(loss_Y, select_columns = c("Gender", "WFF"))
retention_cor_mat <- round(cor(retention_loss_cor1[23:31]), 4) #to four decimals
melt_retention_cor_mat <- melt(retention_cor_mat) #melted correlation
#Correlation Plot for Losses
ggplot(data = melt_retention_cor_mat, aes(x=Var1, y=Var2, fill=value)) + geom_tile() + geom_text(aes(label = value), color = "black", size = 2) + theme(axis.text.x = element_text(angle = 45, hjust=1)) + scale_fill_gradient(low = "white", high = "black", guide = "colorbar")
#Correlation for Retains
loss_N<-filter(retention, Loss=="N") #filter for retained
retention_loss_cor2 <- dummy_cols(loss_N, select_columns = c("Gender", "WFF"))
retention_cor_mat <- round(cor(retention_loss_cor2[23:31]), 4) #to four decimals
melt_retention_cor_mat <- melt(retention_cor_mat) #melted correlation
#Correlation Plot for Retains
ggplot(data = melt_retention_cor_mat, aes(x=Var1, y=Var2, fill=value)) + geom_tile() + geom_text(aes(label = value), color = "black", size = 2) + theme(axis.text.x = element_text(angle = 45, hjust=1)) + scale_fill_gradient(low = "white", high = "black", guide = "colorbar")
Below are two correlation heat maps identifying correlation between the Type of Soldier (Officer, Enlisted, Warren) and Gender, the first for losses and the second for Soldiers that were retained. While there are no obvious significant correlations between the variables given for Soldiers lost nor between the variables given for Soldiers retained, some subtle differences can be noted:
Warrant and
OfficerEnlisted
#####################
# Correlation Plots #
#####################
retention_loss_cor3 <- dummy_cols(loss_Y, select_columns = c("Gender", "Type"))
retention_cor_mat <- round(cor(retention_loss_cor3[23:27]), 4) #to four decimals
melt_retention_cor_mat <- melt(retention_cor_mat)
ggplot(data = melt_retention_cor_mat, aes(x=Var1, y=Var2, fill=value)) + geom_tile() + geom_text(aes(label = value), color = "black", size = 2) + theme(axis.text.x = element_text(angle = 45, hjust=1)) + scale_fill_gradient(low = "white", high = "black", guide = "colorbar")
retention_loss_cor4 <- dummy_cols(loss_N, select_columns = c("Gender", "Type"))
retention_cor_mat <- round(cor(retention_loss_cor4[23:27]), 4) #to four decimals
melt_retention_cor_mat <- melt(retention_cor_mat)
ggplot(data = melt_retention_cor_mat, aes(x=Var1, y=Var2, fill=value)) + geom_tile() + geom_text(aes(label = value), color = "black", size = 2) + theme(axis.text.x = element_text(angle = 45, hjust=1)) + scale_fill_gradient(low = "white", high = "black", guide = "colorbar")
Below is a correlation heat map identifying correlation between Loss Reason and Gender for Soldiers lost. While there are no obvious significant correlations between these variables, some subtle differences can be noted:
IET Discharge and Hardship or ReligiousResigned Commission, Regular Retirement,
Obligation Complete, Non-Criminal Misconduct,
Failure to Meet Requirements,
Enrolled in ROTC, Death,
Criminal Misconduct, AWOL, and
Accepted CommissionMedical Retirement/Separation,
Erroneous Enlistment,
Component/Service Transfer, and
AdministrativeWhile correlations are not inherently predictive nor do they show causation, they can help tell a story about the data. From the above relationships identified, it is clear that females tend to leave the INNG for more administrative or medical reasons, whereas males tend to leave the INNG for either professional growth (i.e., accepting commission or enrolling in ROTC) or various forms of misconduct (i.e., criminal, non-criminal, or absenteeism without leave). This, alone, can suggest changes in the INNG to capitalize on professional growth, reprimand misconduct, and shape the well-being of Soldiers.
#####################
# Correlation Plots #
#####################
retention_loss_cor5 <- dummy_cols(loss_Y, select_columns = c("Gender", "Loss Reason"))
retention_cor_mat <- round(cor(retention_loss_cor5[23:40]), 4) #to four decimals
melt_retention_cor_mat <- melt(retention_cor_mat)
ggplot(data = melt_retention_cor_mat, aes(x=Var1, y=Var2, fill=value)) + geom_tile() + geom_text(aes(label = value), color = "black", size = 1) + theme(axis.text.x = element_text(angle = 45, hjust=1)) + scale_fill_gradient(low = "white", high = "black", guide = "colorbar")
While correlations can provide some basic insight into the internal
relationships in the data, these correlations do not provide insight
into further analysis. Correlations do not inherently point to
causation, but much like stereotypes surrounding service in the
military, the demographic relationships present in the above heatmaps
prove to be true. Correlation maps and their associated matrices can
assist in identifying relationships in the data prior to the analysis.
In this case, WFF and Gender were correlated,
Soldier Type and Gender were correlated, and
Loss Reason and Gender were correlated. The correlation
maps done were limited to explore the data and identify key
relationships for future recommendations. However, this list is not
exhaustive of all possibilities for correlation exploration.
Classification trees are inherently very easy to understand and digest as they tend to follow more intuitive processing for decisions. They also provide a simple visual representation of relationships that work well with qualitative data without the need to use dummy variables, as is true with correlation. One of the disadvantages of classification trees is the room for error due to a lower level of predictive accuracy.
This classification tree below uses the factors of
Gender, WFF, Grade,
Months in Grade, and Months in Service to
predict whether Soldiers are retained or lost. To read the tree below,
begin with the first node:
This methodology can be followed for all nodes. Without going into exhaustive detail on the classification tree, some important information can be extracted:
Gender and WFF were used in the
development in this classification tree, the variables were not deemed
important enough to be held in any aspect of the tree. Classification
trees are generally flexible models that do not increase the parameters
by adding more variables.#################
# Decision Tree #
#################
#Random Sampling
index <- sample(nrow(retention), nrow(retention)*0.50)
ret_train <- retention[index,]
ret_test <- retention[-index,]
#Fit and Display Decision Tree
ret_rpart <- rpart(formula = Loss ~ Gender + WFF + Grade + `Months in Grade` + `Months in Service`, data = ret_train, method = 'class')
rpart.plot(ret_rpart, extra = 106)
#Prediction
ret_pred_dt <- predict(ret_rpart, newdata = ret_test, type = "class")
#Confusion Matrix
cm1 <- confusionMatrix(ret_pred_dt, as.factor(ret_test$Loss))
cm1
## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 4129 1942
## Y 838 2509
##
## Accuracy : 0.7048
## 95% CI : (0.6955, 0.714)
## No Information Rate : 0.5274
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4001
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8313
## Specificity : 0.5637
## Pos Pred Value : 0.6801
## Neg Pred Value : 0.7496
## Prevalence : 0.5274
## Detection Rate : 0.4384
## Detection Prevalence : 0.6446
## Balanced Accuracy : 0.6975
##
## 'Positive' Class : N
##
The accuracy score for this prediction tool is listed below, which displays the percentage of values correctly predicted using this method. While this is not the greatest method for prediction, as will soon be shown, it can still be useful in quickly and intuitively categorizing Soldiers and determining, to the probabilities above, the likelihood they’ll stay in or get out (depending on the parameters above).
####################
# Display Accuracy #
####################
cm1$overall['Accuracy']
## Accuracy
## 0.7048206
Note: Whether a Soldier is retained is predicted in this model using
Gender, WFF, Months in Grade, and
Months in Service.
KNN below is used to predict categorical variables using numeric
covariates. In this case, we are using the numeric
Months in Grade and Months in Service as
predictors in whether Loss is dictated as Y or
N. This algorithm stores all available data and classifies
new data points based on similarity. Through trial-and-error in
verifying the performance of the algorithm, the optimum value for k is
5, shown below.
#############
# KNN Model #
#############
#Develop KNN Model
ret_knn <- knn(train = ret_train[, 9:10], test = ret_test[, 9:10], cl = as.vector(as.matrix(ret_train[, 22])), k = 5)
#Confusion Matrix for Misclassified
cm2 <- confusionMatrix(ret_knn, as.factor(ret_test$Loss))
cm2
## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 4077 1311
## Y 890 3140
##
## Accuracy : 0.7663
## 95% CI : (0.7576, 0.7748)
## No Information Rate : 0.5274
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5289
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8208
## Specificity : 0.7055
## Pos Pred Value : 0.7567
## Neg Pred Value : 0.7792
## Prevalence : 0.5274
## Detection Rate : 0.4329
## Detection Prevalence : 0.5721
## Balanced Accuracy : 0.7631
##
## 'Positive' Class : N
##
The accuracy score for this prediction tool is listed below, which displays the percentage of values correctly predicted using this method. This is slightly more accurate in predicting than the classification tree previously used, although visualizing this for predictions is not as intuitive as decision trees - the visulization has been removed.
####################
# Display Accuracy #
####################
cm2$overall['Accuracy']
## Accuracy
## 0.7662986
Note: Whether a Soldier is retained is predicted in this model using
Months in Grade and Months in Service.
A generalized linear model, or logistic regression, can be used to
predict categorical variables using numeric and categorical covariates.
In this case, following iterative removal of insignificant variables,
only Gender, Months in Grade, and
Months in Service remain.
#######################
# Logistic Regression #
#######################
#Convert Train Variables
ret_train$Loss <- as.factor(ret_train$Loss)
#Convert Test Variables
ret_test$Loss <- as.factor(ret_test$Loss)
#Run Regression
ret_glm <- glm(Loss ~ Gender + `Months in Grade` + `Months in Service`, data = ret_train, family = binomial)
summary(ret_glm)
##
## Call:
## glm(formula = Loss ~ Gender + `Months in Grade` + `Months in Service`,
## family = binomial, data = ret_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5936 -1.1136 -0.7922 1.1736 2.3485
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.083725 0.052798 -1.586 0.1128
## GenderM 0.138438 0.054027 2.562 0.0104 *
## `Months in Grade` 0.023110 0.001238 18.670 <2e-16 ***
## `Months in Service` -0.005759 0.000343 -16.789 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13038 on 9416 degrees of freedom
## Residual deviance: 12586 on 9413 degrees of freedom
## AIC: 12594
##
## Number of Fisher Scoring iterations: 4
#Generate Prediction
pred_ret_glm <- predict(ret_glm, newdata = ret_test, type = "response")
y_or_n <- ifelse(pred_ret_glm >= 0, "Y", "N")
p_class <- factor(y_or_n, levels = levels(ret_test$Loss))
#Confusion Matrix
cm3 <- confusionMatrix(p_class, as.factor(ret_test$Loss))
cm3
## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 0 0
## Y 4967 4451
##
## Accuracy : 0.4726
## 95% CI : (0.4625, 0.4827)
## No Information Rate : 0.5274
## P-Value [Acc > NIR] : 1
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.4726
## Prevalence : 0.5274
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : N
##
The accuracy score for this prediction tool is listed below, which displays the percentage of values correctly predicted using this method. This is much less accurate in predicting than the classification tree or the KNN model previously used. Another thing to note: this model’s confusion matrix lends to the possibility that this method is only useful for predicting false negatives and true negatives. This leaves this model relatively unreliable in working to predict whether a Soldier will be retained.
####################
# Display Accuracy #
####################
cm3$overall['Accuracy']
## Accuracy
## 0.4726056
Note: Whether a Soldier is retained is predicted in this model using
Gender, Months in Grade, and
Months in Service.
This analysis began with the discussion of Soldier care, and the intent of this analysis was initially to provide some insight into statistics around retention of Soldiers to lend further insights to leaders in the Indiana National Guard on just how to care for Soldiers. One of the ways to approach this question of Soldier care is by examining why Soldiers get out through statistically sound means. In this case, this project’s aim evolved into utilizing available data to determine the algorithm that would best reflect and predict whether a Soldier is to be retained.
Each Soldier and their leaders has their own perception of how and why Soldiers leave the INNG. With a wide selection of predictor variables to choose from, sometimes an analysis such as this can be convoluted with bias or its variables can be selected based on our own perceptions. The deployment of this analysis is, by no means, inclusive of all factors affecting retention. However, it does provide insight into some factors that can realistically affect whether a Soldier stays in.
According to the analyses done, both Months in Grade and
Months in Service can be used in a variety of analytic
methods to predict whether a Soldier can get out, among other variables
in some cases (pending limitations of the methods used).
Classification
Tree: This tree used the variables Gender,
WFF, Grade, Months in Grade, and
Months in Service for its development. Viewing the,
Gender and WFF were not deemed important
enough factors to determine the retention of a Soldier in the data set.
However, from the Grade at the father node, both
Months in Grade and Months in Service were
used to break out the decisions for this. Using this decision tree,
which serves both as a visual aid in determining a Soldiers’ retention
and as an intuitive tool for understanding retention, we can be 95%
confident that this tree can correctly categorize retention of Soldiers
with an accuracy between 69.55% and 71.4%. While not a perfect model,
it’s usable and easy to understand, and has an acceptable level of
accuracy.
KNN Model: While
KNN models are not easy to visualize, they are very thorough in
identifying multi-layered predictions through relationships that exist
within the data set. In this model, only the variables of
Months in Grade and Months in Service were
used, primarily due to the numeric nature required of covariates used in
this form of model. Using this model, we can (with a 95% confidence)
correctly categorize retention of Soldiers with an accuracy between
75.76% and 77.48%. This model is slightly better in predictions for
retention than the Classification Tree above and the Logistic Regression
below, but is much more limited in its application as it does not accept
non-numeric (or character) variables as inputs. As an accurate tool with
minimal data collection, leaders in the INNG can rapidly use this tool
to more accurately predict the retention of their Soldiers.
Logistic
Regression: This regression model, following multiple steps of
backwards selection, used the variables of Gender,
Months in Grade, and Months in Service for its
predictions. Using this model, we can (with a 95% confidence) correctly
categorize the true negatives and false negatives of Soldier retention
with an accuracy between 46.25% and 48.27%. This model is overall much
worse than the KNN model and the Classification Tree.
Overall, it is highly recommended that leaders primarily refer to the classification tree for determining whether a Soldier is to stay in. While the classification tree is not the most accurate, it does provide a relatively intuitive model for analysis into whether a Soldier will leave the INNG and uses minimal data entry for its decisions. Given the breadth of the data (spanning over 7 years), this specific tool can be quickly used by leaders scanning formations to determine the likelihood a Soldier will leave the guard. And without much additional data input, it can practically be deployed in use by leaders at all echelons of the INNG.
Data for this project is by no means all-inclusive and exhaustive. There are a plethora of statistical points than can be used to increase the breadth and accuracy of this project. Initially, this project started with two distinct types of data sets: the total strength at the start of a fiscal year (by Soldier name) and the total losses by the end of that fiscal year (by Soldier name). However, columns existed in the losses tables but not in the strength table. These columns included data about
These known attributes alone could provide more insight into predictors for whether a Soldier could get out, but because the data only existed for Soldiers that left the guard, it could not be used in this analysis.
There are additional qualities available for further examination to predict whether a Soldier leaves the INNG. The Director’s Personnel Readiness Overview, or DPRO, has a lot of versatility and variability in the predictors one can use in an analysis like this, such as
This list, again, is by no means exhaustive. In beginning this project, the hope was to analyze the data itself as opposed to weighting the methods used for prediction. However, given that some prediction methods are stronger than others using the minimal variables available, it’s safe to say that additional variables for input will not only increase the accuracy of the prediction methods but also expand on the potential methods used for further analysis.
While this project was limited in its data, leaders in the INNG are
still strongly encouraged to use ADP
6-22: Army Leadership and the Profession to assist in developing a
conducive leadership style that serves Soldiers, promotes a team
mindset, and lends to the self-policing profession that is the US Army.
Although this data does not suggest causation into the true reasons for
Soldiers leaving the INNG, it does give leaders the ability to identify
Soldiers within their formations that are at risk for leaving the guard,
even if only using the variables of Months in Grade and
Months in Service. If this document is applied
appropriately, leaders can identify those at-risk Soldiers and work with
them to identify courses of action to keep them invested in the INNG.
Through personal anecdotal experience, the foundation of a Soldier’s
leaders is what causes them to get out or stay in:
Deep predictive analysis into the retention of Soldiers is a very broad and complicated endeavor, especially in searching to identify a predictive tool that can be used to shape retention programs. The long-term objective of this project is to quantifiably, through quantitative and qualitative means, identify what factors can affect an Indiana National Guard Soldier’s retention. Then, in turn, use the results of this predictive analysis to then prescribe a method for the maintenance of Soldier Care and Soldier retention programs. Some of this data can be pulled from DPRO (such as with this project) which primarily covers each Soldier’s profile. However, additional data would be needed to cultivate a holistic approach in developing the feedback loop that serves as a true Soldier retention protocol. Some of this data includes, but is not limited to, command climate surveys conducted annually to assess leadership within a command echelon, surveys detailing personal revelations from Soldiers leaving the INNG, performance rates of the unit(s) as a whole, career progression tracking for each Soldier, and more.
In addition to quantitative data provided by DPRO used in this initial analysis, each Soldier is strongly encouraged to complete a standard paragraph-entry survey when exiting the Indiana National Guard. This lengthy survey requests that the Soldiers provide true reasons as to why they’re leaving the service. Unfortunately, this data is available only for Indiana, as other states have the freedom to implement surveys as they see fit, and the survey changes frequently, adding another layer of complication to its applicability. As it stands now, this survey is also not required of Soldiers as they exit the INNG. As a result, this survey obtains only about 10% participation, with some of the entries being unusable as they contain no information. Further textual analysis into the exit survey data available could lend to a justification to make the exit survey mandatory, as well as to develop a streamlined exit survey requirement for all 54 states and territories.
And finally, this project will be able to lend to the most important topic of all: the definition and implementation of Soldier Care. Developing this definition, while not a direct outcome of the project, will allow leaders across the National Guard to build and implement a program dedicated to the care of its Service Members by tying every fiber of action to the point of this definition. Further analysis of this problem set will solidify, using quantifiable analysis to support it, the intention behind caring for Soldiers.
**In Closing*: In an organization such as the Indiana National Guard, whose vision is focused on “putting our people first”, it is absolutely critical that every leader at every echelon applies this vision and this mission statement into Soldier Care.