DACSS 601 – Submission 2
First, we will import a cleaned dataset. This dataset will cover counts and demographic data for the airforce.
library(tidyverse)
library(readxl)
setwd("D:/Academic/UMass/DACSS_601/Datasets")
airforce <- read_excel("airforce_cleaned.xlsx")
head(airforce)
# A tibble: 6 x 18
enlisted pay_grade `single withoutchildren~ `single withoutchildren~
<chr> <dbl> <dbl> <dbl>
1 E 1 7721 1550
2 E 2 4380 1010
3 E 3 29725 7108
4 E 4 20805 4756
5 E 5 14623 4104
6 E 6 3660 1377
# ... with 14 more variables: single withoutchildren total <dbl>,
# single withchildren male <dbl>, single withchildren female <dbl>,
# single withchildren total <dbl>, married jointservice male <dbl>,
# married jointservice female <dbl>,
# married jointservice total <dbl>, married civilian female <dbl>,
# married civilian male <dbl>, married civilian total <dbl>,
# married male total <dbl>, married female total <dbl>,
# married total total <dbl>, branch <chr>
It looks like this dataset is a tibble (no transformation needed there, as it’s already clean!), and is 6 rows by 18 columns. Of the 18 columns, 2 are characters (enlisted and branch), and the remaining 16 are dbl – double precision or more commonly known as “floating numbers,” ie. doubles, real numbers, etc.
To give a horizontal view of the dataframe’s structure:
str(airforce)
tbl_df [19 x 18] (S3: tbl_df/tbl/data.frame)
$ enlisted : chr [1:19] "E" "E" "E" "E" ...
$ pay_grade : num [1:19] 1 2 3 4 5 6 7 8 9 1 ...
$ single withoutchildren male : num [1:19] 7721 4380 29725 20805 14623 ...
$ single withoutchildren female: num [1:19] 1550 1010 7108 4756 4104 ...
$ single withoutchildren total : num [1:19] 9271 5390 36833 25561 18727 ...
$ single withchildren male : num [1:19] 27 33 396 987 2755 ...
$ single withchildren female : num [1:19] 5 9 266 842 2171 ...
$ single withchildren total : num [1:19] 32 42 662 1829 4926 ...
$ married jointservice male : num [1:19] 49 97 1258 3036 6154 ...
$ married jointservice female : num [1:19] 27 105 1687 3207 5519 ...
$ married jointservice total : num [1:19] 76 202 2945 6243 11673 ...
$ married civilian female : num [1:19] 1064 802 10436 15363 31711 ...
$ married civilian male : num [1:19] 178 163 1631 1769 2889 ...
$ married civilian total : num [1:19] 1242 965 12067 17132 34600 ...
$ married male total : num [1:19] 8861 5312 41815 40191 55243 ...
$ married female total : num [1:19] 1760 1287 10692 10574 14683 ...
$ married total total : num [1:19] 10621 6599 52507 50765 69926 ...
$ branch : chr [1:19] "AirForce" "AirForce" "AirForce" "AirForce" ...
Okay, let’s dig a little deeper into some of these columns.
First, we’re going to review subsections of the dataset. We’ll start with a simple table for two columns, a character and a double.
pay_grade
enlisted 1 2 3 4 5 6 7 8 9 10
E 1 1 1 1 1 1 1 1 1 0
O 1 1 1 1 1 1 1 1 1 1
The “enlisted” column includes two variables: E (Enlisted) and O (Officer). For “pay_grade”, there are 9 levels, from 1 - 9. What makes this dataset tricky is that the ranks (pay grades) repeat whether E or O. The only exception is there is no E10. Therefore, there is E1-9 and then O1-10. This adds up to 19 observations or rows.
Second, we’re going to filter out the bottom half of Officer paygrades.
filter(airforce_selected, enlisted == "O", pay_grade > 5)
# A tibble: 5 x 2
enlisted pay_grade
<chr> <dbl>
1 O 6
2 O 7
3 O 8
4 O 9
5 O 10
Now we have the five (5) highest ranking pay grades in the Air Force.
For future exploration, we can add variables with “totals,” such as Single With vs. Without Children, Civilian Male vs. Female, and Married Male vs. Female. We could then compare on these two primary variables, enlisted and pay_grade, to determine differences in Enlisted vs. Officers and how different ranks affect gender enrollment, marriage status, and the probability of having children. For now, we’ll keep things simple, but this gives me a foundation if I use this dataset for future work.