Synopsis

On this page, we’ll look through some fictional datasets looking at costs across years at US Airforce bases.

data source: https://www.dropbox.com/sh/qgv90e945updkzq/AADpwjjWq-7u3IArV0rYl3-9a?dl=1 data source description: http://uc-r.github.io/data_wrangling/week-5-assignment-data

Packages required

To run the scripts below, you will need tidyverse, dplyr, rio, ggplot2 and magrittr.

library(tidyverse)
library(dplyr)
library(rio)
library(ggplot2)
library(magrittr)

Question 1

Here we import the bomber_wide.rds dataset, convert to a tibble with as_tibble() and gather the columns to make a vertical, tidy dataset:

#import the bomber_wide.rds dataset, convert to tibble and gather the years
b <- import("bomber_wide.rds")
b <- as_tibble(b)
b <- gather(b, c(3:21), key = "Year", value = "FH")
b
## # A tibble: 57 x 4
##      Type    MD  Year    FH
##     <chr> <chr> <chr> <int>
## 1  Bomber   B-1  1996 26914
## 2  Bomber   B-2  1996  2364
## 3  Bomber  B-52  1996 28511
## 4  Bomber   B-1  1997 25219
## 5  Bomber   B-2  1997  2776
## 6  Bomber  B-52  1997 26034
## 7  Bomber   B-1  1998 24205
## 8  Bomber   B-2  1998  2166
## 9  Bomber  B-52  1998 25639
## 10 Bomber   B-1  1999 23306
## # ... with 47 more rows

Question 2

Here we import the bomber_long.rds dataset, convert to a tibble and spread a long column depending on the Value column:

#import, tibble, spread to split "Output" using "Value" as the rows
b2 <- import("bomber_long.rds")
b2 <- as_tibble(b2)
b2 <- spread(b2, key = "Output", value = "Value")
b2
## # A tibble: 57 x 6
##      Type    MD    FY      Cost    FH   Gallons
## *   <chr> <chr> <int>     <int> <int>     <int>
## 1  Bomber   B-1  1996  72753781 26914  88594449
## 2  Bomber   B-1  1997  71297263 25219  85484074
## 3  Bomber   B-1  1998  84026805 24205  85259038
## 4  Bomber   B-1  1999  71848336 23306  79323816
## 5  Bomber   B-1  2000  58439777 25013  86230284
## 6  Bomber   B-1  2001  94946077 25059  86892432
## 7  Bomber   B-1  2002  96458536 26581  89198262
## 8  Bomber   B-1  2003  68650070 21491  74485788
## 9  Bomber   B-1  2004 101895634 28118 101397707
## 10 Bomber   B-1  2005 124816690 21859  78410415
## # ... with 47 more rows

Question 3

Here we import the bomber_combined.rds dataset, convert to a tibble and separate the AC column into Type and MD

#Question 3
#import, tibble, and separate column AC into "Type" and "MD"
b3 <- import("bomber_combined.rds")
b3 <- as_tibble(b3)
b3 <- b3 %>% separate(AC, into = c("Type", "MD"), sep = " ")
b3
## # A tibble: 57 x 6
##      Type    MD    FY      Cost    FH   Gallons
## *   <chr> <chr> <int>     <int> <int>     <int>
## 1  Bomber   B-1  1996  72753781 26914  88594449
## 2  Bomber   B-1  1997  71297263 25219  85484074
## 3  Bomber   B-1  1998  84026805 24205  85259038
## 4  Bomber   B-1  1999  71848336 23306  79323816
## 5  Bomber   B-1  2000  58439777 25013  86230284
## 6  Bomber   B-1  2001  94946077 25059  86892432
## 7  Bomber   B-1  2002  96458536 26581  89198262
## 8  Bomber   B-1  2003  68650070 21491  74485788
## 9  Bomber   B-1  2004 101895634 28118 101397707
## 10 Bomber   B-1  2005 124816690 21859  78410415
## # ... with 47 more rows

Question 4

Here we import the bomber_prefix.rds dataset, convert to a tibble. Then we use the head() function to look at the structure of the rows and the column names. After the head() function we use unite() to bring prefix and number together, separated by a “-”.

#Question 4
#import, tibble, check the columns to combine, and combine them with a "-" in between
b4 <- import("bomber_prefix.rds")
b4 <- as_tibble(b4)
head(b4)
## # A tibble: 6 x 6
##     Type prefix number    FY Output Value
##    <chr>  <chr>  <chr> <int>  <chr> <int>
## 1 Bomber      B      1  1996     FH 26914
## 2 Bomber      B      1  1997     FH 25219
## 3 Bomber      B      1  1998     FH 24205
## 4 Bomber      B      1  1999     FH 23306
## 5 Bomber      B      1  2000     FH 25013
## 6 Bomber      B      1  2001     FH 25059
b4 <- b4 %>% unite(MD, prefix, number, sep = "-")
b4
## # A tibble: 171 x 5
##      Type    MD    FY Output Value
## *   <chr> <chr> <int>  <chr> <int>
## 1  Bomber   B-1  1996     FH 26914
## 2  Bomber   B-1  1997     FH 25219
## 3  Bomber   B-1  1998     FH 24205
## 4  Bomber   B-1  1999     FH 23306
## 5  Bomber   B-1  2000     FH 25013
## 6  Bomber   B-1  2001     FH 25059
## 7  Bomber   B-1  2002     FH 26581
## 8  Bomber   B-1  2003     FH 21491
## 9  Bomber   B-1  2004     FH 28118
## 10 Bomber   B-1  2005     FH 21859
## # ... with 161 more rows

Question 5

Here we’ll use the pipe function to import, convert to tibble, combine columns, separate other columns, spread two columns by Value one one command. We’ll then look at three different charts (Cost, Flying Hours or FH, Gallons) against the Fiscal Year or FY and compare the Mission Description, or MD. While a facet_grid() could compare MD, we’ll group by color and place on the same chart to better show the differences, potentially to the Professor’s chagrin.

The cost after 2004 has jumped for both B-1 and B-52, but B-1 had a shaper climb from 2005 to 2008 and 2009 to 2012. The cost fell off for B-1 from 2012.

Flying hours and gallons have trended down since 1996 for B-1 and B-52, strangely, as have gallons. A question out of this question that we would want to investigate is why cost is not related to Flying Hours or Gallons of gasoline.

#Question 5
#import, tibble, unite, separate, spread
b5 <- import("bomber_mess.rds") %>% as_tibble() %>% unite(MD, prefix, number, sep = "-") %>%
  separate(Metric, into = c("FY", "number"), sep = "_") %>% spread(key = "number", value = "Value")
b5
## # A tibble: 57 x 6
##      Type    MD    FY      Cost    FH   Gallons
## *   <chr> <chr> <chr>     <int> <int>     <int>
## 1  Bomber   B-1  1996  72753781 26914  88594449
## 2  Bomber   B-1  1997  71297263 25219  85484074
## 3  Bomber   B-1  1998  84026805 24205  85259038
## 4  Bomber   B-1  1999  71848336 23306  79323816
## 5  Bomber   B-1  2000  58439777 25013  86230284
## 6  Bomber   B-1  2001  94946077 25059  86892432
## 7  Bomber   B-1  2002  96458536 26581  89198262
## 8  Bomber   B-1  2003  68650070 21491  74485788
## 9  Bomber   B-1  2004 101895634 28118 101397707
## 10 Bomber   B-1  2005 124816690 21859  78410415
## # ... with 47 more rows
ggplot(data = b5) + geom_line(mapping = aes(x = FY, y = Cost, group = MD, colour = MD))

ggplot(data = b5) + geom_line(mapping = aes(x = FY, y = FH, group = MD, colour = MD))

ggplot(data = b5) + geom_line(mapping = aes(x = FY, y = Gallons, group = MD, colour = MD))

Question 6

Here we will import a categorization file and dataset and convert them to tibbles.

We then join them together, filter by FY, base and system, group by system and sum the total cost and end strength for these groupings.

Missle and Aircraft missions at Minot AFB have similar end strength sums, but Aircraft missions are much more expensive.

#Question 6
prog <- as_tibble(import("ws_programmatics.rds"))
cate <- as_tibble(import("ws_categorizations.rds"))
head(prog)
## # A tibble: 6 x 18
##             Base    MD    FY Manpower_Ops Manpower_Mx
##            <chr> <chr> <int>        <dbl>       <dbl>
## 1 ALTUS AFB (OK)  A-10  2007           NA          NA
## 2 ALTUS AFB (OK)  A-10  2008           NA          NA
## 3 ALTUS AFB (OK) AT-38  1997           NA          NA
## 4 ALTUS AFB (OK) AT-38  1998           NA          NA
## 5 ALTUS AFB (OK) C-130  1998           NA          NA
## 6 ALTUS AFB (OK) C-130  1999           NA          NA
## # ... with 13 more variables: Manpower_Support_Staff <dbl>,
## #   Operating_Material <dbl>, Mx_Consumables <dbl>, Mx_DLR <dbl>,
## #   Mx_Depot_AC <dbl>, Mx_Depot_Missile <dbl>, Mx_Depot_Engine <dbl>,
## #   CLS <dbl>, Total_O.S <dbl>, Avg_Inv <dbl>, TAI <dbl>,
## #   End_Strength <dbl>, FH <dbl>
head(cate)
## # A tibble: 6 x 3
##             Base   System    MD
##            <chr>    <chr> <chr>
## 1 ALTUS AFB (OK) AIRCRAFT  A-10
## 2 ALTUS AFB (OK) AIRCRAFT AT-38
## 3 ALTUS AFB (OK) AIRCRAFT C-130
## 4 ALTUS AFB (OK) AIRCRAFT C-135
## 5 ALTUS AFB (OK) AIRCRAFT C-141
## 6 ALTUS AFB (OK) AIRCRAFT  C-17
#As the name describes, ws_categorizations provides additional context (system) for the programmatics
#filter by year, military base, system, and then sum two columns by system:
new <- left_join(prog, cate, by = c("Base", "MD")) %>%
  filter(FY == 2014, Base == "MINOT AFB (ND)", (System == "MISSILES" | System == "AIRCRAFT")) %>% 
  group_by(System) %>%
  summarise(SumTotOS = sum(Total_O.S, na.rm=TRUE), SumEndStr = sum(End_Strength, na.rm = TRUE))
new
## # A tibble: 2 x 3
##     System  SumTotOS SumEndStr
##      <chr>     <dbl>     <dbl>
## 1 AIRCRAFT 297398235      2023
## 2 MISSILES 112224909      1951

Question 7

We join the files form Question 6 again, but this time we’ll make a new column that is the sum of total costs/sum of flying hours to give a total cost per flying hour (CPFH) for each base in the year 2014. We’ll then look at the base with the highest CPFH and chart the bases with a top 10 CPFH and order the bar chart by CPFH.

Patterson AFB in FL has the highest total costs.

#Question 7
#code to show base with top cost per flying hour
new2 <- left_join(prog, cate, by = c("Base", "MD")) %>%
  filter(FY == 2014) %>%
  group_by(Base) %>% 
  summarise(SumTotOS = sum(Total_O.S, na.rm = TRUE), SumFH = sum(FH, na.rm = TRUE)) %>% 
  filter(SumFH != 0) %>% 
  mutate(CPFH = SumTotOS/SumFH) %>% 
  arrange(desc(CPFH)) %>% 
  top_n(10,CPFH)
  
top_n(new2, 1, CPFH)
## # A tibble: 1 x 4
##               Base  SumTotOS SumFH   CPFH
##              <chr>     <dbl> <dbl>  <dbl>
## 1 PATRICK AFB (FL) 456348004  2550 178960
head(new2)
## # A tibble: 6 x 4
##                Base   SumTotOS SumFH      CPFH
##               <chr>      <dbl> <dbl>     <dbl>
## 1  PATRICK AFB (FL)  456348004  2550 178960.00
## 2 PETERSON AFB (CO)  830493384  5334 155698.05
## 3   TINKER AFB (OK) 3220722796 21356 150811.14
## 4  LANGLEY AFB (VA) 1727342213 11484 150412.94
## 5    SCOTT AFB (IL) 1444012422  9895 145933.54
## 6     HILL AFB (UT) 1119979581 11496  97423.42
#code to show bar chart of bases with a top 10 CPFH in dataset
ggplot(data = new2, aes(x = reorder(Base, -CPFH), y = CPFH)) + geom_bar(stat = "identity") +
    theme(axis.text = element_text(size = 4)) +
    xlab("Base") + ylab("CPFH ($/hour)") + ggtitle("Bases with top Cost per Flying Hour")

Question 8

We use scatter plots in Q8 to visually assesss the relationship between End Strength and Total Costs.

First, let’s examine what’s happening over the Fiscal Years (FY). The relationship looks fairly similar across the years, although small multiples is probably not the best way to investigate this data.

Next we look at Wright-Patterson AFB as we have the most datapoints here to try and see what’s happening on a more granular level. Both the Others and Aircraft systems look interested to penetrate. There is an array of low ES/total cost Aircraft missions and a fairly linear relationship of higher ES/total cost ratios. It would be worth examining these in more detail. Additionally, the Other system has a pronounced cluster of expensive missions with high end strength. This cluster may be worth categorizing differently than the other Others which have lower ES or lower Cost.

Finally, we look at the Mission Details (MD). There are many types of missions, so we’ll look at the missions that have a form X-# where X is a letter. Later we can dig into the details on specfics behind the letter.

Missions of B/E/F have a lower ES to total cost ratio. U and B seem to have a non-linear relationship. As ES increases, total cost increases exponentially.

Missions of C type have a fairly linear relationship and maintain a high ES to Total cost ratio.

new3 <- left_join(prog, cate, by = c("Base", "MD"))

ggplot(data = new3, aes(x = End_Strength, y = Total_O.S)) + geom_point() + facet_wrap( ~ FY, ncol=4)

table(new3$Base) #Wright-Patterson AFB has the most datapoints. Let's look by system here.
## 
##            ALTUS AFB (OK)          ANDREWS AFB (MD) 
##                       234                      1078 
##        BARKSDALE AFB (LA)            BEALE AFB (CA) 
##                       805                       564 
##          BOLLING AFB (DC)          BUCKLEY AFB (CO) 
##                       677                       223 
##           CANNON AFB (NM)       CHARLESTON AFB (SC) 
##                       371                       345 
##         COLUMBUS AFB (MS)    DAVIS-MONTHAN AFB (AZ) 
##                       231                       840 
##            DOVER AFB (DE)            DYESS AFB (TX) 
##                       247                       576 
##          EDWARDS AFB (CA)            EGLIN AFB (FL) 
##                       899                      1144 
##          EIELSON AFB (AK)        ELLSWORTH AFB (SD) 
##                       377                       429 
##        ELMENDORF AFB (AK)        FAIRCHILD AFB (WA) 
##                       732                       384 
## FRANCIS E WARREN AFB (WY)       GOODFELLOW AFB (TX) 
##                       252                       190 
##      GRAND FORKS AFB (ND)          HANSCOM AFB (MA) 
##                       265                       538 
##           HICKAM AFB (HI)             HILL AFB (UT) 
##                      1045                      1289 
##         HOLLOMAN AFB (NM)          KEESLER AFB (MS) 
##                       515                       449 
##         KIRTLAND AFB (NM)         LACKLAND AFB (TX) 
##                       755                       630 
##          LANGLEY AFB (VA)         LAUGHLIN AFB (TX) 
##                      1595                       202 
##      LITTLE ROCK AFB (AR)             LUKE AFB (AZ) 
##                       483                       466 
##          MACDILL AFB (FL)        MALMSTROM AFB (MT) 
##                       590                       196 
##          MAXWELL AFB (AL)          MCCHORD AFB (WA) 
##                       701                       435 
##        MCCONNELL AFB (KS)          MCGUIRE AFB (NJ) 
##                       402                       511 
##            MINOT AFB (ND)            MOODY AFB (GA) 
##                       409                       453 
##    MOUNTAIN HOME AFB (ID)           NELLIS AFB (NV) 
##                       370                       824 
##           OFFUTT AFB (NE)          PATRICK AFB (FL) 
##                       974                       514 
##         PETERSON AFB (CO)                 POPE (NC) 
##                      1042                       481 
##         RANDOLPH AFB (TX)        SCHRIEVER AFB (CO) 
##                      1048                       394 
##            SCOTT AFB (IL)  SEYMOUR JOHNSON AFB (NC) 
##                      1300                       414 
##             SHAW AFB (SC)         SHEPPARD AFB (TX) 
##                       626                       270 
##           TINKER AFB (OK)           TRAVIS AFB (CA) 
##                      1359                       475 
##          TYNDALL AFB (FL)            VANCE AFB (OK) 
##                       896                       172 
##       VANDENBERG AFB (CA)         WHITEMAN AFB (MO) 
##                       514                       420 
## WRIGHT-PATTERSON AFB (OH) 
##                      1708
new4 <- new3 %>% filter(Base == "WRIGHT-PATTERSON AFB (OH)")
ggplot(data = new4, aes(x = End_Strength, y = Total_O.S, group = System, colour = System)) + geom_point(shape=1)

#Let's examine mission types that being with one letter across bases
new5 <- new3 %>% separate(MD, into = c("Mission", "Details"), sep = "-")
new6 <- new5 %>% filter(nchar(Mission) == 1)
ggplot(data = new6, aes(x = End_Strength, y = Total_O.S, group = Mission, colour = Mission)) + geom_point()