On this page, we’ll look through some fictional datasets looking at costs across years at US Airforce bases.
data source: https://www.dropbox.com/sh/qgv90e945updkzq/AADpwjjWq-7u3IArV0rYl3-9a?dl=1 data source description: http://uc-r.github.io/data_wrangling/week-5-assignment-data
To run the scripts below, you will need tidyverse, dplyr, rio, ggplot2 and magrittr.
library(tidyverse)
library(dplyr)
library(rio)
library(ggplot2)
library(magrittr)
Here we import the bomber_wide.rds dataset, convert to a tibble with as_tibble() and gather the columns to make a vertical, tidy dataset:
#import the bomber_wide.rds dataset, convert to tibble and gather the years
b <- import("bomber_wide.rds")
b <- as_tibble(b)
b <- gather(b, c(3:21), key = "Year", value = "FH")
b
## # A tibble: 57 x 4
## Type MD Year FH
## <chr> <chr> <chr> <int>
## 1 Bomber B-1 1996 26914
## 2 Bomber B-2 1996 2364
## 3 Bomber B-52 1996 28511
## 4 Bomber B-1 1997 25219
## 5 Bomber B-2 1997 2776
## 6 Bomber B-52 1997 26034
## 7 Bomber B-1 1998 24205
## 8 Bomber B-2 1998 2166
## 9 Bomber B-52 1998 25639
## 10 Bomber B-1 1999 23306
## # ... with 47 more rows
Here we import the bomber_long.rds dataset, convert to a tibble and spread a long column depending on the Value column:
#import, tibble, spread to split "Output" using "Value" as the rows
b2 <- import("bomber_long.rds")
b2 <- as_tibble(b2)
b2 <- spread(b2, key = "Output", value = "Value")
b2
## # A tibble: 57 x 6
## Type MD FY Cost FH Gallons
## * <chr> <chr> <int> <int> <int> <int>
## 1 Bomber B-1 1996 72753781 26914 88594449
## 2 Bomber B-1 1997 71297263 25219 85484074
## 3 Bomber B-1 1998 84026805 24205 85259038
## 4 Bomber B-1 1999 71848336 23306 79323816
## 5 Bomber B-1 2000 58439777 25013 86230284
## 6 Bomber B-1 2001 94946077 25059 86892432
## 7 Bomber B-1 2002 96458536 26581 89198262
## 8 Bomber B-1 2003 68650070 21491 74485788
## 9 Bomber B-1 2004 101895634 28118 101397707
## 10 Bomber B-1 2005 124816690 21859 78410415
## # ... with 47 more rows
Here we import the bomber_combined.rds dataset, convert to a tibble and separate the AC column into Type and MD
#Question 3
#import, tibble, and separate column AC into "Type" and "MD"
b3 <- import("bomber_combined.rds")
b3 <- as_tibble(b3)
b3 <- b3 %>% separate(AC, into = c("Type", "MD"), sep = " ")
b3
## # A tibble: 57 x 6
## Type MD FY Cost FH Gallons
## * <chr> <chr> <int> <int> <int> <int>
## 1 Bomber B-1 1996 72753781 26914 88594449
## 2 Bomber B-1 1997 71297263 25219 85484074
## 3 Bomber B-1 1998 84026805 24205 85259038
## 4 Bomber B-1 1999 71848336 23306 79323816
## 5 Bomber B-1 2000 58439777 25013 86230284
## 6 Bomber B-1 2001 94946077 25059 86892432
## 7 Bomber B-1 2002 96458536 26581 89198262
## 8 Bomber B-1 2003 68650070 21491 74485788
## 9 Bomber B-1 2004 101895634 28118 101397707
## 10 Bomber B-1 2005 124816690 21859 78410415
## # ... with 47 more rows
Here we import the bomber_prefix.rds dataset, convert to a tibble. Then we use the head() function to look at the structure of the rows and the column names. After the head() function we use unite() to bring prefix and number together, separated by a “-”.
#Question 4
#import, tibble, check the columns to combine, and combine them with a "-" in between
b4 <- import("bomber_prefix.rds")
b4 <- as_tibble(b4)
head(b4)
## # A tibble: 6 x 6
## Type prefix number FY Output Value
## <chr> <chr> <chr> <int> <chr> <int>
## 1 Bomber B 1 1996 FH 26914
## 2 Bomber B 1 1997 FH 25219
## 3 Bomber B 1 1998 FH 24205
## 4 Bomber B 1 1999 FH 23306
## 5 Bomber B 1 2000 FH 25013
## 6 Bomber B 1 2001 FH 25059
b4 <- b4 %>% unite(MD, prefix, number, sep = "-")
b4
## # A tibble: 171 x 5
## Type MD FY Output Value
## * <chr> <chr> <int> <chr> <int>
## 1 Bomber B-1 1996 FH 26914
## 2 Bomber B-1 1997 FH 25219
## 3 Bomber B-1 1998 FH 24205
## 4 Bomber B-1 1999 FH 23306
## 5 Bomber B-1 2000 FH 25013
## 6 Bomber B-1 2001 FH 25059
## 7 Bomber B-1 2002 FH 26581
## 8 Bomber B-1 2003 FH 21491
## 9 Bomber B-1 2004 FH 28118
## 10 Bomber B-1 2005 FH 21859
## # ... with 161 more rows
Here we’ll use the pipe function to import, convert to tibble, combine columns, separate other columns, spread two columns by Value one one command. We’ll then look at three different charts (Cost, Flying Hours or FH, Gallons) against the Fiscal Year or FY and compare the Mission Description, or MD. While a facet_grid() could compare MD, we’ll group by color and place on the same chart to better show the differences, potentially to the Professor’s chagrin.
The cost after 2004 has jumped for both B-1 and B-52, but B-1 had a shaper climb from 2005 to 2008 and 2009 to 2012. The cost fell off for B-1 from 2012.
Flying hours and gallons have trended down since 1996 for B-1 and B-52, strangely, as have gallons. A question out of this question that we would want to investigate is why cost is not related to Flying Hours or Gallons of gasoline.
#Question 5
#import, tibble, unite, separate, spread
b5 <- import("bomber_mess.rds") %>% as_tibble() %>% unite(MD, prefix, number, sep = "-") %>%
separate(Metric, into = c("FY", "number"), sep = "_") %>% spread(key = "number", value = "Value")
b5
## # A tibble: 57 x 6
## Type MD FY Cost FH Gallons
## * <chr> <chr> <chr> <int> <int> <int>
## 1 Bomber B-1 1996 72753781 26914 88594449
## 2 Bomber B-1 1997 71297263 25219 85484074
## 3 Bomber B-1 1998 84026805 24205 85259038
## 4 Bomber B-1 1999 71848336 23306 79323816
## 5 Bomber B-1 2000 58439777 25013 86230284
## 6 Bomber B-1 2001 94946077 25059 86892432
## 7 Bomber B-1 2002 96458536 26581 89198262
## 8 Bomber B-1 2003 68650070 21491 74485788
## 9 Bomber B-1 2004 101895634 28118 101397707
## 10 Bomber B-1 2005 124816690 21859 78410415
## # ... with 47 more rows
ggplot(data = b5) + geom_line(mapping = aes(x = FY, y = Cost, group = MD, colour = MD))
ggplot(data = b5) + geom_line(mapping = aes(x = FY, y = FH, group = MD, colour = MD))
ggplot(data = b5) + geom_line(mapping = aes(x = FY, y = Gallons, group = MD, colour = MD))
Here we will import a categorization file and dataset and convert them to tibbles.
We then join them together, filter by FY, base and system, group by system and sum the total cost and end strength for these groupings.
Missle and Aircraft missions at Minot AFB have similar end strength sums, but Aircraft missions are much more expensive.
#Question 6
prog <- as_tibble(import("ws_programmatics.rds"))
cate <- as_tibble(import("ws_categorizations.rds"))
head(prog)
## # A tibble: 6 x 18
## Base MD FY Manpower_Ops Manpower_Mx
## <chr> <chr> <int> <dbl> <dbl>
## 1 ALTUS AFB (OK) A-10 2007 NA NA
## 2 ALTUS AFB (OK) A-10 2008 NA NA
## 3 ALTUS AFB (OK) AT-38 1997 NA NA
## 4 ALTUS AFB (OK) AT-38 1998 NA NA
## 5 ALTUS AFB (OK) C-130 1998 NA NA
## 6 ALTUS AFB (OK) C-130 1999 NA NA
## # ... with 13 more variables: Manpower_Support_Staff <dbl>,
## # Operating_Material <dbl>, Mx_Consumables <dbl>, Mx_DLR <dbl>,
## # Mx_Depot_AC <dbl>, Mx_Depot_Missile <dbl>, Mx_Depot_Engine <dbl>,
## # CLS <dbl>, Total_O.S <dbl>, Avg_Inv <dbl>, TAI <dbl>,
## # End_Strength <dbl>, FH <dbl>
head(cate)
## # A tibble: 6 x 3
## Base System MD
## <chr> <chr> <chr>
## 1 ALTUS AFB (OK) AIRCRAFT A-10
## 2 ALTUS AFB (OK) AIRCRAFT AT-38
## 3 ALTUS AFB (OK) AIRCRAFT C-130
## 4 ALTUS AFB (OK) AIRCRAFT C-135
## 5 ALTUS AFB (OK) AIRCRAFT C-141
## 6 ALTUS AFB (OK) AIRCRAFT C-17
#As the name describes, ws_categorizations provides additional context (system) for the programmatics
#filter by year, military base, system, and then sum two columns by system:
new <- left_join(prog, cate, by = c("Base", "MD")) %>%
filter(FY == 2014, Base == "MINOT AFB (ND)", (System == "MISSILES" | System == "AIRCRAFT")) %>%
group_by(System) %>%
summarise(SumTotOS = sum(Total_O.S, na.rm=TRUE), SumEndStr = sum(End_Strength, na.rm = TRUE))
new
## # A tibble: 2 x 3
## System SumTotOS SumEndStr
## <chr> <dbl> <dbl>
## 1 AIRCRAFT 297398235 2023
## 2 MISSILES 112224909 1951
We join the files form Question 6 again, but this time we’ll make a new column that is the sum of total costs/sum of flying hours to give a total cost per flying hour (CPFH) for each base in the year 2014. We’ll then look at the base with the highest CPFH and chart the bases with a top 10 CPFH and order the bar chart by CPFH.
Patterson AFB in FL has the highest total costs.
#Question 7
#code to show base with top cost per flying hour
new2 <- left_join(prog, cate, by = c("Base", "MD")) %>%
filter(FY == 2014) %>%
group_by(Base) %>%
summarise(SumTotOS = sum(Total_O.S, na.rm = TRUE), SumFH = sum(FH, na.rm = TRUE)) %>%
filter(SumFH != 0) %>%
mutate(CPFH = SumTotOS/SumFH) %>%
arrange(desc(CPFH)) %>%
top_n(10,CPFH)
top_n(new2, 1, CPFH)
## # A tibble: 1 x 4
## Base SumTotOS SumFH CPFH
## <chr> <dbl> <dbl> <dbl>
## 1 PATRICK AFB (FL) 456348004 2550 178960
head(new2)
## # A tibble: 6 x 4
## Base SumTotOS SumFH CPFH
## <chr> <dbl> <dbl> <dbl>
## 1 PATRICK AFB (FL) 456348004 2550 178960.00
## 2 PETERSON AFB (CO) 830493384 5334 155698.05
## 3 TINKER AFB (OK) 3220722796 21356 150811.14
## 4 LANGLEY AFB (VA) 1727342213 11484 150412.94
## 5 SCOTT AFB (IL) 1444012422 9895 145933.54
## 6 HILL AFB (UT) 1119979581 11496 97423.42
#code to show bar chart of bases with a top 10 CPFH in dataset
ggplot(data = new2, aes(x = reorder(Base, -CPFH), y = CPFH)) + geom_bar(stat = "identity") +
theme(axis.text = element_text(size = 4)) +
xlab("Base") + ylab("CPFH ($/hour)") + ggtitle("Bases with top Cost per Flying Hour")
We use scatter plots in Q8 to visually assesss the relationship between End Strength and Total Costs.
First, let’s examine what’s happening over the Fiscal Years (FY). The relationship looks fairly similar across the years, although small multiples is probably not the best way to investigate this data.
Next we look at Wright-Patterson AFB as we have the most datapoints here to try and see what’s happening on a more granular level. Both the Others and Aircraft systems look interested to penetrate. There is an array of low ES/total cost Aircraft missions and a fairly linear relationship of higher ES/total cost ratios. It would be worth examining these in more detail. Additionally, the Other system has a pronounced cluster of expensive missions with high end strength. This cluster may be worth categorizing differently than the other Others which have lower ES or lower Cost.
Finally, we look at the Mission Details (MD). There are many types of missions, so we’ll look at the missions that have a form X-# where X is a letter. Later we can dig into the details on specfics behind the letter.
Missions of B/E/F have a lower ES to total cost ratio. U and B seem to have a non-linear relationship. As ES increases, total cost increases exponentially.
Missions of C type have a fairly linear relationship and maintain a high ES to Total cost ratio.
new3 <- left_join(prog, cate, by = c("Base", "MD"))
ggplot(data = new3, aes(x = End_Strength, y = Total_O.S)) + geom_point() + facet_wrap( ~ FY, ncol=4)
table(new3$Base) #Wright-Patterson AFB has the most datapoints. Let's look by system here.
##
## ALTUS AFB (OK) ANDREWS AFB (MD)
## 234 1078
## BARKSDALE AFB (LA) BEALE AFB (CA)
## 805 564
## BOLLING AFB (DC) BUCKLEY AFB (CO)
## 677 223
## CANNON AFB (NM) CHARLESTON AFB (SC)
## 371 345
## COLUMBUS AFB (MS) DAVIS-MONTHAN AFB (AZ)
## 231 840
## DOVER AFB (DE) DYESS AFB (TX)
## 247 576
## EDWARDS AFB (CA) EGLIN AFB (FL)
## 899 1144
## EIELSON AFB (AK) ELLSWORTH AFB (SD)
## 377 429
## ELMENDORF AFB (AK) FAIRCHILD AFB (WA)
## 732 384
## FRANCIS E WARREN AFB (WY) GOODFELLOW AFB (TX)
## 252 190
## GRAND FORKS AFB (ND) HANSCOM AFB (MA)
## 265 538
## HICKAM AFB (HI) HILL AFB (UT)
## 1045 1289
## HOLLOMAN AFB (NM) KEESLER AFB (MS)
## 515 449
## KIRTLAND AFB (NM) LACKLAND AFB (TX)
## 755 630
## LANGLEY AFB (VA) LAUGHLIN AFB (TX)
## 1595 202
## LITTLE ROCK AFB (AR) LUKE AFB (AZ)
## 483 466
## MACDILL AFB (FL) MALMSTROM AFB (MT)
## 590 196
## MAXWELL AFB (AL) MCCHORD AFB (WA)
## 701 435
## MCCONNELL AFB (KS) MCGUIRE AFB (NJ)
## 402 511
## MINOT AFB (ND) MOODY AFB (GA)
## 409 453
## MOUNTAIN HOME AFB (ID) NELLIS AFB (NV)
## 370 824
## OFFUTT AFB (NE) PATRICK AFB (FL)
## 974 514
## PETERSON AFB (CO) POPE (NC)
## 1042 481
## RANDOLPH AFB (TX) SCHRIEVER AFB (CO)
## 1048 394
## SCOTT AFB (IL) SEYMOUR JOHNSON AFB (NC)
## 1300 414
## SHAW AFB (SC) SHEPPARD AFB (TX)
## 626 270
## TINKER AFB (OK) TRAVIS AFB (CA)
## 1359 475
## TYNDALL AFB (FL) VANCE AFB (OK)
## 896 172
## VANDENBERG AFB (CA) WHITEMAN AFB (MO)
## 514 420
## WRIGHT-PATTERSON AFB (OH)
## 1708
new4 <- new3 %>% filter(Base == "WRIGHT-PATTERSON AFB (OH)")
ggplot(data = new4, aes(x = End_Strength, y = Total_O.S, group = System, colour = System)) + geom_point(shape=1)
#Let's examine mission types that being with one letter across bases
new5 <- new3 %>% separate(MD, into = c("Mission", "Details"), sep = "-")
new6 <- new5 %>% filter(nchar(Mission) == 1)
ggplot(data = new6, aes(x = End_Strength, y = Total_O.S, group = Mission, colour = Mission)) + geom_point()