Synopsis

This is the assignment report for week-5. I have answered the eight questions in this assignment. By working on week-5 homework I have learned various concepts of data cleaning, manipulation and relational data that can be done by the “tidyverse” package in R.

Packages Required

To complete this assignment and run the codes I have used the following packages:

library(tidyverse) # To use ggplot and dplyr#

Q1.

Import the bomber_wide.rds file, which lists the flying hours for each aircraft by year. Convert this data to a tibble and tidy it by changing it from a wide format to a long format so that you have the following columns: Type, MD, Year, & FH.

bomberwide<-readRDS("/Users/sakshilohana/Downloads/data/bomber_wide.rds")
bomberwide_tibble<-bomberwide %>% 
  gather(c(`1996`:`2014`), key="Year", value = "FH")
as_tibble(bomberwide_tibble)
## # A tibble: 57 × 4
##      Type    MD  Year    FH
##     <chr> <chr> <chr> <int>
## 1  Bomber   B-1  1996 26914
## 2  Bomber   B-2  1996  2364
## 3  Bomber  B-52  1996 28511
## 4  Bomber   B-1  1997 25219
## 5  Bomber   B-2  1997  2776
## 6  Bomber  B-52  1997 26034
## 7  Bomber   B-1  1998 24205
## 8  Bomber   B-2  1998  2166
## 9  Bomber  B-52  1998 25639
## 10 Bomber   B-1  1999 23306
## # ... with 47 more rows

Q2.

Import the bomber_long.rds data, which provides the value for three different outputs for each aircraft by year. The output measures include cost, flying hours, and gallons of gas consumed but these variables are “stacked” in the Output variable. Change this data to a tibble and convert to a wider format so that you have the following columns: Type, MD, FY, Cost, FH, & Gallons.

bomberlong<-readRDS("/Users/sakshilohana/Downloads/data/bomber_long.rds")
bomberlong_tibble<-bomberlong %>% 
  spread(key="Output", value = "Value")
as_tibble(bomberlong_tibble)
## # A tibble: 57 × 6
##      Type    MD    FY      Cost    FH   Gallons
## *   <chr> <chr> <int>     <int> <int>     <int>
## 1  Bomber   B-1  1996  72753781 26914  88594449
## 2  Bomber   B-1  1997  71297263 25219  85484074
## 3  Bomber   B-1  1998  84026805 24205  85259038
## 4  Bomber   B-1  1999  71848336 23306  79323816
## 5  Bomber   B-1  2000  58439777 25013  86230284
## 6  Bomber   B-1  2001  94946077 25059  86892432
## 7  Bomber   B-1  2002  96458536 26581  89198262
## 8  Bomber   B-1  2003  68650070 21491  74485788
## 9  Bomber   B-1  2004 101895634 28118 101397707
## 10 Bomber   B-1  2005 124816690 21859  78410415
## # ... with 47 more rows

Q3.

Import the bomber_combined.rds file. Note that the first variable in this data (AC) combines the aircraft type (Bomber) and aircraft designator (i.e. B-1). This variable should be split into two. Take this data and convert it to a tibble and separate the AC variable into “Type” and “MD”.

bombercombined<-readRDS("/Users/sakshilohana/Downloads/data/bomber_combined.rds")
bombercombined_tibble<-bombercombined %>% 
  separate(AC, into = c("Type", "MD"), sep = " ")
as_tibble(bombercombined_tibble)
## # A tibble: 57 × 6
##      Type    MD    FY      Cost    FH   Gallons
## *   <chr> <chr> <int>     <int> <int>     <int>
## 1  Bomber   B-1  1996  72753781 26914  88594449
## 2  Bomber   B-1  1997  71297263 25219  85484074
## 3  Bomber   B-1  1998  84026805 24205  85259038
## 4  Bomber   B-1  1999  71848336 23306  79323816
## 5  Bomber   B-1  2000  58439777 25013  86230284
## 6  Bomber   B-1  2001  94946077 25059  86892432
## 7  Bomber   B-1  2002  96458536 26581  89198262
## 8  Bomber   B-1  2003  68650070 21491  74485788
## 9  Bomber   B-1  2004 101895634 28118 101397707
## 10 Bomber   B-1  2005 124816690 21859  78410415
## # ... with 47 more rows

Q4.

Import the bomber_prefix.rds data. Take this data and convert it to a tibble and unite the prefix and number variables into an “MD” variable.

bomberprefix<-readRDS("/Users/sakshilohana/Downloads/data/bomber_prefix.rds")
bomberprefix_tibble<-bomberprefix %>% 
  unite(MD, prefix, number, sep = "") %>% 
  spread(key="Output", value = "Value")
as_tibble(bomberprefix_tibble)
## # A tibble: 57 × 6
##      Type    MD    FY      Cost    FH   Gallons
## *   <chr> <chr> <int>     <int> <int>     <int>
## 1  Bomber    B1  1996  72753781 26914  88594449
## 2  Bomber    B1  1997  71297263 25219  85484074
## 3  Bomber    B1  1998  84026805 24205  85259038
## 4  Bomber    B1  1999  71848336 23306  79323816
## 5  Bomber    B1  2000  58439777 25013  86230284
## 6  Bomber    B1  2001  94946077 25059  86892432
## 7  Bomber    B1  2002  96458536 26581  89198262
## 8  Bomber    B1  2003  68650070 21491  74485788
## 9  Bomber    B1  2004 101895634 28118 101397707
## 10 Bomber    B1  2005 124816690 21859  78410415
## # ... with 47 more rows

Q5.

Import the bomber_mess.rds file so that it is a tibble. Clean this data up by making it contain the variables: Type, MD, FY, Cost, FH and Gallons.

bombermess<-readRDS("/Users/sakshilohana/Downloads/data/bomber_mess.rds")
bombermess_tibble_plot<-bombermess %>%
  unite(MD, prefix, number, sep = "") %>%  
  separate(Metric, into = c("FY", "Output"), sep = "_") %>% 
  ggplot(mapping = aes(x = FY , y = Value, group = MD, color = MD)) + 
  geom_line() + geom_point() +
  facet_wrap(~ Output, nrow = 3, scales = "free")
bombermess_tibble_plot

bombermess_tibble<-bombermess %>%
  unite(MD, prefix, number, sep = "") %>%  
  separate(Metric, into = c("FY", "Output"), sep = "_") %>% 
  spread(key="Output", value = "Value")

as_tibble(bombermess_tibble)
## # A tibble: 57 × 6
##      Type    MD    FY      Cost    FH   Gallons
## *   <chr> <chr> <chr>     <int> <int>     <int>
## 1  Bomber    B1  1996  72753781 26914  88594449
## 2  Bomber    B1  1997  71297263 25219  85484074
## 3  Bomber    B1  1998  84026805 24205  85259038
## 4  Bomber    B1  1999  71848336 23306  79323816
## 5  Bomber    B1  2000  58439777 25013  86230284
## 6  Bomber    B1  2001  94946077 25059  86892432
## 7  Bomber    B1  2002  96458536 26581  89198262
## 8  Bomber    B1  2003  68650070 21491  74485788
## 9  Bomber    B1  2004 101895634 28118 101397707
## 10 Bomber    B1  2005 124816690 21859  78410415
## # ... with 47 more rows

Q6.

Import the wsprogrammatics.rds & wscategorization.rds data so that they are tibbles and perform the following steps in sequence using the pipe operator (%>%). Join the wscategorization data to the wsprogrammatics data.Filter for only FY 2014 data at the following Base: Minot AFB (ND).Filter for only Systems classified as “AIRCRAFT” or “MISSILES”.Group the data by System level.Calculate the total sum of the TotalO.S and EndStrength variables.

wsprog<-readRDS("/Users/sakshilohana/Downloads/data/ws_programmatics.rds")
wscat<-readRDS("/Users/sakshilohana/Downloads/data/ws_categorizations.rds")
wsjoin<-full_join(wsprog,wscat) %>% 
  filter(FY==2014, Base == "MINOT AFB (ND)") %>% 
  filter(System %in% c("MISSILES","AIRCRAFT")) %>% 
  group_by(System) %>% 
  summarise(TotalOS = sum(Total_O.S, na.rm = TRUE), 
            TotalEnd = sum(End_Strength, na.rm = TRUE)) %>%
  mutate(TotalSum = TotalOS + TotalEnd )
## Joining, by = c("Base", "MD")
wsjoin
## # A tibble: 2 × 4
##     System   TotalOS TotalEnd  TotalSum
##      <chr>     <dbl>    <dbl>     <dbl>
## 1 AIRCRAFT 297398235     2023 297400258
## 2 MISSILES 112224909     1951 112226860

Q7.

Once again, join the wsprogrammatics.rds & wscategorization.rds data; however, this time identify which Base had the largest cost per flying hour (which requires you to create a new variable) in 2014. Using a bar chart in ggplot2, plot these values for the top 10 bases with the largest cost per flying hour.

wsprog<-readRDS("/Users/sakshilohana/Downloads/data/ws_programmatics.rds")
wscat<-readRDS("/Users/sakshilohana/Downloads/data/ws_categorizations.rds")
wsjoin_tibble<-full_join(wsprog,wscat) %>% 
  filter(FY==2014) %>% 
  group_by(Base) %>% 
  mutate(CPFH = sum(Total_O.S, na.rm = TRUE)/sum(FH, na.rm = TRUE)) %>% 
  filter(CPFH != Inf) %>%
  summarise(avg_CPFH = mean(CPFH, na.rm = TRUE)) %>%
  filter(rank(desc(avg_CPFH))<=10) %>%
  arrange(desc(avg_CPFH))
## Joining, by = c("Base", "MD")
as_tibble(wsjoin_tibble)
## # A tibble: 10 × 2
##                         Base  avg_CPFH
##                        <chr>     <dbl>
## 1           PATRICK AFB (FL) 178960.00
## 2          PETERSON AFB (CO) 155698.05
## 3            TINKER AFB (OK) 150811.14
## 4           LANGLEY AFB (VA) 150412.94
## 5             SCOTT AFB (IL) 145933.54
## 6              HILL AFB (UT)  97423.42
## 7  WRIGHT-PATTERSON AFB (OH)  71068.01
## 8          LACKLAND AFB (TX)  60352.56
## 9                  POPE (NC)  57593.87
## 10        BARKSDALE AFB (LA)  52585.62
wsjoin<-full_join(wsprog,wscat) %>% 
  filter(FY==2014) %>% 
  group_by(Base) %>% 
  mutate(CPFH = sum(Total_O.S, na.rm = TRUE)/sum(FH, na.rm = TRUE)) %>% 
  filter(CPFH != Inf) %>%
  summarise(avg_CPFH = mean(CPFH, na.rm = TRUE)) %>%
  filter(rank(desc(avg_CPFH))<=10) %>%
  arrange(desc(avg_CPFH)) %>%
  ggplot(mapping = aes(x = Base , y = avg_CPFH)) + 
  geom_bar(stat = "identity", color="blue", fill= "turquoise") +
  ggtitle("Top 10 Bases") + 
  labs(x= "Base" , y = "Largest cost per flying hour") + 
  theme(axis.text.x = element_text(colour = "black", size = 5))
## Joining, by = c("Base", "MD")
wsjoin

Q8.

Using scatter plots in ggplot2, assess the relationship between the end strength (End_Strength) variable and total costs (Total_O.S). Provide three scatter plots that visually assesses this replationship from different angles (by FY, System, etc).

wsprog<-readRDS("/Users/sakshilohana/Downloads/data/ws_programmatics.rds")
wscat<-readRDS("/Users/sakshilohana/Downloads/data/ws_categorizations.rds")
wsjoin_fy<-full_join(wsprog,wscat) %>%
  ggplot() + 
  geom_point(mapping = aes(x = End_Strength, y = Total_O.S, group = FY, color = FY))
## Joining, by = c("Base", "MD")
wsjoin_fy

wsjoin_system<-full_join(wsprog,wscat) %>%
  ggplot() + 
  geom_point(mapping = aes(x = End_Strength, y = Total_O.S), color= "blue") +
  facet_wrap(~System, nrow = 3)
## Joining, by = c("Base", "MD")
wsjoin_system

wsjoin_fh<-full_join(wsprog,wscat) %>%
  ggplot() + 
  geom_point(mapping = aes(x = End_Strength, y = Total_O.S, group = FH, color = FH))
## Joining, by = c("Base", "MD")
wsjoin_fh