Synopsis

This is my homework report for week 5, produced with R Markdown.

In this homework I’d be working on styling the data by following these steps:

  • Understand the basics of dealing with data frames

  • Comprehend a new concept known as “tibbles,” which makes working with data frames even easier

  • Learn how to organize data in a tidy way

  • Manage relational, but separate, data frames

Packages Required

For this homework assignment, I used the following packages:

library(ggplot2)        # for creating graphs
library(dplyr)          # for performing data transformation and manipulation tasks.
library(knitr)          # for kniting r code to html files
library(tidyr)          # for tidying the data set
library(magrittr)       # for chaining commands with pipe operator, %>%.
library(tidyverse)      # for tibbles, a modern re-imagining of data frames

Source Code

The data sets used are :

  • bomber_combined.rds

  • bomber_long.rds

  • bomber_mess.rds

  • bomber_prefix.rds

  • bomber_wide.rds

  • ws_categorizations.rds

  • ws_programmatics.rds

A short description of each data set is available at: (http://uc-r.github.io/data_wrangling/week-5-assignment-data)

Tibbles, Tidy Data and Relational Data

Problem 1

Import the bomber_wide.rds file, which lists the flying hours for each aircraft by year. Convert this data to a tibble and tidy it by changing it from a wide format to a long format so that you have the following columns: Type, MD, Year, & FH.

readRDS("C:/Users/Jasmine/Desktop/Study/DATA WRANGLING/Week 5/data/bomber_wide.rds")%>%
  as_tibble() %>%
  gather(Year, FH, `1996`:`2014`)
## # A tibble: 57 × 4
##      Type    MD  Year    FH
##     <chr> <chr> <chr> <int>
## 1  Bomber   B-1  1996 26914
## 2  Bomber   B-2  1996  2364
## 3  Bomber  B-52  1996 28511
## 4  Bomber   B-1  1997 25219
## 5  Bomber   B-2  1997  2776
## 6  Bomber  B-52  1997 26034
## 7  Bomber   B-1  1998 24205
## 8  Bomber   B-2  1998  2166
## 9  Bomber  B-52  1998 25639
## 10 Bomber   B-1  1999 23306
## # ... with 47 more rows

Problem 2

Import the bomber_long.rds data, which provides the value for three different outputs for each aircraft by year. The output measures include cost, flying hours, and gallons of gas consumed but these variables are “stacked” in the Output variable. Change this data to a tibble and convert to a wider format so that you have the following columns: Type, MD, FY, Cost, FH, & Gallons.

readRDS("C:/Users/Jasmine/Desktop/Study/DATA WRANGLING/Week 5/data/bomber_long.rds")%>%
  as_tibble() %>%
  spread(key = "Output", value = "Value")
## # A tibble: 57 × 6
##      Type    MD    FY      Cost    FH   Gallons
## *   <chr> <chr> <int>     <int> <int>     <int>
## 1  Bomber   B-1  1996  72753781 26914  88594449
## 2  Bomber   B-1  1997  71297263 25219  85484074
## 3  Bomber   B-1  1998  84026805 24205  85259038
## 4  Bomber   B-1  1999  71848336 23306  79323816
## 5  Bomber   B-1  2000  58439777 25013  86230284
## 6  Bomber   B-1  2001  94946077 25059  86892432
## 7  Bomber   B-1  2002  96458536 26581  89198262
## 8  Bomber   B-1  2003  68650070 21491  74485788
## 9  Bomber   B-1  2004 101895634 28118 101397707
## 10 Bomber   B-1  2005 124816690 21859  78410415
## # ... with 47 more rows

Problem 3

Import the bomber_combined.rds file. Note that the first variable in this data (AC) combines the aircraft type (Bomber) and aircraft designator (i.e. B-1). This variable should be split into two. Take this data and convert it to a tibble and separate the AC variable into “Type” and “MD”

readRDS("C:/Users/Jasmine/Desktop/Study/DATA WRANGLING/Week 5/data/bomber_combined.rds")%>%
  as_tibble() %>%
  separate(AC, into = c("Type", "MD"), sep = " ")
## # A tibble: 57 × 6
##      Type    MD    FY      Cost    FH   Gallons
## *   <chr> <chr> <int>     <int> <int>     <int>
## 1  Bomber   B-1  1996  72753781 26914  88594449
## 2  Bomber   B-1  1997  71297263 25219  85484074
## 3  Bomber   B-1  1998  84026805 24205  85259038
## 4  Bomber   B-1  1999  71848336 23306  79323816
## 5  Bomber   B-1  2000  58439777 25013  86230284
## 6  Bomber   B-1  2001  94946077 25059  86892432
## 7  Bomber   B-1  2002  96458536 26581  89198262
## 8  Bomber   B-1  2003  68650070 21491  74485788
## 9  Bomber   B-1  2004 101895634 28118 101397707
## 10 Bomber   B-1  2005 124816690 21859  78410415
## # ... with 47 more rows

Problem 4

Import the bomber_prefix.rds data. Take this data and convert it to a tibble and unite the prefix and number variables into an “MD” variable so that the data matches the tidy data sets you produced in problems #2 and #3.

readRDS("C:/Users/Jasmine/Desktop/Study/DATA WRANGLING/Week 5/data/bomber_prefix.rds") %>%
  as_tibble() %>%
  unite('MD', prefix, number, sep = '-')
## # A tibble: 171 × 5
##      Type    MD    FY Output Value
## *   <chr> <chr> <int>  <chr> <int>
## 1  Bomber   B-1  1996     FH 26914
## 2  Bomber   B-1  1997     FH 25219
## 3  Bomber   B-1  1998     FH 24205
## 4  Bomber   B-1  1999     FH 23306
## 5  Bomber   B-1  2000     FH 25013
## 6  Bomber   B-1  2001     FH 25059
## 7  Bomber   B-1  2002     FH 26581
## 8  Bomber   B-1  2003     FH 21491
## 9  Bomber   B-1  2004     FH 28118
## 10 Bomber   B-1  2005     FH 21859
## # ... with 161 more rows

Problem 5

Import the bomber_mess.rds file so that it is a tibble. Clean this data up. Plot the historical trends of this data in ggplot2 with a line chart such that the plot is facetted by the Cost, FH, and Gallons variables and each facet compares the different MDs (“B-1”, “B-2”, “B-52”).

readRDS("C:/Users/Jasmine/Desktop/Study/DATA WRANGLING/Week 5/data/bomber_mess.rds")%>%
  as_tibble() %>%
  unite('MD', prefix, number, sep = '-') %>%
  separate(col = Metric, into = c('FY', 'Metric'), sep = '_' ) %T>% 
  {print(ggplot(data = ., mapping = aes(x = as.Date(paste0(FY, '-01-01')),
                                        y = Value,
                                        group = MD,
                                        color = MD)) +
           geom_line() +
           scale_x_date(date_labels = '%Y' ) +
           facet_wrap(~ Metric, scale = "free", ncol = 1) +
           labs(x = "Year", y = "log(Value)") +
           ggtitle("Historical Trends") 
  )} %>%
  spread(key = 'Metric', value = "Value")

## # A tibble: 57 × 6
##      Type    MD    FY      Cost    FH   Gallons
## *   <chr> <chr> <chr>     <int> <int>     <int>
## 1  Bomber   B-1  1996  72753781 26914  88594449
## 2  Bomber   B-1  1997  71297263 25219  85484074
## 3  Bomber   B-1  1998  84026805 24205  85259038
## 4  Bomber   B-1  1999  71848336 23306  79323816
## 5  Bomber   B-1  2000  58439777 25013  86230284
## 6  Bomber   B-1  2001  94946077 25059  86892432
## 7  Bomber   B-1  2002  96458536 26581  89198262
## 8  Bomber   B-1  2003  68650070 21491  74485788
## 9  Bomber   B-1  2004 101895634 28118 101397707
## 10 Bomber   B-1  2005 124816690 21859  78410415
## # ... with 47 more rows

Problem 6

Import the ws_programmatics.rds & ws_categorization.rds data so that they are tibbles and perform the following steps in sequence using the pipe operator.

ws_programmatics <- readRDS("C:/Users/Jasmine/Desktop/Study/DATA WRANGLING/Week 5/data/ws_programmatics.rds")
ws_categorizations <- readRDS("C:/Users/Jasmine/Desktop/Study/DATA WRANGLING/Week 5/data/ws_categorizations.rds")

ws_programmatics %>%
  full_join(ws_categorizations,
            by = c("Base", "MD"),
            all.x = T,
            all.y = T) %>%
  filter(FY == '2014',
         Base == "MINOT AFB (ND)",
         System %in% c("AIRCRAFT", "MISSILES")) %>%
  group_by(System) %>%
  summarize(Sum_OS = sum(Total_O.S, na.rm = T),
            Sum_End_Strength = sum(End_Strength, na.rm = T))
## # A tibble: 2 × 3
##     System    Sum_OS Sum_End_Strength
##      <chr>     <dbl>            <dbl>
## 1 AIRCRAFT 297398235             2023
## 2 MISSILES 112224909             1951

Problem 7

Once again, join the ws_programmatics.rds & ws_categorization.rds data; however, this time identify which Base had the largest cost per flying hour which requires you to create a new variable) in 2014. Using a bar chart in ggplot2, plot these values for the top 10 bases with the largest cost per flying hour.

ws_programmatics %>%
  full_join(ws_categorizations, by = c("Base","MD")) %>%
  select(Base, Total_O.S, FH, FY, MD) %>%
  filter(FY=="2014")%>%
  na.omit() %>%
  group_by(Base) %>%
  summarize(Total_O.S = sum(Total_O.S, na.rm = T), FH = sum(FH)) %>%
  mutate(CPFH = Total_O.S/FH) %>%
  top_n(10, CPFH) %>%
  ggplot(mapping = aes(x=Base,y=CPFH))+
  geom_bar(stat = "identity")

ws_programmatics %>%
  full_join(ws_categorizations, by = c("Base","MD")) %>%
  select(Base, Total_O.S, FH, FY, MD) %>%
  filter(FY=="2014")%>%
  na.omit() %>%
  mutate(CPFH = Total_O.S/FH) %>%
  top_n(10, CPFH) %>%
  ggplot(mapping = aes(x=Base,y=CPFH))+
  geom_bar(stat = "identity")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Problem 8

Import the ws_programmatics.rds & ws_categorization.rds data so that they are tibbles and perform the following steps in sequence using the pipe operator.

ws_programmatics%>%
  left_join(ws_categorizations,by=c("Base","MD"))%>%
  ggplot(aes(End_Strength,Total_O.S))+
  geom_point(color="blue")+
  facet_wrap(~FY,scales="free",nrow=3)

ws_programmatics %>%
  full_join(ws_categorizations, by = c("Base","MD"))%>%
  select(Base, Total_O.S, FY, MD, System, End_Strength) %>%
  na.omit() %>%
ggplot(mapping = aes(x=End_Strength,y=Total_O.S))+
  geom_point(stat = "identity", color = "blue")  + 
  facet_wrap(~ System, scale = "free", ncol = 1)