Description of the Project

Choose any three of the “wide” datasets identified in the Week 6 Discussion items.

Dataset Used: Trains

Suggested analysis:

Steps

Attach libraries, load and show raw data

library(tidyr)
library(dplyr)
library(knitr)
rawdata = read.csv("https://raw.githubusercontent.com/L-Velasco/Fall16_IS607/master/Project%202_Train.csv")
kable(rawdata)
Trains Arrival Atlantis El.Dorado Hyperborea Narnia Valhalla
shinkansen on_time 497 221 212 503 1841
shinkansen delayed 62 12 20 102 305
tgv on_time 694 4840 383 320 201
tgv delayed 117 415 65 129 61

Tidy the data

Used gather() to make the city destination variables be part of observations and spread() to make the arrival type (on time, delayed) be variables. This step reshapes the data from “wide” to “tall” presentation.

tidydata <- rawdata %>%
  gather("destination","freq",3:7) %>%
  spread(Arrival,freq)
kable(tidydata)
Trains destination delayed on_time
shinkansen Atlantis 62 497
shinkansen El.Dorado 12 221
shinkansen Hyperborea 20 212
shinkansen Narnia 102 503
shinkansen Valhalla 305 1841
tgv Atlantis 117 694
tgv El.Dorado 415 4840
tgv Hyperborea 65 383
tgv Narnia 129 320
tgv Valhalla 61 201

Transform and Analyze the data

Used mutate() to add two new variables that sums the total service and on-time percentage, per train and destination. These new variables help in analysing the train’s arrival. performance across city destinations.

transformed_data <- tidydata %>% 
  mutate(total_service = delayed + on_time,
         pct_ontime = round(on_time / total_service,2) * 100)
kable(transformed_data)
Trains destination delayed on_time total_service pct_ontime
shinkansen Atlantis 62 497 559 89
shinkansen El.Dorado 12 221 233 95
shinkansen Hyperborea 20 212 232 91
shinkansen Narnia 102 503 605 83
shinkansen Valhalla 305 1841 2146 86
tgv Atlantis 117 694 811 86
tgv El.Dorado 415 4840 5255 92
tgv Hyperborea 65 383 448 85
tgv Narnia 129 320 449 71
tgv Valhalla 61 201 262 77

1. Which was the more reliable service for each city and by how much?

reliable_train_per_city <- transformed_data %>% 
  select(-delayed, -on_time, -total_service) %>% 
  spread(Trains, pct_ontime) %>% 
  mutate(pct_diff = shinkansen - tgv)

kable(reliable_train_per_city)
destination shinkansen tgv pct_diff
Atlantis 89 86 3
El.Dorado 95 92 3
Hyperborea 91 85 6
Narnia 83 71 12
Valhalla 86 77 9

Based on the table above, SHINKANSEN appears to have a more reliable service for each city by at least 3% on-time arrival.

#2. Which was the most reliable train service, overall?

reliable_train_overall <- transformed_data %>% 
  group_by(Trains) %>% 
  summarise(pct_ontime = round(sum(on_time)/sum(total_service),2) * 100)

kable(reliable_train_overall)
Trains pct_ontime
shinkansen 87
tgv 89

Based on the table above, TGV appears to have a more reliable service overall.

#3. In aggregate, which city had the most reliable on-time train service?

city_reliable_ontime_service <- transformed_data %>% 
  group_by(destination) %>% 
  summarise(pct_ontime = round(sum(on_time)/sum(total_service),2) * 100)

kable(city_reliable_ontime_service)
destination pct_ontime
Atlantis 87
El.Dorado 92
Hyperborea 88
Narnia 78
Valhalla 85

Based on the table above, the destination EL DORADO appears to have the most reliable on-time service from both trains.