Choose any three of the “wide” datasets identified in the Week 6 Discussion items.
Dataset Used: Trains
Suggested analysis:
library(tidyr)
library(dplyr)
library(knitr)
rawdata = read.csv("https://raw.githubusercontent.com/L-Velasco/Fall16_IS607/master/Project%202_Train.csv")
kable(rawdata)
| Trains | Arrival | Atlantis | El.Dorado | Hyperborea | Narnia | Valhalla |
|---|---|---|---|---|---|---|
| shinkansen | on_time | 497 | 221 | 212 | 503 | 1841 |
| shinkansen | delayed | 62 | 12 | 20 | 102 | 305 |
| tgv | on_time | 694 | 4840 | 383 | 320 | 201 |
| tgv | delayed | 117 | 415 | 65 | 129 | 61 |
Used gather() to make the city destination variables be part of observations and spread() to make the arrival type (on time, delayed) be variables. This step reshapes the data from “wide” to “tall” presentation.
tidydata <- rawdata %>%
gather("destination","freq",3:7) %>%
spread(Arrival,freq)
kable(tidydata)
| Trains | destination | delayed | on_time |
|---|---|---|---|
| shinkansen | Atlantis | 62 | 497 |
| shinkansen | El.Dorado | 12 | 221 |
| shinkansen | Hyperborea | 20 | 212 |
| shinkansen | Narnia | 102 | 503 |
| shinkansen | Valhalla | 305 | 1841 |
| tgv | Atlantis | 117 | 694 |
| tgv | El.Dorado | 415 | 4840 |
| tgv | Hyperborea | 65 | 383 |
| tgv | Narnia | 129 | 320 |
| tgv | Valhalla | 61 | 201 |
Used mutate() to add two new variables that sums the total service and on-time percentage, per train and destination. These new variables help in analysing the train’s arrival. performance across city destinations.
transformed_data <- tidydata %>%
mutate(total_service = delayed + on_time,
pct_ontime = round(on_time / total_service,2) * 100)
kable(transformed_data)
| Trains | destination | delayed | on_time | total_service | pct_ontime |
|---|---|---|---|---|---|
| shinkansen | Atlantis | 62 | 497 | 559 | 89 |
| shinkansen | El.Dorado | 12 | 221 | 233 | 95 |
| shinkansen | Hyperborea | 20 | 212 | 232 | 91 |
| shinkansen | Narnia | 102 | 503 | 605 | 83 |
| shinkansen | Valhalla | 305 | 1841 | 2146 | 86 |
| tgv | Atlantis | 117 | 694 | 811 | 86 |
| tgv | El.Dorado | 415 | 4840 | 5255 | 92 |
| tgv | Hyperborea | 65 | 383 | 448 | 85 |
| tgv | Narnia | 129 | 320 | 449 | 71 |
| tgv | Valhalla | 61 | 201 | 262 | 77 |
1. Which was the more reliable service for each city and by how much?
reliable_train_per_city <- transformed_data %>%
select(-delayed, -on_time, -total_service) %>%
spread(Trains, pct_ontime) %>%
mutate(pct_diff = shinkansen - tgv)
kable(reliable_train_per_city)
| destination | shinkansen | tgv | pct_diff |
|---|---|---|---|
| Atlantis | 89 | 86 | 3 |
| El.Dorado | 95 | 92 | 3 |
| Hyperborea | 91 | 85 | 6 |
| Narnia | 83 | 71 | 12 |
| Valhalla | 86 | 77 | 9 |
Based on the table above, SHINKANSEN appears to have a more reliable service for each city by at least 3% on-time arrival.
#2. Which was the most reliable train service, overall?
reliable_train_overall <- transformed_data %>%
group_by(Trains) %>%
summarise(pct_ontime = round(sum(on_time)/sum(total_service),2) * 100)
kable(reliable_train_overall)
| Trains | pct_ontime |
|---|---|
| shinkansen | 87 |
| tgv | 89 |
Based on the table above, TGV appears to have a more reliable service overall.
#3. In aggregate, which city had the most reliable on-time train service?
city_reliable_ontime_service <- transformed_data %>%
group_by(destination) %>%
summarise(pct_ontime = round(sum(on_time)/sum(total_service),2) * 100)
kable(city_reliable_ontime_service)
| destination | pct_ontime |
|---|---|
| Atlantis | 87 |
| El.Dorado | 92 |
| Hyperborea | 88 |
| Narnia | 78 |
| Valhalla | 85 |
Based on the table above, the destination EL DORADO appears to have the most reliable on-time service from both trains.