# Calling the required libraries
library(tidyverse) #helps wrangle data
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate) #helps wrangle date attributes
## Loading required package: timechange
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2) #helps visualize data
# Fetching the data from Github link
dataURL <- ("https://raw.githubusercontent.com/Umerfarooq122/Data_sets/main/TravelMode.csv")
# Assigning it to a variable
Travel_modes <- read.csv(dataURL)
# Displaying the Head of data frame
head(Travel_modes)
## X individual mode choice wait vcost travel gcost income size
## 1 1 1 air no 69 59 100 70 35 1
## 2 2 1 train no 34 31 372 71 35 1
## 3 3 1 bus no 35 25 417 70 35 1
## 4 4 1 car yes 0 10 180 30 35 1
## 5 5 2 air no 64 58 68 68 30 2
## 6 6 2 train no 44 31 354 84 30 2
# Let's check out the overall structure of the data frame
str(Travel_modes)
## 'data.frame': 840 obs. of 10 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ individual: int 1 1 1 1 2 2 2 2 3 3 ...
## $ mode : chr "air" "train" "bus" "car" ...
## $ choice : chr "no" "no" "no" "yes" ...
## $ wait : int 69 34 35 0 64 44 53 0 69 34 ...
## $ vcost : int 59 31 25 10 58 31 25 11 115 98 ...
## $ travel : int 100 372 417 180 68 354 399 255 125 892 ...
## $ gcost : int 70 71 70 30 68 84 85 50 129 195 ...
## $ income : int 35 35 35 35 30 30 30 30 40 40 ...
## $ size : int 1 1 1 1 2 2 2 2 1 1 ...
# Exploring the Data: checking means, medians and Quartiles for different columns
summary(Travel_modes)
## X individual mode choice
## Min. : 1.0 Min. : 1.0 Length:840 Length:840
## 1st Qu.:210.8 1st Qu.: 53.0 Class :character Class :character
## Median :420.5 Median :105.5 Mode :character Mode :character
## Mean :420.5 Mean :105.5
## 3rd Qu.:630.2 3rd Qu.:158.0
## Max. :840.0 Max. :210.0
## wait vcost travel gcost
## Min. : 0.00 Min. : 2.00 Min. : 63.0 Min. : 30.0
## 1st Qu.: 0.75 1st Qu.: 23.00 1st Qu.: 234.0 1st Qu.: 71.0
## Median :35.00 Median : 39.00 Median : 397.0 Median :101.5
## Mean :34.59 Mean : 47.76 Mean : 486.2 Mean :110.9
## 3rd Qu.:53.00 3rd Qu.: 66.25 3rd Qu.: 795.5 3rd Qu.:144.0
## Max. :99.00 Max. :180.00 Max. :1440.0 Max. :269.0
## income size
## Min. : 2.00 Min. :1.000
## 1st Qu.:20.00 1st Qu.:1.000
## Median :34.50 Median :1.000
## Mean :34.55 Mean :1.743
## 3rd Qu.:50.00 3rd Qu.:2.000
## Max. :72.00 Max. :6.000
From “str()” function we found out that this data frame contains 840 observations on 4 modes for 210 individuals while 9 variables.
From summary() function we can clearly see that the data size is of 210 individuals. With longest terminal wait was 99 min and similarly the long travel time is 1440 min. The max vehicle cost is $180 while the minimum is $2. Similarly the maximum generalized cost is $269 while the minimum is $30.
# Using more appropriate names for variable that makes sense
Travel_modes <- rename(Travel_modes, terminal_wait=wait,travel_time=travel,vehicle_cost=vcost,generalized_cost=gcost)
# checking the new column names
head(Travel_modes)
## X individual mode choice terminal_wait vehicle_cost travel_time
## 1 1 1 air no 69 59 100
## 2 2 1 train no 34 31 372
## 3 3 1 bus no 35 25 417
## 4 4 1 car yes 0 10 180
## 5 5 2 air no 64 58 68
## 6 6 2 train no 44 31 354
## generalized_cost income size
## 1 70 35 1
## 2 71 35 1
## 3 70 35 1
## 4 30 35 1
## 5 68 30 2
## 6 84 30 2
Let’s add two extra columns to the current data frame got get overall cost and accumulative time to travel
Travel_modes <- Travel_modes %>%
mutate(overall_cost= vehicle_cost+generalized_cost, total_time= terminal_wait+travel_time, number_of_people=individual*size)
# Checking out the new columns
head(Travel_modes)
## X individual mode choice terminal_wait vehicle_cost travel_time
## 1 1 1 air no 69 59 100
## 2 2 1 train no 34 31 372
## 3 3 1 bus no 35 25 417
## 4 4 1 car yes 0 10 180
## 5 5 2 air no 64 58 68
## 6 6 2 train no 44 31 354
## generalized_cost income size overall_cost total_time number_of_people
## 1 70 35 1 129 169 1
## 2 71 35 1 102 406 1
## 3 70 35 1 95 452 1
## 4 30 35 1 40 180 1
## 5 68 30 2 126 132 4
## 6 84 30 2 115 398 4
#Travel_modes %>%
ggplot(data=Travel_modes)+
geom_line(mapping = aes(x=travel_time, y=overall_cost, color=choice))+
facet_wrap(~mode)+labs(title="Travel time vs over all cost: ")+theme_dark()
The graph above displays the relation between travel time and overall cost of travel from Sydney to Melbourne for different modes of travel. The graph clearly answers the first two questions asked at the beginning of the study which is:
ggplot(data=Travel_modes, aes(x=choice, y=number_of_people, fill= choice))+
geom_bar(stat="identity")+facet_wrap(~mode)+theme_dark()+
labs(title = "Bar graph for best choice among people ")
Now according to the bar graph above the popular choice among the people is to travel either by air or car, followed by train, while bus being the least favorite one.
Even though from the previous graph travelling by air is the most expensive one and car has the longest travel time are still being the popular choices. Which rises more question and asks for more digging through the data.
Let’s plot the terminal wait time against the travel time
ggplot(data=Travel_modes)+
geom_smooth(mapping = aes(x=travel_time, y=terminal_wait, color= choice), method = loess)+
facet_wrap(~mode)+theme_dark()+
labs(title = "Terminal wait vs Travel time")
## `geom_smooth()` using formula = 'y ~ x'
Looking the graph above, gives an idea about one of the possible reasons for people to opt car and air is that, both train and bus has higher terminal wait time compared to a car while higher travel time compared to travelling by air.
Now Let’s checkout for the choice frequency based on people’s income
ggplot(data=Travel_modes)+
geom_boxplot(mapping = aes(x=mode, y=income, fill=choice))+
theme_dark()+labs(title = "Household income vs Mode")
Looking at the box plot above, people with higher incomes tends to travel by air a lot while people with lower income prefer train or bus.
Question 1: Which mode of travelling is less time consuming compared to others.
Answer: According to the graph the most least time consuming mode of travel is by “Air” while “car” was the most time consuming overall followed by bus and train respectively.
Question 2: Which mode is better considering the overall cost?
Answer: Considering the overall cost the bus proves to be the cheapest way to travel followed by car and train, respectively, while air is expensive means of travel.
Question 3: The most appropriate way to travel between Sydney and Melbourne, Australia according to people
Answer : according to the bar graph above the popular choice among the people is to travel either by air or car, followed by train, while bus being the least favorite one