This Study present the Analysis of the data on travel mode choice for travel between Sydney and Melbourne, Australia.

Questions arising from the given data.

Question 1: Which mode of travelling is less time consuming compared to others.
Question 2: Which mode is better considering the overall cost
Question 3: The most appropriate way to travel between Sydney and Melbourne, Australia according to people.
Setting up the environment for analysis
# Calling the required libraries
library(tidyverse)  #helps wrangle data
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate)  #helps wrangle date attributes
## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)  #helps visualize data
1: Data Exploration: This should include summary statistics, means, medians, quartiles, or anyother relevant information about the data set. Please include some conclusions in the R Markdown text.
# Let's check out the overall structure of the data frame
str(Travel_modes) 
## 'data.frame':    840 obs. of  10 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ individual: int  1 1 1 1 2 2 2 2 3 3 ...
##  $ mode      : chr  "air" "train" "bus" "car" ...
##  $ choice    : chr  "no" "no" "no" "yes" ...
##  $ wait      : int  69 34 35 0 64 44 53 0 69 34 ...
##  $ vcost     : int  59 31 25 10 58 31 25 11 115 98 ...
##  $ travel    : int  100 372 417 180 68 354 399 255 125 892 ...
##  $ gcost     : int  70 71 70 30 68 84 85 50 129 195 ...
##  $ income    : int  35 35 35 35 30 30 30 30 40 40 ...
##  $ size      : int  1 1 1 1 2 2 2 2 1 1 ...
# Exploring the Data: checking means, medians and Quartiles for different columns
summary(Travel_modes)
##        X           individual        mode              choice         
##  Min.   :  1.0   Min.   :  1.0   Length:840         Length:840        
##  1st Qu.:210.8   1st Qu.: 53.0   Class :character   Class :character  
##  Median :420.5   Median :105.5   Mode  :character   Mode  :character  
##  Mean   :420.5   Mean   :105.5                                        
##  3rd Qu.:630.2   3rd Qu.:158.0                                        
##  Max.   :840.0   Max.   :210.0                                        
##       wait           vcost            travel           gcost      
##  Min.   : 0.00   Min.   :  2.00   Min.   :  63.0   Min.   : 30.0  
##  1st Qu.: 0.75   1st Qu.: 23.00   1st Qu.: 234.0   1st Qu.: 71.0  
##  Median :35.00   Median : 39.00   Median : 397.0   Median :101.5  
##  Mean   :34.59   Mean   : 47.76   Mean   : 486.2   Mean   :110.9  
##  3rd Qu.:53.00   3rd Qu.: 66.25   3rd Qu.: 795.5   3rd Qu.:144.0  
##  Max.   :99.00   Max.   :180.00   Max.   :1440.0   Max.   :269.0  
##      income           size      
##  Min.   : 2.00   Min.   :1.000  
##  1st Qu.:20.00   1st Qu.:1.000  
##  Median :34.50   Median :1.000  
##  Mean   :34.55   Mean   :1.743  
##  3rd Qu.:50.00   3rd Qu.:2.000  
##  Max.   :72.00   Max.   :6.000

From “str()” function we found out that this data frame contains 840 observations on 4 modes for 210 individuals while 9 variables.

From summary() function we can clearly see that the data size is of 210 individuals. With longest terminal wait was 99 min and similarly the long travel time is 1440 min. The max vehicle cost is $180 while the minimum is $2. Similarly the maximum generalized cost is $269 while the minimum is $30.

2: Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together)
# Using more appropriate names for variable that makes sense

Travel_modes <- rename(Travel_modes, terminal_wait=wait,travel_time=travel,vehicle_cost=vcost,generalized_cost=gcost)

# checking the new column names

head(Travel_modes)
##   X individual  mode choice terminal_wait vehicle_cost travel_time
## 1 1          1   air     no            69           59         100
## 2 2          1 train     no            34           31         372
## 3 3          1   bus     no            35           25         417
## 4 4          1   car    yes             0           10         180
## 5 5          2   air     no            64           58          68
## 6 6          2 train     no            44           31         354
##   generalized_cost income size
## 1               70     35    1
## 2               71     35    1
## 3               70     35    1
## 4               30     35    1
## 5               68     30    2
## 6               84     30    2

Let’s add two extra columns to the current data frame got get overall cost and accumulative time to travel

Travel_modes <- Travel_modes %>%
  mutate(overall_cost= vehicle_cost+generalized_cost, total_time= terminal_wait+travel_time, number_of_people=individual*size)

# Checking out the new columns
head(Travel_modes)
##   X individual  mode choice terminal_wait vehicle_cost travel_time
## 1 1          1   air     no            69           59         100
## 2 2          1 train     no            34           31         372
## 3 3          1   bus     no            35           25         417
## 4 4          1   car    yes             0           10         180
## 5 5          2   air     no            64           58          68
## 6 6          2 train     no            44           31         354
##   generalized_cost income size overall_cost total_time number_of_people
## 1               70     35    1          129        169                1
## 2               71     35    1          102        406                1
## 3               70     35    1           95        452                1
## 4               30     35    1           40        180                1
## 5               68     30    2          126        132                4
## 6               84     30    2          115        398                4
3: Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2
#Travel_modes %>%
ggplot(data=Travel_modes)+
  geom_line(mapping = aes(x=travel_time, y=overall_cost, color=choice))+
  facet_wrap(~mode)+labs(title="Travel time vs over all cost: ")+theme_dark()

The graph above displays the relation between travel time and overall cost of travel from Sydney to Melbourne for different modes of travel. The graph clearly answers the first two questions asked at the beginning of the study which is:

Now to look for the most appropriate means of travel according to people

ggplot(data=Travel_modes, aes(x=choice, y=number_of_people, fill= choice))+
  geom_bar(stat="identity")+facet_wrap(~mode)+theme_dark()+
  labs(title = "Bar graph for best choice among people ")

Now according to the bar graph above the popular choice among the people is to travel either by air or car, followed by train, while bus being the least favorite one.

Even though from the previous graph travelling by air is the most expensive one and car has the longest travel time are still being the popular choices. Which rises more question and asks for more digging through the data.

Let’s plot the terminal wait time against the travel time

ggplot(data=Travel_modes)+
  geom_smooth(mapping = aes(x=travel_time, y=terminal_wait, color= choice), method = loess)+
  facet_wrap(~mode)+theme_dark()+
  labs(title = "Terminal wait vs Travel time")
## `geom_smooth()` using formula = 'y ~ x'

Looking the graph above, gives an idea about one of the possible reasons for people to opt car and air is that, both train and bus has higher terminal wait time compared to a car while higher travel time compared to travelling by air.

Now Let’s checkout for the choice frequency based on people’s income

ggplot(data=Travel_modes)+
  geom_boxplot(mapping = aes(x=mode, y=income, fill=choice))+
  theme_dark()+labs(title = "Household income vs Mode")

Looking at the box plot above, people with higher incomes tends to travel by air a lot while people with lower income prefer train or bus.

4: Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end

Question 1: Which mode of travelling is less time consuming compared to others.

Answer: According to the graph the most least time consuming mode of travel is by “Air” while “car” was the most time consuming overall followed by bus and train respectively.

Question 2: Which mode is better considering the overall cost?

Answer: Considering the overall cost the bus proves to be the cheapest way to travel followed by car and train, respectively, while air is expensive means of travel.

Question 3: The most appropriate way to travel between Sydney and Melbourne, Australia according to people

Answer : according to the bar graph above the popular choice among the people is to travel either by air or car, followed by train, while bus being the least favorite one

Conclusion

* According to the analysis carried out most of the people prefer to travel either by air or car. People who prefer to travel by air because of less travel time and has on the average has higher household income. On the other hand the possible reason for travelling through cars could be zero terminal wait and cheap travel cost.
* According to the first graph betweeen overall cost and total travel time a more optimal choices would have been to travel through train but further investigation or collection of data on customer satifaction index is required to see why more people would drive rather than taking the train but nonetheless train will still stand out for the people who has lower household income and manage
* Travelling through bus with its long terminal wait and travel time was the least favourite among the people and realtively costing a smidge over travelling through car but again more data on customer satisfaction is required.