Before we perform discrete choice analysis with Apollo in R. We have to import data and may need to clean and transform it in the format required by Apollo. Sometimes, we also want to explore the data for better understanding. In this document, we will cover some essentials -
The working directory is the path on the computer that is the default location for any file we read or save out of R. We can run the command getwd() in console to see the current working directory. We can also see the current working directory at the top of the console.
I would recommend you to create a separate subfolder for each assignment in the parent folder Discrete Choice Analysis. You can download the assignment dataset and store a copy of data i.e. csv/spss file in the corresponding assignment folder. Then you can change the current working directory to assignment folder in one of the two ways -
setwd()For example, I created a subfolder ‘Assignment-0’ inside the folder ‘Discrete Choice Analysis’ which is in the desktop. Then I can copy the location from address bar of computer and set the current working directory as - setwd("C:/Desktop/Discrete Choice Analysis/Assignment-0"). Note we use forward slash /, not \ in file location.
We will read data and save outputs in the same assignment folder. Later, we can learn to read files from a different location in computer and save the outputs in other locations.
R script is a simple text file containing multiple commands that can be executed at once and can be stored and reused later. We can start a new script by clicking on File %>% New File %>% R Script. Read %>% as then. The script pane would be the on top left. Let’s save the script as trial.R(by clicking File %>% Save), we don’t need to write the extension .R just like we don’t write the extension .png to save the image. We can save the script in the same working directory. Let’s copy the following code chunk and paste it in the trial.R script. We can run the code in mainly two ways -
Try both Run and Source to run this code in trial.R. We can also see the variables in Environment pane.
# Out of vehicle travel time
OVTT <- 5
# In vehicle travel time
IVTT <- 10
# total travel time
TT <- OVTT + IVTT
In the R script, we can write comments using # at the beginning of the line. We use <- operator to assign the values, it can also be replaced by = for now, but not recommended. In RStudio, the keyboard shortcut for the assignment operator <- is Alt + - (in Windows) or Option + - (in Mac).
We have already created a subfolder Assignment-0 in Discrete Choice Analysis folder. We have to download the ‘Sample data.csv’ file and paste it in this subfolder. We can start a new R script file and save it as Tutorial.R. We have already set the working directory (crosscheck using command getwd() in console). In the subsequent sections, we will clean the workspace, load the tidyverse package and read data from ‘Sample data.csv’. Tidyverse is a collection of many packages designed for data science in R. Here, we will very briefly explore two packages - dplyr and tidyr.
We will quickly glance at the verbs of dplyr package for data manipulation. We will then quickly explore the tidyr package to transform data between longer and wider format. Later, we will use some of these skills to cross tabulate data. For each step below, I have added the code chunk to perform some specific task, you can copy from here and paste it in the Tutorial.R script and run.
Before we write set of commands, we should clear the workspace i.e. clear all the variables created earlier. Then we have to load the packages into memory using command library("insert package name here").
Copy the code from below and paste in the Tutorial.R file and run/source it
# -------------------------------------------------------------------------------
# STEP 1 - INITIAL CODE SETUP
# clear Workspace/Environment
rm(list = ls())
# load packages
library("tidyverse") # to perform data analysis
library("haven") # to read spss file
We will read csv file using function read_csv("insert file name here.csv"), it stores the data as tibble which is similar to dataframe. We can double click the tbl in the Environment to see. We can also read spss files using function read_sav("insert file name here.sav")
Copy the code from below and paste in the Tutorial.R script below the previous code and run/source it.
# -------------------------------------------------------------------------------
# STEP 2 - READ DATA
# read csv file and store it as a variable of data type tibble
tbl <- read_csv("Sample Data.csv")
Description of tbl
If you are interested in more details on data type tibble, you can see chapter 10 in R for data science book.
dplyr is a package included in Tidyverse which provide set of verbs for efficiently manipulating datasets.
select() to select specific variables based on their namesfilter() to select rows based on values of some variablesmutate() to add new variablesarrange() to reorder the rowssummarise() to aggregate values of rowsYou can read the details in chapter 5 in the book - R for Data Science. You can also see the documentation in the dplyr website. We will quickly apply these verbs to perform few tasks below. I have written down the steps of data manipulation and attached the corresponding code, you will realize that writing code in R is easier than writing steps in english. We will use the pipe operator %>% (read as then) to feed the output of previous step in the next step.
Steps -
tbl) [then]Mode = 1) [then]Mode, TTcar, TTtrans, Gender [then]TT which is equal to TTcar - TTtrans [then]TT [then]tbl_carCopy the code from below and paste in the Tutorial.R script below the previous code and run/source it.
# -------------------------------------------------------------------------------
# Task 1 - Manipulate data
tbl_car <- tbl %>%
filter(Mode == 1) %>%
select(Mode, TTcar, TTtrans, Gender) %>%
mutate(TT = TTcar - TTtrans) %>%
arrange(TT)
Steps -
tbl) [then]Mode, TCcar, TCtrans, Gender [then]Mode = 0) [then]TC which is equal to TCcar - TCtrans [then]TC (google it)tbl_TransCopy the code from below and paste in the Tutorial.R script below the previous code. Uncomment it using control + shift + C and run/source it. You will get error because there are mistakes in the code chunk, please correct them.
# -------------------------------------------------------------------------------
# Task 2 - Debug the code chunk
# tbl_Trans <- tbl >%>
# select(Mode, TCcar, TCtrans, Gender) %>%
# filter(Mode = 0)
# arrange(TC) %>%
# mutate(TC = TCcar - TCtrans)
Steps -
tbl) [then]Mode = 1) [then]Maletbl_summary_1Copy the code from below and paste in the Tutorial.R script below the previous code and run/source it.
# -------------------------------------------------------------------------------
# # Task 3 - Summarise
tbl_summary_1 <- tbl %>%
filter(Mode == 1) %>%
summarise(count = n(),
total_male = sum(Gender),
mean_HHSize = mean(HHSize))
Steps -
tbl) [then]tbl_summary_2Copy the code from below and paste in the Tutorial.R script below the previous code and run/source it.
# -------------------------------------------------------------------------------
# Task 4 - Summarise by group
tbl_summary_2 <- tbl %>%
group_by(Mode) %>%
summarise(count = n(),
total_male = sum(Gender),
mean_HHSize = mean(HHSize))
Steps -
tbl) [then]tbl_summary_3Copy the code from below and paste in the Tutorial.R script below the previous code and run/source it.
# -------------------------------------------------------------------------------
# Task 5 - Summarise by multiple groups
tbl_summary_3 <- tbl %>%
group_by(Mode, Gender) %>%
summarise(count = n())
The last results looks very similar to cross tab but arranged in different format i.e. longer format. We need to change it to wider format. The tibbles can be easily transformed using tidyr library. We will quickly glance through this library
The longer and wider format is illustrated with help of a figure below.
Illustration of longer and wider format
Here, we will look at only one function pivot_wider(). The documentation is here. We have to specify few inputs/arguments - names_from, values_from and values_fill
We will convert the previous output to wider format. Copy the code from below and paste in the Tutorial.R script below the previous code and run/source it.
# -------------------------------------------------------------------------------
# Task 2 - Convert previous output to wider format
cross_tab_1 <- tbl_summary_3 %>%
pivot_wider(names_from = Mode,
values_from = count,
values_fill = 0)
Steps -
tbl) [then]cross_tab_2Copy the code from below and paste in the Tutorial.R script below the previous code and run/source it.
# -------------------------------------------------------------------------------
# Task 2 - Summarise by multiple groups
cross_tab_2 <- tbl %>%
group_by(Mode, HHSize) %>%
summarise(count = n()) %>%
pivot_wider(names_from = Mode,
values_from = count,
values_fill = 0)
Here we will read a real dataset from spss file and apply previous knowledge to create cross tabulation of Mode and Gender
# -------------------------------------------------------------------------------
# Task 2 - Crosstab real data
# read data from spss file
crosstab_real <- read_sav("Dataset_assign1.sav") %>%
# read factors
as_factor() %>%
# group by Mode and Gender
group_by(Mode, Gender) %>%
# use summarise to count each group
summarise(count = n()) %>%
# convert to wider format
pivot_wider(names_from = Mode,
values_from = count,
values_fill = 0)