Essentials for Data Analysis

Before we perform discrete choice analysis with Apollo in R. We have to import data and may need to clean and transform it in the format required by Apollo. Sometimes, we also want to explore the data for better understanding. In this document, we will cover some essentials -

Changing the working directory
Working with R script
Data analysis using Tidyverse package
- Initial code setup
- Read data from csv/spss file
- Basic verbs of dplyr
- Basics of tidyr for data transformation
- Crosstab

Changing the working directory

The working directory is the path on the computer that is the default location for any file we read or save out of R. We can run the command getwd() in console to see the current working directory. We can also see the current working directory at the top of the console.

I would recommend you to create a separate subfolder for each assignment in the parent folder Discrete Choice Analysis. You can download the assignment dataset and store a copy of data i.e. csv/spss file in the corresponding assignment folder. Then you can change the current working directory to assignment folder in one of the two ways -

Using command setwd()

For example, I created a subfolder ‘Assignment-0’ inside the folder ‘Discrete Choice Analysis’ which is in the desktop. Then I can copy the location from address bar of computer and set the current working directory as - setwd("C:/Desktop/Discrete Choice Analysis/Assignment-0"). Note we use forward slash /, not \ in file location.

By clicking Session %>% Set Working Directory %>% Choose Directory…

We will read data and save outputs in the same assignment folder. Later, we can learn to read files from a different location in computer and save the outputs in other locations.

Working with R script

R script is a simple text file containing multiple commands that can be executed at once and can be stored and reused later. We can start a new script by clicking on File %>% New File %>% R Script. Read %>% as then. The script pane would be the on top left. Let’s save the script as trial.R(by clicking File %>% Save), we don’t need to write the extension .R just like we don’t write the extension .png to save the image. We can save the script in the same working directory. Let’s copy the following code chunk and paste it in the trial.R script. We can run the code in mainly two ways -

Using Run (top right on script pane), it can be used run the current line and selection/highlighted portion
Source (top right on script pane), it is used to run the entire code

Try both Run and Source to run this code in trial.R. We can also see the variables in Environment pane.

# Out of vehicle travel time
OVTT <- 5
# In vehicle travel time
IVTT <- 10

# total travel time
TT <- OVTT + IVTT

In the R script, we can write comments using # at the beginning of the line. We use <- operator to assign the values, it can also be replaced by = for now, but not recommended. In RStudio, the keyboard shortcut for the assignment operator <- is Alt + - (in Windows) or Option + - (in Mac).

Data analysis using Tidyverse package

We have already created a subfolder Assignment-0 in Discrete Choice Analysis folder. We have to download the ‘Sample data.csv’ file and paste it in this subfolder. We can start a new R script file and save it as Tutorial.R. We have already set the working directory (crosscheck using command getwd() in console). In the subsequent sections, we will clean the workspace, load the tidyverse package and read data from ‘Sample data.csv’. Tidyverse is a collection of many packages designed for data science in R. Here, we will very briefly explore two packages - dplyr and tidyr.

We will quickly glance at the verbs of dplyr package for data manipulation. We will then quickly explore the tidyr package to transform data between longer and wider format. Later, we will use some of these skills to cross tabulate data. For each step below, I have added the code chunk to perform some specific task, you can copy from here and paste it in the Tutorial.R script and run.

Initial code setup

Before we write set of commands, we should clear the workspace i.e. clear all the variables created earlier. Then we have to load the packages into memory using command library("insert package name here").

Copy the code from below and paste in the Tutorial.R file and run/source it

# -------------------------------------------------------------------------------
# STEP 1 - INITIAL CODE SETUP

# clear Workspace/Environment
rm(list = ls())

# load packages
library("tidyverse") # to perform data analysis
library("haven") # to read spss file

Read data from csv/spss file

We will read csv file using function read_csv("insert file name here.csv"), it stores the data as tibble which is similar to dataframe. We can double click the tbl in the Environment to see. We can also read spss files using function read_sav("insert file name here.sav")

Copy the code from below and paste in the Tutorial.R script below the previous code and run/source it.

# -------------------------------------------------------------------------------
# STEP 2 - READ DATA

# read csv file and store it as a variable of data type tibble
tbl <- read_csv("Sample Data.csv")

Description of tbl

ID is the identifier for individuals
In column Mode, 1 represents car and 0 represents transit
In column Gender, 1 represents male and 0 represents female
HHSize stands for Household size
TTcar and TTtrans represent travel time of car and transit
TCcar and TCtrans represent travel cost of car and transit
Transfer represents the number of transfer required in transit

If you are interested in more details on data type tibble, you can see chapter 10 in R for data science book.

Basic verbs of dplyr package

dplyr is a package included in Tidyverse which provide set of verbs for efficiently manipulating datasets.

select() to select specific variables based on their names
filter() to select rows based on values of some variables
mutate() to add new variables
arrange() to reorder the rows
summarise() to aggregate values of rows

You can read the details in chapter 5 in the book - R for Data Science. You can also see the documentation in the dplyr website. We will quickly apply these verbs to perform few tasks below. I have written down the steps of data manipulation and attached the corresponding code, you will realize that writing code in R is easier than writing steps in english. We will use the pipe operator %>% (read as then) to feed the output of previous step in the next step.

Task 1 - Manipulate data

Steps -

Start with data (tbl) [then]
filter only cars (Mode = 1) [then]
select only variables/columns - Mode, TTcar, TTtrans, Gender [then]
add a new variable TT which is equal to TTcar - TTtrans [then]
arrange in increasing order of TT [then]
assign the result to tbl_car

Copy the code from below and paste in the Tutorial.R script below the previous code and run/source it.

# -------------------------------------------------------------------------------
# Task 1 - Manipulate data

tbl_car <- tbl %>%
  filter(Mode == 1) %>%
  select(Mode, TTcar, TTtrans, Gender) %>%
  mutate(TT = TTcar - TTtrans) %>%
  arrange(TT)

Task 2 - Debug the code chunk

Steps -

Start with data (tbl) [then]
select only variables/columns - Mode, TCcar, TCtrans, Gender [then]
filter only transit (Mode = 0) [then]
add a new variable TC which is equal to TCcar - TCtrans [then]
arrange in decreasing order of TC (google it)
assign the result to tbl_Trans

Copy the code from below and paste in the Tutorial.R script below the previous code. Uncomment it using control + shift + C and run/source it. You will get error because there are mistakes in the code chunk, please correct them.

# -------------------------------------------------------------------------------
# Task 2 - Debug the code chunk


# tbl_Trans <- tbl >%>
#   select(Mode, TCcar, TCtrans, Gender) %>%
#   filter(Mode = 0)
#   arrange(TC) %>%
#   mutate(TC = TCcar - TCtrans)

Task 3 - Summarise

Steps -

Start with data (tbl) [then]
filter only cars (Mode = 1) [then]
summarise the data
- count the observations
- count number of Male
- calculate mean household size
assign the result to tbl_summary_1

Copy the code from below and paste in the Tutorial.R script below the previous code and run/source it.

# -------------------------------------------------------------------------------
# # Task 3 - Summarise

tbl_summary_1 <- tbl %>%
  filter(Mode == 1) %>%
  summarise(count = n(),
            total_male = sum(Gender),
            mean_HHSize = mean(HHSize))

Task 4 - Summarise by group

Steps -

Start with data (tbl) [then]
group by Mode
summarise the data
- count the observations
- count Males
- calculate mean household size
assign the result to tbl_summary_2

Copy the code from below and paste in the Tutorial.R script below the previous code and run/source it.

# -------------------------------------------------------------------------------
# Task 4 - Summarise by group

tbl_summary_2 <- tbl %>%
  group_by(Mode) %>%
  summarise(count = n(),
            total_male = sum(Gender),
            mean_HHSize = mean(HHSize))

Task 5 - Summarise by multiple groups

Steps -

Start with data (tbl) [then]
group by Mode and Gender
summarise the data
- count the observations
assign the result to tbl_summary_3

Copy the code from below and paste in the Tutorial.R script below the previous code and run/source it.

# -------------------------------------------------------------------------------
# Task 5 - Summarise by multiple groups

tbl_summary_3 <- tbl %>%
  group_by(Mode, Gender) %>%
  summarise(count = n())

The last results looks very similar to cross tab but arranged in different format i.e. longer format. We need to change it to wider format. The tibbles can be easily transformed using tidyr library. We will quickly glance through this library

Basics of tidyr for data transformation

The longer and wider format is illustrated with help of a figure below.

Illustration of longer and wider format

Here, we will look at only one function pivot_wider(). The documentation is here. We have to specify few inputs/arguments - names_from, values_from and values_fill

Task 1 - convert to wider format

We will convert the previous output to wider format. Copy the code from below and paste in the Tutorial.R script below the previous code and run/source it.

# -------------------------------------------------------------------------------
# Task 2 - Convert previous output to wider format
cross_tab_1 <- tbl_summary_3 %>%
  pivot_wider(names_from = Mode,
              values_from = count, 
              values_fill = 0)

Task 2 - create crosstab from scatch

Steps -

Start with data (tbl) [then]
group by Mode and HHSize
summarise the data
- count the observations
convert to wider format
assign the result to cross_tab_2

Copy the code from below and paste in the Tutorial.R script below the previous code and run/source it.

# -------------------------------------------------------------------------------
# Task 2 - Summarise by multiple groups

cross_tab_2 <- tbl %>%
  group_by(Mode, HHSize) %>%
  summarise(count = n()) %>%
  pivot_wider(names_from = Mode,
            values_from = count, 
            values_fill = 0)

Crosstab the real data

Here we will read a real dataset from spss file and apply previous knowledge to create cross tabulation of Mode and Gender

# -------------------------------------------------------------------------------
# Task 2 - Crosstab real data

# read data from spss file
crosstab_real <- read_sav("Dataset_assign1.sav") %>%
  # read factors 
  as_factor() %>%
  # group by Mode and Gender
  group_by(Mode, Gender) %>%
  # use summarise to count each group
  summarise(count = n()) %>% 
  # convert to wider format
  pivot_wider(names_from = Mode,
            values_from = count, 
            values_fill = 0)

Essentials for Data Analysis

2/19/2021

Changing the working directory

Working with R script

Data analysis using Tidyverse package

Initial code setup

Read data from csv/spss file

Basic verbs of dplyr package

Task 1 - Manipulate data

Task 2 - Debug the code chunk

Task 3 - Summarise

Task 4 - Summarise by group

Task 5 - Summarise by multiple groups

Basics of tidyr for data transformation

Task 1 - convert to wider format

Task 2 - create crosstab from scatch

Crosstab the real data