In this assignment, you’ll practice collaborating around a code project with GitHub. You could consider our collective work as building out a book of examples on how to use TidyVerse functions.
GitHub repository: https://github.com/acatlin/FALL2019TIDYVERSE
You have two tasks:
Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)
Extend an Existing Example. Using one of your classmate’s examples (as created above), extend his or her example with additional annotated code. (15 points)
You should clone the provided repository. Once you have code to submit, you should make a pull request on the shared repository. Minimally, you should be submitted .Rmd files; ideally, you should also submit an .md file and update the README.md file with your example.
After you’ve completed both parts of the assignment, please submit your GitHub handle name in the submission link provided in the week 1 folder! This will let your instructor know that your work is ready to be graded.
You should complete both parts of the assignment and make your submission no later than the end of day on Sunday, December 1st.
There are currently 8 packages in the tidyverse
package bundle including:
dplyr
: a set of tools for efficiently manipulating datasets;forcats
: a package for manipulating categorical variables / factors;ggplots2
: a classic package for data visualization;purrr
: another set of tools for manipulating datasets, specially vecters, a complement to dplyr
;readr
: a set of faster and more user friendly functions to read data than R default functions;stringr
: a package for common string operations;tibble
:a package for reimagining data.frames in a modern way;tidyr
: a package for reshaping data, a complement to dplyr
.In this assignment, I will use some handy functions in tidyverse
package to perform data cleaning.
library(tidyverse)
The dataset in this project is called ‘London bike sharing dataset’ from https://www.kaggle.com/datasets; The description of data is as below:
"timestamp" - timestamp field for grouping the data
"cnt" - the count of a new bike shares
"t1" - real temperature in C
"t2" - temperature in C "feels like"
"hum" - humidity in percentage
"wind_speed" - wind speed in km/h
"weather_code" - category of the weather
"is_holiday" - boolean field - 1 holiday / 0 non holiday
"is_weekend" - boolean field - 1 if the day is weekend
"season" - category field meteorological seasons: 0-spring ; 1-summer; 2-fall; 3-winter.
"weathe_code" category description:
1 = Clear ; mostly clear but have some values with haze/fog/patches of fog/ fog in vicinity
2 = scattered clouds / few clouds
3 = Broken clouds
4 = Cloudy
7 = Rain/ light Rain shower/ Light rain
10 = rain with thunderstorm
26 = snowfall
94 = Freezing Fog
I use read_csv
function from readr
to import csv file.
data_raw <- read_csv("https://raw.githubusercontent.com/oggyluky11/Data607-Tidyverse_Vignette/master/london_merged.csv")
## Parsed with column specification:
## cols(
## timestamp = col_datetime(format = ""),
## cnt = col_double(),
## t1 = col_double(),
## t2 = col_double(),
## hum = col_double(),
## wind_speed = col_double(),
## weather_code = col_double(),
## is_holiday = col_double(),
## is_weekend = col_double(),
## season = col_double()
## )
head(data_raw)
There is a default function read.csv
in R to import CSV files. However, the comment below shows that the readr::read_csv performs much faster than the default read.csv
.
#readr function
system.time(d<-read_csv("https://raw.githubusercontent.com/oggyluky11/Data607-Tidyverse_Vignette/master/london_merged.csv"))
## user system elapsed
## 0.05 0.02 0.11
#default function
system.time(d<-read.csv("https://raw.githubusercontent.com/oggyluky11/Data607-Tidyverse_Vignette/master/london_merged.csv"))
## user system elapsed
## 0.87 0.04 1.03
case_when
is a handly function when we want to modify values in a column according to a predefined logic. It the dataset above, all the factors are represented by numerical values. I use case_when
to assign text strings ‘spring’, ‘summer’, ‘fall’, ‘winter’ back to their corresponding numerical representations ‘0’, ‘1’, ‘2’, ‘3’. Note that this function is handly because the original data type in the data set is ‘double’ for value 0,1,2,3, use case_when
to rewrite numerical values by text values doesn’t trigger any errors.
season <- data_raw$season
season <- case_when(season == 0 ~ 'spring',
season == 1 ~ 'summer',
season == 2 ~ 'fall',
season == 3 ~ 'winter')
table(data_raw$season)
##
## 0 1 2 3
## 4394 4387 4303 4330
table(season)
## season
## fall spring summer winter
## 4303 4394 4387 4330
While case_When
is a general process to modify values according to predefined logics, another function recode
can do the task in this example in a more convenient way. case_when
is more generalized if there are more complicated logics applied, such as mathematical functions.
season <- recode(data_raw$season, '0'='spring' , '1'='summer', '2'='fall', '3'='winter')
table(season)
## season
## fall spring summer winter
## 4303 4394 4387 4330
An important data transformation is converting data between ‘long format’ and ‘wide format’. The gather
and spread
functions in tidyr
package do the job easily. Please note that I use these two functions here for demostration purpose, the use of these two functions doesn’t have actual meanings to this example.
gather
. Here I will reshape the last 4 columns in this dataset, - which are ‘is_holiday’, ‘is_weekend’, ‘season’, ‘weather_code’, into two columns. One column is called ‘atrribute’ which contains the name of the original columns, the other is called ‘value’ which contains the original column values. Doing this operation we converted the original data from ‘wide format’ into ‘long format’. This operation benefits when we want to consolidate values from columns of similar natures. For example. when we have location names as columns names in our dataset, we may want to consolidate values from all locations into one column for better analysis. In this example, gathering columns ‘is_holiday’, ‘is_weekend’, ‘season’, ‘weather_code’ is meaningless, so I will reverse the process in the following chunk.gather_data <- gather(data_raw,key = 'atrribute', value = 'value', 7:10)
gather_data
table(gather_data$atrribute)
##
## is_holiday is_weekend season weather_code
## 17414 17414 17414 17414
spread
. This function is used to convert data from ‘long format’ into ‘wide format’. In the example above, we have a column called ‘atrribute’, and its correspoding value column ‘values’ and I want to break this ‘long’ column into ‘shorter’ columns by the factors in atrribute
. There are four factors ‘is_holiday’, ‘is_weekend’, ‘season’ and ‘weather_code’, therefore when spread
is applied, 4 new columns are created, each for one of the four factors.We can see that gather
and spread
are reverse processes of each other. When we gather and again spread out the dataset, we get our original dataset again.
spread_data <- spread(gather_data,key = 'atrribute', value = 'value')
spread_data
all.equal(spread_data, data_raw)
## [1] TRUE
mutate
function adds new variables into the dataframe and presevers the existing ones. For example, if I want to calculate the Fahrenheit degrees of ‘t1’ and ‘t2’ and retain ‘t1’ and ‘t2’, I do the following:
data_f <- mutate(data_raw, f1 = t1*9/5+32, f2 = t2*9/5+32)
data_f
the select
function is straightforward, I can select the functions that I want. If I want to drop columns such as the ‘t1’ and ‘t2’ in the example above, the select
function is also handy.
data_f_selected <- select(data_f, -t1, -t2)
data_f_selected
str_extract
is used when I want to get content from a string according to a comment pattern denoted by regular expression. Regular expression is out of the scope of this assignment so I won’t go further into details. In this example I used str_extract
the first 4 digit combo in column ‘timestamp’ which is the year.
data_year <- mutate(data_f, year = str_extract(timestamp, '[0-9]{4}'))
data_year
Another way to get the year column is to convert the column into ‘Date’ format then use seperate
function to seperate ‘year’, ‘month’ and ‘date’ into three columns.
data_year <- mutate(data_f, timestamp = as.Date(timestamp))
data_year <- separate(data_year, timestamp, c('year','month','date'))
data_year
The unite
function is used to comcatnate values of two columns into a new column. The argument ‘remove’ is to indicate whether to remove the original columns that to be united.
data_unite <- unite(data_year,year_month, year,month, sep = '-', remove = FALSE)
data_unite
The group_by
function groups the dataframe into groups to allow the following operations to be performed by group. In this example I group the data by the column year_month
, then calculate the mean of each ‘year-month’ group.
data_group <- group_by(data_unite, year_month)
data_group_mean_cnt <- mutate(data_group, mean_cnt = mean(cnt))
data_group_mean_cnt
unique(data_group_mean_cnt$year_month)
## [1] "2015-01" "2015-02" "2015-03" "2015-04" "2015-05" "2015-06" "2015-07"
## [8] "2015-08" "2015-09" "2015-10" "2015-11" "2015-12" "2016-01" "2016-02"
## [15] "2016-03" "2016-04" "2016-05" "2016-06" "2016-07" "2016-08" "2016-09"
## [22] "2016-10" "2016-11" "2016-12" "2017-01"
unique(data_group_mean_cnt$mean_cnt)
## [1] 814.6632 810.1252 941.7240 1156.5814 1203.5121 1441.0767 1514.4419
## [8] 1389.7191 1255.2433 1175.3342 952.6470 814.6459 782.9543 861.7878
## [15] 900.5857 1069.3255 1346.6868 1324.6496 1572.9109 1536.9108 1462.1069
## [22] 1259.3620 978.9416 876.2204 523.3333
The pipe operator ‘%>%’,comes from the magrittr
package, are embeded in tidyverse
. It enables a handy coding by inserting the output of the preceding code into the following code by the operator ‘%>%’, which makes R coding cleaner, easier and more straightforward visually. I can combine all the codes in all previous example chunks into one piece of code as below:
data <- data_raw %>%
mutate(season = case_when(season == 0 ~ 'spring',
season == 1 ~ 'summer',
season == 2 ~ 'fall',
season == 3 ~ 'winter'),
weather_code = recode(weather_code,
'1' = 'Clear',
'2' = 'scattered clouds / few clouds',
'3' = 'Broken clouds',
'4' = 'Cloudy',
'7' = 'Rain/ light Rain shower/ Light rain',
'10' = 'rain with thunderstorm',
'26' = 'snowfall',
'94' = 'Freezing Fog'),
is_holiday = recode(is_holiday, '1' = 'Yes', '0' = 'No'),
is_weekend = recode(is_weekend,'1' = 'Yes', '0' = 'No'),
f1 = t1*9/5+32,
f2 = t2*9/5+32,
timestamp = as.Date(timestamp)) %>%
select(-t1, -t2) %>%
separate(timestamp, c('year','month', 'date')) %>%
unite(year_month, year, month, sep = '-', remove = FALSE) %>%
group_by(year_month) %>%
mutate(mean_cnt = mean(cnt)) %>%
ungroup()
data
ggplot2
is a classific package for ploting the data. In this example I plot a bar chart with column year_month
as x-axis and mean_cnt
as y-axis.
data %>%
group_by(year_month) %>%
summarise(mean_cnt = mean(cnt)) %>%
ungroup() %>%
ggplot(aes(x = year_month, y = mean_cnt, fill = mean_cnt, label = round(mean_cnt))) +
geom_col() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = 'Average New Bike Shares by Month',
subtitle = '2015-01 to 2017-01')+
xlab('Year-Month') +
ylab('Average Count') +
scale_fill_continuous(name = 'Average Count')+
geom_text(angle = 90, vjust = 0.4, hjust = 1.1, color = 'white', size = 3)