Data Wrangling and Visualisation

Before there was R, there was S!

R is a dialect of S language that was developed in 1976 by Rick Becker and John Chambers at the Bell Laboratories.

Rick Becker gave an excellent keynote talk “Forty Years of S” at UseR!2016 conference:
Rick Becker @ UseR!2016 where he talked about development of S language that gives explanations for many characteristics of R as we know it, including “<-” assignment operator.

1993 Bell Labs gave StatSci (later Insightful Corp.) an exclusive license to develop and sell the S language. Insightful sold its implementation of the S language under the product name S-PLUS.

You can read more about the history of S, R, and S-PLUS

then, R was born

In early nineties at the University of Aucklandn in the Department of Statistics R was created by Ross Ihaka and Robert Gentleman.

They used GNU General Public License to make R open source free software.

Ross Ihaka and Robert Gentleman. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3):299–314, 1996

Currently R is developed by the R Development Core Team, of which John Chambers is a member.

Write R Code

To start using R you need to:

Install R (and RStudio)
Launch it and set your working directory: letting R know where to find all of your files.
Start writing R code!

Tip: When start working on a new R code/R Project in RStudio IDE use File -> New Project. This way your working directory would be set up when you start a new project and it will save all your files in it. Next time you open your project it would set project’s directory as a working directory… It would help you with so much more.

Before Tidyverse R, there is Base R!

When you download and install R for the first time, you are installing the Base R software. Base R contains most of the functions you’ll use on a daily basis: mean(), subset()…

To learn about R’s basic operations, data structures and base functions you could look at one of R-Ladies Manchester’s handouts: Introduction to base R.

If you want to access data and code written by other people, you’ll need to install it as a package. An R package is a bundle of functions (code), data, documentation, vignettes (examples), stored in one neat place.

“In R, the fundamental unit of shareable code is the package.” Hadley Wickham

The verse!

An opinionated collection of R packages for data science.

install.packages("tidyverse")

library(tidyverse)

Have you tried learning data science by reading books?

R for Data Science by Garrett Grolemund & Hadley Wickham

Have you tried learning data science by posting your questions and discussing it with other people within the R community?

RStudio Community

The `dplyr` Package:

provides a “grammar” (the verbs) for data manipulation and for operating on data frames. The key opertor and the esential verbs are :

%>%: the “pipe” operator used to connect multiple verb actions together into a pipeline.
select(): return a subset of the columns of a data frame.
mutate(): add new variables/columns or transform existing variables.
filter(): extract a subset of rows from a data frame based on logical conditions.
arrange(): reorder rows of a data frame according to single or multiple variables.
summarise() / summarize(): reduces each group to a single row by calculating aggregate measures.

Chicago Data

Description: Chicago daily air pollution and death rate data. A data frame with 7 columns and 5114 rows. Each row refers to one day. The columns are:

• death total deaths (per day).

• pm10median median particles in 2.5-10 per cubic m

• pm25median medianparticles < 2.5 mg per cubic m (more dangerous).

• o3median Ozone in parts per billion

• so2median Median Sulpher dioxide measurement

• time time in days

• tmpd temperature in fahrenheit

1st look at the data: `dim()` & `head()`

# install.packages("gamair")
library(gamair)
data(chicago)
dim(chicago)

## [1] 5114    7

head(chicago)

##   death pm10median pm25median  o3median  so2median    time tmpd
## 1   130 -7.4335443         NA -19.59234  1.9280426 -2556.5 31.5
## 2   150         NA         NA -19.03861 -0.9855631 -2555.5 33.0
## 3   101 -0.8265306         NA -20.21734 -1.8914161 -2554.5 33.0
## 4   135  5.5664557         NA -19.67567  6.1393413 -2553.5 29.0
## 5   126         NA         NA -19.21734  2.2784649 -2552.5 32.0
## 6   130  6.5664557         NA -17.63400  9.8585839 -2551.5 40.0

Examine the structure of the data: `str()`

str(chicago)

## 'data.frame':    5114 obs. of  7 variables:
##  $ death     : int  130 150 101 135 126 130 129 109 125 153 ...
##  $ pm10median: num  -7.434 NA -0.827 5.566 NA ...
##  $ pm25median: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ o3median  : num  -19.6 -19 -20.2 -19.7 -19.2 ...
##  $ so2median : num  1.928 -0.986 -1.891 6.139 2.278 ...
##  $ time      : num  -2556 -2556 -2554 -2554 -2552 ...
##  $ tmpd      : num  31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...

The output could look messy and it might not fit the screen if you’re dealing with a big data set that has lots of variables!

Do it in a tidy way: `glimpse()`

suppressPackageStartupMessages(library(dplyr))
glimpse(chicago)

## Observations: 5,114
## Variables: 7
## $ death      <int> 130, 150, 101, 135, 126, 130, 129, 109, 125, 153, 1...
## $ pm10median <dbl> -7.4335443, NA, -0.8265306, 5.5664557, NA, 6.566455...
## $ pm25median <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ o3median   <dbl> -19.592338, -19.038614, -20.217338, -19.675671, -19...
## $ so2median  <dbl> 1.9280426, -0.9855631, -1.8914161, 6.1393413, 2.278...
## $ time       <dbl> -2556.5, -2555.5, -2554.5, -2553.5, -2552.5, -2551....
## $ tmpd       <dbl> 31.5, 33.0, 33.0, 29.0, 32.0, 40.0, 34.5, 29.0, 26....

The pipeline operater: `%>%`

Left Hand Side (LHS) %>% Right Hand Side (RHS)

x %>% f(..., y)
f(x,y)

The “pipe” passes the result of the LHS as the 1st operator argument of the function on the RHS

3 %>% sum(4) <==> sum(3, 4)

%>% is very practical for chaining together multiple dplyr functions in a sequence of operations.

`select()`

starts_with("X") every name that starts with “X”.
ends_with("X") every name that ends with “X”.
contains("X") every name that contains “X”.
matches("X") every name that matches “X”, where “X” can be a regular expression.
num_range("x", 1:5) the variables named x01, x02, x03, x04, x05.
one_of(x) => every name that appears in x, which should be a character vector.

##   death pm10median pm25median  o3median  so2median    time tmpd
## 1   130  -7.433544         NA -19.59234  1.9280426 -2556.5 31.5
## 2   150         NA         NA -19.03861 -0.9855631 -2555.5 33.0

Select your variables

chicago_air_measurements <- select(chicago, ends_with("median"))
head(chicago_air_measurements, n = 1)

##   pm10median pm25median  o3median so2median
## 1  -7.433544         NA -19.59234  1.928043

chicago_air_pm <- chicago[c("pm10median", "pm25median")]
head(chicago_air_pm, n = 1)

##   pm10median pm25median
## 1  -7.433544         NA

chicago_air_pm2 <- select(chicago, starts_with("pm"))
head(chicago_air_pm2, n = 1)

##   pm10median pm25median
## 1  -7.433544         NA

`mutate()`

For example, it would allow you to add to the data frame df a new column, z, which is the multiplication of the columns x and y:

mutate(df, z = x * y)

Let us convert °F into °C: T(°C) = (T(°F) - 32) × 5/9

chicago2 <- mutate(chicago, tmpdc = round((tmpd - 32) / 1.8, digits = 1)) 
head(chicago2, n = 3)

##   death pm10median pm25median  o3median  so2median    time tmpd tmpdc
## 1   130 -7.4335443         NA -19.59234  1.9280426 -2556.5 31.5  -0.3
## 2   150         NA         NA -19.03861 -0.9855631 -2555.5 33.0   0.6
## 3   101 -0.8265306         NA -20.21734 -1.8914161 -2554.5 33.0   0.6

`filter()`

There is a set of logical operators in R that you can use inside filter():

x < y: TRUE if x is less than y
x <= y: TRUE if x is less than or equal to y
x == y: TRUE if x equals y
x != y: TRUE if x does not equal y
x >= y: TRUE if x is greater than or equal to y
x > y: TRUE if x is greater than y
x %in% c(a, b, c): TRUE if x is in the vector c(a, b, c)

Filter your data

high_death <- filter(chicago2, death > 200) 
high_death

##   death pm10median pm25median  o3median  so2median  time tmpd tmpdc
## 1   226  20.941667         NA 29.703545  2.2685856 559.5 91.5  33.1
## 2   411  14.798103         NA 28.115091  0.6976599 560.5 86.0  30.0
## 3   287  -8.333333         NA 21.115009 -0.9330126 561.5 83.0  28.3
## 4   228  -3.232732         NA  5.649732 -2.3158882 562.5 78.5  25.8

high_temp_death <- filter(chicago2, death > 200 & tmpdc >= 30)
high_temp_death

##   death pm10median pm25median o3median so2median  time tmpd tmpdc
## 1   226   20.94167         NA 29.70355 2.2685856 559.5 91.5  33.1
## 2   411   14.79810         NA 28.11509 0.6976599 560.5 86.0  30.0

`arrange()`

is used to reorder rows of a data frame (df) according to one of the variables/columns.

If you pass arrange() a character variable, R will rearrange the rows in alphabetical order according to values of the variable.
If you pass a factor variable, R will rearrange the rows according to the order of the levels in your factor (running levels() on the variable reveals this order).

Arranging your data

low_2_high <- arrange(chicago, death)
head(low_2_high, n = 4)

##   death pm10median pm25median   o3median   so2median    time tmpd
## 1    69  -1.818182         NA  -8.029279  1.12452237  1313.5 64.5
## 2    73 -19.320548         NA  -5.869187  0.07297014  2052.5 66.0
## 3    77  -8.801262         NA -13.170360 -3.48994781 -2363.5 64.5
## 4    77 -19.165746  -10.14961   3.436157  3.60026234  1646.5 70.0

high_2_low <- arrange(chicago, desc(death))
head(high_2_low, n = 4)

##   death pm10median pm25median  o3median  so2median  time tmpd
## 1   411  14.798103         NA 28.115091  0.6976599 560.5 86.0
## 2   287  -8.333333         NA 21.115009 -0.9330126 561.5 83.0
## 3   228  -3.232732         NA  5.649732 -2.3158882 562.5 78.5
## 4   226  20.941667         NA 29.703545  2.2685856 559.5 91.5

`summarise()`

uses the same syntax as mutate(), but the resulting dataset consists of a single row instead of an entire new column in the case of mutate().
builds a new dataset that contains only the summarising statistics.

Let us use summarise() to print out a summary of chicago data containing two variables: max_detht and the max_tmpd

summarise(chicago, max_deth = max(death), max_tmpd = max(tmpd))

##   max_deth max_tmpd
## 1      411       92

`%>%` all up!

chicago_pipe <- chicago %>%
  filter(!is.na(pm10median) & !is.na(so2median)) %>%
  mutate(tmpdC = round((tmpd - 32) / 1.8, digits = 1))
plot(chicago_pipe$tmpdC, chicago_pipe$death, cex = 0.5, col = "red")

grammer of graphics

Enables you to specify building blocks of a plot and to combine them to create graphical display you want. There are 8 building blocks:

data
aesthetic mapping
geometric object
statistical transformations
scales
coordinate system
position adjustments
faceting

`ggplot()`

library(ggplot2)
ggplot(chicago_pipe, aes(x = tmpdC, y = death)) +
  geom_point(col ="red")

adding layers to your `ggplot()`

ggplot(chicago_pipe, aes(x = tmpdC, y = death, col = "red")) +
  geom_point(alpha = 0.2) +
  geom_smooth(col = "blue") +
  labs (title= " death vs temperature ", 
        x = "°C", y = "death") +
  theme(legend.position = "none", 
        panel.border = element_rect(fill = NA, 
                                    colour = "black",
                                    size = .75),
        plot.title=element_text(hjust=0.5))

Voila

useful links:

cheatsheets:

websites:

Your turn!

upload Daily Mortality Weather and Pollution Data for Chicago: chicagoNMMAPS available from dlnm package.
have a glance at the data.
what are the questions you could ask; could you provide the answers to them?

There is a chalange:

dplyr’s group_by() function enables you to group your data. It allows you to create a separate df that splits the original df by a variable.

Knowing about group_by() function, coud you compute the average pollutant level by month and visualise your result?

Possible Solution: code

# install and open `dlnm' package and access the data
install.packages("dlnm")
library(dlnm)
data("chicagoNMMAPS")

# group data by month and calculate average monthly polution
my_ch <- chicagoNMMAPS %>%
  group_by(month) %>%
  summarise(pm10 = mean(pm10, na.rm = TRUE))

# visualise the information
ggplot(my_ch, aes(x=month, y = pm10)) +
  geom_line() + geom_point(col = "red") +
  xlab("Month") + ylab("average pm10") +
  scale_x_continuous(breaks = seq(1, 12, 1), labels = seq(1, 12, 1))

Data Wrangling and Visualisation

DataTeka: Tatjana Kecojevic

13/02/2018

Before there was R, there was S!

then, R was born

Write R Code

Before Tidyverse R, there is Base R!

The verse!

The `dplyr` Package:

Chicago Data

1st look at the data: `dim()` & `head()`

Examine the structure of the data: `str()`

Do it in a tidy way: `glimpse()`

The pipeline operater: `%>%`

`select()`

Select your variables

`mutate()`

`filter()`

Filter your data

`arrange()`

Arranging your data

`summarise()`

`%>%` all up!

grammer of graphics

`ggplot()`

adding layers to your `ggplot()`

Voila

useful links:

Your turn!

Possible Solution: code

Avoid Chicago in spring and summer!

Data Wrangling and Visualisation

DataTeka: Tatjana Kecojevic

13/02/2018

Before there was R, there was S!

then, R was born

Write R Code

Before Tidyverse R, there is Base R!

The verse!

The dplyr Package:

Chicago Data

1st look at the data: dim() & head()

Examine the structure of the data: str()

Do it in a tidy way: glimpse()

The pipeline operater: %>%

select()

Select your variables

mutate()

filter()

Filter your data

arrange()

Arranging your data

summarise()

%>% all up!

grammer of graphics

ggplot()

adding layers to your ggplot()

Voila

useful links:

Your turn!

Possible Solution: code

Avoid Chicago in spring and summer!

The `dplyr` Package:

1st look at the data: `dim()` & `head()`

Examine the structure of the data: `str()`

Do it in a tidy way: `glimpse()`

The pipeline operater: `%>%`

`select()`

`mutate()`

`filter()`

`arrange()`

`summarise()`

`%>%` all up!

`ggplot()`

adding layers to your `ggplot()`