Before there was R, there was S!

R is a dialect of S language that was developed in 1976 by Rick Becker and John Chambers at the Bell Laboratories.

Rick Becker gave an excellent keynote talk “Forty Years of S” at UseR!2016 conference:
Rick Becker @ UseR!2016 where he talked about development of S language that gives explanations for many characteristics of R as we know it, including “<-” assignment operator.

1993 Bell Labs gave StatSci (later Insightful Corp.) an exclusive license to develop and sell the S language. Insightful sold its implementation of the S language under the product name S-PLUS.

You can read more about the history of S, R, and S-PLUS

then, R was born

In early nineties at the University of Aucklandn in the Department of Statistics R was created by Ross Ihaka and Robert Gentleman.

They used GNU General Public License to make R open source free software.

Ross Ihaka and Robert Gentleman. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3):299–314, 1996

Currently R is developed by the R Development Core Team, of which John Chambers is a member.

Write R Code

To start using R you need to:

  1. Install R (and RStudio)

  2. Launch it and set your working directory: letting R know where to find all of your files.

  3. Start writing R code!

Tip: When start working on a new R code/R Project in RStudio IDE use File -> New Project. This way your working directory would be set up when you start a new project and it will save all your files in it. Next time you open your project it would set project’s directory as a working directory… It would help you with so much more.

Before Tidyverse R, there is Base R!

When you download and install R for the first time, you are installing the Base R software. Base R contains most of the functions you’ll use on a daily basis: mean(), subset()

To learn about R’s basic operations, data structures and base functions you could look at one of R-Ladies Manchester’s handouts: Introduction to base R.

If you want to access data and code written by other people, you’ll need to install it as a package. An R package is a bundle of functions (code), data, documentation, vignettes (examples), stored in one neat place.

“In R, the fundamental unit of shareable code is the package.” Hadley Wickham

The verse!

An opinionated collection of R packages for data science.

install.packages("tidyverse")

library(tidyverse)

R for Data Science by Garrett Grolemund & Hadley Wickham

RStudio Community

The dplyr Package:

provides a “grammar” (the verbs) for data manipulation and for operating on data frames. The key opertor and the esential verbs are :

Chicago Data

Description: Chicago daily air pollution and death rate data. A data frame with 7 columns and 5114 rows. Each row refers to one day. The columns are:

death total deaths (per day).

pm10median median particles in 2.5-10 per cubic m

pm25median medianparticles < 2.5 mg per cubic m (more dangerous).

o3median Ozone in parts per billion

so2median Median Sulpher dioxide measurement

time time in days

tmpd temperature in fahrenheit

1st look at the data: dim() & head()

# install.packages("gamair")
library(gamair)
data(chicago)
dim(chicago)
## [1] 5114    7
head(chicago)
##   death pm10median pm25median  o3median  so2median    time tmpd
## 1   130 -7.4335443         NA -19.59234  1.9280426 -2556.5 31.5
## 2   150         NA         NA -19.03861 -0.9855631 -2555.5 33.0
## 3   101 -0.8265306         NA -20.21734 -1.8914161 -2554.5 33.0
## 4   135  5.5664557         NA -19.67567  6.1393413 -2553.5 29.0
## 5   126         NA         NA -19.21734  2.2784649 -2552.5 32.0
## 6   130  6.5664557         NA -17.63400  9.8585839 -2551.5 40.0

Examine the structure of the data: str()

str(chicago) 
## 'data.frame':    5114 obs. of  7 variables:
##  $ death     : int  130 150 101 135 126 130 129 109 125 153 ...
##  $ pm10median: num  -7.434 NA -0.827 5.566 NA ...
##  $ pm25median: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ o3median  : num  -19.6 -19 -20.2 -19.7 -19.2 ...
##  $ so2median : num  1.928 -0.986 -1.891 6.139 2.278 ...
##  $ time      : num  -2556 -2556 -2554 -2554 -2552 ...
##  $ tmpd      : num  31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...

The output could look messy and it might not fit the screen if you’re dealing with a big data set that has lots of variables!

Do it in a tidy way: glimpse()

suppressPackageStartupMessages(library(dplyr))
glimpse(chicago) 
## Observations: 5,114
## Variables: 7
## $ death      <int> 130, 150, 101, 135, 126, 130, 129, 109, 125, 153, 1...
## $ pm10median <dbl> -7.4335443, NA, -0.8265306, 5.5664557, NA, 6.566455...
## $ pm25median <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ o3median   <dbl> -19.592338, -19.038614, -20.217338, -19.675671, -19...
## $ so2median  <dbl> 1.9280426, -0.9855631, -1.8914161, 6.1393413, 2.278...
## $ time       <dbl> -2556.5, -2555.5, -2554.5, -2553.5, -2552.5, -2551....
## $ tmpd       <dbl> 31.5, 33.0, 33.0, 29.0, 32.0, 40.0, 34.5, 29.0, 26....

The pipeline operater: %>%

Left Hand Side (LHS)    %>%    Right Hand Side (RHS)

x %>% f(..., y)
        f(x,y)

The “pipe” passes the result of the LHS as the 1st operator argument of the function on the RHS

3 %>% sum(4)     <==>     sum(3, 4)

%>%   is very practical for chaining together multiple dplyr functions in a sequence of operations.

select()

##   death pm10median pm25median  o3median  so2median    time tmpd
## 1   130  -7.433544         NA -19.59234  1.9280426 -2556.5 31.5
## 2   150         NA         NA -19.03861 -0.9855631 -2555.5 33.0

Select your variables

chicago_air_measurements <- select(chicago, ends_with("median"))
head(chicago_air_measurements, n = 1)
##   pm10median pm25median  o3median so2median
## 1  -7.433544         NA -19.59234  1.928043
chicago_air_pm <- chicago[c("pm10median", "pm25median")]
head(chicago_air_pm, n = 1)
##   pm10median pm25median
## 1  -7.433544         NA
chicago_air_pm2 <- select(chicago, starts_with("pm"))
head(chicago_air_pm2, n = 1)
##   pm10median pm25median
## 1  -7.433544         NA

mutate()

For example, it would allow you to add to the data frame df a new column, z, which is the multiplication of the columns x and y:

                 mutate(df, z = x * y)

Let us convert °F into °C:     T(°C) = (T(°F) - 32) × 5/9

chicago2 <- mutate(chicago, tmpdc = round((tmpd - 32) / 1.8, digits = 1)) 
head(chicago2, n = 3)
##   death pm10median pm25median  o3median  so2median    time tmpd tmpdc
## 1   130 -7.4335443         NA -19.59234  1.9280426 -2556.5 31.5  -0.3
## 2   150         NA         NA -19.03861 -0.9855631 -2555.5 33.0   0.6
## 3   101 -0.8265306         NA -20.21734 -1.8914161 -2554.5 33.0   0.6

filter()

There is a set of logical operators in R that you can use inside filter():

Filter your data

high_death <- filter(chicago2, death > 200) 
high_death
##   death pm10median pm25median  o3median  so2median  time tmpd tmpdc
## 1   226  20.941667         NA 29.703545  2.2685856 559.5 91.5  33.1
## 2   411  14.798103         NA 28.115091  0.6976599 560.5 86.0  30.0
## 3   287  -8.333333         NA 21.115009 -0.9330126 561.5 83.0  28.3
## 4   228  -3.232732         NA  5.649732 -2.3158882 562.5 78.5  25.8
high_temp_death <- filter(chicago2, death > 200 & tmpdc >= 30)
high_temp_death
##   death pm10median pm25median o3median so2median  time tmpd tmpdc
## 1   226   20.94167         NA 29.70355 2.2685856 559.5 91.5  33.1
## 2   411   14.79810         NA 28.11509 0.6976599 560.5 86.0  30.0

arrange()

is used to reorder rows of a data frame (df) according to one of the variables/columns.

Arranging your data

low_2_high <- arrange(chicago, death)
head(low_2_high, n = 4)
##   death pm10median pm25median   o3median   so2median    time tmpd
## 1    69  -1.818182         NA  -8.029279  1.12452237  1313.5 64.5
## 2    73 -19.320548         NA  -5.869187  0.07297014  2052.5 66.0
## 3    77  -8.801262         NA -13.170360 -3.48994781 -2363.5 64.5
## 4    77 -19.165746  -10.14961   3.436157  3.60026234  1646.5 70.0
high_2_low <- arrange(chicago, desc(death))
head(high_2_low, n = 4)
##   death pm10median pm25median  o3median  so2median  time tmpd
## 1   411  14.798103         NA 28.115091  0.6976599 560.5 86.0
## 2   287  -8.333333         NA 21.115009 -0.9330126 561.5 83.0
## 3   228  -3.232732         NA  5.649732 -2.3158882 562.5 78.5
## 4   226  20.941667         NA 29.703545  2.2685856 559.5 91.5

summarise()

Let us use summarise() to print out a summary of chicago data containing two variables: max_detht and the max_tmpd

summarise(chicago, max_deth = max(death), max_tmpd = max(tmpd))
##   max_deth max_tmpd
## 1      411       92

%>% all up!

chicago_pipe <- chicago %>%
  filter(!is.na(pm10median) & !is.na(so2median)) %>%
  mutate(tmpdC = round((tmpd - 32) / 1.8, digits = 1))
plot(chicago_pipe$tmpdC, chicago_pipe$death, cex = 0.5, col = "red")

grammer of graphics

Enables you to specify building blocks of a plot and to combine them to create graphical display you want. There are 8 building blocks:

ggplot()

library(ggplot2)
ggplot(chicago_pipe, aes(x = tmpdC, y = death)) +
  geom_point(col ="red")

adding layers to your ggplot()

ggplot(chicago_pipe, aes(x = tmpdC, y = death, col = "red")) +
  geom_point(alpha = 0.2) +
  geom_smooth(col = "blue") +
  labs (title= " death vs temperature ", 
        x = "°C", y = "death") +
  theme(legend.position = "none", 
        panel.border = element_rect(fill = NA, 
                                    colour = "black",
                                    size = .75),
        plot.title=element_text(hjust=0.5))

Voila

Your turn!

  1. upload Daily Mortality Weather and Pollution Data for Chicago: chicagoNMMAPS available from dlnm package.
  2. have a glance at the data.
  3. what are the questions you could ask; could you provide the answers to them?

There is a chalange:

dplyr’s group_by() function enables you to group your data. It allows you to create a separate df that splits the original df by a variable.

Knowing about group_by() function, coud you compute the average pollutant level by month and visualise your result?

Possible Solution: code

# install and open `dlnm' package and access the data
install.packages("dlnm")
library(dlnm)
data("chicagoNMMAPS")

# group data by month and calculate average monthly polution
my_ch <- chicagoNMMAPS %>%
  group_by(month) %>%
  summarise(pm10 = mean(pm10, na.rm = TRUE))

# visualise the information
ggplot(my_ch, aes(x=month, y = pm10)) +
  geom_line() + geom_point(col = "red") +
  xlab("Month") + ylab("average pm10") +
  scale_x_continuous(breaks = seq(1, 12, 1), labels = seq(1, 12, 1))

Avoid Chicago in spring and summer!