R is a dialect of S language that was developed in 1976 by Rick Becker and John Chambers at the Bell Laboratories.
Rick Becker gave an excellent keynote talk “Forty Years of S” at UseR!2016 conference:
Rick Becker @ UseR!2016 where he talked about development of S language that gives explanations for many characteristics of R as we know it, including “<-” assignment operator.
1993 Bell Labs gave StatSci (later Insightful Corp.) an exclusive license to develop and sell the S language. Insightful sold its implementation of the S language under the product name S-PLUS.
You can read more about the history of S, R, and S-PLUS
In early nineties at the University of Aucklandn in the Department of Statistics R was created by Ross Ihaka and Robert Gentleman.
They used GNU General Public License to make R open source free software.
Ross Ihaka and Robert Gentleman. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3):299–314, 1996
Currently R is developed by the R Development Core Team, of which John Chambers is a member.
To start using R you need to:
Install R (and RStudio)
Launch it and set your working directory: letting R know where to find all of your files.
Start writing R code!
Tip: When start working on a new R code/R Project in RStudio IDE use File -> New Project
. This way your working directory would be set up when you start a new project and it will save all your files in it. Next time you open your project it would set project’s directory as a working directory… It would help you with so much more.
When you download and install R for the first time, you are installing the Base R software. Base R contains most of the functions you’ll use on a daily basis: mean()
, subset()
…
To learn about R’s basic operations, data structures and base functions you could look at one of R-Ladies Manchester’s handouts: Introduction to base R.
If you want to access data and code written by other people, you’ll need to install it as a package. An R package is a bundle of functions (code), data, documentation, vignettes (examples), stored in one neat place.
“In R, the fundamental unit of shareable code is the package.” Hadley Wickham
An opinionated collection of R packages for data science.
R for Data Science by Garrett Grolemund & Hadley Wickham
dplyr
Package:provides a “grammar” (the verbs) for data manipulation and for operating on data frames. The key opertor and the esential verbs are :
%>%
: the “pipe” operator used to connect multiple verb actions together into a pipeline.select()
: return a subset of the columns of a data frame.mutate()
: add new variables/columns or transform existing variables.filter()
: extract a subset of rows from a data frame based on logical conditions.arrange()
: reorder rows of a data frame according to single or multiple variables.summarise()
/ summarize()
: reduces each group to a single row by calculating aggregate measures.Description: Chicago daily air pollution and death rate data. A data frame with 7 columns and 5114 rows. Each row refers to one day. The columns are:
• death
total deaths (per day).
• pm10median
median particles in 2.5-10 per cubic m
• pm25median
medianparticles < 2.5 mg per cubic m (more dangerous).
• o3median
Ozone in parts per billion
• so2median
Median Sulpher dioxide measurement
• time
time in days
• tmpd
temperature in fahrenheit
dim()
& head()
# install.packages("gamair")
library(gamair)
data(chicago)
dim(chicago)
## [1] 5114 7
head(chicago)
## death pm10median pm25median o3median so2median time tmpd
## 1 130 -7.4335443 NA -19.59234 1.9280426 -2556.5 31.5
## 2 150 NA NA -19.03861 -0.9855631 -2555.5 33.0
## 3 101 -0.8265306 NA -20.21734 -1.8914161 -2554.5 33.0
## 4 135 5.5664557 NA -19.67567 6.1393413 -2553.5 29.0
## 5 126 NA NA -19.21734 2.2784649 -2552.5 32.0
## 6 130 6.5664557 NA -17.63400 9.8585839 -2551.5 40.0
str()
str(chicago)
## 'data.frame': 5114 obs. of 7 variables:
## $ death : int 130 150 101 135 126 130 129 109 125 153 ...
## $ pm10median: num -7.434 NA -0.827 5.566 NA ...
## $ pm25median: num NA NA NA NA NA NA NA NA NA NA ...
## $ o3median : num -19.6 -19 -20.2 -19.7 -19.2 ...
## $ so2median : num 1.928 -0.986 -1.891 6.139 2.278 ...
## $ time : num -2556 -2556 -2554 -2554 -2552 ...
## $ tmpd : num 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...
The output could look messy and it might not fit the screen if you’re dealing with a big data set that has lots of variables!
glimpse()
suppressPackageStartupMessages(library(dplyr))
glimpse(chicago)
## Observations: 5,114
## Variables: 7
## $ death <int> 130, 150, 101, 135, 126, 130, 129, 109, 125, 153, 1...
## $ pm10median <dbl> -7.4335443, NA, -0.8265306, 5.5664557, NA, 6.566455...
## $ pm25median <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ o3median <dbl> -19.592338, -19.038614, -20.217338, -19.675671, -19...
## $ so2median <dbl> 1.9280426, -0.9855631, -1.8914161, 6.1393413, 2.278...
## $ time <dbl> -2556.5, -2555.5, -2554.5, -2553.5, -2552.5, -2551....
## $ tmpd <dbl> 31.5, 33.0, 33.0, 29.0, 32.0, 40.0, 34.5, 29.0, 26....
%>%
Left Hand Side (LHS) %>%
Right Hand Side (RHS)
x %>% f(..., y)
f(x,y)
The “pipe” passes the result of the LHS as the 1st operator argument of the function on the RHS
3 %>% sum(4)
<==> sum(3, 4)
%>%
is very practical for chaining together multiple dplyr
functions in a sequence of operations.
select()
starts_with("X")
every name that starts with “X”.ends_with("X")
every name that ends with “X”.contains("X")
every name that contains “X”.matches("X")
every name that matches “X”, where “X” can be a regular expression.num_range("x", 1:5)
the variables named x01, x02, x03, x04, x05.one_of(x)
=> every name that appears in x, which should be a character vector.## death pm10median pm25median o3median so2median time tmpd
## 1 130 -7.433544 NA -19.59234 1.9280426 -2556.5 31.5
## 2 150 NA NA -19.03861 -0.9855631 -2555.5 33.0
chicago_air_measurements <- select(chicago, ends_with("median"))
head(chicago_air_measurements, n = 1)
## pm10median pm25median o3median so2median
## 1 -7.433544 NA -19.59234 1.928043
chicago_air_pm <- chicago[c("pm10median", "pm25median")]
head(chicago_air_pm, n = 1)
## pm10median pm25median
## 1 -7.433544 NA
chicago_air_pm2 <- select(chicago, starts_with("pm"))
head(chicago_air_pm2, n = 1)
## pm10median pm25median
## 1 -7.433544 NA
mutate()
For example, it would allow you to add to the data frame df
a new column, z
, which is the multiplication of the columns x
and y
:
mutate(df, z = x * y)
Let us convert °F into °C: T(°C) = (T(°F) - 32) × 5/9
chicago2 <- mutate(chicago, tmpdc = round((tmpd - 32) / 1.8, digits = 1))
head(chicago2, n = 3)
## death pm10median pm25median o3median so2median time tmpd tmpdc
## 1 130 -7.4335443 NA -19.59234 1.9280426 -2556.5 31.5 -0.3
## 2 150 NA NA -19.03861 -0.9855631 -2555.5 33.0 0.6
## 3 101 -0.8265306 NA -20.21734 -1.8914161 -2554.5 33.0 0.6
filter()
There is a set of logical operators in R that you can use inside filter()
:
x < y
: TRUE if x
is less than y
x <= y
: TRUE if x
is less than or equal to y
x == y
: TRUE if x
equals y
x != y
: TRUE if x
does not equal y
x >= y
: TRUE if x
is greater than or equal to y
x > y
: TRUE if x
is greater than y
x %in% c(a, b, c)
: TRUE if x
is in the vector c(a, b, c)
high_death <- filter(chicago2, death > 200)
high_death
## death pm10median pm25median o3median so2median time tmpd tmpdc
## 1 226 20.941667 NA 29.703545 2.2685856 559.5 91.5 33.1
## 2 411 14.798103 NA 28.115091 0.6976599 560.5 86.0 30.0
## 3 287 -8.333333 NA 21.115009 -0.9330126 561.5 83.0 28.3
## 4 228 -3.232732 NA 5.649732 -2.3158882 562.5 78.5 25.8
high_temp_death <- filter(chicago2, death > 200 & tmpdc >= 30)
high_temp_death
## death pm10median pm25median o3median so2median time tmpd tmpdc
## 1 226 20.94167 NA 29.70355 2.2685856 559.5 91.5 33.1
## 2 411 14.79810 NA 28.11509 0.6976599 560.5 86.0 30.0
arrange()
is used to reorder rows of a data frame (df) according to one of the variables/columns.
arrange()
a character variable, R will rearrange the rows in alphabetical order according to values of the variable.levels()
on the variable reveals this order).low_2_high <- arrange(chicago, death)
head(low_2_high, n = 4)
## death pm10median pm25median o3median so2median time tmpd
## 1 69 -1.818182 NA -8.029279 1.12452237 1313.5 64.5
## 2 73 -19.320548 NA -5.869187 0.07297014 2052.5 66.0
## 3 77 -8.801262 NA -13.170360 -3.48994781 -2363.5 64.5
## 4 77 -19.165746 -10.14961 3.436157 3.60026234 1646.5 70.0
high_2_low <- arrange(chicago, desc(death))
head(high_2_low, n = 4)
## death pm10median pm25median o3median so2median time tmpd
## 1 411 14.798103 NA 28.115091 0.6976599 560.5 86.0
## 2 287 -8.333333 NA 21.115009 -0.9330126 561.5 83.0
## 3 228 -3.232732 NA 5.649732 -2.3158882 562.5 78.5
## 4 226 20.941667 NA 29.703545 2.2685856 559.5 91.5
summarise()
uses the same syntax as mutate()
, but the resulting dataset consists of a single row instead of an entire new column in the case of mutate()
.
builds a new dataset that contains only the summarising statistics.
Let us use summarise()
to print out a summary of chicago data containing two variables: max_detht and the max_tmpd
summarise(chicago, max_deth = max(death), max_tmpd = max(tmpd))
## max_deth max_tmpd
## 1 411 92
%>%
all up!chicago_pipe <- chicago %>%
filter(!is.na(pm10median) & !is.na(so2median)) %>%
mutate(tmpdC = round((tmpd - 32) / 1.8, digits = 1))
plot(chicago_pipe$tmpdC, chicago_pipe$death, cex = 0.5, col = "red")
Enables you to specify building blocks of a plot and to combine them to create graphical display you want. There are 8 building blocks:
ggplot()
library(ggplot2)
ggplot(chicago_pipe, aes(x = tmpdC, y = death)) +
geom_point(col ="red")
ggplot()
ggplot(chicago_pipe, aes(x = tmpdC, y = death, col = "red")) +
geom_point(alpha = 0.2) +
geom_smooth(col = "blue") +
labs (title= " death vs temperature ",
x = "°C", y = "death") +
theme(legend.position = "none",
panel.border = element_rect(fill = NA,
colour = "black",
size = .75),
plot.title=element_text(hjust=0.5))
cheatsheets:
websites:
chicagoNMMAPS
available from dlnm
package.There is a chalange:
dplyr
’s group_by()
function enables you to group your data. It allows you to create a separate df that splits the original df by a variable.
Knowing about group_by()
function, coud you compute the average pollutant level by month and visualise your result?
# install and open `dlnm' package and access the data
install.packages("dlnm")
library(dlnm)
data("chicagoNMMAPS")
# group data by month and calculate average monthly polution
my_ch <- chicagoNMMAPS %>%
group_by(month) %>%
summarise(pm10 = mean(pm10, na.rm = TRUE))
# visualise the information
ggplot(my_ch, aes(x=month, y = pm10)) +
geom_line() + geom_point(col = "red") +
xlab("Month") + ylab("average pm10") +
scale_x_continuous(breaks = seq(1, 12, 1), labels = seq(1, 12, 1))