To install and load the dplyr package:
install.packages("dplyr")
library(dplyr)
getwd()
[1] "/Users/sauce"
# 1. set WD
setwd("/Users/sauce/desktop")
# 2. make folder
if(!file.exists("./data")){
dir.create("./data")
}
# 3. make handle
fileURL <- "https://github.com/DataScienceSpecialization/courses/blob/master/03_GettingData/dplyr/chicago.rds?raw=true"
# 4. download data
download.file(fileURL, destfile = "./data/chicago.rds", method = "curl", extra='-L')
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 162 0 162 0 0 475 0 --:--:-- --:--:-- --:--:-- 476
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 173 100 173 0 0 364 0 --:--:-- --:--:-- --:--:-- 168k
100 127k 100 127k 0 0 168k 0 --:--:-- --:--:-- --:--:-- 168k
# 5. read data
chicago <- readRDS("./data/chicago.rds")
Warning in format.POSIXlt(as.POSIXlt(x), ...) :
unknown timezone 'zone/tz/2017c.1.0/zoneinfo/America/Denver'
dim(chicago)
[1] 6940 8
To view the details of the data set:
str(chicago)
'data.frame': 6940 obs. of 8 variables:
$ city : chr "chic" "chic" "chic" "chic" ...
$ tmpd : num 31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...
$ dptp : num 31.5 29.9 27.4 28.6 28.9 ...
$ date : Date, format: "1987-01-01" "1987-01-02" ...
$ pm25tmean2: num NA NA NA NA NA NA NA NA NA NA ...
$ pm10tmean2: num 34 NA 34.2 47 NA ...
$ o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ...
$ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ...
names(chicago)[1:3]
[1] "city" "tmpd" "dptp"
To work with a relevant subset of the data:
subset <- select(chicago, city:dptp)
head(subset)
To remove just one variable from data frame:
subset2 <- select(chicago, -(city:dptp))
head(subset2)
Keep every variable that ends with a 2:
subset3 <- select(chicago, ends_with("2"))
str(subset3)
'data.frame': 6940 obs. of 4 variables:
$ pm25tmean2: num NA NA NA NA NA NA NA NA NA NA ...
$ pm10tmean2: num 34 NA 34.2 47 NA ...
$ o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ...
$ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ...
Keep every variable that ends starts with a d:
subset <- select(chicago, starts_with("d"))
str(subset)
'data.frame': 6940 obs. of 2 variables:
$ dptp: num 31.5 29.9 27.4 28.6 28.9 ...
$ date: Date, format: "1987-01-01" "1987-01-02" ...
The “filter” function is quite powerful. Let’s say we’re only interested in observations from the data where the pm25tmean2 value is greater than 30. The “filter” function will return only the rows where this is true.
chic.f <- filter(chicago, pm25tmean2 > 30)
str(chic.f)
'data.frame': 194 obs. of 8 variables:
$ city : chr "chic" "chic" "chic" "chic" ...
$ tmpd : num 23 28 55 59 57 57 75 61 73 78 ...
$ dptp : num 21.9 25.8 51.3 53.7 52 56 65.8 59 60.3 67.1 ...
$ date : Date, format: "1998-01-17" "1998-01-23" ...
$ pm25tmean2: num 38.1 34 39.4 35.4 33.3 ...
$ pm10tmean2: num 32.5 38.7 34 28.5 35 ...
$ o3tmean2 : num 3.18 1.75 10.79 14.3 20.66 ...
$ no2tmean2 : num 25.3 29.4 25.3 31.4 26.8 ...
summary(chic.f$pm25tmean2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
30.05 32.12 35.04 36.63 39.53 61.50
Furthermore, we can apply multiple filters to the data set. In this case, we want all rows where the pm25mean2 >,30 and the tmpd > 80.
chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)
select(chic.f, date, tmpd, pm25tmean2)
The arrange function helps us rearrange observations in the data frame. We’ll first arrange the data based on date.
chicago <- arrange(chicago, date)
tail(select(chicago, date, pm25tmean2), 3)
And descending order based on date…
chicago <- arrange(chicago, desc(date))
The rename function helps us rename variables in our data frame.
chicago <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2)
The “mutate” function helps us create a new variable based on existing variables in the data. For instance, with the Chicago data, one might want to compare the daily temperatures to the mean temp for that day. For this we can create a detrend variable.
chicago <- mutate(chicago, pm25detrend = pm25 - mean(pm25, na.rm = TRUE))
summary(chicago)
We can also define summary statistics using the groupby variable in conjunction with the strata variable.
chicago <- mutate(chicago, year = as.POSIXlt(date)$year + 1900)
years <- group_by(chicago, year)
summarize(years, pm25 = mean(pm25, na.rm = TRUE),
o3 = max(o3tmean2, na.rm = TRUE),
no2 = median(no2tmean2, na.rm = TRUE))
Source: local data frame [19 x 4]
Error: unexpected symbol in "Source: local data"
We now have summary stats for each year in the data frame.
Let’s now look at the average levels of ozone and nitrogen oxide within quin-tiles of pm25.
qq <- quantile(chicago$pm25, seq(0, 1, 0.2), na.rm = TRUE)
chicago <- mutate(chicago, pm25.quint = cut(pm25, qq))
quint <- group_by(chicago, pm25.quint)
summarize(quint, o3 = mean(o3tmean2, na.rm = TRUE),
no2 = mean(no2tmean2, na.rm = TRUE))
Source: local data frame [6 x 3]
Error: unexpected symbol in "Source: local data"
Finally, the %>% operator is used to string together dplyr functions. This is a more intuitive method for making multiple commands. We can look at the previous function and this time use the %>% operator.