Carlos Zelada
Octubre 2017
80% of data analysis is spent on the process of cleaning and preparing the data.“ Dasu and Johnson 2003
"Data scientists spend 60% of their time on cleaning and organizing data. Collecting data sets comes second at 19% of their time, meaning data scientists spend around 80% of their time on preparing and managing data for analysis.” Forbes.com 2016
Package: Tidyr
Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types.
In tidy data:
Tidy data makes it easy for an analyst or a computer to extract needed variables because it provides a standard way of structuring a dataset.
pew <- read.delim(
file = "http://stat405.had.co.nz/data/pew.txt",
header = TRUE,
stringsAsFactors = FALSE,
check.names = F,
encoding = "UTF-8"
)
| religion | <$10k | $10-20k | $20-30k | $30-40k | $40-50k | $50-75k | $75-100k | $100-150k | >150k | Don't know/refused |
|---|---|---|---|---|---|---|---|---|---|---|
| Agnostic | 27 | 34 | 60 | 81 | 76 | 137 | 122 | 109 | 84 | 96 |
| Atheist | 12 | 27 | 37 | 52 | 35 | 70 | 73 | 59 | 74 | 76 |
| Buddhist | 27 | 21 | 30 | 34 | 33 | 58 | 62 | 39 | 53 | 54 |
| Catholic | 418 | 617 | 732 | 670 | 638 | 1116 | 949 | 792 | 633 | 1489 |
| Don’t know/refused | 15 | 14 | 15 | 11 | 10 | 35 | 21 | 17 | 18 | 116 |
| Evangelical Prot | 575 | 869 | 1064 | 982 | 881 | 1486 | 949 | 723 | 414 | 1529 |
tidytable<-gather(pew,income,frequency, -religion)
tidytable<-gather(pew,income,frequency, -religion)
kable(head(tidytable))
| religion | income | frequency |
|---|---|---|
| Agnostic | <$10k | 27 |
| Atheist | <$10k | 12 |
| Buddhist | <$10k | 27 |
| Catholic | <$10k | 418 |
| Don’t know/refused | <$10k | 15 |
| Evangelical Prot | <$10k | 575 |
Package: DPLYR
filterselectarrangemutatesummarise along side with group_by.library(dplyr)
library(nycflights13)
## (1)
filter(flights, dest %in% c("SFO", "OAK"))
filter(flights, dest == "SFO" | dest == "OAK")
filter(flights, dest == "SFO" | "OAK") ##Error
## (2)
filter(flights, month == 1)
## (3)
filter(flights, dep_delay > 60)
## (4)
filter(flights, hour >= 0, hour <= 5)
filter(flights, hour >= 0 & hour <= 5)
## (5)
filter(flights, arr_delay > 2 * dep_delay)
select(). What other ways can you select variables?select(flights, arr_delay, dep_delay)
select(flights, arr_delay:dep_delay)
select(flights, ends_with("delay"))
select(flights, contains("delay"))
##(1)
arrange(flights, year ,month, day, hour, minute)
##(2)
arrange(flights, desc(dep_delay))
arrange(flights, desc(arr_delay))
dep_time?hint: you may need to use select() or View() to see your new variable
## (1)
fl <- mutate(flights, speed = distance / (air_time / 60))
arrange(fl, desc(speed))
## (2)
mutate(fl, delta = dep_delay - arr_delay)
## (3)
mutate(fl,
hour = dep_time %/% 100,
minute = dep_time %% 100)
by_month <- group_by(flights, month)
kable(summarise(by_month,Number_flights=n()))
| month | Number_flights |
|---|---|
| 1 | 27004 |
| 2 | 24951 |
| 3 | 28834 |
| 4 | 28330 |
| 5 | 28796 |
| 6 | 28243 |
| 7 | 29425 |
| 8 | 29327 |
| 9 | 27574 |
| 10 | 28889 |
| 11 | 27268 |
| 12 | 28135 |
by_month <- group_by(flights, month)
kable(summarise(by_month,
mean_air_time=mean(air_time,
na.rm=TRUE),
sd_air_time=
sd(air_time,
na.rm = TRUE),
flgs=n())
)
| month | mean_air_time | sd_air_time | planes |
|---|---|---|---|
| 1 | 154.1874 | 95.24514 | 27004 |
| 2 | 151.3464 | 92.26271 | 24951 |
| 3 | 149.0770 | 91.83097 | 28834 |
| 4 | 153.1011 | 94.53309 | 28330 |
| 5 | 145.7275 | 92.24838 | 28796 |
| 6 | 150.3252 | 93.95354 | 28243 |
| 7 | 146.7283 | 91.14821 | 29425 |
| 8 | 148.1604 | 92.88184 | 29327 |
| 9 | 143.4712 | 91.90547 | 27574 |
| 10 | 148.8861 | 93.58280 | 28889 |
| 11 | 155.4686 | 95.94203 | 27268 |
| 12 | 162.5914 | 97.13239 | 28135 |
How might you summarise dep_delay for each day?
hint: mean, sd, median, max, min, quantile, sum, n
by_date <- group_by(flights, month, day)
delays <- summarise(by_date,
mean = mean(dep_delay, na.rm = TRUE),
median = median(dep_delay, na.rm = TRUE),
q75 = quantile(dep_delay, 0.75, na.rm = TRUE),
over_15 = mean(dep_delay > 15, na.rm = TRUE),
over_30 = mean(dep_delay > 30, na.rm = TRUE),
over_60 = mean(dep_delay > 60, na.rm = TRUE)
)
hourly_delay <- filter(
summarise(
group_by(
filter(
flights,
!is.na(dep_delay)
),
month,day, hour ),
delay = mean(dep_delay),
n = n() ),
n > 10 )
\[ \LARGE x \; \text{%>%} \; f(y) \rightarrow f(x,y) \]
\[ \LARGE y \; \text{%>%} \; f(x) \rightarrow f(y,x) \]
hourly_delay <- flights %>%
filter(!is.na(dep_delay)) %>%
group_by(month,day, hour) %>%
summarise(delay = mean(dep_delay), n = n()) %>%
filter(n > 10)
Hint: pronounce %>% as then
Create data pipelines to answer the following questions:
(Hint: hour + minute / 60)
flights %>%
group_by(dest) %>%
summarise(
arr_delay = mean(arr_delay, na.rm = TRUE),
n = n()) %>%
arrange(desc(arr_delay))
flights %>%
group_by(carrier, flight, dest) %>%
tally(sort = TRUE) %>%
filter(n == 365)
OR
flights %>%
group_by(carrier, flight, dest) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
filter(n == 365)
per_hour <- flights %>%
mutate(time = hour + minute / 60) %>%
group_by(time) %>%
summarise(
arr_delay = mean(arr_delay, na.rm = TRUE),n = n() )
planes <- flights %>%
filter(!is.na(arr_delay)) %>%
group_by(tail_num) %>%
filter(n() > 30)
planes %>%
mutate(z_delay =
(arr_delay - mean(arr_delay)) / sd(arr_delay)) %>%
filter(z_delay > 5)
planes %>% filter(min_rank(arr_delay) < 5)
What’s the difference between min_rank(), row_number() and dense_rank()?
For each plane, find the two most delayed flights. Which of the three rank functions is most appropriate?
| Type | Action |
|---|---|
| inner | Include only rows in both x and y |
| left | Include all of x, and matching rows of y |
| semi | Include rows of x that match y |
| anti | Include rows of x that don’t match y |
hourly_delay <- flights %>%
group_by(month, day, hour) %>%
filter(!is.na(dep_delay)) %>%
summarise(
delay = mean(dep_delay),
n = n() ) %>%
filter
delay_weather <- hourly_delay %>% left_join(weather)
Are older planes more likely to be delayed?
Introduction to dplyr https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
Data Processing with dplyr & tidyr https://rpubs.com/bradleyboehmke/data_wrangling
Data manipulation with tidyr http://datascienceplus.com/data-manipulation-with-tidyr/