Last week we learned about 5 verbs of dplyr and group_by command. Let’s review them briefly by examples taken from DataCamp course.
library(dplyr)
library(hflights)
hflights <- tbl_df(hflights)
select() accepts
starts_with, ends_with, contains, matches, num_range, one_ofwhich can be used in mixture. Minus sign(-) is used to hide the column.
select(hflights,UniqueCarrier,ends_with("Num"),contains("Delay"))
## # A tibble: 227,496 x 5
## UniqueCarrier FlightNum TailNum ArrDelay DepDelay
## * <chr> <int> <chr> <int> <int>
## 1 AA 428 N576AA -10 0
## 2 AA 428 N557AA -9 1
## 3 AA 428 N541AA -8 -8
## 4 AA 428 N403AA 3 3
## 5 AA 428 N492AA -3 5
## 6 AA 428 N262AA -7 -1
## 7 AA 428 N493AA -1 -1
## 8 AA 428 N477AA -16 -5
## 9 AA 428 N476AA 44 43
## 10 AA 428 N504AA 43 43
## # ... with 227,486 more rows
summarise generates a summary table
summarise(hflights,
n_obs = n(),
n_carrier = n_distinct(UniqueCarrier),
n_dest = n_distinct(Dest))
## # A tibble: 1 x 3
## n_obs n_carrier n_dest
## <int> <int> <int>
## 1 227496 15 116
and R functions, specific summarise functions can be used for summary.
hflights %>%
group_by(UniqueCarrier) %>%
summarise(p_canc = sum(Cancelled==1)/n()*100,
avg_delay = mean(ArrDelay,na.rm=TRUE)) %>%
arrange(avg_delay, p_canc)
## # A tibble: 15 x 3
## UniqueCarrier p_canc avg_delay
## <chr> <dbl> <dbl>
## 1 US 1.1268986 -0.6307692
## 2 AA 1.8495684 0.8917558
## 3 FL 0.9817672 1.8536239
## 4 AS 0.0000000 3.1923077
## 5 YV 1.2658228 4.0128205
## 6 DL 1.5903067 6.0841374
## 7 CO 0.6782614 6.0986983
## 8 MQ 2.9044750 7.1529751
## 9 EV 3.4482759 7.2569543
## 10 WN 1.5504047 7.5871430
## 11 F9 0.7159905 7.6682692
## 12 XE 1.5495599 8.1865242
## 13 OO 1.3946828 8.6934922
## 14 B6 2.5899281 9.8588410
## 15 UA 1.6409266 10.4628628
hflights %>%
filter(!is.na(ArrDelay) & ArrDelay > 0) %>%
group_by(UniqueCarrier) %>%
summarise(avg=mean(ArrDelay)) %>%
mutate(rank=rank(avg)) %>%
arrange(rank)
## # A tibble: 15 x 3
## UniqueCarrier avg rank
## <chr> <dbl> <dbl>
## 1 YV 18.67568 1
## 2 F9 18.68683 2
## 3 US 20.70235 3
## 4 CO 22.13374 4
## 5 AS 22.91195 5
## 6 OO 24.14663 6
## 7 XE 24.19337 7
## 8 WN 25.27750 8
## 9 FL 27.85693 9
## 10 AA 28.49740 10
## 11 DL 32.12463 11
## 12 UA 32.48067 12
## 13 MQ 38.75135 13
## 14 EV 40.24231 14
## 15 B6 45.47744 15
Although summarise and mutate work within groups, their output is different:
summarise generates a new summary table, replacing the original data tablemutate adds a new column to existing data table while mutations are calculated within each groupThe example code below adds group average column and average value is added repeated for each group member.
Let’s demonstrate this with a sample data:
sample_data <- data_frame(group = sample(letters[1:3], 10, replace = TRUE),
value = rnorm(10))
sample_data %>%
group_by(group) %>%
mutate(group_average = mean(value)) %>%
arrange(group)
## # A tibble: 10 x 3
## # Groups: group [2]
## group value group_average
## <chr> <dbl> <dbl>
## 1 b -0.5876116 -0.06753188
## 2 b -0.6694330 -0.06753188
## 3 b 1.0544489 -0.06753188
## 4 c -1.1338465 -0.73575947
## 5 c -0.3340361 -0.73575947
## 6 c -1.2066329 -0.73575947
## 7 c 0.1223628 -0.73575947
## 8 c -1.9220008 -0.73575947
## 9 c -0.3822265 -0.73575947
## 10 c -0.2939364 -0.73575947
And now with actual data
hflights %>%
filter(!is.na(ArrDelay) & ArrDelay > 0) %>%
group_by(UniqueCarrier) %>%
mutate(avg=mean(ArrDelay))
## # A tibble: 106,920 x 22
## # Groups: UniqueCarrier [15]
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## <int> <int> <int> <int> <int> <int> <chr>
## 1 2011 1 4 2 1403 1513 AA
## 2 2011 1 9 7 1443 1554 AA
## 3 2011 1 10 1 1443 1553 AA
## 4 2011 1 11 2 1429 1539 AA
## 5 2011 1 12 3 1419 1515 AA
## 6 2011 1 17 1 1530 1634 AA
## 7 2011 1 20 4 1507 1622 AA
## 8 2011 1 24 1 1356 1513 AA
## 9 2011 1 31 1 1441 1553 AA
## 10 2011 1 1 6 728 840 AA
## # ... with 106,910 more rows, and 15 more variables: FlightNum <int>,
## # TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>, avg <dbl>
mtcars %>%
select(cyl, gear, carb) %>%
group_by(cyl, gear) %>%
summarise(count = n()) %>%
summarise(count1 = max(count))
## # A tibble: 3 x 2
## cyl count1
## <dbl> <dbl>
## 1 4 8
## 2 6 4
## 3 8 12
hflights %>%
group_by(Origin,Dest) %>%
summarise(count_d=n()) %>%
summarise(max_flights=max(count_d))
## # A tibble: 2 x 2
## Origin max_flights
## <chr> <dbl>
## 1 HOU 8243
## 2 IAH 5748
To which airport maximum number of flights took place originating from HOU? (Please figure out the code)
## # A tibble: 2 x 3
## # Groups: Origin [2]
## Origin Dest count_d
## <chr> <chr> <int>
## 1 HOU DAL 8243
## 2 IAH ORD 5748
%>%Sends the output of the LHS function to the first argument of the RHS function.
sum(1:8) %>%
sqrt()
## [1] 6
Please review the scores for assignment scores below and see instructor if there’s problem with your score. Here are some info about the table:
| StudentNo | GroupNo | R_Intro | Data_Import | Data_Manipulation |
|---|---|---|---|---|
| 12051035 | gr1 | 0.979 | 0.931 | 0.847 |
| 12051048 | gr1 | 0.963 | 0.733 | 0.747 |
| 13056014 | gr1 | 0.969 | 0.972 | 0.912 |
| 13056016 | gr1 | 0.984 | 0.992 | 0.912 |
| 13056022 | gr1 | 1.000 | 1.000 | 1.000 |
| 13056034 | gr1 | 0.963 | 0.792 | 1.000 |
| 13056043 | gr1 | 1.000 | 0.972 | 1.000 |
| 13056046 | gr1 | 1.000 | 0.778 | 1.000 |
| 13056050 | gr1 | 1.000 | 0.775 | 1.000 |
| 12056002 | gr2 | 0.958 | 0.525 | 0.272 |
| 1205A005 | gr2 | 1.000 | 0.333 | 0.084 |
| 1205A041 | gr2 | 1.000 | 0.778 | 1.000 |
| 1205A042 | gr2 | 1.000 | 0.778 | 1.000 |
| 13056004 | gr2 | 1.000 | 0.792 | 0.962 |
| 1305A002 | gr2 | 1.000 | 0.792 | 1.000 |
| 1305A005 | gr2 | 0.974 | 0.783 | 0.941 |
| 1305A006 | gr2 | 0.995 | 0.778 | 1.000 |
| 1305A011 | gr2 | 1.000 | 0.778 | 1.000 |
| 1305A014 | gr2 | 0.265 | 0.806 | 0.351 |
| 1305A015 | gr2 | 0.000 | 0.056 | 0.340 |
| 1305A016 | gr2 | 1.000 | 0.778 | 1.000 |
| 1305A029 | gr2 | 0.990 | 0.847 | 0.971 |
| 1305A032 | gr2 | 1.000 | 1.000 | 1.000 |
| 1305A034 | gr2 | 0.879 | 0.697 | 1.000 |
| 1305A035 | gr2 | 1.000 | 0.722 | 0.941 |
| 1305A042 | gr2 | 1.000 | 0.778 | 0.962 |
| 1305A043 | gr2 | 0.995 | 0.222 | 0.324 |
| 1305A044 | gr2 | 0.906 | 0.983 | 0.971 |
| 1305A045 | gr2 | 1.000 | 0.750 | 0.991 |
| 14056012 | gr2 | 0.968 | 0.567 | 0.500 |
| 14056903 | gr2 | 1.000 | 1.000 | 0.529 |
| 1405A015 | gr2 | 1.000 | 0.819 | 1.000 |
| 1405A024 | gr2 | 1.000 | 0.847 | 0.641 |
| 1405A044 | gr2 | 1.000 | 1.000 | 1.000 |
| 1405A902 | gr2 | 0.306 | 0.778 | 1.000 |
| 1405A903 | gr2 | 1.000 | 0.992 | 0.991 |
Will be discussed next week.
If you are interested in running the codes (you should be!) in DataCamp in your computer you can download the datasets used during the course. At the home page of each course there’s link for downloading the datasets used in the course. Your instructor can not emphasize this more, you won’t be digesting what you learned unless you practice writing codes from scratch.
Datasets are available at DataCamp courses
The datasets are provided as .RData or .csv format. The .RData files contain the data frames and other structures as is when loaded with load() command.
For example, iris.RData file contains iris, iris.tidy, iris.wide and iris.wide2 data frames.
library(ggplot2)
# this is the iris dataset used in DataCamp ggplot course
# you need to change the folder for your local setup
load("~/Documents/YTU/DERSLER/2017GUZ/veri-analizi/lesson06/iris.RData")
ggplot(iris)
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width))
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) + geom_point()
p <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width))
p
p + geom_point()
p + geom_jitter()
p <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()
p
Mappings are done within aes() if we want to map an aesthetic to a variable. If an aesthetic is to be constant then it’s defined as attribute outside of aes()
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(col = "red")
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
geom_point()
control aesthetic mappings of each layer independently
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
geom_point()
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(col = Species))
Globally assigned aesthetic is inherited in following layers.
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
geom_point()
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(col = Species))
This week following chapters are the assignment.
Please use Github issues if you’re having problem with concepts or code.