Review of assignment topics

Last week we learned about 5 verbs of dplyr and group_by command. Let’s review them briefly by examples taken from DataCamp course.

Flight data

library(dplyr)
library(hflights)
hflights <- tbl_df(hflights)

Advanced select and summarise

select() accepts

  • column numbers (accepting range)
  • column names (accepting range)
  • special functions: starts_with, ends_with, contains, matches, num_range, one_of

which can be used in mixture. Minus sign(-) is used to hide the column.

select(hflights,UniqueCarrier,ends_with("Num"),contains("Delay"))
## # A tibble: 227,496 x 5
##    UniqueCarrier FlightNum TailNum ArrDelay DepDelay
##  *         <chr>     <int>   <chr>    <int>    <int>
##  1            AA       428  N576AA      -10        0
##  2            AA       428  N557AA       -9        1
##  3            AA       428  N541AA       -8       -8
##  4            AA       428  N403AA        3        3
##  5            AA       428  N492AA       -3        5
##  6            AA       428  N262AA       -7       -1
##  7            AA       428  N493AA       -1       -1
##  8            AA       428  N477AA      -16       -5
##  9            AA       428  N476AA       44       43
## 10            AA       428  N504AA       43       43
## # ... with 227,486 more rows

summarise generates a summary table

summarise(hflights,
          n_obs = n(),
          n_carrier = n_distinct(UniqueCarrier),
          n_dest = n_distinct(Dest))
## # A tibble: 1 x 3
##    n_obs n_carrier n_dest
##    <int>     <int>  <int>
## 1 227496        15    116

and R functions, specific summarise functions can be used for summary.

hflights %>%
  group_by(UniqueCarrier) %>%
  summarise(p_canc = sum(Cancelled==1)/n()*100,
            avg_delay = mean(ArrDelay,na.rm=TRUE)) %>%
  arrange(avg_delay, p_canc)
## # A tibble: 15 x 3
##    UniqueCarrier    p_canc  avg_delay
##            <chr>     <dbl>      <dbl>
##  1            US 1.1268986 -0.6307692
##  2            AA 1.8495684  0.8917558
##  3            FL 0.9817672  1.8536239
##  4            AS 0.0000000  3.1923077
##  5            YV 1.2658228  4.0128205
##  6            DL 1.5903067  6.0841374
##  7            CO 0.6782614  6.0986983
##  8            MQ 2.9044750  7.1529751
##  9            EV 3.4482759  7.2569543
## 10            WN 1.5504047  7.5871430
## 11            F9 0.7159905  7.6682692
## 12            XE 1.5495599  8.1865242
## 13            OO 1.3946828  8.6934922
## 14            B6 2.5899281  9.8588410
## 15            UA 1.6409266 10.4628628

Summarise and mutate with group_by

hflights %>%
  filter(!is.na(ArrDelay) & ArrDelay > 0) %>%
  group_by(UniqueCarrier) %>%
  summarise(avg=mean(ArrDelay)) %>%
  mutate(rank=rank(avg)) %>%
  arrange(rank)
## # A tibble: 15 x 3
##    UniqueCarrier      avg  rank
##            <chr>    <dbl> <dbl>
##  1            YV 18.67568     1
##  2            F9 18.68683     2
##  3            US 20.70235     3
##  4            CO 22.13374     4
##  5            AS 22.91195     5
##  6            OO 24.14663     6
##  7            XE 24.19337     7
##  8            WN 25.27750     8
##  9            FL 27.85693     9
## 10            AA 28.49740    10
## 11            DL 32.12463    11
## 12            UA 32.48067    12
## 13            MQ 38.75135    13
## 14            EV 40.24231    14
## 15            B6 45.47744    15

Although summarise and mutate work within groups, their output is different:

  • summarise generates a new summary table, replacing the original data table
  • mutate adds a new column to existing data table while mutations are calculated within each group

The example code below adds group average column and average value is added repeated for each group member.

Let’s demonstrate this with a sample data:

sample_data <- data_frame(group = sample(letters[1:3], 10, replace = TRUE),
                          value = rnorm(10))
sample_data %>%
  group_by(group) %>%
  mutate(group_average = mean(value)) %>%
  arrange(group)
## # A tibble: 10 x 3
## # Groups:   group [2]
##    group      value group_average
##    <chr>      <dbl>         <dbl>
##  1     b -0.5876116   -0.06753188
##  2     b -0.6694330   -0.06753188
##  3     b  1.0544489   -0.06753188
##  4     c -1.1338465   -0.73575947
##  5     c -0.3340361   -0.73575947
##  6     c -1.2066329   -0.73575947
##  7     c  0.1223628   -0.73575947
##  8     c -1.9220008   -0.73575947
##  9     c -0.3822265   -0.73575947
## 10     c -0.2939364   -0.73575947

And now with actual data

hflights %>%
  filter(!is.na(ArrDelay) & ArrDelay > 0) %>%
  group_by(UniqueCarrier) %>%
  mutate(avg=mean(ArrDelay))
## # A tibble: 106,920 x 22
## # Groups:   UniqueCarrier [15]
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    <int> <int>      <int>     <int>   <int>   <int>         <chr>
##  1  2011     1          4         2    1403    1513            AA
##  2  2011     1          9         7    1443    1554            AA
##  3  2011     1         10         1    1443    1553            AA
##  4  2011     1         11         2    1429    1539            AA
##  5  2011     1         12         3    1419    1515            AA
##  6  2011     1         17         1    1530    1634            AA
##  7  2011     1         20         4    1507    1622            AA
##  8  2011     1         24         1    1356    1513            AA
##  9  2011     1         31         1    1441    1553            AA
## 10  2011     1          1         6     728     840            AA
## # ... with 106,910 more rows, and 15 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>, avg <dbl>

Peeling of layers of group_by at each summarise

mtcars %>%
  select(cyl, gear, carb) %>%
  group_by(cyl, gear) %>%
  summarise(count = n()) %>%
  summarise(count1 = max(count))
## # A tibble: 3 x 2
##     cyl count1
##   <dbl>  <dbl>
## 1     4      8
## 2     6      4
## 3     8     12
hflights %>%
  group_by(Origin,Dest)  %>%
  summarise(count_d=n()) %>% 
  summarise(max_flights=max(count_d))
## # A tibble: 2 x 2
##   Origin max_flights
##    <chr>       <dbl>
## 1    HOU        8243
## 2    IAH        5748

To which airport maximum number of flights took place originating from HOU? (Please figure out the code)

## # A tibble: 2 x 3
## # Groups:   Origin [2]
##   Origin  Dest count_d
##    <chr> <chr>   <int>
## 1    HOU   DAL    8243
## 2    IAH   ORD    5748

A reminder about the pipe %>%

Sends the output of the LHS function to the first argument of the RHS function.

sum(1:8) %>%
  sqrt()
## [1] 6

Let’s exercise what we learned last week

Daily exercises at DataCamp

Please do daily exercises for:

Review of assignment scores

Please review the scores for assignment scores below and see instructor if there’s problem with your score. Here are some info about the table:

StudentNo GroupNo R_Intro Data_Import Data_Manipulation
12051035 gr1 0.979 0.931 0.847
12051048 gr1 0.963 0.733 0.747
13056014 gr1 0.969 0.972 0.912
13056016 gr1 0.984 0.992 0.912
13056022 gr1 1.000 1.000 1.000
13056034 gr1 0.963 0.792 1.000
13056043 gr1 1.000 0.972 1.000
13056046 gr1 1.000 0.778 1.000
13056050 gr1 1.000 0.775 1.000
12056002 gr2 0.958 0.525 0.272
1205A005 gr2 1.000 0.333 0.084
1205A041 gr2 1.000 0.778 1.000
1205A042 gr2 1.000 0.778 1.000
13056004 gr2 1.000 0.792 0.962
1305A002 gr2 1.000 0.792 1.000
1305A005 gr2 0.974 0.783 0.941
1305A006 gr2 0.995 0.778 1.000
1305A011 gr2 1.000 0.778 1.000
1305A014 gr2 0.265 0.806 0.351
1305A015 gr2 0.000 0.056 0.340
1305A016 gr2 1.000 0.778 1.000
1305A029 gr2 0.990 0.847 0.971
1305A032 gr2 1.000 1.000 1.000
1305A034 gr2 0.879 0.697 1.000
1305A035 gr2 1.000 0.722 0.941
1305A042 gr2 1.000 0.778 0.962
1305A043 gr2 0.995 0.222 0.324
1305A044 gr2 0.906 0.983 0.971
1305A045 gr2 1.000 0.750 0.991
14056012 gr2 0.968 0.567 0.500
14056903 gr2 1.000 1.000 0.529
1405A015 gr2 1.000 0.819 1.000
1405A024 gr2 1.000 0.847 0.641
1405A044 gr2 1.000 1.000 1.000
1405A902 gr2 0.306 0.778 1.000
1405A903 gr2 1.000 0.992 0.991

Markdown and R = RMarkdown

Will be discussed next week.

Important points before the assignment for next week

If you are interested in running the codes (you should be!) in DataCamp in your computer you can download the datasets used during the course. At the home page of each course there’s link for downloading the datasets used in the course. Your instructor can not emphasize this more, you won’t be digesting what you learned unless you practice writing codes from scratch.

Datasets are available at DataCamp courses

Datasets are available at DataCamp courses

The datasets are provided as .RData or .csv format. The .RData files contain the data frames and other structures as is when loaded with load() command.

For example, iris.RData file contains iris, iris.tidy, iris.wide and iris.wide2 data frames.

library(ggplot2)
# this is the iris dataset used in DataCamp ggplot course
# you need to change the folder for your local setup
load("~/Documents/YTU/DERSLER/2017GUZ/veri-analizi/lesson06/iris.RData")

Data, Aesthetics and Geometry are separate layers

ggplot(iris)

ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width))

ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) + geom_point()

ggplot object

p <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width))
p
p + geom_point()
p + geom_jitter()
p <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()
p

Attribute vs. Mapping

Mappings are done within aes() if we want to map an aesthetic to a variable. If an aesthetic is to be constant then it’s defined as attribute outside of aes()

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point(col = "red")

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
  geom_point()

Global vs. Local Aesthetics

control aesthetic mappings of each layer independently

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
  geom_point()   

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point(aes(col = Species))

Inheritance

Globally assigned aesthetic is inherited in following layers.

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
  geom_point()   

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point(aes(col = Species))

Assignments for next week

This week following chapters are the assignment.

Please use Github issues if you’re having problem with concepts or code.