Last week we learned about 5 verbs of dplyr and group_by command. Let’s review them briefly by examples taken from DataCamp course.
library(dplyr)
library(hflights)
hflights <- tbl_df(hflights)select() accepts
starts_with, ends_with, contains, matches, num_range, one_ofwhich can be used in mixture. Minus sign(-) is used to hide the column.
select(hflights,UniqueCarrier,ends_with("Num"),contains("Delay"))## # A tibble: 227,496 x 5
##    UniqueCarrier FlightNum TailNum ArrDelay DepDelay
##  *         <chr>     <int>   <chr>    <int>    <int>
##  1            AA       428  N576AA      -10        0
##  2            AA       428  N557AA       -9        1
##  3            AA       428  N541AA       -8       -8
##  4            AA       428  N403AA        3        3
##  5            AA       428  N492AA       -3        5
##  6            AA       428  N262AA       -7       -1
##  7            AA       428  N493AA       -1       -1
##  8            AA       428  N477AA      -16       -5
##  9            AA       428  N476AA       44       43
## 10            AA       428  N504AA       43       43
## # ... with 227,486 more rowssummarise generates a summary table
summarise(hflights,
          n_obs = n(),
          n_carrier = n_distinct(UniqueCarrier),
          n_dest = n_distinct(Dest))## # A tibble: 1 x 3
##    n_obs n_carrier n_dest
##    <int>     <int>  <int>
## 1 227496        15    116and R functions, specific summarise functions can be used for summary.
hflights %>%
  group_by(UniqueCarrier) %>%
  summarise(p_canc = sum(Cancelled==1)/n()*100,
            avg_delay = mean(ArrDelay,na.rm=TRUE)) %>%
  arrange(avg_delay, p_canc)## # A tibble: 15 x 3
##    UniqueCarrier    p_canc  avg_delay
##            <chr>     <dbl>      <dbl>
##  1            US 1.1268986 -0.6307692
##  2            AA 1.8495684  0.8917558
##  3            FL 0.9817672  1.8536239
##  4            AS 0.0000000  3.1923077
##  5            YV 1.2658228  4.0128205
##  6            DL 1.5903067  6.0841374
##  7            CO 0.6782614  6.0986983
##  8            MQ 2.9044750  7.1529751
##  9            EV 3.4482759  7.2569543
## 10            WN 1.5504047  7.5871430
## 11            F9 0.7159905  7.6682692
## 12            XE 1.5495599  8.1865242
## 13            OO 1.3946828  8.6934922
## 14            B6 2.5899281  9.8588410
## 15            UA 1.6409266 10.4628628hflights %>%
  filter(!is.na(ArrDelay) & ArrDelay > 0) %>%
  group_by(UniqueCarrier) %>%
  summarise(avg=mean(ArrDelay)) %>%
  mutate(rank=rank(avg)) %>%
  arrange(rank)## # A tibble: 15 x 3
##    UniqueCarrier      avg  rank
##            <chr>    <dbl> <dbl>
##  1            YV 18.67568     1
##  2            F9 18.68683     2
##  3            US 20.70235     3
##  4            CO 22.13374     4
##  5            AS 22.91195     5
##  6            OO 24.14663     6
##  7            XE 24.19337     7
##  8            WN 25.27750     8
##  9            FL 27.85693     9
## 10            AA 28.49740    10
## 11            DL 32.12463    11
## 12            UA 32.48067    12
## 13            MQ 38.75135    13
## 14            EV 40.24231    14
## 15            B6 45.47744    15Although summarise and mutate work within groups, their output is different:
summarise generates a new summary table, replacing the original data tablemutate adds a new column to existing data table while mutations are calculated within each groupThe example code below adds group average column and average value is added repeated for each group member.
Let’s demonstrate this with a sample data:
sample_data <- data_frame(group = sample(letters[1:3], 10, replace = TRUE),
                          value = rnorm(10))
sample_data %>%
  group_by(group) %>%
  mutate(group_average = mean(value)) %>%
  arrange(group)## # A tibble: 10 x 3
## # Groups:   group [2]
##    group      value group_average
##    <chr>      <dbl>         <dbl>
##  1     b -0.5876116   -0.06753188
##  2     b -0.6694330   -0.06753188
##  3     b  1.0544489   -0.06753188
##  4     c -1.1338465   -0.73575947
##  5     c -0.3340361   -0.73575947
##  6     c -1.2066329   -0.73575947
##  7     c  0.1223628   -0.73575947
##  8     c -1.9220008   -0.73575947
##  9     c -0.3822265   -0.73575947
## 10     c -0.2939364   -0.73575947And now with actual data
hflights %>%
  filter(!is.na(ArrDelay) & ArrDelay > 0) %>%
  group_by(UniqueCarrier) %>%
  mutate(avg=mean(ArrDelay))## # A tibble: 106,920 x 22
## # Groups:   UniqueCarrier [15]
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    <int> <int>      <int>     <int>   <int>   <int>         <chr>
##  1  2011     1          4         2    1403    1513            AA
##  2  2011     1          9         7    1443    1554            AA
##  3  2011     1         10         1    1443    1553            AA
##  4  2011     1         11         2    1429    1539            AA
##  5  2011     1         12         3    1419    1515            AA
##  6  2011     1         17         1    1530    1634            AA
##  7  2011     1         20         4    1507    1622            AA
##  8  2011     1         24         1    1356    1513            AA
##  9  2011     1         31         1    1441    1553            AA
## 10  2011     1          1         6     728     840            AA
## # ... with 106,910 more rows, and 15 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>, avg <dbl>mtcars %>%
  select(cyl, gear, carb) %>%
  group_by(cyl, gear) %>%
  summarise(count = n()) %>%
  summarise(count1 = max(count))## # A tibble: 3 x 2
##     cyl count1
##   <dbl>  <dbl>
## 1     4      8
## 2     6      4
## 3     8     12hflights %>%
  group_by(Origin,Dest)  %>%
  summarise(count_d=n()) %>% 
  summarise(max_flights=max(count_d))## # A tibble: 2 x 2
##   Origin max_flights
##    <chr>       <dbl>
## 1    HOU        8243
## 2    IAH        5748To which airport maximum number of flights took place originating from HOU? (Please figure out the code)
## # A tibble: 2 x 3
## # Groups:   Origin [2]
##   Origin  Dest count_d
##    <chr> <chr>   <int>
## 1    HOU   DAL    8243
## 2    IAH   ORD    5748%>%Sends the output of the LHS function to the first argument of the RHS function.
sum(1:8) %>%
  sqrt()## [1] 6Please review the scores for assignment scores below and see instructor if there’s problem with your score. Here are some info about the table:
| StudentNo | GroupNo | R_Intro | Data_Import | Data_Manipulation | 
|---|---|---|---|---|
| 12051035 | gr1 | 0.979 | 0.931 | 0.847 | 
| 12051048 | gr1 | 0.963 | 0.733 | 0.747 | 
| 13056014 | gr1 | 0.969 | 0.972 | 0.912 | 
| 13056016 | gr1 | 0.984 | 0.992 | 0.912 | 
| 13056022 | gr1 | 1.000 | 1.000 | 1.000 | 
| 13056034 | gr1 | 0.963 | 0.792 | 1.000 | 
| 13056043 | gr1 | 1.000 | 0.972 | 1.000 | 
| 13056046 | gr1 | 1.000 | 0.778 | 1.000 | 
| 13056050 | gr1 | 1.000 | 0.775 | 1.000 | 
| 12056002 | gr2 | 0.958 | 0.525 | 0.272 | 
| 1205A005 | gr2 | 1.000 | 0.333 | 0.084 | 
| 1205A041 | gr2 | 1.000 | 0.778 | 1.000 | 
| 1205A042 | gr2 | 1.000 | 0.778 | 1.000 | 
| 13056004 | gr2 | 1.000 | 0.792 | 0.962 | 
| 1305A002 | gr2 | 1.000 | 0.792 | 1.000 | 
| 1305A005 | gr2 | 0.974 | 0.783 | 0.941 | 
| 1305A006 | gr2 | 0.995 | 0.778 | 1.000 | 
| 1305A011 | gr2 | 1.000 | 0.778 | 1.000 | 
| 1305A014 | gr2 | 0.265 | 0.806 | 0.351 | 
| 1305A015 | gr2 | 0.000 | 0.056 | 0.340 | 
| 1305A016 | gr2 | 1.000 | 0.778 | 1.000 | 
| 1305A029 | gr2 | 0.990 | 0.847 | 0.971 | 
| 1305A032 | gr2 | 1.000 | 1.000 | 1.000 | 
| 1305A034 | gr2 | 0.879 | 0.697 | 1.000 | 
| 1305A035 | gr2 | 1.000 | 0.722 | 0.941 | 
| 1305A042 | gr2 | 1.000 | 0.778 | 0.962 | 
| 1305A043 | gr2 | 0.995 | 0.222 | 0.324 | 
| 1305A044 | gr2 | 0.906 | 0.983 | 0.971 | 
| 1305A045 | gr2 | 1.000 | 0.750 | 0.991 | 
| 14056012 | gr2 | 0.968 | 0.567 | 0.500 | 
| 14056903 | gr2 | 1.000 | 1.000 | 0.529 | 
| 1405A015 | gr2 | 1.000 | 0.819 | 1.000 | 
| 1405A024 | gr2 | 1.000 | 0.847 | 0.641 | 
| 1405A044 | gr2 | 1.000 | 1.000 | 1.000 | 
| 1405A902 | gr2 | 0.306 | 0.778 | 1.000 | 
| 1405A903 | gr2 | 1.000 | 0.992 | 0.991 | 
Will be discussed next week.
If you are interested in running the codes (you should be!) in DataCamp in your computer you can download the datasets used during the course. At the home page of each course there’s link for downloading the datasets used in the course. Your instructor can not emphasize this more, you won’t be digesting what you learned unless you practice writing codes from scratch.
Datasets are available at DataCamp courses
The datasets are provided as .RData or .csv format. The .RData files contain the data frames and other structures as is when loaded with load() command.
For example, iris.RData file contains iris, iris.tidy, iris.wide and iris.wide2 data frames.
library(ggplot2)
# this is the iris dataset used in DataCamp ggplot course
# you need to change the folder for your local setup
load("~/Documents/YTU/DERSLER/2017GUZ/veri-analizi/lesson06/iris.RData")ggplot(iris)ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width))ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) + geom_point()p <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width))
p
p + geom_point()
p + geom_jitter()
p <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()
pMappings are done within aes() if we want to map an aesthetic to a variable. If an aesthetic is to be constant then it’s defined as attribute outside of aes()
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point(col = "red")
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
  geom_point()control aesthetic mappings of each layer independently
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
  geom_point()   
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point(aes(col = Species))Globally assigned aesthetic is inherited in following layers.
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
  geom_point()   
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point(aes(col = Species))This week following chapters are the assignment.
Please use Github issues if you’re having problem with concepts or code.