Review of assignment topics

Last week we learned about 5 verbs of dplyr and group_by command. Let’s review them briefly by examples taken from DataCamp course.

Flight data

library(dplyr)
library(hflights)
hflights <- tbl_df(hflights)

Advanced select and summarise

select() accepts

column numbers (accepting range)
column names (accepting range)
special functions: starts_with, ends_with, contains, matches, num_range, one_of

which can be used in mixture. Minus sign(-) is used to hide the column.

select(hflights,UniqueCarrier,ends_with("Num"),contains("Delay"))

## # A tibble: 227,496 x 5
##    UniqueCarrier FlightNum TailNum ArrDelay DepDelay
##  *         <chr>     <int>   <chr>    <int>    <int>
##  1            AA       428  N576AA      -10        0
##  2            AA       428  N557AA       -9        1
##  3            AA       428  N541AA       -8       -8
##  4            AA       428  N403AA        3        3
##  5            AA       428  N492AA       -3        5
##  6            AA       428  N262AA       -7       -1
##  7            AA       428  N493AA       -1       -1
##  8            AA       428  N477AA      -16       -5
##  9            AA       428  N476AA       44       43
## 10            AA       428  N504AA       43       43
## # ... with 227,486 more rows

summarise generates a summary table

summarise(hflights,
          n_obs = n(),
          n_carrier = n_distinct(UniqueCarrier),
          n_dest = n_distinct(Dest))

## # A tibble: 1 x 3
##    n_obs n_carrier n_dest
##    <int>     <int>  <int>
## 1 227496        15    116

and R functions, specific summarise functions can be used for summary.

hflights %>%
  group_by(UniqueCarrier) %>%
  summarise(p_canc = sum(Cancelled==1)/n()*100,
            avg_delay = mean(ArrDelay,na.rm=TRUE)) %>%
  arrange(avg_delay, p_canc)

## # A tibble: 15 x 3
##    UniqueCarrier    p_canc  avg_delay
##            <chr>     <dbl>      <dbl>
##  1            US 1.1268986 -0.6307692
##  2            AA 1.8495684  0.8917558
##  3            FL 0.9817672  1.8536239
##  4            AS 0.0000000  3.1923077
##  5            YV 1.2658228  4.0128205
##  6            DL 1.5903067  6.0841374
##  7            CO 0.6782614  6.0986983
##  8            MQ 2.9044750  7.1529751
##  9            EV 3.4482759  7.2569543
## 10            WN 1.5504047  7.5871430
## 11            F9 0.7159905  7.6682692
## 12            XE 1.5495599  8.1865242
## 13            OO 1.3946828  8.6934922
## 14            B6 2.5899281  9.8588410
## 15            UA 1.6409266 10.4628628

Summarise and mutate with group_by

hflights %>%
  filter(!is.na(ArrDelay) & ArrDelay > 0) %>%
  group_by(UniqueCarrier) %>%
  summarise(avg=mean(ArrDelay)) %>%
  mutate(rank=rank(avg)) %>%
  arrange(rank)

## # A tibble: 15 x 3
##    UniqueCarrier      avg  rank
##            <chr>    <dbl> <dbl>
##  1            YV 18.67568     1
##  2            F9 18.68683     2
##  3            US 20.70235     3
##  4            CO 22.13374     4
##  5            AS 22.91195     5
##  6            OO 24.14663     6
##  7            XE 24.19337     7
##  8            WN 25.27750     8
##  9            FL 27.85693     9
## 10            AA 28.49740    10
## 11            DL 32.12463    11
## 12            UA 32.48067    12
## 13            MQ 38.75135    13
## 14            EV 40.24231    14
## 15            B6 45.47744    15

Although summarise and mutate work within groups, their output is different:

summarise generates a new summary table, replacing the original data table
mutate adds a new column to existing data table while mutations are calculated within each group

The example code below adds group average column and average value is added repeated for each group member.

Let’s demonstrate this with a sample data:

sample_data <- data_frame(group = sample(letters[1:3], 10, replace = TRUE),
                          value = rnorm(10))
sample_data %>%
  group_by(group) %>%
  mutate(group_average = mean(value)) %>%
  arrange(group)

## # A tibble: 10 x 3
## # Groups:   group [2]
##    group      value group_average
##    <chr>      <dbl>         <dbl>
##  1     b -0.5876116   -0.06753188
##  2     b -0.6694330   -0.06753188
##  3     b  1.0544489   -0.06753188
##  4     c -1.1338465   -0.73575947
##  5     c -0.3340361   -0.73575947
##  6     c -1.2066329   -0.73575947
##  7     c  0.1223628   -0.73575947
##  8     c -1.9220008   -0.73575947
##  9     c -0.3822265   -0.73575947
## 10     c -0.2939364   -0.73575947

And now with actual data

hflights %>%
  filter(!is.na(ArrDelay) & ArrDelay > 0) %>%
  group_by(UniqueCarrier) %>%
  mutate(avg=mean(ArrDelay))

## # A tibble: 106,920 x 22
## # Groups:   UniqueCarrier [15]
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    <int> <int>      <int>     <int>   <int>   <int>         <chr>
##  1  2011     1          4         2    1403    1513            AA
##  2  2011     1          9         7    1443    1554            AA
##  3  2011     1         10         1    1443    1553            AA
##  4  2011     1         11         2    1429    1539            AA
##  5  2011     1         12         3    1419    1515            AA
##  6  2011     1         17         1    1530    1634            AA
##  7  2011     1         20         4    1507    1622            AA
##  8  2011     1         24         1    1356    1513            AA
##  9  2011     1         31         1    1441    1553            AA
## 10  2011     1          1         6     728     840            AA
## # ... with 106,910 more rows, and 15 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>, avg <dbl>

Peeling of layers of group_by at each summarise

mtcars %>%
  select(cyl, gear, carb) %>%
  group_by(cyl, gear) %>%
  summarise(count = n()) %>%
  summarise(count1 = max(count))

## # A tibble: 3 x 2
##     cyl count1
##   <dbl>  <dbl>
## 1     4      8
## 2     6      4
## 3     8     12

hflights %>%
  group_by(Origin,Dest)  %>%
  summarise(count_d=n()) %>% 
  summarise(max_flights=max(count_d))

## # A tibble: 2 x 2
##   Origin max_flights
##    <chr>       <dbl>
## 1    HOU        8243
## 2    IAH        5748

To which airport maximum number of flights took place originating from HOU? (Please figure out the code)

## # A tibble: 2 x 3
## # Groups:   Origin [2]
##   Origin  Dest count_d
##    <chr> <chr>   <int>
## 1    HOU   DAL    8243
## 2    IAH   ORD    5748

A reminder about the pipe `%>%`

Sends the output of the LHS function to the first argument of the RHS function.

sum(1:8) %>%
  sqrt()

## [1] 6

Let’s exercise what we learned last week

Daily exercises at DataCamp

Please do daily exercises for:

Data Manipulation in R with dplyr

Review of assignment scores

Please review the scores for assignment scores below and see instructor if there’s problem with your score. Here are some info about the table:

“R_Intro” is the first course and it won’t be graded.
Table is “course” based, not “chapter”" based.
A score of “0.98” means you completed the course, along the way you might use a hint and lost 30XP. The more XP you lose the farther your score is from 1.0
The instructor will publish the formula later as to how these scores will be graded. As of now, the formula would be
- Above a certain threshold (for example 0.80) the course will be accepted as complete
- Below the threshold the score will be used as a factor (if score is 0.7 then you’ll get 0.7 * Total points)

StudentNo	GroupNo	R_Intro	Data_Import	Data_Manipulation
12051035	gr1	0.979	0.931	0.847
12051048	gr1	0.963	0.733	0.747
13056014	gr1	0.969	0.972	0.912
13056016	gr1	0.984	0.992	0.912
13056022	gr1	1.000	1.000	1.000
13056034	gr1	0.963	0.792	1.000
13056043	gr1	1.000	0.972	1.000
13056046	gr1	1.000	0.778	1.000
13056050	gr1	1.000	0.775	1.000
12056002	gr2	0.958	0.525	0.272
1205A005	gr2	1.000	0.333	0.084
1205A041	gr2	1.000	0.778	1.000
1205A042	gr2	1.000	0.778	1.000
13056004	gr2	1.000	0.792	0.962
1305A002	gr2	1.000	0.792	1.000
1305A005	gr2	0.974	0.783	0.941
1305A006	gr2	0.995	0.778	1.000
1305A011	gr2	1.000	0.778	1.000
1305A014	gr2	0.265	0.806	0.351
1305A015	gr2	0.000	0.056	0.340
1305A016	gr2	1.000	0.778	1.000
1305A029	gr2	0.990	0.847	0.971
1305A032	gr2	1.000	1.000	1.000
1305A034	gr2	0.879	0.697	1.000
1305A035	gr2	1.000	0.722	0.941
1305A042	gr2	1.000	0.778	0.962
1305A043	gr2	0.995	0.222	0.324
1305A044	gr2	0.906	0.983	0.971
1305A045	gr2	1.000	0.750	0.991
14056012	gr2	0.968	0.567	0.500
14056903	gr2	1.000	1.000	0.529
1405A015	gr2	1.000	0.819	1.000
1405A024	gr2	1.000	0.847	0.641
1405A044	gr2	1.000	1.000	1.000
1405A902	gr2	0.306	0.778	1.000
1405A903	gr2	1.000	0.992	0.991

Markdown and R = RMarkdown

Will be discussed next week.

Important points before the assignment for next week

If you are interested in running the codes (you should be!) in DataCamp in your computer you can download the datasets used during the course. At the home page of each course there’s link for downloading the datasets used in the course. Your instructor can not emphasize this more, you won’t be digesting what you learned unless you practice writing codes from scratch.

Datasets are available at DataCamp courses

The datasets are provided as .RData or .csv format. The .RData files contain the data frames and other structures as is when loaded with load() command.

For example, iris.RData file contains iris, iris.tidy, iris.wide and iris.wide2 data frames.

library(ggplot2)
# this is the iris dataset used in DataCamp ggplot course
# you need to change the folder for your local setup
load("~/Documents/YTU/DERSLER/2017GUZ/veri-analizi/lesson06/iris.RData")

Data, Aesthetics and Geometry are separate layers

ggplot(iris)

ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width))

ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) + geom_point()

ggplot object

p <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width))
p
p + geom_point()
p + geom_jitter()
p <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()
p

Attribute vs. Mapping

Mappings are done within aes() if we want to map an aesthetic to a variable. If an aesthetic is to be constant then it’s defined as attribute outside of aes()

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point(col = "red")

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
  geom_point()

Global vs. Local Aesthetics

control aesthetic mappings of each layer independently

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
  geom_point()   

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point(aes(col = Species))

Inheritance

Globally assigned aesthetic is inherited in following layers.

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
  geom_point()   

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point(aes(col = Species))

Assignments for next week

This week following chapters are the assignment.

Data Visualization with ggplot2 (Part 1) : Chapters 2,3 and 4

Please use Github issues if you’re having problem with concepts or code.

Data Analysis and Visualization - Lesson 5

alper yilmaz

October 17th, 2017

Review of assignment topics

Flight data

Advanced select and summarise

Summarise and mutate with group_by

Peeling of layers of group_by at each summarise

A reminder about the pipe `%>%`

Let’s exercise what we learned last week

Daily exercises at DataCamp

Review of assignment scores

Markdown and R = RMarkdown

Important points before the assignment for next week

Data, Aesthetics and Geometry are separate layers

ggplot object

Attribute vs. Mapping

Global vs. Local Aesthetics

Inheritance

Assignments for next week

Data Analysis and Visualization - Lesson 5

alper yilmaz

October 17th, 2017

Review of assignment topics

Flight data

Advanced select and summarise

Summarise and mutate with group_by

Peeling of layers of group_by at each summarise

A reminder about the pipe %>%

Let’s exercise what we learned last week

Daily exercises at DataCamp

Review of assignment scores

Markdown and R = RMarkdown

Important points before the assignment for next week

Data, Aesthetics and Geometry are separate layers

ggplot object

Attribute vs. Mapping

Global vs. Local Aesthetics

Inheritance

Assignments for next week

A reminder about the pipe `%>%`