Data Science with the tidyverse

`ggplot` examples

Display the mpg data frame from the ggplot2 package, which contains observations collected by the US-EPA on 38 car models.

head(mpg)

## # A tibble: 6 x 11
##   manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
##   <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi         a4      1.8  1999     4 auto~ f        18    29 p     comp~
## 2 audi         a4      1.8  1999     4 manu~ f        21    29 p     comp~
## 3 audi         a4      2    2008     4 manu~ f        20    31 p     comp~
## 4 audi         a4      2    2008     4 auto~ f        21    30 p     comp~
## 5 audi         a4      2.8  1999     6 auto~ f        16    26 p     comp~
## 6 audi         a4      2.8  1999     6 manu~ f        18    26 p     comp~

The next section shows the structure of mpg data frame.

str(mpg)

## Classes 'tbl_df', 'tbl' and 'data.frame':    234 obs. of  11 variables:
##  $ manufacturer: chr  "audi" "audi" "audi" "audi" ...
##  $ model       : chr  "a4" "a4" "a4" "a4" ...
##  $ displ       : num  1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int  1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int  4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr  "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr  "f" "f" "f" "f" ...
##  $ cty         : int  18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int  29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr  "p" "p" "p" "p" ...
##  $ class       : chr  "compact" "compact" "compact" "compact" ...

Among the variables in mpg are:
* Car’s engine size in litres: displ
* Car’s fuel efficiency on the highway, in miles per gallon: hwy

Do cars with big engines use more fuel than cars with small engines? A plot of Car’s fuel efficiency against Car’s engine size might reveal a relationship.

ggplot(data=mpg) +
  geom_point(aes(x=displ, y=hwy), shape=21) +
  geom_smooth(aes(x=displ, y=hwy), method="lm", formula=y~x, color="green3",se = F) +
  ggtitle("Relationship between engine size (displ) and fuel efficiency (hwy),\nand linear trend (OLS)") +
  theme(plot.title = element_text(color = "blue4", face = "bold", size = 12),
        axis.title = element_text(size=10))

The plot shows a negative relationship between engine size and fuel efficiency. In general, it seams that cars with big engines use more fue, yet some cars fall out of the linear pattern. Let´s see if this behavior is related to car’s class.

ggplot(data = mpg) + 
  geom_point(aes(x=displ, y=hwy, color=class=="2seater", shape=class=="2seater")) +
  geom_smooth(aes(x=displ, y=hwy), method="lm", formula=y~x, color="green3",se = F) +
  ggtitle("Relationship between engine size (displ) and fuel efficiency (hwy) \nby car's class") +
scale_color_manual(name="2-Seat", values = c("black","red"), labels=c("F", "T")) +
  scale_shape_manual(name="2-Seat", values = c(21,16), labels=c("F", "T")) +
  theme(plot.title = element_text(color = "blue4", face = "bold", size = 12),
        axis.title = element_text(size=10))

This plot reveals that several of the points that do not follow the linear pattern, are 2-seat cars.

Facets

A different way to explore the relationship comes with facet_wrap, which allows us to split the data in the presence of categorical variables.

ggplot(data = mpg) + 
  geom_point(aes(x=displ, y=hwy)) +
  geom_line(stat="smooth", aes(x=displ, y=hwy), method="lm", formula=y~x, color="red", se = F, alpha = 0.5) +
  facet_wrap(.~class)+
  ggtitle("Relationship between engine size (displ) and fuel efficiency (hwy) \nby car's class") +
  theme(plot.title = element_text(color = "blue4", face = "bold", size = 12),
        axis.title = element_text(size=10))

***

This graph reveals that in the case of the 2-seat cars, there is little relationship between the engine’s size and fuel efficiency. Let’s plot de data again, filtering out the 2-seat cars from the linear trend’s estimation.

ggplot(data = mpg) + 
  geom_point(aes(x=displ, y=hwy, color=class=="2seater", shape=class=="2seater")) +
  geom_smooth(aes(x=displ, y=hwy), method="lm", formula=y~x, se = F, color="green3") +
  geom_smooth(data = filter(mpg, class != "2seater"), aes(x=displ, y=hwy), method="lm", formula=y~x, se = F, color="black") +
  ggtitle("Relationship between engine size (displ) and fuel efficiency (hwy) \nby car's class, linear trend with/witwout 2seater") +
scale_color_manual(name="2-Seat", values = c("black","red"), labels=c("F", "T")) +
  scale_shape_manual(name="2-Seat", values = c(21,16), labels=c("F", "T")) +
  theme(plot.title = element_text(color = "blue4", face = "bold", size = 12),
        axis.title = element_text(size=10))

Bar charts and histograms

The mpg data frame comes with several categorical variables; class among them. Let´s draw a bar chart with this variable.

ggplot(data=mpg) +
  geom_bar(aes(x=class, y=..prop.., group=1), fill="seagreen")+
  ggtitle("Bar chart for mpg by class of car")+
  ylab("Proportion")+
  xlab("Car's class")

ggplot makes possible to draw the summary statistics of a continuous variable by the levels of a categorical one.

ggplot(data=mpg) + 
  geom_boxplot(aes(x=class, y=hwy)) +
  ggtitle("Boxplot for fuel efficiency (hwy) by clas of car") +
  coord_flip()

Finally, bar charts can be split into subcategories. Let’s draw a bar chart for class of car split by the number of cylinders cyl.

mpg$cyl <- as.factor(mpg$cyl) # Declare cyl to be a factor variable
ggplot(data=mpg) +
  geom_bar(aes(x=class, fill=cyl), position="dodge")

Finally, let’s draw a histogram for hwy together with a density plot for the same variable.

ggplot(data=mpg) +
  geom_histogram(aes(x=hwy, y=..density..), binwidth=1, fill="seagreen") +
  geom_density(aes(x=hwy)) +
  ylab("Density")

`Dplyr` basics

For this section, let’s load the nycflights13::flights data frame, which consist of more roughly 0.3 million flights that departed from New York City in 2013.

head(flights)

## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     1      517            515         2      830
## 2  2013     1     1      533            529         4      850
## 3  2013     1     1      542            540         2      923
## 4  2013     1     1      544            545        -1     1004
## 5  2013     1     1      554            600        -6      812
## 6  2013     1     1      554            558        -4      740
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

Get all the column names from the flights dataset.

glimpse(flights)

## Observations: 336,776
## Variables: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...

Find the number of missing values by column:

flights %>% # sapply(flights, function(x) sum(is.na(x)))
  map_df(function(x) sum(is.na(x))) %>%
  gather(Feature, Nulls)

## # A tibble: 19 x 2
##    Feature        Nulls
##    <chr>          <int>
##  1 year               0
##  2 month              0
##  3 day                0
##  4 dep_time        8255
##  5 sched_dep_time     0
##  6 dep_delay       8255
##  7 arr_time        8713
##  8 sched_arr_time     0
##  9 arr_delay       9430
## 10 carrier            0
## 11 flight             0
## 12 tailnum         2512
## 13 origin             0
## 14 dest               0
## 15 air_time        9430
## 16 distance           0
## 17 hour               0
## 18 minute             0
## 19 time_hour          0

Select all the flights that took place on February 12, with departure time betweem 6-7 AM, and order the sample data using the scheduled departure time. Aditionally, show origin and destination.

flights %>%  # subset(flights, month==2 & day==12 & dep_time>600 & dep_time<700)
  filter(month==2, day==12, dep_time>600, dep_time<700) %>%
  select(origin, dest, dep_time, sched_dep_time) %>%
  arrange(sched_dep_time)

## # A tibble: 64 x 4
##    origin dest  dep_time sched_dep_time
##    <chr>  <chr>    <int>          <int>
##  1 LGA    FLL        602            600
##  2 JFK    IAD        604            600
##  3 LGA    ATL        628            600
##  4 EWR    IAD        631            600
##  5 EWR    DTW        640            600
##  6 LGA    IAD        653            600
##  7 JFK    LAX        601            601
##  8 EWR    BOS        604            608
##  9 EWR    CLE        606            608
## 10 LGA    MIA        601            610
## # ... with 54 more rows

Compute the number of flights by carrier. Then sort the data.

flights %>%
  group_by(carrier) %>%
  summarise(Carrier_count = n()) %>%
  arrange(desc(Carrier_count))

## # A tibble: 16 x 2
##    carrier Carrier_count
##    <chr>           <int>
##  1 UA              58665
##  2 B6              54635
##  3 EV              54173
##  4 DL              48110
##  5 AA              32729
##  6 MQ              26397
##  7 US              20536
##  8 9E              18460
##  9 WN              12275
## 10 VX               5162
## 11 FL               3260
## 12 AS                714
## 13 F9                685
## 14 YV                601
## 15 HA                342
## 16 OO                 32

Rename tailnum as tail_num

rename(flights, tail_num=tailnum)

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tail_num <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Get the average delay per date.

flights %>%
  group_by(year, month, day) %>%
  summarise(delay_mean=mean(dep_delay, na.rm=T))

## # A tibble: 365 x 4
## # Groups:   year, month [?]
##     year month   day delay_mean
##    <int> <int> <int>      <dbl>
##  1  2013     1     1      11.5 
##  2  2013     1     2      13.9 
##  3  2013     1     3      11.0 
##  4  2013     1     4       8.95
##  5  2013     1     5       5.73
##  6  2013     1     6       7.15
##  7  2013     1     7       5.42
##  8  2013     1     8       2.55
##  9  2013     1     9       2.28
## 10  2013     1    10       2.84
## # ... with 355 more rows

Estimate the speed of the flight and sort out the data by it.

flights %>% 
  select(year:day, dest, flight, tailnum, distance, air_time) %>%
  mutate(speed=distance/air_time*60) %>%
  arrange(desc(speed))

## # A tibble: 336,776 x 9
##     year month   day dest  flight tailnum distance air_time speed
##    <int> <int> <int> <chr>  <int> <chr>      <dbl>    <dbl> <dbl>
##  1  2013     5    25 ATL     1499 N666DN       762       65  703.
##  2  2013     7     2 MSP     4667 N17196      1008       93  650.
##  3  2013     5    13 GSP     4292 N14568       594       55  648 
##  4  2013     3    23 BNA     3805 N12567       748       70  641.
##  5  2013     1    12 PBI     1902 N956DL      1035      105  591.
##  6  2013    11    17 SJU      315 N3768       1598      170  564 
##  7  2013     2    21 SJU      707 N779JB      1598      172  557.
##  8  2013    11    17 STT      936 N5FFAA      1623      175  556.
##  9  2013    11    16 SJU      347 N3773D      1598      173  554.
## 10  2013    11    16 SJU     1503 N571JB      1598      173  554.
## # ... with 336,766 more rows

Get the number of flights to Boston in each month.

flights %>%
  filter(dest=="BOS") %>%
  group_by(year, month, dest) %>%
  summarise(count=n(), 
            dist=mean(distance, na.rm=T),
            del=mean(arr_delay, na.rm=T))

## # A tibble: 12 x 6
## # Groups:   year, month [?]
##     year month dest  count  dist     del
##    <int> <int> <chr> <int> <dbl>   <dbl>
##  1  2013     1 BOS    1245  191.  -2.54 
##  2  2013     2 BOS    1182  190.   0.457
##  3  2013     3 BOS    1324  191.   3.83 
##  4  2013     4 BOS    1305  190.   3.51 
##  5  2013     5 BOS    1327  191.   5.04 
##  6  2013     6 BOS    1312  191.   9.31 
##  7  2013     7 BOS    1378  191.  12.4  
##  8  2013     8 BOS    1377  191.   1.30 
##  9  2013     9 BOS    1307  191.  -3.34 
## 10  2013    10 BOS    1357  191.  -3.45 
## 11  2013    11 BOS    1235  191.  -3.39 
## 12  2013    12 BOS    1159  191.  12.5

Tabulate the number of flights by destination and origin.

flights %>%
  group_by(origin) %>%
  select(dest, origin) %>%
  table() %>%
  as.data.frame.matrix() %>%
  head()

##      EWR  JFK   LGA
## ABQ    0  254     0
## ACK    0  265     0
## ALB  439    0     0
## ANC    8    0     0
## ATL 5022 1930 10263
## AUS  968 1471     0

flights %>%
  group_by(dest) %>%
  summarise_at(vars(arr_delay, air_time), funs(mean(., na.rm=T), sd(., na.rm=T)))

## # A tibble: 105 x 5
##    dest  arr_delay_mean air_time_mean arr_delay_sd air_time_sd
##    <chr>          <dbl>         <dbl>        <dbl>       <dbl>
##  1 ABQ             4.38         249.          42.0       19.3 
##  2 ACK             4.85          42.1         30.0        8.13
##  3 ALB            14.4           31.8         50.5        3.08
##  4 ANC            -2.5          413.          26.4       14.7 
##  5 ATL            11.3          113.          47.0        9.81
##  6 AUS             6.02         213.          43.5       18.2 
##  7 AVL             8.00          89.9         33.6        7.38
##  8 BDL             7.05          25.5         42.1        3.29
##  9 BGR             8.03          54.1         46.4        3.33
## 10 BHM            16.9          123.          56.2       10.5 
## # ... with 95 more rows

Data Science with the tidyverse

Antonio Aguilar

17 de octubre de 2018

`ggplot` examples

Facets

Bar charts and histograms

`Dplyr` basics

Data Science with the tidyverse

Antonio Aguilar

17 de octubre de 2018

ggplot examples

Facets

Bar charts and histograms

Dplyr basics

`ggplot` examples

`Dplyr` basics