ggplot
examplesDisplay the mpg data frame from the ggplot2 package, which contains observations collected by the US-EPA on 38 car models.
head(mpg)
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto~ f 18 29 p comp~
## 2 audi a4 1.8 1999 4 manu~ f 21 29 p comp~
## 3 audi a4 2 2008 4 manu~ f 20 31 p comp~
## 4 audi a4 2 2008 4 auto~ f 21 30 p comp~
## 5 audi a4 2.8 1999 6 auto~ f 16 26 p comp~
## 6 audi a4 2.8 1999 6 manu~ f 18 26 p comp~
The next section shows the structure of mpg data frame.
str(mpg)
## Classes 'tbl_df', 'tbl' and 'data.frame': 234 obs. of 11 variables:
## $ manufacturer: chr "audi" "audi" "audi" "audi" ...
## $ model : chr "a4" "a4" "a4" "a4" ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr "f" "f" "f" "f" ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr "p" "p" "p" "p" ...
## $ class : chr "compact" "compact" "compact" "compact" ...
Among the variables in mpg are:
* Car’s engine size in litres: displ
* Car’s fuel efficiency on the highway, in miles per gallon: hwy
Do cars with big engines use more fuel than cars with small engines? A plot of Car’s fuel efficiency against Car’s engine size might reveal a relationship.
ggplot(data=mpg) +
geom_point(aes(x=displ, y=hwy), shape=21) +
geom_smooth(aes(x=displ, y=hwy), method="lm", formula=y~x, color="green3",se = F) +
ggtitle("Relationship between engine size (displ) and fuel efficiency (hwy),\nand linear trend (OLS)") +
theme(plot.title = element_text(color = "blue4", face = "bold", size = 12),
axis.title = element_text(size=10))
The plot shows a negative relationship between engine size and fuel efficiency. In general, it seams that cars with big engines use more fue, yet some cars fall out of the linear pattern. Let´s see if this behavior is related to car’s class
.
ggplot(data = mpg) +
geom_point(aes(x=displ, y=hwy, color=class=="2seater", shape=class=="2seater")) +
geom_smooth(aes(x=displ, y=hwy), method="lm", formula=y~x, color="green3",se = F) +
ggtitle("Relationship between engine size (displ) and fuel efficiency (hwy) \nby car's class") +
scale_color_manual(name="2-Seat", values = c("black","red"), labels=c("F", "T")) +
scale_shape_manual(name="2-Seat", values = c(21,16), labels=c("F", "T")) +
theme(plot.title = element_text(color = "blue4", face = "bold", size = 12),
axis.title = element_text(size=10))
This plot reveals that several of the points that do not follow the linear pattern, are 2-seat cars.
A different way to explore the relationship comes with facet_wrap
, which allows us to split the data in the presence of categorical variables.
ggplot(data = mpg) +
geom_point(aes(x=displ, y=hwy)) +
geom_line(stat="smooth", aes(x=displ, y=hwy), method="lm", formula=y~x, color="red", se = F, alpha = 0.5) +
facet_wrap(.~class)+
ggtitle("Relationship between engine size (displ) and fuel efficiency (hwy) \nby car's class") +
theme(plot.title = element_text(color = "blue4", face = "bold", size = 12),
axis.title = element_text(size=10))
***
This graph reveals that in the case of the 2-seat cars, there is little relationship between the engine’s size and fuel efficiency. Let’s plot de data again, filtering out the 2-seat cars from the linear trend’s estimation.
ggplot(data = mpg) +
geom_point(aes(x=displ, y=hwy, color=class=="2seater", shape=class=="2seater")) +
geom_smooth(aes(x=displ, y=hwy), method="lm", formula=y~x, se = F, color="green3") +
geom_smooth(data = filter(mpg, class != "2seater"), aes(x=displ, y=hwy), method="lm", formula=y~x, se = F, color="black") +
ggtitle("Relationship between engine size (displ) and fuel efficiency (hwy) \nby car's class, linear trend with/witwout 2seater") +
scale_color_manual(name="2-Seat", values = c("black","red"), labels=c("F", "T")) +
scale_shape_manual(name="2-Seat", values = c(21,16), labels=c("F", "T")) +
theme(plot.title = element_text(color = "blue4", face = "bold", size = 12),
axis.title = element_text(size=10))
The mpg data frame comes with several categorical variables; class
among them. Let´s draw a bar chart with this variable.
ggplot(data=mpg) +
geom_bar(aes(x=class, y=..prop.., group=1), fill="seagreen")+
ggtitle("Bar chart for mpg by class of car")+
ylab("Proportion")+
xlab("Car's class")
ggplot
makes possible to draw the summary statistics of a continuous variable by the levels of a categorical one.
ggplot(data=mpg) +
geom_boxplot(aes(x=class, y=hwy)) +
ggtitle("Boxplot for fuel efficiency (hwy) by clas of car") +
coord_flip()
Finally, bar charts can be split into subcategories. Let’s draw a bar chart for class of car split by the number of cylinders cyl
.
mpg$cyl <- as.factor(mpg$cyl) # Declare cyl to be a factor variable
ggplot(data=mpg) +
geom_bar(aes(x=class, fill=cyl), position="dodge")
Finally, let’s draw a histogram for hwy
together with a density plot for the same variable.
ggplot(data=mpg) +
geom_histogram(aes(x=hwy, y=..density..), binwidth=1, fill="seagreen") +
geom_density(aes(x=hwy)) +
ylab("Density")
Dplyr
basicsFor this section, let’s load the nycflights13::flights
data frame, which consist of more roughly 0.3 million flights that departed from New York City in 2013.
head(flights)
## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
Get all the column names from the flights dataset.
glimpse(flights)
## Observations: 336,776
## Variables: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
Find the number of missing values by column:
flights %>% # sapply(flights, function(x) sum(is.na(x)))
map_df(function(x) sum(is.na(x))) %>%
gather(Feature, Nulls)
## # A tibble: 19 x 2
## Feature Nulls
## <chr> <int>
## 1 year 0
## 2 month 0
## 3 day 0
## 4 dep_time 8255
## 5 sched_dep_time 0
## 6 dep_delay 8255
## 7 arr_time 8713
## 8 sched_arr_time 0
## 9 arr_delay 9430
## 10 carrier 0
## 11 flight 0
## 12 tailnum 2512
## 13 origin 0
## 14 dest 0
## 15 air_time 9430
## 16 distance 0
## 17 hour 0
## 18 minute 0
## 19 time_hour 0
Select all the flights that took place on February 12, with departure time betweem 6-7 AM, and order the sample data using the scheduled departure time. Aditionally, show origin and destination.
flights %>% # subset(flights, month==2 & day==12 & dep_time>600 & dep_time<700)
filter(month==2, day==12, dep_time>600, dep_time<700) %>%
select(origin, dest, dep_time, sched_dep_time) %>%
arrange(sched_dep_time)
## # A tibble: 64 x 4
## origin dest dep_time sched_dep_time
## <chr> <chr> <int> <int>
## 1 LGA FLL 602 600
## 2 JFK IAD 604 600
## 3 LGA ATL 628 600
## 4 EWR IAD 631 600
## 5 EWR DTW 640 600
## 6 LGA IAD 653 600
## 7 JFK LAX 601 601
## 8 EWR BOS 604 608
## 9 EWR CLE 606 608
## 10 LGA MIA 601 610
## # ... with 54 more rows
Compute the number of flights by carrier. Then sort the data.
flights %>%
group_by(carrier) %>%
summarise(Carrier_count = n()) %>%
arrange(desc(Carrier_count))
## # A tibble: 16 x 2
## carrier Carrier_count
## <chr> <int>
## 1 UA 58665
## 2 B6 54635
## 3 EV 54173
## 4 DL 48110
## 5 AA 32729
## 6 MQ 26397
## 7 US 20536
## 8 9E 18460
## 9 WN 12275
## 10 VX 5162
## 11 FL 3260
## 12 AS 714
## 13 F9 685
## 14 YV 601
## 15 HA 342
## 16 OO 32
Rename tailnum
as tail_num
rename(flights, tail_num=tailnum)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tail_num <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Get the average delay per date.
flights %>%
group_by(year, month, day) %>%
summarise(delay_mean=mean(dep_delay, na.rm=T))
## # A tibble: 365 x 4
## # Groups: year, month [?]
## year month day delay_mean
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.5
## 2 2013 1 2 13.9
## 3 2013 1 3 11.0
## 4 2013 1 4 8.95
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.55
## 9 2013 1 9 2.28
## 10 2013 1 10 2.84
## # ... with 355 more rows
Estimate the speed of the flight and sort out the data by it.
flights %>%
select(year:day, dest, flight, tailnum, distance, air_time) %>%
mutate(speed=distance/air_time*60) %>%
arrange(desc(speed))
## # A tibble: 336,776 x 9
## year month day dest flight tailnum distance air_time speed
## <int> <int> <int> <chr> <int> <chr> <dbl> <dbl> <dbl>
## 1 2013 5 25 ATL 1499 N666DN 762 65 703.
## 2 2013 7 2 MSP 4667 N17196 1008 93 650.
## 3 2013 5 13 GSP 4292 N14568 594 55 648
## 4 2013 3 23 BNA 3805 N12567 748 70 641.
## 5 2013 1 12 PBI 1902 N956DL 1035 105 591.
## 6 2013 11 17 SJU 315 N3768 1598 170 564
## 7 2013 2 21 SJU 707 N779JB 1598 172 557.
## 8 2013 11 17 STT 936 N5FFAA 1623 175 556.
## 9 2013 11 16 SJU 347 N3773D 1598 173 554.
## 10 2013 11 16 SJU 1503 N571JB 1598 173 554.
## # ... with 336,766 more rows
Get the number of flights to Boston in each month.
flights %>%
filter(dest=="BOS") %>%
group_by(year, month, dest) %>%
summarise(count=n(),
dist=mean(distance, na.rm=T),
del=mean(arr_delay, na.rm=T))
## # A tibble: 12 x 6
## # Groups: year, month [?]
## year month dest count dist del
## <int> <int> <chr> <int> <dbl> <dbl>
## 1 2013 1 BOS 1245 191. -2.54
## 2 2013 2 BOS 1182 190. 0.457
## 3 2013 3 BOS 1324 191. 3.83
## 4 2013 4 BOS 1305 190. 3.51
## 5 2013 5 BOS 1327 191. 5.04
## 6 2013 6 BOS 1312 191. 9.31
## 7 2013 7 BOS 1378 191. 12.4
## 8 2013 8 BOS 1377 191. 1.30
## 9 2013 9 BOS 1307 191. -3.34
## 10 2013 10 BOS 1357 191. -3.45
## 11 2013 11 BOS 1235 191. -3.39
## 12 2013 12 BOS 1159 191. 12.5
Tabulate the number of flights by destination and origin.
flights %>%
group_by(origin) %>%
select(dest, origin) %>%
table() %>%
as.data.frame.matrix() %>%
head()
## EWR JFK LGA
## ABQ 0 254 0
## ACK 0 265 0
## ALB 439 0 0
## ANC 8 0 0
## ATL 5022 1930 10263
## AUS 968 1471 0
flights %>%
group_by(dest) %>%
summarise_at(vars(arr_delay, air_time), funs(mean(., na.rm=T), sd(., na.rm=T)))
## # A tibble: 105 x 5
## dest arr_delay_mean air_time_mean arr_delay_sd air_time_sd
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ABQ 4.38 249. 42.0 19.3
## 2 ACK 4.85 42.1 30.0 8.13
## 3 ALB 14.4 31.8 50.5 3.08
## 4 ANC -2.5 413. 26.4 14.7
## 5 ATL 11.3 113. 47.0 9.81
## 6 AUS 6.02 213. 43.5 18.2
## 7 AVL 8.00 89.9 33.6 7.38
## 8 BDL 7.05 25.5 42.1 3.29
## 9 BGR 8.03 54.1 46.4 3.33
## 10 BHM 16.9 123. 56.2 10.5
## # ... with 95 more rows