class: center, middle, inverse, title-slide .title[ # Statistical graphics with
ggplot2
] .subtitle[ ## Programming for Statistical Science ] .author[ ### Dr. Zulfiqar Ali ] .institute[ ### College of Statistcal Sciences, University of the Punjab, Lahore ] .date[ ### 15 June 2023 ] --- ## Supplementary materials Additional resources - [Chapter 3](https://r4ds.had.co.nz/data-visualisation.html), R for Data Science - `ggplot2` [Reference](https://ggplot2.tidyverse.org/reference/index.html) - `ggplot2` [cheat sheet](https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf) - [color brewer 2](http://colorbrewer2.org/) --- ## `What is ggplot2?` - `ggplot2` is a tool used in R for creating plots. It follows the grammar of graphics, which means it has a structured way of creating visualizations. It combines the best features from the base and lattice plotting systems. - `Base Graphics:` Base graphics is the default plotting system in R. It offers a wide range of basic plotting functions, such as `plot()`, `hist()`, `boxplot()`, and more. These functions allow users to create various types of plots, customize colors, add labels, and modify axes. Base graphics are known for their simplicity and ease of use. - ` Lattice Graphics:` Lattice graphics is an advanced plotting system in R that builds on top of base graphics. It provides additional features for creating more complex and structured plots. Lattice graphics are particularly useful for visualizing multivariate data, where multiple variables are involved. The main function in the lattice package is `xyplot()`, which allows users to create scatter plots, line plots, bar plots, and more. Lattice graphics also support features like faceting (dividing plots into multiple panels) and conditioning (plotting subsets of data based on specific conditions) --- ## `What is ggplot2?` - One of the advantages of `ggplot2` is that it handles many of the intricate aspects of plotting automatically. For example, it takes care of tasks like creating legends and dividing plots into multiple panels (faceting). This saves users from dealing with these complex tasks themselves - When working with data that has multiple variables, `ggplot2` is especially useful. It simplifies the process of visualizing relationships between different variables, making it easier to understand complex datasets. Package `ggplot2` is available in package `tidyverse`. Let's load that now. ```r library(tidyverse) ``` --- ## Tidyverse Package The **tidyverse** package is a collection of R packages designed for data manipulation, visualization, and analysis. It provides a cohesive and consistent set of functions that work together seamlessly. Some of the packages included in the tidyverse are: - **ggplot2**: For creating elegant and customizable data visualizations. - **dplyr**: For data manipulation tasks, such as filtering, summarizing, and joining datasets. - **tidyr**: For tidying messy data by reshaping and organizing it. - **readr**: For reading and writing data in various formats. - **purrr**: For working with functional programming paradigms. --- ## The Grammar of Graphics - The Grammar of Graphics is a visualization concept that was introduced by Leland Wilkinson in 1999. - Its purpose is to define the fundamental elements of a statistical graphic. In 2009, Hadley Wickham adapted the concept for the R programming language. - The adaptation provided a consistent and concise syntax for describing statistical graphics. - It offers a highly modular approach, breaking graphs into semantic components. - It's important to note that The Grammar of Graphics does not serve as a guide for selecting the most appropriate graph type or how to effectively convey your data. - This aspect will be explored further in subsequent discussions. - The Grammar of Graphics provides a powerful framework for creating visualizations in R, emphasizing the structure and principles behind statistical graphics. --- ## Today's data: MLB ```r teams <- read_csv("http://www2.stat.duke.edu/~sms185/data/mlb/teams.csv") ``` Object `teams` is a data frame that contains yearly statistics and standings for Majore League Ball (MLB) https://www.mlb.com/stats/team teams from 2009 to 2018. The data has 300 rows and 56 variables. --- .tiny[ ```r teams ``` ``` #> # A tibble: 300 × 56 #> yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin #> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> #> 1 2009 NL ARI ARI W 5 162 81 70 92 N N #> 2 2009 NL ATL ATL E 3 162 81 86 76 N N #> 3 2009 AL BAL BAL E 5 162 81 64 98 N N #> 4 2009 AL BOS BOS E 2 162 81 95 67 N Y #> 5 2009 AL CHA CHW C 3 162 81 79 83 N N #> 6 2009 NL CHN CHC C 2 161 80 83 78 N N #> 7 2009 NL CIN CIN C 4 162 81 78 84 N N #> 8 2009 AL CLE CLE C 4 162 81 65 97 N N #> 9 2009 NL COL COL W 2 162 81 92 70 N Y #> 10 2009 AL DET DET C 2 163 81 86 77 N N #> # ℹ 290 more rows #> # ℹ 44 more variables: LgWin <chr>, WSWin <chr>, R <dbl>, AB <dbl>, H <dbl>, #> # X2B <dbl>, X3B <dbl>, HR <dbl>, BB <dbl>, SO <dbl>, SB <dbl>, CS <dbl>, #> # HBP <dbl>, SF <dbl>, RA <dbl>, ER <dbl>, ERA <dbl>, CG <dbl>, SHO <dbl>, #> # SV <dbl>, IPouts <dbl>, HA <dbl>, HRA <dbl>, BBA <dbl>, SOA <dbl>, E <dbl>, #> # DP <dbl>, FP <dbl>, name <chr>, park <chr>, attendance <dbl>, BPF <dbl>, #> # PPF <dbl>, teamIDBR <chr>, teamIDlahman45 <chr>, teamIDretro <chr>, … ``` ] --- class: inverse, center, middle # Plot comparison --- ## scatter plot using `ggplot()` <img src="lec_08_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ## Using simple `plot()` <img src="lec_08_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## Diffrence The main difference between the scatter plot in ggplot2 and the simple plot command lies in the underlying philosophy and syntax of the two approaches. #ggplot2 Scatter Plot: - ggplot2 is a part of the tidyverse package and follows the grammar of graphics framework. - It provides a more structured and layered approach to creating plots. - In ggplot2, a scatter plot is created using the geom_point() function. - You can customize various aspects of the scatter plot, such as color, size, and shape of the points, using aesthetics mapping (aes()). - Additional layers, such as trend lines or smoothing curves, can be added using separate geom_* functions. --- ## Diffrence #Simple Plot Command: - The simple plot command, such as plot() in base R, is a quick and straightforward way to create basic plots. - It follows a more traditional and direct approach. - A scatter plot can be created by passing the x and y variables as arguments to the plot() function. Basic customization options, like point color and size, can be adjusted using additional parameters. Overall, ggplot2 offers a more flexible and customizable approach to creating scatter plots, with a focus on the grammar of graphics. On the other hand, the simple plot command provides a quick and easy way to generate basic scatter plots with fewer customization options. The choice between the two approaches depends on the level of complexity and customization required for your visualization. --- --- ## More understanding on ggplot2 type graphs - Spatial graphs as an example ``` #> coordinates Stations Jan Feb Mar Apr May #> 1 (74.8624, 35.357) badin 1.178969 1.166179 1.10908 1.061619 1.012896 #> Jun Jul Aug Sep Oct Nov Dec #> 1 1.009722 1.01368 1.025683 1.038332 1.07704 1.11487 1.155086 #> [using ordinary kriging] ``` ``` #> coordinates Stations Jan Feb Mar Apr May #> 1 (74.8624, 35.357) badin 1.178969 1.166179 1.10908 1.061619 1.012896 #> Jun Jul Aug Sep Oct Nov Dec #> 1 1.009722 1.01368 1.025683 1.038332 1.07704 1.11487 1.155086 #> [using ordinary kriging] ``` ``` #> coordinates Stations Jan Feb Mar Apr May #> 1 (74.8624, 35.357) badin 1.178969 1.166179 1.10908 1.061619 1.012896 #> Jun Jul Aug Sep Oct Nov Dec #> 1 1.009722 1.01368 1.025683 1.038332 1.07704 1.11487 1.155086 #> [using ordinary kriging] ``` ``` #> coordinates Stations Jan Feb Mar Apr May #> 1 (74.8624, 35.357) badin 1.178969 1.166179 1.10908 1.061619 1.012896 #> Jun Jul Aug Sep Oct Nov Dec #> 1 1.009722 1.01368 1.025683 1.038332 1.07704 1.11487 1.155086 #> [using ordinary kriging] ``` ``` #> coordinates Stations Jan Feb Mar Apr May #> 1 (74.8624, 35.357) badin 1.178969 1.166179 1.10908 1.061619 1.012896 #> Jun Jul Aug Sep Oct Nov Dec #> 1 1.009722 1.01368 1.025683 1.038332 1.07704 1.11487 1.155086 #> [using ordinary kriging] ``` ``` #> coordinates Stations Jan Feb Mar Apr May #> 1 (74.8624, 35.357) badin 1.178969 1.166179 1.10908 1.061619 1.012896 #> Jun Jul Aug Sep Oct Nov Dec #> 1 1.009722 1.01368 1.025683 1.038332 1.07704 1.11487 1.155086 #> [using ordinary kriging] ``` ``` #> coordinates Stations Jan Feb Mar Apr May #> 1 (74.8624, 35.357) badin 1.178969 1.166179 1.10908 1.061619 1.012896 #> Jun Jul Aug Sep Oct Nov Dec #> 1 1.009722 1.01368 1.025683 1.038332 1.07704 1.11487 1.155086 #> [using ordinary kriging] ``` ``` #> coordinates Stations Jan Feb Mar Apr May #> 1 (74.8624, 35.357) badin 1.178969 1.166179 1.10908 1.061619 1.012896 #> Jun Jul Aug Sep Oct Nov Dec #> 1 1.009722 1.01368 1.025683 1.038332 1.07704 1.11487 1.155086 #> [using ordinary kriging] ``` ``` #> coordinates Stations Jan Feb Mar Apr May #> 1 (74.8624, 35.357) badin 1.178969 1.166179 1.10908 1.061619 1.012896 #> Jun Jul Aug Sep Oct Nov Dec #> 1 1.009722 1.01368 1.025683 1.038332 1.07704 1.11487 1.155086 #> [using ordinary kriging] ``` ``` #> coordinates Stations Jan Feb Mar Apr May #> 1 (74.8624, 35.357) badin 1.178969 1.166179 1.10908 1.061619 1.012896 #> Jun Jul Aug Sep Oct Nov Dec #> 1 1.009722 1.01368 1.025683 1.038332 1.07704 1.11487 1.155086 #> [using ordinary kriging] ``` ``` #> coordinates Stations Jan Feb Mar Apr May #> 1 (74.8624, 35.357) badin 1.178969 1.166179 1.10908 1.061619 1.012896 #> Jun Jul Aug Sep Oct Nov Dec #> 1 1.009722 1.01368 1.025683 1.038332 1.07704 1.11487 1.155086 #> [using ordinary kriging] ``` ``` #> coordinates Stations Jan Feb Mar Apr May #> 1 (74.8624, 35.357) badin 1.178969 1.166179 1.10908 1.061619 1.012896 #> Jun Jul Aug Sep Oct Nov Dec #> 1 1.009722 1.01368 1.025683 1.038332 1.07704 1.11487 1.155086 #> [using ordinary kriging] ``` --- <img src="lec_08_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- ## Scatter plot Using `plot()` ```r teams$RD <- teams$R - teams$RA teams_div <- teams[teams$DivWin == "Y", ] teams_no_div <- teams[teams$DivWin == "N", ] mod1 <- lm(WinPct ~ RD, data = teams_div) mod2 <- lm(WinPct ~ RD, data = teams_no_div) plot(x = (teams$R - teams$RA), y = teams$WinPct, col = adjustcolor(as.integer(factor(teams$DivWin))), pch = 16, xlab = "Run Differential", ylab = "Win Percentage") abline(mod1, col = 2, lwd=2) abline(mod2, col = 1, lwd=2) ``` --- class: inverse, center, middle # What's in a `ggplot()`? --- ## Terminology A statistical graphic is a... - mapping of **data** - which may be **statistically transformed** (summarized, log-transformed, etc.) - to **aesthetic attributes** (color, size, xy-position, etc.) - using **geometric objects** (points, lines, bars, etc.) - and mapped onto a specific **facet** and **coordinate system.** --- ## What do I "need"? 1) Some data (preferably in a data frame) ```r *ggplot(data = teams) ``` <img src="lec_08_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- 2) A set of variable mappings ```r *ggplot(data = teams, mapping = aes(x = attendance / 1000, y = W)) ``` <img src="lec_08_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- 3) A geom with arguments, or multiple geoms with arguments connected by `+` ```r ggplot(data = teams, mapping = aes(x = attendance / 1000, y = W)) + * geom_point(color = "blue") ``` <img src="lec_08_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- 4) Some options on changing scales or adding facets ```r ggplot(data = teams, mapping = aes(x = attendance / 1000, y = W)) + geom_point(color = "blue") + * facet_wrap(~yearID, nrow = 2) ``` <img src="lec_08_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- 5) Some labels ```r ggplot(data = teams, mapping = aes(x = attendance / 1000, y = W)) + geom_point(color = "blue") + facet_wrap(~yearID, nrow = 2) + * labs(x = "Attendance", y = "Wins", caption = "Attendance in thousands") ``` <img src="lec_08_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- 6) Other options ```r ggplot(data = teams, mapping = aes(x = attendance / 1000, y = W)) + geom_point(color = "blue") + facet_wrap(~yearID, nrow = 2) + labs(x = "Attendance", y = "Wins", caption = "Attendance in thousands") + * theme_bw(base_size = 16) + * theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` <img src="lec_08_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- ## Anatomy of a ggplot ```r ggplot( data = [dataframe], aes( x = [var_x], y = [var_y], color = [var_for_color], fill = [var_for_fill], shape = [var_for_shape], size = [var_for_size], alpha = [var_for_alpha], ...#other aesthetics ) ) + geom_<some_geom>([geom_arguments]) + ... # other geoms scale_<some_axis>_<some_scale>() + facet_<some_facet>([formula]) + ... # other options ``` To visualize multivariate relationships we can add variables to our visualization by specifying aesthetics: color, size, shape, linetype, alpha, or fill; we can also add facets based on variable levels. --- class: inverse, center, middle # Scatter plots --- ## Base plot .tiny[ ```r ggplot(data = teams, mapping = aes(x = (R ^ 2 / (R ^ 2 + RA ^2 )), y = WinPct)) + * geom_point() ``` <img src="lec_08_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> ] --- ## Altering aesthetic color .tiny[ ```r ggplot(data = teams, mapping = aes(x = (R ^ 2 / (R ^ 2 + RA ^2 )), y = WinPct)) + * geom_point(color = "#E81828") ``` <img src="lec_08_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> ] --- ## Altering aesthetic color .tiny[ ```r *ggplot(data = teams, mapping = aes(x = (R ^ 2 / (R ^ 2 + RA ^2 )), y = WinPct, color = lgID)) + geom_point(show.legend = FALSE) ``` <img src="lec_08_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> ] --- ## Altering aesthetic color .tiny[ ```r ggplot(data = teams, mapping = aes(x = (R ^ 2 / (R ^ 2 + RA ^2 )), y = WinPct, color = lgID)) + * geom_point() ``` <img src="lec_08_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> ] --- ## Base plot .tiny[ ```r ggplot(data = teams[teams$yearID == 2018, ], mapping = aes(x = BB + H, y = SO)) + geom_point() ``` <img src="lec_08_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> ] --- ## Altering multiple aesthetics .tiny[ ```r ggplot(data = teams[teams$yearID == 2018, ], mapping = aes(x = BB + H, y = SO)) + * geom_point(size = 3, shape = 2, color = "#E81828") ``` <img src="lec_08_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> ] --- ## Altering multiple aesthetics .tiny[ ```r ggplot(data = teams[teams$yearID == 2018, ], mapping = aes(x = BB + H, y = SO, * color = factor(Rank), shape = factor(Rank))) + geom_point(size = 4, alpha = .8, show.legend = FALSE) ``` <img src="lec_08_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> ] --- ## Altering multiple aesthetics .tiny[ ```r ggplot(data = teams[teams$yearID == 2018, ], mapping = aes(x = BB + H, y = SO, * color = factor(Rank), shape = factor(Rank))) + geom_point(size = 4, alpha = .8) ``` <img src="lec_08_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> ] --- ## Inside or outside `aes()`? When does an aesthetic go inside function `aes()`? - If you want an aesthetic to be reflective of a variable's values, it must go inside aes. - If you want to set an aesthetic manually and not have it convey information about a variable, use the aesthetic's name outside of aes and set it to your desired value. Aesthetics for continuous and discrete variables are measured on continuous and discrete scales, respectively. --- ## Faceting .tiny[ ```r ggplot(data = teams, mapping = aes(x = R, y = WinPct, color = DivWin)) + geom_point(alpha = .8) + * facet_grid(lgID~ .) ``` <img src="lec_08_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> ] --- ## Faceting .tiny[ ```r ggplot(data = teams, mapping = aes(x = R, y = WinPct, color = DivWin)) + geom_point(alpha = .8) + * facet_grid(. ~lgID) ``` <img src="lec_08_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> ] --- ## Faceting .tiny[ ```r ggplot(data = teams, mapping = aes(x = R, y = WinPct, color = DivWin)) + geom_point(alpha = .8) + * facet_grid(divID~lgID) ``` <img src="lec_08_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> ] --- ## Faceting .tiny[ ```r ggplot(data = teams, mapping = aes(x = R, y = WinPct, color = DivWin)) + geom_point(alpha = .8) + * facet_wrap(~yearID) ``` <img src="lec_08_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" /> ] --- ## Facet grid or wrap? - Use `facet_wrap()` to wrap a one dimensional sequence into two dimensional panels. - Use `facet_grid()` when you have two discrete variables and you want panels of plots to represent all possible combinations. --- ## Exercise Let's explore the relationship between runs and strikeouts for division winners and non-division winners. Use tibble `teams` to re-create the plot below. <img src="lec_08_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> <br/> **How can we improve this visualization?** ??? .tiny[ ```r ggplot(data = teams, mapping = aes(x = SO, y = R, color = factor(DivWin))) + geom_point(size = 3, alpha = .8) + facet_wrap(~yearID, nrow = 2) + labs(x = "Strike outs", y = "Runs", color = "Division winner") ``` ] --- ## A more effective visualization <img src="lec_08_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" /> ??? ```r ggplot(data = teams, mapping = aes(x = SO, y = R, color = factor(DivWin))) + geom_point(size = 2, alpha = .8) + geom_hline(yintercept = 750, lty = 2, alpha = .5, color = "blue") + geom_vline(xintercept = 1250, lty = 2, alpha = .5, color = "blue") + facet_wrap(~yearID, nrow = 2) + labs(x = "Strike outs", y = "Runs", color = "Division winner", title = "Division winners generally score more runs", subtitle = "and have fewer strike outs") + scale_color_manual(values = c("grey", "red")) + scale_x_continuous(limits = c(750, 1750), breaks = seq(900, 1700, 350), labels = seq(900, 1700, 350)) + scale_y_continuous(limits = c(500, 1000), breaks = seq(500, 1000, 100), labels = seq(500, 1000, 100)) + theme_bw(base_size = 16) + theme(legend.position = "bottom") ``` --- class: inverse, center, middle # Other geoms --- ## Caution - The following plots are not well-polished. They are designed to demonstrate the various geoms and options that exist within `ggplot2`. - You should always have a well-labelled and polished visualization if it will be seen by an outside audience. --- ## Box plots .tiny[ ```r ggplot(teams, mapping = aes(x = factor(yearID), y = kpg)) + * geom_boxplot(color = "#E81828", fill = "#002D72", alpha = .7) ``` <img src="lec_08_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" /> ] --- ## Box plots: flipped coordinates .tiny[ ```r ggplot(teams, mapping = aes(x = factor(yearID), y = kpg)) + geom_boxplot(color = "#E81828", fill = "#002D72", alpha = .7) + * coord_flip() ``` <img src="lec_08_files/figure-html/unnamed-chunk-34-1.png" style="display: block; margin: auto;" /> ] --- ## Box plots: custom colors .tiny[ ```r ggplot(teams, mapping = aes(x = factor(yearID), y = kpg, fill = lgID)) + geom_boxplot(color = "grey", alpha = .7) + * scale_fill_manual(values = c("#E81828", "#002D72")) + coord_flip() + * theme_bw() ``` <img src="lec_08_files/figure-html/unnamed-chunk-35-1.png" style="display: block; margin: auto;" /> ] --- ## Bar plots .tiny[ ```r ggplot(teams[teams$yearID == 2018, ], mapping = aes(y = W, x = franchID)) + geom_bar(stat = "identity") ``` <img src="lec_08_files/figure-html/unnamed-chunk-36-1.png" style="display: block; margin: auto;" /> ] --- ## Bar plots: angled text .tiny[ ```r ggplot(teams[teams$yearID == 2018, ], mapping = aes(y = W, x = franchID)) + geom_bar(stat = "identity") + * theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` <img src="lec_08_files/figure-html/unnamed-chunk-37-1.png" style="display: block; margin: auto;" /> ] --- ## Bar plots: sorted .tiny[ ```r *ggplot(teams[teams$yearID == 2018, ], mapping = aes(y = W, x = reorder(franchID, W))) + geom_bar(stat = "identity", color = "#E81828", fill = "#002D72", alpha = .2) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` <img src="lec_08_files/figure-html/unnamed-chunk-38-1.png" style="display: block; margin: auto;" /> ] --- ## Bar plots .tiny[ ```r ggplot(teams[teams$yearID == 2018, ], mapping = aes(y = W, x = reorder(franchID, W))) + geom_bar(stat = "identity", color = "#E81828", fill = "#002D72", alpha = .3) + * scale_y_continuous(breaks = seq(0, 120, 15), labels = seq(0, 120, 15), limits = c(0, 120)) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` <img src="lec_08_files/figure-html/unnamed-chunk-39-1.png" style="display: block; margin: auto;" /> ] --- ## Histograms .tiny[ ```r ggplot(teams, mapping = aes(x = WinPct)) + geom_histogram(binwidth = .025, fill = "#E81828", color = "#002D72", alpha = .7) ``` <img src="lec_08_files/figure-html/unnamed-chunk-40-1.png" style="display: block; margin: auto;" /> ] --- ## Density plots .tiny[ ```r ggplot(teams, mapping = aes(x = WinPct)) + geom_density(fill = "#E81828", color = "#002D72", alpha = .7) ``` <img src="lec_08_files/figure-html/unnamed-chunk-41-1.png" style="display: block; margin: auto;" /> ] --- ## Density plots: custom colors .tiny[ ```r ggplot(teams, mapping = aes(x = WinPct, fill = lgID)) + geom_density(alpha = .5) + * scale_fill_manual(values = c("#E81828", "#002D72")) ``` <img src="lec_08_files/figure-html/unnamed-chunk-42-1.png" style="display: block; margin: auto;" /> ] --- ## Heat maps .tiny[ ```r ggplot(teams[teams$yearID == 2018, ], mapping = aes(x = Rank, y = divID, fill = RD)) + * geom_raster() ``` <img src="lec_08_files/figure-html/unnamed-chunk-43-1.png" style="display: block; margin: auto;" /> ] --- ## Heat maps: color palette .tiny[ ```r ggplot(teams[teams$yearID == 2018, ], mapping = aes(x = Rank, y = divID, fill = RD)) + geom_raster() + * scale_fill_gradientn(colours = terrain.colors(100)) ``` <img src="lec_08_files/figure-html/unnamed-chunk-44-1.png" style="display: block; margin: auto;" /> ] --- ## Heat maps: color palette .tiny[ ```r ggplot(teams[teams$yearID == 2018, ], mapping = aes(x = Rank, y = divID, fill = RD)) + geom_raster() + * scale_fill_gradient(low = "#fef0d9", high = "#b30000") ``` <img src="lec_08_files/figure-html/unnamed-chunk-45-1.png" style="display: block; margin: auto;" /> ] --- ## Choosing colors [Color Brewer 2](http://colorbrewer2.org/) <img src="images/color_brewer.png"> --- ## Effective visualization tips - Provide a title that tells a story. - Strive to have your visualization function in a closed environment. - Be mindful of color and scale choices. - Generally, color is better than shape to make things "pop". - Not everything has to have a color, shape, transparency, etc. - Add labels and annotation. - Use your visualization to support your story. - Use chunk options `fig.width`, `fig.height`, `fig.align`, and `fig.show` to manipulate your plot's size and placement. --- # Exercise --- ## Energy data .tiny[ ```r energy <- read_csv("http://www2.stat.duke.edu/~sms185/data/energy/energy.csv") ``` ] .tiny[ ```r energy ``` ``` #> # A tibble: 105 × 6 #> MWhperDay name type location note boe #> <dbl> <chr> <chr> <chr> <chr> <dbl> #> 1 3 Chernobyl Solar Solar Ukraine "On … 0 #> 2 637 Solarpark Meuro Solar Germany <NA> 55 #> 3 920 Tesla's proposed virtual power plant Solar South Austr… "50,… 79 #> 4 1280 Quaid-e-Azam Solar Pakistan "Nam… 110 #> 5 1760 Topaz Solar USA <NA> 152 #> 6 2025 Agua Caliente Solar USA "Ari… 175 #> 7 2466 Kamuthi Solar India "\"1… 213 #> 8 2720 Longyangxia Solar China <NA> 234 #> 9 3840 Kurnool Solar India <NA> 331 #> 10 4950 Tengger Desert Solar China "Cov… 427 #> # ℹ 95 more rows ``` ] --- ## Data dictionary The power sources represent the amount of energy a power source generates each day as represented in daily MWh. - `MWhperDay`: MWh of energy generated per day - `name`: energy source name - `type`: type of energy source - `location`: country of energy source - `note`: more details on energy source - `boe`: barrel of oil equivalent <br> - **Daily megawatt hour (MWh)** is a measure of energy output. - **1 MWh** is, on average, enough power for 28 people in the USA --- ## Objective Re-create the plot on the following slide. A few notes: - base font size is 18 - hex colors: `c("#9d8b7e", "#315a70", "#66344c", "#678b93", "#b5cfe1", "#ffcccc")` - use function `order()` to help get the top 30 Starter code: ```r energy_top_30 <- energy[order(energy$MWhperDay, decreasing = T)[1:30], ] ``` --- <img src="lec_08_files/figure-html/unnamed-chunk-49-1.png" style="display: block; margin: auto;" /> ??? .tiny[ ```r ggplot(energy_top_30, mapping = aes(x = reorder(name, MWhperDay), y = MWhperDay / 1000, fill = type)) + geom_bar(stat = "identity") + scale_fill_manual(values = c("#9d8b7e", "#315a70", "#66344c", "#678b93", "#b5cfe1", "#ffcccc")) + theme_bw(base_size = 18) + labs(y = "Daily MWh (in thousands)", x = "Power Source", title = "Top 30 power source energy generators", fill = "Power Source", caption = "1 MWh is, on average, enough power for 28 people in the USA") + coord_flip() ``` ] --- ## References 1. Grolemund, G., & Wickham, H. (2019). R for Data Science. R4ds.had.co.nz. https://r4ds.had.co.nz/data-visualisation.html 2. https://ggplot2.tidyverse.org/reference/ 3. Lahman, S. (2019) Lahman's Baseball Database, 1871-2018, Main page, http://www.seanlahman.com/baseball-archive/statistics/ 4. https://www.visualcapitalist.com/worlds-largest-energy-sources/