class: center, middle, inverse, title-slide # Visualization in R ## How to know your data clearly ### Xiaokai Xia ### 2017/4/20 --- ## Prologue You can find this slide at https://github.com/dddd1007/VisulizationLesson And you should finish this lesson with those packages as follow: - tidyverse - vioplot - yarrr - superheat - corrgram - factoextra --- ## Why we should visual our data? ```r data(anscombe) ``` <table class='gmisc_table' style='border-collapse: collapse; margin-top: 1em; margin-bottom: 1em;' > <thead> <tr> <th style='border-top: 2px solid grey;'></th> <th colspan='2' style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>Data 1</th><th style='border-top: 2px solid grey;; border-bottom: hidden;'> </th> <th colspan='2' style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>Data 2</th><th style='border-top: 2px solid grey;; border-bottom: hidden;'> </th> <th colspan='2' style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>Data 3</th><th style='border-top: 2px solid grey;; border-bottom: hidden;'> </th> <th colspan='2' style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>Data 4</th> </tr> <tr> <th style='border-bottom: 1px solid grey;'> </th> <th style='border-bottom: 1px solid grey; text-align: center;'>x-value</th> <th style='border-bottom: 1px solid grey; text-align: center;'>y-value</th> <th style='border-bottom: 1px solid grey;' colspan='1'> </th> <th style='border-bottom: 1px solid grey; text-align: center;'>x-value</th> <th style='border-bottom: 1px solid grey; text-align: center;'>y-value</th> <th style='border-bottom: 1px solid grey;' colspan='1'> </th> <th style='border-bottom: 1px solid grey; text-align: center;'>x-value</th> <th style='border-bottom: 1px solid grey; text-align: center;'>y-value</th> <th style='border-bottom: 1px solid grey;' colspan='1'> </th> <th style='border-bottom: 1px solid grey; text-align: center;'>x-value</th> <th style='border-bottom: 1px solid grey; text-align: center;'>y-value</th> </tr> </thead> <tbody> <tr> <td style='text-align: left;'>1</td> <td style='text-align: center;'>10</td> <td style='text-align: center;'>8.04</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>10</td> <td style='text-align: center;'>9.14</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>10</td> <td style='text-align: center;'>7.46</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>8</td> <td style='text-align: center;'>6.58</td> </tr> <tr> <td style='text-align: left;'>2</td> <td style='text-align: center;'>8</td> <td style='text-align: center;'>6.95</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>8</td> <td style='text-align: center;'>8.14</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>8</td> <td style='text-align: center;'>6.77</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>8</td> <td style='text-align: center;'>5.76</td> </tr> <tr> <td style='text-align: left;'>3</td> <td style='text-align: center;'>13</td> <td style='text-align: center;'>7.58</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>13</td> <td style='text-align: center;'>8.74</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>13</td> <td style='text-align: center;'>12.74</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>8</td> <td style='text-align: center;'>7.71</td> </tr> <tr> <td style='text-align: left;'>4</td> <td style='text-align: center;'>9</td> <td style='text-align: center;'>8.81</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>9</td> <td style='text-align: center;'>8.77</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>9</td> <td style='text-align: center;'>7.11</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>8</td> <td style='text-align: center;'>8.84</td> </tr> <tr> <td style='text-align: left;'>5</td> <td style='text-align: center;'>11</td> <td style='text-align: center;'>8.33</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>11</td> <td style='text-align: center;'>9.26</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>11</td> <td style='text-align: center;'>7.81</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>8</td> <td style='text-align: center;'>8.47</td> </tr> <tr> <td style='text-align: left;'>6</td> <td style='text-align: center;'>14</td> <td style='text-align: center;'>9.96</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>14</td> <td style='text-align: center;'>8.1</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>14</td> <td style='text-align: center;'>8.84</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>8</td> <td style='text-align: center;'>7.04</td> </tr> <tr> <td style='text-align: left;'>7</td> <td style='text-align: center;'>6</td> <td style='text-align: center;'>7.24</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>6</td> <td style='text-align: center;'>6.13</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>6</td> <td style='text-align: center;'>6.08</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>8</td> <td style='text-align: center;'>5.25</td> </tr> <tr> <td style='text-align: left;'>8</td> <td style='text-align: center;'>4</td> <td style='text-align: center;'>4.26</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>4</td> <td style='text-align: center;'>3.1</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>4</td> <td style='text-align: center;'>5.39</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>19</td> <td style='text-align: center;'>12.5</td> </tr> <tr> <td style='text-align: left;'>9</td> <td style='text-align: center;'>12</td> <td style='text-align: center;'>10.84</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>12</td> <td style='text-align: center;'>9.13</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>12</td> <td style='text-align: center;'>8.15</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>8</td> <td style='text-align: center;'>5.56</td> </tr> <tr> <td style='text-align: left;'>10</td> <td style='text-align: center;'>7</td> <td style='text-align: center;'>4.82</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>7</td> <td style='text-align: center;'>7.26</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>7</td> <td style='text-align: center;'>6.42</td> <td style='' colspan='1'> </td> <td style='text-align: center;'>8</td> <td style='text-align: center;'>7.91</td> </tr> <tr> <td style='border-bottom: 2px solid grey; text-align: left;'>11</td> <td style='border-bottom: 2px solid grey; text-align: center;'>5</td> <td style='border-bottom: 2px solid grey; text-align: center;'>5.68</td> <td style='border-bottom: 2px solid grey;' colspan='1'> </td> <td style='border-bottom: 2px solid grey; text-align: center;'>5</td> <td style='border-bottom: 2px solid grey; text-align: center;'>4.74</td> <td style='border-bottom: 2px solid grey;' colspan='1'> </td> <td style='border-bottom: 2px solid grey; text-align: center;'>5</td> <td style='border-bottom: 2px solid grey; text-align: center;'>5.73</td> <td style='border-bottom: 2px solid grey;' colspan='1'> </td> <td style='border-bottom: 2px solid grey; text-align: center;'>8</td> <td style='border-bottom: 2px solid grey; text-align: center;'>6.89</td> </tr> </tbody> </table> --- ## Try to do a statistical analysis ### Like ... Correlation? ```r cor(anscombe$x1, anscombe$y1) ``` ``` ## [1] 0.8164205 ``` ```r cor(anscombe$x2, anscombe$y2) ``` ``` ## [1] 0.8162365 ``` ```r cor(anscombe$x3, anscombe$y3) ``` ``` ## [1] 0.8162867 ``` ```r cor(anscombe$x4, anscombe$y4) ``` ``` ## [1] 0.8165214 ``` --- ### Or a Linear Model? ```r ff <- y ~ x mods <- setNames(as.list(1:4), paste0("lm", 1:4)) for(i in 1:4) { ff[2:3] <- lapply(paste0(c("y","x"), i), as.name) mods[[i]] <- lmi <- lm(ff, data = anscombe) } sapply(mods, coef) ``` ``` ## lm1 lm2 lm3 lm4 ## (Intercept) 3.0000909 3.000909 3.0024545 3.0017273 ## x1 0.5000909 0.500000 0.4997273 0.4999091 ``` --- ### But if we do a visualization firstly? .center[ <!-- --> ] --- ## Basic Plot - Histogram - Bar Plot - Boxplot - Pie Chart -- ```r head(iris) ``` ``` ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ``` --- ### Histogram ```r hist(iris$Sepal.Length) ``` <!-- --> --- ### Bar Plot ```r barplot(table(iris$Species)) ``` <!-- --> --- #### Tip **Notice the different between histogram and bar charts.** --- ### boxplot ```r boxplot(Sepal.Length ~ Species, data = iris) ``` <!-- --> --- ### Pie Chart ```r pie(c(1,1,1),levels(iris$Species)) ``` <!-- --> --- ## Should I need to know some other charts? .center[] --- ## Some charts using raw data - Scatter Plot - Violin Plot - Pirate Plot --- ### Scatter Plot ```r plot(iris$Sepal.Length, iris$Sepal.Width) ``` <!-- --> --- ### Violin Plot <!-- --> --- ### Pirate Plot ```r yarrr::pirateplot(Sepal.Length ~ Species, data = iris) ``` <!-- --> --- ### Pirate Plot .center[] --- ## May you find some cool figure of papers sometimes. Like this... -- .center[] --- Like this... <!-- --> --- this... <!-- --> --- this... <!-- --> --- this... <!-- --> --- ### And you can plot them by searching on the Internet. **相关图** corrgram::corrgram(iris[-5], upper.panel = NULL) **热力图** superheat(X = iris[,-c(1, 5)], yr = iris[,1], yr.axis.name = "Sepal.Length", membership.rows = iris$Species) **聚类图** data("USArrests") df <- scale(USArrests) km.res <- kmeans(scale(USArrests), 4, nstart = 25) factoextra::fviz_cluster(km.res, data = df,ggtheme = theme_minimal()) --- background-image: url(hadly.jpg) --- ## ggplot2 -- ### The Grammar of Graphics .center[] --- ## Three Basic elements - Layer - Aesthetics - Geometric --- ## Scatter Plot ```r ggplot(iris) + geom_point(aes(x = Sepal.Length, y = Sepal.Width)) ``` <!-- --> --- ## Scatter Plot ```r ggplot(iris) + geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) ``` <!-- --> --- ## Scatter Plot ```r ggplot(iris) + geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) + stat_smooth(aes(x = Sepal.Length, y = Sepal.Width)) ``` <!-- --> --- ## Scatter Plot ```r ggplot(iris) + geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) + stat_smooth(aes(x = Sepal.Length, y = Sepal.Width)) + facet_grid(Species ~ .) ``` <!-- --> --- ## Boxplot -- Sepal.Length ~ Species ```r ggplot(iris) + geom_boxplot(aes(x = Species, y = Sepal.Length)) ``` <!-- --> --- ## Boxplot Change color -- ```r ggplot(iris) + geom_boxplot(aes(x = Species, y = Sepal.Length, fill = Species)) ``` <!-- --> --- ## Boxplot Change theme -- ```r ggplot(iris) + geom_boxplot(aes(x = Species, y = Sepal.Length, fill = Species)) + ggthemes::theme_economist() ``` <!-- --> --- ## Boxplot Change color scale -- ```r ggplot(iris) + geom_boxplot(aes(x = Species, y = Sepal.Length, fill = Species)) + ggthemes::theme_economist() + scale_fill_brewer() ``` <!-- --> --- ## qqplot -- ```r ggplot(iris) + geom_qq(aes(sample = Sepal.Length)) ``` <!-- --> --- ## Histogram -- ```r ggplot(iris) + geom_histogram(aes(x = Sepal.Length)) ``` <!-- --> --- ## Barplot -- ```r ggplot(iris) + geom_bar(aes(x = Species)) ``` <!-- --> --- ## Barplot -- Change color by group ```r ggplot(iris) + geom_bar(aes(x = Species, fill = Species)) ``` <!-- --> --- ## So why we should use ggplot2? - The understandable logic of the grammar. - The automatic statistical calculation. - Automatic legends, colors, etc. - Nice output. - Easy to extend.