We will continue on our journey for effective communication with R and graphics.
Today, we will cover what some of the most important methods of graphing data and end with some special methods of graphing with R.
We all know about the standard bar charts and see them everywhere, but the draw back to them is that you get to only show one indicator/variable at a time. But what if another indicator factors into the decision, say the number of cylinders of a car and number of gears it has? To look at this example, we will use the build in dataset of R called mtcars.
Numbers<-table(mpg$drv, mpg$class)
barplot(Numbers,
main = 'Automobile cylinder number grouped by number of gears',
col = brewer.pal(3, name = "Dark2"),
legend = rownames(Numbers),
xlab = 'Class of Car',
ylab = 'Number of cars in each class')
ggplot(mpg, aes(class)) +
geom_bar(aes(fill = drv),
position = position_stack(reverse = TRUE)) +
coord_flip() +
theme(legend.position = "top") +
scale_fill_brewer(palette="Dark2") +
xlab("Count of Class of Car with different Drive Shafts") +
ylab("Class of Car")
*** Notice the use of color pallets in both plots! ***
One of the most important aspects of effective communication is the correct use of colors. If colors are not chosen wisely, your graphs will ended up looking random and a lot of time is spent to intrepret colors rather than the business question. I use the Canva.com often and I recommed the same to you. Part of your grade for homework 1 will be based on your effort on color coordination. I do expect everyone to read about rcolorbrew package.
To visualize the relationship between multiple variables, we can use the color intensity, to that end, heat maps were brought in from GIS to data science for better presentation of that goal. Which has resulted in a 2-D visualization that is easy to intrepret and graph.
We’ll use our example we will generate 10 random points
# simulate a dataset of 10 points
x<-rnorm(30,mean=rep(1:5,each=2),sd=0.7)
y<-rnorm(30,mean=rep(c(1,9),each=5),sd=0.1)
dataFrame<-data.frame(x=x,y=y)
set.seed(143)
dataMatrix<-as.matrix(dataFrame)[sample(1:30),] # convert to class 'matrix', then shuffle the rows of the matrix
heatmap(dataMatrix) # visualize hierarchical clustering via a heatmap
Correlated data is best visualized through corrplot. The 2D format is similar to a heat map, but it highlights statistics that are directly related.
Most correlograms highlight the amount of correlation between datasets at various points in time. Comparing sales data between different months or years is a basic example.
#data("mtcars")
corr_matrix <- cor(mtcars)
# with circles
corrplot(corr_matrix)
# with numbers and lower
corrplot(corr_matrix,
method = 'number',
type = "lower")
library(ggcorrplot)
ggcorrplot(corr_matrix)
ggcorrplot(corr_matrix,
hc.order = TRUE,
type = "lower",
outline.col = "white",
ggtheme = ggplot2::theme_gray,
colors = c("#6D9EC1", "white", "#E46726"),
lab = TRUE)
GGVis is an advanced feature package for ggplot that allows us to create interactive plots and HTML (website) integration. This feature is very important when we want to give greater access of data to stackeholders and decision makers, while giving the data scientist greater freedom to explore.
Lets do a simple plot from the stacked bar charts
mpg %>%
ggvis::ggvis(~class, fill = ~drv) %>%
ggvis::layer_bars()
What about an interactive histogram
mpg %>%
ggvis::ggvis(~hwy) %>%
ggvis::layer_histograms(binwidth = input_slider(1, 10, value = 2))
## Warning: 'binwidth' is deprecated. Please use 'width' instead. (Last used in
## version 0.3.0)
## Warning: Can't output dynamic/interactive ggvis plots in a knitr document.
## Generating a static (non-dynamic, non-interactive) version of the plot.
library(sunburstR)
# read in sample visit-sequences.csv data provided in source
# https://gist.github.com/kerryrodden/7090426#file-visit-sequences-csv
sequences <- read.csv(
system.file("examples/visit-sequences.csv",package="sunburstR")
,header=F
,stringsAsFactors = FALSE
)
sunburst(sequences)
IN LAB ASSIGNMENT: Now it is your turn! If you don’t already have the package installed, install the Lahman package below. Using the Batting dataset, create 2 plots using ggplot showing the relation between some of the variables. Note that the dataset includes both categorical (factor) fields as well as numeric, so try grouping the data by some of the factors to create color groupings or facets. Make note that we will be using the filtered dataset Batting_recent so our plots dont take too much time to render. Finally, make sure your graphs are clearly labeled, titled, and have good color coordination!
library(Lahman)
data("Batting")
str(Batting)
## 'data.frame': 105861 obs. of 22 variables:
## $ playerID: chr "abercda01" "addybo01" "allisar01" "allisdo01" ...
## $ yearID : int 1871 1871 1871 1871 1871 1871 1871 1871 1871 1871 ...
## $ stint : int 1 1 1 1 1 1 1 1 1 1 ...
## $ teamID : Factor w/ 149 levels "ALT","ANA","ARI",..: 136 111 39 142 111 56 111 24 56 24 ...
## $ lgID : Factor w/ 7 levels "AA","AL","FL",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ G : int 1 25 29 27 25 12 1 31 1 18 ...
## $ AB : int 4 118 137 133 120 49 4 157 5 86 ...
## $ R : int 0 30 28 28 29 9 0 66 1 13 ...
## $ H : int 0 32 40 44 39 11 1 63 1 13 ...
## $ X2B : int 0 6 4 10 11 2 0 10 1 2 ...
## $ X3B : int 0 0 5 2 3 1 0 9 0 1 ...
## $ HR : int 0 0 0 2 0 0 0 0 0 0 ...
## $ RBI : int 0 13 19 27 16 5 2 34 1 11 ...
## $ SB : int 0 8 3 1 6 0 0 11 0 1 ...
## $ CS : int 0 1 1 1 2 1 0 6 0 0 ...
## $ BB : int 0 4 2 0 2 0 1 13 0 0 ...
## $ SO : int 0 0 5 2 1 1 0 1 0 0 ...
## $ IBB : int NA NA NA NA NA NA NA NA NA NA ...
## $ HBP : int NA NA NA NA NA NA NA NA NA NA ...
## $ SH : int NA NA NA NA NA NA NA NA NA NA ...
## $ SF : int NA NA NA NA NA NA NA NA NA NA ...
## $ GIDP : int 0 0 1 0 0 0 0 1 0 0 ...
Batting_recent <- Batting %>%
filter(yearID >= 2015) %>%
mutate(Power_Hitter = ifelse(HR >= 25, "Power Hitter", "Not Power Hitter"))
glimpse(Batting_recent)
## Observations: 5,998
## Variables: 23
## $ playerID <chr> "aardsda01", "abadfe01", "abreujo02", "achteaj01", "ackl…
## $ yearID <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 20…
## $ stint <int> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ teamID <fct> ATL, OAK, CHA, MIN, SEA, NYA, COL, CLE, SLN, CIN, SFN, S…
## $ lgID <fct> NL, AL, AL, AL, AL, AL, NL, AL, NL, NL, NL, NL, AL, NL, …
## $ G <int> 33, 62, 154, 11, 85, 23, 26, 28, 60, 13, 52, 52, 7, 134,…
## $ AB <int> 1, 0, 613, 0, 186, 52, 53, 1, 175, 0, 113, 2, 19, 421, 0…
## $ R <int> 0, 0, 88, 0, 22, 6, 4, 0, 14, 0, 11, 0, 0, 49, 0, 12, 0,…
## $ H <int> 0, 0, 178, 0, 40, 15, 13, 0, 42, 0, 21, 0, 6, 95, 0, 22,…
## $ X2B <int> 0, 0, 34, 0, 8, 3, 1, 0, 9, 0, 7, 0, 1, 17, 0, 2, 0, 0, …
## $ X3B <int> 0, 0, 3, 0, 1, 2, 1, 0, 0, 0, 1, 0, 0, 6, 0, 1, 0, 0, 0,…
## $ HR <int> 0, 0, 30, 0, 6, 4, 0, 0, 5, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0…
## $ RBI <int> 0, 0, 101, 0, 19, 11, 3, 0, 24, 0, 11, 0, 2, 34, 0, 4, 0…
## $ SB <int> 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 3, 0, 0, 4, 0, 1, 0, 0, 1,…
## $ CS <int> 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 2, 0, 0, 5, 0, 0, 0, 0, 0,…
## $ BB <int> 0, 0, 39, 0, 14, 4, 3, 0, 10, 0, 15, 0, 0, 29, 0, 2, 0, …
## $ SO <int> 1, 0, 140, 0, 38, 7, 11, 0, 41, 0, 20, 2, 7, 81, 0, 17, …
## $ IBB <int> 0, 0, 11, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ HBP <int> 0, 0, 15, 0, 1, 0, 1, 0, 0, 0, 4, 0, 1, 1, 0, 0, 0, 0, 0…
## $ SH <int> 0, 0, 0, 0, 3, 0, 1, 0, 0, 0, 2, 0, 0, 5, 0, 3, 0, 0, 1,…
## $ SF <int> 0, 0, 1, 0, 3, 1, 0, 0, 1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0,…
## $ GIDP <int> 0, 0, 16, 0, 3, 0, 0, 1, 1, 0, 2, 0, 0, 4, 0, 2, 0, 0, 0…
## $ Power_Hitter <chr> "Not Power Hitter", "Not Power Hitter", "Power Hitter", …
#Plotting total number of home runs by team and League ID
ggplot(data = Batting_recent,
aes(x = teamID,
y = HR,
fill = lgID)) +
labs(fill = "League ID") +
geom_bar(stat = "identity") +
ggtitle("Total Home Runs by Team") +
theme(axis.text.x = element_text(size = 10, angle = 90)) +
xlab("Team ID") + ylab("Total Home Runs")
#Plot shows that the top 4 teams with the most homeruns are all part of the AL league
#Plotting correlation matrix to see which features are correlated with each other
corr_matrix <- cor(select_if(Batting_recent, is.numeric))
ggcorrplot(corr_matrix,
hc.order = TRUE,
type = "lower",
outline.col = "white",
ggtheme = ggplot2::theme_gray,
colors = c("#6D9EC1", "white", "#E46726"))+
ggtitle("Correlation Matrix")
#the correlation matrix shows a very high correlation between "At Bat" (AB) and "Hits" (H), which is expected since if a player is at bat they are very likely to get a hit