Business Intelligence

Agenda

Introduction to ggplot2 Package (with Scatter Plot)
Bar Chart
Tree Map
Correlation Plot

Introduction to ggplot Package

ggplot is a powerful package for more advanced visualization.

The functions in ggplot package build up a group in layer: starting with a simple graph -> adding additional elements/functions with +, one at a time.

Load data

We will first load a data from a package using the following code, then load the library of ggplot2

install.packages('mosaicData')

library('mosaicData')
data(CPS85, package = "mosaicData")
# load library 
library(ggplot2)
head(CPS85,5)

##   wage educ race sex hispanic south married exper union age   sector
## 1  9.0   10    W   M       NH    NS Married    27   Not  43    const
## 2  5.5   12    W   M       NH    NS Married    20   Not  38    sales
## 3  3.8   12    W   F       NH    NS  Single     4   Not  22    sales
## 4 10.5   12    W   F       NH    NS Married    29   Not  47 clerical
## 5 15.0   12    W   M       NH    NS Married    40 Union  58    const

ggplot function

The first function in building a graph is ggplot function. It specifies the - data frame containing the data to be plotted - the mapping of the variables to visual properties of the graph (where aes stands for aesthetics).

   # specify dataset and mapping, 
    ggplot(data = CPS85,
           mapping = aes(x = exper, y = wage)) # specify x-axis and y axis

geoms (geometric objects)

Geoms: geometric objects (points, lines, bars, etc.) of a graph. They are added using functions that start with geom_. In this example, geom_point function is used to create a scatterplot. In ggplot2 graphs, functions are chained together using the + sign to build a final plot.

    ggplot(data = CPS85,
           mapping = aes(x = exper, y = wage)) + 
    geom_point() #This is how we add datapoints to the plot

Delete outlier

The graph indicates that there is an outlier. One individual has a wage much higher than the rest. We’ll delete this case before continuing.

library(dplyr)
plotdata <- filter(CPS85, wage < 40) #filter out the outlier

# redraw scatterplot
ggplot(data = plotdata, # Then we used the updated dataset 
       mapping = aes(x = exper, y = wage)) +
  geom_point()

Parameter/Options of geom_ Function

Options for the geom_point function include: 1) Point color: color 2)Point size: size 3) Transparency: alpha. Ranging from 0 (complete transparent) to 1 (complete opaque) ## Parameter/Options of geom_ Function

# make points blue, larger, and semi-transparent
ggplot(data = plotdata,
       mapping = aes(x = exper, y = wage)) +
geom_point(color = "cornflowerblue",  #Color
             alpha = .7, #Transparency level
             size = 3) #Point Size

Add a Line of Best Fit

Next, let’s add a line of best fit. We can do this with the geom_smooth function. Options control

the type of line (linear, quadratic, nonparametric)
the thickness of the line - the line’s color
the presence or absence of a confidence interval. Here we request a linear regression (method = lm) line (where lm stands for linear model).

Add a Line of Best Fit

    # add a line of best fit.
    ggplot(data = plotdata,
           mapping = aes(x = exper, y = wage)) +
      geom_point(color = "cornflowerblue",alpha = .7,size = 3) +
      geom_smooth(method = "lm") #added another layer of smooth line of best fit and specified linear regression method

## `geom_smooth()` using formula = 'y ~ x'

Grouping

Variables can be mapped to the color, shape, size, transparency, etc. This allows groups of observations to be superimposed in a single graph.

# indicate sex using color
ggplot(data = plotdata, mapping = aes(x = exper, y = wage, color = sex)) + 
  #using sex variable to differentiate colors
      geom_point(alpha = .7, size = 3) +
      geom_smooth(method = "lm", se = FALSE, linewidth = 1.5)

It appears that men tend to make more money than women. Additionally, there may be a stronger relationship between experience and wages for men than than for women.

Your turn

To explore the relationship between age and income, create the following graph, age as x axis, and wage as y axis. Group the outcome based on their marital status (adjust the scatterplot’s color based on married variable)

Scales

Scale functions (which start with scale) allow you to modify this mapping. In the next plot, we’ll change the x and y axis scaling, and the colors employed.

# modify the x and y axes and specify the colors to be used
ggplot(data = plotdata,
       mapping = aes(x = exper, y = wage,color = sex)) +
  geom_point(alpha = .7, size = 3) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1.5) +
  scale_x_continuous(breaks = seq(0, 60, 10)) + #adjust the scales of axes
  scale_y_continuous(breaks = seq(0, 30, 5), label = scales::dollar) +
  scale_color_manual(values = c("indianred3", "cornflowerblue")) 
# specify different colors of the trend lines

Facets

Facets reproduce a graph for [each level]} of a given variable (or combination of variables). Facets functions start with facet_. Here, facets are defined by the eight levels of the sector variable.

ggplot(data = plotdata,
       mapping = aes(x = exper, y = wage,color = sex)) +
  geom_point(alpha = .7) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_continuous(breaks = seq(0, 60, 10)) +
  scale_y_continuous(breaks = seq(0, 30, 5),label = scales::dollar) +
  scale_color_manual(values = c("indianred3","cornflowerblue")) +
  facet_wrap(~sector) #This is the facet_ function, how we produce a graph for each sector

Labels

The labs function provides customized labels for the axes and legends. Additionally, a custom title, subtitle, and caption can be added.

# add informative labels
ggplot(data = plotdata,mapping = aes(x = exper, y = wage,color = sex)) +
  geom_point(alpha = .7) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_continuous(breaks = seq(0, 60, 10)) +
  scale_y_continuous(breaks = seq(0, 30, 5),label = scales::dollar) +
  scale_color_manual(values = c("indianred3","cornflowerblue")) +
  facet_wrap(~sector) +
  labs(title = "Relationship between wages and experience",
       subtitle = "Current Population Survey",
       caption = "source: http://mosaic-web.org/",
       x = " Years of Experience",
       y = "Hourly Wage",
       color = "Gender")

Labels

Bar Chart

The distribution of a single categorical variable is typically plotted with a bar chart, a pie chart, or (less commonly) a tree map.

The Marriage dataset contains the marriage records of 98 individuals in Mobile County, Alabama. Below, a bar chart is used to display the distribution of wedding participants by race.

library(ggplot2)
data(Marriage, package = "mosaicData")

ggplot(Marriage, aes(x = race)) + 
  geom_bar(fill = "cornflowerblue", 
           color="black") +
  labs(x = "Race", 
       y = "Frequency", 
       title = "Participants by race")

Bar Chart when y axis is other statistics

library(ggplot2)
library(tidyverse)
df = read_csv("HousePrices.csv")

ImageData <- df %>%
  group_by(air_cond) %>%
  summarise(n_house = n(), avg_price = mean(price))
ImageData

## # A tibble: 2 × 3
##   air_cond n_house avg_price
##   <chr>      <int>     <dbl>
## 1 No          1093   187022.
## 2 Yes          635   254904.

ggplot(ImageData, aes(air_cond, avg_price)) + 
  geom_col(colour = "cornflowerblue",fill = "cornflowerblue") +
  labs(x = "Whether house have AC", y = "Average Price", 
       title = "Price by AC")+
  coord_cartesian(ylim=c(150000,270000))# set y limits

Tree Map

Tree map is an alternative to pie chart, but it also can handle catgorical variables with many levels.

#install.packages("treemapify")
library(treemapify)
data(Marriage, package = "mosaicData")

# create a treemap of marriage officials
plotdata <- Marriage %>%
  count(officialTitle)

Tree Map

ggplot(plotdata, aes(fill = officialTitle, 
           area = n)) +
  geom_treemap() + 
  labs(title = "Marriages by officiate")

Tree Map with Labels

We can also add the categorical data as label for its area, we can

ggplot(plotdata, 
       aes(fill = officialTitle, area = n, label = officialTitle)) + 
       geom_treemap() + 
       geom_treemap_text(colour = "white",place = "centre") +
       labs(title = "Marriages by officiate")

Correlation Plot

Correlation plots help you to visualize the pairwise relationships between a set of quantitative variables by displaying their correlations using color or shading. We will first calculate correlation

#Load data
data(SaratogaHouses, package="mosaicData")
# select numeric variables
df <- dplyr::select_if(SaratogaHouses, is.numeric)
# calulate the correlations
r <- cor(df, use="complete.obs")
round(r,2)

##            price lotSize   age landValue livingArea pctCollege bedrooms
## price       1.00    0.16 -0.19      0.58       0.71       0.20     0.40
## lotSize     0.16    1.00 -0.02      0.06       0.16      -0.03     0.11
## age        -0.19   -0.02  1.00     -0.02      -0.17      -0.04     0.03
## landValue   0.58    0.06 -0.02      1.00       0.42       0.23     0.20
## livingArea  0.71    0.16 -0.17      0.42       1.00       0.21     0.66
## pctCollege  0.20   -0.03 -0.04      0.23       0.21       1.00     0.16
## bedrooms    0.40    0.11  0.03      0.20       0.66       0.16     1.00
## fireplaces  0.38    0.09 -0.17      0.21       0.47       0.25     0.28
## bathrooms   0.60    0.08 -0.36      0.30       0.72       0.18     0.46
## rooms       0.53    0.14 -0.08      0.30       0.73       0.16     0.67
##            fireplaces bathrooms rooms
## price            0.38      0.60  0.53
## lotSize          0.09      0.08  0.14
## age             -0.17     -0.36 -0.08
## landValue        0.21      0.30  0.30
## livingArea       0.47      0.72  0.73
## pctCollege       0.25      0.18  0.16
## bedrooms         0.28      0.46  0.67
## fireplaces       1.00      0.44  0.32
## bathrooms        0.44      1.00  0.52
## rooms            0.32      0.52  1.00

Correlation Plot

library(ggplot2)
library(ggcorrplot)
ggcorrplot(r, colors = c("blue", "white", "red")) # Plot the correlation

We can also customize the output.

hc.order = TRUE reorders the variables, placing variables with similar correlation patterns together.
type = "lower" plots the lower portion of the correlation matrix.
lab = TRUE overlays the correlation coefficients (as text) on the plot.

ggcorrplot(r, 
           hc.order = TRUE, 
           type = "lower",
           lab = TRUE)