Business Intelligence

Agenda

Histogram
Bar plot
Pie chart
Box Plot
Scatter plot
Introduction to ggplot2 Package (with Scatter Plot)
Bar Chart with ggplot
Tree Map with ggplot
Correlation Plot with ggplot

Load data

We will use the same House Price dataset. You can go to Canvas - Dataset Module to download the following data.

df<-read.table(file="HousePrices.csv", 
               sep=",", header=TRUE, stringsAsFactors=FALSE)

class(df) # R will convert the file to a data frame

## [1] "data.frame"

Histogram of a numeric variable

Histogram takes a numeric vector as input, and creates bins to visualize the distribution of numbers in the vector.

hist(df$price)

hist(df$price, breaks=40) # adjust the size of each bin via breaks

# to learn more about these additional arguments, just type ?hist
hist(df$price, col="green", xlim=c(50000, 800000),
     xlab="Price of house", ylab="Count", 
     main="Distribution of house price")

Bigger font size:

hist(df$price, col="plum", xlim=c(50000, 800000),
     xlab="Price of house", ylab="Count", 
     main="Distribution of house price",
     cex.lab=1.5, cex.axis=1.5, cex.main=1.5)

Name of colors in R: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

More fancy colors: https://www.datanovia.com/en/blog/top-r-color-palettes-to-know-for-great-data-visualization/

hist(df$price, col=cm.colors(10), xlim=c(50000, 800000),
     xlab="Price of house", ylab="Count", 
     main="Distribution of house price")

Save your plots

Your turn

Create the following graph with default number of bins

Do you think this is a good visualization? Re-generate the graph to show to show the distribution of the majority. Add your favorate color to the chart.

Frequency of a categorical variable

table(df$air_cond)

## 
##   No  Yes 
## 1093  635

table(df$heat)

## 
##  Electric   Hot Air Hot Water 
##       305      1121       302

table(df$fuel)

## 
## Electric      Gas      Oil 
##      315     1197      216

Bar plot

In a bar plot, the x axis is for categorical values while in a histogram, the x axis is for numerical values.

ditribution <- table(df$heat)
ditribution

## 
##  Electric   Hot Air Hot Water 
##       305      1121       302

barplot(ditribution)

Pie chart for a string/factor variable

ditribution <- table(df$fuel)
pie(ditribution)

Customize your pie chart

ditribution <- table(df$fuel)
pct <- round(ditribution/nrow(df)*100)      # calculate percentages
lbls <- paste(names(ditribution), pct, "%") # add percents to labels
pie(ditribution,col=c("steelblue4", "grey", "grey"),
    labels=lbls)

You turn

Show a histogram of lot_size
Show a bar plot of air_cond
Show a pie chart of construction

Customize your plots to enhance the visualization.

Box Plot

boxplot(df$lot_size)

Remove outliers in Box Plot

boxplot(df$lot_size,outline=FALSE,
        ylab="Lot size of house", main="Box plot of house lot size")

Detour: formula

The template for a formula in R is as follows:

outcome ~ predictor_1 + predictor_2 + ...

A formula usually has two parts: 1) one outcome variable on the left and 2) a set of predictors on the right
The two parts are separated by a tilde sign (~)
Formulas are frequently used in regression analyses and data mining. But you can also use formulas in some basic graphics.
Essentially, you use a formula to tell R that you are interested in the relationship between the outcome variable and the predictors

Scatter plot

plot(price ~ lot_size + bathrooms + rooms + age, data=df)  # y_axis ~ x_axis

ifelse()

ifelse(condition, do_this_if_true, do_this_if_false)

ifelse() is a function.

Example:

a <- 2 
ifelse(a > 0, "a is positive", "a is negative")

## [1] "a is positive"

color <- ifelse(a > 0, "red", "blue")
print(color)

## [1] "red"

Make plots prettier & more informative

plot(price ~ lot_size, data=df, 
     pch=ifelse(df$air_cond=="Yes", 0, 1), # pch: symbols for the points
     col=ifelse(df$air_cond=="Yes", "red", "blue")) # col: colors of the points
legend("topleft", c("w/ aircon", "w/t aircon"), 
       pch=c(0, 1), col=c("red", "blue"))

References for pch

R plot pch symbol chart

Alternatively, you can just google “R pch symbols”

col : color (code or name) to use for the points
bg : the background (or fill) color for the open plot symbols. It can be used only when pch = 21:25.
cex : the size of pch symbols
lwd : the line width for the plotting symbols

Your turn

Create a scatter plot with living_area on the x axis and price on the y axis. Color the dots in the plot according to whether the house has a fireplace

Introduction to ggplot Package

ggplot is a powerful package for more advanced visualization.

The functions in ggplot package build up a group in layer: starting with a simple graph -> adding additional elements/functions with +, one at a time.

Load data

We will first load a data from a package using the following code, then load the library of ggplot2

install.packages('mosaicData')

library('mosaicData')
data(CPS85, package = "mosaicData")
# load library 
library(ggplot2)
head(CPS85,5)

##   wage educ race sex hispanic south married exper union age   sector
## 1  9.0   10    W   M       NH    NS Married    27   Not  43    const
## 2  5.5   12    W   M       NH    NS Married    20   Not  38    sales
## 3  3.8   12    W   F       NH    NS  Single     4   Not  22    sales
## 4 10.5   12    W   F       NH    NS Married    29   Not  47 clerical
## 5 15.0   12    W   M       NH    NS Married    40 Union  58    const

ggplot function

The first function in building a graph is ggplot function. It specifies the - data frame containing the data to be plotted - the mapping of the variables to visual properties of the graph (where aes stands for aesthetics).

   # specify dataset and mapping, 
    ggplot(data = CPS85,
           mapping = aes(x = exper, y = wage)) # specify x-axis and y axis

geoms (geometric objects)

Geoms: geometric objects (points, lines, bars, etc.) of a graph. They are added using functions that start with geom_. In this example, geom_point function is used to create a scatterplot. In ggplot2 graphs, functions are chained together using the + sign to build a final plot.

    ggplot(data = CPS85,
           mapping = aes(x = exper, y = wage)) + 
    geom_point() #This is how we add datapoints to the plot

Delete outlier

The graph indicates that there is an outlier. One individual has a wage much higher than the rest. We’ll delete this case before continuing.

library(dplyr)
plotdata <- filter(CPS85, wage < 40) #filter out the outlier

# redraw scatterplot
ggplot(data = plotdata, # Then we used the updated dataset 
       mapping = aes(x = exper, y = wage)) +
  geom_point()

Parameter/Options of geom_ Function

Options for the geom_point function include: 1) Point color: color 2)Point size: size 3) Transparency: alpha. Ranging from 0 (complete transparent) to 1 (complete opaque) ## Parameter/Options of geom_ Function

# make points blue, larger, and semi-transparent
ggplot(data = plotdata,
       mapping = aes(x = exper, y = wage)) +
geom_point(color = "cornflowerblue",  #Color
             alpha = .7, #Transparency level
             size = 3) #Point Size

Add a Line of Best Fit

Next, let’s add a line of best fit. We can do this with the geom_smooth function. Options control

the type of line (linear, quadratic, nonparametric)
the thickness of the line - the line’s color
the presence or absence of a confidence interval. Here we request a linear regression (method = lm) line (where lm stands for linear model).

Add a Line of Best Fit

    # add a line of best fit.
    ggplot(data = plotdata,
           mapping = aes(x = exper, y = wage)) +
      geom_point(color = "cornflowerblue",alpha = .7,size = 3) +
      geom_smooth(method = "lm") #added another layer of smooth line of best fit and specified linear regression method

## `geom_smooth()` using formula = 'y ~ x'

Grouping

Variables can be mapped to the color, shape, size, transparency, etc. This allows groups of observations to be superimposed in a single graph.

# indicate sex using color
ggplot(data = plotdata, mapping = aes(x = exper, y = wage, color = sex)) + 
  #using sex variable to differentiate colors
      geom_point(alpha = .7, size = 3) +
      geom_smooth(method = "lm", se = FALSE, linewidth = 1.5)

It appears that men tend to make more money than women. Additionally, there may be a stronger relationship between experience and wages for men than than for women.

Your turn

To explore the relationship between age and income, create the following graph, age as x axis, and wage as y axis. Group the outcome based on their marital status (adjust the scatterplot’s color based on married variable)

Scales

Scale functions (which start with scale) allow you to modify this mapping. In the next plot, we’ll change the x and y axis scaling, and the colors employed.

# modify the x and y axes and specify the colors to be used
ggplot(data = plotdata,
       mapping = aes(x = exper, y = wage,color = sex)) +
  geom_point(alpha = .7, size = 3) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1.5) +
  scale_x_continuous(breaks = seq(0, 60, 10)) + #adjust the scales of axes
  scale_y_continuous(breaks = seq(0, 30, 5), label = scales::dollar) +
  scale_color_manual(values = c("indianred3", "cornflowerblue")) 
# specify different colors of the trend lines

Facets

Facets reproduce a graph for [each level]} of a given variable (or combination of variables). Facets functions start with facet_. Here, facets are defined by the eight levels of the sector variable.

ggplot(data = plotdata,
       mapping = aes(x = exper, y = wage,color = sex)) +
  geom_point(alpha = .7) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_continuous(breaks = seq(0, 60, 10)) +
  scale_y_continuous(breaks = seq(0, 30, 5),label = scales::dollar) +
  scale_color_manual(values = c("indianred3","cornflowerblue")) +
  facet_wrap(~sector) #This is the facet_ function, how we produce a graph for each sector

Labels

The labs function provides customized labels for the axes and legends. Additionally, a custom title, subtitle, and caption can be added.

# add informative labels
ggplot(data = plotdata,mapping = aes(x = exper, y = wage,color = sex)) +
  geom_point(alpha = .7) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_continuous(breaks = seq(0, 60, 10)) +
  scale_y_continuous(breaks = seq(0, 30, 5),label = scales::dollar) +
  scale_color_manual(values = c("indianred3","cornflowerblue")) +
  facet_wrap(~sector) +
  labs(title = "Relationship between wages and experience",
       subtitle = "Current Population Survey",
       caption = "source: http://mosaic-web.org/",
       x = " Years of Experience",
       y = "Hourly Wage",
       color = "Gender")

Labels

Bar Chart

The distribution of a single categorical variable is typically plotted with a bar chart, a pie chart, or (less commonly) a tree map.

The Marriage dataset contains the marriage records of 98 individuals in Mobile County, Alabama. Below, a bar chart is used to display the distribution of wedding participants by race.

library(ggplot2)
data(Marriage, package = "mosaicData")

ggplot(Marriage, aes(x = race)) + 
  geom_bar(fill = "cornflowerblue", 
           color="black") +
  labs(x = "Race", 
       y = "Frequency", 
       title = "Participants by race")

Bar Chart when y axis is other statistics

library(ggplot2)
library(tidyverse)
df = read_csv("HousePrices.csv")

ImageData <- df %>%
  group_by(air_cond) %>%
  summarise(n_house = n(), avg_price = mean(price))
ImageData

## # A tibble: 2 × 3
##   air_cond n_house avg_price
##   <chr>      <int>     <dbl>
## 1 No          1093   187022.
## 2 Yes          635   254904.

ggplot(ImageData, aes(air_cond, avg_price)) + 
  geom_col(colour = "cornflowerblue",fill = "cornflowerblue") +
  labs(x = "Whether house have AC", y = "Average Price", 
       title = "Price by AC")+
  coord_cartesian(ylim=c(150000,270000))# set y limits

Tree Map

Tree map is an alternative to pie chart, but it also can handle catgorical variables with many levels.

#install.packages("treemapify")
library(treemapify)
data(Marriage, package = "mosaicData")

# create a treemap of marriage officials
plotdata <- Marriage %>%
  count(officialTitle)

Tree Map

ggplot(plotdata, aes(fill = officialTitle, 
           area = n)) +
  geom_treemap() + 
  labs(title = "Marriages by officiate")

Tree Map with Labels

We can also add the categorical data as label for its area, we can

ggplot(plotdata, 
       aes(fill = officialTitle, area = n, label = officialTitle)) + 
       geom_treemap() + 
       geom_treemap_text(colour = "white",place = "centre") +
       labs(title = "Marriages by officiate")

Correlation Plot

Correlation plots help you to visualize the pairwise relationships between a set of quantitative variables by displaying their correlations using color or shading. We will first calculate correlation

#Load data
data(SaratogaHouses, package="mosaicData")
# select numeric variables
df <- dplyr::select_if(SaratogaHouses, is.numeric)
# calulate the correlations
r <- cor(df, use="complete.obs")
round(r,2)

##            price lotSize   age landValue livingArea pctCollege bedrooms
## price       1.00    0.16 -0.19      0.58       0.71       0.20     0.40
## lotSize     0.16    1.00 -0.02      0.06       0.16      -0.03     0.11
## age        -0.19   -0.02  1.00     -0.02      -0.17      -0.04     0.03
## landValue   0.58    0.06 -0.02      1.00       0.42       0.23     0.20
## livingArea  0.71    0.16 -0.17      0.42       1.00       0.21     0.66
## pctCollege  0.20   -0.03 -0.04      0.23       0.21       1.00     0.16
## bedrooms    0.40    0.11  0.03      0.20       0.66       0.16     1.00
## fireplaces  0.38    0.09 -0.17      0.21       0.47       0.25     0.28
## bathrooms   0.60    0.08 -0.36      0.30       0.72       0.18     0.46
## rooms       0.53    0.14 -0.08      0.30       0.73       0.16     0.67
##            fireplaces bathrooms rooms
## price            0.38      0.60  0.53
## lotSize          0.09      0.08  0.14
## age             -0.17     -0.36 -0.08
## landValue        0.21      0.30  0.30
## livingArea       0.47      0.72  0.73
## pctCollege       0.25      0.18  0.16
## bedrooms         0.28      0.46  0.67
## fireplaces       1.00      0.44  0.32
## bathrooms        0.44      1.00  0.52
## rooms            0.32      0.52  1.00

Correlation Plot

library(ggplot2)
library(ggcorrplot)
ggcorrplot(r, colors = c("blue", "white", "red")) # Plot the correlation

We can also customize the output.

hc.order = TRUE reorders the variables, placing variables with similar correlation patterns together.
type = "lower" plots the lower portion of the correlation matrix.
lab = TRUE overlays the correlation coefficients (as text) on the plot.

ggcorrplot(r, 
           hc.order = TRUE, 
           type = "lower",
           lab = TRUE)