- Histogram
- Bar plot
- Pie chart
- Box Plot
- Scatter plot
- Introduction to ggplot2 Package (with Scatter Plot)
- Bar Chart with ggplot
- Tree Map with ggplot
- Correlation Plot with ggplot
We will use the same House Price dataset. You can go to Canvas - Dataset Module to download the following data.
df<-read.table(file="HousePrices.csv",
sep=",", header=TRUE, stringsAsFactors=FALSE)
class(df) # R will convert the file to a data frame
## [1] "data.frame"
Histogram takes a numeric vector as input, and creates bins to visualize the distribution of numbers in the vector.
hist(df$price)
hist(df$price, breaks=40) # adjust the size of each bin via breaks
# to learn more about these additional arguments, just type ?hist
hist(df$price, col="green", xlim=c(50000, 800000),
xlab="Price of house", ylab="Count",
main="Distribution of house price")
Bigger font size:
hist(df$price, col="plum", xlim=c(50000, 800000),
xlab="Price of house", ylab="Count",
main="Distribution of house price",
cex.lab=1.5, cex.axis=1.5, cex.main=1.5)
Name of colors in R: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
More fancy colors: https://www.datanovia.com/en/blog/top-r-color-palettes-to-know-for-great-data-visualization/
hist(df$price, col=cm.colors(10), xlim=c(50000, 800000),
xlab="Price of house", ylab="Count",
main="Distribution of house price")
Create the following graph with default number of bins
Do you think this is a good visualization? Re-generate the graph to show to show the distribution of the majority. Add your favorate color to the chart.
table(df$air_cond)
## ## No Yes ## 1093 635
table(df$heat)
## ## Electric Hot Air Hot Water ## 305 1121 302
table(df$fuel)
## ## Electric Gas Oil ## 315 1197 216
In a bar plot, the x axis is for categorical values while in a histogram, the x axis is for numerical values.
ditribution <- table(df$heat) ditribution
## ## Electric Hot Air Hot Water ## 305 1121 302
barplot(ditribution)
ditribution <- table(df$fuel) pie(ditribution)
ditribution <- table(df$fuel)
pct <- round(ditribution/nrow(df)*100) # calculate percentages
lbls <- paste(names(ditribution), pct, "%") # add percents to labels
pie(ditribution,col=c("steelblue4", "grey", "grey"),
labels=lbls)
Show a histogram of lot_size
Show a bar plot of air_cond
Show a pie chart of construction
Customize your plots to enhance the visualization.
boxplot(df$lot_size)
boxplot(df$lot_size,outline=FALSE,
ylab="Lot size of house", main="Box plot of house lot size")
The template for a formula in R is as follows:
outcome ~ predictor_1 + predictor_2 + ...
A formula usually has two parts: 1) one outcome variable on the left and 2) a set of predictors on the right
The two parts are separated by a tilde sign (~)
Formulas are frequently used in regression analyses and data mining. But you can also use formulas in some basic graphics.
Essentially, you use a formula to tell R that you are interested in the relationship between the outcome variable and the predictors
plot(price ~ lot_size + bathrooms + rooms + age, data=df) # y_axis ~ x_axis
ifelse(condition, do_this_if_true, do_this_if_false)
ifelse() is a function.Example:
a <- 2 ifelse(a > 0, "a is positive", "a is negative")
## [1] "a is positive"
color <- ifelse(a > 0, "red", "blue") print(color)
## [1] "red"
plot(price ~ lot_size, data=df,
pch=ifelse(df$air_cond=="Yes", 0, 1), # pch: symbols for the points
col=ifelse(df$air_cond=="Yes", "red", "blue")) # col: colors of the points
legend("topleft", c("w/ aircon", "w/t aircon"),
pch=c(0, 1), col=c("red", "blue"))
Alternatively, you can just google “R pch symbols”
Create a scatter plot with living_area on the x axis and price on the y axis. Color the dots in the plot according to whether the house has a fireplace
ggplot is a powerful package for more advanced visualization.
The functions in ggplot package build up a group in layer: starting with a simple graph -> adding additional elements/functions with +, one at a time.
We will first load a data from a package using the following code, then load the library of ggplot2
install.packages('mosaicData')
library('mosaicData')
data(CPS85, package = "mosaicData")
# load library
library(ggplot2)
head(CPS85,5)
## wage educ race sex hispanic south married exper union age sector ## 1 9.0 10 W M NH NS Married 27 Not 43 const ## 2 5.5 12 W M NH NS Married 20 Not 38 sales ## 3 3.8 12 W F NH NS Single 4 Not 22 sales ## 4 10.5 12 W F NH NS Married 29 Not 47 clerical ## 5 15.0 12 W M NH NS Married 40 Union 58 const
The first function in building a graph is ggplot function. It specifies the - data frame containing the data to be plotted - the mapping of the variables to visual properties of the graph (where aes stands for aesthetics).
# specify dataset and mapping,
ggplot(data = CPS85,
mapping = aes(x = exper, y = wage)) # specify x-axis and y axis
Geoms: geometric objects (points, lines, bars, etc.) of a graph. They are added using functions that start with geom_. In this example, geom_point function is used to create a scatterplot. In ggplot2 graphs, functions are chained together using the + sign to build a final plot.
ggplot(data = CPS85,
mapping = aes(x = exper, y = wage)) +
geom_point() #This is how we add datapoints to the plot
The graph indicates that there is an outlier. One individual has a wage much higher than the rest. We’ll delete this case before continuing.
library(dplyr) plotdata <- filter(CPS85, wage < 40) #filter out the outlier
# redraw scatterplot
ggplot(data = plotdata, # Then we used the updated dataset
mapping = aes(x = exper, y = wage)) +
geom_point()
Options for the geom_point function include: 1) Point color: color 2)Point size: size 3) Transparency: alpha. Ranging from 0 (complete transparent) to 1 (complete opaque) ## Parameter/Options of geom_ Function
# make points blue, larger, and semi-transparent
ggplot(data = plotdata,
mapping = aes(x = exper, y = wage)) +
geom_point(color = "cornflowerblue", #Color
alpha = .7, #Transparency level
size = 3) #Point Size
Next, let’s add a line of best fit. We can do this with the geom_smooth function. Options control
the type of line (linear, quadratic, nonparametric)
the thickness of the line - the line’s color
the presence or absence of a confidence interval. Here we request a linear regression (method = lm) line (where lm stands for linear model).
# add a line of best fit.
ggplot(data = plotdata,
mapping = aes(x = exper, y = wage)) +
geom_point(color = "cornflowerblue",alpha = .7,size = 3) +
geom_smooth(method = "lm") #added another layer of smooth line of best fit and specified linear regression method
## `geom_smooth()` using formula = 'y ~ x'
Variables can be mapped to the color, shape, size, transparency, etc. This allows groups of observations to be superimposed in a single graph.
# indicate sex using color
ggplot(data = plotdata, mapping = aes(x = exper, y = wage, color = sex)) +
#using sex variable to differentiate colors
geom_point(alpha = .7, size = 3) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1.5)
It appears that men tend to make more money than women. Additionally, there may be a stronger relationship between experience and wages for men than than for women.
To explore the relationship between age and income, create the following graph, age as x axis, and wage as y axis. Group the outcome based on their marital status (adjust the scatterplot’s color based on married variable)
Scale functions (which start with scale) allow you to modify this mapping. In the next plot, we’ll change the x and y axis scaling, and the colors employed.
# modify the x and y axes and specify the colors to be used
ggplot(data = plotdata,
mapping = aes(x = exper, y = wage,color = sex)) +
geom_point(alpha = .7, size = 3) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1.5) +
scale_x_continuous(breaks = seq(0, 60, 10)) + #adjust the scales of axes
scale_y_continuous(breaks = seq(0, 30, 5), label = scales::dollar) +
scale_color_manual(values = c("indianred3", "cornflowerblue"))
# specify different colors of the trend lines
Facets reproduce a graph for [each level]} of a given variable (or combination of variables). Facets functions start with facet_. Here, facets are defined by the eight levels of the sector variable.
ggplot(data = plotdata,
mapping = aes(x = exper, y = wage,color = sex)) +
geom_point(alpha = .7) +
geom_smooth(method = "lm", se = FALSE) +
scale_x_continuous(breaks = seq(0, 60, 10)) +
scale_y_continuous(breaks = seq(0, 30, 5),label = scales::dollar) +
scale_color_manual(values = c("indianred3","cornflowerblue")) +
facet_wrap(~sector) #This is the facet_ function, how we produce a graph for each sector
The labs function provides customized labels for the axes and legends. Additionally, a custom title, subtitle, and caption can be added.
# add informative labels
ggplot(data = plotdata,mapping = aes(x = exper, y = wage,color = sex)) +
geom_point(alpha = .7) +
geom_smooth(method = "lm", se = FALSE) +
scale_x_continuous(breaks = seq(0, 60, 10)) +
scale_y_continuous(breaks = seq(0, 30, 5),label = scales::dollar) +
scale_color_manual(values = c("indianred3","cornflowerblue")) +
facet_wrap(~sector) +
labs(title = "Relationship between wages and experience",
subtitle = "Current Population Survey",
caption = "source: http://mosaic-web.org/",
x = " Years of Experience",
y = "Hourly Wage",
color = "Gender")
The distribution of a single categorical variable is typically plotted with a bar chart, a pie chart, or (less commonly) a tree map.
The Marriage dataset contains the marriage records of 98 individuals in Mobile County, Alabama. Below, a bar chart is used to display the distribution of wedding participants by race.
library(ggplot2) data(Marriage, package = "mosaicData")
ggplot(Marriage, aes(x = race)) +
geom_bar(fill = "cornflowerblue",
color="black") +
labs(x = "Race",
y = "Frequency",
title = "Participants by race")
library(ggplot2)
library(tidyverse)
df = read_csv("HousePrices.csv")
ImageData <- df %>% group_by(air_cond) %>% summarise(n_house = n(), avg_price = mean(price)) ImageData
## # A tibble: 2 × 3 ## air_cond n_house avg_price ## <chr> <int> <dbl> ## 1 No 1093 187022. ## 2 Yes 635 254904.
ggplot(ImageData, aes(air_cond, avg_price)) +
geom_col(colour = "cornflowerblue",fill = "cornflowerblue") +
labs(x = "Whether house have AC", y = "Average Price",
title = "Price by AC")+
coord_cartesian(ylim=c(150000,270000))# set y limits
Tree map is an alternative to pie chart, but it also can handle catgorical variables with many levels.
#install.packages("treemapify")
library(treemapify)
data(Marriage, package = "mosaicData")
# create a treemap of marriage officials
plotdata <- Marriage %>%
count(officialTitle)
ggplot(plotdata, aes(fill = officialTitle,
area = n)) +
geom_treemap() +
labs(title = "Marriages by officiate")
We can also add the categorical data as label for its area, we can
ggplot(plotdata,
aes(fill = officialTitle, area = n, label = officialTitle)) +
geom_treemap() +
geom_treemap_text(colour = "white",place = "centre") +
labs(title = "Marriages by officiate")
Correlation plots help you to visualize the pairwise relationships between a set of quantitative variables by displaying their correlations using color or shading. We will first calculate correlation
#Load data data(SaratogaHouses, package="mosaicData") # select numeric variables df <- dplyr::select_if(SaratogaHouses, is.numeric) # calulate the correlations r <- cor(df, use="complete.obs") round(r,2)
## price lotSize age landValue livingArea pctCollege bedrooms ## price 1.00 0.16 -0.19 0.58 0.71 0.20 0.40 ## lotSize 0.16 1.00 -0.02 0.06 0.16 -0.03 0.11 ## age -0.19 -0.02 1.00 -0.02 -0.17 -0.04 0.03 ## landValue 0.58 0.06 -0.02 1.00 0.42 0.23 0.20 ## livingArea 0.71 0.16 -0.17 0.42 1.00 0.21 0.66 ## pctCollege 0.20 -0.03 -0.04 0.23 0.21 1.00 0.16 ## bedrooms 0.40 0.11 0.03 0.20 0.66 0.16 1.00 ## fireplaces 0.38 0.09 -0.17 0.21 0.47 0.25 0.28 ## bathrooms 0.60 0.08 -0.36 0.30 0.72 0.18 0.46 ## rooms 0.53 0.14 -0.08 0.30 0.73 0.16 0.67 ## fireplaces bathrooms rooms ## price 0.38 0.60 0.53 ## lotSize 0.09 0.08 0.14 ## age -0.17 -0.36 -0.08 ## landValue 0.21 0.30 0.30 ## livingArea 0.47 0.72 0.73 ## pctCollege 0.25 0.18 0.16 ## bedrooms 0.28 0.46 0.67 ## fireplaces 1.00 0.44 0.32 ## bathrooms 0.44 1.00 0.52 ## rooms 0.32 0.52 1.00
library(ggplot2)
library(ggcorrplot)
ggcorrplot(r, colors = c("blue", "white", "red")) # Plot the correlation
We can also customize the output.
hc.order = TRUE reorders the variables, placing variables with similar correlation patterns together.
type = "lower" plots the lower portion of the correlation matrix.
lab = TRUE overlays the correlation coefficients (as text) on the plot.
ggcorrplot(r,
hc.order = TRUE,
type = "lower",
lab = TRUE)