Agenda

  1. Histogram
  2. Bar plot
  3. Pie chart
  4. Box plot
  5. Scatter plot

Load data

We will use the same House Price dataset. You can go to Canvas - Sample Dataset Module to download the following data.

df<-read.table(file="HousePrices.csv", 
               sep=",", header=TRUE, stringsAsFactors=FALSE)

class(df) # R will convert the file to a data frame
## [1] "data.frame"

Histogram of a numeric variable

Histogram takes a numeric vector as input, and creates bins to visualize the distribution of numbers in the vector.

hist(df$price)

hist(df$price, breaks=20) # adjust the size of each bin via breaks

# to learn more about these additional arguments, just type ?hist

hist(df$price, col="green", xlim=c(50000, 800000),
     xlab="Price of house", ylab="Count", 
     main="Distribution of house price")

Bigger font size: Adjusting the cex argument

hist(df$price, col="plum", xlim=c(50000, 800000),
     xlab="Price of house", ylab="Count", 
     main="Distribution of house price",
     cex.lab=1.2, cex.axis=1.2, cex.main=1.5) #The default cex value = 1

Your turn

Create the following graph with a slightly larger font size for the captions and axis (cex=2)

Frequency of a categorical variable

table(df$air_cond)
## 
##   No  Yes 
## 1093  635
table(df$heat)
## 
##  Electric   Hot Air Hot Water 
##       305      1121       302
table(df$fuel)
## 
## Electric      Gas      Oil 
##      315     1197      216

Bar plot

In a bar plot, the x axis is for categorical values while in a histogram, the x axis is for numerical values.

ditribution <- table(df$heat)
ditribution
## 
##  Electric   Hot Air Hot Water 
##       305      1121       302
barplot(ditribution)

Pie chart for a string/factor variable

ditribution <- table(df$fuel)
pie(ditribution)

Customize your pie chart

ditribution <- table(df$fuel)
pct <- round(ditribution/nrow(df)*100)      # calculate percentages
lbls <- paste(names(ditribution), pct, "%") # add percents to labels
pie(ditribution,col=c("steelblue4", "grey", "grey"),
    labels=lbls)

Box plot

This is a simple example for generating boxplot in R

boxplot(df$price,data=df, main="House Price BoxPlot")

Removing Outliers

We can use the outline argument to specify not including outliers

boxplot(df$lot_size,outline=FALSE,
        ylab="Lot size of house", main="Box plot of house lot size")

Box plot

This is an example for generating boxplot for a variable with an additional feature. For example, the distribution of price for houses with different numbers of rooms.

boxplot(price~rooms,data=df, main="House Prices of Different Rooms",
   xlab="Number of rooms", ylab="House Prices ($)")

You turn

  1. Show a histogram of lot_size

  2. Show a bar plot of air_cond

  3. Show a pie chart of construction

Customize your plots to enhance the visualization.

Detour: formula

The template for a formula in R is as follows:

outcome ~ predictor_1 + predictor_2 + ...
  • A formula usually has two parts: 1) one outcome variable on the left and 2) a set of predictors on the right

  • The two parts are separated by a tilde sign (~)

  • Formulas are frequently used in regression analyses and data mining. But you can also use formulas in some basic graphics.

  • Essentially, you use a formula to tell R that you are interested in the relationship between the outcome variable and the predictors

Scatter plot

plot(price ~ lot_size + rooms + living_area, data=df)  # y_axis ~ x_axis

ifelse()

ifelse(condition, do_this_if_true, do_this_if_false)
  • ifelse() is a function. Do not confuse it with if( ){...}else{...}, although they achieve the same thing.

Example:

a <- 2; b <- 1 
ifelse(a > b, "a is greater than b", "a is less than b")
## [1] "a is greater than b"
color <- ifelse(a==b, "red", "blue")
print(color)
## [1] "blue"

Make plots prettier & more informative

plot(price ~ living_area, data=df, 
     pch=ifelse(df$air_cond== "Yes", 0, 1), # pch: symbols for the points 
     col=ifelse(df$air_cond=="Yes", "blue", "red")) # col: colors of the points
legend("topleft", c("with AC", "without AC"), 
       pch=c(0, 1), col=c("blue","red"))

References for pch and col

Your turn

Create a scatter plot with living_area on the x axis and price on the y axis. Color the dots in the plot according to whether the house has a fireplace