Agenda

  1. Histogram
  2. Bar plot
  3. Pie chart
  4. Scatter plot

Load data

We will use the same House Price dataset. You can go to Canvas - Dataset Module to download the following data.

df<-read.table(file="HousePrices.csv", 
               sep=",", header=TRUE, stringsAsFactors=FALSE)

class(df) # R will convert the file to a data frame
## [1] "data.frame"

Histogram of a numeric variable

Histogram takes a numeric vector as input, and creates bins to visualize the distribution of numbers in the vector.

hist(df$price)

hist(df$price, breaks=40) # adjust the size of each bin via breaks

# to learn more about these additional arguments, just type ?hist
hist(df$price, col="green", xlim=c(50000, 800000),
     xlab="Price of house", ylab="Count", 
     main="Distribution of house price")

Bigger font size:

hist(df$price, col="plum", xlim=c(50000, 800000),
     xlab="Price of house", ylab="Count", 
     main="Distribution of house price",
     cex.lab=1.5, cex.axis=1.5, cex.main=1.5)

Your turn

Create the following graph with default number of bins

Do you think this is a good visualization? Re-generate the graph to show to show the distribution of the majority. Add your favorate color to the chart.

Frequency of a categorical variable

table(df$air_cond)
## 
##   No  Yes 
## 1093  635
table(df$heat)
## 
##  Electric   Hot Air Hot Water 
##       305      1121       302
table(df$fuel)
## 
## Electric      Gas      Oil 
##      315     1197      216

Bar plot

In a bar plot, the x axis is for categorical values while in a histogram, the x axis is for numerical values.

ditribution <- table(df$heat)
ditribution
## 
##  Electric   Hot Air Hot Water 
##       305      1121       302
barplot(ditribution)

Pie chart for a string/factor variable

ditribution <- table(df$fuel)
pie(ditribution)

Customize your pie chart

ditribution <- table(df$fuel)
pct <- round(ditribution/nrow(df)*100)      # calculate percentages
lbls <- paste(names(ditribution), pct, "%") # add percents to labels
pie(ditribution,col=c("steelblue4", "grey", "grey"),
    labels=lbls)

You turn

  1. Show a histogram of lot_size

  2. Show a bar plot of air_cond

  3. Show a pie chart of construction

Customize your plots to enhance the visualization.

Detour: formula

The template for a formula in R is as follows:

outcome ~ predictor_1 + predictor_2 + ...
  • A formula usually has two parts: 1) one outcome variable on the left and 2) a set of predictors on the right

  • The two parts are separated by a tilde sign (~)

  • Formulas are frequently used in regression analyses and data mining. But you can also use formulas in some basic graphics.

  • Essentially, you use a formula to tell R that you are interested in the relationship between the outcome variable and the predictors

Scatter plot

plot(price ~ lot_size + bathrooms + rooms + age, data=df)  # y_axis ~ x_axis

ifelse()

ifelse(condition, do_this_if_true, do_this_if_false)
  • ifelse() is a function.

Example:

a <- 2; b <- 1 
ifelse(a > b, "a is greater than b", "a is less than b")
## [1] "a is greater than b"
color <- ifelse(a==b, "red", "blue")
print(color)
## [1] "blue"

Make plots prettier & more informative

plot(price ~ lot_size, data=df, 
     pch=ifelse(df$air_cond=="Yes", 0, 1), # pch: symbols for the points
     col=ifelse(df$air_cond=="Yes", "red", "blue")) # col: colors of the points
legend("topleft", c("w/ aircon", "w/t aircon"), 
       pch=c(0, 1), col=c("red", "blue"))

References for pch

Your turn

Create a scatter plot with living_area on the x axis and price on the y axis. Color the dots in the plot according to whether the house has a fireplace