Graphs_Histograms-Draft02.knit

Introduction to data analysis using R, R Studio and R Markdown
Dee Chiluiza, PhD
Northeastern University
Boston, Massachusetts

Return to Home Page

Short manual series: Histograms

Introduction

Histograms are graphs that allow the observation and analysis of continuous data distribution and behavior.

A histogram allows to get insights of the shape of the data distribution in terms of normality, skewness and kurtosis. These last two values can be confirmed using the codes skewness(dataset$variable) and kurtosis(dataset$variable), from library(e1071).

Let’s create two random samples, observe the R chunk library_data.
A first sample was created using code: sample1 = rnorm(1000, mean = 300, sd = 25), it produces 1000 random values with mean 300 and standard deviation 25.
A second sample was created using code: sample2 = runif(1000, min = 10, max =390), it produces 1000 random values from a minimum value of 10 and a maximum value of 390. All values on both codes can be changed by choice.

Libraries and data sets R Chunk

# {r library_data, message=FALSE, warning=FALSE}

# Libraries
library(dplyr) 
library(ggplot2)
library(readxl)
library(gridExtra)
library(RColorBrewer)
library(e1071)
library(lattice)

# Data sets
sample1 = rnorm(1000, mean = 150, sd = 25)
sample2 = runif(1000, min = 10, max =390)
data("faithful")
data("mpg")

library(readxl)
carSales <- read_excel("DataSets/carSales.xlsx")

Here’s a link to the data set carSales. Save it on your computer and use the same code to import it.
https://figshare.com/s/685c77fbec70f6fc7758

Let’s create the first histogram

The code hist() is from the basic library(graphics).

This is the most raw form of the code outcome.

hist(sample1)

The four histograms displayed below were produced in sequential order, observe the basic code on 1 and how some internal codes were added to change the tile with main ="" and remove y-axis label with ylab = "“.
(1) basic code hist(sample1) with main =”1.1" to change default title q , (2) add breaks to increase number of bins and improve data resolution for better analysis, (3) adding colors, and (4) changing y-axis values direction and limits. Notice the change in the y-axis size between 1 and 2; this occurs since there are more bins and the number of observations per bin is therefore smaller.

# Par code to present 4 figures in a 4x4 matrix and to increase margin size
par(mfrow=c(2,2), mai = c(0.5,1,0.5,0.2))

# 1.Basic histogram
hist(sample1,
     main = "1.1")

# 2. Increase number of bins
hist(sample1, 
     main = "1.2",
     breaks = 50,
     ylab = "")

# 3. Add colors to improve visualization
hist(sample1, 
     main = "1.3",
     breaks = 50,
     col = brewer.pal(12, "Set3"))

# 4. Add colors, y-axis values orientation, y-axis limits
hist(sample1, 
     main = "1.4",
     breaks = 50,
     ylab = "",
     col = brewer.pal(12, "Set3"),
     las = 1,
     ylim = c(0,100))

Second histogram

In histogram 2.4, breaks are set using code seq(), which contains 3 elements: minimum, maximum, and bin size.

# Par code to present 4 figures in a 4x4 matrix and to increase margin size
par(mfrow=c(2,2), mai = c(0.5,1,0.5,0.2))

# 1.Basic histogram
hist(sample2,
     main = "2.1")

# 2. Increase number of bins
hist(sample2, 
     main = "2.2",
     breaks = 50,
     ylab = "")

# 3. Add colors to improve visualization
hist(sample2, 
     main = "2.3",
     breaks = 50,
     col = brewer.pal(12, "Set3"))

# 4. Y-axis orientation, breaks, x-y axes limits.
hist(sample2, 
     main = "2.4",
     breaks = seq(0,400,20),
     ylab = "",
     col = brewer.pal(12, "Set3"),
     las = 1,
     ylim = c(0,80),
     xlim = c(0,400))

Using summarytools package to list basic descriptice statistics

We will use the data set carSales.

Obtain their basic descriptive statistics using code summarytools::descr(), from library(summarytools).

priceStats = summarytools::descr(carSales$Price/1000)

kmStats = summarytools::descr(carSales$Km/1000)

Below, the two objects priceStats and kmStats are presented using a table created using HTML language.

Price Statistics

Km Statistics

	Price/1000
Mean	28.44
Std.Dev	12.20
Min	0.80
Q1	19.76
Median	30.85
Q3	37.84
Max	63.50
MAD	11.57
IQR	18.06
CV	0.43
Skewness	-0.44
SE.Skewness	0.03
Kurtosis	-0.64
N.Valid	5844.00
Pct.Valid	100.00

	Km/1000
Mean	92.19
Std.Dev	44.77
Min	5.54
Q1	63.33
Median	86.05
Q3	112.95
Max	452.80
MAD	36.11
IQR	49.62
CV	0.49
Skewness	1.32
SE.Skewness	0.03
Kurtosis	4.06
N.Valid	5844.00
Pct.Valid	100.00

Use ?hist on the console for a complete list of options for the hist() code.

Use code glimpse() on the console to check the whole data set carSales. As you can see, two numerical continuous variables are price of the cars and kilometers read on each car.

Observe the distribution of both variables using histograms.

Observe the sequence of events:

1. Basic code

These graphs have several issues, can you think of how many changes you can make to improve them?
- Main title and axes labels need to be changed.
- X- and Y-axes values display need to be fixed.
- Default bins are too big for proper analysis, increase the number to a high number and then decrease them until they tell a story. Instead of talking about the number of bins you ,ight hear about the width of the bins, it is the same. Increasing the number of bins decrease their width, increasing the width of the bins deceases their number, and so on.
- Graphs need color.
- For better analysis of data distribution, the mean and the median can be included in the graph.

par(mfcol=c(2,1), mai=c(0.8,1,0.2,1))

hist(carSales$Price)

hist(carSales$Km)

2. Change the titles and labels

par(mfcol=c(2,1),
    mai=c(0.8,1,0.2,1))

hist(carSales$Price,
     main = "Distribution of price values",
     ylab = "Frequency",
     xlab = "Price (in US$)")

hist(carSales$Km,
     main = "Distribution of kilometers read per car",
     ylab = "Frequency",
     xlab = "Kilometers")

3. Remove titles and increase bins

In the graphs above, you can observe that the titles are unnecessary, look below the difference when they are removed with main = "".
Increase the number of bins (breaks =)to a high value then decease it until the distribution tells a story about your data. Compare the bins with the graphs above, with smaller bins there is more information to analyze. But be aware that too many bins is also not a good way to present a histogram unless there is a good reason. Decrease the number of bins until you find a good balance.

Notice the two changes in the codes below.

par(mfcol=c(2,1),
    mai=c(0.8,1,0.2,1))

hist(carSales$Price,
     main = "",
     ylab = "Frequency",
     xlab = "Price (in US$)",
     breaks = 70)

hist(carSales$Km,
     main = "",
     ylab = "Frequency",
     xlab = "Kilometers",
     breaks = 70)

4. Fix values on x-axis

Sometimes, big numbers such as 20,000 or 1e+05, are not a proper way to present data. A good strategy to apply is to divide all values by a factor, depending on the initial values, divide all values by 10, 1000, 10,000, 1 million, etc. You can also use log values if applicable.
- There are several ways to do this, in the case below, the division by 1000 was done on the variable code.
- When this procedure is performed, it must be indicated in the labels.
- Observe how easier it is to observe the values on the x-axis.

par(mfcol=c(2,1),
    mai=c(0.8,1,0.2,1))

hist(carSales$Price/1000,
     main = "",
     ylab = "Frequency",
     xlab = "Price (in US$ x 1000)",
     breaks = 70)

hist(carSales$Km/1000,
     main = "",
     ylab = "Frequency",
     xlab = "Kilometers (x1,000)",
     breaks = 70)

5. Fix x- and y-axes presentations

Change orientation of y-axis with las=1. You can also use las=2 to change both axes orientations.
- Increase values presented on the x-axis with xaxp=c(min,max,breaks). In the first histogram below the code indicates: xaxp = (0,80,8), from 0, to 80, add 8 breaks.

par(mfcol=c(2,1),
    mai=c(0.8,1,0.2,1))

hist(carSales$Price/1000,
     main = "",
     ylab = "Frequency",
     xlab = "Price (in US$ x 1000)",
     breaks = 70,
     las = 1,
     xaxp = c(0,80,8),
     yaxp = c(0,300,3))

hist(carSales$Km/1000,
     main = "",
     ylab = "Frequency",
     xlab = "Kilometers (x1,000)",
     breaks = 70,
     las = 1,
     xaxp = c(0,500,10),
     yaxp = c(0,400,4))

6. Add colors

Observe the two color codes used below.

par(mfcol=c(2,1),
    mai=c(0.8,1,0.2,1))

hist(carSales$Price/1000,
     main = "",
     ylab = "Frequency",
     xlab = "Price (in US$ x 1000)",
     breaks = 70,
     las = 1,
     xaxp = c(0,80,8),
     yaxp = c(0,300,3),
     col = terrain.colors(12))

hist(carSales$Km/1000,
     main = "",
     ylab = "Frequency",
     xlab = "Kilometers (x1,000)",
     breaks = 70,
     las = 1,
     xaxp = c(0,500,10),
     yaxp = c(0,400,4),
     col = brewer.pal(12, "Set3"))

7. Add the mean and median values as vertical lines

abline() code uses v for vertical line, and h for horizontal line.
- Check more about the code using ?abline in th console.
- The codes are added after the end of the his() codes.
- Notice the addition of hashtags # to enter non-coding text, they can be used to organize codes and enter notes.
- Notice the increase on y-axis limits with ylim=c(0,0) to improve visualization of the bin and new lines.
- Notice the amount of data on the left versus the right of the mean line.
- Notice the position of the median line regarding the mean line, is it on the left or the right? This is a clue on the polarity of the skewness value we will analyze below.

# par code to present graphs using a 2x1 matrix (mfcol) and to change margins (mai)
par(mfcol=c(2,1),
    mai=c(0.8,1,0.2,1))

# Histogram for price values 
hist(carSales$Price/1000,
     main = "",
     ylab = "Frequency",
     xlab = "Price (in US$ x 1000)",
     breaks = 70,
     las = 1,
     xaxp = c(0,80,8),
     yaxp = c(0,400,4),
     col = terrain.colors(12),
     ylim = c(0,400))

abline(v = mean(carSales$Price/1000),
       col = "red",
       lwd = 3)            # lwd for thickness

abline(v = median(carSales$Price/1000),
       col = "blue",
       lwd = 3)


# Histogram for kilometer values
hist(carSales$Km/1000,
     main = "",
     ylab = "Frequency",
     xlab = "Kilometers (x1,000)",
     breaks = 70,
     las = 1,
     xaxp = c(0,500,10),
     yaxp = c(0,400,4),
     col = brewer.pal(12, "Set3"),
     ylim = c(0,400))

abline(v = mean(carSales$Km/1000),
       col = "red",
       lwd = 3)

abline(v = median(carSales$Km/1000),
       col = "blue",
       lwd = 3)

8. Add text information inside the graphs

Use the code text() after the end of the histogram code.
- text() code can be added before or after the abline() code, the order does not affect them.
- text() needs indication of the x- and y-axis positions of the text to display.
- the text can be words, the outcome of formulas, or a combination of both.
- In the case below, we will add three text() codes to each histogram, one for the meann,o for the median, and one on the top-right to replace the title and x-axis label.
- X-axes labels are removed by adding quotations without text inside.

# Histogram for price values 
hist(carSales$Price/1000,
     main = "",
     ylab = "Frequency",
     xlab = "",
     breaks = 70,
     las = 1,
     xaxp = c(0,80,8),
     yaxp = c(0,400,4),
     col = terrain.colors(12),
     ylim = c(0,400))

abline(v = mean(carSales$Price/1000),
       col = "red",
       lwd = 3)

text(x=mean(carSales$Price/1000),
     y = 390,
     paste("Mean:", round(mean(carSales$Price)/1000,1),"K"),
     col = "red",
     cex = 0.8,
     pos = 4)

abline(v = median(carSales$Price/1000),
       col = "blue",
       lwd = 3)

text(x=median(carSales$Price/1000),
     y = 390,
     paste("Median:", round(median(carSales$Price)/1000,1),"K"),
     col = "blue",
     cex = 0.8,
     pos = 2)

text(x = 80,
     y = 380,
     paste("Price (in US$ x 1,000)"),
     cex = 1,
     pos = 2,
     col = "#003366")

Histogram for kilometer values

# Histogram for kilometer values
hist(carSales$Km/1000,
     main = "",
     ylab = "Frequency",
     xlab = "",
     breaks = 70,
     las = 1,
     xaxp = c(0,500,10),
     yaxp = c(0,400,4),
     col = brewer.pal(12, "Set3"),
     ylim = c(0,400))

abline(v = mean(carSales$Km/1000),
       col = "red",
       lwd = 3)

text(x=median(carSales$Km/1000),
     y = 390,
     paste("Median:", round(mean(carSales$Km)/1000,1),"K"),
     col = "red",
     cex = 0.8,
     pos = 4)

abline(v = median(carSales$Km/1000),
       col = "blue",
       lwd = 3)

text(x=median(carSales$Km/1000),
     y = 390,
     paste("Median:", round(median(carSales$Km)/1000,1),"K"),
     col = "blue",
     cex = 0.8,
     pos = 2)

text(x = 450,
     y = 380,
     paste("Km (x 1,000)"),
     cex = 1,
     pos = 2,
     col = "#003366")

Histogram example 1

Observe all the extras that were added to the histogram below.

par(mai = c(1,1.2,0.5,0.4))

hist(carSales$Price,
     breaks = seq(0,80000,1000),
     main = "",
     ylab = "Frequency",
     xlab = "",
     col.lab="red",               # Change colors of x-y axes
     ylim = c(0,400),
     col = brewer.pal(8, "Pastel2"),
     las = 2,
     xaxp = c(0,80000,16))

# Vertical line and text for the mean

abline(v = mean(carSales$Price),
       col = "blue",
       lwd = 2)

text(x=mean(carSales$Price),
     y = 390,
     paste("Mean:", round(mean(carSales$Price/1000),1),"K"),
     col = "blue",
     cex = 0.8,
     pos = 4)

# Vertical line and text for the median

abline(v = median(carSales$Price),
       col = "red",
       lwd = 2)

text(x=median(carSales$Price),
     y = 390,
     paste("Median:", round(median(carSales$Price/1000),1),"K"),
     col = "red",
     cex = 0.8,
     pos = 2)

# Text codes for skewness and kurtosis

text(x=80000,
     y = 185,
     paste("Skewness:", round(skewness(carSales$Price),3)),
     col = "#006633",
     cex = 1,
     pos = 2)

text(x=80000,
     y = 135,
     paste("Kurtosis:", round(kurtosis(carSales$Price),3)),
     col = "#006633",
     cex = 1,
     pos = 2)

mtext("Price distribution",
     side = 1,
     line = 3.7,
     cex=0.9,
     col = "red")

# Add horizontal dotted grey lines 

abline(h = seq(0,350,50), col = "grey", lty="dotted")

Adding data values to columns

Adding values to each bar is done using labels = T.
- This is better with thick bins. It does not works well with numerous bins.

# Histogram for kilometer values
hist(carSales$Km/1000,
     main = "",
     ylab = "Frequency",
     xlab = "",
     breaks = 20,
     las = 1,
     xaxp = c(0,500,10),
     yaxp = c(0,1500,5),
     col = brewer.pal(12, "Set3"),
     ylim = c(0,1500),
     labels = T)

Lattice Library to observe continuous data per categories

Using lattice() library to separate and present continuous data in histograms according to the levels of a factor (categories of a categorical variable)

par(mfcol=c(3,1))

histogram(~ Km/1000 | FuelType, data=carSales,
           las=1,
           main="Price",
           xlab = "",
           breaks = 20,
          col = brewer.pal(8,"Pastel1"))

histogram(~ Km/1000 | Transmission, data=carSales,
           las=1,
           main="Price",
           xlab = "",
           breaks = 20,
          col = brewer.pal(8,"Pastel1"))

histogram(~ Km/1000 | Owner, data=carSales,
           las=1,
           main="Price",
           xlab = "",
           breaks = 20,
          col = brewer.pal(8,"Pastel1"))

Add a normal distribution line

hist(sample1,
     breaks = 25,
     xlim = c(75, 225),
     prob = TRUE,
     las=1,
     col = brewer.pal(12, "Set3"),
     ylim=c(0,0.02),
     xaxp=c(75,225,10))

lines(density(sample1, adjust=2), col="red")

Histograms using GGPLOT

cInd_GasDies = carSales %>% 
        filter(FuelType %in% c('Gasoline', 'Diesel'))

hista1 = cInd_GasDies %>% 
        ggplot(aes(Km/1000)) +
        geom_histogram(fill="#A11515",
                       color="white",
                       binwidth=10)+
        theme_minimal()

hista2 = cInd_GasDies %>% 
        ggplot(aes(Km/1000, fill=Owner)) +
        geom_density(alpha=0.4)+
        theme_minimal()

grid.arrange(hista1, hista2, ncol=1, nrow=2)

Histogram example 2

# Par() code to present graphs in a 2x1 format, change background, fonts.

par(font = 1, 
    font.lab = 3, 
    font.axis = 3, 
    font.main = 1, 
    fg = "#CC0000", 
    bg = "#D9F5F4", 
    cex = 0.8, 
    cex.lab = 0.4, 
    srt = 90, 
    tcl = -0.4, 
    ylbias = 0.6, 
    pty = "m")

#cex changes size of all graph fonts
#srt changes angle of values on top of bars (plotting text and symbols)
#tcl changes the side and direction of thick marks on vertical and horizontal axes 

# Histogram sample1
hist(sample1,
     breaks = 15,
     main = "Sample 1",
     las = 2,
     labels = TRUE,
     col = terrain.colors(22),
     cex.main = 0.8,
     cex.lab = 0.8,
     cex.axis = 0.8,
     cex = 0.1,
     border = "blue")

# Par() code to present graphs in a 2x1 format, change background, fonts.

par(font = 1, 
    font.lab = 3, 
    font.axis = 3, 
    font.main = 1, 
    fg = "#CC0000", 
    bg = "#D9F5F4", 
    cex = 0.8, 
    cex.lab = 0.4, 
    srt = 90, 
    tcl = -0.4, 
    ylbias = 0.6, 
    pty = "m")

# Histogram sample1 Prob
hist(sample1,
     prob = T,
     breaks = 15,
     main = "Sample 1 Probability",
     las = 2,
     ylim = c(0,0.02), 
     labels = TRUE,
     col = terrain.colors(22),
     cex.main = 0.8,
     cex.lab = 0.8,
     cex.axis = 0.8,
     cex = 0.1,
     border = "blue")

Disclaimer: This short series manual project is a work in progress. Until otherwise clearly stated, this material is considered to be draft version.

Dee Chiluiza, PhD
June 2021
Last update: 13 October, 2021
Boston, Massachusetts, USA
Bruno Dog