|
Dee Chiluiza, PhD Northeastern University Boston, Massachusetts |
| Short manual series: Histograms |
Introduction
Histograms are graphs that allow the observation and analysis of continuous data distribution and behavior.A histogram allows to get insights of the shape of the data distribution in terms of normality, skewness and kurtosis. These last two values can be confirmed using the codes skewness(dataset$variable) and kurtosis(dataset$variable), from library(e1071).
Let’s create two random samples, observe the R chunk library_data.
A first sample was created using code: sample1 = rnorm(1000, mean = 300, sd = 25), it produces 1000 random values with mean 300 and standard deviation 25.
A second sample was created using code: sample2 = runif(1000, min = 10, max =390), it produces 1000 random values from a minimum value of 10 and a maximum value of 390. All values on both codes can be changed by choice.
# {r library_data, message=FALSE, warning=FALSE}
# Libraries
library(dplyr)
library(ggplot2)
library(readxl)
library(gridExtra)
library(RColorBrewer)
library(e1071)
library(lattice)
# Data sets
sample1 = rnorm(1000, mean = 150, sd = 25)
sample2 = runif(1000, min = 10, max =390)
data("faithful")
data("mpg")
library(readxl)
carSales <- read_excel("DataSets/carSales.xlsx")Here’s a link to the data set carSales. Save it on your computer and use the same code to import it.
https://figshare.com/s/685c77fbec70f6fc7758
Let’s create the first histogram
The code hist() is from the basic library(graphics).
This is the most raw form of the code outcome.
The four histograms displayed below were produced in sequential order, observe the basic code on 1 and how some internal codes were added to change the tile with main ="" and remove y-axis label with ylab = "“.
(1) basic code hist(sample1) with main =”1.1" to change default title q , (2) add breaks to increase number of bins and improve data resolution for better analysis, (3) adding colors, and (4) changing y-axis values direction and limits. Notice the change in the y-axis size between 1 and 2; this occurs since there are more bins and the number of observations per bin is therefore smaller.
# Par code to present 4 figures in a 4x4 matrix and to increase margin size
par(mfrow=c(2,2), mai = c(0.5,1,0.5,0.2))
# 1.Basic histogram
hist(sample1,
main = "1.1")
# 2. Increase number of bins
hist(sample1,
main = "1.2",
breaks = 50,
ylab = "")
# 3. Add colors to improve visualization
hist(sample1,
main = "1.3",
breaks = 50,
col = brewer.pal(12, "Set3"))
# 4. Add colors, y-axis values orientation, y-axis limits
hist(sample1,
main = "1.4",
breaks = 50,
ylab = "",
col = brewer.pal(12, "Set3"),
las = 1,
ylim = c(0,100))
Second histogram
In histogram 2.4, breaks are set using code seq(), which contains 3 elements: minimum, maximum, and bin size.
# Par code to present 4 figures in a 4x4 matrix and to increase margin size
par(mfrow=c(2,2), mai = c(0.5,1,0.5,0.2))
# 1.Basic histogram
hist(sample2,
main = "2.1")
# 2. Increase number of bins
hist(sample2,
main = "2.2",
breaks = 50,
ylab = "")
# 3. Add colors to improve visualization
hist(sample2,
main = "2.3",
breaks = 50,
col = brewer.pal(12, "Set3"))
# 4. Y-axis orientation, breaks, x-y axes limits.
hist(sample2,
main = "2.4",
breaks = seq(0,400,20),
ylab = "",
col = brewer.pal(12, "Set3"),
las = 1,
ylim = c(0,80),
xlim = c(0,400))
Using summarytools package to list basic descriptice statistics
We will use the data set carSales.
Obtain their basic descriptive statistics using code summarytools::descr(), from library(summarytools).
priceStats = summarytools::descr(carSales$Price/1000)
kmStats = summarytools::descr(carSales$Km/1000)
Below, the two objects priceStats and kmStats are presented using a table created using HTML language.
| Price Statistics | Km Statistics | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
Use ?hist on the console for a complete list of options for the hist() code.
Use code glimpse() on the console to check the whole data set carSales. As you can see, two numerical continuous variables are price of the cars and kilometers read on each car.Observe the distribution of both variables using histograms.
Observe the sequence of events:
These graphs have several issues, can you think of how many changes you can make to improve them?
- Main title and axes labels need to be changed.
- X- and Y-axes values display need to be fixed.
- Default bins are too big for proper analysis, increase the number to a high number and then decrease them until they tell a story. Instead of talking about the number of bins you ,ight hear about the width of the bins, it is the same. Increasing the number of bins decrease their width, increasing the width of the bins deceases their number, and so on.
- Graphs need color.
- For better analysis of data distribution, the mean and the median can be included in the graph.
par(mfcol=c(2,1),
mai=c(0.8,1,0.2,1))
hist(carSales$Price,
main = "Distribution of price values",
ylab = "Frequency",
xlab = "Price (in US$)")
hist(carSales$Km,
main = "Distribution of kilometers read per car",
ylab = "Frequency",
xlab = "Kilometers")
In the graphs above, you can observe that the titles are unnecessary, look below the difference when they are removed with main = "".
Increase the number of bins (breaks =)to a high value then decease it until the distribution tells a story about your data. Compare the bins with the graphs above, with smaller bins there is more information to analyze. But be aware that too many bins is also not a good way to present a histogram unless there is a good reason. Decrease the number of bins until you find a good balance.
Notice the two changes in the codes below.
par(mfcol=c(2,1),
mai=c(0.8,1,0.2,1))
hist(carSales$Price,
main = "",
ylab = "Frequency",
xlab = "Price (in US$)",
breaks = 70)
hist(carSales$Km,
main = "",
ylab = "Frequency",
xlab = "Kilometers",
breaks = 70)
Sometimes, big numbers such as 20,000 or 1e+05, are not a proper way to present data. A good strategy to apply is to divide all values by a factor, depending on the initial values, divide all values by 10, 1000, 10,000, 1 million, etc. You can also use log values if applicable.
- There are several ways to do this, in the case below, the division by 1000 was done on the variable code.
- When this procedure is performed, it must be indicated in the labels.
- Observe how easier it is to observe the values on the x-axis.
par(mfcol=c(2,1),
mai=c(0.8,1,0.2,1))
hist(carSales$Price/1000,
main = "",
ylab = "Frequency",
xlab = "Price (in US$ x 1000)",
breaks = 70)
hist(carSales$Km/1000,
main = "",
ylab = "Frequency",
xlab = "Kilometers (x1,000)",
breaks = 70)
par(mfcol=c(2,1),
mai=c(0.8,1,0.2,1))
hist(carSales$Price/1000,
main = "",
ylab = "Frequency",
xlab = "Price (in US$ x 1000)",
breaks = 70,
las = 1,
xaxp = c(0,80,8),
yaxp = c(0,300,3))
hist(carSales$Km/1000,
main = "",
ylab = "Frequency",
xlab = "Kilometers (x1,000)",
breaks = 70,
las = 1,
xaxp = c(0,500,10),
yaxp = c(0,400,4))
par(mfcol=c(2,1),
mai=c(0.8,1,0.2,1))
hist(carSales$Price/1000,
main = "",
ylab = "Frequency",
xlab = "Price (in US$ x 1000)",
breaks = 70,
las = 1,
xaxp = c(0,80,8),
yaxp = c(0,300,3),
col = terrain.colors(12))
hist(carSales$Km/1000,
main = "",
ylab = "Frequency",
xlab = "Kilometers (x1,000)",
breaks = 70,
las = 1,
xaxp = c(0,500,10),
yaxp = c(0,400,4),
col = brewer.pal(12, "Set3"))
# par code to present graphs using a 2x1 matrix (mfcol) and to change margins (mai)
par(mfcol=c(2,1),
mai=c(0.8,1,0.2,1))
# Histogram for price values
hist(carSales$Price/1000,
main = "",
ylab = "Frequency",
xlab = "Price (in US$ x 1000)",
breaks = 70,
las = 1,
xaxp = c(0,80,8),
yaxp = c(0,400,4),
col = terrain.colors(12),
ylim = c(0,400))
abline(v = mean(carSales$Price/1000),
col = "red",
lwd = 3) # lwd for thickness
abline(v = median(carSales$Price/1000),
col = "blue",
lwd = 3)
# Histogram for kilometer values
hist(carSales$Km/1000,
main = "",
ylab = "Frequency",
xlab = "Kilometers (x1,000)",
breaks = 70,
las = 1,
xaxp = c(0,500,10),
yaxp = c(0,400,4),
col = brewer.pal(12, "Set3"),
ylim = c(0,400))
abline(v = mean(carSales$Km/1000),
col = "red",
lwd = 3)
abline(v = median(carSales$Km/1000),
col = "blue",
lwd = 3)
# Histogram for price values
hist(carSales$Price/1000,
main = "",
ylab = "Frequency",
xlab = "",
breaks = 70,
las = 1,
xaxp = c(0,80,8),
yaxp = c(0,400,4),
col = terrain.colors(12),
ylim = c(0,400))
abline(v = mean(carSales$Price/1000),
col = "red",
lwd = 3)
text(x=mean(carSales$Price/1000),
y = 390,
paste("Mean:", round(mean(carSales$Price)/1000,1),"K"),
col = "red",
cex = 0.8,
pos = 4)
abline(v = median(carSales$Price/1000),
col = "blue",
lwd = 3)
text(x=median(carSales$Price/1000),
y = 390,
paste("Median:", round(median(carSales$Price)/1000,1),"K"),
col = "blue",
cex = 0.8,
pos = 2)
text(x = 80,
y = 380,
paste("Price (in US$ x 1,000)"),
cex = 1,
pos = 2,
col = "#003366")Histogram for kilometer values
# Histogram for kilometer values
hist(carSales$Km/1000,
main = "",
ylab = "Frequency",
xlab = "",
breaks = 70,
las = 1,
xaxp = c(0,500,10),
yaxp = c(0,400,4),
col = brewer.pal(12, "Set3"),
ylim = c(0,400))
abline(v = mean(carSales$Km/1000),
col = "red",
lwd = 3)
text(x=median(carSales$Km/1000),
y = 390,
paste("Median:", round(mean(carSales$Km)/1000,1),"K"),
col = "red",
cex = 0.8,
pos = 4)
abline(v = median(carSales$Km/1000),
col = "blue",
lwd = 3)
text(x=median(carSales$Km/1000),
y = 390,
paste("Median:", round(median(carSales$Km)/1000,1),"K"),
col = "blue",
cex = 0.8,
pos = 2)
text(x = 450,
y = 380,
paste("Km (x 1,000)"),
cex = 1,
pos = 2,
col = "#003366")
Histogram example 1
Observe all the extras that were added to the histogram below.
par(mai = c(1,1.2,0.5,0.4))
hist(carSales$Price,
breaks = seq(0,80000,1000),
main = "",
ylab = "Frequency",
xlab = "",
col.lab="red", # Change colors of x-y axes
ylim = c(0,400),
col = brewer.pal(8, "Pastel2"),
las = 2,
xaxp = c(0,80000,16))
# Vertical line and text for the mean
abline(v = mean(carSales$Price),
col = "blue",
lwd = 2)
text(x=mean(carSales$Price),
y = 390,
paste("Mean:", round(mean(carSales$Price/1000),1),"K"),
col = "blue",
cex = 0.8,
pos = 4)
# Vertical line and text for the median
abline(v = median(carSales$Price),
col = "red",
lwd = 2)
text(x=median(carSales$Price),
y = 390,
paste("Median:", round(median(carSales$Price/1000),1),"K"),
col = "red",
cex = 0.8,
pos = 2)
# Text codes for skewness and kurtosis
text(x=80000,
y = 185,
paste("Skewness:", round(skewness(carSales$Price),3)),
col = "#006633",
cex = 1,
pos = 2)
text(x=80000,
y = 135,
paste("Kurtosis:", round(kurtosis(carSales$Price),3)),
col = "#006633",
cex = 1,
pos = 2)
mtext("Price distribution",
side = 1,
line = 3.7,
cex=0.9,
col = "red")
# Add horizontal dotted grey lines
abline(h = seq(0,350,50), col = "grey", lty="dotted")
Adding data values to columns
# Histogram for kilometer values
hist(carSales$Km/1000,
main = "",
ylab = "Frequency",
xlab = "",
breaks = 20,
las = 1,
xaxp = c(0,500,10),
yaxp = c(0,1500,5),
col = brewer.pal(12, "Set3"),
ylim = c(0,1500),
labels = T)
Lattice Library to observe continuous data per categories
Using lattice() library to separate and present continuous data in histograms according to the levels of a factor (categories of a categorical variable)
par(mfcol=c(3,1))
histogram(~ Km/1000 | FuelType, data=carSales,
las=1,
main="Price",
xlab = "",
breaks = 20,
col = brewer.pal(8,"Pastel1"))histogram(~ Km/1000 | Transmission, data=carSales,
las=1,
main="Price",
xlab = "",
breaks = 20,
col = brewer.pal(8,"Pastel1"))histogram(~ Km/1000 | Owner, data=carSales,
las=1,
main="Price",
xlab = "",
breaks = 20,
col = brewer.pal(8,"Pastel1"))
Add a normal distribution line
hist(sample1,
breaks = 25,
xlim = c(75, 225),
prob = TRUE,
las=1,
col = brewer.pal(12, "Set3"),
ylim=c(0,0.02),
xaxp=c(75,225,10))
lines(density(sample1, adjust=2), col="red")Histograms using GGPLOT
cInd_GasDies = carSales %>%
filter(FuelType %in% c('Gasoline', 'Diesel'))
hista1 = cInd_GasDies %>%
ggplot(aes(Km/1000)) +
geom_histogram(fill="#A11515",
color="white",
binwidth=10)+
theme_minimal()
hista2 = cInd_GasDies %>%
ggplot(aes(Km/1000, fill=Owner)) +
geom_density(alpha=0.4)+
theme_minimal()
grid.arrange(hista1, hista2, ncol=1, nrow=2)
# Par() code to present graphs in a 2x1 format, change background, fonts.
par(font = 1,
font.lab = 3,
font.axis = 3,
font.main = 1,
fg = "#CC0000",
bg = "#D9F5F4",
cex = 0.8,
cex.lab = 0.4,
srt = 90,
tcl = -0.4,
ylbias = 0.6,
pty = "m")
#cex changes size of all graph fonts
#srt changes angle of values on top of bars (plotting text and symbols)
#tcl changes the side and direction of thick marks on vertical and horizontal axes
# Histogram sample1
hist(sample1,
breaks = 15,
main = "Sample 1",
las = 2,
labels = TRUE,
col = terrain.colors(22),
cex.main = 0.8,
cex.lab = 0.8,
cex.axis = 0.8,
cex = 0.1,
border = "blue")# Par() code to present graphs in a 2x1 format, change background, fonts.
par(font = 1,
font.lab = 3,
font.axis = 3,
font.main = 1,
fg = "#CC0000",
bg = "#D9F5F4",
cex = 0.8,
cex.lab = 0.4,
srt = 90,
tcl = -0.4,
ylbias = 0.6,
pty = "m")
# Histogram sample1 Prob
hist(sample1,
prob = T,
breaks = 15,
main = "Sample 1 Probability",
las = 2,
ylim = c(0,0.02),
labels = TRUE,
col = terrain.colors(22),
cex.main = 0.8,
cex.lab = 0.8,
cex.axis = 0.8,
cex = 0.1,
border = "blue")
Disclaimer: This short series manual project is a work in progress. Until otherwise clearly stated, this material is considered to be draft version.