Dee Chiluiza, PhD
Northeastern University
Introduction to data analysis using R, R Studio and R Markdown

 
Short manual series: Histograms
# {r library_data, message=FALSE, warning=FALSE}

# Libraries
library(dplyr) 
library(ggplot2)
library(readxl)
library(gridExtra)
library(RColorBrewer)
library(lattice)

# Data sets
sample1 = rnorm(1000, mean = 150, sd = 25)
sample2 = runif(1000, min = 10, max =390)

data("faithful")
data("mpg")

# Data sets
carSales <- read_excel("DataSets/carSales.xlsx")

Introduction

Histograms are graphs that allow the observation and analysis of continuous data distribution and behavior.

A histogram allows to get insights of the shape of the data distribution in terms of normality, skewness and kurtosis. These last two values can be confirmed using the codes skewness(dataset$variable) and kurtosis(dataset$variable), from library(e1071).

Let’s create two random samples, observe the R chunk library_data.
A first sample was created using code: sample1 = rnorm(1000, mean = 300, sd = 25), it produces 1000 random values with mean 300 and standard deviation 25.
A second sample was created using code: sample2 = runif(1000, min = 10, max =390), it produces 1000 random values from a minimum value of 10 and a maximum value of 390. All values on both codes can be changed by choice.

Let’s create the first histogram.

The code hist() is from the basic library(graphics).

The four histograms are produced in order: (1) basic, (2) add breaks to increase number of bins and improve data resolution for better analysis, (3) adding colors, and (4) changing y-axis values direction and limits. Notice the change in the y-axis size between a and 2; this occurs since there are more bins and the number of observations per bin is therefore smaller.

# Par code to present 4 figures in a 4x4 matrix and to increase margin size
par(mfrow=c(2,2), mai = c(0.5,1,0.5,0.2))

# 1.Basic histogram
hist(sample1,
     main = "1")

# 2. Increase number of bins
hist(sample1, 
     main = "2",
     breaks = 50,
     ylab = "")

# 3. Add colors to improve visualization
hist(sample1, 
     main = "3",
     breaks = 50,
     col = brewer.pal(12, "Set3"))

# 4. Add colors to improve visualization
hist(sample1, 
     main = "4",
     breaks = 75,
     ylab = "",
     col = brewer.pal(12, "Set3"),
     las = 1,
     ylim = c(0,40))

Second histogram

In histogram 4, breaks are set using cose seq(), which contains 3 elements: minimum, maximum, and bin size.

# Par code to present 4 figures in a 4x4 matrix and to increase margin size
par(mfrow=c(2,2), mai = c(0.5,1,0.5,0.2))

# 1.Basic histogram
hist(sample2,
     main = "1")

# 2. Increase number of bins
hist(sample2, 
     main = "2",
     breaks = 50,
     ylab = "")

# 3. Add colors to improve visualization
hist(sample2, 
     main = "3",
     breaks = 50,
     col = brewer.pal(12, "Set3"))

# 4. Y-axis orientation, breaks, x-y axes limits.
hist(sample2, 
     main = "4",
     breaks = seq(0,400,20),
     ylab = "",
     col = brewer.pal(12, "Set3"),
     las = 1,
     ylim = c(0,80),
     xlim = c(0,400))

First observe the histogram, then add internal code: labels = TRUE, knit again and observe changes.

# Par() code to present graphs in a 2x1 format, change background, fonts.

par(mfcol = c(1,2), 
    font = 1, 
    font.lab = 3, 
    font.axis = 3, 
    font.main = 1, 
    fg = "#CC0000", 
    bg = "#D9F5F4", 
    cex = 0.8, 
    cex.lab = 0.4, 
    srt = 90, 
    tcl = -0.4, 
    ylbias = 0.6, 
    pty = "m")

#cex changes size of all graph fonts
#srt changes angle of values on top of bars (plotting text and symbols)
#tcl changes the side and direction of thick marks on vertical and horizontal axes 

# Histogram sample1
hist(sample1,
     breaks = 15,
     main = "Sample 1",
     las = 2,
     ylim = c(0,200), 
     labels = TRUE,
     col = terrain.colors(22),
     cex.main = 0.8,
     cex.lab = 0.8,
     cex.axis = 0.8,
     cex = 0.1,
     border = "blue")

# Histogram sample1 Prob
hist(sample1,
     prob = T,
     breaks = 15,
     main = "Sample 1 Probability",
     las = 2,
     ylim = c(0,0.02), 
     labels = TRUE,
     col = terrain.colors(22),
     cex.main = 0.8,
     cex.lab = 0.8,
     cex.axis = 0.8,
     cex = 0.1,
     border = "blue")

The shape of the histogram reflect the nature of the data, it is normally distributed.

Some data sets contain a lof of information and adding more breaks produce a more defined graphs. Compare the histogram when we use 25 breaks in the R chunk below, observe them on the HTML report.


Adding extras to histograms

For a complete list of options, remember to always use in the console (not in your R Markdown file) help “?” to obtain the full information of any code or package. In this case use ?hist.

Because sample 2 is random, I will ask for the x-axis limits to go from the minimum to the maximum value.

hist(sample1,
     breaks = 35,
     xlim = c(min(sample1),max(sample1)),
     ylim = c(0,100),
     freq = TRUE,
     ylab = "Frequency",
     xlab = "Kilometers register on cars",
     col = "coral"
     )

Third histogram

For the third histogram we will use an actual data set. We will use public data set: mpg

Important: when you use a public library you need to check the package that contains it. In this case, when you use the ?mpg code, at the beginning of the help section you will see mpg {ggplot2}, that means it needs that package and you need to activate that library. If needed, I activate packages from the console, not inside R Markdown, and I call libraries using the first {r, include=FALSE} R chunk on top of this document.

My personal preference is also to call any data set using the first {r, include=FALSE} R chunk, go there and see it.
First run some basic codes to get a quick understanding of the set. Start by checking this data set information using ?mpg

Question: Which codes would you use? Answer that question before reading the codes below.

Observe how I use the code strategy print(paste()) to present the values on the report.

print(paste("Data set Fuel economy 1999-2008 contains", ncol(mpg), "variables."))
## [1] "Data set Fuel economy 1999-2008 contains 11 variables."
print(paste("Data set Fuel economy 1999-2008 contains", nrow(mpg), "observations."))
## [1] "Data set Fuel economy 1999-2008 contains 234 observations."

Data set Fuel economy 1999-2008 variables are:

names(mpg)
##  [1] "manufacturer" "model"        "displ"        "year"         "cyl"         
##  [6] "trans"        "drv"          "cty"          "hwy"          "fl"          
## [11] "class"

We will use the variable hwy: Highway efficiency in miles per gallon.

Notice how I will include several codes on the same R chunk. I will present the histogram on the HTML report but not the long codes.

efficiency = mpg$hwy

bins = seq(min(efficiency), max(efficiency),2)  

hist(efficiency,
     breaks = bins,
     xlim = c(min(efficiency),max(efficiency)),
     ylab = "Frequencies",
     xlab = "Car efficiency, highway miles per gallon",
     col = "#CCFFE5")

The breaks = bins code above uses a vector instead of a number. In this case, I customize the bins to the data in use. I basically ask: from this value to this value, create bins every 2 units. In this case, 2 is the width of each bin.

For example, from 2 to 10, create bins every 2 units. In this case, bins will be created to contain observations with values 2 to 4, 4 to 6, 6 to 8, 8 to 10.

For any data set, I can use codes min() and max() to set the limits, from the minimum value to the maximum value, create bins every 2 units. Practice changing to 3, 4, or more the size of each bin.

Add a normal distribution line

hist(sample1,
     breaks = 25,
     xlim = c(0, 400),
     ylim=c(0,0.02),
     prob = TRUE,
     las=1,
     col =terrain.colors(16))

lines(density(sample1, adjust=3), col="red")

Pay attention to the following code: rnorm(length(), mean(), sd()).

normline = rnorm(length(sample1), mean(sample1), sd(sample1))

hist(sample1,
     breaks = 25,
     xlim = c(200, 400),
     prob = TRUE,
     las=1,
     col =terrain.colors(16),
     ylim=c(0,0.02))
lines(density(normline, adjust=2), col="red")

Add vertical lines


Using lattice() library to separate continuous data in histograms according to the levels of a factor (categories of a categorical variable)

par(mfcol=c(3,1))

histogram(~ Km/1000 | FuelType, data=carSales,
           las=1,
           main="Price",
           xlab = "",
           breaks = 20,
          col = brewer.pal(8,"Pastel1"))

histogram(~ Km/1000 | Transmission, data=carSales,
           las=1,
           main="Price",
           xlab = "",
           breaks = 20,
          col = brewer.pal(8,"Pastel1"))

histogram(~ Km/1000 | Owner, data=carSales,
           las=1,
           main="Price",
           xlab = "",
           breaks = 20,
          col = brewer.pal(8,"Pastel1"))

Histograms using GGPLOT

carSales %>% summary()
##    Location              Year        FuelType         Transmission      
##  Length:5844        Min.   :2001   Length:5844        Length:5844       
##  Class :character   1st Qu.:2012   Class :character   Class :character  
##  Mode  :character   Median :2014   Mode  :character   Mode  :character  
##                     Mean   :2014                                        
##                     3rd Qu.:2016                                        
##                     Max.   :2020                                        
##     Owner             Efficiency      Engine_cc      Power_bhp    
##  Length:5844        Min.   : 8.00   Min.   :1000   Min.   : 50.0  
##  Class :character   1st Qu.:16.00   1st Qu.:1500   1st Qu.:100.0  
##  Mode  :character   Median :18.00   Median :1500   Median :100.0  
##                     Mean   :18.13   Mean   :1790   Mean   :139.3  
##                     3rd Qu.:22.00   3rd Qu.:2000   3rd Qu.:150.0  
##                     Max.   :30.00   Max.   :6000   Max.   :600.0  
##      Seats              Km             Price      
##  Min.   : 2.000   Min.   :  5539   Min.   :  800  
##  1st Qu.: 5.000   1st Qu.: 63335   1st Qu.:19770  
##  Median : 5.000   Median : 86053   Median :30854  
##  Mean   : 5.171   Mean   : 92188   Mean   :28440  
##  3rd Qu.: 5.000   3rd Qu.:112950   3rd Qu.:37835  
##  Max.   :10.000   Max.   :452805   Max.   :63496
carSales %>% count(Location)
## # A tibble: 11 x 2
##    Location        n
##    <chr>       <int>
##  1 Asuncion      519
##  2 Bogota        643
##  3 BuenosAires   401
##  4 Caracas       343
##  5 LaPaz         475
##  6 Lima          626
##  7 Montevideo    590
##  8 PanamaCity    773
##  9 Quito         713
## 10 SanJose       543
## 11 Santiago      218
carSales %>% count(Year)
## # A tibble: 20 x 2
##     Year     n
##    <dbl> <int>
##  1  2001     6
##  2  2002    18
##  3  2003    14
##  4  2004    25
##  5  2005    42
##  6  2006    66
##  7  2007   105
##  8  2008   159
##  9  2009   189
## 10  2010   330
## 11  2011   451
## 12  2012   555
## 13  2013   637
## 14  2014   792
## 15  2015   708
## 16  2016   685
## 17  2017   536
## 18  2018   280
## 19  2019   194
## 20  2020    52
carSales %>% count(FuelType)
## # A tibble: 3 x 2
##   FuelType     n
##   <chr>    <int>
## 1 CNG         65
## 2 Diesel    3134
## 3 Gasoline  2645
carSales %>% count(Transmission)
## # A tibble: 2 x 2
##   Transmission     n
##   <chr>        <int>
## 1 Automatic     1681
## 2 Manual        4163
carSales %>% count(Owner)
## # A tibble: 4 x 2
##   Owner      n
##   <chr>  <int>
## 1 First   4079
## 2 Fourth   114
## 3 Second   916
## 4 Third    735
carSales %>% count(Efficiency)
## # A tibble: 10 x 2
##    Efficiency     n
##         <dbl> <int>
##  1          8    18
##  2         10   116
##  3         12   822
##  4         16  1242
##  5         18  1796
##  6         20   320
##  7         22   910
##  8         26   601
##  9         28     5
## 10         30    14
carSales %>% count(Engine_cc)
## # A tibble: 7 x 2
##   Engine_cc     n
##       <dbl> <int>
## 1      1000   675
## 2      1500  2832
## 3      2000  1168
## 4      2500   712
## 5      3000   429
## 6      4000    16
## 7      6000    12
carSales %>% count(Power_bhp)
## # A tibble: 8 x 2
##   Power_bhp     n
##       <dbl> <int>
## 1        50   161
## 2       100  2952
## 3       150  1546
## 4       200   839
## 5       300   287
## 6       400    47
## 7       500     7
## 8       600     5
carSales %>% count(Seats)
## # A tibble: 6 x 2
##   Seats     n
##   <dbl> <int>
## 1     2    13
## 2     4    99
## 3     5  4891
## 4     6   701
## 5     8   133
## 6    10     7
carSales %>% count(Engine_cc)
## # A tibble: 7 x 2
##   Engine_cc     n
##       <dbl> <int>
## 1      1000   675
## 2      1500  2832
## 3      2000  1168
## 4      2500   712
## 5      3000   429
## 6      4000    16
## 7      6000    12
cInd_GasDies = carSales %>% 
        filter(FuelType %in% c('Gasoline', 'Diesel'))

hista1 = cInd_GasDies %>% 
        ggplot(aes(Km/1000)) +
        geom_histogram(fill="#A11515",
                       color="white",
                       binwidth=10)+
        theme_minimal()

hista2 = cInd_GasDies %>% 
        ggplot(aes(Km/1000, fill=Owner)) +
        geom_density(alpha=0.4)+
        theme_minimal()

grid.arrange(hista1, hista2, ncol=2, nrow=1)