|
Northeastern University Introduction to data analysis using R, R Studio and R Markdown |
| Short manual series: Histograms |
# {r library_data, message=FALSE, warning=FALSE}
# Libraries
library(dplyr)
library(ggplot2)
library(readxl)
library(gridExtra)
library(RColorBrewer)
library(lattice)
# Data sets
sample1 = rnorm(1000, mean = 150, sd = 25)
sample2 = runif(1000, min = 10, max =390)
data("faithful")
data("mpg")
# Data sets
carSales <- read_excel("DataSets/carSales.xlsx")Introduction
Histograms are graphs that allow the observation and analysis of continuous data distribution and behavior.A histogram allows to get insights of the shape of the data distribution in terms of normality, skewness and kurtosis. These last two values can be confirmed using the codes skewness(dataset$variable) and kurtosis(dataset$variable), from library(e1071).
Let’s create two random samples, observe the R chunk library_data.
A first sample was created using code: sample1 = rnorm(1000, mean = 300, sd = 25), it produces 1000 random values with mean 300 and standard deviation 25.
A second sample was created using code: sample2 = runif(1000, min = 10, max =390), it produces 1000 random values from a minimum value of 10 and a maximum value of 390. All values on both codes can be changed by choice.
Let’s create the first histogram.
The code hist() is from the basic library(graphics).
The four histograms are produced in order: (1) basic, (2) add breaks to increase number of bins and improve data resolution for better analysis, (3) adding colors, and (4) changing y-axis values direction and limits. Notice the change in the y-axis size between a and 2; this occurs since there are more bins and the number of observations per bin is therefore smaller.
# Par code to present 4 figures in a 4x4 matrix and to increase margin size
par(mfrow=c(2,2), mai = c(0.5,1,0.5,0.2))
# 1.Basic histogram
hist(sample1,
main = "1")
# 2. Increase number of bins
hist(sample1,
main = "2",
breaks = 50,
ylab = "")
# 3. Add colors to improve visualization
hist(sample1,
main = "3",
breaks = 50,
col = brewer.pal(12, "Set3"))
# 4. Add colors to improve visualization
hist(sample1,
main = "4",
breaks = 75,
ylab = "",
col = brewer.pal(12, "Set3"),
las = 1,
ylim = c(0,40))In histogram 4, breaks are set using cose seq(), which contains 3 elements: minimum, maximum, and bin size.
# Par code to present 4 figures in a 4x4 matrix and to increase margin size
par(mfrow=c(2,2), mai = c(0.5,1,0.5,0.2))
# 1.Basic histogram
hist(sample2,
main = "1")
# 2. Increase number of bins
hist(sample2,
main = "2",
breaks = 50,
ylab = "")
# 3. Add colors to improve visualization
hist(sample2,
main = "3",
breaks = 50,
col = brewer.pal(12, "Set3"))
# 4. Y-axis orientation, breaks, x-y axes limits.
hist(sample2,
main = "4",
breaks = seq(0,400,20),
ylab = "",
col = brewer.pal(12, "Set3"),
las = 1,
ylim = c(0,80),
xlim = c(0,400))First observe the histogram, then add internal code: labels = TRUE, knit again and observe changes.
# Par() code to present graphs in a 2x1 format, change background, fonts.
par(mfcol = c(1,2),
font = 1,
font.lab = 3,
font.axis = 3,
font.main = 1,
fg = "#CC0000",
bg = "#D9F5F4",
cex = 0.8,
cex.lab = 0.4,
srt = 90,
tcl = -0.4,
ylbias = 0.6,
pty = "m")
#cex changes size of all graph fonts
#srt changes angle of values on top of bars (plotting text and symbols)
#tcl changes the side and direction of thick marks on vertical and horizontal axes
# Histogram sample1
hist(sample1,
breaks = 15,
main = "Sample 1",
las = 2,
ylim = c(0,200),
labels = TRUE,
col = terrain.colors(22),
cex.main = 0.8,
cex.lab = 0.8,
cex.axis = 0.8,
cex = 0.1,
border = "blue")
# Histogram sample1 Prob
hist(sample1,
prob = T,
breaks = 15,
main = "Sample 1 Probability",
las = 2,
ylim = c(0,0.02),
labels = TRUE,
col = terrain.colors(22),
cex.main = 0.8,
cex.lab = 0.8,
cex.axis = 0.8,
cex = 0.1,
border = "blue")The shape of the histogram reflect the nature of the data, it is normally distributed.
Some data sets contain a lof of information and adding more breaks produce a more defined graphs. Compare the histogram when we use 25 breaks in the R chunk below, observe them on the HTML report.
Adding extras to histograms
For a complete list of options, remember to always use in the console (not in your R Markdown file) help “?” to obtain the full information of any code or package. In this case use ?hist.
Because sample 2 is random, I will ask for the x-axis limits to go from the minimum to the maximum value.
hist(sample1,
breaks = 35,
xlim = c(min(sample1),max(sample1)),
ylim = c(0,100),
freq = TRUE,
ylab = "Frequency",
xlab = "Kilometers register on cars",
col = "coral"
)Third histogram
For the third histogram we will use an actual data set. We will use public data set: mpg
Important: when you use a public library you need to check the package that contains it. In this case, when you use the ?mpg code, at the beginning of the help section you will see mpg {ggplot2}, that means it needs that package and you need to activate that library. If needed, I activate packages from the console, not inside R Markdown, and I call libraries using the first {r, include=FALSE} R chunk on top of this document.
My personal preference is also to call any data set using the first {r, include=FALSE} R chunk, go there and see it.
First run some basic codes to get a quick understanding of the set. Start by checking this data set information using ?mpg
Question: Which codes would you use? Answer that question before reading the codes below.
Observe how I use the code strategy print(paste()) to present the values on the report.
## [1] "Data set Fuel economy 1999-2008 contains 11 variables."
## [1] "Data set Fuel economy 1999-2008 contains 234 observations."
Data set Fuel economy 1999-2008 variables are:
## [1] "manufacturer" "model" "displ" "year" "cyl"
## [6] "trans" "drv" "cty" "hwy" "fl"
## [11] "class"
We will use the variable hwy: Highway efficiency in miles per gallon.
Notice how I will include several codes on the same R chunk. I will present the histogram on the HTML report but not the long codes.
efficiency = mpg$hwy
bins = seq(min(efficiency), max(efficiency),2)
hist(efficiency,
breaks = bins,
xlim = c(min(efficiency),max(efficiency)),
ylab = "Frequencies",
xlab = "Car efficiency, highway miles per gallon",
col = "#CCFFE5")The breaks = bins code above uses a vector instead of a number. In this case, I customize the bins to the data in use. I basically ask: from this value to this value, create bins every 2 units. In this case, 2 is the width of each bin.
For example, from 2 to 10, create bins every 2 units. In this case, bins will be created to contain observations with values 2 to 4, 4 to 6, 6 to 8, 8 to 10.
For any data set, I can use codes min() and max() to set the limits, from the minimum value to the maximum value, create bins every 2 units. Practice changing to 3, 4, or more the size of each bin.
Add a normal distribution line
hist(sample1,
breaks = 25,
xlim = c(0, 400),
ylim=c(0,0.02),
prob = TRUE,
las=1,
col =terrain.colors(16))
lines(density(sample1, adjust=3), col="red")Pay attention to the following code: rnorm(length(), mean(), sd()).
normline = rnorm(length(sample1), mean(sample1), sd(sample1))
hist(sample1,
breaks = 25,
xlim = c(200, 400),
prob = TRUE,
las=1,
col =terrain.colors(16),
ylim=c(0,0.02))
lines(density(normline, adjust=2), col="red")Add vertical lines
Using lattice() library to separate continuous data in histograms according to the levels of a factor (categories of a categorical variable)
par(mfcol=c(3,1))
histogram(~ Km/1000 | FuelType, data=carSales,
las=1,
main="Price",
xlab = "",
breaks = 20,
col = brewer.pal(8,"Pastel1"))histogram(~ Km/1000 | Transmission, data=carSales,
las=1,
main="Price",
xlab = "",
breaks = 20,
col = brewer.pal(8,"Pastel1"))histogram(~ Km/1000 | Owner, data=carSales,
las=1,
main="Price",
xlab = "",
breaks = 20,
col = brewer.pal(8,"Pastel1"))
Histograms using GGPLOT
## Location Year FuelType Transmission
## Length:5844 Min. :2001 Length:5844 Length:5844
## Class :character 1st Qu.:2012 Class :character Class :character
## Mode :character Median :2014 Mode :character Mode :character
## Mean :2014
## 3rd Qu.:2016
## Max. :2020
## Owner Efficiency Engine_cc Power_bhp
## Length:5844 Min. : 8.00 Min. :1000 Min. : 50.0
## Class :character 1st Qu.:16.00 1st Qu.:1500 1st Qu.:100.0
## Mode :character Median :18.00 Median :1500 Median :100.0
## Mean :18.13 Mean :1790 Mean :139.3
## 3rd Qu.:22.00 3rd Qu.:2000 3rd Qu.:150.0
## Max. :30.00 Max. :6000 Max. :600.0
## Seats Km Price
## Min. : 2.000 Min. : 5539 Min. : 800
## 1st Qu.: 5.000 1st Qu.: 63335 1st Qu.:19770
## Median : 5.000 Median : 86053 Median :30854
## Mean : 5.171 Mean : 92188 Mean :28440
## 3rd Qu.: 5.000 3rd Qu.:112950 3rd Qu.:37835
## Max. :10.000 Max. :452805 Max. :63496
## # A tibble: 11 x 2
## Location n
## <chr> <int>
## 1 Asuncion 519
## 2 Bogota 643
## 3 BuenosAires 401
## 4 Caracas 343
## 5 LaPaz 475
## 6 Lima 626
## 7 Montevideo 590
## 8 PanamaCity 773
## 9 Quito 713
## 10 SanJose 543
## 11 Santiago 218
## # A tibble: 20 x 2
## Year n
## <dbl> <int>
## 1 2001 6
## 2 2002 18
## 3 2003 14
## 4 2004 25
## 5 2005 42
## 6 2006 66
## 7 2007 105
## 8 2008 159
## 9 2009 189
## 10 2010 330
## 11 2011 451
## 12 2012 555
## 13 2013 637
## 14 2014 792
## 15 2015 708
## 16 2016 685
## 17 2017 536
## 18 2018 280
## 19 2019 194
## 20 2020 52
## # A tibble: 3 x 2
## FuelType n
## <chr> <int>
## 1 CNG 65
## 2 Diesel 3134
## 3 Gasoline 2645
## # A tibble: 2 x 2
## Transmission n
## <chr> <int>
## 1 Automatic 1681
## 2 Manual 4163
## # A tibble: 4 x 2
## Owner n
## <chr> <int>
## 1 First 4079
## 2 Fourth 114
## 3 Second 916
## 4 Third 735
## # A tibble: 10 x 2
## Efficiency n
## <dbl> <int>
## 1 8 18
## 2 10 116
## 3 12 822
## 4 16 1242
## 5 18 1796
## 6 20 320
## 7 22 910
## 8 26 601
## 9 28 5
## 10 30 14
## # A tibble: 7 x 2
## Engine_cc n
## <dbl> <int>
## 1 1000 675
## 2 1500 2832
## 3 2000 1168
## 4 2500 712
## 5 3000 429
## 6 4000 16
## 7 6000 12
## # A tibble: 8 x 2
## Power_bhp n
## <dbl> <int>
## 1 50 161
## 2 100 2952
## 3 150 1546
## 4 200 839
## 5 300 287
## 6 400 47
## 7 500 7
## 8 600 5
## # A tibble: 6 x 2
## Seats n
## <dbl> <int>
## 1 2 13
## 2 4 99
## 3 5 4891
## 4 6 701
## 5 8 133
## 6 10 7
## # A tibble: 7 x 2
## Engine_cc n
## <dbl> <int>
## 1 1000 675
## 2 1500 2832
## 3 2000 1168
## 4 2500 712
## 5 3000 429
## 6 4000 16
## 7 6000 12
cInd_GasDies = carSales %>%
filter(FuelType %in% c('Gasoline', 'Diesel'))
hista1 = cInd_GasDies %>%
ggplot(aes(Km/1000)) +
geom_histogram(fill="#A11515",
color="white",
binwidth=10)+
theme_minimal()
hista2 = cInd_GasDies %>%
ggplot(aes(Km/1000, fill=Owner)) +
geom_density(alpha=0.4)+
theme_minimal()
grid.arrange(hista1, hista2, ncol=2, nrow=1)