Disclaimer
Please do not cite this document. Data and syntax can be reused freely
while keep attributing and refering to the original source (where
applicable) in the reference. This document is used for educational
purpose only.
Benefits of using Histogram:
1. To give immediate information about central tendency of the data (the
mean, median, mode) and the dispersion and variability of data
2. To recognize general pattern of data distribution of one’s
interest
3. To identify the outlier, extreme, and unusual observation
Aside from base, ggplot2, and other
packages, R package that also can produce Histogram is
graphics packages. Please install and activate the library
prior to graph execution.
#install.packages("graphics")
library(graphics)
We are going to use Low Birth Weight data from Hosmer & Lemeshow
(2013) that supplied from different packages aplore3
(stands for Applied Logistic Regression 3rd Ed.)
#install.packages("aplore3")
library(aplore3)
Let us inspect the low birth weight data:
head(lowbwt, n = 10)
## id low age lwt race smoke ptl ht ui ftv bwt
## 1 4 < 2500 g 28 120 Other Yes One No Yes None 709
## 2 10 < 2500 g 29 130 White No None No Yes Two, etc. 1021
## 3 11 < 2500 g 34 187 Black Yes None Yes No None 1135
## 4 13 < 2500 g 25 105 Other No One Yes No None 1330
## 5 15 < 2500 g 25 85 Other No None No Yes None 1474
## 6 16 < 2500 g 27 150 Other No None No No None 1588
## 7 17 < 2500 g 23 97 Other No None No Yes One 1588
## 8 18 < 2500 g 24 128 Black No One No No One 1701
## 9 19 < 2500 g 24 132 Other No None Yes No None 1729
## 10 20 < 2500 g 21 165 White Yes None Yes No One 1790
names(lowbwt)
## [1] "id" "low" "age" "lwt" "race" "smoke" "ptl" "ht" "ui"
## [10] "ftv" "bwt"
str(lowbwt)
## 'data.frame': 189 obs. of 11 variables:
## $ id : int 4 10 11 13 15 16 17 18 19 20 ...
## $ low : Factor w/ 2 levels ">= 2500 g","< 2500 g": 2 2 2 2 2 2 2 2 2 2 ...
## $ age : int 28 29 34 25 25 27 23 24 24 21 ...
## $ lwt : int 120 130 187 105 85 150 97 128 132 165 ...
## $ race : Factor w/ 3 levels "White","Black",..: 3 1 2 3 3 3 3 2 3 1 ...
## $ smoke: Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 2 ...
## $ ptl : Factor w/ 3 levels "None","One","Two, etc.": 2 1 1 2 1 1 1 2 1 1 ...
## $ ht : Factor w/ 2 levels "No","Yes": 1 1 2 2 1 1 1 1 2 2 ...
## $ ui : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 2 1 1 1 ...
## $ ftv : Factor w/ 3 levels "None","One","Two, etc.": 1 3 1 1 1 1 2 2 1 2 ...
## $ bwt : int 709 1021 1135 1330 1474 1588 1588 1701 1729 1790 ...
The above example is worked when the data frame is embedded on the
package itself. What if we only had it externally in our working
directory?. As a dummy, let us export the lowbwt data and
store it in the same assigned working directory.
#write.csv(lowbwt,"/Users/sachnazdes/Documents/2022-IPB/MATA KULIAH/STA563 - Eksplorasi dan Visualisasi Data/lowbwt.csv", row.names = TRUE)
data.lahir <- read.csv("lowbwt.csv")
colnames(data.lahir)
## [1] "X" "id" "low" "age" "lwt" "race" "smoke" "ptl" "ht"
## [10] "ui" "ftv" "bwt"
#dim(data.lahir)
#View(data.lahir)
The data comprises 11 variables:
Be cautious on class of variables after importing. Specify the class
so that it satisfies the analysis tool requirement.
Coerce multiple variables in data frame into factor at once:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
cols <- c("low","race","smoke","ptl", "ht", "ui", "ftv")
data.lahir[cols] <- lapply(data.lahir[cols], factor)
str(data.lahir)
## 'data.frame': 189 obs. of 12 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ id : int 4 10 11 13 15 16 17 18 19 20 ...
## $ low : Factor w/ 2 levels "< 2500 g",">= 2500 g": 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int 28 29 34 25 25 27 23 24 24 21 ...
## $ lwt : int 120 130 187 105 85 150 97 128 132 165 ...
## $ race : Factor w/ 3 levels "Black","Other",..: 2 3 1 2 2 2 2 1 2 3 ...
## $ smoke: Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 2 ...
## $ ptl : Factor w/ 3 levels "None","One","Two, etc.": 2 1 1 2 1 1 1 2 1 1 ...
## $ ht : Factor w/ 2 levels "No","Yes": 1 1 2 2 1 1 1 1 2 2 ...
## $ ui : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 2 1 1 1 ...
## $ ftv : Factor w/ 3 levels "None","One","Two, etc.": 1 3 1 1 1 1 2 2 1 2 ...
## $ bwt : int 709 1021 1135 1330 1474 1588 1588 1701 1729 1790 ...
Prior to creating histogram, we can describe the summary of the data as follows:
summary(data.lahir)
## X id low age lwt
## Min. : 1 Min. : 4.0 < 2500 g : 59 Min. :14.00 Min. : 80.0
## 1st Qu.: 48 1st Qu.: 68.0 >= 2500 g:130 1st Qu.:19.00 1st Qu.:110.0
## Median : 95 Median :123.0 Median :23.00 Median :121.0
## Mean : 95 Mean :121.1 Mean :23.24 Mean :129.8
## 3rd Qu.:142 3rd Qu.:176.0 3rd Qu.:26.00 3rd Qu.:140.0
## Max. :189 Max. :226.0 Max. :45.00 Max. :250.0
## race smoke ptl ht ui ftv
## Black:26 No :115 None :159 No :177 No :161 None :100
## Other:67 Yes: 74 One : 24 Yes: 12 Yes: 28 One : 47
## White:96 Two, etc.: 6 Two, etc.: 42
##
##
##
## bwt
## Min. : 709
## 1st Qu.:2414
## Median :2977
## Mean :2945
## 3rd Qu.:3475
## Max. :4990
To create histogram, one might utilize hist()
function
hist(data.lahir$age, breaks = 25, col = "coral",
xlab = "Age of Mother (years)",
main = "Age Distribution of Mother when Giving Birth")
Let us learn about the argument of hist() function.
breaks= is option to indicate how much range or bar that
suit our interest. The more breaks we use, the more denser the histogram
will be. col= is option to denote color of the chart.
xlab = is used to add label on x axis, while mean is to
showcast the title of the histogram.
From this graph we can see that the distribution of data is somewhat skewed to the right. which means the mean of the data might exceed its median. The average age of mother when she is giving birth is mostly around 23 years old. Accordingly we can also guess the outlier of the data. Which age is it?
Challenge:
Construct the histogram of other variables, what can you explain from
it? What is distribution of each variable?, are there any outliers?
which one(s)?
boxplot(data.lahir$age, horizontal = TRUE, col = 'coral', xlab="Age of Mother (years)")
Naïve Estimator
weight <- data.lahir$bwt
hist(weight, breaks= 25, col = "coral",
xlab= "birth weight in (g)",
main = "",
freq = FALSE)
If we arbitrarily change the value of freq= TRUE, then the vertical axis will be its corresponding frequency, or else it would be its probability density.
weight <- data.lahir$bwt
hist(weight, breaks= 25, col = "coral",
xlab= "birth weight (g)",
main = "",
freq = FALSE)
h = 500
n = length(weight)
x <- seq(700, 5000, by=20)
fx <- NULL
for (i in 1:length(x)) {
fx[i] = sum(ifelse(abs(weight - x[i]) < h, 1, 0))/(2*h*n)
}
lines(x, fx, type="l", col="blue", lwd=2)
The program to calculate the estimated value of f(x) can be explained
as follows: The command
sum(ifelse(abs(weight - x[i]) < h, 1, 0)) is used to
calculate the number of observations that are in the interval (x-h, x+h
) by first converting the observations with values in the interval to 1
and those outside the interval to 0. The complete line of the
fx[i] command is to store the value of the density function
for each value of x in the range between 700 and 5000 g. Since it is
impossible to calculate f(x) at any value, in this program only x weight
values between 700 to 5000 with intervals of 20 g are calculated. The
smaller the interval, the longer the computation will be due to the
large number of x values that are estimated to be f(x) using the program
above.
Challenge:
Fine-tuning the chart above using different h values. What
can you digest from it?
Kernel Estimator
Unlike the naive estimator, the kernel estimator has a smoothing
parameter feature.
dense <- density(data.lahir$age, bw=1, kernel="epanechnikov")
hist(data.lahir$age, freq = FALSE, breaks = 15, col = "skyblue1", main = "", xlab = "Age of Mother (years)")
lines(dense, col="blue", lwd=2, main="", ylim=c(0, 0.09))
The main options available in the density() function are
bw and the kernel. The bw option
is used to specify the window width to use, while the
kernel option is used to select the kernel function to use.
The density() function provides several kernel options,
such as: gaussian, epanechnikov,
rectangular, triangular, and
biweight.
table(data.lahir$race)
##
## Black Other White
## 26 67 96
frekuensi <- table(data.lahir$race)
prop.table(frekuensi)
##
## Black Other White
## 0.1375661 0.3544974 0.5079365
barplot(frekuensi)
pie(frekuensi)
Pembandingan antar kelompok:
aggregate(bwt ~ smoke, data = data.lahir, mean)
## smoke bwt
## 1 No 3054.957
## 2 Yes 2773.243
Cross tabulation:
ras <- data.lahir$race
kategori_weight <- data.lahir$low
table(ras, kategori_weight)
## kategori_weight
## ras < 2500 g >= 2500 g
## Black 11 15
## Other 25 42
## White 23 73
99 / 5,000 Translation results Row percentage calculation can be done
using the prop.table() function with
margin = 1 option as follows:
crosstab<-table(ras, kategori_weight)
prop.table(crosstab, margin = 1)
## kategori_weight
## ras < 2500 g >= 2500 g
## Black 0.4230769 0.5769231
## Other 0.3731343 0.6268657
## White 0.2395833 0.7604167
The comparison view can also be expressed in the form of a stacked barplot:
persentase <- prop.table(crosstab, margin = 1)
barplot(t(persentase))
#install.packages("sm")
library(sm)
## Warning: package 'sm' was built under R version 4.1.2
## Package 'sm', version 2.2-5.7: type help(sm) for summary information
sm.density.compare(data.lahir$age, data.lahir$race,
xlab="Age of Mother (year)",
lwd = 2)
colfill <-c(2:(2+length(levels(data.lahir$race))))
legend("topright", levels(data.lahir$race), fill = colfill)
hrbrthemes::import_roboto_condensed()
## You will likely need to install these fonts on your system as well.
##
## You can find them in [/Library/Frameworks/R.framework/Versions/4.1/Resources/library/hrbrthemes/fonts/roboto-condensed]
library(ggridges)
library(ggplot2)
ggplot(data.lahir, aes(x = age, y = race, fill=race)) +
geom_density_ridges() +
theme_ridges() +
theme(legend.position = "none")
## Picking joint bandwidth of 2.05
library(ggridges)
library(ggplot2)
library(viridis)
## Loading required package: viridisLite
library(hrbrthemes)
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
## Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
## if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
#plot
ggplot(data.lahir, aes( x = age, y = race, fill = ..x..))+
geom_density_ridges_gradient(scale=1, rel_min_height=0.01)+
scale_fill_viridis(name = "usia", option = "C") +
labs(title = "") +
theme_ipsum() +
theme(legend.position = "none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size=8))
## Picking joint bandwidth of 2.05
boxplot(data.lahir$age ~ data.lahir$race,
col=c("coral","green","skyblue"),
ylab="Mother Age (year)",
xlab = "Race")
The outlier identification technique on a single variable can be done
by paying attention to the difference between the observations and its
sample average. An observation is said to be an outlier if the
difference between the observed value and the mean (x bar) is greater
than 3s (s is the standard deviation). In other words, an observation
will be called an outlier if it is outside the range:
(xbar - 3s, xbar + 3s)
m <- mean(weight)
s <- sd(weight)
pencilan <- (weight > m + 3*s) | (weight < m - 3*s)
#to count how many outliers in our data
sum(pencilan)
## [1] 1
#to identify which index of observation belong to the outlier
which(pencilan)
## [1] 1
#value of outlier
weight[which(pencilan)]
## [1] 709
Another possible approach is to use a limit value. The limit used to
identify outliers is the lower limit and the upper limit with the
formula:
- Lower Limit = Q1 - 1.5*IQR - Upper Limit = Q3 + 1.5*IQR
#outlier detection
nilai.pencilan <- boxplot.stats(weight)$out
which(weight == nilai.pencilan)
## [1] 1
nilai.pencilan
## [1] 709
The existence of large extreme data will cause the sample average to be larger than it should be. Likewise, when there is small extreme data, the average sample will be smaller than it should be. So in this condition it can be said that our estimator is not robust because its value is easily disturbed by the presence of outliers, especially if the size of the data cluster is not large. The solution to this problem is a robust estimator, such as: Trimmed Mean, Winsorized Mean, and M-Estimators for the average.
example1 <- c(12, 12, 12, 13, 13, 14, 16, 17, 18, 18, 19, 19, 19, 19, 20, 20, 21, 22, 22, 67)
mean(example1)
## [1] 19.65
mean(example1, trim = 0.05)
## [1] 17.44444
The pysch package provides a function named
winsor.mean() which can be used to calculate the winsorized
mean of a data vector.
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
winsor.mean(example1, trim=0.1)
## [1] 17.4
winsor.mean(example1, trim=0.05)
## [1] 17.5125
The scatter plot is a two-dimensional graph that displays the cross of the Cartesian axis and data points where each axis represents each variable and each point on the plot represents one observation in the data whose position coordinates are adjusted to the value of the variable. You can find harga rumah data through this link
rumah <- read.table("harga rumah.txt", header=TRUE)
attach(rumah)
plot(size, price, ylab="Housing Price (US$)",
xlab = "Size (sq feet)",
pch = 19, col = "coral", cex = 1.5)
plot(size, price, ylab="Housing Price (US$)",
xlab = "Size (sq feet)",
pch = 19, cex = 1.5,
col=ifelse(location == 1, "coral", "navyblue"))
Correlation matrix
pairs(rumah[,1:2], pch=19, col="coral", cex=1.5)
library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
chart.Correlation(rumah, histogram = TRUE, pch= 19)
pairs(data.lahir[,c(4,5,12)], pch=19, col="coral", cex=1.5)
# Library
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.4 ✓ purrr 0.3.4
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x psych::%+%() masks ggplot2::%+%()
## x psych::alpha() masks ggplot2::alpha()
## x dplyr::filter() masks stats::filter()
## x xts::first() masks dplyr::first()
## x dplyr::lag() masks stats::lag()
## x xts::last() masks dplyr::last()
#Generate data
a <- data.frame( x=rnorm(20000, 10, 1.9), y=rnorm(20000, 10, 1.2) )
b <- data.frame( x=rnorm(20000, 14.5, 1.9), y=rnorm(20000, 14.5, 1.9) )
c <- data.frame( x=rnorm(20000, 9.5, 1.9), y=rnorm(20000, 15.5, 1.9) )
data <- rbind(a,b,c)
Create basic scatterplot:
ggplot(data, aes(x=x, y=y) ) +
geom_point()
Second histogram with default feature:
ggplot(data, aes(x=x, y=y) ) +
geom_bin2d() +
theme_bw()
2d plot is useful when we have abundance data/ observation. For the
2d histogram, the plot area is divided in a multitude of squares. (It is
a 2d version of the classic histogram). It is called using the
geom_bin2d() function.This function offers a bins argument
that controls the number of bins you want to display. Please execute the
following syntax:
# Bin size control + color palette
ggplot(data, aes(x=x, y=y) ) +
geom_bin2d(bins = 70) +
scale_fill_continuous(type = "viridis") +
theme_bw()
Another representative of 2d plot is hexagone histogram as an
alternative of square multitude. It is then called hexbin chart. It is
using function geom_hex() to produce this chart. This
function provides the bins argument as well, to control the number of
division per axis.
It is also need to activate dependency packages such as:
hexbin packages.
#install.packages("hexbin")
library(hexbin)
Let’s create hexbin chart with default option:
ggplot(data, aes(x=x, y=y) ) +
geom_hex() +
theme_bw()
We can control bin size + custom color pallette:
ggplot(data, aes(x=x, y=y) ) +
geom_hex(bins = 70) +
scale_fill_continuous(type = "viridis") +
theme_bw()
Another type of 2D density plot is contour 2d density plot. it is
supported with raster function.
Show contour only:
ggplot(data, aes(x=x, y=y) ) +
geom_density_2d()
Show the area only:
ggplot(data, aes(x=x, y=y) ) +
stat_density_2d(aes(fill = ..level..), geom = "polygon")
Show area and contour:
ggplot(data, aes(x=x, y=y) ) +
stat_density_2d(aes(fill = ..level..), geom = "polygon", colour="white")
Using Raster:
ggplot(data, aes(x=x, y=y) ) +
stat_density_2d(aes(fill = ..density..), geom = "raster", contour = FALSE) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
theme(
legend.position='none'
)
It is used when we have at least 3 variables of interest. For
ilustration we take data of Hans
Rosling TED Talk - Gapminder. We need to install and activate
gapminder library.
#install.packages("gapminder")
Activate other supplementary libraries:
library(tidyverse)
library(hrbrthemes)
library(viridis)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(ggrepel)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Import the data:
library(gapminder)
data_bubble <- gapminder %>% filter(year=="2007") %>% dplyr::select(-year)
Display the bubble chart:
data_bubble %>%
mutate(pop=pop/1000000) %>%
arrange(desc(pop)) %>%
mutate(country = factor(country, country)) %>%
ggplot( aes(x=gdpPercap, y=lifeExp, size = pop, color = continent)) +
geom_point(alpha=0.7) +
scale_size(range = c(1.4, 19), name="Population (M)") +
scale_color_viridis(discrete=TRUE, guide=FALSE) +
theme_ipsum() +
theme(legend.position="bottom")
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
To make it more insightful, we can annotate the name of country on the bubble chart. Firstly, let us pre-proccessed the data.
tmp <- data_bubble %>%
mutate(
annotation = case_when(
gdpPercap > 5000 & lifeExp < 60 ~ "yes",
lifeExp < 30 ~ "yes",
gdpPercap > 40000 ~ "yes"
)
) %>%
mutate(pop=pop/1000000) %>%
arrange(desc(pop)) %>%
mutate(country = factor(country, country))
Create the plot:
ggplot( tmp, aes(x=gdpPercap, y=lifeExp, size = pop, color = continent)) +
geom_point(alpha=0.7) +
scale_size(range = c(1.4, 19), name="Population (M)") +
scale_color_viridis(discrete=TRUE) +
theme_ipsum() +
theme(legend.position="none") +
geom_text_repel(data=tmp %>% filter(annotation=="yes"), aes(label=country), size=4 )
Create interactive version:
p <- data_bubble %>%
mutate(gdpPercap=round(gdpPercap,0)) %>%
mutate(pop=round(pop/1000000,2)) %>%
mutate(lifeExp=round(lifeExp,1)) %>%
arrange(desc(pop)) %>%
mutate(country = factor(country, country)) %>%
mutate(text = paste("Country: ", country, "\nPopulation (M): ", pop, "\nLife Expectancy: ", lifeExp, "\nGdp per capita: ", gdpPercap, sep="")) %>%
ggplot( aes(x=gdpPercap, y=lifeExp, size = pop, color = continent, text=text)) +
geom_point(alpha=0.7) +
scale_size(range = c(1.4, 19), name="Population (M)") +
scale_color_viridis(discrete=TRUE, guide=FALSE) +
theme_ipsum() +
theme(legend.position="none")
ggplotly(p, tooltip="text")
Local Regression (loess) is a non-parametric approach that fits multiple regressions in local neighborhood. It is used to smoothen plot. The principle of local regression is to determine the regression line using only the closest few points, not using all the data in the, thus it is so called local regression.
data(economics, package="ggplot2") # load data
economics$index <- 1:nrow(economics) # create index variable
economics <- economics[1:80, ] # retail 80rows for better graphical understanding
loessMod10 <- loess(uempmed ~ index, data=economics, span=0.10) # 10% smoothing span
loessMod25 <- loess(uempmed ~ index, data=economics, span=0.25) # 25% smoothing span
loessMod50 <- loess(uempmed ~ index, data=economics, span=0.50) # 50% smoothing span
Predict Loess:
smoothed10 <- predict(loessMod10)
smoothed25 <- predict(loessMod25)
smoothed50 <- predict(loessMod50)
Create the plot:
plot(economics$uempmed, x=economics$date, type="l", main="Loess Smoothing and Prediction", xlab="Date", ylab="Unemployment (Median)")
lines(smoothed10, x=economics$date, col="red")
lines(smoothed25, x=economics$date, col="green")
lines(smoothed50, x=economics$date, col="blue")
legend('bottomright', legend=c('.1', '.25', '.5'),
col=c('red', 'green', 'blue'), pch=19, title='Smoothing Span')