# Loading relevant libraries
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(plotly))

Introduction

This section is about using histograms and density plots to summarize a single continuous variable. We will use the traffic volumes data set (discrete data) from Ontario Open Data catalogue (Government of Ontario 2016).

Visualizing Traffic Volumes at a station

Let’s start with the traffic volumes data. These data were collected under the commercial vehicle survey in Ontario, Canada in 2006. First, we read the data set:

# Download Data
suppressPackageStartupMessages(library(data.table) )
df_vol <- fread("https://files.ontario.ca/opendata/2006_commercial_vehicle_survey_-_traffic_volumes_at_survey_stations.csv")

#saveRDS(df_vol, file="Data/df_vol.rds")

# Filter the data for one station
suppressPackageStartupMessages(library(dplyr) )
df_vol <- df_vol %>% 
  filter(`Station ID`=="ON0016",
         `Day of Week Number`==2) %>% 
  select(`Highway or Road`, `Day of Week Number`, Hour, total_trucks)

As seen above, we have filtered the original data to keep only the ON0016 station. This station is located On Highway 401, between Bennett Rd (Exit 435) & Liberty St (Exit 432) about 1 km west of Bennet Rd (Km Marker 434), as described in the Location Description column in the original data set. Following is a map of the location:

library(leaflet)
m <- leaflet() %>%
  addTiles() %>%  # Add default OpenStreetMap map tiles
  addMarkers(lng=-78.66317, lat=43.90038)
m  # Print the map

Now, let’s see the data frame:

suppressPackageStartupMessages(library(DT))
datatable(df_vol, caption = "Traffic Volumes at a Survey Station", filter = "top", 
          options = list(
  pageLength = 5
))

This subset of the data shows the hourly volumes of total_trucks on day number 2 (Monday) in a week. There are a total of 24 rows in df_vol, each representing the volume in an hour. The numbers in Hour column represents the start of the hour in the day, e.g. 0 means 12 am to 1 am. For now, we will only look into the volumes per day (Monday, to be specific).

Histogram and Density plots

The frequency histogram is constructed after plotting the observed counts of unique values of a variable on y axis and the unique values on the x axis. In our volumes data, most of the unique values were observed only once. So, it is not wise to use unique values on x axis. Instead, we can combine values in groups.

df_vol <- df_vol %>% 
  mutate(class_interval = cut_width(total_trucks, width=50))

df_vol_hist <- df_vol %>% 
  group_by(class_interval) %>% 
  summarize(frequency = n()) %>% 
  ungroup() %>% 
  mutate(rel_freq = frequency/sum(frequency), 
         Density = rel_freq/50)


datatable(df_vol_hist)

The table above shows the group or bin (Class_interval) of the total_trucks variable and the counts of volumes (frequency) falling into each bin. Following diagram shows how rel_freq and Density were estimated:

library(DiagrammeR)
mermaid("
graph LR
A(Data)-->B(Frequency)
B-->C((Frequency / <br> Total Frequency))
C-->D(Relative <br> Frequency)
D-->E((Relative Frequency / <br> Class Interval))
E-->F(Density)

style A fill:#E5E25F
style B fill:#E5E25F
style C fill:white
style D fill:#E5E25F
style E fill:white
style F fill:#E5E25F
")

In R, we don’t have to calculate frequencies or densities ourselves. We can plot histogram and density histogram by just specifying the binwidth i.e. class interval.

Frequency Histogram

ggplotly(ggplot(data = df_vol, aes(x = total_trucks)) + 
  geom_histogram(binwidth = 50, fill="skyblue", color = "black") +
  theme_bw())

Relative frequency Histogram

ggplotly(ggplot(data = df_vol, aes(x = total_trucks, y = ..count../sum(..count..))) + 
  geom_histogram(binwidth = 50, fill="red", color = "black") +
  theme_bw() + labs(y = "Relative Frequency"))

Density Histogram

ggplotly(ggplot(data = df_vol, aes(x = total_trucks, y = ..density..)) + 
  geom_histogram(binwidth = 50, fill="green", color = "black") +
  theme_bw() + labs(y = "Density"))

The histogram provides us the information about the range of data, shape of distribution and modes (frequent values) in the data.
When the binwidth is small enough, the histogram is smoothed. This can provide us a function of frequency (or probability). For discrete data, it is called probability mass function, whereas, for continuous data, it is called as probability density function. There are some well-defined functions where the distributions are controlled by specific parameters. These are called parametric distributions. An example of parametric distributions is the Normal distribution which has parameters of mean and standard deviation. Non-parameteric distributions are useful when a parametric distribution is not appropriate for a given data set. Kernel density estimation is one way to estimate and plot the density for a data set.

Resources

Transportation Statistics and Microsimulation book

References

Government of Ontario, 2016. Data catalogue. Available at: https://www.ontario.ca/search/data-catalogue?sort=asc [Accessed August 5, 2016].

Summarizing One Variable Graphically

Umair Durrani

August 7, 2016