Goals

This exploration will examine numerical data.

The goals of this exploration are to explore numerical and graphical summaries for numerical variables

  • graphical summaries
    • histograms
    • boxplots
    • dotplots
    • density curves
  • numerical summary statistics
    • mean
    • median
    • Q1 and Q3
    • min and max

Data

We will look at two different data sets;

  • vehicle.csv containing EPA fuel economy for vehicles from 1984-2020

  • climate.csv containing August 2019 daily high and low temperatures for Fremont, Santa Cruz, Sacramento and Santa Barbara

Packages Needed

  • ggplot2 for graphing

Since we are graphing we will need the package and since we are also data wrangling we wil need the dplyr package. For reading csv files, we need readr package. For quick summary we’ll use purrr. kable for producing better looking tables.We will also be creating interactive graphics using the package plotly.

Library the needed packages

library(ggplot2)
library(dplyr)
library(readr)
library(purrr)
library(kableExtra)
library(plotly)

Exploring climate data

Reading in data with readr

climate <-read.csv("climate.csv")

Glimpse data to asses variable types

glimpse(climate)
## Observations: 123
## Variables: 4
## $ City <fct> Sacramento, Sacramento, Sacramento, Sacramento, Sacramento,…
## $ Date <fct> 8/1/2019, 8/2/2019, 8/3/2019, 8/4/2019, 8/5/2019, 8/6/2019,…
## $ High <dbl> 92, 98, 102, 99, 98, 101, 94, 86, 91, 87, 92, 98, 101, 105,…
## $ Low  <dbl> 62, 63, 65, 64, 61, 67, 62, 60, 62, 63, 59, 63, 66, 67, 71,…

Visual Exploration

Let’s view a variety of visualization of a single numerical variable

  • Fremont’s Daily High Temperature for August 2019

Look at connection between graphs

Closer look at boxplot connection to histogram

Classic Dot Plot

Classic Histogram

Density Curve

Boxplot

Violin Plot

Boxplot with jittered data points

Dotplot with dots centered on axis

Comparing groups with summary visualization

Let’s compare the daily high temperatures for four different cities in a dotplot

ggplot(climate, aes(High, fill=City)) +
  geom_dotplot(binwidth=1,stackratio=1.0, dotsize=1)+
  facet_wrap(~City, nrow=4)

View the temperatures for each city in a faceted histogram

p<-ggplot(climate, aes(High, fill=City)) +
  geom_histogram(binwidth = 2, color="white")+
  facet_wrap(~City, nrow=4)
ggplotly(p)

View the temperatures for each city in a faceted density curve

p<-ggplot(climate, aes(High, color=City)) +
  geom_density()+
  facet_wrap(~City, nrow=4)
ggplotly(p)

View the temperatures for each city in single histogram

p<-ggplot(climate, aes(High, fill=City)) +
  geom_histogram(binwidth = 1, color="white")
ggplotly(p)

View the temperatures for each city in a single density curve

p<-ggplot(climate, aes(High, color=City)) +
  geom_density()
ggplotly(p)

View the temperatures for each city in a single density curve with shading

p<-ggplot(climate, aes(High, color=City, fill=City)) +
  geom_density(alpha=0.25)
ggplotly(p)

View the temperatures for each city in a boxplot

p<-ggplot(climate, aes(x= City, y=High, color=City)) +
  geom_boxplot()+
  coord_flip()
ggplotly(p)

View the temperatures for each city in a violin plot

p<-ggplot(climate, aes(x= City, y=High, fill=City)) +
  geom_violin()+
  coord_flip()
ggplotly(p)

Add data points

p<-ggplot(climate, aes(x= City, y=High, fill=City)) +
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))+
  coord_flip()
p

Classic Numerical Summary Statistcs

Using group_by and summarize functions

high_sum <- climate %>% 
  group_by(City) %>% 
  summarize(High_min=min(High), High_q1=quantile(High, 0.25),  High_median=median(High),  High_q3=quantile(High, 0.75), High_max=max(High), High_mean=mean(High), High_sd=sd(High), High_IQR=IQR(High))
high_sum
City High_min High_q1 High_median High_q3 High_max High_mean High_sd High_IQR
Fremont 73.0 78.00 82.0 86.75 98.0 82.80000 6.535843 8.75
Sacramento 86.0 91.00 98.0 101.00 107.0 96.48387 5.999283 10.00
Santa Barbara 65.0 70.00 71.0 74.00 77.0 71.83871 2.989947 4.00
Santa Cruz 65.6 72.85 78.1 82.45 86.3 77.30968 5.799963 9.60

Improving table format with kableExtra package

high_sum %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>% 
  add_header_above(c("Summary of Daily High Temperatures"=9))
Summary of Daily High Temperatures
City High_min High_q1 High_median High_q3 High_max High_mean High_sd High_IQR
Fremont 73.0 78.00 82.0 86.75 98.0 82.80000 6.535843 8.75
Sacramento 86.0 91.00 98.0 101.00 107.0 96.48387 5.999283 10.00
Santa Barbara 65.0 70.00 71.0 74.00 77.0 71.83871 2.989947 4.00
Santa Cruz 65.6 72.85 78.1 82.45 86.3 77.30968 5.799963 9.60

Summarize using purrr package

climate %>%  
  split(.$City) %>% 
  map(summary)
## $Fremont
##             City           Date         High            Low       
##  Fremont      :30   8/1/2019 : 1   Min.   :73.00   Min.   :55.00  
##  Sacramento   : 0   8/10/2019: 1   1st Qu.:78.00   1st Qu.:57.00  
##  Santa Barbara: 0   8/11/2019: 1   Median :82.00   Median :59.00  
##  Santa Cruz   : 0   8/12/2019: 1   Mean   :82.80   Mean   :58.77  
##                     8/13/2019: 1   3rd Qu.:86.75   3rd Qu.:59.00  
##                     8/14/2019: 1   Max.   :98.00   Max.   :64.00  
##                     (Other)  :24                                  
## 
## $Sacramento
##             City           Date         High             Low       
##  Fremont      : 0   8/1/2019 : 1   Min.   : 86.00   Min.   :58.00  
##  Sacramento   :31   8/10/2019: 1   1st Qu.: 91.00   1st Qu.:62.00  
##  Santa Barbara: 0   8/11/2019: 1   Median : 98.00   Median :64.00  
##  Santa Cruz   : 0   8/12/2019: 1   Mean   : 96.48   Mean   :64.26  
##                     8/13/2019: 1   3rd Qu.:101.00   3rd Qu.:67.00  
##                     8/14/2019: 1   Max.   :107.00   Max.   :71.00  
##                     (Other)  :25                                   
## 
## $`Santa Barbara`
##             City           Date         High            Low       
##  Fremont      : 0   8/1/2019 : 1   Min.   :65.00   Min.   :55.00  
##  Sacramento   : 0   8/10/2019: 1   1st Qu.:70.00   1st Qu.:56.00  
##  Santa Barbara:31   8/11/2019: 1   Median :71.00   Median :57.00  
##  Santa Cruz   : 0   8/12/2019: 1   Mean   :71.84   Mean   :57.58  
##                     8/13/2019: 1   3rd Qu.:74.00   3rd Qu.:59.00  
##                     8/14/2019: 1   Max.   :77.00   Max.   :61.00  
##                     (Other)  :25                                  
## 
## $`Santa Cruz`
##             City           Date         High            Low       
##  Fremont      : 0   8/1/2019 : 1   Min.   :65.60   Min.   :53.10  
##  Sacramento   : 0   8/10/2019: 1   1st Qu.:72.85   1st Qu.:55.75  
##  Santa Barbara: 0   8/11/2019: 1   Median :78.10   Median :57.10  
##  Santa Cruz   :31   8/12/2019: 1   Mean   :77.31   Mean   :57.26  
##                     8/13/2019: 1   3rd Qu.:82.45   3rd Qu.:58.70  
##                     8/14/2019: 1   Max.   :86.30   Max.   :63.20  
##                     (Other)  :25