This exploration will examine numerical data.
The goals of this exploration are to explore numerical and graphical summaries for numerical variables
We will look at two different data sets;
vehicle.csv containing EPA fuel economy for vehicles from 1984-2020
climate.csv containing August 2019 daily high and low temperatures for Fremont, Santa Cruz, Sacramento and Santa Barbara
ggplot2 for graphingSince we are graphing we will need the package and since we are also data wrangling we wil need the dplyr package. For reading csv files, we need readr package. For quick summary we’ll use purrr. kable for producing better looking tables.We will also be creating interactive graphics using the package plotly.
Library the needed packages
library(ggplot2)
library(dplyr)
library(readr)
library(purrr)
library(kableExtra)
library(plotly)
climate dataReading in data with readr
climate <-read.csv("climate.csv")
Glimpse data to asses variable types
glimpse(climate)
## Observations: 123
## Variables: 4
## $ City <fct> Sacramento, Sacramento, Sacramento, Sacramento, Sacramento,…
## $ Date <fct> 8/1/2019, 8/2/2019, 8/3/2019, 8/4/2019, 8/5/2019, 8/6/2019,…
## $ High <dbl> 92, 98, 102, 99, 98, 101, 94, 86, 91, 87, 92, 98, 101, 105,…
## $ Low <dbl> 62, 63, 65, 64, 61, 67, 62, 60, 62, 63, 59, 63, 66, 67, 71,…
Let’s view a variety of visualization of a single numerical variable
Closer look at boxplot connection to histogram
Let’s compare the daily high temperatures for four different cities in a dotplot
ggplot(climate, aes(High, fill=City)) +
geom_dotplot(binwidth=1,stackratio=1.0, dotsize=1)+
facet_wrap(~City, nrow=4)
View the temperatures for each city in a faceted histogram
p<-ggplot(climate, aes(High, fill=City)) +
geom_histogram(binwidth = 2, color="white")+
facet_wrap(~City, nrow=4)
ggplotly(p)
View the temperatures for each city in a faceted density curve
p<-ggplot(climate, aes(High, color=City)) +
geom_density()+
facet_wrap(~City, nrow=4)
ggplotly(p)
View the temperatures for each city in single histogram
p<-ggplot(climate, aes(High, fill=City)) +
geom_histogram(binwidth = 1, color="white")
ggplotly(p)
View the temperatures for each city in a single density curve
p<-ggplot(climate, aes(High, color=City)) +
geom_density()
ggplotly(p)
View the temperatures for each city in a single density curve with shading
p<-ggplot(climate, aes(High, color=City, fill=City)) +
geom_density(alpha=0.25)
ggplotly(p)
View the temperatures for each city in a boxplot
p<-ggplot(climate, aes(x= City, y=High, color=City)) +
geom_boxplot()+
coord_flip()
ggplotly(p)
View the temperatures for each city in a violin plot
p<-ggplot(climate, aes(x= City, y=High, fill=City)) +
geom_violin()+
coord_flip()
ggplotly(p)
Add data points
p<-ggplot(climate, aes(x= City, y=High, fill=City)) +
geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))+
coord_flip()
p
group_by and summarize functionshigh_sum <- climate %>%
group_by(City) %>%
summarize(High_min=min(High), High_q1=quantile(High, 0.25), High_median=median(High), High_q3=quantile(High, 0.75), High_max=max(High), High_mean=mean(High), High_sd=sd(High), High_IQR=IQR(High))
high_sum
| City | High_min | High_q1 | High_median | High_q3 | High_max | High_mean | High_sd | High_IQR |
|---|---|---|---|---|---|---|---|---|
| Fremont | 73.0 | 78.00 | 82.0 | 86.75 | 98.0 | 82.80000 | 6.535843 | 8.75 |
| Sacramento | 86.0 | 91.00 | 98.0 | 101.00 | 107.0 | 96.48387 | 5.999283 | 10.00 |
| Santa Barbara | 65.0 | 70.00 | 71.0 | 74.00 | 77.0 | 71.83871 | 2.989947 | 4.00 |
| Santa Cruz | 65.6 | 72.85 | 78.1 | 82.45 | 86.3 | 77.30968 | 5.799963 | 9.60 |
kableExtra packagehigh_sum %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
add_header_above(c("Summary of Daily High Temperatures"=9))
| City | High_min | High_q1 | High_median | High_q3 | High_max | High_mean | High_sd | High_IQR |
|---|---|---|---|---|---|---|---|---|
| Fremont | 73.0 | 78.00 | 82.0 | 86.75 | 98.0 | 82.80000 | 6.535843 | 8.75 |
| Sacramento | 86.0 | 91.00 | 98.0 | 101.00 | 107.0 | 96.48387 | 5.999283 | 10.00 |
| Santa Barbara | 65.0 | 70.00 | 71.0 | 74.00 | 77.0 | 71.83871 | 2.989947 | 4.00 |
| Santa Cruz | 65.6 | 72.85 | 78.1 | 82.45 | 86.3 | 77.30968 | 5.799963 | 9.60 |
purrr packageclimate %>%
split(.$City) %>%
map(summary)
## $Fremont
## City Date High Low
## Fremont :30 8/1/2019 : 1 Min. :73.00 Min. :55.00
## Sacramento : 0 8/10/2019: 1 1st Qu.:78.00 1st Qu.:57.00
## Santa Barbara: 0 8/11/2019: 1 Median :82.00 Median :59.00
## Santa Cruz : 0 8/12/2019: 1 Mean :82.80 Mean :58.77
## 8/13/2019: 1 3rd Qu.:86.75 3rd Qu.:59.00
## 8/14/2019: 1 Max. :98.00 Max. :64.00
## (Other) :24
##
## $Sacramento
## City Date High Low
## Fremont : 0 8/1/2019 : 1 Min. : 86.00 Min. :58.00
## Sacramento :31 8/10/2019: 1 1st Qu.: 91.00 1st Qu.:62.00
## Santa Barbara: 0 8/11/2019: 1 Median : 98.00 Median :64.00
## Santa Cruz : 0 8/12/2019: 1 Mean : 96.48 Mean :64.26
## 8/13/2019: 1 3rd Qu.:101.00 3rd Qu.:67.00
## 8/14/2019: 1 Max. :107.00 Max. :71.00
## (Other) :25
##
## $`Santa Barbara`
## City Date High Low
## Fremont : 0 8/1/2019 : 1 Min. :65.00 Min. :55.00
## Sacramento : 0 8/10/2019: 1 1st Qu.:70.00 1st Qu.:56.00
## Santa Barbara:31 8/11/2019: 1 Median :71.00 Median :57.00
## Santa Cruz : 0 8/12/2019: 1 Mean :71.84 Mean :57.58
## 8/13/2019: 1 3rd Qu.:74.00 3rd Qu.:59.00
## 8/14/2019: 1 Max. :77.00 Max. :61.00
## (Other) :25
##
## $`Santa Cruz`
## City Date High Low
## Fremont : 0 8/1/2019 : 1 Min. :65.60 Min. :53.10
## Sacramento : 0 8/10/2019: 1 1st Qu.:72.85 1st Qu.:55.75
## Santa Barbara: 0 8/11/2019: 1 Median :78.10 Median :57.10
## Santa Cruz :31 8/12/2019: 1 Mean :77.31 Mean :57.26
## 8/13/2019: 1 3rd Qu.:82.45 3rd Qu.:58.70
## 8/14/2019: 1 Max. :86.30 Max. :63.20
## (Other) :25