Three’s Company!

Synopsis

This week the aim is to scrape the Cincinnati weather data set. In doing so I will describe what each variable in the set is measuring. Additionally I will attempt to explain the data set as a whole by taking a look at the summary statistics and also parsing for possible missing values. Lastly I’ll draw up three different visualizations in R and explain what’s going on with each one.

Packages Required

The package required for this data set visualization is “tidyverse”

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

Source Code

This data set contains the 24hr average daily temperature for Cincinnati from 1/1/1995 through 10/19/16. The “month” variable is assigning the numbers 1-12 for each of the 12 months (Jan-Dec). The “day” variable is the day of the month from the first column. And the year is the given year.

url <- "http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt"
weather <- read.table(url)
colnames(weather) <- c("month", "day", "year", "temp") 

Data description

Number of observations

nrow(weather)
## [1] 7963
dim(weather)
## [1] 7963    4

Number of variables and their names. (Which I named in the “source code” section)

ncol(weather)
## [1] 4
colnames(weather)
## [1] "month" "day"   "year"  "temp"

Any missing values?

weather$temp[weather$temp == -99] <- NA
sum(is.na(weather))
## [1] 14
weather[!complete.cases(weather),]
##      month day year temp
## 1454    12  24 1998   NA
## 1455    12  25 1998   NA
## 1460    12  30 1998   NA
## 1461    12  31 1998   NA
## 1471     1  10 1999   NA
## 2726     6  18 2002   NA
## 2727     6  19 2002   NA
## 2728     6  20 2002   NA
## 2729     6  21 2002   NA
## 2807     9   7 2002   NA
## 2982     3   1 2003   NA
## 4623     8  28 2007   NA
## 5016     9  24 2008   NA
## 5213     4   9 2009   NA
weathered <- na.omit(weather)

Some summary statistics.

weathered$month <- factor(weathered$month)
weathered$day <- factor(weathered$day)
weathered$year <- factor(weathered$year)
summary(weathered)
##      month           day            year           temp      
##  5      : 682   2      : 262   1996   : 366   Min.   :-2.20  
##  7      : 682   3      : 262   2000   : 366   1st Qu.:40.20  
##  1      : 681   4      : 262   2004   : 366   Median :57.10  
##  3      : 681   5      : 262   2012   : 366   Mean   :54.73  
##  8      : 681   6      : 262   1995   : 365   3rd Qu.:70.70  
##  10     : 670   8      : 262   1997   : 365   Max.   :89.20  
##  (Other):3872   (Other):6377   (Other):5755

Visualization #1

Here we show the count of particular temperature values for all years by month.

ggplot(data = weathered) + 
  geom_bar(mapping = aes(x = temp), show.legend = FALSE) + 
  facet_wrap(~ month)

Visualization #2

This is a plot of year and average temperature with the respective months colored.

ggplot(data = weathered) + 
  geom_point(mapping = aes(x = year, y = temp, color = month ))

Visualization #3

Here I plotted all of the days by year with a diverging color scheme to help visualize the temperature.

ggplot(data = weathered) + 
  geom_point(mapping = aes(x = year, y = month, color = temp ))