Synopsis

This R Markdown file contains the solutions to assignment #3. In this file, I will examine daily average temperature data for Cincinnati, Ohio going back to 1995. General summary statistics will be given and followed by three different visualizations of the data.

Based on the initial findings, one can conclude that average daily temperatures do indeed appear to be rising slightly in Cincinnati. Also, as I’m from Cincinnati, I could have confirmed this just as a matter of fact, but the data does indicate a high level of seasonality in the temperatures in Cincinnati.

Packages Required

In this analysis, only the tidyverse package was required. The tidyverse package is required to use the ggplot() function, which is the best data visualization function in R.

## Load required libraries.
library(tidyverse)  ## This package enables the data visualizations created further down in this script.

Source Code

In this exercsise, I will be pulling data from http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt. The data measures four different variables: the day, the month, and the year the sample was taken as well as the 24 hour average temperature for that day. Below is the code used to import the data.

## Import the Data ##
url <- "http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt"
mydata_temps <- read.table(url,header = FALSE, col.names=c("Month", "Day", "Year", "AvgTemp"))

Data Description

To investigate the structure of the data, I will use the str() function as demonstrated below.

## Investigate the structure of the data set.
str(mydata_temps)
## 'data.frame':    7963 obs. of  4 variables:
##  $ Month  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Year   : int  1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
##  $ AvgTemp: num  41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...

As one can see from the results, there are four different variables in the dataset with 7,963 observations of those variables. Initially, the first thing one can see is that Month, Day, and Year are all classified as integers instead of categorical values. I will fix this using the below code.

## Change day, month, and year to categorical variables.
mydata_temps$Month <- factor(mydata_temps$Month)
mydata_temps$Day <- factor(mydata_temps$Day)
mydata_temps$Year <- factor(mydata_temps$Year)
str(mydata_temps)
## 'data.frame':    7963 obs. of  4 variables:
##  $ Month  : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Day    : Factor w/ 31 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Year   : Factor w/ 22 levels "1995","1996",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ AvgTemp: num  41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...

Now, we can see that Month, Day, and Year have been recategorized as factors instead of integers. We have each month and day of each month represented in the data along with 22 years.

Next, we will investigate whether or not there are any missing values. In this particular dataset, missing values are classified as “-99”, so we will first need to replace “-99” with NA, count these values, and then omit them from the dataset.

## Replace "-99" with NA and count them.
mydata_temps[mydata_temps=="-99"] <- NA
sum(is.na(mydata_temps))
## [1] 14
## Remove missing data from dataset.
na.omit(mydata_temps)

As one can see, this dataset is only missing 14 values out of 7,963 data points.

Now, we will produce some general summary statistics for the dataset. In this case, AvgTemp is the only variable of interest as the others are categorical. As one can see, the Average Daily Temperature in Cincinnati ranges from -2.2 degrees to 89.2 degrees with a mean of 54.73 degrees.

## Summary statistics
summary(mydata_temps)
##      Month           Day            Year         AvgTemp     
##  1      : 682   1      : 262   1996   : 366   Min.   :-2.20  
##  3      : 682   2      : 262   2000   : 366   1st Qu.:40.20  
##  5      : 682   3      : 262   2004   : 366   Median :57.10  
##  7      : 682   4      : 262   2008   : 366   Mean   :54.73  
##  8      : 682   5      : 262   2012   : 366   3rd Qu.:70.70  
##  10     : 670   6      : 262   1995   : 365   Max.   :89.20  
##  (Other):3883   (Other):6391   (Other):5768   NA's   :14

Data Visualization

First, we will start out with a generic bar chart of the average annual temperature per year from 1995 through 2016. As one can see, the average annual temperature does appear to be slightly increasing. Please note that 2016 does appear to be slightly inflated as we only have data through part of the year, including typically warm months. This could be an indication of seasonality.

## Plot 1: Bar chart for the annual average temperature from 1995-2016.
ggplot(data = mydata_temps) + 
  geom_bar(mapping = aes(x = Year, y = AvgTemp), stat = "summary", fun.y = "mean")

Next, I wished to investigate the seasonal swings in temperatures further. To do this, I plotted the average temperature by month for each year. Again, one can see that the shapes of each graph follow the same relative pattern with peaks in the summer months that taper off as winter approaches followed by an increase again in the spring.

## Plot 2: Bar chart of Average Temperature by Month faceted by year.
ggplot(data = mydata_temps) + 
  geom_bar(mapping = aes(x = Month, y = AvgTemp), stat = "summary", fun.y = "mean") + 
  facet_wrap(~ Year, nrow = 4)

Finally, to absolutely confirm the seasonality aspect of Cincinnati temperatures, I created a boxplot for each month including all 22 years worth of data. This graph did confirm the seasonality, but it also revealed an interesting facet of Cincinnati weather. We typically have less variability in temperature during the summer months with more variability in the cooler months. This lends itself to the joke that in Cincinnati, you can experience all four seasons in one day, especially in the cooler months!

## Plot 3: Monthly boxplot.
ggplot(data = mydata_temps) + 
  geom_boxplot(mapping = aes(x = Month, y = AvgTemp))

Conclusion

Many Cincinnatians could speak to the seasonality of our weather here, so the results of this analysis come as no surprise. However, this same R Markdown file could be used to analyze the seasonality of any city listed on http://academic.udayton.edu/kissock/http/Weather/citylistUS.htm. This could especially be handy when visiting places you have never been before.