This R Markdown file contains the solutions to assignment #3. In this file, I will examine daily average temperature data for Cincinnati, Ohio going back to 1995. General summary statistics will be given and followed by three different visualizations of the data.
Based on the initial findings, one can conclude that average daily temperatures do indeed appear to be rising slightly in Cincinnati. Also, as I’m from Cincinnati, I could have confirmed this just as a matter of fact, but the data does indicate a high level of seasonality in the temperatures in Cincinnati.
In this analysis, only the tidyverse package was required. The tidyverse package is required to use the ggplot() function, which is the best data visualization function in R.
## Load required libraries.
library(tidyverse) ## This package enables the data visualizations created further down in this script.
In this exercsise, I will be pulling data from http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt. The data measures four different variables: the day, the month, and the year the sample was taken as well as the 24 hour average temperature for that day. Below is the code used to import the data.
## Import the Data ##
url <- "http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt"
mydata_temps <- read.table(url,header = FALSE, col.names=c("Month", "Day", "Year", "AvgTemp"))
To investigate the structure of the data, I will use the str() function as demonstrated below.
## Investigate the structure of the data set.
str(mydata_temps)
## 'data.frame': 7963 obs. of 4 variables:
## $ Month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
## $ AvgTemp: num 41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...
As one can see from the results, there are four different variables in the dataset with 7,963 observations of those variables. Initially, the first thing one can see is that Month, Day, and Year are all classified as integers instead of categorical values. I will fix this using the below code.
## Change day, month, and year to categorical variables.
mydata_temps$Month <- factor(mydata_temps$Month)
mydata_temps$Day <- factor(mydata_temps$Day)
mydata_temps$Year <- factor(mydata_temps$Year)
str(mydata_temps)
## 'data.frame': 7963 obs. of 4 variables:
## $ Month : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Day : Factor w/ 31 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Year : Factor w/ 22 levels "1995","1996",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ AvgTemp: num 41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...
Now, we can see that Month, Day, and Year have been recategorized as factors instead of integers. We have each month and day of each month represented in the data along with 22 years.
Next, we will investigate whether or not there are any missing values. In this particular dataset, missing values are classified as “-99”, so we will first need to replace “-99” with NA, count these values, and then omit them from the dataset.
## Replace "-99" with NA and count them.
mydata_temps[mydata_temps=="-99"] <- NA
sum(is.na(mydata_temps))
## [1] 14
## Remove missing data from dataset.
na.omit(mydata_temps)
As one can see, this dataset is only missing 14 values out of 7,963 data points.
Now, we will produce some general summary statistics for the dataset. In this case, AvgTemp is the only variable of interest as the others are categorical. As one can see, the Average Daily Temperature in Cincinnati ranges from -2.2 degrees to 89.2 degrees with a mean of 54.73 degrees.
## Summary statistics
summary(mydata_temps)
## Month Day Year AvgTemp
## 1 : 682 1 : 262 1996 : 366 Min. :-2.20
## 3 : 682 2 : 262 2000 : 366 1st Qu.:40.20
## 5 : 682 3 : 262 2004 : 366 Median :57.10
## 7 : 682 4 : 262 2008 : 366 Mean :54.73
## 8 : 682 5 : 262 2012 : 366 3rd Qu.:70.70
## 10 : 670 6 : 262 1995 : 365 Max. :89.20
## (Other):3883 (Other):6391 (Other):5768 NA's :14
First, we will start out with a generic bar chart of the average annual temperature per year from 1995 through 2016. As one can see, the average annual temperature does appear to be slightly increasing. Please note that 2016 does appear to be slightly inflated as we only have data through part of the year, including typically warm months. This could be an indication of seasonality.
## Plot 1: Bar chart for the annual average temperature from 1995-2016.
ggplot(data = mydata_temps) +
geom_bar(mapping = aes(x = Year, y = AvgTemp), stat = "summary", fun.y = "mean")
Next, I wished to investigate the seasonal swings in temperatures further. To do this, I plotted the average temperature by month for each year. Again, one can see that the shapes of each graph follow the same relative pattern with peaks in the summer months that taper off as winter approaches followed by an increase again in the spring.
## Plot 2: Bar chart of Average Temperature by Month faceted by year.
ggplot(data = mydata_temps) +
geom_bar(mapping = aes(x = Month, y = AvgTemp), stat = "summary", fun.y = "mean") +
facet_wrap(~ Year, nrow = 4)
Finally, to absolutely confirm the seasonality aspect of Cincinnati temperatures, I created a boxplot for each month including all 22 years worth of data. This graph did confirm the seasonality, but it also revealed an interesting facet of Cincinnati weather. We typically have less variability in temperature during the summer months with more variability in the cooler months. This lends itself to the joke that in Cincinnati, you can experience all four seasons in one day, especially in the cooler months!
## Plot 3: Monthly boxplot.
ggplot(data = mydata_temps) +
geom_boxplot(mapping = aes(x = Month, y = AvgTemp))
Many Cincinnatians could speak to the seasonality of our weather here, so the results of this analysis come as no surprise. However, this same R Markdown file could be used to analyze the seasonality of any city listed on http://academic.udayton.edu/kissock/http/Weather/citylistUS.htm. This could especially be handy when visiting places you have never been before.