Week_3_Shah

Synopsis

This document explores the daily average temperature data available on the University of Dayton website and extracted from the National Climatic Data Center. The document describes the type of data hosted on the website, the variables included, missing data and explores the Cincinnati average weather data. First, we will examine the temperature drop as we Cincinnatians ease into the Fall season. Then we will look at the November temperatures since the data has been recorded to see what we can expect for the next month. Finally, we will look at the impact of global warming in Cincinnati over the last 20+ years by taking the average yearly temperature (2016 will be truncated) and charting this.

Source: http://academic.udayton.edu/kissock/http/Weather/

Packages Required

We will load dplyr and ggplot2. We will use plyr later to rename the columns, dplyr to average the temperature throughout the years and ggplot2 for the graphs. We will not load plyr now because of function overlap between plyr and dplyr.

library(dplyr)
library(ggplot2)

Source Code (describe variables)

The variables are the following Month: The month is an integer ranging from 1 to 12 (January through December) Originally called V1 and converted in next section Day: The day is an integer ranging from 1 to 31 (day of the month) Originally called V2 and converted in next section Year: The year ranges from 1995 to 2016 Originally called V3 and converted in next section Temperature(F): The temperature (F) is recorded every hour throughout the day by the National Climatic Data Center to form a daily average. This variable includes the daily average of the temperature. Originally called V4 and converted in next section *Full_Date: The full date is a concatentation of the Year-Month-Day as YYYY-MM-DD This is included for graphing purposes, and done in this script, but not in the data included on the University of Dayton website.

Data Description (description with summary statistics)

The source contains daily average temperatures from 157 US cities and 167 int’l cities. Here we will examine the data from Cincinnati, OH, USA. The University of Dayton uses the hourly temperature, in degrees Fahrenheit, from the National Climatic Data Center, to generate an average daily temperature.

The data is hosted and available from download for non-commercial and research purposes only.

Missing values are coded in the source data as “-99”, but we will convert this below to NA. Additionally, we will change the variable names to the names in the “Source Code” section.

The first 6 observations are displayed below. The structure shows 4 variables and their data type ()

#pull data from website
url <- "http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt"
df <- read.table(url)
#let's look at the first 6 rows and the structure
head(df) #the column names are V1-4. We should change this to month/date/year.

##   V1 V2   V3   V4
## 1  1  1 1995 41.1
## 2  1  2 1995 22.2
## 3  1  3 1995 22.8
## 4  1  4 1995 14.9
## 5  1  5 1995  9.5
## 6  1  6 1995 23.8

str(df)

## 'data.frame':    7963 obs. of  4 variables:
##  $ V1: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ V2: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ V3: int  1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
##  $ V4: num  41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...

Using the summary function we can see the range, 1st and 3rd Quartile, mean and median. The variables are named V1 to V4.

#let's summarize the data
summary(df)

##        V1               V2              V3             V4        
##  Min.   : 1.000   Min.   : 1.00   Min.   :1995   Min.   :-99.00  
##  1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2000   1st Qu.: 40.10  
##  Median : 6.000   Median :16.00   Median :2005   Median : 57.00  
##  Mean   : 6.479   Mean   :15.72   Mean   :2005   Mean   : 54.46  
##  3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:2011   3rd Qu.: 70.70  
##  Max.   :12.000   Max.   :31.00   Max.   :2016   Max.   : 89.20

#from the summary, we determine that V2 is the day, V1 is the month, etc.

Let’s rename the variables to be a bit more descriptive:

df <- plyr::rename(df, c("V1" = "Month", "V2" ="Day", "V3" = "Year", "V4" = "Temperature(F)"))
head(df)

##   Month Day Year Temperature(F)
## 1     1   1 1995           41.1
## 2     1   2 1995           22.2
## 3     1   3 1995           22.8
## 4     1   4 1995           14.9
## 5     1   5 1995            9.5
## 6     1   6 1995           23.8

We saw in the summary that we had values of “-99” in Temperature. From the data description we see that this is a missing value. Let’s convert this to NA so we can more easily look at the data (using mean() for example). We’ll see that many of the missing data points were in December and February of 1998 and 2002, respectively.

sum(df$`Temperature(F)`=="-99")

## [1] 14

missing <- df[df$`Temperature(F)`=="-99",]
missing #shows that 1998 and 2002 had 9 of the 14 missing datapoints, mostly in

##      Month Day Year Temperature(F)
## 1454    12  24 1998            -99
## 1455    12  25 1998            -99
## 1460    12  30 1998            -99
## 1461    12  31 1998            -99
## 1471     1  10 1999            -99
## 2726     6  18 2002            -99
## 2727     6  19 2002            -99
## 2728     6  20 2002            -99
## 2729     6  21 2002            -99
## 2807     9   7 2002            -99
## 2982     3   1 2003            -99
## 4623     8  28 2007            -99
## 5016     9  24 2008            -99
## 5213     4   9 2009            -99

#December and February, respectively.
df$`Temperature(F)`[df$`Temperature(F)`=='-99'] <- NA
sum(is.na(df)) #check if -99 values were swapped to NA

## [1] 14

Finally, for time series let’s combine the date into a single column and convert the data type to as.Date.

#combined dates for single column
df$Full_Date <- paste(df$Year, df$Month, df$Day, sep="/")
df$Full_Date <- as.Date(df$Full_Date)

Data Visualization

Let’s leverage the data to see how we can better prepare for future weather in Cincinnati.

In the first chart, we’ll look at a scatter plot of the last few months (more accurately since the start of July) to see how the weather has changed. The trend is obvious that October hs been substantially colder (and more variable) than previous months.

#line chart for last 3 months
df1 <- df[df$Full_Date>'2016-06-30',]
plot1 <- ggplot(data = df1, mapping = aes(x = df1$Full_Date, y = df1$Temperature)) + 
  geom_point(mapping = aes(color = df1$Month)) + geom_smooth()
plot1 +ggtitle("Daily temperature since July 1") + labs(x="Day", y
  = "Daily temperature (F)")

In the second plot, we’ll look at what to expect in November using the boxplot. We look at November as a boxplot from every year in the data (since 1995). We should expect a temperature in the high 30s to low 40s on average. But it will likely range from 20 to 60 degrees.

#let's look at a boxplot of every November in this dataset over the years
df2 <- df[df$Month == "11",]
plot2 <- ggplot(data = df2, mapping = aes(x = df2$Year, y = df2$`Temperature(F)`)) + 
  geom_boxplot(aes(group=df2$Year))
plot2 + ggtitle("Temperature (F) in November over Years") + labs(x="Year", y
                                                   = "Temperature Average (F)")

Finally, let’s see how global warming is impacting us and will impact us in the future by looking at average yearly temperatures over the years and putting a trend line over the data. On average the temperature has increased 0.7 degrees per decade leaving out the incomplete 2016 year.

#finally, let's look at a bar plot of the average yearly temperature
df$Year <- as.factor(df$Year)
df3 <- df %>% group_by(Year) %>% summarise(avg = mean(`Temperature(F)`, na.rm = TRUE))
plot3 <- ggplot(data = df3[df3$Year!=2016,], mapping = aes(x = Year, y = avg)) + geom_point()
plot3 + geom_smooth(method = "lm", se=FALSE, color="red", aes(group=1)) + ggtitle("Average temperature (F) per year") + labs(x="Year", y = "Temperature Average (F)")