Synopsis

This R Markdown talks about Weather Data for Cincinnati from 1995 to 2016. I also present some basic summary statistics, along with plots for trends at year, month & day levels.

Packages Required

library(knitr) # Knitr library to dynamically access codes written across multiple R scripts
library(ggplot2) # To create visualizations
library(dplyr) # Package for data manipulation
read_chunk("Week_3/Week_3.R") # Use the script for the Week 3 assignment

Structure of the Input data

#Download the weather data from the University of Dayton website
url <- "http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt"
weather <- read.table(url)
colnames(weather) <- c("month","day","year","temp")
str(weather)
## 'data.frame':    7963 obs. of  4 variables:
##  $ month: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ year : int  1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
##  $ temp : num  41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...
weather[weather == "-99"] <- NA
#The data labels missing values as "-99". We will convert these to NA as it makes it easy to manipulate
for(i in 1:nrow(weather)){
  if(is.na(weather$temp[i]) == T){
    j <- (i-2):(i+2)
    weather$temp[i] <- mean(weather$temp[j], na.rm = T)
  }
}
#Missing Value Imputation with 2 days before and after

#Lets look at the temperature variable at the various levels of data
per_year <- round(weather %>% group_by(year) %>% summarise(avg_temp = mean(temp,na.rm=T)),2)
per_month <- round(weather %>% group_by(month) %>% summarise(avg_temp = mean(temp,na.rm=T)),2)
per_day <- round(weather %>% group_by(day) %>% summarise(avg_temp = mean(temp, na.rm = T)),2)

Details about the Data

Data Description

Missing Variable Analysis

##No. of missing values in each variable
sapply(weather, function(x) sum(is.na(x)))
## month   day  year  temp 
##     0     0     0     0
##Summary Statistics for each vaiable
summary(weather)
##      month             day             year           temp      
##  Min.   : 1.000   Min.   : 1.00   Min.   :1995   Min.   :-2.20  
##  1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2000   1st Qu.:40.20  
##  Median : 6.000   Median :16.00   Median :2005   Median :57.10  
##  Mean   : 6.479   Mean   :15.72   Mean   :2005   Mean   :54.72  
##  3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:2011   3rd Qu.:70.70  
##  Max.   :12.000   Max.   :31.00   Max.   :2016   Max.   :89.20

Summary by Year

The data for 2016 shows it being the hottest year, but that’s because it still hasn’t seen the colder months of the year. The trend line though clearly shows as upward trend in temperature over the years

p1 <- ggplot(data = per_year, aes(year, avg_temp)) + geom_line() + 
  geom_smooth(method = "lm", color = "blue", se = F)
p1 + ggtitle("Temperatures by Year") + xlab("Year") + ylab("Temperature")

Summary by Month

Of course monthly data doesnt show anything out of the ordinary, July being the hottest and January being the coldest

p2 <- ggplot(data = weather, aes(factor(month), temp)) + geom_boxplot() + stat_boxplot(geom = "errorbar")
p2 + ggtitle("Temperature by Month") + xlab("Month") + ylab("Temperature")

Summary by Day of the Month

The last day of the month is on an average the hottest day of the month over the course of the 22 years contained in the data

p3 <- ggplot(data = per_day, aes(day, avg_temp)) + geom_line()
p3 +  ggtitle("Temperature by Day of Month") + xlab("Day of Month") + ylab("Temperature") + 
      scale_y_continuous(breaks = seq(53.5,57,0.5))