Synopsis

This is the homework for Week 3 Data Wrangling. The objective is to scrape the Cincinnati weather data located here and create visualizations.

Packages required

The following packages are requiblue for this homework

library(gdata) # "The gdata package provides various R programming tools for data      manipulation"  
library(ggplot2) # ggplot is used for data visualization in R
weather <- read.table("http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt", header = FALSE)
names(weather) <- c("month", "day", "year", "temperature") # add header row

Source Code

The data set contains the following variables from left to right

Data Description

Following tables provide an overview about the data such as total observations, number of variables and type for each variable

str(weather)
'data.frame':   7963 obs. of  4 variables:
 $ month      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ day        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ year       : int  1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
 $ temperature: num  41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...

Following table provides a summary of the dataset

summary(weather)
     month             day             year       temperature    
 Min.   : 1.000   Min.   : 1.00   Min.   :1995   Min.   :-99.00  
 1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2000   1st Qu.: 40.10  
 Median : 6.000   Median :16.00   Median :2005   Median : 57.00  
 Mean   : 6.479   Mean   :15.72   Mean   :2005   Mean   : 54.46  
 3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:2011   3rd Qu.: 70.70  
 Max.   :12.000   Max.   :31.00   Max.   :2016   Max.   : 89.20  

The missing values for temperature are represented with a -99. We replace them with NA’s

There are 14 rows that have missing values in the data set

Remove Rows that have NA values:

#Omitting rows with missing values
weather_final <- na.omit(weather)

#Displaying summary statistics of final dataframe
summary(weather_final)
     month             day             year       temperature   
 Min.   : 1.000   Min.   : 1.00   Min.   :1995   Min.   :-2.20  
 1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2000   1st Qu.:40.20  
 Median : 6.000   Median :16.00   Median :2005   Median :57.10  
 Mean   : 6.477   Mean   :15.71   Mean   :2005   Mean   :54.73  
 3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:2011   3rd Qu.:70.70  
 Max.   :12.000   Max.   :31.00   Max.   :2016   Max.   :89.20  

Data Visualization

#Viz 1
ggplot(data=weather_final) +
geom_smooth(mapping = aes(x = year, y = temperature),color="blue", se = FALSE) +
facet_wrap(~ month, nrow = 5) +
ggtitle("Monthly Average Temperature") +
ylab("Avg Temperature(F)")

#Viz 2
library(ggplot2)
ggplot(data=weather_final) +
geom_smooth(mapping = aes(x = year, y = temperature),color="blue") +
ggtitle("Yearly Avg Temperature") +
ylab("Avg Temperature(F)")

#Viz 3
ggplot(data = weather_final) + 
stat_summary(mapping = aes(x = year, y = temperature),
               color="blue",
               geom = "pointrange",
               fun.ymin = min,
               fun.ymax = max,
               fun.y = median
) + 
ggtitle("Summarised Average Temperature") +
ylab("Avg Temperature(F)")