Synopsis

This is Homework-3 of the BANA 8090 Data Wrangling with R course. In this file, we scrape the Cincinnati weather data from here and analyze it. Several visualizations have been plotted to get a better understanding of the data.

Packages required

This section loads and describes the various packages required for analysis of this data set.

library(gdata)  
"The gdata package provides various R programming tools for data manipulation" 

library(ggplot2)
"The ggplot2 package is used to plot multi-layered graphs in R"

Source Code

Data is scraped from the Global Summary of the Day (GSOD) database archived by the National Climatic Data Center (NCDC). The average daily temperatures are from Global Summary of the Day (GSOD) dataset. The dataset contains temperatures from January 1, 1995 to present and are computed from 24 hourly temperature readings. The data set contains the following variables from left to right

#Reading data from the given url
library(gdata)
t <- read.table("http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt", header = FALSE)

#Assigning headers to the given data
names(t) <- c("month", "day", "year", "temperature")

#Replacing missing values with NA
t$temperature[t$temperature==-99] <-NA

#Counting missing values
sum(is.na(t$temperature==TRUE))

#Deleting rows with missing values
t_final <- na.omit(t)

The number of missing values in the original dataset are 14

Data Description

Following tables provide an overview about the data such as total observations, number of variables and type for each variable

str(t_final)
## 'data.frame':    7949 obs. of  4 variables:
##  $ month      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ year       : int  1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
##  $ temperature: num  41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:14] 1454 1455 1460 1461 1471 2726 2727 2728 2729 2807 ...
##   .. ..- attr(*, "names")= chr [1:14] "1454" "1455" "1460" "1461" ...

Following table provides a summary of the dataset

summary(t_final)
##      month             day             year       temperature   
##  Min.   : 1.000   Min.   : 1.00   Min.   :1995   Min.   :-2.20  
##  1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2000   1st Qu.:40.20  
##  Median : 6.000   Median :16.00   Median :2005   Median :57.10  
##  Mean   : 6.477   Mean   :15.71   Mean   :2005   Mean   :54.73  
##  3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:2011   3rd Qu.:70.70  
##  Max.   :12.000   Max.   :31.00   Max.   :2016   Max.   :89.20

Data vizualization

#Graph 1
hist(t_final$temperature, main="Histogram of temperature", xlab = "Temperature (F)", ylab = "Frequency")

#Graph 2
ggplot(data = t_final) + 
stat_summary(mapping = aes(x = year, y = temperature),
               geom = "pointrange",
               fun.ymin = min,
               fun.ymax = max,
               fun.y = median
) + 
ggtitle("Average Temperature Summary") +
ylab("Average Temperature(F)")

#Graph 3
ggplot(data=t_final) +
geom_smooth(mapping = aes(x = year, y = temperature), se = FALSE) +
facet_wrap(~ month, nrow = 5) +
ggtitle("Monthwise Average Temperature") +
ylab("Average Temperature(F)")

#Graph 4
ggplot(data=t_final) +
geom_smooth(mapping = aes(x = year, y = temperature)) +
ggtitle("Yearwise Average Temperature") +
ylab("Average Temperature(F)")