This is Homework-3 of the BANA 8090 Data Wrangling with R course. In this file, we scrape the Cincinnati weather data from here and analyze it. Several visualizations have been plotted to get a better understanding of the data.
This section loads and describes the various packages required for analysis of this data set.
library(gdata)
"The gdata package provides various R programming tools for data manipulation"
library(ggplot2)
"The ggplot2 package is used to plot multi-layered graphs in R"
Data is scraped from the Global Summary of the Day (GSOD) database archived by the National Climatic Data Center (NCDC). The average daily temperatures are from Global Summary of the Day (GSOD) dataset. The dataset contains temperatures from January 1, 1995 to present and are computed from 24 hourly temperature readings. The data set contains the following variables from left to right
#Reading data from the given url
library(gdata)
t <- read.table("http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt", header = FALSE)
#Assigning headers to the given data
names(t) <- c("month", "day", "year", "temperature")
#Replacing missing values with NA
t$temperature[t$temperature==-99] <-NA
#Counting missing values
sum(is.na(t$temperature==TRUE))
#Deleting rows with missing values
t_final <- na.omit(t)
The number of missing values in the original dataset are 14
Following tables provide an overview about the data such as total observations, number of variables and type for each variable
str(t_final)
## 'data.frame': 7949 obs. of 4 variables:
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 2 3 4 5 6 7 8 9 10 ...
## $ year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
## $ temperature: num 41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:14] 1454 1455 1460 1461 1471 2726 2727 2728 2729 2807 ...
## .. ..- attr(*, "names")= chr [1:14] "1454" "1455" "1460" "1461" ...
Following table provides a summary of the dataset
summary(t_final)
## month day year temperature
## Min. : 1.000 Min. : 1.00 Min. :1995 Min. :-2.20
## 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2000 1st Qu.:40.20
## Median : 6.000 Median :16.00 Median :2005 Median :57.10
## Mean : 6.477 Mean :15.71 Mean :2005 Mean :54.73
## 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:2011 3rd Qu.:70.70
## Max. :12.000 Max. :31.00 Max. :2016 Max. :89.20
#Graph 1
hist(t_final$temperature, main="Histogram of temperature", xlab = "Temperature (F)", ylab = "Frequency")
#Graph 2
ggplot(data = t_final) +
stat_summary(mapping = aes(x = year, y = temperature),
geom = "pointrange",
fun.ymin = min,
fun.ymax = max,
fun.y = median
) +
ggtitle("Average Temperature Summary") +
ylab("Average Temperature(F)")
#Graph 3
ggplot(data=t_final) +
geom_smooth(mapping = aes(x = year, y = temperature), se = FALSE) +
facet_wrap(~ month, nrow = 5) +
ggtitle("Monthwise Average Temperature") +
ylab("Average Temperature(F)")
#Graph 4
ggplot(data=t_final) +
geom_smooth(mapping = aes(x = year, y = temperature)) +
ggtitle("Yearwise Average Temperature") +
ylab("Average Temperature(F)")