This is the homework for Week 3 Data Wrangling. The objective is to scrape the Cincinnati weather data located here and create visualizations.
The following packages are requiblue for this homework
library(gdata) # "The gdata package provides various R programming tools for data manipulation"
library(ggplot2) # ggplot is used for data visualization in R
weather <- read.table("http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt", header = FALSE)
names(weather) <- c("month", "day", "year", "temperature") # add header row
The data set contains the following variables from left to right
Following tables provide an overview about the data such as total observations, number of variables and type for each variable
str(weather)
'data.frame': 7963 obs. of 4 variables:
$ month : int 1 1 1 1 1 1 1 1 1 1 ...
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
$ year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
$ temperature: num 41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...
Following table provides a summary of the dataset
summary(weather)
month day year temperature
Min. : 1.000 Min. : 1.00 Min. :1995 Min. :-99.00
1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2000 1st Qu.: 40.10
Median : 6.000 Median :16.00 Median :2005 Median : 57.00
Mean : 6.479 Mean :15.72 Mean :2005 Mean : 54.46
3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:2011 3rd Qu.: 70.70
Max. :12.000 Max. :31.00 Max. :2016 Max. : 89.20
The missing values for temperature are represented with a -99. We replace them with NA’s
There are 14 rows that have missing values in the data set
Remove Rows that have NA values:
#Omitting rows with missing values
weather_final <- na.omit(weather)
#Displaying summary statistics of final dataframe
summary(weather_final)
month day year temperature
Min. : 1.000 Min. : 1.00 Min. :1995 Min. :-2.20
1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2000 1st Qu.:40.20
Median : 6.000 Median :16.00 Median :2005 Median :57.10
Mean : 6.477 Mean :15.71 Mean :2005 Mean :54.73
3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:2011 3rd Qu.:70.70
Max. :12.000 Max. :31.00 Max. :2016 Max. :89.20
#Viz 1
ggplot(data=weather_final) +
geom_smooth(mapping = aes(x = year, y = temperature),color="blue", se = FALSE) +
facet_wrap(~ month, nrow = 5) +
ggtitle("Monthly Average Temperature") +
ylab("Avg Temperature(F)")
#Viz 2
library(ggplot2)
ggplot(data=weather_final) +
geom_smooth(mapping = aes(x = year, y = temperature),color="blue") +
ggtitle("Yearly Avg Temperature") +
ylab("Avg Temperature(F)")
#Viz 3
ggplot(data = weather_final) +
stat_summary(mapping = aes(x = year, y = temperature),
color="blue",
geom = "pointrange",
fun.ymin = min,
fun.ymax = max,
fun.y = median
) +
ggtitle("Summarised Average Temperature") +
ylab("Avg Temperature(F)")