Synopsis

This HTML document is created from the associated R Markdown file. In the third week of this homework assignment, I am using Cincinnati weather data and performing basic data analysis over it. I have also created three graphs using R visualization functions to facilitate my analysis. The broad categories, I am covering in this R Markdown file are:

  1. Review the codebook

  2. Learn about the data

  3. Visualize the data

Packages Required

Package(s) used in assignment to exceute R code are mentioned below:

library(ggplot2)  ##Package to produce complex multi-layered graphs in R

Source Code

Source data for the data set are taken from the Global Summary of the Day (GSOD) database archived by the National Climatic Data Center (NCDC). The data fields in the data set are - month, day, year, average daily temperature (F). The average daily temperatures are from Global Summary of the Day (GSOD) dataset and are computed from 24 hourly temperature readings. The dataset contains data from January 1, 1995 to present.

Moreover, “-99” is used as missing value flag whenever data are not available.

Data Description

#Creating data frame after extracting data from given URL

url <- "http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt"
cincy_weather <- read.table(url,header=FALSE,sep="",col.names=c("Month","Day","Year","AverageTemp"))

#Number of rows and variables
dim(cincy_weather)
## [1] 7963    4
#Names of variables
names(cincy_weather)
## [1] "Month"       "Day"         "Year"        "AverageTemp"
#Checking top and bottom values
head(cincy_weather)
##   Month Day Year AverageTemp
## 1     1   1 1995        41.1
## 2     1   2 1995        22.2
## 3     1   3 1995        22.8
## 4     1   4 1995        14.9
## 5     1   5 1995         9.5
## 6     1   6 1995        23.8
tail(cincy_weather)
##      Month Day Year AverageTemp
## 7958    10  14 2016        54.4
## 7959    10  15 2016        63.2
## 7960    10  16 2016        68.7
## 7961    10  17 2016        71.1
## 7962    10  18 2016        74.4
## 7963    10  19 2016        75.3
#Replace missing values equal to -99 with NA

cincy_weather$AverageTemp[cincy_weather$AverageTemp==-99] <-NA

#Counting missing values
sum(is.na(cincy_weather$AverageTemp==TRUE))
## [1] 14
# See what all rows have incomplete data
cincy_weather[!complete.cases(cincy_weather),]
##      Month Day Year AverageTemp
## 1454    12  24 1998          NA
## 1455    12  25 1998          NA
## 1460    12  30 1998          NA
## 1461    12  31 1998          NA
## 1471     1  10 1999          NA
## 2726     6  18 2002          NA
## 2727     6  19 2002          NA
## 2728     6  20 2002          NA
## 2729     6  21 2002          NA
## 2807     9   7 2002          NA
## 2982     3   1 2003          NA
## 4623     8  28 2007          NA
## 5016     9  24 2008          NA
## 5213     4   9 2009          NA
#Omitting rows with missing values
cincy_weather_final <- na.omit(cincy_weather)

#Displaying summary statistics of final dataframe
summary(cincy_weather_final)
##      Month             Day             Year       AverageTemp   
##  Min.   : 1.000   Min.   : 1.00   Min.   :1995   Min.   :-2.20  
##  1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2000   1st Qu.:40.20  
##  Median : 6.000   Median :16.00   Median :2005   Median :57.10  
##  Mean   : 6.477   Mean   :15.71   Mean   :2005   Mean   :54.73  
##  3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:2011   3rd Qu.:70.70  
##  Max.   :12.000   Max.   :31.00   Max.   :2016   Max.   :89.20

Data Visualization

#Graph 1
ggplot(data=cincy_weather_final) +
geom_smooth(mapping = aes(x = Year, y = AverageTemp),color="tomato", se = FALSE) +
facet_wrap(~ Month, nrow = 5) +
ggtitle("Monthwise Average Temperature") +
ylab("Average Temperature(F)")

#Graph 2
ggplot(data=cincy_weather_final) +
geom_point(mapping = aes(x = Year,y = AverageTemp),color="tomato") +
geom_smooth(mapping = aes(x = Year, y = AverageTemp),color="tomato") +
ggtitle("Yearwise Average Temperature") +
ylab("Average Temperature(F)")

#Graph 3
ggplot(data = cincy_weather_final) + 
stat_summary(mapping = aes(x = Year, y = AverageTemp),
               color="tomato",
               geom = "pointrange",
               fun.ymin = min,
               fun.ymax = max,
               fun.y = median
) + 
ggtitle("Summarised Average Temperature") +
ylab("Average Temperature(F)")