Synopsis:

This document illustrates the steps to understand the data structure in the initial stages before starting the analysis. These steps broadly include - understanding the code, finding descriptive measures of data and visualization.

Initial findings -
1. There were 14 missing values of average Temprature
2. The visualizations show that February is the coldest month and July is the hottest month in cincinnati
3. For the data provided, July 5th - 8th 2012 has been hottest

Packages Required:

ggplot2 - For the visualizations
readr - For importing the .txt file from the web
dplyr - For filter function used in visualization

Source Code:

Accoring to the codebook there are four variables in the weather data-
1. Month
2. Day
3. Year
4. Average daily temprature
The average daily temperature is computed from 24 hourly temperature readings in the Global Summary of the Day (GSOD) data. “-99” is used when the data is missing

Data Description:

Importing and Missing values

url <- "http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt"
cincy_weather <- read.table(url,col.names = c("Month","Day","Year","Temp"))
head(cincy_weather)
##   Month Day Year Temp
## 1     1   1 1995 41.1
## 2     1   2 1995 22.2
## 3     1   3 1995 22.8
## 4     1   4 1995 14.9
## 5     1   5 1995  9.5
## 6     1   6 1995 23.8
cincy_weather$Temp[cincy_weather$Temp == -99.00] <- NA
sapply(cincy_weather,function(x) (sum(is.na(x))))
## Month   Day  Year  Temp 
##     0     0     0    14

Summary Statistics

cincy_weather$Month <- factor(cincy_weather$Month)
cincy_weather$Day <- factor(cincy_weather$Day)
cincy_weather$Year <- factor(cincy_weather$Year)
sapply(cincy_weather, summary, na.rm=TRUE)
## $Month
##   1   2   3   4   5   6   7   8   9  10  11  12 
## 682 622 682 660 682 660 682 682 660 670 630 651 
## 
## $Day
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
## 262 262 262 262 262 262 262 262 262 262 262 262 262 262 262 262 262 262 
##  19  20  21  22  23  24  25  26  27  28  29  30  31 
## 262 261 261 261 261 261 261 261 261 261 245 239 152 
## 
## $Year
## 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 
##  365  366  365  365  365  366  365  365  365  366  365  365  365  366  365 
## 2010 2011 2012 2013 2014 2015 2016 
##  365  365  366  365  365  365  293 
## 
## $Temp
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -2.20   40.20   57.10   54.73   70.70   89.20      14
cincy_weather <- na.omit(cincy_weather) #removing rows with NA values

Data Visualization

ggplot(data=cincy_weather) +
  geom_histogram(mapping= aes(x=Temp),binwidth = 5,col="red")

ggplot(data=cincy_weather) +
  stat_summary(mapping= aes(x=Month, y=Temp) , size=1,fun.ymin=min, fun.ymax=max, fun.y = mean, colour="blue")

This Chart illustrates the range of temperature in cincinnati in each month. We can clearly see that Jan - Feb are the coldest whereas July is the hottest month

ggplot(data=cincy_weather) +
 geom_point(mapping = aes(x=Day, y=Temp, colour=Year), data = filter(cincy_weather, Month==7))

ggplot(data=cincy_weather) +
geom_point(mapping = aes(x=Day, y=Temp, colour=Year), data = filter(cincy_weather, Month==2))

These visualizations show the hottest and the coldest days in July and Feb espectively and the year when they happened.