Synopsis

This is my homework report for week 3, produced with R Markdown. In this homework I imported Cincinnati whether data and studied its code book and learn different aspects of data. In the third step I visualized data to gain better understanding

Summary:

The data is weather data for cincinnati for over a period of 22 years starting from 1995 to 2016.It seems clear from the data that July to September are the hottest month of the year whereas November to February are the coolest month. Overall yearly variation in temprature seems to be constant, however there is slight increase in average temprature for 2016.

There is not much of a variation in temprature at different days when we compare temprature under a perticular month. The daily temprature remains high for summers and low for winters.

Packages Required

library(tidyverse) ## this package is used for plotting options in graph

Source Code

The data discription can be found at:

http://academic.udayton.edu/kissock/http/Weather/source.htm

Note on missing data for the whether data set:

http://academic.udayton.edu/kissock/http/Weather/missingdata.htm

Data discription

setwd("C:/tauseef/data_wrangling/Data Wrangling with R (BANA 8090)")


filename<-"http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt"
col_names<-c("month","day","year",
             "avg_daily_temp")

ohcincin<-read.table(filename,header=F,sep="",col.names = col_names,strip.white = T )
str(ohcincin)
## 'data.frame':    7963 obs. of  4 variables:
##  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ year          : int  1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
##  $ avg_daily_temp: num  41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...
nrow(ohcincin) ## number of observations in data
## [1] 7963
ncol(ohcincin) ## number of variables in data
## [1] 4
###need to cahnge data format of month day and year to factor from integer###

ohcincin$month<-as.factor(ohcincin$month)
ohcincin$day<-as.factor(ohcincin$day)
ohcincin$year<-as.factor(ohcincin$year)
########################### replacing the missing values which are in the form of -99 to NA####
ohcincin[ohcincin == -99] <- NA
ohcincin_n<-na.omit(ohcincin)
######################summary stats after removing missing values##########
head(ohcincin_n)
##   month day year avg_daily_temp
## 1     1   1 1995           41.1
## 2     1   2 1995           22.2
## 3     1   3 1995           22.8
## 4     1   4 1995           14.9
## 5     1   5 1995            9.5
## 6     1   6 1995           23.8
str(ohcincin_n)
## 'data.frame':    7949 obs. of  4 variables:
##  $ month         : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : Factor w/ 31 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ year          : Factor w/ 22 levels "1995","1996",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ avg_daily_temp: num  41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:14] 1454 1455 1460 1461 1471 2726 2727 2728 2729 2807 ...
##   .. ..- attr(*, "names")= chr [1:14] "1454" "1455" "1460" "1461" ...
nrow(ohcincin_n) ## number of observations in data
## [1] 7949
ncol(ohcincin_n) ## number of variables in data
## [1] 4
sum(is.na(ohcincin_n)) ##  missing values in the data
## [1] 0
summary(ohcincin_n) 
##      month           day            year      avg_daily_temp 
##  5      : 682   2      : 262   1996   : 366   Min.   :-2.20  
##  7      : 682   3      : 262   2000   : 366   1st Qu.:40.20  
##  1      : 681   4      : 262   2004   : 366   Median :57.10  
##  3      : 681   5      : 262   2012   : 366   Mean   :54.73  
##  8      : 681   6      : 262   1995   : 365   3rd Qu.:70.70  
##  10     : 670   8      : 262   1997   : 365   Max.   :89.20  
##  (Other):3872   (Other):6377   (Other):5755

Data Visualization

ggplot(data = ohcincin_n, mapping = aes(x = year, y = avg_daily_temp)) + 
  geom_boxplot() 

The above graph shows yearly min maximum amd median temprature at Cincinnati. Year 2016 seems to be hottest year with in a decade

ggplot(data = ohcincin_n) + 
  geom_point(mapping = aes(x = month, y = avg_daily_temp))

The above graph shows monthly variation of average daily temprature at Cincinnati. April to July seems to be hottest and November to February seems to be the coolest months of the year.

ggplot(data = ohcincin_n, mapping = aes(x = day, y = avg_daily_temp)) + 
  stat_summary( mapping = aes(x = day, y = avg_daily_temp),
                fun.ymin = min,
                fun.ymax = max,
                fun.y = median )+
  facet_wrap(~ month, nrow = 6)

The daily temprature remains constant for a pirticular month.