Tabular Representation

Synopsis

This HTML document is associated with the R Markdown file that is used to complete the week 3 assignment.This R markdown file made by me summarizes the packages I have used to complete my homework assignment. RMD file is actually a very helpful tool that summarizes the whole task. Our code along with their results get documented neatly, as summarized below. Cinicnnati weather data is analysed with the aid of 3 graphics. I have reviewed the codebook, learnt about the data and its graphical representation as well.

Packages Required

Only 1 package was installed and used:

library(ggplot2)  ##Package to produce complex multi-layered graphs in R

Source Code

Source data for the data set are taken from the Global Summary of the Day (GSOD) database archived by the National Climatic Data Center (NCDC). The data fields in the data set are: 1. Month 2. Day 3. Year 4. Temperature

The dataset contains data from January 1, 1995 to present. We need categorical and numerical variables to support our analysis. We identify the categorical variables and convert them into factor variables if not present already. Also, missing values have been taken care of.

Data Description

#Creating data frame from given URL
url<-"http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt"
cincinnati_data <- read.table(url,stringsAsFactors = FALSE)

#Giving column names
colnames(cincinnati_data) <- c("Month","Date","Year","Temperature")

#Checking the structure
str(cincinnati_data)
## 'data.frame':    7963 obs. of  4 variables:
##  $ Month      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Date       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Year       : int  1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
##  $ Temperature: num  41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...
#Converting into factor variables
cincinnati_data$Month<-factor(cincinnati_data$Month)
cincinnati_data$Date<-factor(cincinnati_data$Date)
cincinnati_data$Year<-factor(cincinnati_data$Year)

#Number of rows and variables
dim(cincinnati_data)
## [1] 7963    4
#Names of variables
names(cincinnati_data)
## [1] "Month"       "Date"        "Year"        "Temperature"
#Checking top and bottom values
head(cincinnati_data)
##   Month Date Year Temperature
## 1     1    1 1995        41.1
## 2     1    2 1995        22.2
## 3     1    3 1995        22.8
## 4     1    4 1995        14.9
## 5     1    5 1995         9.5
## 6     1    6 1995        23.8
tail(cincinnati_data)
##      Month Date Year Temperature
## 7958    10   14 2016        54.4
## 7959    10   15 2016        63.2
## 7960    10   16 2016        68.7
## 7961    10   17 2016        71.1
## 7962    10   18 2016        74.4
## 7963    10   19 2016        75.3
#Replace the missing values with NA
cincinnati_data$Temperature[cincinnati_data$Temperature==-99] <-NA

#Counting the missing values
sum(is.na(cincinnati_data$Temperature==TRUE))
## [1] 14
#Omitting rows with missing values
cincinnati_data <- na.omit(cincinnati_data)

#Displaying summary statistics of final dataset
summary(cincinnati_data)
##      Month           Date           Year       Temperature   
##  5      : 682   2      : 262   1996   : 366   Min.   :-2.20  
##  7      : 682   3      : 262   2000   : 366   1st Qu.:40.20  
##  1      : 681   4      : 262   2004   : 366   Median :57.10  
##  3      : 681   5      : 262   2012   : 366   Mean   :54.73  
##  8      : 681   6      : 262   1995   : 365   3rd Qu.:70.70  
##  10     : 670   8      : 262   1997   : 365   Max.   :89.20  
##  (Other):3872   (Other):6377   (Other):5755

Graphical Representation - 1

ggplot(data = cincinnati_data) + 
  geom_point(mapping = aes(x = Month, y = Temperature, color = Month)) +
  ggtitle("Monthly Avg Temperature") +
  ylab("Temperature(F)")

Graphical Representation - 2

ggplot(data=cincinnati_data) +
  geom_point(mapping = aes(x = Year,y = Temperature),color="blue") +
  geom_smooth(mapping = aes(x = Year, y = Temperature),color="blue") +
  ggtitle("Yearwise Temperature") +
  ylab("Temperature(F)")

Graphical Representation - 3

ggplot(data = cincinnati_data) + 
  stat_summary(mapping = aes(x = Year, y = Temperature),
               color="blue",
               geom = "pointrange",
               fun.ymin = min,
               fun.ymax = max,
               fun.y = median
  ) + 
  ggtitle("Summary statistics of Temperature") +
  ylab("Temperature(F)")