This HTML document is associated with the R Markdown file that is used to complete the week 3 assignment.This R markdown file made by me summarizes the packages I have used to complete my homework assignment. RMD file is actually a very helpful tool that summarizes the whole task. Our code along with their results get documented neatly, as summarized below. Cinicnnati weather data is analysed with the aid of 3 graphics. I have reviewed the codebook, learnt about the data and its graphical representation as well.
Only 1 package was installed and used:
library(ggplot2) ##Package to produce complex multi-layered graphs in R
Source data for the data set are taken from the Global Summary of the Day (GSOD) database archived by the National Climatic Data Center (NCDC). The data fields in the data set are: 1. Month 2. Day 3. Year 4. Temperature
The dataset contains data from January 1, 1995 to present. We need categorical and numerical variables to support our analysis. We identify the categorical variables and convert them into factor variables if not present already. Also, missing values have been taken care of.
#Creating data frame from given URL
url<-"http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt"
cincinnati_data <- read.table(url,stringsAsFactors = FALSE)
#Giving column names
colnames(cincinnati_data) <- c("Month","Date","Year","Temperature")
#Checking the structure
str(cincinnati_data)
## 'data.frame': 7963 obs. of 4 variables:
## $ Month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
## $ Temperature: num 41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...
#Converting into factor variables
cincinnati_data$Month<-factor(cincinnati_data$Month)
cincinnati_data$Date<-factor(cincinnati_data$Date)
cincinnati_data$Year<-factor(cincinnati_data$Year)
#Number of rows and variables
dim(cincinnati_data)
## [1] 7963 4
#Names of variables
names(cincinnati_data)
## [1] "Month" "Date" "Year" "Temperature"
#Checking top and bottom values
head(cincinnati_data)
## Month Date Year Temperature
## 1 1 1 1995 41.1
## 2 1 2 1995 22.2
## 3 1 3 1995 22.8
## 4 1 4 1995 14.9
## 5 1 5 1995 9.5
## 6 1 6 1995 23.8
tail(cincinnati_data)
## Month Date Year Temperature
## 7958 10 14 2016 54.4
## 7959 10 15 2016 63.2
## 7960 10 16 2016 68.7
## 7961 10 17 2016 71.1
## 7962 10 18 2016 74.4
## 7963 10 19 2016 75.3
#Replace the missing values with NA
cincinnati_data$Temperature[cincinnati_data$Temperature==-99] <-NA
#Counting the missing values
sum(is.na(cincinnati_data$Temperature==TRUE))
## [1] 14
#Omitting rows with missing values
cincinnati_data <- na.omit(cincinnati_data)
#Displaying summary statistics of final dataset
summary(cincinnati_data)
## Month Date Year Temperature
## 5 : 682 2 : 262 1996 : 366 Min. :-2.20
## 7 : 682 3 : 262 2000 : 366 1st Qu.:40.20
## 1 : 681 4 : 262 2004 : 366 Median :57.10
## 3 : 681 5 : 262 2012 : 366 Mean :54.73
## 8 : 681 6 : 262 1995 : 365 3rd Qu.:70.70
## 10 : 670 8 : 262 1997 : 365 Max. :89.20
## (Other):3872 (Other):6377 (Other):5755
ggplot(data = cincinnati_data) +
geom_point(mapping = aes(x = Month, y = Temperature, color = Month)) +
ggtitle("Monthly Avg Temperature") +
ylab("Temperature(F)")
ggplot(data=cincinnati_data) +
geom_point(mapping = aes(x = Year,y = Temperature),color="blue") +
geom_smooth(mapping = aes(x = Year, y = Temperature),color="blue") +
ggtitle("Yearwise Temperature") +
ylab("Temperature(F)")
ggplot(data = cincinnati_data) +
stat_summary(mapping = aes(x = Year, y = Temperature),
color="blue",
geom = "pointrange",
fun.ymin = min,
fun.ymax = max,
fun.y = median
) +
ggtitle("Summary statistics of Temperature") +
ylab("Temperature(F)")