The purpose of this markdown file is to understand the data, read it, looking at the number of observations , missing value and the basic summary of the data.The data is of daily average temperatures for Cincinnati. The file is updated on from January 1, 1995 to present.
Source data is the National Climatic Data Center. The data is available for research and non-commercial purposes only.
library("gdata") ## to pull text file from the web
library("ggplot2") ## for data visualizing data
Details about the columns is as below :
library("gdata")
library("readr")
url <- "http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt"
OHCINCIN<- read.table(url,col.names = c("Month","Day","Year","Temperature"))
str(OHCINCIN)
## 'data.frame': 7963 obs. of 4 variables:
## $ Month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
## $ Temperature: num 41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...
ncol(OHCINCIN) # gives no of columns in the data
## [1] 4
names(OHCINCIN) # gives names of columns in the data
## [1] "Month" "Day" "Year" "Temperature"
nrow(OHCINCIN) # gives number of rows in the data
## [1] 7963
dim(OHCINCIN) # gives number of rows and columns in the data
## [1] 7963 4
str(OHCINCIN) # gives the structure of the data
## 'data.frame': 7963 obs. of 4 variables:
## $ Month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
## $ Temperature: num 41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...
head(OHCINCIN) # gives the first 6 rows in the data
tail(OHCINCIN) # gives the first 6 columns in the data
sum(is.na(OHCINCIN)) # gives the number of NAs present in the data
## [1] 0
OHCINCIN$Temperature[is.na(OHCINCIN$Temperature)] <- mean(OHCINCIN$Temperature, na.rm = TRUE) # replaces all NA values in the temperature coulmn with the mean value
summary(OHCINCIN) # gives the basic summary of the data(mean, median, min, max)
## Month Day Year Temperature
## Min. : 1.000 Min. : 1.00 Min. :1995 Min. :-99.00
## 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2000 1st Qu.: 40.10
## Median : 6.000 Median :16.00 Median :2005 Median : 57.00
## Mean : 6.479 Mean :15.72 Mean :2005 Mean : 54.46
## 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:2011 3rd Qu.: 70.70
## Max. :12.000 Max. :31.00 Max. :2016 Max. : 89.20
OHCINCIN$date_variable <-
paste(OHCINCIN$Day , OHCINCIN$Month, OHCINCIN$Year, sep = "/")
OHCINCIN$Date <- as.Date(OHCINCIN$date_variable, format = "%d/%m/%Y")
OHCINCIN = OHCINCIN[OHCINCIN$Year != 2016,]
OHCINCIN$Month <- as.factor(OHCINCIN$Month)
OHCINCIN
avg_temp <- mean(OHCINCIN$Temperature)
This visualization shows year on year avaerage temperature, where the purple lines shows the avaerage on the whole. The year 2016 is excluded from the data as it has data for only 10 months and if included shows a misleading average temperature.
ggplot(data = OHCINCIN) +
geom_line(
mapping = aes(x = Year, y = Temperature),
stat = "summary",
fun.y = "mean",
color = "orange"
) +
geom_line(mapping = aes(x = Year, y = avg_temp), color = "blue")
This graph shows the month wise bar chart of temperatures. It reflects the temperature is highest in 6th 7th and 8th month and lowest in 1st 2nd and 12th month of the year.
ggplot(data = OHCINCIN) +
geom_bar(mapping = aes(x = Month, y = Temperature),
stat = "summary", fun.y = 'mean',
color ="blue" )
This visualization shows the year vs temp for each month seperately. It represents the range of temperature in each month and in each year. It not only shows temperature in different months but also how tmperature varies year on year in each month.
ggplot(data = OHCINCIN) +
geom_point(mapping = aes(x = Year, y = Temperature)) +
facet_wrap(~ Month, nrow = 4)