This is my report for week 3 assignment on scarping data , learning about the data and visualising it .The dataset being analysed contains the daily average temperatures for Cincinnati from 1995 to 2016.
This packages contains multiple packages within it with can be used for data representation
library(tidyverse)
The data fields are as follows: V1-month v2-day v3-year v4-average daily temperature (F).
The number of rows and columns
cincinnati <- read.table("http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt")
ncol(cincinnati)
## [1] 4
nrow(cincinnati)
## [1] 7963
Names of Variables and their data type
cincinnati <- read.table("http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt")
names(cincinnati)
## [1] "V1" "V2" "V3" "V4"
str(cincinnati)
## 'data.frame': 7963 obs. of 4 variables:
## $ V1: int 1 1 1 1 1 1 1 1 1 1 ...
## $ V2: int 1 2 3 4 5 6 7 8 9 10 ...
## $ V3: int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
## $ V4: num 41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...
A sneek peak into the dataset
cincinnati <- read.table("http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt")
head(cincinnati)
## V1 V2 V3 V4
## 1 1 1 1995 41.1
## 2 1 2 1995 22.2
## 3 1 3 1995 22.8
## 4 1 4 1995 14.9
## 5 1 5 1995 9.5
## 6 1 6 1995 23.8
tail(cincinnati)
## V1 V2 V3 V4
## 7958 10 14 2016 54.4
## 7959 10 15 2016 63.2
## 7960 10 16 2016 68.7
## 7961 10 17 2016 71.1
## 7962 10 18 2016 74.4
## 7963 10 19 2016 75.3
Checking for missing values
cincinnati <- read.table("http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt")
cincinnati[cincinnati==-99] <- NA
sum(is.na(cincinnati))
## [1] 14
Basic Statistics
cincinnati <- read.table("http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt")
summary(cincinnati)
## V1 V2 V3 V4
## Min. : 1.000 Min. : 1.00 Min. :1995 Min. :-99.00
## 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2000 1st Qu.: 40.10
## Median : 6.000 Median :16.00 Median :2005 Median : 57.00
## Mean : 6.479 Mean :15.72 Mean :2005 Mean : 54.46
## 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:2011 3rd Qu.: 70.70
## Max. :12.000 Max. :31.00 Max. :2016 Max. : 89.20
Visualization 1: This visualization clearly shows that the maximum mean temperature was in 2016 and the least mean temperature was in 1996
library(tidyverse)
cincinnati <- read.table("http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt")
cincinnati[is.na(cincinnati)] <- 0
cin2 <-with(cincinnati, aggregate(x=V4, by=list(V3), FUN=mean))
names(cin2)[names(cin2) == 'x'] <- 'Average.Temperature'
names(cin2)[names(cin2) == 'Group.1'] <- 'Year'
ggplot(cin2,aes(x=Year, y=Average.Temperature)) + geom_line()+geom_point()+geom_line(color="blue")
Visualization 2:This visualization clearly shows that the maximum mean temperature was on the 30th day of each month and the least mean temperature was on the 5th day of each month
library(tidyverse)
cincinnati <- read.table("http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt")
cincinnati[is.na(cincinnati)] <- 0
cin2 <-with(cincinnati, aggregate(x=V4, by=list(V2), FUN=mean))
names(cin2)[names(cin2) == 'x'] <- 'Average.Temperature'
names(cin2)[names(cin2) == 'Group.1'] <- 'Day'
ggplot(cin2,aes(x=Day, y=Average.Temperature)) + geom_line() +geom_point()+geom_line(color="pink")
Visualization 3:This visualization clearly shows that the maximum mean temperature was in July across the years and the least mean temperature was January across the years
library(tidyverse)
cincinnati <- read.table("http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt")
cincinnati[is.na(cincinnati)] <- 0
cin2 <-with(cincinnati, aggregate(x=V4, by=list(V1), FUN=mean))
names(cin2)[names(cin2) == 'x'] <- 'Average.Temperature'
names(cin2)[names(cin2) == 'Group.1'] <- 'Month'
ggplot(cin2,aes(x=Month, y=Average.Temperature)) + geom_line() +geom_point()+geom_line(color="green")