As a part of the week 3 assignment, I have scrapped the daily average weather data and done some basic exploratory analysis. Following are the finding:
I used the following packages for this assignment:
library(ggplot2)
library(scales)
library(grid)
library(RColorBrewer)
All of the above packages are used for plotting the data.
Data was imported using the read.table function in R.
cincy_url <- "http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt"
cincy_data <- read.table(cincy_url)
str(cincy_data)
## 'data.frame': 7963 obs. of 4 variables:
## $ V1: int 1 1 1 1 1 1 1 1 1 1 ...
## $ V2: int 1 2 3 4 5 6 7 8 9 10 ...
## $ V3: int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
## $ V4: num 41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...
summary(cincy_data)
## V1 V2 V3 V4
## Min. : 1.000 Min. : 1.00 Min. :1995 Min. :-99.00
## 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2000 1st Qu.: 40.10
## Median : 6.000 Median :16.00 Median :2005 Median : 57.00
## Mean : 6.479 Mean :15.72 Mean :2005 Mean : 54.46
## 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:2011 3rd Qu.: 70.70
## Max. :12.000 Max. :31.00 Max. :2016 Max. : 89.20
Str and Summary functions, and the given link help me figure out the contents of the data. The data contains the year, months and date along with Daily Average Temperatures from 1995 to 2016. We hence rename the columns accordingly.
colnames(cincy_data) <- c("month","day","year","avg_temp")
Below is the dimensions of the data, count of missing values, and some measures of central tendency for each variable.
dim(cincy_data)
## [1] 7963 4
print(paste(dim(cincy_data)[1]," Rows and ", dim(cincy_data)[2]," columns."))
## [1] "7963 Rows and 4 columns."
table(is.na(cincy_data))
##
## FALSE
## 31852
summary(cincy_data)
## month day year avg_temp
## Min. : 1.000 Min. : 1.00 Min. :1995 Min. :-99.00
## 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2000 1st Qu.: 40.10
## Median : 6.000 Median :16.00 Median :2005 Median : 57.00
## Mean : 6.479 Mean :15.72 Mean :2005 Mean : 54.46
## 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:2011 3rd Qu.: 70.70
## Max. :12.000 Max. :31.00 Max. :2016 Max. : 89.20
In the first graph, I wanted to observe how are the temperatures distributes. So I plotted a histogram using the ggplot, scale, grid and Rcolorbrewer functions.
After having an idea about the distribution of the temperature, I wanted to see how has the temperature varied over the years. So I plotted the average tempearture for each year, along with a regression trend line. This clearly shows the temperatures increasing over the years.
And finally, I wanted to observe the variation of the temperatures over the months in a year. So I plotted the monthly averages. The graph shows the variation in the temperatures a lot more for the colder months while the summers typically have less variation.
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).