Synopsis

As a part of the week 3 assignment, I have scrapped the daily average weather data and done some basic exploratory analysis. Following are the finding:

  1. The temperatures have been increasing over the years
  2. Colder months have a lot of variation in the temperature as compared to the colder months
  3. Temperature typically vary between 0 to 90, with very few values showing -99 (could be data error)

Packages Required:

I used the following packages for this assignment:

  library(ggplot2)
  library(scales)
  library(grid)
  library(RColorBrewer)

All of the above packages are used for plotting the data.

Source Code

Data was imported using the read.table function in R.

cincy_url <- "http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt"
cincy_data <- read.table(cincy_url)
str(cincy_data)
## 'data.frame':    7963 obs. of  4 variables:
##  $ V1: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ V2: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ V3: int  1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
##  $ V4: num  41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...
summary(cincy_data)
##        V1               V2              V3             V4        
##  Min.   : 1.000   Min.   : 1.00   Min.   :1995   Min.   :-99.00  
##  1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2000   1st Qu.: 40.10  
##  Median : 6.000   Median :16.00   Median :2005   Median : 57.00  
##  Mean   : 6.479   Mean   :15.72   Mean   :2005   Mean   : 54.46  
##  3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:2011   3rd Qu.: 70.70  
##  Max.   :12.000   Max.   :31.00   Max.   :2016   Max.   : 89.20

Str and Summary functions, and the given link help me figure out the contents of the data. The data contains the year, months and date along with Daily Average Temperatures from 1995 to 2016. We hence rename the columns accordingly.

colnames(cincy_data) <- c("month","day","year","avg_temp")

Data Description

Below is the dimensions of the data, count of missing values, and some measures of central tendency for each variable.

dim(cincy_data)
## [1] 7963    4
print(paste(dim(cincy_data)[1]," Rows and ", dim(cincy_data)[2]," columns."))
## [1] "7963  Rows and  4  columns."
table(is.na(cincy_data))
## 
## FALSE 
## 31852
summary(cincy_data)
##      month             day             year         avg_temp     
##  Min.   : 1.000   Min.   : 1.00   Min.   :1995   Min.   :-99.00  
##  1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2000   1st Qu.: 40.10  
##  Median : 6.000   Median :16.00   Median :2005   Median : 57.00  
##  Mean   : 6.479   Mean   :15.72   Mean   :2005   Mean   : 54.46  
##  3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:2011   3rd Qu.: 70.70  
##  Max.   :12.000   Max.   :31.00   Max.   :2016   Max.   : 89.20

Data Visualization

In the first graph, I wanted to observe how are the temperatures distributes. So I plotted a histogram using the ggplot, scale, grid and Rcolorbrewer functions.

After having an idea about the distribution of the temperature, I wanted to see how has the temperature varied over the years. So I plotted the average tempearture for each year, along with a regression trend line. This clearly shows the temperatures increasing over the years.

And finally, I wanted to observe the variation of the temperatures over the months in a year. So I plotted the monthly averages. The graph shows the variation in the temperatures a lot more for the colder months while the summers typically have less variation.

## Warning: Removed 14 rows containing non-finite values (stat_boxplot).