Synopsis

The purpose of this page is to go through of what you should do when you obtain a data set. For this assignment, we will use the Cincinnati weather data set found at http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt and walkthrough the following:

  1. Review the codebook
  2. Learn about the data
  3. Visualize the data

Packages Required

Below are the packages used in this assignment

library(ggplot2) # graphing package for data visualization
library(dplyr) # used to filter and subset data frame
library(RColorBrewer) # used to set color palette for graph

Source Code

Source data for the weather data are from the Global Summary of the Day (GSOD) database archived by the National Climatic Data Center (NCDC). The average daily temperatures are computed from 24 hourly temperature readings in the Global Summary of the Day (GSOD) data.

The weather data file is updated on a regular basis and contains weather data from January 1, 1995 to present. The following variables are included in the data set:

Data Description

To show the summary data statistics for the Cincinnati weather data, we first scrapped the data from the weather data site, as shown below:

WeatherURL <- "http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt"
  WeatherImport <- read.table(WeatherURL, na = "-99")
  names(WeatherImport) <- c("Month", "Day", "Year", "Temperature")

Then, we found the total number of observations and the variable names in the data set:

  nrow(WeatherImport)
## [1] 7963
  names(WeatherImport)
## [1] "Month"       "Day"         "Year"        "Temperature"

We also obtain the total count of missing values in the data set:

  sum(is.na(WeatherImport))
## [1] 14

Lastly, we found the summary statisics for each variable, including the mean, median, min, and max:

  summary(WeatherImport)
##      Month             Day             Year       Temperature   
##  Min.   : 1.000   Min.   : 1.00   Min.   :1995   Min.   :-2.20  
##  1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2000   1st Qu.:40.20  
##  Median : 6.000   Median :16.00   Median :2005   Median :57.10  
##  Mean   : 6.479   Mean   :15.72   Mean   :2005   Mean   :54.73  
##  3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:2011   3rd Qu.:70.70  
##  Max.   :12.000   Max.   :31.00   Max.   :2016   Max.   :89.20  
##                                                  NA's   :14

Data Visualization

The following shows a breakdown of the temperatures by month, from 1995 to present.

library(RColorBrewer)
 WeatherImport$Year <- as.numeric(WeatherImport$Year)
  WeatherImport$Month <- as.factor(WeatherImport$Month)
  ggplot(WeatherImport, aes(x = Year, y = Temperature, colour = Month)) +
    geom_point(alpha = 0.3,  position = position_jitter()) + 
    scale_color_brewer(palette="Paired") + ggtitle("Temperature Breakdown by Month")

The following shows the temperatures for the last year (2015):

library(dplyr)
  WeatherImport %>%
    filter(Year==2015) %>%
    ggplot(aes(x = Month, y = Temperature, colour = Year)) +
    geom_point(alpha = 0.3,  position = position_jitter()) + 
    stat_smooth(method = "lm") + ggtitle("Temperature in 2015")

The following shows a box plot for each year of temperature data:

 WeatherImport$Year <- as.factor(WeatherImport$Year)
  WeatherImport$Year <- as.factor(WeatherImport$Year)
  ggplot(subset(WeatherImport, !is.na(Temperature)), 
         aes(x = Year, y = Temperature)) +
    geom_boxplot() + ggtitle("Summary of Weather each Year") +
    geom_point(alpha=0.3, color="lightslateblue", position = "jitter")