Synopsis
The purpose of this page is to go through of what you should do when you obtain a data set. For this assignment, we will use the Cincinnati weather data set found at http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt and walkthrough the following:
- Review the codebook
- Learn about the data
- Visualize the data
Packages Required
Below are the packages used in this assignment
library(ggplot2) # graphing package for data visualization
library(dplyr) # used to filter and subset data frame
library(RColorBrewer) # used to set color palette for graphSource Code
Source data for the weather data are from the Global Summary of the Day (GSOD) database archived by the National Climatic Data Center (NCDC). The average daily temperatures are computed from 24 hourly temperature readings in the Global Summary of the Day (GSOD) data.
The weather data file is updated on a regular basis and contains weather data from January 1, 1995 to present. The following variables are included in the data set:
- Month: the month associated with the particular record of temperature
- Day: the day of the month associated with the particular record of temperature
- Year: the year associated with the particular record of temperature
- Temperature: average daily temperature in Fahrenheit
Data Description
To show the summary data statistics for the Cincinnati weather data, we first scrapped the data from the weather data site, as shown below:
WeatherURL <- "http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt"
WeatherImport <- read.table(WeatherURL, na = "-99")
names(WeatherImport) <- c("Month", "Day", "Year", "Temperature")Then, we found the total number of observations and the variable names in the data set:
nrow(WeatherImport)## [1] 7963
names(WeatherImport)## [1] "Month" "Day" "Year" "Temperature"
We also obtain the total count of missing values in the data set:
sum(is.na(WeatherImport))## [1] 14
Lastly, we found the summary statisics for each variable, including the mean, median, min, and max:
summary(WeatherImport)## Month Day Year Temperature
## Min. : 1.000 Min. : 1.00 Min. :1995 Min. :-2.20
## 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2000 1st Qu.:40.20
## Median : 6.000 Median :16.00 Median :2005 Median :57.10
## Mean : 6.479 Mean :15.72 Mean :2005 Mean :54.73
## 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:2011 3rd Qu.:70.70
## Max. :12.000 Max. :31.00 Max. :2016 Max. :89.20
## NA's :14
Data Visualization
The following shows a breakdown of the temperatures by month, from 1995 to present.
library(RColorBrewer)
WeatherImport$Year <- as.numeric(WeatherImport$Year)
WeatherImport$Month <- as.factor(WeatherImport$Month)
ggplot(WeatherImport, aes(x = Year, y = Temperature, colour = Month)) +
geom_point(alpha = 0.3, position = position_jitter()) +
scale_color_brewer(palette="Paired") + ggtitle("Temperature Breakdown by Month")The following shows the temperatures for the last year (2015):
library(dplyr)
WeatherImport %>%
filter(Year==2015) %>%
ggplot(aes(x = Month, y = Temperature, colour = Year)) +
geom_point(alpha = 0.3, position = position_jitter()) +
stat_smooth(method = "lm") + ggtitle("Temperature in 2015")The following shows a box plot for each year of temperature data:
WeatherImport$Year <- as.factor(WeatherImport$Year)
WeatherImport$Year <- as.factor(WeatherImport$Year)
ggplot(subset(WeatherImport, !is.na(Temperature)),
aes(x = Year, y = Temperature)) +
geom_boxplot() + ggtitle("Summary of Weather each Year") +
geom_point(alpha=0.3, color="lightslateblue", position = "jitter")