This week the aim is to scrape the Cincinnati weather data set. In doing so I will describe what each variable in the set is measuring. Additionally I will attempt to explain the data set as a whole by taking a look at the summary statistics and also parsing for possible missing values. Lastly I’ll draw up three different visualizations in R and explain what’s going on with each one.
The package required for this data set visualization is “tidyverse”
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
This data set contains the 24hr average daily temperature for Cincinnati from 1/1/1995 through 10/19/16. The “month” variable is assigning the numbers 1-12 for each of the 12 months (Jan-Dec). The “day” variable is the day of the month from the first column. And the year is the given year.
url <- "http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt"
weather <- read.table(url)
colnames(weather) <- c("month", "day", "year", "temp")
Number of observations
nrow(weather)
## [1] 7963
dim(weather)
## [1] 7963 4
Number of variables and their names. (Which I named in the “source code” section)
ncol(weather)
## [1] 4
colnames(weather)
## [1] "month" "day" "year" "temp"
Any missing values?
weather$temp[weather$temp == -99] <- NA
sum(is.na(weather))
## [1] 14
weather[!complete.cases(weather),]
## month day year temp
## 1454 12 24 1998 NA
## 1455 12 25 1998 NA
## 1460 12 30 1998 NA
## 1461 12 31 1998 NA
## 1471 1 10 1999 NA
## 2726 6 18 2002 NA
## 2727 6 19 2002 NA
## 2728 6 20 2002 NA
## 2729 6 21 2002 NA
## 2807 9 7 2002 NA
## 2982 3 1 2003 NA
## 4623 8 28 2007 NA
## 5016 9 24 2008 NA
## 5213 4 9 2009 NA
weathered <- na.omit(weather)
Some summary statistics.
weathered$month <- factor(weathered$month)
weathered$day <- factor(weathered$day)
weathered$year <- factor(weathered$year)
summary(weathered)
## month day year temp
## 5 : 682 2 : 262 1996 : 366 Min. :-2.20
## 7 : 682 3 : 262 2000 : 366 1st Qu.:40.20
## 1 : 681 4 : 262 2004 : 366 Median :57.10
## 3 : 681 5 : 262 2012 : 366 Mean :54.73
## 8 : 681 6 : 262 1995 : 365 3rd Qu.:70.70
## 10 : 670 8 : 262 1997 : 365 Max. :89.20
## (Other):3872 (Other):6377 (Other):5755
Here we show the count of particular temperature values for all years by month.
ggplot(data = weathered) +
geom_bar(mapping = aes(x = temp), show.legend = FALSE) +
facet_wrap(~ month)
This is a plot of year and average temperature with the respective months colored.
ggplot(data = weathered) +
geom_point(mapping = aes(x = year, y = temp, color = month ))
Here I plotted all of the days by year with a diverging color scheme to help visualize the temperature.
ggplot(data = weathered) +
geom_point(mapping = aes(x = year, y = month, color = temp ))