Load libraries

library(stringr)
library(dplyr)
library(tidyr)
library(zoo)
library(ggplot2)
library(knitr)
library(rvest)
library(tibble)

Get the data

Use rvest to scrape data from wikipedia. We see that the table needs to be cleaned up.

url <- "https://en.wikipedia.org/wiki/Climate_of_New_York"
url_html <- url %>% read_html()
node_list <- url_html %>% html_nodes("table")
raw_df <- node_list %>% .[[6]] %>% html_table(fill=T)
raw_df %>% head() %>% kable()
Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h]
Month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Year
Record high °F (°C) 72(22) 78(26) 86(30) 96(36) 99(37) 101(38) 106(41) 104(40) 102(39) 94(34) 84(29) 75(24) 106(41)
Mean maximum °F (°C) 59.6(15.3) 60.7(15.9) 71.5(21.9) 83.0(28.3) 88.0(31.1) 92.3(33.5) 95.4(35.2) 93.7(34.3) 88.5(31.4) 78.8(26.0) 71.3(21.8) 62.2(16.8) 97.0(36.1)
Average high °F (°C) 38.3(3.5) 41.6(5.3) 49.7(9.8) 61.2(16.2) 70.8(21.6) 79.3(26.3) 84.1(28.9) 82.6(28.1) 75.2(24.0) 63.8(17.7) 53.8(12.1) 43.0(6.1) 62.0(16.7)
Average low °F (°C) 26.9(−2.8) 28.9(−1.7) 35.2(1.8) 44.8(7.1) 54.0(12.2) 63.6(17.6) 68.8(20.4) 67.8(19.9) 60.8(16.0) 50.0(10.0) 41.6(5.3) 32.0(0.0) 48.0(8.9)
Mean minimum °F (°C) 9.2(−12.7) 12.8(−10.7) 18.5(−7.5) 32.3(0.2) 43.5(6.4) 52.9(11.6) 60.3(15.7) 58.8(14.9) 48.6(9.2) 38.0(3.3) 27.7(−2.4) 15.6(−9.1) 7.0(−13.9)

Tidy the data

We’ll replace the column names with the first row and transpose the data frame. Then use regular expressions to get rid of everything inside of parentheses. (We do not need Celsius values - Farenheit is enough.) Finally we’ll turn the rownames into a column and format all columns. This output looks much better.

nyc_df <- raw_df
names(nyc_df) <- nyc_df[1,]
nyc_df <- nyc_df %>% slice(-n()) %>% slice(-1) %>% t()
nyc_df <- apply(nyc_df, 2, function(x) gsub("\\([^)]*\\)", "", x))
colnames(nyc_df) <- nyc_df[1,] %>% 
  gsub("[[:punct:]]", "deg_", .) %>% 
  trimws() %>% tolower() %>% 
  gsub(" ", "_", .)
months <- rownames(nyc_df)
nyc_df <- nyc_df %>% as_tibble() %>% slice(-1) %>% 
  mutate(month=months[-1]) %>% 
  filter(month!="Year") %>% 
  select(month, everything())
nyc_df[, 2:ncol(nyc_df)] <- nyc_df[, 2:ncol(nyc_df)] %>% unlist() %>% 
  iconv(., from = "UTF-8", to = "ASCII//TRANSLIT") %>% 
  gsub("\\?", "-", .) %>% as.numeric()
nyc_df %>% kable()
month record_high_deg_f mean_maximum_deg_f average_high_deg_f average_low_deg_f mean_minimum_deg_f record_low_deg_f average_precipitation_inches average_snowfall_inches average_precipitation_days average_snowy_days average_relative_humidity mean_monthly_sunshine_hours percent_possible_sunshine
Jan 72 59.6 38.3 26.9 9.2 -6 3.65 7.0 10.4 4.0 61.5 162.7 54
Feb 78 60.7 41.6 28.9 12.8 -15 3.09 9.2 9.2 2.8 60.2 163.1 55
Mar 86 71.5 49.7 35.2 18.5 3 4.36 3.9 10.9 1.8 58.5 212.5 57
Apr 96 83.0 61.2 44.8 32.3 12 4.50 0.6 11.5 0.3 55.3 225.6 57
May 99 88.0 70.8 54.0 43.5 32 4.19 0.0 11.1 0.0 62.7 256.6 57
Jun 101 92.3 79.3 63.6 52.9 44 4.41 0.0 11.2 0.0 65.2 257.3 57
Jul 106 95.4 84.1 68.8 60.3 52 4.60 0.0 10.4 0.0 64.2 268.2 59
Aug 104 93.7 82.6 67.8 58.8 50 4.44 0.0 9.5 0.0 66.0 268.2 63
Sep 102 88.5 75.2 60.8 48.6 39 4.28 0.0 8.7 0.0 67.8 219.3 59
Oct 94 78.8 63.8 50.0 38.0 28 4.40 0.0 8.9 0.0 65.6 211.2 61
Nov 84 71.3 53.8 41.6 27.7 5 4.02 0.3 9.6 0.2 64.6 151.0 51
Dec 75 62.2 43.0 32.0 15.6 -13 4.00 4.8 10.6 2.3 64.1 139.0 48

Analysis

There is a lot of information here, so for the purposes of this exercise we will look at only a few key things.


Average high and low temperatures by month

nyc_df %>% ggplot(aes(x=month)) + scale_x_discrete(limits=month.abb) + 
  geom_point(aes(y=average_high_deg_f, group=1)) + geom_line(aes(y=average_high_deg_f, group=1, col="Average High")) + 
  geom_point(aes(y=average_low_deg_f, group=1)) + geom_line(aes(y=average_low_deg_f, group=1, col="Average Low")) + 
  labs(title="Temperature in Central Park, 1981-2010", x="Month", y="Degrees in Farenheit", colour="") + 
  scale_colour_manual(values = c("red", "blue"))

A few notable tidbits from examining the above graph:

  • July is the hottest month - a bit hotter than August
  • January is the coldest month
  • March is colder than November, both by Average High and Low

The slope of the graph between months is interesting. Between which two months do we see the biggest change in temperature?

diff(c(nyc_df$average_high_deg_f, nyc_df$average_high_deg_f[1])); diff(c(nyc_df$average_low_deg_f, nyc_df$average_low_deg_f[1]))
##  [1]   3.3   8.1  11.5   9.6   8.5   4.8  -1.5  -7.4 -11.4 -10.0 -10.8
## [12]  -4.7
##  [1]   2.0   6.3   9.6   9.2   9.6   5.2  -1.0  -7.0 -10.8  -8.4  -9.6
## [12]  -5.1

The above numbers show the differences between months, in seasonal order, for the High and then the Low. For both it looks like the biggest increase is between March and April. The biggest decrease is between September and October.


Record High and Low Temperatures

nyc_df %>% select(month, record_high_deg_f, record_low_deg_f) %>% kable()
month record_high_deg_f record_low_deg_f
Jan 72 -6
Feb 78 -15
Mar 86 3
Apr 96 12
May 99 32
Jun 101 44
Jul 106 52
Aug 104 50
Sep 102 39
Oct 94 28
Nov 84 5
Dec 75 -13
max(nyc_df$record_high_deg_f)
## [1] 106
min(nyc_df$record_low_deg_f)
## [1] -15

The record high and low of 106 and -15 are not quite as extreme as I might have thought.


Average Precipitation

Finally, let’s take a look at average precipitation by month.

nyc_df %>% ggplot(aes(x=month, y=average_precipitation_inches, fill=month)) + 
  scale_x_discrete(limits=month.abb) +
  geom_bar(stat="identity") + 
  labs(title = "Average Precipitation in Central Park, 1981-2010", x="Month", y="Inches") + 
  theme(legend.position = "none")

February has quite a bit less precipitation than the other months. But for the most part the distribution is fairly flat.


Summary

We used rvest to scrape an interesting table of New York climate data. We needed to tidy the table quite a bit to get it into a format suitable for analysis. We then did some simple analyses on average temperatures, record temperatures, and average precipitation.