library(stringr)
library(dplyr)
library(tidyr)
library(zoo)
library(ggplot2)
library(knitr)
library(rvest)
library(tibble)
Use rvest to scrape data from wikipedia. We see that the table needs to be cleaned up.
url <- "https://en.wikipedia.org/wiki/Climate_of_New_York"
url_html <- url %>% read_html()
node_list <- url_html %>% html_nodes("table")
raw_df <- node_list %>% .[[6]] %>% html_table(fill=T)
raw_df %>% head() %>% kable()
| Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] | Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] | Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] | Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] | Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] | Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] | Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] | Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] | Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] | Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] | Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] | Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] | Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] | Climate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[g] extremes 1869–present[h] |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Month | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | Year |
| Record high °F (°C) | 72(22) | 78(26) | 86(30) | 96(36) | 99(37) | 101(38) | 106(41) | 104(40) | 102(39) | 94(34) | 84(29) | 75(24) | 106(41) |
| Mean maximum °F (°C) | 59.6(15.3) | 60.7(15.9) | 71.5(21.9) | 83.0(28.3) | 88.0(31.1) | 92.3(33.5) | 95.4(35.2) | 93.7(34.3) | 88.5(31.4) | 78.8(26.0) | 71.3(21.8) | 62.2(16.8) | 97.0(36.1) |
| Average high °F (°C) | 38.3(3.5) | 41.6(5.3) | 49.7(9.8) | 61.2(16.2) | 70.8(21.6) | 79.3(26.3) | 84.1(28.9) | 82.6(28.1) | 75.2(24.0) | 63.8(17.7) | 53.8(12.1) | 43.0(6.1) | 62.0(16.7) |
| Average low °F (°C) | 26.9(−2.8) | 28.9(−1.7) | 35.2(1.8) | 44.8(7.1) | 54.0(12.2) | 63.6(17.6) | 68.8(20.4) | 67.8(19.9) | 60.8(16.0) | 50.0(10.0) | 41.6(5.3) | 32.0(0.0) | 48.0(8.9) |
| Mean minimum °F (°C) | 9.2(−12.7) | 12.8(−10.7) | 18.5(−7.5) | 32.3(0.2) | 43.5(6.4) | 52.9(11.6) | 60.3(15.7) | 58.8(14.9) | 48.6(9.2) | 38.0(3.3) | 27.7(−2.4) | 15.6(−9.1) | 7.0(−13.9) |
We’ll replace the column names with the first row and transpose the data frame. Then use regular expressions to get rid of everything inside of parentheses. (We do not need Celsius values - Farenheit is enough.) Finally we’ll turn the rownames into a column and format all columns. This output looks much better.
nyc_df <- raw_df
names(nyc_df) <- nyc_df[1,]
nyc_df <- nyc_df %>% slice(-n()) %>% slice(-1) %>% t()
nyc_df <- apply(nyc_df, 2, function(x) gsub("\\([^)]*\\)", "", x))
colnames(nyc_df) <- nyc_df[1,] %>%
gsub("[[:punct:]]", "deg_", .) %>%
trimws() %>% tolower() %>%
gsub(" ", "_", .)
months <- rownames(nyc_df)
nyc_df <- nyc_df %>% as_tibble() %>% slice(-1) %>%
mutate(month=months[-1]) %>%
filter(month!="Year") %>%
select(month, everything())
nyc_df[, 2:ncol(nyc_df)] <- nyc_df[, 2:ncol(nyc_df)] %>% unlist() %>%
iconv(., from = "UTF-8", to = "ASCII//TRANSLIT") %>%
gsub("\\?", "-", .) %>% as.numeric()
nyc_df %>% kable()
| month | record_high_deg_f | mean_maximum_deg_f | average_high_deg_f | average_low_deg_f | mean_minimum_deg_f | record_low_deg_f | average_precipitation_inches | average_snowfall_inches | average_precipitation_days | average_snowy_days | average_relative_humidity | mean_monthly_sunshine_hours | percent_possible_sunshine |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Jan | 72 | 59.6 | 38.3 | 26.9 | 9.2 | -6 | 3.65 | 7.0 | 10.4 | 4.0 | 61.5 | 162.7 | 54 |
| Feb | 78 | 60.7 | 41.6 | 28.9 | 12.8 | -15 | 3.09 | 9.2 | 9.2 | 2.8 | 60.2 | 163.1 | 55 |
| Mar | 86 | 71.5 | 49.7 | 35.2 | 18.5 | 3 | 4.36 | 3.9 | 10.9 | 1.8 | 58.5 | 212.5 | 57 |
| Apr | 96 | 83.0 | 61.2 | 44.8 | 32.3 | 12 | 4.50 | 0.6 | 11.5 | 0.3 | 55.3 | 225.6 | 57 |
| May | 99 | 88.0 | 70.8 | 54.0 | 43.5 | 32 | 4.19 | 0.0 | 11.1 | 0.0 | 62.7 | 256.6 | 57 |
| Jun | 101 | 92.3 | 79.3 | 63.6 | 52.9 | 44 | 4.41 | 0.0 | 11.2 | 0.0 | 65.2 | 257.3 | 57 |
| Jul | 106 | 95.4 | 84.1 | 68.8 | 60.3 | 52 | 4.60 | 0.0 | 10.4 | 0.0 | 64.2 | 268.2 | 59 |
| Aug | 104 | 93.7 | 82.6 | 67.8 | 58.8 | 50 | 4.44 | 0.0 | 9.5 | 0.0 | 66.0 | 268.2 | 63 |
| Sep | 102 | 88.5 | 75.2 | 60.8 | 48.6 | 39 | 4.28 | 0.0 | 8.7 | 0.0 | 67.8 | 219.3 | 59 |
| Oct | 94 | 78.8 | 63.8 | 50.0 | 38.0 | 28 | 4.40 | 0.0 | 8.9 | 0.0 | 65.6 | 211.2 | 61 |
| Nov | 84 | 71.3 | 53.8 | 41.6 | 27.7 | 5 | 4.02 | 0.3 | 9.6 | 0.2 | 64.6 | 151.0 | 51 |
| Dec | 75 | 62.2 | 43.0 | 32.0 | 15.6 | -13 | 4.00 | 4.8 | 10.6 | 2.3 | 64.1 | 139.0 | 48 |
There is a lot of information here, so for the purposes of this exercise we will look at only a few key things.
Average high and low temperatures by month
nyc_df %>% ggplot(aes(x=month)) + scale_x_discrete(limits=month.abb) +
geom_point(aes(y=average_high_deg_f, group=1)) + geom_line(aes(y=average_high_deg_f, group=1, col="Average High")) +
geom_point(aes(y=average_low_deg_f, group=1)) + geom_line(aes(y=average_low_deg_f, group=1, col="Average Low")) +
labs(title="Temperature in Central Park, 1981-2010", x="Month", y="Degrees in Farenheit", colour="") +
scale_colour_manual(values = c("red", "blue"))
A few notable tidbits from examining the above graph:
The slope of the graph between months is interesting. Between which two months do we see the biggest change in temperature?
diff(c(nyc_df$average_high_deg_f, nyc_df$average_high_deg_f[1])); diff(c(nyc_df$average_low_deg_f, nyc_df$average_low_deg_f[1]))
## [1] 3.3 8.1 11.5 9.6 8.5 4.8 -1.5 -7.4 -11.4 -10.0 -10.8
## [12] -4.7
## [1] 2.0 6.3 9.6 9.2 9.6 5.2 -1.0 -7.0 -10.8 -8.4 -9.6
## [12] -5.1
The above numbers show the differences between months, in seasonal order, for the High and then the Low. For both it looks like the biggest increase is between March and April. The biggest decrease is between September and October.
Record High and Low Temperatures
nyc_df %>% select(month, record_high_deg_f, record_low_deg_f) %>% kable()
| month | record_high_deg_f | record_low_deg_f |
|---|---|---|
| Jan | 72 | -6 |
| Feb | 78 | -15 |
| Mar | 86 | 3 |
| Apr | 96 | 12 |
| May | 99 | 32 |
| Jun | 101 | 44 |
| Jul | 106 | 52 |
| Aug | 104 | 50 |
| Sep | 102 | 39 |
| Oct | 94 | 28 |
| Nov | 84 | 5 |
| Dec | 75 | -13 |
max(nyc_df$record_high_deg_f)
## [1] 106
min(nyc_df$record_low_deg_f)
## [1] -15
The record high and low of 106 and -15 are not quite as extreme as I might have thought.
Average Precipitation
Finally, let’s take a look at average precipitation by month.
nyc_df %>% ggplot(aes(x=month, y=average_precipitation_inches, fill=month)) +
scale_x_discrete(limits=month.abb) +
geom_bar(stat="identity") +
labs(title = "Average Precipitation in Central Park, 1981-2010", x="Month", y="Inches") +
theme(legend.position = "none")
February has quite a bit less precipitation than the other months. But for the most part the distribution is fairly flat.
We used rvest to scrape an interesting table of New York climate data. We needed to tidy the table quite a bit to get it into a format suitable for analysis. We then did some simple analyses on average temperatures, record temperatures, and average precipitation.