This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.
Here will be develop an R program to scrape data from the Weather Underground website for Buffalo, NY in 2016. Note that 2016 was a leap year, so there will be 266 data values to scrape from 266 webpages.
First we try to read one data value into R. We will scrape the Mean Temperature on Friday, January 1, 2016. The value was 29.
The webpage we will start with is Weather Underground, Buffalo, NY January 1, 2016 2016. Note that the date is in the web address.
We will be using the rvest R package. It is useful to read over the examples on Hadley’s github page. It is also useful to watch the RStudio videos about extracting data from the web.
To get started we need to examine the html for the webpage we are trying to read from to try to find the nearby code to give to rvest to find the value we need. To say the webapge, open it in Firefox (or other browswer), the right-click on the page, and then click Save Page As. Browse to where you save the file and then open it in a text editor. Try to do a search for the value we aer looking for, here 29. You may need to click next a few times to find it. The value should have Mean Tempurature just about it. When you find it you will see something like.
</div>
<table id="historyTable" class="responsive airport-history-summary-table" cellspacing="0" cellpadding="0">
<thead>
<tr>
<th> </th>
<th>Actual</th>
<th>Average </th>
<th>Record </th>
</tr>
</thead>
<tbody>
<tr>
<td class="history-table-grey-header">Temperature</td>
<td colspan="3" class="history-table-grey-header"> </td>
</tr>
<tr>
<td class="indent"><span>Mean Temperature</span></td>
<td>
<span class="wx-data"><span class="wx-value">29</span><span class="wx-unit"> °F</span></span>
</td>
<td>
The start of this block of code has div and tghe calue is in a span line. So after trying a few other things, these two words find what we are looking for when using rvest.
Here is some code modified from the rvest website. It reads all of the value from the webpage with div and span. It is the 13^th value that we need.
mean.temp
[1] NA NA NA NA NA NA NA NA NA NA NA NA 29.00 NA NA 26.00 NA NA NA 32.00 NA NA 32.00
[24] NA NA 63.00 NA NA NA 26.00 NA NA 20.00 NA NA 0.00 NA NA NA NA NA NA NA NA NA 20.00
[47] NA NA NA NA NA NA 0.03 NA NA 0.12 NA NA 1.40 NA NA NA NA NA 1.90 NA NA 0.90 NA
[70] NA 8.10 NA NA NA NA NA NA NA NA NA 30.03 NA NA NA 15.00 NA NA NA 26.00 NA NA NA
[93] 33.00 NA NA NA 7.00 NA NA NA NA NA NA NA NA NA NA NA NA NA 30.90 NA NA 20.70 NA
[116] NA 23.00 NA NA 30.09 NA NA 9.00 NA NA 13.80 NA NA 0.00 NA NA 32.00 NA NA 23.00 NA NA 21.90
[139] NA NA 30.07 NA NA 9.00 NA NA 11.50 NA NA 19.60 NA NA 0.00 NA NA 32.00 NA NA 21.20 NA NA
[162] 23.00 NA NA 30.06 NA NA 10.00 NA NA 16.10 NA NA 0.00 NA NA 32.00 NA NA 21.60 NA NA 24.10 NA
[185] NA 30.05 NA NA 10.00 NA NA 15.00 NA NA 0.00 NA NA 30.90 NA NA 21.20 NA NA 24.10 NA NA 30.04
[208] NA NA 10.00 NA NA 12.70 NA NA 0.00 NA NA 30.90 NA NA 19.80 NA NA 23.00 NA NA 30.02 NA NA
[231] 10.00 NA NA 16.10 NA NA 30.90 NA NA 21.20 NA NA 21.90 NA NA 30.02 NA NA 8.00 NA NA 12.70 NA
[254] NA 0.00 NA NA 30.00 NA NA 20.00 NA NA 23.00 NA NA 30.02 NA NA 7.00 NA NA 12.70 NA NA 0.00
[277] NA NA 28.90 NA NA 19.20 NA NA 21.90 NA NA 29.99 NA NA 7.00 NA NA 11.50 NA NA 21.90 NA NA
[300] 0.00 NA NA 28.90 NA NA 17.30 NA NA 21.90 NA NA 29.99 NA NA 2.00 NA NA 16.10 NA NA 20.70 NA
[323] NA 0.00 NA NA 28.90 NA NA 16.90 NA NA 21.90 NA NA 29.99 NA NA 1.00 NA NA 17.30 NA NA 23.00
[346] NA NA 0.00 NA NA 28.90 NA NA 19.20 NA NA 21.90 NA NA 30.02 NA NA 3.00 NA NA 11.50 NA NA
[369] 0.00 NA NA 28.90 NA NA 17.30 NA NA 21.00 NA NA 29.99 NA NA 9.00 NA NA 16.10 NA NA 24.20 NA
[392] NA 0.00 NA NA 28.90 NA NA 19.20 NA NA 21.00 NA NA 29.99 NA NA 2.50 NA NA 11.50 NA NA 0.00
[415] NA NA 28.00 NA NA 18.10 NA NA 23.00 NA NA 29.99 NA NA 0.50 NA NA 11.50 NA NA 0.00 NA NA
[438] 28.00 NA NA 17.50 NA NA 24.10 NA NA 30.03 NA NA 0.20 NA NA 12.70 NA NA 0.00 NA NA 28.00 NA
[461] NA 18.60 NA NA 25.00 NA NA 30.00 NA NA 0.20 NA NA 10.40 NA NA 0.02 NA NA 28.00 NA NA 18.10
[484] NA NA 25.00 NA NA 30.00 NA NA 0.80 NA NA 11.50 NA NA 0.02 NA NA 28.00 NA NA 18.10 NA NA
[507] 25.00 NA NA 29.99 NA NA 2.00 NA NA 11.50 NA NA 0.03 NA NA 28.90 NA NA 19.20 NA NA 24.10 NA
[530] NA 30.02 NA NA 5.00 NA NA 11.50 NA NA 0.03 NA NA 30.00 NA NA 20.50 NA NA 21.90 NA NA 29.99
[553] NA NA 9.00 NA NA 11.50 NA NA 24.20 NA NA 0.00 NA NA 30.00 NA NA 17.90 NA NA 19.00 NA NA
[576] 30.01 NA NA 5.00 NA NA 18.40 NA NA 32.20 NA NA 0.00 NA NA 30.00 NA NA 16.90 NA NA 17.10 NA
[599] NA 29.99 NA NA 10.00 NA NA 21.90 NA NA 27.60 NA NA 0.00 NA NA 30.00 NA NA 18.70 NA NA 16.00
[622] NA NA 29.98 NA NA 10.00 NA NA 16.10 NA NA 26.50 NA NA 30.90 NA NA 19.40 NA NA 15.10 NA NA
[645] 29.98 NA NA 10.00 NA NA 17.30 NA NA 31.10 NA NA 30.00 NA NA 17.20 NA NA 16.00 NA NA 30.00 NA
[668] NA 10.00 NA NA 20.70 NA NA 31.10 NA NA 0.00 NA NA 28.00 NA NA 14.30 NA NA 17.10 NA NA 30.01
[691] NA NA 10.00 NA NA 21.90 NA NA 32.20 NA NA 0.00 NA NA 27.00 NA NA 15.20 NA NA 15.10 NA NA
[714] 30.05 NA NA 10.00 NA NA 15.00 NA NA 27.00 NA NA 15.60 NA NA 14.00 NA NA 30.06 NA NA 10.00 NA
[737] NA 13.80 NA NA 24.20 NA NA 27.00 NA NA 15.60 NA NA 15.10 NA NA 30.06 NA NA 10.00 NA NA 13.80
[760] NA NA 21.90 NA NA 28.90 NA NA 18.20 NA NA 15.10 NA NA 30.05 NA NA 10.00 NA NA 13.80 NA NA
[783] 20.70 NA NA 28.00 NA NA 15.70 NA NA 19.00 NA NA 30.06 NA NA 10.00 NA NA 17.30 NA NA 25.30 NA
[806] NA 28.00 NA NA 16.50 NA NA 19.90 NA NA 30.07 NA NA 10.00 NA NA 15.00 NA NA 20.70 NA NA 0.00
[829] NA NA 26.10 NA NA 13.60 NA NA 17.10 NA NA 30.09 NA NA 9.00 NA NA 16.10 NA NA 0.00 NA
mean.temp[13]
[1] 29
buffalo.data <- numeric(3)
buffalo <- read_html("https://www.wunderground.com/history/airport/KBUF/2016/01/01/DailyHistory.html")
mean.temp <- buffalo %>%
html_nodes("div span") %>%
html_text() %>%
as.numeric()
NAs introduced by coercion
#mean.temp
mean.temp[13]
[1] 29
buffalo.data[1] <- mean.temp[13]
buffalo <- read_html("https://www.wunderground.com/history/airport/KBUF/2016/01/02/DailyHistory.html")
mean.temp <- buffalo %>%
html_nodes("div span") %>%
html_text() %>%
as.numeric()
NAs introduced by coercion
#mean.temp
mean.temp[13]
[1] 30
buffalo.data[2] <- mean.temp[13]
buffalo <- read_html("https://www.wunderground.com/history/airport/KBUF/2016/01/03/DailyHistory.html")
mean.temp <- buffalo %>%
html_nodes("div span") %>%
html_text() %>%
as.numeric()
NAs introduced by coercion
#mean.temp
mean.temp[13]
[1] 28
buffalo.data[3] <- mean.temp[13]
buffalo.data
[1] 29 30 28
2016 was a leap year. So February had 29 days. Also, April, June, September, and November have 30 days. So we need to do a bit of checking to make sure we only collect data for the days of the year.
buffalo.data <- numeric()
M <- 12
B <- 31
for (m in 1:M){
for (d in 1:B){
if (m == 2 & d > 29) break
if ( ( m %in% c(4,6,9,11) ) & d > 30 ) break
buffalo <- read_html(paste0("https://www.wunderground.com/history/airport/KBUF/2016/",m,"/",d,"/DailyHistory.html"))
mean.temp <- buffalo %>%
html_nodes("div span") %>%
html_text() %>%
as.numeric()
#mean.temp
#mean.temp[13]
buffalo.data <- c(buffalo.data,mean.temp[13])
}
}
buffalo.data
Next we will create the dates for each day of 2016. R has a function seq() that works with dates. The as.Date() tells R that values are dates.
buffalo.date <- data.frame(seq(as.Date("2016/1/1"), as.Date("2016/12/31"), "days"))
Finally, we will put the dates and mean tempuratures together into a dataframe. We will use another of Hadley’s packages plyr.
library(dplyr)
buffalo.data1 <- as.data.frame(buffalo.data)
buffalo.final <- bind_cols(buffalo.date1, buffalo.data1)
colnames(buffalo.final) <- c("date","temp")
buffalo.final
summary(buffalo.final$temp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 37.00 51.00 51.44 69.00 84.00
We start by making a scatterplot using base R.
Next, using the qplot() function from the ggplot2 package.
And finally, using the ggplot() function.
p + geom_point() + stat_smooth(method=lm, level=0.95)
p + geom_point() + stat_smooth()
p + geom_point() + stat_smooth(method=loess)
Question: What do you see in your plot? Is it warmer in the Summer months and colder in the Winter months?