STAT 451, Day 5

Data Scraping

Chapter 2

Introduction

Chapter 2 gives an idea about different ways data can be stored.

There are examples beyond .csv It also gives some ideas about how to get data from websites.

  • websites
  • scraping - R, python
  • data formats - .xlsx, .txt, .json, .xml This is related to ETL or Extract, Transform, Load

Data Provided by Others

Most of the time data is not collected by you!

You need to check the data to make sure there are no obvious errors in the data file.

You should understand the context of the data

  • where it came from
  • how it was collected
  • what it's about page 22

Data Sources

Genres

More Genres

More Genres

R and RStudio

We will not use Python in this class unless time allows at the end of the semester.

Join the RStudio Community

R and RStudio

The google for R

Ask questions on StackExchange

Website for learning R

Datamining with R

R and RStudio

An alternative to python's BeautifulSoup is rvest package, which is part of the tidyverse.

R blogs

R and RStudio

Online book for doing time series analysis in R.

Slide with R Code

Look at the mtcars data.

attach(mtcars)
summary(mtcars$mpg)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   15.43   19.20   20.09   22.80   33.90 

Simple Linear Regression

Add it to the plot

plot(wt, mpg) 
abline(lm(mpg~wt))
title("Regression of MPG on Weight")

plot of chunk unnamed-chunk-2

Data Scraping

Often you can find the exact data that you need, except there's one problem. It's not all in one place or in one file. Instead it's in a bunch of HTML pages or on multiple websites.

What should you do?

Scrape the data

page 27

Data Scraping

The author discusses the use of python and beautifulsoup.

He looks at the Weather Underground website to collected maximum temperature data for Buffalo, NY.

I encourage you to read this section in the book. We will return to this later in the class when we introduce python.

Data Scraping

To download the code and data files for the book, you can go to the book's website

Download the code for Chapter 2.

Examine

wunderdata.txt wunderdata.xml wunderdata.json These are all .txt files.

wunderdata.txt

wunder <- read.csv('wunder-data.txt', header=FALSE, col.names=c('Date', 'Temp'))
wunder$Date <- seq(as.Date("2009/1/1"), as.Date("2009/12/31"), "days")
head(wunder)
        Date Temp
1 2009-01-01   26
2 2009-01-02   34
3 2009-01-03   27
4 2009-01-04   34
5 2009-01-05   34
6 2009-01-06   31

Open the other file formats.