
Chapter 2 gives an idea about different ways data can be stored.
There are examples beyond .csv It also gives some ideas about how to get data from websites.
Most of the time data is not collected by you!
You need to check the data to make sure there are no obvious errors in the data file.
You should understand the context of the data
Seach Engines:
Unversities:
We will not use Python in this class unless time allows at the end of the semester.
Join the RStudio Community
The google for R
Ask questions on StackExchange
Website for learning R
Datamining with R
Good blogs to follow.
Online book for doing time series analysis in R.
Look at the mtcars data.
attach(mtcars)
summary(mtcars$mpg)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.43 19.20 20.09 22.80 33.90
Add it to the plot
plot(wt, mpg)
abline(lm(mpg~wt))
title("Regression of MPG on Weight")
Often you can find the exact data that you need, except there's one problem. It's not all in one place or in one file. Instead it's in a bunch of HTML pages or on multiple websites.
What should you do?
Scrape the data
page 27
The author discusses the use of python and beautifulsoup.
He looks at the Weather Underground website to collected maximum temperature data for Buffalo, NY.
I encourage you to read this section in the book. We will return to this later in the class when we introduce python.
To download the code and data files for the book, you can go to the book's website
Download the code for Chapter 2.
Examine
wunderdata.txt wunderdata.xml wunderdata.json These are all .txt files.
wunder <- read.csv('wunder-data.txt', header=FALSE, col.names=c('Date', 'Temp'))
wunder$Date <- seq(as.Date("2009/1/1"), as.Date("2009/12/31"), "days")
head(wunder)
Date Temp
1 2009-01-01 26
2 2009-01-02 34
3 2009-01-03 27
4 2009-01-04 34
5 2009-01-05 34
6 2009-01-06 31
Open the other file formats.