Loading Data

Albert Y. Kim
Wednesday 2016/2/16

RIP Moodle

We are not going to work off moodle anymore, rather

The Middlebury Data Science organization homepage.
Homeworks will be posted on above page.
All lecture material/instructions will be posted at github.com/Middlebury-Data-Science/Topics

Preparation for Today

Download the relevant files for today's lecture at https://github.com/Middlebury-Data-Science/Topics
Open UCBAdmissions.xlsx in Excel
Open the HW schedule in Google Docs: http://bit.ly/1oJKpVS
Install the rvest package in RStudio

Statistics Pedagogy

From the American Statistician (Cobb 2015): Mere Renovation is Too Little Too Late: We Need to Rethink our Undergraduate Curriculum from the Ground Up

Cobb advocates “minimizing prerequisites to research”:

In the Humanities, students in a first course engage with original sources. You do not just prepare your students to read Austen; they read Austen. You do not just prepare students to hear Bach; they hear Bach.

Statistics Pedagogy

In the context of this class: you need to be doing research QUICK!

Research here doesn't not mean necessary publishing in journals, but more simply looking at real data that's relevant to the student.

Today's Data

The University of California Berkeley was sued in 1973 for bias against women who had applied for admission to graduate schools. We consider the \( n=4526 \) people who applied to the 6 largest departments.

     Admit Gender Dept Freq
1 Admitted   Male    A  512
2 Rejected   Male    A  313
3 Admitted Female    A   89
4 Rejected Female    A   19
5 Admitted   Male    B  353
6 Rejected   Male    B  207

Today's Data

We also consider data on tuition and financial aid from various institutions across the United States from the Washington Post.

Loading Data

We will explore two methods for loading data:

Via a Comma Separated Values (CSV) file, from
- Excel files
- Google docs
Via the web using the rvest package in RStudio package

CSV Files

An Excel .xlsx file not only contains data, but also lots of metadata, i.e. data about data, that we don't need.
A CSV file is just data, where values for a given row are separated by commas. It is the most flexible way to share a table of data.

CSV = No fluff, just stuff.

Excel Files

To load data in an Excel spreadsheet in RStudio, we need the values to be in tidy format:

Data are in a table: i.e. perfect rectangular shpe
Columns are variables
Rows are observations, with the exception of the first row which are the variable names AKA a header row.

Excel Files

We convert the UCBAdmissions.xlsx file to CSV format to eliminate the metadata:

From Excel's menu bar -> File -> Save As… -> Format -> Windows Comma Separated (.csv)
Add .csv to the filename to Save As. i.e. your filename should be UCBAdmissions.csv
Click Continue

Excel Files

We load the CSV file into R.

From RStudio -> Environment Panel -> Import Dataset -> From Text File… -> Click on UCBAdmissions.csv
Make sure Heading is set to Yes. This indicates that the first row is a header row.
Click Import

Note in your console panel R spits out the command to do this automatically. You can copy this line into your R scripts.

Google Docs

From the Google Docs menu bar (not your browser's menu bar):

File -> Download As -> Comma-separated values -> Save the file to your computer.
Repeat the steps from the last slide to load into R.

Basic Webscraping

The other way we'll load is by basic web-scraping via the rvest package in RStudio package. Run the following code in RStudio:

library(rvest)
webpage <- "http://apps.washingtonpost.com/g/page/local/college-grants-for-the-affluent/1526/"
wp_data <- webpage %>%
  read_html() %>% 
  html_nodes("table") %>% 
  .[[1]] %>% 
  html_table()
View(wp_data)

Basic Webscraping

The rvest package works by scraping HTML code used to make webpages. To view a webpage's raw HTML code:

In Google Chrome's menu bar -> View -> Developer -> View Source
In Firefox's menu bar -> Tools -> Web Developer -> Page Source

The html_nodes() function looks for HTML tags.