Learning R, Part 1

Nathan Byers
April 21, 2014

Topics

  • R basics
  • Regression
  • Web scraping
  • Using R with other tools

R basics

What is R?

Starting Up R

We'll be opening up an R session through IUanyWARE:

  1. Go to iuanyware.iu.edu and sign in
  2. Go to the Analysis & Modeling folder
  3. Open RStudio (say no to the update request)

Doing math

  • Open up a script by going to File -> New File -> R Script
  • It will open in the top left panel in RStudio
  • Try some math
10 + 5
10 - 5
10 * 5
10 / 5
  • All the code in this presentation can be copied and pasted into your script
  • Highlight your code and click the “Run” button on the toolbar of the script panel

Creating a variable

The assignment operator in R is <-

x <- 10
y <- 5
x + y
[1] 15

(the top panel shows what you should run in your script and the bottom panel shows the output)

Vectors

  • Vectors are variables with an ordered set of values
  • We use c( ) as a container for vectors
x <- c(1, 2, 3, 4, 5)
x
[1] 1 2 3 4 5

Data frames

  • Data frames are spreadsheet-like tables in R
  • We use the data.frame(variable1, variable2, ...) function
price <- c(1000, 4000, 2000, 5000, 500)
carat <- c(0.4, 0.55, 0.45, 0.65, .2)
color <- c("G", "H", "D", "E", "G")
diamonds <- data.frame(price, carat, color)
diamonds
  price carat color
1  1000  0.40     G
2  4000  0.55     H
3  2000  0.45     D
4  5000  0.65     E
5   500  0.20     G

Plotting

plot(x = carat, y = price)

plot of chunk unnamed-chunk-5

Regression

Regression

We use the lm(y ~ x) function in R to fit a linear regression model and the summary( ) function to see the results

fit <- lm(price ~ carat)
summary(fit)

Regression (output)


Call:
lm(formula = price ~ carat)

Residuals:
   1    2    3    4    5 
-967  435 -500  370  663 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)    -2294       1129   -2.03    0.135  
carat          10652       2378    4.48    0.021 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 806 on 3 degrees of freedom
Multiple R-squared:  0.87,  Adjusted R-squared:  0.827 
F-statistic: 20.1 on 1 and 3 DF,  p-value: 0.0207

Regression

Let's see what this regression looks like as a line on our plot

abline(fit)

plot of chunk unnamed-chunk-9

Regression

  • Looking at the graph, we might think a quadratic model would work better
  • This just means that we add squared values of the carat variable to our regression equation
carat2 <- carat^2
fit2 <- lm(price ~ carat + carat2)
summary(fit2)

Regression (quadratic output)


Call:
lm(formula = price ~ carat + carat2)

Residuals:
     1      2      3      4      5 
-437.4  596.0   20.2 -280.4  101.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)     1169       1877    0.62     0.60
carat          -8375       9502   -0.88     0.47
carat2         22616      11120    2.03     0.18

Residual standard error: 564 on 2 degrees of freedom
Multiple R-squared:  0.958, Adjusted R-squared:  0.915 
F-statistic: 22.6 on 2 and 2 DF,  p-value: 0.0424

Regression (quadratic plot)

  • Unfortunately, plotting the quadratic curve on the line isn't that simple in R
  • Here's the code, but don't worry if you don't understand what's going on
carat.values <- seq(0.2, 0.7, 0.001)
curve.price <- predict(fit2, list(carat = carat.values, carat2 = carat.values^2))
plot(x = carat, y = price)
lines(carat.values, curve.price)

Regression (quadratic plot)

plot of chunk unnamed-chunk-13

Regression (prediction)

  • Based on this small data set and the quadratic model, how much would we expect to spend if we wanted a half-carat diamond
  • To find out, we use the predict() function
prediction <- predict(fit2, list(carat = 0.5, carat2 = 0.5^2))
prediction
   1 
2635 

Regression (prediction plot)

  • Let's plot the prediction on the quadratic graph
points(x = 0.5, y = prediction, col = "red")

plot of chunk unnamed-chunk-16

Web scraping

Web scraping

  • R is not just a language for statistics
  • It's an environment for retrieving and manipulating data
  • For example, scraping the web for data is very easy

Web scraping

Web scraping

For this example we use the XML package

install.packages("XML")
library(XML)
url <- "http://en.wikipedia.org/wiki/Healthcare_system"
table <- readHTMLTable(url)[[1]]
View(table)
    Country Life expectancy Infant mortality
1 Australia            81.4             4.49
2    Canada            81.4             4.78
3    France            81.0             3.34
4   Germany            79.8             3.48
5     Italy            80.5             3.33

R packages

  • A quick word on what an “R package” is
  • R comes with core, or base, functionality
  • However, anyone can write functions for R and make it available to other R users in a package
  • Packages must be installed first (the install.packages() function) then loaded before using it (the library() function)

Using R with other tools

Using R with other tools

Using R with other tools

  • R is a high level language, so it's easier to use but sometimes slow
  • The Rcpp package enables you to write code in C++ for speedy processing and integrate that code with R

Resources for learning R

Parts 2 and 3