Lecture 1 Notes

Harold Nelson

March 17, 2016

R can be a pocket calculator

R is based on vectors

 x <- c(1,2,3)
 y <- c(4,5,6)
 x
## [1] 1 2 3
 y
## [1] 4 5 6
 z <- c(x,y)
 z
## [1] 1 2 3 4 5 6
w <- x + z
w
## [1] 2 4 6 5 7 9

R Contrasted with Excel and Typical Language

Consider the problem of computing the standard deviation of n numbers using the formula for the population standard deviation.

\[\sigma = \sqrt{\frac{\sum_{i=1}^{i=n}(x{_i} - \mu_{x})^2}{n}}\]

Let’s use the set of integers from 1 to 10 as an example. Here’s the code in R.

x <- 1:10
mux <- mean(x)
Devs <- x - mux
DevsSq <- Devs^2
MeanDevsSq <- mean(DevsSq)
RMSE <- sqrt(MeanDevsSq)
RMSE # Display Result
## [1] 2.872281

Exercise: Do this in Excel Exercise: Do the “guts” of this in your favorite language.

R the wrong way partially

x <- 1:10
mux <- mean(x)

SumxDevsSQ <- 0
for (i in 1:10){
  SumxDevsSQ <- SumxDevsSQ + (x[i] - mux)^2
}

RMSE <- sqrt(SumxDevsSQ/10)
RMSE
## [1] 2.872281

Most of the time beginners in R try to use familiar looping constructs in R instead of the proper vector functions.

Some useful vector tricks

x <- 1:10
# How many values in x are greater than 6?
sum(x >6)
## [1] 4

Why does it work?

# Create gt6 as a vector
gt6 <- x > 6
gt6
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
# Sum it
sum(gt6)
## [1] 4
# R treated the logical values as numeric. TRUE = 1 and FALSE = 0

What proportion of values in x are greater than 6?

mean(x>6)
## [1] 0.4

packages

R starts as just “base R.” You can extend it with thousands of packages. Review process for getting packages.

Dataframes

Most of the data we will be working with will be in dataframes, the most important data structure. A dataframe is a rectangular array of data. Its columns are vectors and can be of different types.

There is a dataframe called mtcars in the datasets package.

Examine it with the str(), head(), tail() and summary() commands.

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
tail(mtcars)
##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Basic data analysis

We need to refer to vectors in a dataframe with a two-part name. The name of the vector is preceded by the name of the dataframe, a “\(" and the name of the vector. The following commands illustrate with mtcars\)mpg are the usual basic descriptive statistics for a quantitative variable.

# Histogram
hist(mtcars$mpg)

# Boxplot
boxplot(mtcars$mpg,horizontal=TRUE)

# Basic Numerical Statistics
summary(mtcars$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.42   19.20   20.09   22.80   33.90
mean(mtcars$mpg)
## [1] 20.09062
median(mtcars$mpg)
## [1] 19.2
sd(mtcars$mpg)
## [1] 6.026948

For a cateogrical variable, we are more limited.

table(mtcars$am)
## 
##  0  1 
## 19 13
table(mtcars$am)/NROW(mtcars)
## 
##       0       1 
## 0.59375 0.40625
barplot(table(mtcars$am))