Harold Nelson
March 17, 2016
x <- c(1,2,3)
y <- c(4,5,6)
x
## [1] 1 2 3
y
## [1] 4 5 6
z <- c(x,y)
z
## [1] 1 2 3 4 5 6
w <- x + z
w
## [1] 2 4 6 5 7 9
Consider the problem of computing the standard deviation of n numbers using the formula for the population standard deviation.
\[\sigma = \sqrt{\frac{\sum_{i=1}^{i=n}(x{_i} - \mu_{x})^2}{n}}\]
Let’s use the set of integers from 1 to 10 as an example. Here’s the code in R.
x <- 1:10
mux <- mean(x)
Devs <- x - mux
DevsSq <- Devs^2
MeanDevsSq <- mean(DevsSq)
RMSE <- sqrt(MeanDevsSq)
RMSE # Display Result
## [1] 2.872281
Exercise: Do this in Excel Exercise: Do the “guts” of this in your favorite language.
x <- 1:10
mux <- mean(x)
SumxDevsSQ <- 0
for (i in 1:10){
SumxDevsSQ <- SumxDevsSQ + (x[i] - mux)^2
}
RMSE <- sqrt(SumxDevsSQ/10)
RMSE
## [1] 2.872281
Most of the time beginners in R try to use familiar looping constructs in R instead of the proper vector functions.
x <- 1:10
# How many values in x are greater than 6?
sum(x >6)
## [1] 4
Why does it work?
# Create gt6 as a vector
gt6 <- x > 6
gt6
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
# Sum it
sum(gt6)
## [1] 4
# R treated the logical values as numeric. TRUE = 1 and FALSE = 0
What proportion of values in x are greater than 6?
mean(x>6)
## [1] 0.4
R starts as just “base R.” You can extend it with thousands of packages. Review process for getting packages.
Most of the data we will be working with will be in dataframes, the most important data structure. A dataframe is a rectangular array of data. Its columns are vectors and can be of different types.
There is a dataframe called mtcars in the datasets package.
Examine it with the str(), head(), tail() and summary() commands.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
tail(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
We need to refer to vectors in a dataframe with a two-part name. The name of the vector is preceded by the name of the dataframe, a “\(" and the name of the vector. The following commands illustrate with mtcars\)mpg are the usual basic descriptive statistics for a quantitative variable.
# Histogram
hist(mtcars$mpg)
# Boxplot
boxplot(mtcars$mpg,horizontal=TRUE)
# Basic Numerical Statistics
summary(mtcars$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.42 19.20 20.09 22.80 33.90
mean(mtcars$mpg)
## [1] 20.09062
median(mtcars$mpg)
## [1] 19.2
sd(mtcars$mpg)
## [1] 6.026948
For a cateogrical variable, we are more limited.
table(mtcars$am)
##
## 0 1
## 19 13
table(mtcars$am)/NROW(mtcars)
##
## 0 1
## 0.59375 0.40625
barplot(table(mtcars$am))