R Tutorial

R

R is an open source software package designed for statistical analysis. RStudio is a front end for R that “makes R easier to use”. You can download it from this site: https://www.rstudio.com/.

There is a wide variety of packages that are freely available for R. Most of them can be downloaded and installed from within R with install.packages(). Once a package has been installed, you can open it with library() or requires().

Data

There are many data sets that are automatically available within R, and there are more that are available as part of a package. R can also read external files (eg .txt with read.table() and .csv with read.csv()) and download data from the internet.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Submitting Your Work

In DAT 315, you will submit all assigned work by emailing the instructor (McDevittT@etown.edu) a link to an R Markdown document that you have uploaded to https://rpubs.com/. Accounts are free.

Using R Markdown guarantees that your work is reproducible and easy to read. You are expected to write reports at a professional level.

Data Frames

The most important data structure in R is the data frame. mtcars is a built-in data frame, and you can readily see the names of each variable at the top of each column.

mtcars

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Other data frames are too large to look at in their entirety, so the head and tail commands can be useful.

head(iris, 10)

##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa

tail(iris,5)

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

The str() command tells us what kinds of variables we have and summary() gives a basic statistical summary of each column.

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

A particular column can be extracted by using the $ notation and names() lists all of the variables.

mtcars$mpg

##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4

names(mtcars)

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

Individual elements of a data frame can be accesssed by identifying the row and column.

mtcars[2,6]

## [1] 2.875

Rows can be accessed by leaving the column unspecified.

mtcars[2,]

##               mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

Columns can be accessed by leaving the row unspecified or by using the $ notation.

You can create your own data frame using data.frame(). Note that variables are assigned with <- or =.

df <- data.frame(x=c(1,2,3),
                 y=c(10,11,12),
                 color=c("red","green","blue"))
df

##   x  y color
## 1 1 10   red
## 2 2 11 green
## 3 3 12  blue

Specific rows or columns can be extracted by searching for certain conditions. The following code identifies all of the 6-cylinder cars in the mtcars dataset.

mtcars6 <- mtcars[mtcars$cyl==6,]

Vectors

Vectors can be constructed with c(), :, or seq().

4:10

## [1]  4  5  6  7  8  9 10

c(seq(2,20,by=2),seq(1,20,by=2))

##  [1]  2  4  6  8 10 12 14 16 18 20  1  3  5  7  9 11 13 15 17 19

Vectors may be useful for extracting certain rows or columns from a data frame.

mtcars[c(1,3,5),]

##                    mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.62 16.46  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.32 18.61  1  1    4    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.02  0  0    3    2

Exploratory Data Analysis

R has standard elementary statistical functions like the following.

mean(mtcars$mpg)

## [1] 20.09062

sd(mtcars$mpg)

## [1] 6.026948

It also makes standard elementary graphs.

hist(mtcars$mpg)

boxplot(Petal.Width~Species, data=iris)

plot(mtcars$wt, mtcars$mpg)

Random Samples

Whenever you do a calculation in R that involves randomness, you should always set the seed to ensure reproducibility. The sample() command allows you to sample either with or without replacement.

set.seed(1234)
sample(1:10,5, replace=FALSE)

## [1] 2 6 5 8 9

sample(10,5, replace=TRUE)

## [1] 7 1 3 7 6

 mtcars[sample(nrow(mtcars), 5, replace=FALSE), ]

##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## AMC Javelin       15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Porsche 914-2     26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Ferrari Dino      19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6

Missing Data

The mtcars and iris datasets have no missing values, but real data sets often have a lot of missing values. The following code shows that there are 44 NAs in the airquality dataset.

head(airquality)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

sum(is.na(airquality))

## [1] 44

R functions often have optional arguments that help us deal with missing data. For example, na.rm=TRUE allows us to compute the mean ozone value.

mean(airquality$Ozone)

## [1] NA

mean(airquality$Ozone, na.rm=TRUE)

## [1] 42.12931

Random Numbers

R has different utilities for producing random numbers. Two helpful functions are rnorm() and runif().

rnorm(6, mean=100, sd=15)

## [1] 114.75106  90.66315  89.02696  92.24995  73.73900 113.20156

runif(3, min=10, max=20)

## [1] 19.14658 18.31345 10.45770

Standard Statistical Tests

The following code implements a test of $H_0: \mu=20$ versus $H_a: \mu>20$. (Greek symbols are available by typing the appropriate $\LaTeX$ code in the R Markdown file.) We see that there is no evidence that $\mu>20$ miles per gallon.

t.test(mtcars$mpg, mu=20, alternative="greater")

## 
##  One Sample t-test
## 
## data:  mtcars$mpg
## t = 0.08506, df = 31, p-value = 0.4664
## alternative hypothesis: true mean is greater than 20
## 95 percent confidence interval:
##  18.28418      Inf
## sample estimates:
## mean of x 
##  20.09062

A one-way ANOVA suggests that the mean iris petal widths are different for the 3 species (setosa, versicolor, and virginica). However, the standard deviation for virginica is more than twice that for setosa, so a one-way ANOVA is not appropriate here.

results <- aov(Petal.Width~Species, data=iris)
summary(results)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## Species       2  80.41   40.21     960 <2e-16 ***
## Residuals   147   6.16    0.04                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

aggregate(iris$Petal.Width, by=list(iris$Species), FUN=sd)

##      Group.1         x
## 1     setosa 0.1053856
## 2 versicolor 0.1977527
## 3  virginica 0.2746501

Linear regression is simple with the lm() command.

model <- lm(mpg~wt, data=mtcars)
summary(model)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

plot(mtcars$wt, mtcars$mpg, pch=19, col="red", xlab="Weight", ylab="Miles per Gallon")
abline(model$coefficients, lty=2, col="blue")

Writing Your Own Functions

At times, it is either convenient or necessary to write your own R functions. Here is a simple function that squares a number.

square <- function(x){
  return(x*x)
}
square(6)

## [1] 36

Here is a more complicated example that does a linear fit to bivariate data and then graphs the data, the line of best fit, and the prediction bands.

lmgraph <- function(xdata, ydata){
    library(ggplot2)
    model <- lm(ydata~xdata)
    newx <- data.frame(xdata=seq(min(xdata), max(xdata), length=10))
    pred <- data.frame(predict(model, newdata=newx, interval=("prediction")))
    pred$xvals <- newx$x
    g = ggplot(pred, aes(x=xvals, y=fit))
    g = g + geom_ribbon(aes(ymin=lwr, ymax=upr), alpha=0.2)
    g = g + geom_line()
    g = g + geom_point(data=data.frame(x=xdata, y=ydata), aes(x=xdata, y=ydata), size=2)
    g = g + labs(x="x", y="y")
    return(g)
}
lmgraph(mtcars$wt, mtcars$mpg)

Loops and Conditionals

R has loops and conditionals just like any computer language. Here’s a simple example. Make a vector of 10 zeros. Then write a loop that replaces the even entries with their squares.

vec <- rep(0,10)
for (k in 1:10)
{
  if (k/2==floor(k/2))
  {
    vec[k] = k*k
  }
}
vec

##  [1]   0   4   0  16   0  36   0  64   0 100

Apply

In some cases, the apply() function can help you avoid explicitly writing loops. For example, we can get the row and column means of the numeric data in df with the following commands.

apply(df[,c(1,2)],1,mean)

## [1] 5.5 6.5 7.5

apply(df[,c(1,2)],2,mean)

##  x  y 
##  2 11

Help

For more information about an R command or an R dataset like sample(), type “?sample” in the Console. But the internet is probably your best source of information and examples. One caveat…because there are a lot of R packages, when you do a Google search you will often find many different solutions to your problem that rely on different packages. You are welcome to use whatever packages you like, but I will try to minimize the number of packages that you are required to use for the sake of simplicity.