R is an open source software package designed for statistical analysis. RStudio is a front end for R that “makes R easier to use”. You can download it from this site: https://www.rstudio.com/.
There is a wide variety of packages that are freely available for R. Most of them can be downloaded and installed from within R with install.packages(). Once a package has been installed, you can open it with library() or requires().
There are many data sets that are automatically available within R, and there are more that are available as part of a package. R can also read external files (eg .txt with read.table() and .csv with read.csv()) and download data from the internet.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
In DAT 315, you will submit all assigned work by emailing the instructor (McDevittT@etown.edu) a link to an R Markdown document that you have uploaded to https://rpubs.com/. Accounts are free.
Using R Markdown guarantees that your work is reproducible and easy to read. You are expected to write reports at a professional level.
The most important data structure in R is the data frame. mtcars is a built-in data frame, and you can readily see the names of each variable at the top of each column.
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Other data frames are too large to look at in their entirety, so the head and tail commands can be useful.
head(iris, 10)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
tail(iris,5)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
The str() command tells us what kinds of variables we have and summary() gives a basic statistical summary of each column.
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
A particular column can be extracted by using the $ notation and names() lists all of the variables.
mtcars$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
Individual elements of a data frame can be accesssed by identifying the row and column.
mtcars[2,6]
## [1] 2.875
Rows can be accessed by leaving the column unspecified.
mtcars[2,]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
Columns can be accessed by leaving the row unspecified or by using the $ notation.
You can create your own data frame using data.frame(). Note that variables are assigned with <- or =.
df <- data.frame(x=c(1,2,3),
y=c(10,11,12),
color=c("red","green","blue"))
df
## x y color
## 1 1 10 red
## 2 2 11 green
## 3 3 12 blue
Specific rows or columns can be extracted by searching for certain conditions. The following code identifies all of the 6-cylinder cars in the mtcars dataset.
mtcars6 <- mtcars[mtcars$cyl==6,]
Vectors can be constructed with c(), :, or seq().
4:10
## [1] 4 5 6 7 8 9 10
c(seq(2,20,by=2),seq(1,20,by=2))
## [1] 2 4 6 8 10 12 14 16 18 20 1 3 5 7 9 11 13 15 17 19
Vectors may be useful for extracting certain rows or columns from a data frame.
mtcars[c(1,3,5),]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
R has standard elementary statistical functions like the following.
mean(mtcars$mpg)
## [1] 20.09062
sd(mtcars$mpg)
## [1] 6.026948
It also makes standard elementary graphs.
hist(mtcars$mpg)
boxplot(Petal.Width~Species, data=iris)
plot(mtcars$wt, mtcars$mpg)
Whenever you do a calculation in R that involves randomness, you should always set the seed to ensure reproducibility. The sample() command allows you to sample either with or without replacement.
set.seed(1234)
sample(1:10,5, replace=FALSE)
## [1] 2 6 5 8 9
sample(10,5, replace=TRUE)
## [1] 7 1 3 7 6
mtcars[sample(nrow(mtcars), 5, replace=FALSE), ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
The mtcars and iris datasets have no missing values, but real data sets often have a lot of missing values. The following code shows that there are 44 NAs in the airquality dataset.
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
sum(is.na(airquality))
## [1] 44
R functions often have optional arguments that help us deal with missing data. For example, na.rm=TRUE allows us to compute the mean ozone value.
mean(airquality$Ozone)
## [1] NA
mean(airquality$Ozone, na.rm=TRUE)
## [1] 42.12931
R has different utilities for producing random numbers. Two helpful functions are rnorm() and runif().
rnorm(6, mean=100, sd=15)
## [1] 114.75106 90.66315 89.02696 92.24995 73.73900 113.20156
runif(3, min=10, max=20)
## [1] 19.14658 18.31345 10.45770
The following code implements a test of \(H_0: \mu=20\) versus \(H_a: \mu>20\). (Greek symbols are available by typing the appropriate \(\LaTeX\) code in the R Markdown file.) We see that there is no evidence that \(\mu>20\) miles per gallon.
t.test(mtcars$mpg, mu=20, alternative="greater")
##
## One Sample t-test
##
## data: mtcars$mpg
## t = 0.08506, df = 31, p-value = 0.4664
## alternative hypothesis: true mean is greater than 20
## 95 percent confidence interval:
## 18.28418 Inf
## sample estimates:
## mean of x
## 20.09062
A one-way ANOVA suggests that the mean iris petal widths are different for the 3 species (setosa, versicolor, and virginica). However, the standard deviation for virginica is more than twice that for setosa, so a one-way ANOVA is not appropriate here.
results <- aov(Petal.Width~Species, data=iris)
summary(results)
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 80.41 40.21 960 <2e-16 ***
## Residuals 147 6.16 0.04
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aggregate(iris$Petal.Width, by=list(iris$Species), FUN=sd)
## Group.1 x
## 1 setosa 0.1053856
## 2 versicolor 0.1977527
## 3 virginica 0.2746501
Linear regression is simple with the lm() command.
model <- lm(mpg~wt, data=mtcars)
summary(model)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
plot(mtcars$wt, mtcars$mpg, pch=19, col="red", xlab="Weight", ylab="Miles per Gallon")
abline(model$coefficients, lty=2, col="blue")
At times, it is either convenient or necessary to write your own R functions. Here is a simple function that squares a number.
square <- function(x){
return(x*x)
}
square(6)
## [1] 36
Here is a more complicated example that does a linear fit to bivariate data and then graphs the data, the line of best fit, and the prediction bands.
lmgraph <- function(xdata, ydata){
library(ggplot2)
model <- lm(ydata~xdata)
newx <- data.frame(xdata=seq(min(xdata), max(xdata), length=10))
pred <- data.frame(predict(model, newdata=newx, interval=("prediction")))
pred$xvals <- newx$x
g = ggplot(pred, aes(x=xvals, y=fit))
g = g + geom_ribbon(aes(ymin=lwr, ymax=upr), alpha=0.2)
g = g + geom_line()
g = g + geom_point(data=data.frame(x=xdata, y=ydata), aes(x=xdata, y=ydata), size=2)
g = g + labs(x="x", y="y")
return(g)
}
lmgraph(mtcars$wt, mtcars$mpg)
R has loops and conditionals just like any computer language. Here’s a simple example. Make a vector of 10 zeros. Then write a loop that replaces the even entries with their squares.
vec <- rep(0,10)
for (k in 1:10)
{
if (k/2==floor(k/2))
{
vec[k] = k*k
}
}
vec
## [1] 0 4 0 16 0 36 0 64 0 100
In some cases, the apply() function can help you avoid explicitly writing loops. For example, we can get the row and column means of the numeric data in df with the following commands.
apply(df[,c(1,2)],1,mean)
## [1] 5.5 6.5 7.5
apply(df[,c(1,2)],2,mean)
## x y
## 2 11
For more information about an R command or an R dataset like sample(), type “?sample” in the Console. But the internet is probably your best source of information and examples. One caveat…because there are a lot of R packages, when you do a Google search you will often find many different solutions to your problem that rely on different packages. You are welcome to use whatever packages you like, but I will try to minimize the number of packages that you are required to use for the sake of simplicity.