In the R Studio interface there are 4 key regions. There is the console which displays output. There is an environment which displays data objects. There is a region for both graphic plots and the help tab. Lastly, I recommend for every project that you open a new R Script that can be saved and used later.
After you have opened a clean scripting page you may begin with some of the basics of data processing and analysis.
You must import/load data into the R environment.
Once you have imported your data, you will often “tidy” it. An example of this is changing the names of columns.
Transformation of data may involve creating new variables from existing variables, trimming outliers, etc.
You will usually visualize your data for publication. These visualizations must be interpretable to people outside of your field of study.
R is a great program not just because it is open source but because you can wrangle your data AND run models on your data.
In this example we will use the mpg dataset which contains fuel economy data in 1999 and 2008 for 38 popular models of cars
IMPORT and EXPLORE DATA
#Specify data directory (main folder)
dataDir <- "E:/SEM - TA/Introduction to R"
setwd(dataDir)
#Read/load/import dataset
dat <- read.csv("mpg_dataset.csv")
#Investigate structure of data
str(dat)
## 'data.frame': 234 obs. of 12 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ manufacturer: chr "audi" "audi" "audi" "audi" ...
## $ model : chr "a4" "a4" "a4" "a4" ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr "f" "f" "f" "f" ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr "p" "p" "p" "p" ...
## $ class : chr "compact" "compact" "compact" "compact" ...
View(dat)
During your data investigation stage you’ll want to know more about your variables. This will involve locating the mean or range of the variables in your dataset. We used the “function”, setwd() to set the working directory, str() to look at the structure of the data and View() to view the entire dataset. However, you will often use functions that are not preloaded into the R software. These functions come loaded into packages available online R repository.
Below, I give an example of installing the package psych() and then loading its library into R. We have done this to use the function describe().
#Explore mean, standard deviation, etc of variables
#install.packages("psych") # run this line if package does not exist
library(psych)
describe(dat)
If you would ever like to know more about any function, simply type a ? before the function into the console and the help window will appear.
?describe()
## starting httpd help server ... done
Once we have explored the data we can now move onto visualizing the data.
For our first plot we will use the “displ” variable that represents a car’s engine size, in litres. We will also use “hwy” which is a car’s fuel efficiency on the highway, in miles per gallon (mpg).
What might you conclude from this plot?
plot(dat$displ,dat$hwy) #x-axis, y-axis
For our next plot we will use the “cyl” variable that represents the car’s driving power. The more cylinders, the more driving power. We will also “hwy” which is a car’s fuel efficiency on the highway, in miles per gallon (mpg).
What might you conclude from this plot?
plot(dat$cyl,dat$hwy)
What does this histogram tell us?
hist(dat$cyl)
Now, you may decide to add more features to your plot.
We can change the type of plot we use by changing the argument “type”. It accepts the following strings and has the given effect.
“p” - points “l” - lines “b” - both points and lines “c” - empty points joined by lines “o” - overplotted points and lines “s” and “S” - stair steps “h” - histogram-like vertical lines “n” - does not produce any points or lines
Similarly, we can define the color using col. Lets look at engine displacement (engine size) and city miles per gallon.
#?plot #use to find arguments for plot function
plot(dat$displ, dat$cty,
main="Engine Size and City Mileage",
ylab="City Miles per Gallon",
xlab = "Engine Size",
type="p", #change this to see
col="green") #change this one
What if I would like to see a table of car brands? And then I would like to see entries for Nissans only?
head(dat,10) #View first 10 entries of dataset
table(dat$manufacturer)
##
## audi chevrolet dodge ford honda hyundai jeep
## 18 19 37 25 9 14 8
## land rover lincoln mercury nissan pontiac subaru toyota
## 4 3 4 13 5 14 34
## volkswagen
## 27
#to use the function filter() I must library the dplyr package
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
filter(dat, manufacturer == "nissan")
Now I would like to only see entries for Nissan and only Nissans from 2008
filter(dat, manufacturer == "nissan", year == "2008")
Sometimes you will want to make a smaller dataset from the larger dataset.
#?select() #you can explore the function this way
new.dat <- select(dat, model, class, manufacturer)
new.dat
In this next example we will use data from farmers about their crops. We will run models on this data. Below is an example of several ANOVA tests.
First, we load package libraries, for the functions we will use, into the R environment.
#install.packages(c("ggplot2", "ggpubr", "tidyverse", #"broom", "AICcmodavg"))
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(ggpubr)
library(tidyverse)
## -- Attaching packages ----------------------------------------------------------------- tidyverse 1.3.0 --
## v tibble 3.0.3 v purrr 0.3.4
## v tidyr 1.1.1 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts -------------------------------------------------------------------- tidyverse_conflicts() --
## x ggplot2::%+%() masks psych::%+%()
## x ggplot2::alpha() masks psych::alpha()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(broom)
library(AICcmodavg)
Import Data
crop.data <- read.csv("crop.data.csv", header = TRUE, colClasses = c("factor", "factor","factor", "numeric"))
str(crop.data)
## 'data.frame': 96 obs. of 4 variables:
## $ density : Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 2 ...
## $ block : Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
## $ fertilizer: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ yield : num 177 178 176 178 177 ...
describe(crop.data)
Run a one-way ANOVA
one.way <- aov(yield ~ fertilizer, data = crop.data)
summary(one.way)
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 2 6.07 3.0340 7.863 7e-04 ***
## Residuals 93 35.89 0.3859
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Run a two-way ANOVA
two.way <- aov(yield ~ fertilizer + density, data = crop.data)
summary(two.way)
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 2 6.068 3.034 9.073 0.000253 ***
## density 1 5.122 5.122 15.316 0.000174 ***
## Residuals 92 30.765 0.334
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Lets look at an ANOVA with an Interaction Term
interaction <- aov(yield ~ fertilizer*density, data = crop.data)
summary(interaction)
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 2 6.068 3.034 9.001 0.000273 ***
## density 1 5.122 5.122 15.195 0.000186 ***
## fertilizer:density 2 0.428 0.214 0.635 0.532500
## Residuals 90 30.337 0.337
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Lets add another predictor
blocking <- aov(yield ~ fertilizer + density + block, data = crop.data)
summary(blocking)
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 2 6.068 3.034 9.018 0.000269 ***
## density 1 5.122 5.122 15.224 0.000184 ***
## block 2 0.486 0.243 0.723 0.488329
## Residuals 90 30.278 0.336
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Lets search for the best model. We will look for the lowest AIC as an indication for the best fitting model.
library(AICcmodavg)
model.set <- list(one.way, two.way, interaction, blocking)
model.names <- c("one.way", "two.way", "interaction", "blocking")
aictab(model.set, modnames = model.names)
We can plot our results into a graph. We will look at the groupwise differences. We want to find out which group means are statistically different from one another so we can add this information to the graph.
tukey.plot.aov<-aov(yield ~ fertilizer:density, data=crop.data)
tukey.plot.test<-TukeyHSD(tukey.plot.aov)
plot(tukey.plot.test, las = 1)
From this graph, we can see that the fertilizer + planting density combinations which are significantly different from one another are 3:1-1:1 (read as “fertilizer type three + planting density 1 contrasted with fertilizer type 1 + planting density type 1”), 1:2-1:1, 2:2-1:1, 3:2-1:1, and 3:2-2:1.
A Final Note
*When you close your R Studio Session, do not save the workspace. This will eventually cause extreme memory problems for your computer.