Introduction to Statistical Computing with R

Based on notes by Paul Thibodeau (2009) and revisions by the Psych 252 instructors in 2010 and 2011

Expanded in 2012 by Mike Frank, Benoit Monin and Ewart Thomas

Converted to R Markdown format and further expanded in 2013 by Michael Waskom.

2013 TAs: Stephanie Gagnon, Lauren Howe, Michael Waskom, Alyssa Fu, Kevin Mickey, Eric Miller

Adapted in summer 2018 for short RA tutorial by Camilla Griffiths & Juan Ospina.

Adapted, revised, and added upon by Chayce Baldwin fall 2018 for RSC R workshops

Adapted, revised, and modified by Benjamin Lira fall 2020 for 399 R workshop

If you haven't already installed R, it is available here. Then, download R Studio.

1 Brief notes about learning R

R is a programming language that is specifically designed for statistical computation. It has many appealing features: it is powerful, flexible, and widely used in the statistical community. The aspects that make R so powerful and flexible, however, contribute to a learning curve that is relatively steeper than what you might find in point-and-click packages like SPSS. The aim of this tutorial is to provide a general introduction to interacting with R that will reduce the feelings of frustration and helplessness that can emerge early in ones relationship with it. Although this tutorial doesn't assume any preexisting knowledge, if you've had no experience with computer programming there may be some parts of it that are confusing or lack a particularly deep meaning. Try your best to understand them now, but you will also likely benefit from returning to the tutorial periodically as you become more comfortable.

The most important skill to cultivate up front is the ability to help yourself when you are stuck. Fortunately, R is pretty helpful in this regard. R is what's known as an interpreted language, which basically means that there is a console you type commands into and get immediate feedback on them. Do this liberally as you work though the tutorial and the early class exercises, making small modifications to the examples we provide until you feel like you really understand what's going on.

The console is also your gateway to the built-in help functionality. Almost all R functions (more later on what those are) have help files built in that will provide you with useful information about what those functions do and how to use them. You find this by typing help(function) (or ?function), where I am using "function" as a stand-in for the name you actually want to know about. It's important to read these files closely the first time you encounter a function, but it's also (possibly more) important to refer back to them frequently. I read once that a prominent distinction between an experienced programmer and a novice is the longer latency for the novice to look up the help for something confusing (but the direction of causality is not clear!). If you have a sense for what you want to do, but don't know or can't remember the exact function that will do it, you can search through the help files for a term with two question marks (e.g. ??regression).

Of course, the internet is also a useful resource. Because of its name, it can be a little annoying to google for help with R. A useful resource, in my opinion, is the stackoverflow website. Because this is a general-purpose resource for programming help, it will be useful to use the R tag ([R]) in your queries. A related resource is the statistics stackexchange, which is like Stack Overflow but focused more on the underlying statistical issues.

Finally, a note on errors. Novice and expert programmers alike will frequently run into errors in their R code. When this happens, processing will halt and an error message will be printed to the console. This is usually more frustrating for the novice, as the error often occurs deep within some function and the message bears no direct correspondence to what they were trying to do. A common beginner mistake is to conclude that the error message is gibberish and resign oneself to woe and dismay. It's important to resist this urge; even if the error message is not immediately informative, it is intended to precisely convey some piece of information, and usually understanding what this information is will be the key to solving the problem.

1.1 Getting started: Using RStudio and R Markdowns

How are R and RStudio different? RStudio is an interface that lays atop the base program R and is more user-friendly. R, on it's own, is only a console where you can type and run code. RStudio has this Console pane, in addition to an Environment pane where you can see all the variables and datasets you've saved in your R session, a Viewer pane where you can access your files, view your plots, access your packages, and get help, and a Source pane where you can create R Markdown documents.

What is an R Markdown document? This is the main kind of document that I use in RStudio, and it's the primary advantage of RStudio over base R console. R Markdown allows you to create a file with a mix of R code and regular text, which is useful if you want to have explanations of your code alongside the code itself. This document, for example, is an R Markdown document. It is also useful because you can export your R Markdown file to an html page or a pdf, which comes in handy when you want to share your code or a report of your analyses to someone who doesn't have R. If you're interested in learning more about the functionality of R Markdown, you can visit this webpage. Also, out of the goodness of their hearts, the team at RStudio literally made a cheat sheet for R Markdowns. Check it out. Not only that, they have made cheat sheets for a lot of other things as well...We'll get into those later.

2 Basic interaction with the R console

#install.packages("psych", "doBy", "tidyverse") #Once downloaded, comment out

At its least useful, you can treat R like a calculator for basic computations. Just type some mathematical expression into your code chunk, and the result will be displayed in the console. A chunk is designated by starting with ```{r} and ending with ``` This is where you will write your code. A new chunk can be created by pressing COMMAND + ALT + I on Mac, or CONTROL + ALT + I on PC.

You can run lines of code from your script by highlighting them, and pressing COMMAND + ENTER on Mac, or CONTROL + ENTER on PC. If you're in an R Markdown document and want to run a whole chunk of code, you can press COMMAND + ALT + C on Mac, or ALT + CONTROL + ALT + C on PC.

1 + 2
## [1] 3
13 / 2
## [1] 6.5
2 ^ 6
## [1] 64
5 * (2 + 3)
## [1] 25

PRACTICE: Use the chunk below to practice using R as a calculator.

# Write a mathematical expression with at least three numbers

# Write a mathematical expression with different opperations

2.1 Variable Assignment

Of course, R is a programming language, so it is much more powerful than a basic calculator. A major aspect of computing with R involves the assignment of values to variables. There are two (almost) equivalent ways to do this:

x = 4
x <- 4

In both cases, x will represent 4 for all lines of code below these here, unless you reassign x.

x
## [1] 4
x + 2
## [1] 6
x = 8
x
## [1] 8

It is important not to confuse variable assignment with a statement about equality. In your head, you should say set x to 4 or x gets 4, but not x is equal to 4. Don't worry now about the subtle differences between the two assignment styles. Although using = is more consistent with the norm in other programming languages, I prefer <- as it makes the action that is being performed more obvious. Whichever you choose, it's best to be consistent throughout your code.

In case you're wondering, you test for equality with two equal signs (==), which does something completely different:

2 == 2
## [1] TRUE
2 == 3
## [1] FALSE

It's fine to use variable names like x for simple math examples like the ones above. But, when writing code to perform analysis, you should be careful to use descriptive names. Code where things are named, subject_id, condition, and rt will be a bit more verbose than if you had used x, y, and z, but it will also make much more sense when you read it again 4 months later as you write up the paper.

With that said, there are a few rules for variable names. You can use any alphanumeric character, although the first character must be a letter. You can't use spaces, because the computer doesn't know that you're trying to write a phrase and interprets that as two (or more) separate terms. When you want something like a phrase, the _ and . characters can be employed (this can be a bit confusing as . is usually meaningful in programming languages, but not in R).

Here's a simple example that novice coders often find confusing. Walk yourself through the code and make sure you understand what operations lead to the final return value:

a <-  10
b <-  20
a <-  b
print(a)
## [1] 20

That's right, an object will only contain its most recent assignment. Even though a was originally assigned the value 10, it was then reassigned the value of b.

So far we've only been dealing with numbers, but there are other data types as well.

For instance, we could assign character (aka string) values, with quotation marks:

group_size <- "average"
group_size
## [1] "average"
rsc_name <- "Research Support Center"
rsc_phone <- "8014225114" 

Here, even though we used letters for one object and digits for the other, R recognizes both objects as character values. Using quotation marks makes all values into characters.

Do you see how rsc_phone changes when we don't use quotation marks?

rsc_phone <- 8014225114
rsc_phone
## [1] 8014225114

Without quotation marks, R recognizes the digits as a number that can be operated on--it can be added, subtracted, etc.

We can also assign logical values, TRUE and FALSE:

alive <- TRUE
asleep <- FALSE

While we're at it, you can also compare numbers with >, <, !=, <=, and >=, which return TRUE or FALSE as well.

2 < 3
## [1] TRUE
3 <= 3
## [1] TRUE
3 != 4 #-> note: the sign "!=" means "not equal to"
## [1] TRUE

Once you've assigned a variable to a value, you can use them in calculations:

psych <- 0
sociology <- 1
econ <- 1
poli.sci <- 2


student_n = psych + sociology + econ + poli.sci 
student_n # print the result for the total number of FHSS students
## [1] 4

2.2 Using functions

Another core concept involves using functions to perform more complex operations. Functions in R work like they do in mathematics: they take one or more inputs (called arguments or parameters), perform a certain operation, and then produce one or more outputs (or return values). You call a function by writing its name followed by parentheses, with any arguments going inside the parentheses. We already saw one example of this with the print() function above. The cat() function is similar, but it converts its arguments into characters first. There are also some basic mathematical functions built into R that operate on numbers:

abs(-4)
## [1] 4
sqrt(64)
## [1] 8
log(1.75)
## [1] 0.5596158

A frequently-used function is c(). This takes a sequence of arguments and sticks them together into a vector, which we'll explain a little bit more about below. All you need to know now is that most of the built in functions for descriptive statistics (and there are many of these!) expect to receive a vector or something like it.

a <- c(1.5, 4, 3)
cat(a)
## 1.5 4 3
sum(a)
## [1] 8.5
mean(a)
## [1] 2.833333
sd(a) #Reports standard deviation
## [1] 1.258306

Applied example:

# Create vectors for the forecasted high temperatures for San Francisco and Palo Alto over the next week:
sf_temps <- c(83, 91, 96, 94, 89, 85, 84)
pa_temps <- c(80, 85, 95, 89, 82, 77, 81)

pa_temps
## [1] 80 85 95 89 82 77 81

Now we can do things like find the mean temperatures (mean()) and standard deviation temperatures (sd(). We could also look at each vector's length(), its max() or min() values, or its median() or range().

mean_sf <- mean(sf_temps) #creates a new variable called mean_sf
sd(sf_temps) #prints the sd of the sf_highs vector in the console
## [1] 5.080307
mean_pa <- mean(pa_temps)
sd(pa_temps)
## [1] 6.12178

APPLY YOUR KNOWLEDGE:

  1. Find the median and range for these vectors.
  2. Find the length of the vectors. What does the length refer to?
  3. On average, how much warmer will Provo be than SLC over the next week?
#median & range

#length

#mean difference between Provo & SLC over the next week

These built-in functions are useful for many reasons, but often we're dealing with more data than a single vector, and we want to do more with it than calculate summary statistics. Next, we'll cover how to import data into R and how to work with it once it's in here. If possible, you want to stay away from manipulating anything in your data outside of R (in Excel, for example) because it's difficult to keep track of changes that you make. In R, your code acts as a trace of every little thing you've done to your data, so you can always undo something if needed or revert back to the data's original form.

3 Importing Data

When we run a study and collect data, we'll usually end up with a .csv file that has one row per participant and one column per survey question or variable. We want to import this into R so we can do things like rename variables, calculate summary statistics for many of the survey questions (e.g. how old, on average, were the participants?), and visualize and analyze some of the trends in the data.

library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.0
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# write csv
write_csv(mtcars, "mtcars.csv")

#file.choose() #Use this function to specify file path of data file

# load a csv file
d <- read_csv("mtcars.csv")
## Parsed with column specification:
## cols(
##   mpg = col_double(),
##   cyl = col_double(),
##   disp = col_double(),
##   hp = col_double(),
##   drat = col_double(),
##   wt = col_double(),
##   qsec = col_double(),
##   vs = col_double(),
##   am = col_double(),
##   gear = col_double(),
##   carb = col_double()
## )

You can also import data from other programs like Stata or SPSS using the package haven:

library(haven)

haven::write_dta(mtcars, "mtcars.dta")
d2 <- haven::read_dta("mtcars.dta")

haven::write_sav(mtcars, "mtcars.sav")
d3 <- haven::read_sav("mtcars.sav")

Once you've imported your datafile, you'll want to take a look at your data set to make sure you have imported it correctly:

#look at the names of all the variables in your dataset
names(d)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
#get summary information about all the variables in your dataset (e.g. number of observations, number of missing values, minimum value, max value)
summary(d)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
#look at the first 10 observations
head(d)
## # A tibble: 6 x 11
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
## 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
## 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
## 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
## 5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
## 6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1
#look at the last 10 observations
tail(d)
## # A tibble: 6 x 11
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
## 2  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
## 3  15.8     8 351     264  4.22  3.17  14.5     0     1     5     4
## 4  19.7     6 145     175  3.62  2.77  15.5     0     1     5     6
## 5  15       8 301     335  3.54  3.57  14.6     0     1     5     8
## 6  21.4     4 121     109  4.11  2.78  18.6     1     1     4     2
#look at all the variable types of all the variables in your dataset
str(d)
## tibble [32 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num [1:32] 160 160 108 258 360 ...
##  $ hp  : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
##  $ vs  : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   mpg = col_double(),
##   ..   cyl = col_double(),
##   ..   disp = col_double(),
##   ..   hp = col_double(),
##   ..   drat = col_double(),
##   ..   wt = col_double(),
##   ..   qsec = col_double(),
##   ..   vs = col_double(),
##   ..   am = col_double(),
##   ..   gear = col_double(),
##   ..   carb = col_double()
##   .. )
#pull up the whole dataset in a new tab in Rstudio. When knitting the markdown, you can't use this function since the markdown doesn't know how to show the dataset in the final report (the html document).
View(d)

For preloaded datasets in R, you can find out more about your data and what the variables mean by typing ?[datasetname] into the console. mtcars is a preloaded dataset, so take a look at the help page for this dataset to learn about the variables and what they mean.

3.1 Data types

There are four main data types that you'll run into in R. It's important to be familiar with them because it will help to understand and debug some of the errors you'll run into. The four data types are:

  1. Numeric (numbers, integers, doubles)
  2. Character (strings)
  3. Logical (true/false)
  4. Factor (discrete levels; e.g., categories as would be used in ANOVA, such as male and female)

If you re-run the str(d) code from above, you'll notice that the output displays the data type for each variable. If you want to change the type of a variable to a factor, for example, you can use the function as.factor(). You'll see an example of this below.

str(d)
## tibble [32 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num [1:32] 160 160 108 258 360 ...
##  $ hp  : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
##  $ vs  : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   mpg = col_double(),
##   ..   cyl = col_double(),
##   ..   disp = col_double(),
##   ..   hp = col_double(),
##   ..   drat = col_double(),
##   ..   wt = col_double(),
##   ..   qsec = col_double(),
##   ..   vs = col_double(),
##   ..   am = col_double(),
##   ..   gear = col_double(),
##   ..   carb = col_double()
##   .. )
d$am <- as.factor(d$am)
str(d$am)
##  Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...

The variable am tells us whether the cars have automatic transmission or not (automatic = 0, manual = 1). Because these are really categories represented by 0 and 1, rather than meaningfully continuous numbers, we used the function as.factor() to tell R that am is a factor. You can check that the transformation was successful using str(variable_name).

3.2 Looking at your data

Though it can be useful to look at our data as a whole, we are usually interested in looking at specific variables in our dataset. We can access those variables with this code: dataset$var_name--the dataset, a dollar sign, and then the variable name. For example, d$cyl would be understood as "the variable cyl within the dataset d". Because we may have multiple data frames in the environment at once (e.g., full datasets, subsets, etc.), it is important that we specify which data frame the variable is in so that R knows where to find it.

3.2.1 Checking out your variables

Now that we can access the variables, we can explore them using a number of functions. Along with the basic statistical functions we described above (e.g., mean(), sd(), range() etc.), there are some more comprehensive functions we can use to look at summary statistics. You may notice below that some of these useful functions come from the psych R package.

library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
d %>% 
  group_by(gear) %>% 
  summarise(Mean = mean(mpg))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
##    gear  Mean
##   <dbl> <dbl>
## 1     3  16.1
## 2     4  24.5
## 3     5  21.4
  #mean mpg for cars with each possible amount of gears--3, 4, or 5

d %>% 
  summarise_all(mean)
## Warning in mean.default(am): argument is not numeric or logical: returning NA
## # A tibble: 1 x 11
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  20.1  6.19  231.  147.  3.60  3.22  17.8 0.438    NA  3.69  2.81
  # means for each variable in the dataset

summary(d$mpg) #basic summary statistics
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90
describe(d) #"describe" function from "psych" package; more in-depth summary statistics
##      vars  n   mean     sd median trimmed    mad   min    max  range  skew
## mpg     1 32  20.09   6.03  19.20   19.70   5.41 10.40  33.90  23.50  0.61
## cyl     2 32   6.19   1.79   6.00    6.23   2.97  4.00   8.00   4.00 -0.17
## disp    3 32 230.72 123.94 196.30  222.52 140.48 71.10 472.00 400.90  0.38
## hp      4 32 146.69  68.56 123.00  141.19  77.10 52.00 335.00 283.00  0.73
## drat    5 32   3.60   0.53   3.70    3.58   0.70  2.76   4.93   2.17  0.27
## wt      6 32   3.22   0.98   3.33    3.15   0.77  1.51   5.42   3.91  0.42
## qsec    7 32  17.85   1.79  17.71   17.83   1.42 14.50  22.90   8.40  0.37
## vs      8 32   0.44   0.50   0.00    0.42   0.00  0.00   1.00   1.00  0.24
## am*     9 32   1.41   0.50   1.00    1.38   0.00  1.00   2.00   1.00  0.36
## gear   10 32   3.69   0.74   4.00    3.62   1.48  3.00   5.00   2.00  0.53
## carb   11 32   2.81   1.62   2.00    2.65   1.48  1.00   8.00   7.00  1.05
##      kurtosis    se
## mpg     -0.37  1.07
## cyl     -1.76  0.32
## disp    -1.21 21.91
## hp      -0.14 12.12
## drat    -0.71  0.09
## wt      -0.02  0.17
## qsec     0.34  0.32
## vs      -2.00  0.09
## am*     -1.92  0.09
## gear    -1.07  0.13
## carb     1.26  0.29
table(d$carb, d$am) #table of frequencies
##    
##     0 1
##   1 3 4
##   2 6 4
##   3 3 0
##   4 7 3
##   6 0 1
##   8 0 1
addmargins(table(d$carb, d$am, dnn=c('# of Carburetors','Transmission'))) #adding more info to table
##                 Transmission
## # of Carburetors  0  1 Sum
##              1    3  4   7
##              2    6  4  10
##              3    3  0   3
##              4    7  3  10
##              6    0  1   1
##              8    0  1   1
##              Sum 19 13  32
# Taking a look at the distribution
hist(d$wt, col = 'orange', xlab = NA, main = 'Distribution of Weight')

boxplot(d$wt, ylab = 'Weight (in 1000s)')

Note that for many of these functions, the only specifications required within the parentheses is the variable name. Can you guess what the options col =, xlab =, main =, and ylab = mean?

You can also look at values for certain groups in your data. Below are a few functions that mirror the output from above, but give us the summary statistics based on different groups of the variables. Here, we use describeBy, from the "psych" package. It gives us summary statistics of the variable mpg for each group of am.

describeBy(d, d$am) # from the "psych" package
## 
##  Descriptive statistics by group 
## group: 0
##      vars  n   mean     sd median trimmed    mad    min    max  range  skew
## mpg     1 19  17.15   3.83  17.30   17.12   3.11  10.40  24.40  14.00  0.01
## cyl     2 19   6.95   1.54   8.00    7.06   0.00   4.00   8.00   4.00 -0.95
## disp    3 19 290.38 110.17 275.80  289.71 124.83 120.10 472.00 351.90  0.05
## hp      4 19 160.26  53.91 175.00  161.06  77.10  62.00 245.00 183.00 -0.01
## drat    5 19   3.29   0.39   3.15    3.28   0.22   2.76   3.92   1.16  0.50
## wt      6 19   3.77   0.78   3.52    3.75   0.45   2.46   5.42   2.96  0.98
## qsec    7 19  18.18   1.75  17.82   18.07   1.19  15.41  22.90   7.49  0.85
## vs      8 19   0.37   0.50   0.00    0.35   0.00   0.00   1.00   1.00  0.50
## am*     9 19   1.00   0.00   1.00    1.00   0.00   1.00   1.00   0.00   NaN
## gear   10 19   3.21   0.42   3.00    3.18   0.00   3.00   4.00   1.00  1.31
## carb   11 19   2.74   1.15   3.00    2.76   1.48   1.00   4.00   3.00 -0.14
##      kurtosis    se
## mpg     -0.80  0.88
## cyl     -0.74  0.35
## disp    -1.26 25.28
## hp      -1.21 12.37
## drat    -1.30  0.09
## wt       0.14  0.18
## qsec     0.55  0.40
## vs      -1.84  0.11
## am*       NaN  0.00
## gear    -0.29  0.10
## carb    -1.57  0.26
## ------------------------------------------------------------ 
## group: 1
##      vars  n   mean    sd median trimmed   mad   min    max  range  skew
## mpg     1 13  24.39  6.17  22.80   24.38  6.67 15.00  33.90  18.90  0.05
## cyl     2 13   5.08  1.55   4.00    4.91  0.00  4.00   8.00   4.00  0.87
## disp    3 13 143.53 87.20 120.30  131.25 58.86 71.10 351.00 279.90  1.33
## hp      4 13 126.85 84.06 109.00  114.73 63.75 52.00 335.00 283.00  1.36
## drat    5 13   4.05  0.36   4.08    4.02  0.27  3.54   4.93   1.39  0.79
## wt      6 13   2.41  0.62   2.32    2.39  0.68  1.51   3.57   2.06  0.21
## qsec    7 13  17.36  1.79  17.02   17.39  2.34 14.50  19.90   5.40 -0.23
## vs      8 13   0.54  0.52   1.00    0.55  0.00  0.00   1.00   1.00 -0.14
## am*     9 13   2.00  0.00   2.00    2.00  0.00  2.00   2.00   0.00   NaN
## gear   10 13   4.38  0.51   4.00    4.36  0.00  4.00   5.00   1.00  0.42
## carb   11 13   2.92  2.18   2.00    2.64  1.48  1.00   8.00   7.00  0.98
##      kurtosis    se
## mpg     -1.46  1.71
## cyl     -0.90  0.43
## disp     0.40 24.19
## hp       0.56 23.31
## drat     0.21  0.10
## wt      -1.17  0.17
## qsec    -1.42  0.50
## vs      -2.13  0.14
## am*       NaN  0.00
## gear    -1.96  0.14
## carb    -0.21  0.60

APPLY YOUR KNOWLEDGE:

  1. Find the the mean and standard deviation of mpg (miles per gallon) and wt (weight).
  2. How many variables and observations are in the dataset
  3. What is the maximum weight by transmission (am) and gears (gear)

Remember: If you are not sure how to do something, try using the ??subject function.

#median & sd of mpg and wt

#Dimensions of data set

#Maximum weight by transmission (am) and gears (gear)

4 Recap

4.1 What we learned in this workshop

Today, we learned a few things to get you started in using R:

  • How to download R and R studio
  • How to begin using an R Markdown
  • Basic functions in R
  • How to import data into R and view that data
  • How to look at summary statistics for individual variables