myList <-list(num =1:5, name =c('Joe','Bob','Mary'))myList
$num
[1] 1 2 3 4 5
$name
[1] "Joe" "Bob" "Mary"
myList[[1]] # vector element in the list
[1] 1 2 3 4 5
myList[[2]][1] # first element of the second vector in the list
[1] "Joe"
You can also use the $ operator to access list elements by name
myList$num # first vector in the list
[1] 1 2 3 4 5
myList$name[1] # first element of the second vector in the list
[1] "Joe"
Installing Packages
Packages can be installed through 2 methods: console/script or RStudio interface
Console/Script: install.packages("package_name")
If you want to install multiple packages at once you can do install.packages(c("package1", "package2"))
RStudio: Click the “Packages” tab in the bottom-right window pane.Then click “Install” and search for the package you want to install.
Loading Packages
Once the packages are installed we need to load them into our R session with the library() function
# Use the '#' symbol to talk to ourselves by 'commenting out' lines.# R will not run any line with a '#' in front of it.library(package1)library(package2)
Notice too that you don’t need quotes around the package names anymore.
R now recognizes these packages as defined objects with given names
Setting Your Working Directory
Your working directory is where all your files live
R does not know where your files are unless you tell it
If you want to use any data that does not come with a package, you need to tell R where it lives
# The working directory where R will look for files getwd()
# Set the working directory to the folder where your files aresetwd("C:/Users/YourName/Documents/YourProject")
Your Turn
Install and load the packages tidyverse, psych, and palmerpenguins using the install.packages() and library() functions.
Make a vector of 50 random values with a mean and standard deviation of your choice using the rnorm() function and assign it to a variable named vec (hint: use ?rnorm to read the argument descriptions).
Get summary statistics for the vector using the describe() function.
Use [] indexing to return the first 10 values of the vector.
Plot your data using the plot() function.
Starting a New Project
Create New Project
Follow these steps to create a new project:
Go to File in the upper left corner of RStudio
Select New Project
Select New Directory
Select New Project
Set the Directory name: to penguins
Click Browse... and set the directory to your desktop or other location
Click Create Project
Create New File
Once the new project session has opened make a new file to write code in:
Go to File in the upper left corner of RStudio
Select New File
Select R Script
Save the file as penguins.R
You can also use keyboard shortcuts to make a new file (Ctrl + Shift + N) and save it (Ctrl + S)
Data we Will Use
Artwork by @allison_horst
Our Data
library(palmerpenguins)
head(penguins) # view the first 10 rows
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
str(penguins) # view the structure of the data
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
View(penguins) # view the data in a new window
Indexing Data Frames
We can use column position to index objects.
There are two slots we can use to extract data: rows and columns
object_name[row, column]
We can also subset out data by column position using : or c(column_1, column_2)
penguins[1,1]
# A tibble: 1 × 1
species
<fct>
1 Adelie
penguins[1, 1:2]
# A tibble: 1 × 2
species island
<fct> <fct>
1 Adelie Torgersen
penguins[1, c(1, 4)]
# A tibble: 1 × 2
species bill_depth_mm
<fct> <dbl>
1 Adelie 18.7
Negative Indexing
We can also exclude various elements using - and/or logical tests
penguins[,-1]
# A tibble: 344 × 7
island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
<fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Torge… 39.1 18.7 181 3750 male 2007
2 Torge… 39.5 17.4 186 3800 fema… 2007
3 Torge… 40.3 18 195 3250 fema… 2007
4 Torge… NA NA NA NA <NA> 2007
5 Torge… 36.7 19.3 193 3450 fema… 2007
6 Torge… 39.3 20.6 190 3650 male 2007
7 Torge… 38.9 17.8 181 3625 fema… 2007
8 Torge… 39.2 19.6 195 4675 male 2007
9 Torge… 34.1 18.1 193 3475 <NA> 2007
10 Torge… 42 20.2 190 4250 <NA> 2007
# ℹ 334 more rows
penguins |># start with the datasetfilter(species =='Adelie', island =='Torgersen') |># filter the data summarise(mean_mass =mean(body_mass_g, na.rm =TRUE)) |># calculate the meanpull(mean_mass) |># pull out the mean masslog() |># notice no arguments inside parenthesesround(0) # round the result
[1] 8
Although the piping adds lines, it makes the code much more readable
Filter the data by selecting any two columns using the column position or $ operator
Assign steps 1-3 to individual variables, then combine them into a list
Create a new data object of with 2 columns: body mass and flipper length for all penguins, then plot them against each other using the plot() function. (hint: for tidyverse, check the documentation for select())
Tidy Data
What is Tidy Data?
A consistent way of organizing data to make it easier to work with.
80% of of data analysis is spent cleaning and preparing the data (Dasu and Johnson 2003).
It’s better to make a dataset tidy from the start rather than trying to fix it later.
The tidyverse is built around the idea of tidy data.
Rules of Tidy Data
Tabular data should be rectangular and flat
Each variable must have its own column
Each observation must have its own row
Each value must have its own cell
Excel Abuse
Excel is a great tool but can be dangerous since you can abuse your data easily.
Excel can be used to store data but use of formulas should be limited.
Workflow can be embedded across sheets and difficult to follow.
Tidy Penguins
penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
Example
Notice anything wrong?
Unnecessary info at top can be moved to different sheet
Inconsistent date formats
Adding extra info to cells (units next to values)
Coloring cells and leaving stand-alone cells (ok if color key in different sheet)