10 + 10
[1] 20
In this section, you will find information about which software is necessary to download, where to find it, and how to install it.
From the R Project, download the latest version of R and follow their instructions to install it.
From RStudio, download the latest version of RStudio Desktop, and follow the given directions to install it. RStudio will serve as your integrated development environment (IDE) for running R code. RStudio is a powerful tool to write, edit, and debug R code and build R scripts, documents, and applications.
If you plan to use R in a computing environment with a Windows operating system, you will need to also download the latest version of RTools from The R Project and follow their installation instructions.
Note: All of the software listed above (i.e., that is necessary to use R) is free.
In this section, you will find steps to write your first lines of R code.
After installing all of the software in the previous section, open RStudio. Upon opening RStudio for the first time, you should notice three panels: one large panel on the left and two smaller panels on top of each other on the right. Also, in the left panel, you should be able to note that R, along with its version, is displayed e.g., R version X.Y.Z ...
).
Each of these three panels serves a specific purpose in writing and running R code:
The left panel consists of the R console, terminal, and jobs tabs. These will all assist you in running R commands.
The top right panel has information about your programming environment, history, and connections. In recent versions of RStudio, it also has a tutorial tab to walk you through RStudio tutorials.
The bottom right panel displays information related to viewing files, plots, packages, help documentation, and documents and presentations generated by R.
Please note that there is a fourth panel that can be displayed on the left, specifically in the upper left panel, when you start a new R script. However, for this tutorial, we will focus our efforts on the original left panel (i.e., now the bottom left panel). Here we will type and run R code.
The simplest commands in R use the language as a calculator. For example, we can calculate 10 + 10:
10 + 10
[1] 20
Congratulations! You’ve just written your first line of R code. Cheers to many more!
To do more complicated calculations and other tasks in R, packages are often required. Packages are bundles of R functions that are alike in structure and/or purpose. The following code will install the gapminder
package and load it so that we can use it in this tutorial:
install.packages("gapminder")
library(gapminder)
Functions are similar to what we saw above in calculating 10 + 10. They take in values and output results based off of these values. To use a function from a package, use the package::function
syntax. Below, we are using the gapminder_unfiltered
function from the gapminder
package to load the unfiltered Gapminder data stored in the package:
::gapminder_unfiltered gapminder
# A tibble: 3,313 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# … with 3,303 more rows
The <-
operator assigns a value to a variable. For example, the variable r_variable
below is being assigned the value of 5.
<- 5 r_variable
In addition, comments can be written in R to annotate code.
# This is a comment!
1 + 1 # This is another comment. 1 + 1 should equal 2.
[1] 2
In this section, you will find techniques to import data.
The data that will be used in this tutorial comes from the #TidyTuesday GitHub repository and is focused on properties of chocolate! It is originally from Flavors of Cacao.
This data can be loaded into R in two ways:
# 1. With the tidytuesdayR package
<- tidytuesdayR::tt_load('2022-01-18') tidy_tuesday_data
Downloading file 1 of 1: `chocolate.csv`
# 2. By reading in the CSV file from GitHub
<- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv') chocolate_data
In this section, you will find techniques to summarize data.
The first thing we should do with our data is check out what it looks like. This can be done with the str
function.
str(chocolate_data)
spec_tbl_df [2,530 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ ref : num [1:2530] 2454 2458 2454 2542 2546 ...
$ company_manufacturer : chr [1:2530] "5150" "5150" "5150" "5150" ...
$ company_location : chr [1:2530] "U.S.A." "U.S.A." "U.S.A." "U.S.A." ...
$ review_date : num [1:2530] 2019 2019 2019 2021 2021 ...
$ country_of_bean_origin : chr [1:2530] "Tanzania" "Dominican Republic" "Madagascar" "Fiji" ...
$ specific_bean_origin_or_bar_name: chr [1:2530] "Kokoa Kamili, batch 1" "Zorzal, batch 1" "Bejofo Estate, batch 1" "Matasawalevu, batch 1" ...
$ cocoa_percent : chr [1:2530] "76%" "76%" "76%" "68%" ...
$ ingredients : chr [1:2530] "3- B,S,C" "3- B,S,C" "3- B,S,C" "3- B,S,C" ...
$ most_memorable_characteristics : chr [1:2530] "rich cocoa, fatty, bready" "cocoa, vegetal, savory" "cocoa, blackberry, full body" "chewy, off, rubbery" ...
$ rating : num [1:2530] 3.25 3.5 3.75 3 3 3.25 3.5 3.5 3.75 2.75 ...
- attr(*, "spec")=
.. cols(
.. ref = col_double(),
.. company_manufacturer = col_character(),
.. company_location = col_character(),
.. review_date = col_double(),
.. country_of_bean_origin = col_character(),
.. specific_bean_origin_or_bar_name = col_character(),
.. cocoa_percent = col_character(),
.. ingredients = col_character(),
.. most_memorable_characteristics = col_character(),
.. rating = col_double()
.. )
- attr(*, "problems")=<externalptr>
Another way to look at the structure of the chocolate data is with the summary
function, which gives a summary of each column of the data.
summary(chocolate_data)
ref company_manufacturer company_location review_date
Min. : 5 Length:2530 Length:2530 Min. :2006
1st Qu.: 802 Class :character Class :character 1st Qu.:2012
Median :1454 Mode :character Mode :character Median :2015
Mean :1430 Mean :2014
3rd Qu.:2079 3rd Qu.:2018
Max. :2712 Max. :2021
country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent
Length:2530 Length:2530 Length:2530
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
ingredients most_memorable_characteristics rating
Length:2530 Length:2530 Min. :1.000
Class :character Class :character 1st Qu.:3.000
Mode :character Mode :character Median :3.250
Mean :3.196
3rd Qu.:3.500
Max. :4.000
To find just the number of rows or the number of columns in the data, use nrow
and ncol
, respectively.
nrow(chocolate_data)
[1] 2530
ncol(chocolate_data)
[1] 10
To find both at the same time, use dim
.
dim(chocolate_data)
[1] 2530 10
To view the first six rows of the data, we can use the head
function.
head(chocolate_data)
# A tibble: 6 × 10
ref company_manufacturer company_location review_date country_of_bean_origin
<dbl> <chr> <chr> <dbl> <chr>
1 2454 5150 U.S.A. 2019 Tanzania
2 2458 5150 U.S.A. 2019 Dominican Republic
3 2454 5150 U.S.A. 2019 Madagascar
4 2542 5150 U.S.A. 2021 Fiji
5 2546 5150 U.S.A. 2021 Venezuela
6 2546 5150 U.S.A. 2021 Uganda
# … with 5 more variables: specific_bean_origin_or_bar_name <chr>,
# cocoa_percent <chr>, ingredients <chr>,
# most_memorable_characteristics <chr>, rating <dbl>
Another effective way to summarize data is by using the skim
function in the skimr
package.
::skim(chocolate_data) skimr
Name | chocolate_data |
Number of rows | 2530 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
character | 7 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
company_manufacturer | 0 | 1.00 | 2 | 39 | 0 | 580 | 0 |
company_location | 0 | 1.00 | 4 | 21 | 0 | 67 | 0 |
country_of_bean_origin | 0 | 1.00 | 4 | 21 | 0 | 62 | 0 |
specific_bean_origin_or_bar_name | 0 | 1.00 | 3 | 51 | 0 | 1605 | 0 |
cocoa_percent | 0 | 1.00 | 3 | 6 | 0 | 46 | 0 |
ingredients | 87 | 0.97 | 4 | 14 | 0 | 21 | 0 |
most_memorable_characteristics | 0 | 1.00 | 3 | 37 | 0 | 2487 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
ref | 0 | 1 | 1429.80 | 757.65 | 5 | 802 | 1454.00 | 2079.0 | 2712 | ▆▇▇▇▇ |
review_date | 0 | 1 | 2014.37 | 3.97 | 2006 | 2012 | 2015.00 | 2018.0 | 2021 | ▃▅▇▆▅ |
rating | 0 | 1 | 3.20 | 0.45 | 1 | 3 | 3.25 | 3.5 | 4 | ▁▁▅▇▇ |
Note 1: If you need help with a function, run either ?function
or help(function)
to view its help page documentation and learn more about how it works.
Note 2: If you need help with a function, but can’t exactly remember its full name, run ??fun
to do a fuzzy search with the part of the name you remember (e.g., fun
). This R command will yield a page of search results, where you can find the function that you were looking for and eventually view its help page documentation.
In this section, you will find approaches to learn more about R in the future.
This tutorial gave a simple introduction to working with R and touches on just a small portion of R’s capabilities for data analysis. By learning more about R, you can more effectively learn from your data. The following are my recommendations of the best ways to learn more about R.
Have fun working through more tutorials like this one. Some great tutorials have been written by RStudio, The Carpentries, and the Department of Biostatistics at Johns Hopkins University.
Read and do exercises in one of the many free R books out there. The Big Book of R lists a collection of R books that you can learn from, and the free R for Data Science book by Hadley Wickham and Garrett Grolemund has an active online learning community.
Work on projects in R with real data. #TidyTuesday offers weekly opportunities to improve your R skills this way.
Learn with a community. I am a proponent of learning new things, especially technical concepts, with a supportive and diverse community. Join groups like R-Ladies Irvine to grow and refine your skills.
Keep up-to-date with new developments in R by reading R-bloggers.
Last but not least, RSeek.org is like Google for R users. It is a search engine that specifically filters for results that are relevant to R programming.
At first blush, this information may seem a bit overwhelming. However, as you progress in your journey of learning R and determine which resources work best for you, you’ll find that you’ll become a better R programmer and data scientist.
Much of this tutorial was derived from the “R for Reproducible Scientific Analysis” Software Carpentry lesson. As noted earlier in this document, the chocolate data was downloaded from the #TidyTuesday GitHub repository and originally came from Flavors of Cacao.