How to Get Started with R

Author

Alyssa Columbus

Published

May 31, 2022

Abstract

This interactive, hands-on tutorial was written with Quarto for the R-Ladies Irvine May 2022 Meeting. It was designed to be approximately one hour in length for beginners who are new to R programming. If you have any questions on any of the information presented in this document, please feel free to contact me.

Setup

In this section, you will find information about which software is necessary to download, where to find it, and how to install it.

From the R Project, download the latest version of R and follow their instructions to install it.

From RStudio, download the latest version of RStudio Desktop, and follow the given directions to install it. RStudio will serve as your integrated development environment (IDE) for running R code. RStudio is a powerful tool to write, edit, and debug R code and build R scripts, documents, and applications.

If you plan to use R in a computing environment with a Windows operating system, you will need to also download the latest version of RTools from The R Project and follow their installation instructions.

Note: All of the software listed above (i.e., that is necessary to use R) is free.

Writing Your First Lines of R Code

In this section, you will find steps to write your first lines of R code.

After installing all of the software in the previous section, open RStudio. Upon opening RStudio for the first time, you should notice three panels: one large panel on the left and two smaller panels on top of each other on the right. Also, in the left panel, you should be able to note that R, along with its version, is displayed e.g., R version X.Y.Z ...).

Each of these three panels serves a specific purpose in writing and running R code:

The left panel consists of the R console, terminal, and jobs tabs. These will all assist you in running R commands.
The top right panel has information about your programming environment, history, and connections. In recent versions of RStudio, it also has a tutorial tab to walk you through RStudio tutorials.
The bottom right panel displays information related to viewing files, plots, packages, help documentation, and documents and presentations generated by R.

Please note that there is a fourth panel that can be displayed on the left, specifically in the upper left panel, when you start a new R script. However, for this tutorial, we will focus our efforts on the original left panel (i.e., now the bottom left panel). Here we will type and run R code.

The simplest commands in R use the language as a calculator. For example, we can calculate 10 + 10:

10 + 10

[1] 20

Congratulations! You’ve just written your first line of R code. Cheers to many more!

To do more complicated calculations and other tasks in R, packages are often required. Packages are bundles of R functions that are alike in structure and/or purpose. The following code will install the gapminder package and load it so that we can use it in this tutorial:

install.packages("gapminder")
library(gapminder)

Functions are similar to what we saw above in calculating 10 + 10. They take in values and output results based off of these values. To use a function from a package, use the package::function syntax. Below, we are using the gapminder_unfiltered function from the gapminder package to load the unfiltered Gapminder data stored in the package:

gapminder::gapminder_unfiltered

# A tibble: 3,313 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# … with 3,303 more rows

The <- operator assigns a value to a variable. For example, the variable r_variable below is being assigned the value of 5.

r_variable <- 5

In addition, comments can be written in R to annotate code.

# This is a comment!
1 + 1 # This is another comment. 1 + 1 should equal 2.

[1] 2

Importing Data

In this section, you will find techniques to import data.

The data that will be used in this tutorial comes from the #TidyTuesday GitHub repository and is focused on properties of chocolate! It is originally from Flavors of Cacao.

This data can be loaded into R in two ways:

# 1. With the tidytuesdayR package
tidy_tuesday_data <- tidytuesdayR::tt_load('2022-01-18')


    Downloading file 1 of 1: `chocolate.csv`

# 2. By reading in the CSV file from GitHub
chocolate_data <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv')

Summarizing Data

In this section, you will find techniques to summarize data.

The first thing we should do with our data is check out what it looks like. This can be done with the str function.

str(chocolate_data)

spec_tbl_df [2,530 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ref                             : num [1:2530] 2454 2458 2454 2542 2546 ...
 $ company_manufacturer            : chr [1:2530] "5150" "5150" "5150" "5150" ...
 $ company_location                : chr [1:2530] "U.S.A." "U.S.A." "U.S.A." "U.S.A." ...
 $ review_date                     : num [1:2530] 2019 2019 2019 2021 2021 ...
 $ country_of_bean_origin          : chr [1:2530] "Tanzania" "Dominican Republic" "Madagascar" "Fiji" ...
 $ specific_bean_origin_or_bar_name: chr [1:2530] "Kokoa Kamili, batch 1" "Zorzal, batch 1" "Bejofo Estate, batch 1" "Matasawalevu, batch 1" ...
 $ cocoa_percent                   : chr [1:2530] "76%" "76%" "76%" "68%" ...
 $ ingredients                     : chr [1:2530] "3- B,S,C" "3- B,S,C" "3- B,S,C" "3- B,S,C" ...
 $ most_memorable_characteristics  : chr [1:2530] "rich cocoa, fatty, bready" "cocoa, vegetal, savory" "cocoa, blackberry, full body" "chewy, off, rubbery" ...
 $ rating                          : num [1:2530] 3.25 3.5 3.75 3 3 3.25 3.5 3.5 3.75 2.75 ...
 - attr(*, "spec")=
  .. cols(
  ..   ref = col_double(),
  ..   company_manufacturer = col_character(),
  ..   company_location = col_character(),
  ..   review_date = col_double(),
  ..   country_of_bean_origin = col_character(),
  ..   specific_bean_origin_or_bar_name = col_character(),
  ..   cocoa_percent = col_character(),
  ..   ingredients = col_character(),
  ..   most_memorable_characteristics = col_character(),
  ..   rating = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

Another way to look at the structure of the chocolate data is with the summary function, which gives a summary of each column of the data.

summary(chocolate_data)

      ref       company_manufacturer company_location    review_date  
 Min.   :   5   Length:2530          Length:2530        Min.   :2006  
 1st Qu.: 802   Class :character     Class :character   1st Qu.:2012  
 Median :1454   Mode  :character     Mode  :character   Median :2015  
 Mean   :1430                                           Mean   :2014  
 3rd Qu.:2079                                           3rd Qu.:2018  
 Max.   :2712                                           Max.   :2021  
 country_of_bean_origin specific_bean_origin_or_bar_name cocoa_percent     
 Length:2530            Length:2530                      Length:2530       
 Class :character       Class :character                 Class :character  
 Mode  :character       Mode  :character                 Mode  :character  
                                                                           
                                                                           
                                                                           
 ingredients        most_memorable_characteristics     rating     
 Length:2530        Length:2530                    Min.   :1.000  
 Class :character   Class :character               1st Qu.:3.000  
 Mode  :character   Mode  :character               Median :3.250  
                                                   Mean   :3.196  
                                                   3rd Qu.:3.500  
                                                   Max.   :4.000

To find just the number of rows or the number of columns in the data, use nrow and ncol, respectively.

nrow(chocolate_data)

[1] 2530

ncol(chocolate_data)

[1] 10

To find both at the same time, use dim.

dim(chocolate_data)

[1] 2530   10

To view the first six rows of the data, we can use the head function.

head(chocolate_data)

# A tibble: 6 × 10
    ref company_manufacturer company_location review_date country_of_bean_origin
  <dbl> <chr>                <chr>                  <dbl> <chr>                 
1  2454 5150                 U.S.A.                  2019 Tanzania              
2  2458 5150                 U.S.A.                  2019 Dominican Republic    
3  2454 5150                 U.S.A.                  2019 Madagascar            
4  2542 5150                 U.S.A.                  2021 Fiji                  
5  2546 5150                 U.S.A.                  2021 Venezuela             
6  2546 5150                 U.S.A.                  2021 Uganda                
# … with 5 more variables: specific_bean_origin_or_bar_name <chr>,
#   cocoa_percent <chr>, ingredients <chr>,
#   most_memorable_characteristics <chr>, rating <dbl>

Another effective way to summarize data is by using the skim function in the skimr package.

skimr::skim(chocolate_data)

Data summary
Name	chocolate_data
Number of rows	2530
Number of columns	10
_______________________
Column type frequency:
character	7
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
company_manufacturer	0	1.00	2	39	580
company_location	0	1.00	4	21	67
country_of_bean_origin	0	1.00	4	21	62
specific_bean_origin_or_bar_name	0	1.00	3	51	1605
cocoa_percent	0	1.00	3	6	46
ingredients	87	0.97	4	14	21
most_memorable_characteristics	0	1.00	3	37	2487

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ref	1	1429.80	757.65	5	802	1454.00	2079.0	2712	▆▇▇▇▇
review_date	1	2014.37	3.97	2006	2012	2015.00	2018.0	2021	▃▅▇▆▅
rating	1	3.20	0.45	1	3	3.25	3.5	4	▁▁▅▇▇

Note 1: If you need help with a function, run either ?function or help(function) to view its help page documentation and learn more about how it works.

Note 2: If you need help with a function, but can’t exactly remember its full name, run ??fun to do a fuzzy search with the part of the name you remember (e.g., fun). This R command will yield a page of search results, where you can find the function that you were looking for and eventually view its help page documentation.

Learning More About R

In this section, you will find approaches to learn more about R in the future.

This tutorial gave a simple introduction to working with R and touches on just a small portion of R’s capabilities for data analysis. By learning more about R, you can more effectively learn from your data. The following are my recommendations of the best ways to learn more about R.

Have fun working through more tutorials like this one. Some great tutorials have been written by RStudio, The Carpentries, and the Department of Biostatistics at Johns Hopkins University.
Read and do exercises in one of the many free R books out there. The Big Book of R lists a collection of R books that you can learn from, and the free R for Data Science book by Hadley Wickham and Garrett Grolemund has an active online learning community.
Work on projects in R with real data. #TidyTuesday offers weekly opportunities to improve your R skills this way.
Learn with a community. I am a proponent of learning new things, especially technical concepts, with a supportive and diverse community. Join groups like R-Ladies Irvine to grow and refine your skills.
Keep up-to-date with new developments in R by reading R-bloggers.
Last but not least, RSeek.org is like Google for R users. It is a search engine that specifically filters for results that are relevant to R programming.

At first blush, this information may seem a bit overwhelming. However, as you progress in your journey of learning R and determine which resources work best for you, you’ll find that you’ll become a better R programmer and data scientist.

References

Much of this tutorial was derived from the “R for Reproducible Scientific Analysis” Software Carpentry lesson. As noted earlier in this document, the chocolate data was downloaded from the #TidyTuesday GitHub repository and originally came from Flavors of Cacao.