Getting Started with R

Author

HEOR DS 510

Published

November 19, 2025

1 The R Project Homepage

The official home of the R language is The R Project for Statistical Computing:

From this site you can access:

  • Links to CRAN (the Comprehensive R Archive Network) for downloads
  • Manuals, FAQs, and contributed documentation
  • Source code and news about new versions of R

This site is the “home base” for R itself and is separate from RStudio/Posit (which provides the IDE).

2 How to Download R and RStudio, and Check Your Versions

2.1 Installing R and RStudio

  • R is the programming language and engine.
  • RStudio (by Posit) is a popular Integrated Development Environment (IDE) for R.
    An IDE is a software environment that integrates coding, running, and debugging tools in one place, making programming easier and more efficient.

R and RStudio

Install R first, then RStudio.

  1. Download R (CRAN; The Comprehensive R Archive Network)
    • Go to https://cran.r-project.org/
    • Click “Download R for Windows”, “macOS”, or “Linux”
    • Choose the latest release and run the installer
  2. Download RStudio Desktop (free)

2.2 Checking Your Versions

###R version

# To run the code you have 3 options: 
# (1) highlight the line and press Run,
# (2) place your cursor at the end of the line and type Cmd/Ctrl + Enter, or 
# (3) to run the entire chunk, click on the green button to the right. 

R.version.string
[1] "R version 4.4.3 (2025-02-28)"

2.2.1 RStudio version

# If running inside RStudio, this will show RStudio version:
 
if (exists("RStudio.Version")) 
  paste("RStudio version:", RStudio.Version()$version)

Explanation:

RStudio.Version is a function provided by RStudio. If it is run inside RStudio, it returns a list of metadata, including “version”. We are calling only for the version using the $ operator.

paste simply fronts the version with “RStudio version” so that it prints nicely.

3 What Is a CRAN Mirror and How Do You Set It?

CRAN (The Comprehensive R Archive Network) hosts R itself and thousands of R packages.
Because CRAN is mirrored (copied) around the world, you typically choose a CRAN mirror close to you geographically so downloads are faster and more reliable.

You can set your CRAN mirror interactively:

# This opens a menu in R to choose a CRAN mirror:
# chooseCRANmirror()   # uncomment and run once in a session

Or you can set it programmatically in your script:

# Example: set CRAN mirror via options() for this session
options(repos = c(CRAN = "https://cran.rstudio.com/"))

# You can check current repos with:
getOption("repos")
                       CRAN 
"https://cran.rstudio.com/" 

For reproducible research, it is often helpful to explicitly set the repository in your script or within your RStudio Project options.

4 R Is Open Source – Anyone Can Create an R Function

R is an open-source language, which means:

  • The source code is publicly available.
  • Anyone can write and share R functions and packages.
  • Most R packages are maintained by individuals or teams and distributed through CRAN, Bioconductor, GitHub, or other platforms.

You can define your own functions easily:

# A simple custom function
add_two <- function(x) {
  x + 2
}

add_two(5)
[1] 7
# Another example: compute a z-score
z_score <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}

z_score(c(1, 2, 3, 4, 5))
[1] -1.2649111 -0.6324555  0.0000000  0.6324555  1.2649111

This is the same mechanism that package authors use—just organized and distributed as packages.

5 Exploring the RStudio Environment (Panes and Toolbars)

Once you have installed R and RStudio, open RStudio. By default, you will see four main panes:

  • Source (top-left)
    • Your editor for .R scripts, .Rmd, .qmd files
    • Run code lines or chunks into the Console
  • Console (bottom-left)
    • Where R commands execute
    • Shows outputs, errors, warnings
  • Environment/History (top-right)
    • Environment: data frames and objects currently in memory
    • History: previously run commands
  • Files/Plots/Packages/Help/Viewer (bottom-right)
    • Files: browse project files
    • Plots: view figures
    • Packages: installed packages and load/unload controls
    • Help: documentation for functions and packages
    • Viewer: renders HTML content (e.g., Quarto documents)

Across the top toolbar you will find buttons for:

  • Running code chunks or single lines
  • Saving files
  • Creating new scripts and Quarto / R Markdown files
  • Knitting (for .Rmd) or Rendering (for .qmd)
  • Managing Projects and version control (Git)

5.1 Quick Demonstrations

# Create a few objects (watch the Environment pane update)
nums <- rnorm(10)
df <- data.frame(id = 1:5, value = c(10, 20, 15, 30, NA))

# Use Help pane: open documentation
help("mean")   # or ?mean

RStudio window
# Show a basic plot (appears in Plots tab)
plot(nums, type = "b", main = "Demo plot", xlab = "Index", ylab = "Value")

6 Setting a Working Directory and Using R Projects

R needs to know where your files live. This is your working directory.

A very efficient workflow is to:

  1. Create a work folder for your project.
  2. Create an R Project inside that folder.
  3. Keep all your data, code, and documents in that folder.

6.1 Creating an .Rproj and Setting the Working Directory

Create a .Rproj file, name it, and save it in your new work folder:

  • File → New Project → Existing Directory (or “New Directory” to create a new folder)

To create and save your R Script file:

  • File → New File → R Script
  • File → Save (choose .R extension)
  • For R Markdown: File → New File → R Markdown, then Save as .Rmd

Once you are in your R Project, the working directory will automatically be the project folder.

Check your working directory:

getwd()
[1] "/Users/carlospineda/Documents/GitHub/SoftwareDev/HEOR_DS"

You can change the working directory manually (less recommended than using Projects):

# Example only; adjust to your own path:
# setwd("/path/to/your/project/folder")

Using R Projects is more robust than calling setwd() in every script, and it keeps your projects self-contained.

7 Installing and Loading Libraries

R’s functionality is extended through packages (libraries). You typically:

  1. Install a package once (per machine or environment).
  2. Load it in each session where you want to use it.

7.1 Installing Packages

7.1.1 Install a single package

# Example install (run once; set eval: false to avoid automatic install)

# install.packages("dplyr") 
# install.packages("ggplot2")

7.1.2 Install multiple packages at once

# Example install of multiple packages (run once; set eval: false to avoid automatic install)

# install.packages(c("tidyverse", "readr", "dplyr", "ggplot2", "readxl", "data.table"))

7.2 Loading Packages

7.2.1 Load a single package

# Load dplyr package
library(dplyr)

7.2.2 Load multiple packages safely

# Load if available; fall back gracefully if not
loaded_pkgs <- c()
for (pkg in c("dplyr", "ggplot2")) {
  if (requireNamespace(pkg, quietly = TRUE)) {
    library(pkg, character.only = TRUE)
    loaded_pkgs <- c(loaded_pkgs, pkg)
  }
}
# This code creates an empty vector to track successfully loaded packages.
# loops through the list of package names.
# checks whether each package is installed. 
# loads the package if it is installed.
# tells library that the variable contains the package name as text. 
# records the packages that were loaded. (if it is not installed, it isn't loaded)
loaded_pkgs
[1] "dplyr"   "ggplot2"

Notes:

  • Use install.packages("packagename") once per machine or project.
  • Use library(packagename) in each session/script where needed.
  • For reproducibility, consider project environments such as renv.

8 Formatting Flat Files for Loading

Good practices for CSV/TSV flat files:

  • Use a header row with short, clear, alphanumeric column names
    (avoid spaces; use underscores if needed)
  • Use UTF-8 encoding
  • Use a consistent delimiter (comma for CSV, tab for TSV)
  • Represent missing values consistently (e.g., empty cell or NA; avoid “-”, “N/A”, “null”)
  • Use ISO 8601 for dates (YYYY-MM-DD) and include time zones if timestamps are present
  • Avoid embedded line breaks in cells; if present, ensure proper quoting
  • Keep one “tidy” table per file: each row is one observation, each column is one variable

8.1 Create and Save a Well-Formatted CSV

# Example tidy dataset
tidy_example <- data.frame(
  subject_id = 1:6,
  group = c("control", "control", "control", "treatment", "treatment", "treatment"),
  age_years = c(34, 45, 51, 29, 40, NA),
  visit_date = as.Date(c("2025-01-10", "2025-01-12", "2025-01-13", "2025-01-11", "2025-01-12", "2025-01-14")),
  score = c(87, 90, 85, 92, 88, 91)
)

# Create a data folder, then save CSV
dir.create("data", showWarnings = FALSE)
csv_path <- file.path("data", "tidy_example.csv")
write.csv(tidy_example, csv_path, row.names = FALSE, na = "")
csv_path
[1] "data/tidy_example.csv"

9 Loading a Dataset (Flat File and Other Resources)

9.1 Loading the CSV with Base R

loaded_base <- read.csv(csv_path, stringsAsFactors = FALSE)
str(loaded_base)  # structure of the dataframe
'data.frame':   6 obs. of  5 variables:
 $ subject_id: int  1 2 3 4 5 6
 $ group     : chr  "control" "control" "control" "treatment" ...
 $ age_years : int  34 45 51 29 40 NA
 $ visit_date: chr  "2025-01-10" "2025-01-12" "2025-01-13" "2025-01-11" ...
 $ score     : int  87 90 85 92 88 91
head(loaded_base) # first six rows (by default)
# Load tidy_example data back in, add a variable, and resave.

tidy_example <- read.csv(csv_path)
tidy_example$score_dichot <- ifelse(tidy_example$score > 90, 1, 0)
write.csv(tidy_example, csv_path, row.names = FALSE, na = "")
head(tidy_example)

9.2 Loading the CSV with readr (tidyverse)

# Install readr if needed (run once; eval is FALSE so it won't execute automatically)
# install.packages("readr")
# If readr is available, demonstrate its use safely
if (requireNamespace("readr", quietly = TRUE)) {
  loaded_readr <- readr::read_csv(csv_path, show_col_types = FALSE)
  head(loaded_readr)
}

9.2.1 Handling Column Types and Missing Values Explicitly with readr

if (requireNamespace("readr", quietly = TRUE)) {
  loaded_typed <- readr::read_csv(
    csv_path,
    col_types = readr::cols(
      subject_id = readr::col_integer(),
      group = readr::col_factor(levels = c("control", "treatment")),
      age_years = readr::col_double(),
      visit_date = readr::col_date(),
      score = readr::col_double()
    ),
    show_col_types = FALSE
  )
  str(loaded_typed)
}
spc_tbl_ [6 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ subject_id  : int [1:6] 1 2 3 4 5 6
 $ group       : Factor w/ 2 levels "control","treatment": 1 1 1 2 2 2
 $ age_years   : num [1:6] 34 45 51 29 40 NA
 $ visit_date  : Date[1:6], format: "2025-01-10" "2025-01-12" ...
 $ score       : num [1:6] 87 90 85 92 88 91
 $ score_dichot: num [1:6] 0 0 0 1 0 1
 - attr(*, "spec")=
  .. cols(
  ..   subject_id = col_integer(),
  ..   group = col_factor(levels = c("control", "treatment"), ordered = FALSE, include_na = FALSE),
  ..   age_years = col_double(),
  ..   visit_date = col_date(format = ""),
  ..   score = col_double(),
  ..   score_dichot = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

9.2.2 Reading Excel Files

# install.packages("readxl")  # run once
if (requireNamespace("readxl", quietly = TRUE)) {
  # Example: readxl::read_excel("data/example.xlsx", sheet = 1)
}
NULL

10 Types of Objects and Variables in R

R works with several fundamental object types:

  • Scalars: single values (e.g., x <- 5)
  • Vectors: ordered collections of values of the same type
  • Matrices: 2D arrays (rows × columns) of a single type
  • Arrays: multi-dimensional generalization of matrices
  • Data frames: tabular structures (columns can have different types)
  • Lists: ordered collections of elements that can each be a different type or structure

10.1 Creating Objects in Practice

10.1.1 Scalars

scalar_example <- 42
scalar_example
[1] 42

10.1.2 Numeric and character vectors (scalars are simply vectors of length 1)

a <- c(10, 20, 30)

b <- c("alpha", "beta", "gamma")
a
[1] 10 20 30
b
[1] "alpha" "beta"  "gamma"

10.1.3 Factors

treatment <- c("control", "treatment", "control", "treatment", "treatment", "control")
grp <- factor(treatment, levels = c("control", "treatment"))
grp
[1] control   treatment control   treatment treatment control  
Levels: control treatment

10.1.4 Matrices

# Matrices
m <- matrix(1:9, nrow = 3)
m
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

10.1.5 Arrays

# Arrays (3-dimensional example)
arr <- array(1:24, dim = c(3, 4, 2))
arr
, , 1

     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

, , 2

     [,1] [,2] [,3] [,4]
[1,]   13   16   19   22
[2,]   14   17   20   23
[3,]   15   18   21   24

10.1.6 Data frames

# Data frames (tabular)
df2 <- data.frame(id = 1:6, group = grp, score = c(88, 92, 85, 91, 87, 90))
df2

10.1.7 Lists

# Lists
lst <- list(scalar = scalar_example, vector = a, matrix = m, array = arr, dataframe = df2)
lst
$scalar
[1] 42

$vector
[1] 10 20 30

$matrix
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

$array
, , 1

     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

, , 2

     [,1] [,2] [,3] [,4]
[1,]   13   16   19   22
[2,]   14   17   20   23
[3,]   15   18   21   24


$dataframe
  id     group score
1  1   control    88
2  2 treatment    92
3  3   control    85
4  4 treatment    91
5  5 treatment    87
6  6   control    90

10.2 How Data Frames Differ from Datasets (Stata/SAS)

If you have used Stata or SAS, you may be used to “datasets” as files on disk.

In R:

  • A data frame is an in-memory object, not a file.
  • You typically read a file (CSV, Stata, SAS, Excel) into a data frame, work with it, then write it back out.
  • R data frames are flexible: columns can have different types.

Conceptually, think of “dataset on disk” (Stata/SAS) versus “data frame in memory” (R), even though they represent similar rectangular data.

11 Coding in Base R vs RStudio

  • Base R
    • The language + interpreter/engine.
    • Can be used from a terminal or the R GUI.
  • RStudio
    • An IDE that wraps around R.
    • Provides a script editor, console, plots, help, history, projects, Git integration, and more.

You can write exactly the same R code in both environments; RStudio simply makes development more convenient.

12 Snippets: Small Templates or Code Skeletons

Snippets are short templates of code you can insert quickly in RStudio.

Example snippet definition:

snippet fun
  ${1:fname} <- function(${2:x}) {
    ${0}
  }

To see how it works, in your R script type fun and press Tab to expand this into a function template.

To explore snippets in RStudio:

  • Tools → Global Options → Code → Edit Snippets…

Window to edit Snippets

12.1 Create your own snippet

  1. Open the Snippets editor as above.
  2. Add a new snippet definition (e.g., for a histogram):
snippet gg_hist
    ${1:plot_name} <- ggplot(data    = ${2:data_name}, mapping = aes(x = ${3:x_var_name})) +
    geom_histogram( fill = "lightblue", color = "lightblue") +
    labs(title    = "${4:title_name}", x = "${5:x_axis_name}",y = "${6:y_axis_name}")+ 
    theme_minimal()
  1. Save and close the editor.
  2. In your script, type gg_hist and press Tab to insert the template.

13 Conducting Analyses

13.1 Descriptive Statistics (Base R)

x <- rnorm(100, mean = 50, sd = 10)
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  27.97   42.27   49.51   50.15   57.71   78.19 
mean(x); median(x); sd(x); quantile(x, probs = c(0.25, 0.5, 0.75))
[1] 50.15221
[1] 49.50954
[1] 10.35397
     25%      50%      75% 
42.27453 49.50954 57.70895 

13.2 Group-wise Summaries (dplyr, if available)

if (requireNamespace("dplyr", quietly = TRUE)) {
  library(dplyr)
  loaded_base %>%
    group_by(group) %>%
    summarise(
      n = n(),
      mean_score = mean(score, na.rm = TRUE),
      mean_age = mean(age_years, na.rm = TRUE)
    )
}

13.3 Visualization (ggplot2, if available)

if (requireNamespace("ggplot2", quietly = TRUE)) {
  library(ggplot2)
  ggplot(loaded_base, aes(x = group, y = score, fill = group)) +
    geom_boxplot() +
    geom_jitter(width = 0.1, alpha = 0.6) +
    labs(title = "Scores by Group", x = "Group", y = "Score") +
    theme_minimal()
}

ggplot requires that we must have previously created a dataframe that is in long format that it will use. In this case, loaded_base is already in the format that ggplot can use.

ggplot has specific syntax that is easy to learn. Here are two ggplot resources: https://www.data-to-viz.com/
https://r-graph-gallery.com/line-chart-ggplot2.html

13.4 Linear Regression

fit <- lm(mpg ~ wt + cyl, data = mtcars)
summary(fit)

Call:
lm(formula = mpg ~ wt + cyl, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2893 -1.5512 -0.4684  1.5743  6.1004 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  39.6863     1.7150  23.141  < 2e-16 ***
wt           -3.1910     0.7569  -4.216 0.000222 ***
cyl          -1.5078     0.4147  -3.636 0.001064 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.568 on 29 degrees of freedom
Multiple R-squared:  0.8302,    Adjusted R-squared:  0.8185 
F-statistic: 70.91 on 2 and 29 DF,  p-value: 6.809e-12
# <- is the assignment operator
# lm commands linear regression
# mpg is the outcome (y variable)
# ~ is the formula operator that expresses relationships. 
# wt and cyl are the x variables 
# the dataframe specified is mtcars (built int)
# lm(y ~ x + z, data = dataframe) 

13.5 T-test (Group Comparison)

t.test(mpg ~ am, data = mtcars)

    Welch Two Sample t-test

data:  mpg by am
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -11.280194  -3.209684
sample estimates:
mean in group 0 mean in group 1 
       17.14737        24.39231 

13.6 Contingency Table and Chi-Squared Test

tbl <- table(mtcars$cyl, mtcars$gear)
tbl
   
     3  4  5
  4  1  8  2
  6  2  4  1
  8 12  0  2
chisq.test(tbl)

    Pearson's Chi-squared test

data:  tbl
X-squared = 18.036, df = 4, p-value = 0.001214

14 Saving Datasets and Objects

14.1 Save to CSV (Portable)

out_csv <- file.path("data", "mtcars_export.csv")
dir.create("data", showWarnings = FALSE)
write.csv(mtcars, out_csv, row.names = FALSE)
out_csv
[1] "data/mtcars_export.csv"

14.2 Save to RDS (Single Object, Preserves R Types)

out_rds <- file.path("data", "mtcars.rds")
saveRDS(mtcars, out_rds)
mtcars_loaded <- readRDS(out_rds)
identical(mtcars, mtcars_loaded)
[1] TRUE

14.3 Save Multiple Objects to .RData (Workspace-like)

out_rdata <- file.path("data", "analysis_objects.RData")
obj1 <- 123
obj2 <- data.frame(x = 1:3, y = c("a", "b", "c"))
save(obj1, obj2, file = out_rdata)
rm(obj1, obj2)
load(out_rdata)
obj1; obj2
[1] 123

15 Saving and Knitting Quarto Files

15.1 What is Quarto?

Quarto is a modern open-source scientific and technical publishing system built on Pandoc. It allows you to create dynamic documents, reports, presentations, and websites that combine text, code, and output.

15.2 How to create Quarto documents

  • In RStudio, go to File → New File → Quarto Document.
  • Choose a template (e.g., HTML, PDF, Word) and click OK.
  • Write your content using Markdown and embed R code chunks using triple backticks with {r}.
  • Save the file with a .qmd extension.

15.3 Rendering Quarto from R

# Quarto render (requires Quarto installed as a separate tool from https://quarto.org/)
if (requireNamespace("quarto", quietly = TRUE)) {
  # quarto::quarto_render("your_document.qmd")
}
NULL

16 Where to Find Help

16.1 Help files in R

?mean
help("lm")
help.search("linear model")
vignette()

16.2 POSIT Cheat Sheets

16.3 R Community Resources

16.4 Your favorite AI tools

  • Use AI tools like ChatGPT to get coding help, explanations, and examples.
  • GitHub Copilot can assist with code completion and suggestions.
  • Always verify AI-generated code for accuracy and best practices.
  • Use AI as a supplement, not a replacement for learning R fundamentals.

17 Sources and Further Reading