R Programming - Class 2: Packages, Data Import, and Tidyverse Basics

Author

Akash Mitra

๐ŸŽฏ Objectives

By the end of this class, you will be able to:

  • Install and load R packages.
  • Import .csv and .xlsx files into R.
  • Understand what the tidyverse is and how to use it.
  • Perform basic data manipulation using dplyr.

๐Ÿ—‚๏ธ Files associated with this class can be downloaded from the links:


๐Ÿงญ Class Outline


1. ๐Ÿ“ฆ Packages in R

  • Packages extend Rโ€™s functionality.
  • Use install.packages() to install.
  • Use library() to load into your session.
# Installing a package
#install.packages("tidyverse")

# Loading a package
library(tidyverse)
โ”€โ”€ Attaching core tidyverse packages โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ tidyverse 2.0.0 โ”€โ”€
โœ” dplyr     1.1.4     โœ” readr     2.1.5
โœ” forcats   1.0.0     โœ” stringr   1.5.1
โœ” ggplot2   3.5.2     โœ” tibble    3.3.0
โœ” lubridate 1.9.4     โœ” tidyr     1.3.1
โœ” purrr     1.0.4     
โ”€โ”€ Conflicts โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ tidyverse_conflicts() โ”€โ”€
โœ– dplyr::filter() masks stats::filter()
โœ– dplyr::lag()    masks stats::lag()
โ„น Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
  • Installed only once, but loaded each time you start R.

2. ๐Ÿ“ Importing Data

๐Ÿ”น Import CSV

data <- read.csv("data.csv")

# Using readr (part of tidyverse)
library(readr)
data <- read_csv("data.csv")
Rows: 4 Columns: 4
โ”€โ”€ Column specification โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Delimiter: ","
chr (2): name, department
dbl (2): age, score

โ„น Use `spec()` to retrieve the full column specification for this data.
โ„น Specify the column types or set `show_col_types = FALSE` to quiet this message.

๐Ÿ”ธ Import Excel

# Requires readxl package
#install.packages("readxl")
library(readxl)

data <- read_excel("data.xlsx")

3. ๐ŸŒ Introduction to Tidyverse

  • A collection of packages for data science:
    • ggplot2, dplyr, tidyr, readr, tibble, stringr, forcats
library(tidyverse)
  • Tidyverse promotes a consistent and readable syntax.

4. ๐Ÿงน Basic Data Manipulation with dplyr

๐Ÿ” View and inspect your data

head(data)
# A tibble: 4 ร— 4
  name    age score department
  <chr> <dbl> <dbl> <chr>     
1 John     25    80 Sales     
2 Sara     30    95 HR        
3 Alex     22    85 Sales     
4 Maya     28    90 Marketing 
glimpse(data)
Rows: 4
Columns: 4
$ name       <chr> "John", "Sara", "Alex", "Maya"
$ age        <dbl> 25, 30, 22, 28
$ score      <dbl> 80, 95, 85, 90
$ department <chr> "Sales", "HR", "Sales", "Marketing"

โœจ Key dplyr functions

Function Description
select() Choose columns
filter() Subset rows
mutate() Create new variables
arrange() Sort rows
summarise() Aggregate values
group_by() Group data

โœ… Piping into the dataset:

#using the pipe operator '|>' or '%>%'
df <- read_excel("data.xlsx")

df |> 
  arrange(desc(age)) |> 
  summarise(mean = mean(age))
# A tibble: 1 ร— 1
   mean
  <dbl>
1  26.2
# Selecting columns
data %>% select(name, age)
# A tibble: 4 ร— 2
  name    age
  <chr> <dbl>
1 John     25
2 Sara     30
3 Alex     22
4 Maya     28
# Filtering rows
data %>% filter(age > 25)
# A tibble: 2 ร— 4
  name    age score department
  <chr> <dbl> <dbl> <chr>     
1 Sara     30    95 HR        
2 Maya     28    90 Marketing 
# Creating a new column
data %>% mutate(score_percent = score / 100)
# A tibble: 4 ร— 5
  name    age score department score_percent
  <chr> <dbl> <dbl> <chr>              <dbl>
1 John     25    80 Sales               0.8 
2 Sara     30    95 HR                  0.95
3 Alex     22    85 Sales               0.85
4 Maya     28    90 Marketing           0.9 
# Sorting
data %>% arrange(desc(score))
# A tibble: 4 ร— 4
  name    age score department
  <chr> <dbl> <dbl> <chr>     
1 Sara     30    95 HR        
2 Maya     28    90 Marketing 
3 Alex     22    85 Sales     
4 John     25    80 Sales     
# Grouping and summarizing
data %>%
  group_by(department) %>%
  summarise(avg_score = mean(score, na.rm = TRUE))
# A tibble: 3 ร— 2
  department avg_score
  <chr>          <dbl>
1 HR              95  
2 Marketing       90  
3 Sales           82.5