Acknowledgment

Some of the text and exercises included in this assignment are from R for Data Science (2e) by Hadley Wickham, Mine Cetinkaya-Rundel, Garrett Grolemund. 2nd Edition, 2023. Available free online at https://r4ds.hadley.nz/.

Lab 1

Welcome to Lab 1! Today we will familiarize ourselves with R and learn how to load data. We will also learn about variable classes, how data are stored, and start learning how to manipulate and export data.

Objectives

Instructions

Work through the exercises below. You should be able to run all code chunks, and there will be instructions for when you need to add/modify the code. Some sections ask for a written response, as indicated by ✏️ Your response here:

Bullet points and brief responses are okay. All lab assignments are graded on completion only, not accuracy.


Familiarizing Yourself with RStudio

Explore the following areas in the RStudio interface:

Question: Which of these four panes will you do most of your code writing/editing in?

✏️ Your response here: Script Editor


R Scripts vs Quarto Documents

Activity: Open a new R script in RStudio: File → New File → R Script

Question: What’s one difference between .R and .qmd files?

✏️ Your response here: Can render as HTML, PDF, and Word.


Organization in Quarto

In the bottom left of the script editor, you should see a hashtag (#). Click on it, and you’ll see all the headings in this document.

Activity: Click on one of the other headings and see what happens.


Source vs Visual Mode in Quarto

Quarto documents can be edited in:

Switch between modes using the buttons in the top-left corner of the editor.

Activity: Practice switching between Source and Visual mode.

Question: Which mode do you prefer and why?

✏️ Your response here: I prefer Source mode, because it gives me more control over what I am working on.


Running R code

In Quarto documents, there are two ways to run code. We will try them both.

First, In Source Mode Only, you can run code directly in the script editor by highlighting it and pressing Command + Return on a Mac or control + Enter on a PC. Try that below. You should see the result (3) appear in your console.

1 + 2

Second, you can run code in a code chunk (in Source and Visual Mode). Clicking the green forward arrow will run all the code in the chunk. Try that below. You should see the result (3) appear directly below the chunk.

1 + 2
## [1] 3

Activity: Create a new code chunk below by finding the “insert a new code chunk” button (green c with a plus sign, near the top right of the script editor). Play around with running code within this chunk.

Insert code chunk here

print("Hello World")
## [1] "Hello World"

Creating Vectors and Data Frames

Data can be stored in R in several different formats. Today we’ll focus on vectors and data frames

Vectors are a one-dimensional set of values of the same type (e.g., all numbers or all characters). Run the code below to create few vectors:

#create a vector equal to 12
x <- 3 * 4
x
## [1] 12
#create a vector of odd numbers
odd_nums <- c(1,3,5,7,9)
odd_nums
## [1] 1 3 5 7 9
#create a vector of years
years <- c(2019, 2020, 2021, 2022, 2023, 2024, 2025)
years
## [1] 2019 2020 2021 2022 2023 2024 2025
#create a vector of fruits
fruits <- c("bananas", "apples", "tangerines", "watermelon", "grapes")
fruits
## [1] "bananas"    "apples"     "tangerines" "watermelon" "grapes"

Data frames are two-dimensional table-like structure (like a spreadsheet). Each column is a vector, and columns can be different data types. We will most often work with data frames in this course. Run the code below to create a data frame:

#create a data frame called "people" with 3 variables called "name" "age" and "student"
people <- data.frame(
  name = c("Alice", "Bob", "Chris", "Dana"),
  age = c(25, 30, 22, 28),
  student = c(TRUE, FALSE, TRUE, TRUE)
)

To look at variables within a data frame, follow the convention of dataframe$variable

people$age
## [1] 25 30 22 28

Question: Look at your environment (upper right). Where are data frames saved and where are vectors saved? What happens if you click on the data frame called “people”?

✏️ Your response here: Data frames are saved under “Data,” while vectors are saved under “Values.” If you click on “people” it opens the data frame as a table in the Script Editor.

Activity: In the code chunk below, create a simple vector and a data frame. Use # to add comments explaining what each line does.

# Create a vector (e.g., a number or a word, or a list of numbers or words)
y <- c("Hello", "Howdy", "Hi")      #list of words to greet someone

# Create a simple data frame with at least 2 columns and 3 rows
family <- data.frame(
  name = c("Santiago", "Claudia", "Jacalyn"), #names of people in my family
  height = c(72, 61, 64), #their heights in inches
  profession = c("Data Project Manager", "Dog Sitter", "Retiree") #their jobs
)

What Are Packages?

Packages are collections of functions that extend R’s capabilities. The one we are going to load today is tidyverse, which includes tools for data cleaning, visualization, and more. Run the code below to install and load the tidyverse package.

# Install the tidyverse package - only need to do this once (un-comment the line below)
#install.packages("tidyverse")

#Load the tidyverse package - will need to do this every time you start a new R session
library(tidyverse)
## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.2     v readr     2.1.4
## v forcats   1.0.0     v stringr   1.5.0
## v ggplot2   3.4.2     v tibble    3.2.1
## v lubridate 1.9.2     v tidyr     1.3.0
## v purrr     1.0.1     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Other packages

There are many other excellent packages that are not part of the tidyverse because they solve problems in a different domain or are designed with a different set of underlying principles. As you tackle more data science projects with R, you’ll learn new packages and new ways of thinking about data. Anytime you encounter a new package you’d like to use, the steps are the same: 1. install the new package by running code install.packages("packagename") (only need to do that one time, then it will forever be installed); 2. Load the package by running code library(packagename) (need to do this every time you start a new R session and want to use that package).


Loading a CSV File

To begin, we’ll focus on a common data file type: CSV (comma-separated values). The dataset we’ll use today contains a comprehensive list of the most famous songs of 2023 as listed on Spotify and is found here: https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023

We can read this file into R using read_csv(). The first argument is the most important: the path to the file. You can think about the path as the address of the file: the file is called students.csv and it lives in the data folder. There are two types of file paths, relative and absolute.

Once you’re inside a project, you should try to use relative paths and not absolute paths. What’s the difference? A relative path is relative to the working directory, i.e. the project’s home. My code below Data/spotify-2023.csv is a relative path which is a shortcut for the absolute path of /Users/esherwin/UCBExtension/Summer2025/CourseFiles/x462/Data/spotify-2023.csv.

Absolute paths point to the same place regardless of your working directory. They look a little different depending on your operating system. On Windows they start with a drive letter (e.g., C:) or two backslashes (e.g., \\servername) and on Mac/Linux they start with a slash “/” (e.g., /users/hadley). You should almost never use absolute paths in your scripts, because they hinder sharing: no one else will have exactly the same directory configuration as you.

#if you have a MAC
#spotify_data <- read_csv("Data/spotify-2023.csv")

#if you have a PC (although I haven't tested this out, un-comment the line below to try)
spotify_data <- read_csv("Data/spotify-2023.csv")
## Rows: 953 Columns: 24
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (5): track_name, artist(s)_name, streams, key, mode
## dbl (17): artist_count, released_year, released_month, released_day, in_spot...
## num  (2): in_deezer_playlists, in_shazam_charts
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#if neither of the above work for you, you can try using the absolute path (copy the pathname of the cvs file on your computer)

Other common file types include Excel files (.xlsx), TSV (.tsv), and .rds. The course textbook is a good reference for learning how to load these file types.


Exploring Data Types

The abbreviations , etc. tell you the variable class. Here are some the more common variable classes:

Class Description Example
numeric Numbers with or without decimals 3, 3.14
double Numeric values with decimals (a subtype of numeric) 3.14
character Text strings "hello"
logical TRUE/FALSE values TRUE, FALSE
# Explore your data frame
glimpse(spotify_data)
## Rows: 953
## Columns: 24
## $ track_name           <chr> "Seven (feat. Latto) (Explicit Ver.)", "LALA", "v~
## $ `artist(s)_name`     <chr> "Latto, Jung Kook", "Myke Towers", "Olivia Rodrig~
## $ artist_count         <dbl> 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1~
## $ released_year        <dbl> 2023, 2023, 2023, 2019, 2023, 2023, 2023, 2023, 2~
## $ released_month       <dbl> 7, 3, 6, 8, 5, 6, 3, 7, 5, 3, 4, 7, 1, 4, 3, 12, ~
## $ released_day         <dbl> 14, 23, 30, 23, 18, 1, 16, 7, 15, 17, 17, 7, 12, ~
## $ in_spotify_playlists <dbl> 553, 1474, 1397, 7858, 3133, 2186, 3090, 714, 109~
## $ in_spotify_charts    <dbl> 147, 48, 113, 100, 50, 91, 50, 43, 83, 44, 40, 55~
## $ streams              <chr> "141381703", "133716286", "140003974", "800840817~
## $ in_apple_playlists   <dbl> 43, 48, 94, 116, 84, 67, 34, 25, 60, 49, 41, 37, ~
## $ in_apple_charts      <dbl> 263, 126, 207, 207, 133, 213, 222, 89, 210, 110, ~
## $ in_deezer_playlists  <dbl> 45, 58, 91, 125, 87, 88, 43, 30, 48, 66, 54, 21, ~
## $ in_deezer_charts     <dbl> 10, 14, 14, 12, 15, 17, 13, 13, 11, 13, 12, 5, 58~
## $ in_shazam_charts     <dbl> 826, 382, 949, 548, 425, 946, 418, 194, 953, 339,~
## $ bpm                  <dbl> 125, 92, 138, 170, 144, 141, 148, 100, 130, 170, ~
## $ key                  <chr> "B", "C#", "F", "A", "A", "C#", "F", "F", "C#", "~
## $ mode                 <chr> "Major", "Major", "Major", "Major", "Minor", "Maj~
## $ `danceability_%`     <dbl> 80, 71, 51, 55, 65, 92, 67, 67, 85, 81, 57, 78, 7~
## $ `valence_%`          <dbl> 89, 61, 32, 58, 23, 66, 83, 26, 22, 56, 56, 52, 6~
## $ `energy_%`           <dbl> 83, 74, 53, 72, 80, 58, 76, 71, 62, 48, 72, 82, 6~
## $ `acousticness_%`     <dbl> 31, 7, 17, 11, 14, 19, 48, 37, 12, 21, 23, 18, 6,~
## $ `instrumentalness_%` <dbl> 0, 0, 0, 0, 63, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 17,~
## $ `liveness_%`         <dbl> 8, 10, 31, 11, 11, 8, 8, 11, 28, 8, 27, 15, 3, 9,~
## $ `speechiness_%`      <dbl> 4, 4, 6, 15, 6, 24, 3, 4, 9, 33, 5, 7, 7, 3, 6, 4~

Question: How many rows (observations) and columns (variables) does this data frame contain? Which variable classes are represented?

✏️ Your response here: It contains 953 observations and 24 variables, representing the character and double datatypes.╣

Now let’s look at some descriptive stats of a few of the variables

#summary is for numeric/double variables
summary(spotify_data$released_year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1930    2020    2022    2018    2022    2023
#table is for character/categorical variables
table(spotify_data$key, useNA = "ifany"
      )  #you can also do useNA = "always"
## 
##    A   A#    B   C#    D   D#    E    F   F#    G   G# <NA> 
##   75   57   81  120   81   33   62   89   73   96   91   95

Question: remove the useNA = “ifany” part of the table command above. What do you notice? Why might it be important to always include useNA = "ifany" or useNA = "always" ?

✏️ Your response here: If I remove that, songs that don’t have a value specified for key disappear. It’s important to include those useNA values so you can get a sense of the full scope of the dataset.


The Pipe Operator (%>% or |>)

Pipes allow you to chain commands together in a readable way. They are part of the dplyr package, a core member of the tidyverse (dplyr should automatically load when you load the tidyverse package, or you can load it separately if you prefer).

You will often see people code in Base R or using dplyr (with the pipe). There is no wrong or right way. I was trained using dplyr, so that’s what you’ll often see in this course, but not always.

To add the pipe to your code, we recommend using the built-in keyboard shortcut Ctrl/Cmd + Shift + M. You’ll need to make one change to your RStudio options to use |> instead of %>% as shown in Figure 3.1.

The code in the chunk below is written in dplyr:

#make a new dataframe of only songs released in 2023
songs2023 <- spotify_data %>% filter(released_year==2023)

#keep only the variables we're interested in
songs2023 <- songs2023 %>% select(track_name, `artist(s)_name`, released_year, streams, bpm, `danceability_%`, `acousticness_%`, `instrumentalness_%`, `speechiness_%`)

Question: How many rows and columns does your new data frame (songs2023) have?

✏️ Your answer here: It has 9 columns and 175 rows

The code in the chunk below does exactly the same thing as the code above, but it’s written in base R.

#make a new dataframe of only songs released in 2023
songs2023 <- spotify_data[spotify_data$released_year == 2023, ]

#keep only the variables we're interested in
songs2023 <- songs2023[, c("track_name", "artist(s)_name", "released_year", "streams", 
                             "bpm", "danceability_%", "acousticness_%", 
                             "instrumentalness_%", "speechiness_%")]

Question: Look at the differences between the code in dplyr and base R. Does one seem more intuitive to you?

✏️ Your answer here: The dplyr version seems more intuitive because it associates the name of the function more closely with the items the function is being applied to.


Exporting Data

Let’s export our new data frame, songs2023, and save it to the folder called “Output” (you’ll need to create that folder within x462 if you haven’t already)

#export CSV file
write_csv(songs2023, "Output/songs2023.csv")

#export RDS file
saveRDS(songs2023, "Output/songs2023.rds")

#export parquet file (uncomment the line below to first install the arrow package)
#install.packages("arrow")
#library(arrow)
#write_parquet(songs2023, "Output/songs2023.parquet")

Rendering Your Document

Click the Render button in the toolbar. This should output an html file.

You can also render to:


(Optional) Comments or Questions

✏️ Please share any comments or questions related to this lab: Thanks for putting this together!