Some of the text and exercises included in this assignment are from R for Data Science (2e) by Hadley Wickham, Mine Cetinkaya-Rundel, Garrett Grolemund. 2nd Edition, 2023. Available free online at https://r4ds.hadley.nz/.
Welcome to Lab 1! Today we will familiarize ourselves with R and learn how to load data. We will also learn about variable classes, how data are stored, and start learning how to manipulate and export data.
Get comfortable using RStudio
Understand the difference between R scripts and Quarto documents
Run basic code and use comments (#) effectively
Install and load packages
Load data
Save variables into vectors data frames
Explore variable classes (numeric, integer, factor, character)
Export data
Render Quarto documents
Work through the exercises below. You should be able to run all code chunks, and there will be instructions for when you need to add/modify the code. Some sections ask for a written response, as indicated by ✏️ Your response here:
Bullet points and brief responses are okay. All lab assignments are graded on completion only, not accuracy.
Explore the following areas in the RStudio interface:
Script Editor: Top-left pane for writing and saving R code
Console: Bottom-left pane where R code runs interactively
Environment/History: Top-right pane showing objects in memory
Files/Plots/Packages/Help/Viewer: Bottom-right for file navigation, plots, installed packages, etc.
Question: Which of these four panes will you do most of your code writing/editing in?
✏️ Your response here: Script Editor
R script (.R): Plain code, not formatted
Quarto file (.qmd): Combines code and narrative, can render as HTML, PDF, Word. Also integrates with other coding languages not covered in this course (Python, Julia).
You can use either file type to run code. You may also see people using R Markdown files (.rmd) which are just an older version of Quarto files. In this course we will use Quarto files for labs and assignments. However, you are welcome to try out using R Scripts on your own. Regardless of the file format you use, the coding and documentation best practices you learn in this course will be the same.
Activity: Open a new R script in RStudio: File → New File → R Script
Question: What’s one difference between .R and .qmd files?
✏️ Your response here: Can render as HTML, PDF, and Word.
In the bottom left of the script editor, you should see a hashtag (#). Click on it, and you’ll see all the headings in this document.
Activity: Click on one of the other headings and see what happens.
Quarto documents can be edited in:
Source mode: Shows raw code and markdown (for full control)
Visual mode: WYSIWYM style (easier formatting, great for beginners)
Switch between modes using the buttons in the top-left corner of the editor.
Activity: Practice switching between Source and Visual mode.
Question: Which mode do you prefer and why?
✏️ Your response here: I prefer Source mode, because it gives me more control over what I am working on.
In Quarto documents, there are two ways to run code. We will try them both.
First, In Source Mode Only, you can run code directly in the script editor by highlighting it and pressing Command + Return on a Mac or control + Enter on a PC. Try that below. You should see the result (3) appear in your console.
1 + 2
Second, you can run code in a code chunk (in Source and Visual Mode). Clicking the green forward arrow will run all the code in the chunk. Try that below. You should see the result (3) appear directly below the chunk.
1 + 2
## [1] 3
Activity: Create a new code chunk below by finding the “insert a new code chunk” button (green c with a plus sign, near the top right of the script editor). Play around with running code within this chunk.
Insert code chunk here
print("Hello World")
## [1] "Hello World"
Data can be stored in R in several different formats. Today we’ll focus on vectors and data frames
Vectors are a one-dimensional set of values of the same type (e.g., all numbers or all characters). Run the code below to create few vectors:
#create a vector equal to 12
x <- 3 * 4
x
## [1] 12
#create a vector of odd numbers
odd_nums <- c(1,3,5,7,9)
odd_nums
## [1] 1 3 5 7 9
#create a vector of years
years <- c(2019, 2020, 2021, 2022, 2023, 2024, 2025)
years
## [1] 2019 2020 2021 2022 2023 2024 2025
#create a vector of fruits
fruits <- c("bananas", "apples", "tangerines", "watermelon", "grapes")
fruits
## [1] "bananas" "apples" "tangerines" "watermelon" "grapes"
Data frames are two-dimensional table-like structure (like a spreadsheet). Each column is a vector, and columns can be different data types. We will most often work with data frames in this course. Run the code below to create a data frame:
#create a data frame called "people" with 3 variables called "name" "age" and "student"
people <- data.frame(
name = c("Alice", "Bob", "Chris", "Dana"),
age = c(25, 30, 22, 28),
student = c(TRUE, FALSE, TRUE, TRUE)
)
To look at variables within a data frame, follow the convention of dataframe$variable
people$age
## [1] 25 30 22 28
Question: Look at your environment (upper right). Where are data frames saved and where are vectors saved? What happens if you click on the data frame called “people”?
✏️ Your response here: Data frames are saved under “Data,” while vectors are saved under “Values.” If you click on “people” it opens the data frame as a table in the Script Editor.
Activity: In the code chunk below, create a simple vector and a data frame. Use # to add comments explaining what each line does.
# Create a vector (e.g., a number or a word, or a list of numbers or words)
y <- c("Hello", "Howdy", "Hi") #list of words to greet someone
# Create a simple data frame with at least 2 columns and 3 rows
family <- data.frame(
name = c("Santiago", "Claudia", "Jacalyn"), #names of people in my family
height = c(72, 61, 64), #their heights in inches
profession = c("Data Project Manager", "Dog Sitter", "Retiree") #their jobs
)
Packages are collections of functions that extend R’s capabilities. The one we are going to load today is tidyverse, which includes tools for data cleaning, visualization, and more. Run the code below to install and load the tidyverse package.
# Install the tidyverse package - only need to do this once (un-comment the line below)
#install.packages("tidyverse")
#Load the tidyverse package - will need to do this every time you start a new R session
library(tidyverse)
## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr 1.1.2 v readr 2.1.4
## v forcats 1.0.0 v stringr 1.5.0
## v ggplot2 3.4.2 v tibble 3.2.1
## v lubridate 1.9.2 v tidyr 1.3.0
## v purrr 1.0.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
There are many other excellent packages that are not part of the tidyverse because they solve problems in a different domain or are designed with a different set of underlying principles. As you tackle more data science projects with R, you’ll learn new packages and new ways of thinking about data. Anytime you encounter a new package you’d like to use, the steps are the same: 1. install the new package by running code install.packages("packagename") (only need to do that one time, then it will forever be installed); 2. Load the package by running code library(packagename) (need to do this every time you start a new R session and want to use that package).
To begin, we’ll focus on a common data file type: CSV (comma-separated values). The dataset we’ll use today contains a comprehensive list of the most famous songs of 2023 as listed on Spotify and is found here: https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023
We can read this file into R using read_csv(). The first argument is the most important: the path to the file. You can think about the path as the address of the file: the file is called students.csv and it lives in the data folder. There are two types of file paths, relative and absolute.
Once you’re inside a project, you should try to use relative paths and not absolute paths. What’s the difference? A relative path is relative to the working directory, i.e. the project’s home. My code below Data/spotify-2023.csv is a relative path which is a shortcut for the absolute path of /Users/esherwin/UCBExtension/Summer2025/CourseFiles/x462/Data/spotify-2023.csv.
Absolute paths point to the same place regardless of your working directory. They look a little different depending on your operating system. On Windows they start with a drive letter (e.g., C:) or two backslashes (e.g., \\servername) and on Mac/Linux they start with a slash “/” (e.g., /users/hadley). You should almost never use absolute paths in your scripts, because they hinder sharing: no one else will have exactly the same directory configuration as you.
#if you have a MAC
#spotify_data <- read_csv("Data/spotify-2023.csv")
#if you have a PC (although I haven't tested this out, un-comment the line below to try)
spotify_data <- read_csv("Data/spotify-2023.csv")
## Rows: 953 Columns: 24
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (5): track_name, artist(s)_name, streams, key, mode
## dbl (17): artist_count, released_year, released_month, released_day, in_spot...
## num (2): in_deezer_playlists, in_shazam_charts
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#if neither of the above work for you, you can try using the absolute path (copy the pathname of the cvs file on your computer)
Other common file types include Excel files (.xlsx), TSV (.tsv), and .rds. The course textbook is a good reference for learning how to load these file types.
The abbreviations
| Class | Description | Example |
|---|---|---|
numeric |
Numbers with or without decimals | 3, 3.14 |
double |
Numeric values with decimals (a subtype of numeric) |
3.14 |
character |
Text strings | "hello" |
logical |
TRUE/FALSE values | TRUE, FALSE |
# Explore your data frame
glimpse(spotify_data)
## Rows: 953
## Columns: 24
## $ track_name <chr> "Seven (feat. Latto) (Explicit Ver.)", "LALA", "v~
## $ `artist(s)_name` <chr> "Latto, Jung Kook", "Myke Towers", "Olivia Rodrig~
## $ artist_count <dbl> 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1~
## $ released_year <dbl> 2023, 2023, 2023, 2019, 2023, 2023, 2023, 2023, 2~
## $ released_month <dbl> 7, 3, 6, 8, 5, 6, 3, 7, 5, 3, 4, 7, 1, 4, 3, 12, ~
## $ released_day <dbl> 14, 23, 30, 23, 18, 1, 16, 7, 15, 17, 17, 7, 12, ~
## $ in_spotify_playlists <dbl> 553, 1474, 1397, 7858, 3133, 2186, 3090, 714, 109~
## $ in_spotify_charts <dbl> 147, 48, 113, 100, 50, 91, 50, 43, 83, 44, 40, 55~
## $ streams <chr> "141381703", "133716286", "140003974", "800840817~
## $ in_apple_playlists <dbl> 43, 48, 94, 116, 84, 67, 34, 25, 60, 49, 41, 37, ~
## $ in_apple_charts <dbl> 263, 126, 207, 207, 133, 213, 222, 89, 210, 110, ~
## $ in_deezer_playlists <dbl> 45, 58, 91, 125, 87, 88, 43, 30, 48, 66, 54, 21, ~
## $ in_deezer_charts <dbl> 10, 14, 14, 12, 15, 17, 13, 13, 11, 13, 12, 5, 58~
## $ in_shazam_charts <dbl> 826, 382, 949, 548, 425, 946, 418, 194, 953, 339,~
## $ bpm <dbl> 125, 92, 138, 170, 144, 141, 148, 100, 130, 170, ~
## $ key <chr> "B", "C#", "F", "A", "A", "C#", "F", "F", "C#", "~
## $ mode <chr> "Major", "Major", "Major", "Major", "Minor", "Maj~
## $ `danceability_%` <dbl> 80, 71, 51, 55, 65, 92, 67, 67, 85, 81, 57, 78, 7~
## $ `valence_%` <dbl> 89, 61, 32, 58, 23, 66, 83, 26, 22, 56, 56, 52, 6~
## $ `energy_%` <dbl> 83, 74, 53, 72, 80, 58, 76, 71, 62, 48, 72, 82, 6~
## $ `acousticness_%` <dbl> 31, 7, 17, 11, 14, 19, 48, 37, 12, 21, 23, 18, 6,~
## $ `instrumentalness_%` <dbl> 0, 0, 0, 0, 63, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 17,~
## $ `liveness_%` <dbl> 8, 10, 31, 11, 11, 8, 8, 11, 28, 8, 27, 15, 3, 9,~
## $ `speechiness_%` <dbl> 4, 4, 6, 15, 6, 24, 3, 4, 9, 33, 5, 7, 7, 3, 6, 4~
Question: How many rows (observations) and columns (variables) does this data frame contain? Which variable classes are represented?
✏️ Your response here: It contains 953 observations and 24 variables, representing the character and double datatypes.╣
Now let’s look at some descriptive stats of a few of the variables
#summary is for numeric/double variables
summary(spotify_data$released_year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1930 2020 2022 2018 2022 2023
#table is for character/categorical variables
table(spotify_data$key, useNA = "ifany"
) #you can also do useNA = "always"
##
## A A# B C# D D# E F F# G G# <NA>
## 75 57 81 120 81 33 62 89 73 96 91 95
Question: remove the useNA = “ifany” part of the table command above. What do you notice? Why might it be important to always include useNA = "ifany" or useNA = "always" ?
✏️ Your response here: If I remove that, songs that don’t have a value specified for key disappear. It’s important to include those useNA values so you can get a sense of the full scope of the dataset.
%>% or |>)Pipes allow you to chain commands together in a readable way. They are part of the dplyr package, a core member of the tidyverse (dplyr should automatically load when you load the tidyverse package, or you can load it separately if you prefer).
You will often see people code in Base R or using dplyr (with the pipe). There is no wrong or right way. I was trained using dplyr, so that’s what you’ll often see in this course, but not always.
To add the pipe to your code, we recommend using the built-in keyboard shortcut Ctrl/Cmd + Shift + M. You’ll need to make one change to your RStudio options to use |> instead of %>% as shown in Figure 3.1.
The code in the chunk below is written in dplyr:
#make a new dataframe of only songs released in 2023
songs2023 <- spotify_data %>% filter(released_year==2023)
#keep only the variables we're interested in
songs2023 <- songs2023 %>% select(track_name, `artist(s)_name`, released_year, streams, bpm, `danceability_%`, `acousticness_%`, `instrumentalness_%`, `speechiness_%`)
Question: How many rows and columns does your new data frame (songs2023) have?
✏️ Your answer here: It has 9 columns and 175 rows
The code in the chunk below does exactly the same thing as the code above, but it’s written in base R.
#make a new dataframe of only songs released in 2023
songs2023 <- spotify_data[spotify_data$released_year == 2023, ]
#keep only the variables we're interested in
songs2023 <- songs2023[, c("track_name", "artist(s)_name", "released_year", "streams",
"bpm", "danceability_%", "acousticness_%",
"instrumentalness_%", "speechiness_%")]
Question: Look at the differences between the code in dplyr and base R. Does one seem more intuitive to you?
✏️ Your answer here: The dplyr version seems more intuitive because it associates the name of the function more closely with the items the function is being applied to.
.csv: Common, readable format. This is a great way to export data to share with collaborators.
.rds: R’s native format (preserves types and structure). This can be so useful if you plan to use your data later in R-you won’t need to rerun your previous code, and all variable formatting will be saved.
.parquet: The arrow package allows you to read and write parquet files, a fast binary file format that can be shared across programming languages.
Anytime I finish a project or create a visualization, I typically save my cleaned up data as both a csv file and a rds file. This can save so much time later on (for example: a collaborator comes back to me 6 months later and asks for a minor edit to a figure. Instead of re-running all code needed to clean and format the data, I can just load the figure data and then I only need to run the final code for the figure).
Let’s export our new data frame, songs2023, and save it to the folder called “Output” (you’ll need to create that folder within x462 if you haven’t already)
#export CSV file
write_csv(songs2023, "Output/songs2023.csv")
#export RDS file
saveRDS(songs2023, "Output/songs2023.rds")
#export parquet file (uncomment the line below to first install the arrow package)
#install.packages("arrow")
#library(arrow)
#write_parquet(songs2023, "Output/songs2023.parquet")
Click the Render button in the toolbar. This should output an html file.
You can also render to:
pdf (requires LaTeX)
docx or doc (Microsoft Word)
✏️ Please share any comments or questions related to this lab: Thanks for putting this together!