a <- 10 # Creates a numeric variable "a" and assigns the value 10
name <- "John" # creates a string variable
prices <- c(1.99, 2.29, 3.49, 5,99) # Creates a vector with multiple values, separated by commas
fun <- TRUE # a logical (boolean) variableDATA 420 - Predictive Analytics with R
Class Notes - Week 1 & 2
R is a powerful, open-source programming language specifically designed for statistical computing and data analysis. It excels in handling large datasets, performing complex statistical operations, and visualizing data.
As a data analyst, R can help you with:
Data Manipulation: Cleaning, filtering, and transforming raw data into usable formats using packages like dplyr and tidyverse.
Statistical Analysis: Running regression models, hypothesis tests, and predictive analytics.
Data Visualization: Creating insightful graphs and charts with ggplot2, shiny, and plotly.
Automation & Reporting: Automating tasks and generating dynamic reports with R Markdown and Quarto.
File Types:
Throughout this course we will be working with three main file types:
R Script (.R extension)
plain text file containing R code
text must be in comment format (following a # symbol)
not ideal for reports
R Markdown (.Rmd extension)
a dynamic document containing code, text and visuals
can generate HTML, PDF and Word documents
great for reports and documentation
R Quarto (.qmd extension)
next generation of R Markdown supporting R, Python, Julia etc.
enhanced outputs for web pages, books, and blogs
ideal for advanced publishing
In R Studio click: file > new file > to choose which file type you wish to create. You can begin writing code immediately in R Script, with Markdown and Quarto files, you can create a new code chunk by using the keyboard shortcut: ctrl + alt + i
Each week I will make my class notes available via Quarto document. You can run each line of code individually by placing your cursor at the end of the line and pressing ctrl+enter or by clicking the green arrow in the top right corner of the code block (called a “chunk”).
Data Types:
Variables:
In R, variables are used to store values such as numbers, strings, vectors, or even complex data structures. They act as containers that hold information, making it easier to manipulate and analyze data.
You can create a variable using either the <- or = assignment operator
Variable names are case-sensitive and should be descriptive
Once you have run a line of code, the variable will become visible within the Environment pane of R Studio (top right pane). This variable can now be used throughout your project.
Working with Variables:
# this is a comment, it will not run
# variable math
a * 2[1] 20
# display variable type
class(a)[1] "numeric"
class(name)[1] "character"
# subsetting with []
# R indexing starts at 1 (Python starts at 0)
prices[1][1] 1.99
# R subsetting includes the last item in range, Python does not
prices[1:3] # returns the 1st, 2nd, and 3rd items in the prices vector[1] 1.99 2.29 3.49
# logical
a == 7 # the == operator asks the question " is a = 7? True? False?[1] FALSE
a != 7 # is a not equal to 7? returns True or False[1] TRUE
a %% 2 == 0 # modulo operator, is a odd or even?[1] TRUE
# True has a value of 1, False 0
TRUE + FALSE[1] 1
class(TRUE)[1] "logical"
Vectors:
R vectors contain one data type. If you mix data types, R will attempt to force (coerce) vector contents to one data type. For example the number 2 can be coerced to become a string “2”, however the word “fish” cannot be coerced to become numeric.
# Vectors
temp <- c(2, 7, 12, -3, 7, -15)
temp2 <- c(4, -7, NA, 6, 12, 23)
class(temp)[1] "numeric"
names <- c("Jason", "Dan", "Steve")
class(names)[1] "character"
# check data types - returns True or False
is.numeric(temp)[1] TRUE
is.character(names)[1] TRUE
# force a data type to another
a <- as.character(a)
# math and logic on vectors
mean(temp)[1] 1.666667
median(temp)[1] 4.5
tax_rate <- 1.07 # 7% tax rate
total_price <- tax_rate * prices
total_price[1] 2.1293 2.4503 3.7343 5.3500 105.9300
# rounding numeric data
round(total_price, 2)[1] 2.13 2.45 3.73 5.35 105.93
# missing data
# our vector contains NA values.
# we can remove them with na.omit
mean(temp2)[1] NA
mean(na.omit(temp2))[1] 7.6
Data Frames:
A fundamental data structure in R that organizes data into rows and columns, like a spreadsheet. You can store different data types (numeric, string, logical) across different columns. Each column is a separate vector with its own data type. Each row is an observation.
df <- data.frame(
ID = c(1, 2, 3),
Name = c("Alice", "Bob", "Charlie"),
Score = c(90, 85, 88)
)
df ID Name Score
1 1 Alice 90
2 2 Bob 85
3 3 Charlie 88
# we can extract individual columns from df by using the "$" symbol
df$ID[1] 1 2 3
df[2, ] # Returns the second row ID Name Score
2 2 Bob 85
df[2, 3] # Access row 2, column 3 (Bob’s score: 85)[1] 85
# adding new columns
df$Passed <- df$Score > 80 # Adds a logical column
# modify a value
df[1, "Score"] <- 95 # Change Alice's score
# filter data frame values
high_scores <- df[df$Score > 85, ]
high_scores ID Name Score Passed
1 1 Alice 95 TRUE
3 3 Charlie 88 TRUE
# Summary statistics from a dataframe
summary(df) ID Name Score Passed
Min. :1.0 Length:3 Min. :85.00 Mode:logical
1st Qu.:1.5 Class :character 1st Qu.:86.50 TRUE:3
Median :2.0 Mode :character Median :88.00
Mean :2.0 Mean :89.33
3rd Qu.:2.5 3rd Qu.:91.50
Max. :3.0 Max. :95.00
Examples using built-in data from the starwars and mpg datasets
R libraries are collections of functions, data, and documentation that extend the capabilities of the R programming language. They’re like toolkits or plugins that help you perform specific tasks more efficiently - whether you’re analyzing data, creating visualizations, building machine learning models, or even scraping websites.
By loading the tidyverse library, we gain access to dozens of built-in datasets.
Libraries must be installed before they can be loaded.
Use: Tools > Install Packages to search the CRAN repository for necessary libraries
library(tidyverse)
# Star Wars dataframe
class(starwars)[1] "tbl_df" "tbl" "data.frame"
head(starwars)# A tibble: 6 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sky… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth Va… 202 136 none white yellow 41.9 male mascu…
5 Leia Org… 150 49 brown light brown 19 fema… femin…
6 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
# The height column contains NA, we'll deal with those before calculating average height
mean(na.omit(starwars$height))[1] 174.6049
# filter by hair colour
brown_hair <- na.omit(starwars[starwars$hair_color == "brown", ])
brown_hair# A tibble: 7 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Leia Org… 150 49 brown light brown 19 fema… femin…
2 Beru Whi… 165 75 brown light blue 47 fema… femin…
3 Chewbacca 228 112 brown unknown blue 200 male mascu…
4 Han Solo 180 80 brown fair brown 29 male mascu…
5 Wedge An… 170 77 brown fair hazel 21 male mascu…
6 Wicket S… 88 20 brown brown brown 8 male mascu…
7 Padmé Am… 185 45 brown light brown 46 fema… femin…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
# another way to filter using %in% - filter by species
human_species <- starwars[starwars$species %in% "Human", ]
human_species <- na.omit(human_species)
human_species# A tibble: 18 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 Darth V… 202 136 none white yellow 41.9 male mascu…
3 Leia Or… 150 49 brown light brown 19 fema… femin…
4 Owen La… 178 120 brown, gr… light blue 52 male mascu…
5 Beru Wh… 165 75 brown light blue 47 fema… femin…
6 Biggs D… 183 84 black light brown 24 male mascu…
7 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
8 Anakin … 188 84 blond fair blue 41.9 male mascu…
9 Han Solo 180 80 brown fair brown 29 male mascu…
10 Wedge A… 170 77 brown fair hazel 21 male mascu…
11 Palpati… 170 75 grey pale yellow 82 male mascu…
12 Boba Fe… 183 78.2 black fair brown 31.5 male mascu…
13 Lando C… 177 79 black dark brown 31 male mascu…
14 Lobot 175 79 none light blue 37 male mascu…
15 Padmé A… 185 45 brown light brown 46 fema… femin…
16 Mace Wi… 188 84 none dark brown 72 male mascu…
17 Dooku 193 80 white fair brown 102 male mascu…
18 Jango F… 183 79 black tan brown 66 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
# This option of filtering returns all rows in the starwars
# dataframe whenever Human is found in the species columnclass(mpg)[1] "tbl_df" "tbl" "data.frame"
head(mpg)# A tibble: 6 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
# return the average highway mileage of all vehicles in the dataframe
mean(mpg$hwy)[1] 23.44017
# Filter for front-wheel-drive vehicles ("f" in the drv column)
fwd_cars <- mpg[mpg$drv == "f", ]
# Calculate the mean highway mileage
mean_highway_mpg_fwd <- mean(fwd_cars$hwy, na.rm = TRUE)
# instead of wrapping the function in na.omit, we can make use of a parameter in the mean function na.rm = TRUE
mean_highway_mpg_fwd[1] 28.16038