DATA 420 - Predictive Analytics with R

Author

Jason Pemberton

Class Notes - Week 1 & 2

R is a powerful, open-source programming language specifically designed for statistical computing and data analysis. It excels in handling large datasets, performing complex statistical operations, and visualizing data.

As a data analyst, R can help you with:

Data Manipulation: Cleaning, filtering, and transforming raw data into usable formats using packages like dplyr and tidyverse.
Statistical Analysis: Running regression models, hypothesis tests, and predictive analytics.
Data Visualization: Creating insightful graphs and charts with ggplot2, shiny, and plotly.
Automation & Reporting: Automating tasks and generating dynamic reports with R Markdown and Quarto.

File Types:

Throughout this course we will be working with three main file types:

R Script (.R extension)
- plain text file containing R code
- text must be in comment format (following a # symbol)
- not ideal for reports
R Markdown (.Rmd extension)
- a dynamic document containing code, text and visuals
- can generate HTML, PDF and Word documents
- great for reports and documentation
R Quarto (.qmd extension)
- next generation of R Markdown supporting R, Python, Julia etc.
- enhanced outputs for web pages, books, and blogs
- ideal for advanced publishing

In R Studio click: file > new file > to choose which file type you wish to create. You can begin writing code immediately in R Script, with Markdown and Quarto files, you can create a new code chunk by using the keyboard shortcut: ctrl + alt + i

Each week I will make my class notes available via Quarto document. You can run each line of code individually by placing your cursor at the end of the line and pressing ctrl+enter or by clicking the green arrow in the top right corner of the code block (called a “chunk”).

Data Types:

Variables:

In R, variables are used to store values such as numbers, strings, vectors, or even complex data structures. They act as containers that hold information, making it easier to manipulate and analyze data.

You can create a variable using either the <- or = assignment operator
Variable names are case-sensitive and should be descriptive

a <- 10 # Creates a numeric variable "a" and assigns the value 10

name <- "John" # creates a string variable

prices <- c(1.99, 2.29, 3.49, 5,99) # Creates a vector with multiple values, separated by commas

fun <- TRUE # a logical (boolean) variable

Once you have run a line of code, the variable will become visible within the Environment pane of R Studio (top right pane). This variable can now be used throughout your project.

Working with Variables:

# this is a comment, it will not run

# variable math
a * 2

[1] 20

# display variable type
class(a)

[1] "numeric"

class(name)

[1] "character"

# subsetting with []
# R indexing starts at 1 (Python starts at 0)
prices[1]

[1] 1.99

# R subsetting includes the last item in range, Python does not
prices[1:3] # returns the 1st, 2nd, and 3rd items in the prices vector

[1] 1.99 2.29 3.49

# logical
a == 7 # the == operator asks the question " is a = 7? True? False?

[1] FALSE

a != 7 # is a not equal to 7? returns True or False

[1] TRUE

a %% 2 == 0 # modulo operator, is a odd or even?

[1] TRUE

# True has a value of 1, False 0
TRUE + FALSE

[1] 1

class(TRUE)

[1] "logical"

Vectors:

R vectors contain one data type. If you mix data types, R will attempt to force (coerce) vector contents to one data type. For example the number 2 can be coerced to become a string “2”, however the word “fish” cannot be coerced to become numeric.

# Vectors
temp <- c(2, 7, 12, -3, 7, -15)
temp2 <- c(4, -7, NA, 6, 12, 23)
class(temp)

[1] "numeric"

names <- c("Jason", "Dan", "Steve")
class(names)

[1] "character"

# check data types - returns True or False
is.numeric(temp)

[1] TRUE

is.character(names)

[1] TRUE

# force a data type to another
a <- as.character(a)

# math and logic on vectors
mean(temp)

[1] 1.666667

median(temp)

[1] 4.5

tax_rate <- 1.07 # 7% tax rate
total_price <- tax_rate * prices
total_price

[1]   2.1293   2.4503   3.7343   5.3500 105.9300

# rounding numeric data
round(total_price, 2)

[1]   2.13   2.45   3.73   5.35 105.93

# missing data
# our vector contains NA values. 
# we can remove them with na.omit
mean(temp2)

[1] NA

mean(na.omit(temp2))

[1] 7.6

Data Frames:

A fundamental data structure in R that organizes data into rows and columns, like a spreadsheet. You can store different data types (numeric, string, logical) across different columns. Each column is a separate vector with its own data type. Each row is an observation.

df <- data.frame(
  ID = c(1, 2, 3),
  Name = c("Alice", "Bob", "Charlie"),
  Score = c(90, 85, 88)
)

df

  ID    Name Score
1  1   Alice    90
2  2     Bob    85
3  3 Charlie    88

# we can extract individual columns from df by using the "$" symbol
df$ID

[1] 1 2 3

df[2, ]  # Returns the second row

  ID Name Score
2  2  Bob    85

df[2, 3]  # Access row 2, column 3 (Bob’s score: 85)

[1] 85

# adding new columns
df$Passed <- df$Score > 80  # Adds a logical column

# modify a value
df[1, "Score"] <- 95  # Change Alice's score

# filter data frame values
high_scores <- df[df$Score > 85, ]
high_scores

  ID    Name Score Passed
1  1   Alice    95   TRUE
3  3 Charlie    88   TRUE

# Summary statistics from a dataframe
summary(df)

       ID          Name               Score        Passed       
 Min.   :1.0   Length:3           Min.   :85.00   Mode:logical  
 1st Qu.:1.5   Class :character   1st Qu.:86.50   TRUE:3        
 Median :2.0   Mode  :character   Median :88.00                 
 Mean   :2.0                      Mean   :89.33                 
 3rd Qu.:2.5                      3rd Qu.:91.50                 
 Max.   :3.0                      Max.   :95.00

Examples using built-in data from the starwars and mpg datasets

R libraries are collections of functions, data, and documentation that extend the capabilities of the R programming language. They’re like toolkits or plugins that help you perform specific tasks more efficiently - whether you’re analyzing data, creating visualizations, building machine learning models, or even scraping websites.

By loading the tidyverse library, we gain access to dozens of built-in datasets.

Libraries must be installed before they can be loaded.

Use: Tools > Install Packages to search the CRAN repository for necessary libraries

library(tidyverse)

# Star Wars dataframe
class(starwars)

[1] "tbl_df"     "tbl"        "data.frame"

head(starwars)

# A tibble: 6 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
3 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
5 Leia Org…    150    49 brown      light      brown           19   fema… femin…
6 Owen Lars    178   120 brown, gr… light      blue            52   male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

# The height column contains NA, we'll deal with those before calculating average height
mean(na.omit(starwars$height))

[1] 174.6049

# filter by hair colour
brown_hair <- na.omit(starwars[starwars$hair_color == "brown", ])
brown_hair

# A tibble: 7 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Leia Org…    150    49 brown      light      brown             19 fema… femin…
2 Beru Whi…    165    75 brown      light      blue              47 fema… femin…
3 Chewbacca    228   112 brown      unknown    blue             200 male  mascu…
4 Han Solo     180    80 brown      fair       brown             29 male  mascu…
5 Wedge An…    170    77 brown      fair       hazel             21 male  mascu…
6 Wicket S…     88    20 brown      brown      brown              8 male  mascu…
7 Padmé Am…    185    45 brown      light      brown             46 fema… femin…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

# another way to filter using %in% - filter by species
human_species <- starwars[starwars$species %in% "Human", ]
human_species <- na.omit(human_species)
human_species

# A tibble: 18 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172  77   blond      fair       blue            19   male  mascu…
 2 Darth V…    202 136   none       white      yellow          41.9 male  mascu…
 3 Leia Or…    150  49   brown      light      brown           19   fema… femin…
 4 Owen La…    178 120   brown, gr… light      blue            52   male  mascu…
 5 Beru Wh…    165  75   brown      light      blue            47   fema… femin…
 6 Biggs D…    183  84   black      light      brown           24   male  mascu…
 7 Obi-Wan…    182  77   auburn, w… fair       blue-gray       57   male  mascu…
 8 Anakin …    188  84   blond      fair       blue            41.9 male  mascu…
 9 Han Solo    180  80   brown      fair       brown           29   male  mascu…
10 Wedge A…    170  77   brown      fair       hazel           21   male  mascu…
11 Palpati…    170  75   grey       pale       yellow          82   male  mascu…
12 Boba Fe…    183  78.2 black      fair       brown           31.5 male  mascu…
13 Lando C…    177  79   black      dark       brown           31   male  mascu…
14 Lobot       175  79   none       light      blue            37   male  mascu…
15 Padmé A…    185  45   brown      light      brown           46   fema… femin…
16 Mace Wi…    188  84   none       dark       brown           72   male  mascu…
17 Dooku       193  80   white      fair       brown          102   male  mascu…
18 Jango F…    183  79   black      tan        brown           66   male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

# This option of filtering returns all rows in the starwars 
# dataframe whenever Human is found in the species column

class(mpg)

[1] "tbl_df"     "tbl"        "data.frame"

head(mpg)

# A tibble: 6 × 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

# return the average highway mileage of all vehicles in the dataframe
mean(mpg$hwy)

[1] 23.44017

# Filter for front-wheel-drive vehicles ("f" in the drv column)
fwd_cars <- mpg[mpg$drv == "f", ]

# Calculate the mean highway mileage
mean_highway_mpg_fwd <- mean(fwd_cars$hwy, na.rm = TRUE)

# instead of wrapping the function in na.omit, we can make use of a parameter in the mean function na.rm = TRUE

mean_highway_mpg_fwd

[1] 28.16038