Introduction to Data Science in R

In this book, we take a well-studied public dataset called mtcars and we work through this dataset from a data science work flow. By doing this, we will develop an understanding of the R programming language, the RStudio integrated development environment (IDE), descriptive and inferential statistics, hypothesis testing, and applied predictive modeling techniques with a focus on linear regression.

This book assumes no real familiarity with the R programming language or even much in the way of math or statistics. The ideal audience for this book is someone who is interested in learning all of these things. In fact, we operate under the belief that the best way to commit statistics to memory is by coding, so there is great advantage in learning both at the same time. The goal of this book is to take someone from beginner to an expert. To that end, we take care to first go into some detail in the beginning sections on how to install the R programming language and how to install RStudio, as well as to explain the syntax of the language. In this book, every time a new method is introduced, we go into detailed explanation for what is happening in the code and the language will become more advanced throughout the book as the reader becomes more and more familiar with our particular style. Extra care will be taken every time a new function or process is introduced. We will explain the syntax everytime something new is introduced.

We recognize that comprehensive statistics textbooks are often written after a lifetime of experience from their authors. This book has been created through the opposite process, as a learning tool, and, in so doing, we came to believe that we had a unique perspective as younger authors in the field, that we had a unique ability to make tangible concepts that are inherently difficult, which is data science, that requires expertise across multiple fields, and approach it from the perspective of someone in need of validated resources using well-studied datasets, whose results we check against a variety of other resources. The reason we choose to use the mtcars dataset is because it’s a well-studied dataset that has been analyzed many times and much of this analysis is available for free online. Indeed, much of the already published work that is available through Rpubs, Stackoverflow, Medium, Twitter and other such sources have informed what is in this book, which began primarily as a personal learning resource before transitioning into a tutoring document and eventually becoming a full-length book.

We encourage anyone to take the methods or ideas in this book and check them against other work using the mtcars dataset, or to reproduce the methods for work in other fields. This dataset is a clean dataset; real life datasets will be messier and require a different level of skill and subtlety. This book should therefore be considered as a sort of compilation or collection of much of the work that has already been done in helping in understand mtcars. We provide both specific attributions where they apply in addition to a more general attribution to all of the authors on various websites who we have not had the full ability to explicitly thank.

1 Getting Started

Our first project is to install both the R programming language and the RStudio interactive development environment. Instructions for how to install both of these are found on the R programming language website at the following url: https://www.r-project.org

  • R website
  • RStudio website

1.1 RStudio

1.2 The R Programming Language

R is a language and environment for statistical computing and graphics. R is similar to the S language and environment and can be considered as a different implementation of S. R provides a wide variety of statistical and graphical techniques and is highly extensible. One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control. R is available as Free Software. It compiles and runs on a wide variety of platforms and systems including Linux, Windows and MacOS.

R is a flexible language that is object-oriented and thus allows the manipulation of complex data structures in a condensed and efficient manner.

Data Types

R uses a variety of data types:

  • scalars
  • vectors
  • lists
  • matrices
  • dataframes

We will look at each of these data types in turn.

scalars
vectors

We can refer to elements of a vector using subscripts.

lists
matrices
arrays

Arrays are similar to matrices but can have more than two dimensions. See help(array) for details.

dataframes
functions

Functions in R follow this form.

function_name <- function(arg1, arg2, ..., argn) { put code here }

1.3 Our Libraries

After we have successfully installed RStudio, our next project is to import the libraries that we will be using. For this project, we will lean heavily on the functions from the tidyverse and tidymodel libraries. We will follow a tidy work flow.

There are three rules which make a dataset tidy:

  • Each variable must have its own column
  • Each observation must have its own row
  • Each value must have its own cell

Tidyverse and Tidymodels

Some libraries are larger than others and contain multiple packages. The tidyverse and tidymodel libraries contain many packages each with their own functions. The packages are grouped together because they have common developers and work with the same philosophy and methodology.

In the following lines of code, the use of hashtag # before a code chunk is a way of commenting-out the code. This allows us to leave the code in the console but at the same time make sure the code doesn’t run or compile when we create a project. For example, we don’t need to install the tidyverse package everytime we compile our project, but we have left the code in for the sake of learning. Once the package is installed, there is no need to run these lines of code again. After the packages are installed, we can bring the packages into our local environment by saying library() and including the package.

library(tidyverse) includes the following libraries that we will be using:

  • library(tibble)
  • library(ggplot2)
  • library(dplyr)
  • library(tidyr)
  • library(forcats)
  • library(broom)
  • library(purrr)

library(tidymodels) includes the following libraries that we will be using:

  • library(infer)
  • library(rsample)
  • library(parsnip)
# install.packages('tidyverse')
# install.packages('tidymodels')

library(tidyverse)
library(infer)
library(broom)
library(rsample)
#library(tidymodels)
library(gridExtra)
library(car)
library(ggpubr)
library(rmdformats)
library(ggfortify)
library(ggridges)
library(moments)
library(MetBrewer)
library(modeest)
library(GGally)
library(ggforce)
library(MASS)
library(quantreg)
library(mblm)
library(parsnip)

library(reticulate) # python
library(dbplyr) # SQL
library(RSQLite) # SQL
library(DBI) # SQL
library(sqldf) # SQL
hiro_red <- "#e76254"
hiro_orange <- "#ef8a47"
hiro_light_orange <- "#f7aa58"
hiro_yellow <- "#ffd06f"
hiro_light_yellow <- "#ffe6b7"
hiro_light_teal <- "#aadce0"
hiro_teal <- "#72bcd5"
hiro_blue <- "#528fad" 
hiro_other_blue <- "#376795"
hiro_dark <- "#1e466e"

number_ticks <- function(n) {function(limits) pretty(limits, n)}
# py_install("pandas")
# py_install("matplotlib")
# py_install("numpy")
# py_install("sklearn.linear_model")
# py_install("statsmodels")
# py_install("seaborn")

1.4 The Grammar of Graphics

Applied to visualizations, a grammar of graphics is a grammar used to describe and create a wide range of statistical graphics. 3. The layered grammar of graphics approach is implemented in ggplot2 , a widely used graphics library for R.

2 Data Munging

Part of the aim of this book is to better familiarize ourselves with the data science workflow. We can think about our data science workflow as following these steps. However, the lines are blurry. Different data scientists may have a preference for a different order for better or worse reasons, and working through a project often involves revisiting previous steps.

  • Load libraries
  • Import the data
  • Explore the data
  • Clean the data
  • Model the data
  • Build reports
    • Visualizations
    • Statistics

For this book, we decide to explore well-studied public dataset called mtcars. We have chosen this dataset because it is accessible to audiences of all skills. We will show in our project how we can get a deep understanding of different statistical methods from a small dataset. Our plan, therefore, is to get a lot from just a little. Our goal is to have a comprehensive repository of code and techniques that can become a useful guide going forward for future projects.

2.1 Investigating Our Data

To find mtcars, we use the data() function in our console to pull up a comprehensive list of datasets that exist in the background of the RStudio environment. We can bring mtcars into our local environment by declaring and saving it as a variable in the following way: mtcars <- mtcars. Left and right arrows <- -> are very common in R. We can also type either help(mtcars) or ?mtcars in order to pull up documentation about the dataset in our RStudio environment. When we type help(mtcars) we can read, among other things, the following description of our dataset:

“The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).”

# data()
# ?mtcars 
# help(mtcars)

mtcars <- mtcars 
mtcars
colnames(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
mtcars

After we have saved mtcars in our local environment, we can explore our dataset with common functions to give us a high-level picture of what the data is like. These functions are in-built functions of the R programming language, which means that they are always available even if no additional packages are installed and used. Using the syntax of the R language without the help of new packages is equivalent to using “base R”.

  • class()
  • head() / tail()
  • str() / glimpse()
  • length() / nrow() / dim()
  • summary()
  • names() / colnames() / rownames()
class

Our first function is the class() function from base R. Looking at the class of mtcars is a good place to start: It is important for us to understand the class of the object that we are working with in order to understand how it will behave. We see that mtcars is a dataframe type. We choose to convert it to a tibble, which is preferred.

Tibbles are preferred for a couple of reasons:

  • Tibbles have a refined print method that shows only the first 10 rows as well as all the columns that fit on screen, which makes them neater in the console.
  • Tibbles behave differently with functions: they don’t automatically change data types, which prevent surprises.

In our case, the as_tibble function has an additional useful argument. We use the row_names argument to make the index column with car names a column like any other that can be used in visualizations. After converting mtcars from a dataframe to a tibble and setting rownames = 'car', we can see that mtcars has 12 variables or columns instead of 11.

In the following example, we call a function from a specific package by making the package name explicit in the function call by using two colons, such as in this way: tibble::as_tibble, which is taken to mean the as_tibble function from the tibble package. We might choose to explicitly call a function from a package if we don’t want to load the package into our more general environment, or else if we are worried that two different packages may have a function with the same name. This is particularly useful if two packages that are loaded have functions that have the same name but the behavior of each function is different in the different packages. A common example is the select function which exists in both the plyr and dplyr packages. For this reason, it’s not uncommon to see dplyr::select, as in the example below.

# class(mtcars) 
remove(mtcars)
mtcars <- tibble::as_tibble(mtcars, rownames = 'car')  
head / tail

We next use the head() and tail() functions to look at the first and last rows of our dataset, respectively. The default number of rows that will be printed to our console is six but we can change this. In our case, we decide instead to print three lines to our console by specifying n = 3. We can see that the very first car in our dataframe is Mazda RX4 and the very last car in our dataframe is Volvo 142E.

head(mtcars, n = 3)
tail(mtcars, n = 3)
glimpse

We can use the glimpse() function to inspect the structure of the data frame. To do this we can either use the str() function in base R, or else we can use glimpse() from the tibble package. The glimpse function is preferable because it prints the output in a neater way. Both of these functions let us see the data types of all of our columns.

# View(mtcars) 

# str(mtcars)
tibble::glimpse(mtcars) 
## Rows: 32
## Columns: 12
## $ car  <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Ho…
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
dimension

We use the length(), nrow(), and dim() functions in order to inspect the mtcars dataset by looking at the number of columns/variables and the number of rows/observations in the dataset. We can see that we have 12 variables and 32 observations each.

We can call length(), nrow() or dim() on the dataframe class and we can also call it on specific columns as well. We refer to specific columns in a dataset by using the $ operator. For example, mtcars refers to the entire dataframe and mtcars$mpg will refer to the mpg column in the mtcars dataframe or tibble.

We also see the following text above out table: A tibble: 32 x 11. This means that we are looking at a 32 x 11 table, which has 32 observations and 11 variables. Users of Excel might think of this in terms of rows and columns: a table of 32 rows and 11 columns. The corresponding langauge in R, instead of rows and columns, is to talk about observations and variables.

# length(mtcars) # 11 variables
# nrow(mtcars) # 32 observations
# dim(mtcars) 
# unique(mtcars$car) # there are 32 unique car values
# length(mtcars$car) 
# length(unique(mtcars$car)) 
summary

We use the summary() function to report summary statistics on each of the columns to get a better understanding of the distribution of variables. It’s important to know that the summary() function produces a different output for vectors with a different character type. We see that the summary statistics for the car variable shows us the number of observations and the class but the summary statistics for the other variables, which are numeric data types, will print the min, first quartile, median, mean, third quartile, and max value.

summary(mtcars) 
##      car                 mpg             cyl             disp      
##  Length:32          Min.   :10.40   Min.   :4.000   Min.   : 71.1  
##  Class :character   1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8  
##  Mode  :character   Median :19.20   Median :6.000   Median :196.3  
##                     Mean   :20.09   Mean   :6.188   Mean   :230.7  
##                     3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0  
##                     Max.   :33.90   Max.   :8.000   Max.   :472.0  
##        hp             drat             wt             qsec      
##  Min.   : 52.0   Min.   :2.760   Min.   :1.513   Min.   :14.50  
##  1st Qu.: 96.5   1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89  
##  Median :123.0   Median :3.695   Median :3.325   Median :17.71  
##  Mean   :146.7   Mean   :3.597   Mean   :3.217   Mean   :17.85  
##  3rd Qu.:180.0   3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90  
##  Max.   :335.0   Max.   :4.930   Max.   :5.424   Max.   :22.90  
##        vs               am              gear            carb      
##  Min.   :0.0000   Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4375   Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :5.000   Max.   :8.000
colnames / rownames

We can use two additional base R functions, colnames() and rownames(), to look at the column and row names in our dataset. names(mtcars) is equivalent to colnames(mtcars).

# names(mtcars)
colnames(mtcars) 
##  [1] "car"  "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"  
## [11] "gear" "carb"
rownames(mtcars) 
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
## [31] "31" "32"
tibbles and tribbles

We can also create data tables from scratch. For example, we may do our research into the country origin of each of the cars in our dataset. One option would be to create another table in a tool like Excel, or else we could create our own table with code in RStudio.

One common method to create our own table is to create two vectors and put them together using either the as.data.frame() or tibble() functions. In this case we create our a new tibble which we call cars_origin_tbl.

car <- c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout", "Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280", "Merc 280C", "Merc 450SE", "Merc 450SL", "Merc 450SLC", "Cadillac Fleetwood", "Lincoln Continental", "Chrysler Imperial", "Fiat 128", "Honda Civic", "Toyota Corolla", "Toyota Corona", "Dodge Challenger", "AMC Javelin", "Camaro Z28", "Pontiac Firebird", "Fiat X1-9", "Porsche 914-2", "Lotus Europa", "Ford Pantera L", "Ferrari Dino", "Maserati Bora", "Volvo 142E")

origin <- c("Japan", "Japan", "Japan", "United States", "United States", "United States", "United States", "Germany", "Germany", "Germany", "Germany", "Germany", "Germany", "Germany", "United States", "United States", "United States", "Italy", "Japan", "Japan", "Japan", "United States", "United States", "United States", "United States", "Italy", "Germany", "British", "United States", "Italy", "Italy", "Sweden")

# cars_origin_df <- data.frame(car, origin)
cars_origin_tbl <- tibble(car, origin)
head(cars_origin_tbl, n = 3)

Another method to creating a tibble is to create a row-wise tibble called a tribble. This rowwise tribble is the same as the tibble that we created; the only difference is the method in which we created it. Because it is a redundant dataframe, we can remove it from our console using the remove() argument.

cars_origin_trbl <- tribble(
  ~car, ~origin,
  "Mazda RX4", "Japan",
  "Mazda RX4 Wag", "Japan",
  "Datsun 710", "Japan", 
  "Hornet 4 Drive", "United States", 
  "Hornet Sportabout", "United States", 
  "Valiant", "United States", 
  "Duster 360", "United States", 
  "Merc 240D", "Germany", 
  "Merc 230", "Germany", 
  "Merc 280", "Germany", 
  "Merc 280C", "Germany", 
  "Merc 450SE", "Germany", 
  "Merc 450SL", "Germany", 
  "Merc 450SLC", "Germany", 
  "Cadillac Fleetwood", "United States", 
  "Lincoln Continental", "United States", 
  "Chrysler Imperial", "United States", 
  "Fiat 128", "Italy", 
  "Honda Civic", "Japan", 
  "Toyota Corolla", "Japan", 
  "Toyota Corona", "Japan", 
  "Dodge Challenger", "United States", 
  "AMC Javelin", "United States", 
  "Camaro Z28", "United States", 
  "Pontiac Firebird", "United States", 
  "Fiat X1-9", "Italy", 
  "Porsche 914-2", "Germany", 
  "Lotus Europa", "British", 
  "Ford Pantera L", "United States", 
  "Ferrari Dino", "Italy", 
  "Maserati Bora", "Italy", 
  "Volvo 142E", "Sweden"
)

remove(cars_origin_trbl)
indexes

We can add an index column to our tibble as well by using the rowid_to_column argument from the tibble package.

mtcars <- tibble::rowid_to_column(mtcars, "index")
mtcars
sql_mtcars <- mtcars

# create data base connection
con <- RSQLite::dbConnect(SQLite(), ":memory:")

# copy tables to data base
dplyr::copy_to(con, sql_mtcars)
dplyr::copy_to(con, cars_origin_tbl)

# check contents of data base
dbListTables(con)
## [1] "cars_origin_tbl" "sql_mtcars"      "sqlite_stat1"    "sqlite_stat4"
# create local pointer to remote table
sql_mtcars_db <- tbl(con, "sql_mtcars")
cars_origin_tbl_db <- tbl(con, "cars_origin_tbl")

# this adds a row number partitioned by cyl

tbl(con, "sql_mtcars") %>%
  dplyr::select(car, mpg, wt) %>%
  dplyr::show_query()
## <SQL>
## SELECT `car`, `mpg`, `wt`
## FROM `sql_mtcars`
tbl(con, "sql_mtcars") %>%
  dplyr::mutate(car2 = car) %>%
  dplyr::show_query()
## <SQL>
## SELECT *, `car` AS `car2`
## FROM `sql_mtcars`
sqldf("SELECT *,
      row_number() over (partition by cyl) AS rn   
      FROM mtcars")
tbl(con, "sql_mtcars") %>%
  dplyr::select(mpg, cyl, disp) %>%
  # dplyr::mutate(car2 = car) %>%
  dplyr::show_query()
## <SQL>
## SELECT `mpg`, `cyl`, `disp`
## FROM `sql_mtcars`

SELECT `car`, `mpg`, `wt`
FROM `sql_mtcars`
Displaying records 1 - 10
car mpg wt
Mazda RX4 21.0 2.620
Mazda RX4 Wag 21.0 2.875
Datsun 710 22.8 2.320
Hornet 4 Drive 21.4 3.215
Hornet Sportabout 18.7 3.440
Valiant 18.1 3.460
Duster 360 14.3 3.570
Merc 240D 24.4 3.190
Merc 230 22.8 3.150
Merc 280 19.2 3.440

2.2 Joining New Datasets

For the sake of being thorough, we will practice every different kind of two-table join. Two-table joins fall into two categories:

  • mutating joins
  • filtering joins

A mutating join combines variables from two tables. It does this by matching observations by their common variables, called keys and copying variables from one table to the other. A filtering join, on the other hand, allows a table to be filtered based on the matching key or another table.

In order to practice the different kinds of two-table joins, we first arbitrarily remove one of the observations in our new cars_origin_tbl. In this case, we remove the origin information for Datsun 710 pretending we do not know that Datsun is Japanese. Removing this value will make it clear what is happening with the different joins.

cars_origin_tbl$origin[cars_origin_tbl$car == "Datsun 710"] <- NA
cars_origin_tbl$car[cars_origin_tbl$car == "Datsun 710"] <- NA

head(cars_origin_tbl)

Mutating Joins

full_join

Our first try is to use a full_join from dplyr.

If we look at the head of our new dataframe we see that Datsun was kept in the original dataframe and then matched to an NA value from the new dataframe.

However, if we look at the tail of our new dataframe we see that there is a new row that is NA all the way through. This is because full join keeps everything on the left-hand-side but also matches everything on the right-hand-side, which is a row that is NA all the way through (where car == NA and origin == NA).

We look at the length of our new dataframe which is 13 observations (the original 12 plus the origin column) but also 33 observations (which is the original 32 plus a new NA observation joined at the bottom).

mtcars_full_join <- dplyr::full_join(mtcars, cars_origin_tbl, by = 'car') 

head(mtcars_full_join, n = 3)
tail(mtcars_full_join, n = 2)
# length(mtcars_full_join) # 13
# nrow(mtcars_full_join) # 33

remove(mtcars_full_join)
left_join

Our next try is to use a left_join from dplyr.

Left_join includes all records from the left side and matched rows from the right table.

If we look at the head of our new dataframe we see that “Datsun” was kept in the original dataframe and then matched to an NA value from the new dataframe.

Unlike full_join, there is no additional row added for the NA value from the ‘car’ variable in the second dataframe.

We look at the length of our new dataframe which is 13 observations (the original 12 plus the origin column) and then only the 32 observations from the original dataframe.

mtcars_left_join <- left_join(mtcars, cars_origin_tbl, by = 'car') 

# an alternate way to write the code
mtcars %>% dplyr::left_join(cars_origin_tbl, by = 'car') -> mtcars_left_join

head(mtcars_left_join, n = 3)
tail(mtcars_left_join, n = 2)
# length(mtcars_left_join) # 13
# nrow(mtcars_left_join) # 32
right_join

Our next try is to use a right_join from dplyr.

If we look at the head of our new dataframe we see that Datsun was kept in the original dataframe and then matched to an NA value from the new dataframe.

Right_join is different from left_join because of the inclusion of unmatched rows.

Whereas left_join includes all records from the left side and matched rows from the right table, right_join returns all rows from the right side and unmatched rows from the left table.

In the result, “Datsun” is missing because it was not in the right-hand table. However, a new NA row was included.

mtcars_right_join <- right_join(mtcars, cars_origin_tbl, by = 'car') 

# an alternate way to write the code
mtcars %>% dplyr::right_join(cars_origin_tbl, by = 'car') -> mtcars_right_join

head(mtcars_right_join, n = 3)
tail(mtcars_right_join, n = 2)
# length(mtcars_right_join) # 13
# nrow(mtcars_right_join) # 32

remove(mtcars_right_join)

Filtering Joins

inner_join

An inner_join in R is a merge operation between two data frames where the merge returns all of the rows that match from both tables. You are going to need to specify a common key for R use to use to match the data elements.

mtcars_inner_join <- inner_join(mtcars, cars_origin_tbl, by = 'car') 

# an alternate way to write the code
mtcars %>% dplyr::inner_join(cars_origin_tbl, by = 'car') -> mtcars_inner_join

# nrow(mtcars_inner_join) # only 31
# length(mtcars_inner_join) # 13

head(mtcars_inner_join, n = 3)
tail(mtcars_inner_join, n = 2)
remove(mtcars_inner_join)
semi_join

A semi_join returns the rows of the first table where it can find a match in the second table.

mtcars_semi_join <- semi_join(mtcars, cars_origin_tbl, by = 'car') 

# an alternate way to write the code
mtcars %>% dplyr::semi_join(cars_origin_tbl, by = 'car') -> mtcars_semi_join

# nrow(mtcars_semi_join) # only 31
# length(mtcars_semi_join) # only 12


# another method using value matching

mtcars_semi_join <- mtcars %>%
  filter(car %in% cars_origin_tbl$car)

# nrow(mtcars_semi_join) # only 31
# length(mtcars_semi_join) # only 12

head(mtcars_semi_join, n = 3)
tail(mtcars_semi_join, n = 2)
remove(mtcars_semi_join)
anti_join

An anti_join returns the rows of the first table where it cannot find a match in the second table. … Anti joins are a type of filtering join, since they return the contents of the first table, but with their rows filtered depending upon the match conditions.

Semi_join(mtcars, cars_origin_tbl) plus anti_join(mtcars, cars_origin_tbl) will return mtcars.

mtcars_anti_join <- anti_join(mtcars, cars_origin_tbl, by = 'car') # this

# an alternate way to write the code
mtcars %>% dplyr::anti_join(cars_origin_tbl)
# length(mtcars_anti_join) # 12
# nrow(mtcars_anti_join) # only 1

remove(mtcars_anti_join)

# what about joining a table on more than one category?

-- semi join
SELECT * FROM `sql_mtcars`
WHERE EXISTS (
  SELECT 1 FROM `cars_origin_tbl`
  WHERE (sql_mtcars.car = cars_origin_tbl.car)
)
Displaying records 1 - 10
index car mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4

-- left join
SELECT `index`, `sql_mtcars`.`car` AS `car`, `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`, `origin`
FROM `sql_mtcars` 
LEFT JOIN `cars_origin_tbl` 
ON (`sql_mtcars`.`car` = `cars_origin_tbl`.`car`)

/*
SELECT `index`, `sql_mtcars`.`car` AS `car`, `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`, `origin`
FROM `sql_mtcars` 
LEFT JOIN `cars_origin_tbl` USING `car`
*/
Displaying records 1 - 10
index car mpg cyl disp hp drat wt qsec vs am gear carb origin
1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Japan
2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Japan
3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Japan
4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 United States
5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 United States
6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 United States
7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 United States
8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Germany
9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Germany
10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Germany

2.3 Formatting Our Data

Data Types

After we investigate our data in order to understand the data types for each of our column vectors, we also want to pay special consideration to the data types as they might relate to our research questions. In particular, we can think about our numeric variables as belonging to two distinct categories: numeric and categorical.

  • Numeric variables have a measurement with a numerical meaning.
  • Categorical variables have a measurement with a categorical meaning.

It’s important to understand this distinction because R may read a column vector as numerical when the everyday meaning may be better understood as a category. For example, the cyl column in mtcars has three unique values: 4, 6, and 8. Mtcars stores this column vector as a numeric data type because R is recognizing 4, 6, and 8 as numbers. However, the true meaning here is the type of engine: 4, 6, or 8 cylinders. For this reason, it is best to convert the cyl column vector to a factor data type.

Mtcars documentation helps us decide.

We change the name of mtcars_left_join back to mtcars, just to keep it more simple.

mtcars_left_join$origin[mtcars_left_join$car == "Datsun 710"] <- "Japan"

mtcars <- mtcars_left_join  

mtcars <- mtcars %>% 
  mutate(am = as.factor(am), cyl = as.factor(cyl), gear = as.factor(gear), vs = as.factor(vs),  carb = as.factor(carb))

# split the data frame into two: one dataframe for categorical variables and one for factor variables

mtcars_factor <- mtcars %>%
  dplyr::select(car, origin, mpg, cyl, vs, am, gear, carb) 

mtcars_numeric <- mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) 

Data Structure

Wide format data is easier to read and interpret but long data format is better for database structures. The tidyr package has useful functions to transform data in this way. To convert from wide to long, sometimes referring to as ‘lengthening’ our data, we use the pivot_longer function, which increases the number of rows and decreases the number of columns. To convert from long to wide we use pivot_wider. pivot_longer and pivot_wider were previously called gather and spread and before that they were called melt and cast.

mtcars_long_numeric <- pivot_longer(mtcars_numeric, names_to = 'names', values_to = 'values', 4:8)

mtcars_long_numeric_with_mpg <- pivot_longer(mtcars_numeric, names_to = 'names', values_to = 'values', 3:8)
# nrow(mtcars_long_numeric) # 160 observations

mtcars_numeric <- pivot_wider(mtcars_long_numeric, names_from = 'names', values_from = 'values')

# spread vs gather
# melt vs 

# mtcars_long <- mtcars %>%
#   tidyr::pivot_longer(everything(), names_to = 'names', values_to = 'values')
 
remove(mtcars_factor)
remove(mtcars_numeric)
remove(mtcars_left_join)

mtcars %>%
  dplyr::select(car, origin, mpg, cyl, vs, am, gear, carb) %>%
  tidyr::pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
  head(n = 3)
mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
  head(n = 3)
mtcars
write.csv(mtcars, '/Users/josefwaples/Desktop/mtcars.csv', row.names = FALSE, col.names = TRUE)

2.4 Missing Values

Types of Missingness

We check to see if there are any NA values in our dataset.

We keep in mind that values can be missing for different reasons.

  • Missing at random (MAR)
  • Missing completely at random (MCAR)
  • Missing not at random (MNAR)

MAR is probability of being missing is the same only within groups defined by the observed data

MCAR is The probability of being missing is the same for all cases. This implies that causes of the missing data are unrelated to the data.

MNAR The probability of being missing varies for reasons that are unknown to us.

Imputation

Determining the cause for missingness is important in making decisions about imputation. We may consider imputation for example median or mode imputation.

# Let's check for NA values 
any(is.na(mtcars)) # FALSE
## [1] FALSE
# is.na(mtcars) # this might help us check the exact location of the NA values. 

3 Exploratory Data Analysis

Exploratory Data Analysis is the process of doing first investigations in order to check assumptions, uncover patterns, and discover anomalies. Exploratory Data Analysis can take the form of summary statistics and visualizations. It’s often useful to start exploring the characteristics of each individual variable with univariate visualizations and statistics before exploring the relationships between variables with bivariate visualizations and statistics.

3.1 Common Visualizations

When examining a dataset, it may be challenging to figure out the most appropriate visualization to use for any particular story. It’s easy to get stuck considering many options. It’s important to remember that there is no perfect answer when visualizing data. Every visualization is part of a larger story and always has to be understood in context. We bias our data at every step of the process. Our lodestar is our hypothesis. We always keep our hypothesis or goal in mind. The following points can serve as a good heuristic in deciding which visualization to use. There are many different kinds of visualizations so this is not a comprehensive list, just a guide.

  • Histograms, density plots, and boxplots are useful in showing the distribution or spread of a variable.
  • Boxplots, bar plots and dot plots are useful in comparing values across groups.
  • Pie charts and donut charts are useful in showing composition.
  • Scatterplots are useful in showing relationships between continuous variables.

We can create these visualizations by using the ggplot2 package from the tidyverse set of packages. There is a particular philosophy that accompanies ggplot2. With the ggplot() system, we supply a dataset, choose aesthetic mappings, add layers and customize scales, coordinates, facets, graphical primitives, and many other things. In addition to being highly extensible, ggplot() also can be paired easily with companion packages.

In the following three-paneled visualization, we show the successive process of adding layers while using ggplot2. In the first case, we supply only a dataset to the ggplot() function. Without supplying any other arguments, we see a blank canvas. In the second case, we add an x-axis aesthetic mapping. In the third case, we also add a geom layer in the form of a histogram. We can link these three separate visualizations together by using a library called patchwork which is one of many ggplot2 extensions. In order to link these visualizations together, we have to save each visualization as it’s own variable.

In the following visualization, we also include a few extra lines of code in order to make our graphs look nice and to standardize our style. We add a title and subtitle to our visual by using the title and subtitle arguments within the labs layer of ggplot(). We also use the theme layer to center and make bold our title and subtitle by specifyingface = "bold" and hjust = 0.5.

one <- ggplot2::ggplot(data = mtcars) + 
  labs(title = "Mtcars", subtitle = "Blank Canvas") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) 

two <- ggplot2::ggplot(data = mtcars, aes(x = mpg)) + 
  labs(title = "Mtcars", subtitle = "+ Aesthetic Mapping") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) 

three <- ggplot2::ggplot(data = mtcars, aes(x = mpg, y = after_stat(count))) + 
  geom_histogram() +
  labs(title = "Mtcars", subtitle = "+ Geom Layer") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) 
  
library(patchwork)
one + two + three

# `'ggitle'` is a useful short cut. 

?after_stat(count)

Distribution

Histograms

Univariate visualizations such as histograms are often a good place to start because they give us an opportunity to explore the distributions of the variables in our dataset. Structurally, ggplot2 requires two position scales for the x and y aesthetics. Because histograms are a univariate visualization, this means that only our x aesthetic is found as a variable in our data. Our y aesthetic will be a computed variable. We can set up our histogram by setting our x aesthetic as our variable and our y aesthetic as referencing an argument: y = after_stat(count))).

What is after_stat(count) really doing in our code? after_stat is telling ggplot() to delay the mapping until later in the rendering process. The height of the histogram bars is mapped to the count geom_histogram() with the bins argument. As a default, the bins argument is set to 30. In many cases, the histogram will work well without having to adjust the default number of bins. In our case, however, because we know that our mtcars dataset is 32 observations, we can specify bins = 32 as an argument of geom_histogram() and override the default setting. Doing this may make our visualization a little bit easier to interpret.

Binwidth is another option…

In the following example, we also set the color and fill arguments with geom_histogram() to change the border color of the histogram columns and the color inside the columns, respectively. We specify the color by using a hexadecimal. There are other ways to specify colors.

one <- ggplot2::ggplot(data = mtcars, aes(x = mpg, y = after_stat(count))) + 
  geom_histogram(bins = 32, color = 'black', fill = '#ffe6b7') +
  labs(title = "Mtcars", subtitle = "Histogram") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) 

two <- ggplot2::ggplot(data = mtcars, aes(x = mpg, y = after_stat(count))) + 
  geom_histogram(bins = 32, color = 'black', fill = '#e76254') +
  labs(title = "Mtcars", subtitle = "Histogram") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  scale_x_reverse()

three <- ggplot2::ggplot(data = mtcars, aes(x = mpg, y = after_stat(count))) + 
  geom_histogram(bins = 32, color = 'black', fill = '#ef8a47') +
  labs(title = "Mtcars", subtitle = "Histogram") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  scale_y_reverse()

# The second and third graph would likely never be used. We use these graphs only to show how the `scale_x_reverse()` and `scale_y_reverse()` functions are working within `ggplot()`, and also to show how to customize different colors.
library(patchwork)
one

We can also pivot our data frame into a long data frame and use a technique called faceting to create histograms for multiple variables side-by-side. In the first case, we use facet_grid() and in the second case we use facet_wrap(). The main difference is that facet_grid normalizes the y-axis scale which obviates the need for having y-axis ticks for every panel. Mostly, the aesthetic difference between the facet_grid() and facet_wrap() is a consequence of this design choice. Notice that in order to create histograms side-by-side, we first have to convert our wide dataframe into a long dataframe using the pivot_longer() function from the tidyr library. Finally, we choose to add another layer to our ggplot2 object by creating a caption on our graph.

In the following code, we continue to create histograms by using geom_histogram(). This time, however, we don’t specify y = after_stat(count) in the aesthetic mapping. In fact, we don’t specify any y aesthetic explicitly and ggplot() knows how to handle this and our code will compile all the same.

mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
  ggplot(aes(x = values, fill = names)) +
  facet_grid(~names, scales = 'free') + # facet_grid automatically has to define a y scale
  geom_histogram(bins = 32, color = 'black', show.legend = FALSE) +
  labs(title = "Mtcars", subtitle = "Histogram") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = 'facet_grid') +
  scale_fill_manual(values=met.brewer("Hiroshige", 6))

mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
  ggplot(aes(x = values, color = names)) +
  facet_wrap(~names, scales = 'free_x') + # facet_grid automatically has to define a y scale
  geom_histogram(bins = 32, fill = 'black', show.legend = FALSE) +
  labs(title = "Mtcars", subtitle = "Histogram") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) + 
  labs(caption = 'facet_wrap') +
  scale_color_manual(values=met.brewer("Hiroshige", 6))

# ggplot2::ggplot(data = mtcars, aes(x = mpg, y = after_stat(count))) +
#   geom_histogram(bins = 1, color = 'black', fill = '#ffe6b7') +
#   labs(title = "Mtcars", subtitle = "Histogram") +
#   theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
#   theme(plot.subtitle = element_text(hjust = 0.5))
# 
# ggplot2::ggplot(data = mtcars, aes(x = mpg, y = after_stat(count))) +
#   geom_histogram(bins = 10, color = 'black', fill = '#ffe6b7') +
#   labs(title = "Mtcars", subtitle = "Histogram") +
#   theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
#   theme(plot.subtitle = element_text(hjust = 0.5))
# 
# ggplot2::ggplot(data = mtcars, aes(x = mpg, y = after_stat(count))) +
#   geom_histogram(bins = 5000, color = 'black', fill = '#ffe6b7') +
#   labs(title = "Mtcars", subtitle = "Histogram") +
#   theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
#   theme(plot.subtitle = element_text(hjust = 0.5))

Density Plots

In addition to creating histograms, we can also create density plots. Density plots are similar to histograms except with smoothed edges. The standard deviation of the variable becomes the default smoothing bandwidth in a density plot. In the case of mpg variable, the bandwidth is 6.026948. We can verify this by looking at the standard deviation of the column vector: sd(mtars$mpg). Within geom_density(), we specify the bandwidth as equal to the standard deviation of the mpg vector in mtcars; we could also specify the exact value of 6.026948, or else we could leave the argument off since this is the default setting. In our facet_wrap() version, we specify an alpha value of 0.25 which is a way of adding transparency to our visualization. Alpha here refers to opacity; values of alpha range from 0 to 1, with lower values corresponding to more transparent colors.

ggplot(data = mtcars, aes(x = mpg)) + 
  geom_density(color = 'black', fill = '#ffe6b7', bandwidth = sd(mtcars$mpg)) +
  labs(title = "Mtcars", subtitle = "Density Plot") +
  theme(plot.title = element_text(face = "bold")) 

mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
  ggplot(aes(x = values, color = names)) +
  facet_grid(~names, scales = 'free') + # facet_grid automatically has to define a y scale
  geom_density(fill = 'black', show.legend = FALSE)  +
  labs(title = "Mtcars", subtitle = "Density Plot") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  scale_color_manual(values=met.brewer("Hiroshige", 6))

mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
  ggplot(aes(x = values, fill = names)) +
  facet_wrap(~names, scales = 'free') +
  geom_density(color = 'black', alpha = 0.25, show.legend = FALSE) +
  labs(title = "Mtcars", subtitle = "Density Plot") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  scale_fill_manual(values=met.brewer("Hiroshige", 6))

Notice that in the above visualizations, our facet_grid() has a disadvantage with our density plots: The relative density for the disp and hp variables at any given point was almost zero. Because the facet_grid() layer relies on a common y-axis scale for our facets, the facet_grid() layer automatically has to define a y-axis scale in ggplot(). This makes the disp and hp variables appear flat in our visualization, almost as though we forgot to chart them. For this reason, facet_grid should be used with care when the variables have a different y-axis scale.

We have another option if we lean on another library called ggridges. The geom_density_ridges() function in the ggridges library is another option to put density plots layered on top of each other. However, if the x-axis values are very different between groups, the hill may look artificially smooth in some cases, and if the x-axis is close enough to see the variation, other groups may not be included. For this reason, we include two different options below with different x-axis scales to illustrate the limitations of each, so care should be used.

one <- mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
  ggplot(aes(x = values, y = names)) + 
  geom_density()

two <- mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
  ggplot(aes(x = values, y = names)) + 
  ggridges::geom_density_ridges()

three <- mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
  ggplot(aes(x = values, y = names)) + 
  ggridges::geom_density_ridges() + xlim(0, 40)

library(patchwork)
one + two + three

Comparison

Dotplots

A dotplot is a way of representing each unique variable as a point on an x-axis. In our case, we create one dot for each car to show the mpg associated with that car along a number line. We do this in order to get a sense of the trend, grouping, and dispersion in the data.

We can also use facet_wrap() to create a dotplot for each unique variable in our dataframe. We combine facet_wrap() with the reoder_within() function from the tidytext library in order to reorder each unique facet. A downside of this technique is that the font becomes hard to read when we include all the variables.

ggplot(mtcars, aes(x = mpg, y = reorder(car, mpg))) +
  geom_point(color = '#e76254') +
  labs(title = "Mtcars", subtitle = "Dot Plot") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) 

mtcars %>%
  dplyr::select(car, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 2:7) %>%
  mutate(car = tidytext::reorder_within(car, values, names)) %>%
  ggplot(aes(x = values, y = reorder(car, values), color = names)) +
  geom_point() +
  labs(title = "Mtcars", subtitle = "Dot Plot") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  tidytext::scale_y_reordered() +
  facet_wrap(~names, scales = 'free')+
  theme(text = element_text(size=6)) +
  scale_color_manual(values=met.brewer("Hiroshige", 6)) +
  theme(legend.position = 'none')

A lollipop plot is constituted of a marker and a stem.

ggplot(mtcars, aes(x = mpg, y = reorder(car, mpg))) +
  geom_point(color = '#ef8a47') +
  labs(title = "Mtcars", subtitle = "Dot Plot") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  geom_segment( aes(x=mpg, xend=0, y=car, yend=car), color="#ef8a47") 

mtcars %>%
  dplyr::select(car, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 2:7) %>%
  mutate(car = tidytext::reorder_within(car, values, names)) %>%
  ggplot(aes(x = values, y = reorder(car, values), color = names)) +
  geom_point() +
  labs(title = "Mtcars", subtitle = "Dot Plot") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  tidytext::scale_y_reordered() +
  facet_wrap(~names, scales = 'free')+
  theme(text = element_text(size=6)) +
  scale_color_manual(values=met.brewer("Hiroshige", 6)) +
  theme(legend.position = 'none') +
  geom_segment( aes(x=values, xend=0, y=car, yend=car), color='black')


plt.close()
plt.cla()
plt.clf()

mtcars['car'] = mtcars['mpg'].rank(ascending=True)

seaborn.scatterplot(data = mtcars, x="mpg", y="car", color = '#ef8a47')
plt.title("dot plot", loc = 'left')
plt.xlabel("mpg")
plt.ylabel("car")
plt.style.use('ggplot')
plt.tight_layout()
plt.hlines(y = mtcars['car'], xmin=0, xmax=mtcars['mpg'], alpha =0.8, color = '#ef8a47')
plt.show()

Strip Plots

In addition to creating a visualization that shows a dot for each car, we can also create a visualization with a larger grouping. We can use a strip plot to create a string of dots for cars from each country of origin.

The following three panels show different versions of strip plots. The panel on the left shows a series of dots for cars from each country of origin. In this first panel, however, some dots overlap, as in the case of two Japanese cars where mpg = 21 for both. The middle panel makes use of an alpha argument in the geom_point() layer by setting alpha = 0.5. Whereas in our previous visualization we set the alpha value below 1 for aesthetic reasons, here the alpha argument is useful in allowing us to see overlapping dots. Similar to the panel in the middle, the panel on the right also distinguishes overlapping dots, except this time by adding a jitter, which is random noise. We could specify jitter in the following way: geom_point(position = position_jitter()) or else we could use geom_jitter() as a shortcut. Finally, we remove the y-axis title, ticks, and labels by setting them to element_blank().

one <- ggplot(mtcars, aes(x = mpg, y = reorder(origin, mpg))) +
  geom_point() +
  labs(title = "Mtcars", subtitle = "Strip Plots") +
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.title = element_text(face = "bold")) +
  theme(axis.title.y = element_blank())

two <- ggplot(mtcars, aes(x = mpg, y = reorder(origin, mpg))) +
  geom_point(alpha = 0.5, color = '#D55E00') +
  theme(axis.title.y = element_blank()) +
  theme(axis.ticks.y = element_blank(),
        axis.text.y = element_blank())

three <- ggplot(mtcars, aes(x = mpg, y = reorder(origin, mpg))) +
  geom_jitter(width = 0.1, height = 0.1, color = '#0072B2') +
  theme(plot.title = element_text(face = "bold")) +
  theme(axis.title.y = element_blank()) +
  theme(axis.ticks.y = element_blank(),
        axis.text.y = element_blank())

library(patchwork)
one + two + three

Boxplots

The order of our aesthetic layers matters in terms of the visual output. In this case, we decide to put the geom_boxplot() before the geom_jitter(). As a result, the points are overlayed on top of the boxplot.

We can change the x or y axis labels by using the xlab() or ylab() layer. In the following exmaple, we change the y axis label.

Boxplots are a great way of seeing interquartile range. Japanese, German, and U.S. cars all have a right skew but Italian cars don’t have a skew.

ggplot(mtcars, aes(x = mpg, y = reorder(origin, mpg), color = origin)) +
  geom_boxplot() + geom_jitter(width=0.1, height=0.1) +
  theme(legend.position = "none") +
  labs(title = "Mtcars", subtitle = "Box Plot") +
  theme(plot.title = element_text(face = "bold")) +
  ylab("country") +
  scale_color_manual(values=met.brewer("Hiroshige", 6))


plt.close()
plt.cla()
plt.clf()

order = mtcars.groupby('origin')['mpg'].mean().sort_values(ascending=False)
# sort mpg
seaborn.boxplot(x="mpg", y="origin", data=mtcars, order=order.index)
plt.title("boxplot", loc = 'left')
plt.show()

We can also create a boxplot for each of the numeric variables in our dataset. Notice that some of variables have dots floating outside the whiskers of the boxplot. An outlier is defined as a point outside 1.5 times the interquartile range above the upper quartile and below the lower quartile. Here we adjust the y-axis limits. It has to be y-axis limits because of coord_flip() but then we also remove the axis text but leave the axis ticks.

\[ \begin{aligned} Q1 - 1.5 * IQR \end{aligned} \] \[ \begin{aligned} Q3 + 1.5 * IQR \end{aligned} \]

mtcars %>%
  dplyr::select(3:13) %>%
  dplyr::select_if(is.numeric) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 2:6) %>%
  ggplot(aes(x = values, color = names)) +
  geom_boxplot() + coord_flip() +
  facet_wrap(~names, scales = 'free_y') +
  theme(legend.position = "none") +
  labs(title = "Mtcars", subtitle = "Box Plot") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  scale_color_manual(values=met.brewer("Hiroshige", 5)) +
  ylim(-0.8, 0.8) +
  theme(axis.text.x=element_blank()) 

Violin Plots

In this case, we have added the geom_jitter before the geom_violin. As a result, the violin is layered over the points.

# 
# ggplot(mtcars, aes(x = mpg, y = reorder(origin, mpg), fill = origin)) +
#   geom_jitter(width=0.1, height=0.1) + geom_violin() +
#   theme(legend.position = "none") +
#   labs(title = "Mtcars", subtitle = "Violin Plot") +
#   theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
#   theme(plot.subtitle = element_text(hjust = 0.5)) +
#   ylab("country") +
#   scale_fill_manual(values=met.brewer("Hiroshige", 6))
# 
# ggplot(mtcars, aes(x = mpg, y = reorder(origin, mpg), fill = origin)) +
#   geom_jitter(width=0.1, height=0.1) + geom_boxplot(alpha = 0.5) + 
#   theme(legend.position = "none") +
#   labs(title = "Mtcars", subtitle = "Violin Plot") +
#   theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
#   theme(plot.subtitle = element_text(hjust = 0.5)) +
#   ylab("country") +
#   scale_fill_manual(values=met.brewer("Hiroshige", 6))

ggplot(mtcars, aes(x = mpg, y = reorder(origin, mpg), fill = origin)) +
  geom_jitter(width=0.1, height=0.1) + geom_violin(alpha = 0.5) + 
  theme(legend.position = "none") +
  labs(title = "Mtcars", subtitle = "Violin Plot") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  ylab("country") +
  scale_fill_manual(values=met.brewer("Hiroshige", 6))


plt.close()
plt.cla()
plt.clf()

order = mtcars.groupby('origin')['mpg'].mean().sort_values(ascending=False)
# sort mpg
seaborn.violinplot(x="mpg", y="origin", data=mtcars, order=order.index)
plt.title("violin plot", loc = 'left')
plt.show()

But how does the violin plot get its shape? The “violin” shape of a violin plot comes from the data’s density plot. You just turn that density plot sideway and put it on both sides of the box plot, mirroring each other.

one <- mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  # pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
  ggplot(aes(x = mpg, y = reorder(origin, mpg), fill = origin)) +
  ggridges::geom_density_ridges() + xlim(0, 40) +
  ylab("country") +
  scale_fill_manual(values=met.brewer("Hiroshige", 6)) +
  theme(legend.position = 'none') +
    labs(title = "Mtcars", subtitle = "Density Plot") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) 

two <- ggplot(mtcars, aes(x = mpg, y = reorder(origin, mpg), fill = origin)) +
  geom_boxplot() + geom_jitter(width=0.1, height=0.1) +
  theme(legend.position = "none") +
  labs(title = "Mtcars", subtitle = "Box Plot") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  scale_fill_manual(values=met.brewer("Hiroshige", 6)) +
  theme(axis.text.y=element_blank()) +
  ylab("")

three <- ggplot(mtcars, aes(x = mpg, y = reorder(origin, mpg), fill = origin)) +
  geom_jitter(width=0.1, height=0.1) + geom_violin() +
  theme(legend.position = "none") +
  labs(title = "Mtcars", subtitle = "Violin Plot") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  scale_fill_manual(values=met.brewer("Hiroshige", 6)) +
  theme(axis.text.y=element_blank()) +
  ylab("")

library(patchwork)
one + two + three

The ggplot2 package’s facet options were not designed for varying axis labels / scales across facets. Placing panels next to each other with very different scales is something that visual design experts will tend to avoid because it can be at best misleading. As a result, facet_wrap() may not be the best technique for boxplots and violin plots when the groupings have different scales. Instead, we can lean on the patchwork to link separate boxplots together.

one <- mtcars %>%
  dplyr::select(car, origin, mpg, cyl, vs, am, gear, carb) %>%
  tidyr::pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
  mutate(names = factor(names)) %>%
  mutate(values = as_factor(values)) %>%
  filter(names == 'am') %>%
  ggplot(aes(x = mpg, y = values)) +
  geom_boxplot(fill = '#e76254') + coord_flip() + facet_wrap(~names)
#
two <- mtcars %>%
  dplyr::select(car, origin, mpg, cyl, vs, am, gear, carb) %>%
  tidyr::pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
  mutate(names = factor(names)) %>%
  mutate(values = as_factor(values)) %>%
  filter(names == 'carb') %>%
  ggplot(aes(x = mpg, y = values)) +
  labs(title = "Mtcars", subtitle = "Boxplot Panels") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  geom_boxplot(fill = '#ef8a47') + coord_flip() + facet_wrap(~names)

three <- mtcars %>%
  dplyr::select(car, origin, mpg, cyl, vs, am, gear, carb) %>%
  tidyr::pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
  mutate(names = factor(names)) %>%
  mutate(values = as_factor(values)) %>%
  filter(names == 'cyl') %>%
  ggplot(aes(x = mpg, y = values)) +
  geom_boxplot(fill = '#f7aa58') + coord_flip() + facet_wrap(~names)

four <- mtcars %>%
  dplyr::select(car, origin, mpg, cyl, vs, am, gear, carb) %>%
  tidyr::pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
  mutate(names = factor(names)) %>%
  mutate(values = as_factor(values)) %>%
  filter(names == 'gear') %>%
  ggplot(aes(x = mpg, y = values)) +
  geom_boxplot(fill = '#ffd06f') + coord_flip() + facet_wrap(~names)

five <- mtcars %>%
  dplyr::select(car, origin, mpg, cyl, vs, am, gear, carb) %>%
  tidyr::pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
  mutate(names = factor(names)) %>%
  mutate(values = as_factor(values)) %>%
  filter(names == 'vs') %>%
  ggplot(aes(x = mpg, y = values)) +
  geom_boxplot(fill = '#ffe6b7') + coord_flip() + facet_wrap(~names)

  # scale_color_manual(values= list(c("#", "#", "#", "#", "#", "#aadce0", "#72bcd5", "#528fad", "#376795", "#1e466e"), c(6, 2, 9, 3, 7, 5, 1, 10, 4, 8), colorblind=TRUE)) 

library(patchwork)
one + two + three + four + five

# outliers on boxplots
# head(mtcars, n = 1)

Bar Graphs

Stat = Count

It’s important for bar graphs to have a zero baseline.

if you don’t insert stat argument, default is ‘count’ +

Stat = Identity

In this case, we add a forty-five degree x-axis tilt to our x-axis labels for readability.

# mention forcats package

one <- ggplot(mtcars, aes(x = forcats::fct_infreq(origin), fill = origin)) + 
  geom_bar(stat = 'count', alpha = 0.3, color = 'black') + 
  labs(title = "Mtcars", subtitle = "Bar Plot") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  scale_fill_manual(values=met.brewer("Hiroshige", 6)) + 
  theme(legend.position = 'none') +
  labs(caption = "Stat = Count") +
  theme(axis.text.x = element_text(angle = 45))

two <- mtcars %>%
  group_by(origin) %>%
  summarise(mean_mpg = mean(mpg)) %>%
  ggplot(aes(x = reorder(origin, mean_mpg), y = mean_mpg, fill = origin)) + 
  geom_bar(stat = 'identity', color = 'black', alpha = 0.3) +
  labs(title = "Mtcars", subtitle = "Bar Plot") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  scale_fill_manual(values=met.brewer("Hiroshige", 6)) +
  theme(legend.position = 'none') +
  labs(caption = "Stat = Identity") +
  theme(axis.text.x = element_text(angle = 45))

one + two

US cars are the heaviest with the biggest engines and the most horse power and the lowest mpg.

mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
  group_by(origin, names) %>%
  summarise(mean_value = mean(values)) %>%
  ggplot(aes(x = reorder(origin, mean_value), y = mean_value)) + 
  geom_bar(stat = 'identity', color = 'black', fill = 'lightblue') +
  facet_wrap(~names, scales = 'free') +
  theme(axis.text.x = element_text(angle = 45)) +
  labs(title = "Mtcars", subtitle = "Bar Plot") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) 

  # scale_fill_manual(values=met.brewer("Hiroshige", 5))

Composition

Donut Chart

Pie charts may be a little harder to interpret because some people may interpret based on angle and some may interpret based on area. Donut charts may solve this problem.

donut_data <- mtcars %>%
  group_by(origin) %>%
  tally()

donut_data$fraction = donut_data$n / sum(donut_data$n)
donut_data = donut_data[order(donut_data$fraction), ]
donut_data$ymax = cumsum(donut_data$fraction)
donut_data$ymin = c(0, head(donut_data$ymax, n=-1))

ggplot(donut_data, aes(fill = origin, ymax=ymax, ymin=ymin, xmax=4, xmin=3)) +
geom_rect() +
coord_polar(theta="y") +
xlim(c(0, 4)) +
theme(panel.grid=element_blank()) +
theme(axis.text=element_blank()) +
theme(axis.ticks=element_blank()) +
annotate("text", x = 0, y = 0, label = "") +
labs(title="Mtcars", subtitle = "Origin") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = "donut") +
scale_fill_manual(values=met.brewer("Hiroshige", 6)) +
theme(axis.title.y = element_blank()) +
theme(axis.ticks.y = element_blank(),
        axis.text.y = element_blank()) +
theme(axis.title.x = element_blank()) +
theme(axis.ticks.x = element_blank(),
        axis.text.x = element_blank())

Relationship

Scatterplot

We can create a scatterplot in order to examine the relationship between two variables. Our method is straightforward using ggplot2 and following the grammar of graphics workflow. First, we feed our dataframe as an argument into the ggplot2 package. Next, we add an aesthetic mapping and decide the variables that will be represented along the x and y axes. Finally, we can add the geom_point layer to create a scatterplot.

We can also put more than one variable against each other with facet_wrap. This is useful if we are interested in looking at multiple relationships at the same time. We have arranged our data to have a common y-axis, which is the mpg variable. We choose to hold the y-axis scales the same for each variable by specifying scales = 'free_x' within the facet_wrap layer.

We can see at first glance that two of our variables have a positive relationship with mpg and three of our variables have a negative relationship with mpg. We can also see that the relationship seems like a closer relationship with some variables than others. For example, the relationship of mpg with wt seems to be stronger than the relationship of mpg with qsec.

ggplot(data = mtcars, aes(x = wt, y = mpg, color = "#f7aa58")) + 
  geom_point() +
  labs(title = "Mtcars", subtitle = "Scatter Plot") +
  theme(plot.title = element_text(face = "bold")) +
  theme(legend.position = 'none') 

mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
  ggplot(aes(x = values, y = mpg, color = names)) +
  geom_point() +
  facet_wrap(~names, scales = 'free_x') +
  labs(title = "Mtcars", subtitle = "Scatter Plot") +
  theme(plot.title = element_text(face = "bold")) +
  scale_color_manual(values=met.brewer("Hiroshige", 5))

2d Density Plots

We can also add geom_rug

one <- ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
  geom_density_2d() + geom_point() +
  labs(title = "Mtcars", subtitle = "2D Density Plot") +
  theme(plot.title = element_text(face = "bold")) 
  # theme(legend.position = 'none')

two <- ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
  geom_density_2d_filled() +
  geom_rug() +
  theme(legend.position = 'none')

library(patchwork)
one + two

# mesokurtic

# ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
#   geom_bin2d() + geom_point() +
#   labs(title = "Mtcars", subtitle = "2D Density Plot") +
#   theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
#   theme(plot.subtitle = element_text(hjust = 0.5)) 
# 
# ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
#   stat_bin_hex(geom = "hex", position = "identity") + geom_point() +
#   labs(title = "Mtcars", subtitle = "2D Density Plot") +
#   theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
#   theme(plot.subtitle = element_text(hjust = 0.5)) 

plt.close()
plt.cla()
plt.clf()

seaborn.kdeplot(x="wt", y="mpg", data=mtcars, order=order.index)
plt.title("2D density plot", loc = 'left')
plt.show()

3.2 Summary Tables

4 Theoretical Distributions

A probability distribution shows us the probabilities of different outcomes occurring. This distribution is useful because it describes the full range of possible outcomes of a statistic and the likelihood of each outcome. We can visualize a probability distribution with a histogram or density plot. To read the visualization and find the probability at a given point, we choose a value on the x-axis and then find the corresponding height on the y-axis, this represents the density at a given point.

We can illustrate this process by taking IQ as an example. We know that IQ follows a normal distribution with a mean of 100 and standard deviation of 15. When we create a visualization of the distribution of IQ scores in a population, we can see that 95% of values fall within two standard deviations of the mean. Or else, to use the language of probability, we can say that there’s a 95% probability that a person sampled randomly from the population will have an IQ between 70 and 130. This is a lot of the use case of probability distributions: charting the likelihood of events with different parameters.

Central limit theorem

We start by reviewing what is called the central limit theorem. The central limit theorem states that if you have a population with mean, called μ, pronounced as mu, and standard deviation σ, called sigma, and we take a sufficiently large number of random samples from the population, then the distribution of the sample means will be normally distributed.

4.1 Sampling Distributions

Researchers sometimes talk about sampling distributions. But what is a sampling distribution, exactly, and how does that relate to the general idea of what a probability distribution is? Essentially, a sampling distribution is a probability distribution that has been created from a certain method: A sampling distribution is obtained through a process of repeated sampling from a population, choosing a test statistic from each sample, and then creating a probability distribution of that given statistic. There are different kinds of sampling distributions: We can use a sampling distribution in order to find statistics such as the mean or variance, or we could use a sampling distribution for proportion to find the the proportion of success for a given outcome.

Consider a process of sampling a population and then taking a statistic from the sample, such as the mean, and then sampling from the population again in order to find another value of this same statistic. In the end, we will have a distribution of mean values. As the sample size increases, this distribution conforms itself to a normal distribution; we know this to be true because of the central limit theorem.

Why do we create a sampling distribution, and why do we care if a resampling technique for a paramater conforms itself to a normal distribution? Sampling distributions come in handy because they simplify the process of statistical inference. A sampling distribution that has taken the shape of a normal distribution in accordance with the central limit is easy to work with because a normal distribution can be described by well-understood equations that fit a wide variety of problems. Furthermore, we can transform a normal distribution into other distributions with different properties, such as the chi-squared or f-distributions.

We can illustrate how the working of the central limit theorem by looking at the hp variable in the mtcars dataset and then comparing that to a resampling for the mean hp. This is, essentially, an illustration of the central limit theorem: We take samples from our population and find the mean value for each sample; then we create a distribution of these mean values and discover that this distribution is like a normal distribution. The code chunks for how to create a resampling method such as this will be reviewed in detail later in the chapter.

Resampling Methods

There are four main types of resampling methods. The reason we use a resampling method is because it isn’t practical to keep generating new real samples from our population of interest. Resampling is a way of mimicking the sampling process. For example, when we want to understand something about our population of interest, we may consider running a survey and hoping for one thousand responses. However, no one will administer the same survey into a target population one thousand times. It is because of this practical limitation that we use a resampling method. Here are the four main resampling methods:

  • Bootstrap
  • Jackknife
  • Randomization
  • Monte Carlo

These methods can be used to build the distribution of a statistic based on our data. Once we have created a sample distribution using a resampling method, we can then generate confidence intervals, among other things.

Before diving into specific resampling methods, it’s useful to first get familiar with how the base R sample() function works. sample() takes at least two arguments that need to be explicitly decided in the function call: x, and size. x decides the list or range of values from which we are choosing to make a sample and size decides the size of our sample. There is an additional argument that is often explicitly specified, which is the replace argument. Unless otherwise decided, replace = FALSE is set as the default.

What is the difference between sampling with and without replacement? When we sample with replacement, we are replacing the value after every sample. Each sample is therefore indepedent of the value that came before it. When we sample without replacement, we aren’t replacing values in our vector. Once a value is selected, it cannot be selected again. In other words, when we sample without replacement, the two sample values are not indepedent, so whatever value we get on one sample affects the possibly of the values form the next sample. A consequence here is that we cannot choose a sample size larger than the size of the input vector unless we specify replace = TRUE. A fourth commonly used argument is also interesting. We can also specify a probability as part of our sample which increases the likelihood of certain values.

sample_without_replacement <- sample(x = 1:10, size = 10) # replace = FALSE as default
sample_with_replacement <- sample(x = 1:10, size = 10, replace = TRUE)

sample_with_replacement_and_weighted_probability <- sample(x = c(1, 0), size = 10, replace = TRUE, prob = c(0.8, 0.2))
Bootstrap

What are we doing when we bootstrap? When we bootstrap, we use any test or metric that uses random sampling with replacement. Bootstrapping is especially effective as a way of assigning measures of accuracy such as bias, variance, confidence intervals, prediction error, and other things, to sample estimates. Here is one way to bootstrap in R is by using the infer package from the tidymodels library.

mtcars %>%
  dplyr::select(hp) %>%
  specify(response = hp) %>%
  generate(reps = 10000, type = 'bootstrap') %>%
  calculate(stat = 'mean')  -> mtcars_bootstrapped_ci_df

mtcars_bootstrapped_ci_df
Jacknife

The jacknife is a precursor to bootstrapping and bootstrapping was introduced as a sort of extension and improvement on the jacknife, which was developed when computers had about one kilobyte of memory. Bootstrapping is indeed more versatile and flexible. In the interest of thoroughness we will review jacknifing as well, although the times in which the jackknife should be used in replacement of bootstrapping will be rare.

The jacknife is a leave-one-out resampling method that calculates a statistic of interest; it does this successively or iteratively until each observation has been removed. This means that the number of resamples is limited to the number of observations and largely for this reason the jackknife performs a little bit poorly with small sample sizes. Jackknife is also a little bit limited in terms of the kinds of data that can be used. On the other hand, unlike the bootstrap, jackknifing is reproducible every time.

In the following example, we try a bootstrap on our mtcars dataset. There are 32 different options depending on which one you leave out. As we can observe, the confidence intervals will be too narrow with the jackknife because the most shift we could get is the shift of leaving one out.

When we look at the jackknife statistic, we see that some but not all of the statistics are the same. The mean is the same but the standard deviation is different, so sd(mtcars$mpg) and sd(mtcars$jackknife_mean_mpg) will evaluate to different values.

In the following code we use a function that we create for ourselves.

for (i in 1:nrow(mtcars)) {
  mtcars_minus_one <- mtcars[-i,]
  mean_value_minus_one = mean(mtcars_minus_one$mpg)
  mtcars$jackknife_mean_mpg[i] <- mean_value_minus_one
}

ggplot(mtcars, aes(x = jackknife_mean_mpg)) + 
  geom_histogram(bins = 32, color = 'black', fill = '#ef8a47', alpha = 0.75) +
  labs(title = "Mtcars", subtitle = "Histogram") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) 

n <- 32
standard_error <- sd(mtcars$mpg)/sqrt(n)

jacknife_mean_mpg <- mean(mtcars$jackknife_mean_mpg) 

jacknife_standard_error <- sqrt(((n - 1) / n) * sum((mtcars$jackknife_mean_mpg-jacknife_mean_mpg) ^ 2))

t.star <- qt(p=0.975,df=n-1) # what is qt

confidence_intervals <- mean(mtcars$mpg) + c(-1,1) * t.star * standard_error

#  data values, and similarly that the jackknifed standard error is equal to the standard error of the mean obtained using all of the data and the familiar formula, so for the mean this seems to be just a more convoluted way to compute quantities we already have formulas for and a waste of time.
Randomization / Permutation?
Monte Carlo

Monte Carlo simulation uses random samples to make clear a certain result. In a Monte Carlo simulation, we assign variables random values and run simulations before we put the results together as an average which gives us an estimate.

The following example, the method taken from Wikipedia, shows an example of how a Monte Carlo simulation works. We can first draw a square whose sides lengths are equal to two. The area inside this square is therefore equal to four, which is the length or height squared. We can then inscribe a circle inside this square. We know that the area of the cirlce is pi times the radius squared. But we can take another approach to figure out the area of the cirlce. We can instead use a Monte Carlo simulation to figure out the area of the circle. We do this by uniformly scattering a given number of points over the square and then we count the number of points that fall inside our circle. We know that the points fall within the cirlce if they have a distance from the origin / center of less than The ratio of the points that fall inside the cirlce to the total number of points is an approximation for pi. pi is approximarley 3.14159 use symbol for pi Something interesting is happening in our example below. By chance, we are guessing a closer value for pi with n = 4 than we are with n = 40. This is a good illustration of the random error that can happen with a Monte Carlo simulation. nm

library(ggforce)

set.seed(0125)

hundredths <- seq(from = -1, to = 1, by = .01)
xvalues <- sample(hundredths, size=4, replace=TRUE)
yvalues <- sample(hundredths, size=4, replace=TRUE)

xy_df <- data.frame(xvalues, yvalues)

xy_df <- xy_df %>% 
  mutate(x_squared = xvalues ^ 2) %>%
  mutate(y_squared = yvalues ^ 2) %>%
  mutate(z_squared = x_squared + y_squared) %>%
  mutate(zvalues = sqrt(z_squared)) %>%
  mutate(inside_or_outside = ifelse(zvalues < 1, "inside", "outside"))

counted <- xy_df %>%
  group_by(inside_or_outside) %>%
  count()

one <- ggplot(xy_df, aes(color = inside_or_outside)) + 
  geom_circle(aes(x0 = 0, y0 = 0, r = 1), color = 'black') +
  geom_point(aes(x = xvalues, y = yvalues), alpha = 0.5) +
  # coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
  labs(subtitle = "n = 4") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "estimate for pi: 3.00") +
  theme(legend.position = 'none') +
  theme(axis.title.y = element_blank()) +
  scale_color_manual(values = c("#376795", "#D55E00")) +
  coord_fixed() 

set.seed(0125)

hundredths <- seq(from = -1, to = 1, by = .01)
xvalues <- sample(hundredths, size = 40, replace=TRUE)
yvalues <- sample(hundredths, size = 40, replace=TRUE)

xy_df <- data.frame(xvalues, yvalues)

xy_df <- xy_df %>% 
  mutate(x_squared = xvalues ^ 2) %>%
  mutate(y_squared = yvalues ^ 2) %>%
  mutate(z_squared = x_squared + y_squared) %>%
  mutate(zvalues = sqrt(z_squared)) %>%
  mutate(inside_or_outside = ifelse(zvalues < 1, "inside", "outside"))

counted <- xy_df %>%
  group_by(inside_or_outside) %>%
  count()

two <- ggplot(xy_df, aes(color = inside_or_outside)) + 
  geom_circle(aes(x0 = 0, y0 = 0, r = 1), color = 'black') +
  geom_point(aes(x = xvalues, y = yvalues), alpha = 0.5) +
  # coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
  labs(title = "Monte Carlo Simulations", subtitle = "n = 40") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "estimate for pi: 2.80") +
  theme(legend.position = 'none') +
  theme(axis.title.y = element_blank()) +
  scale_color_manual(values = c("#376795", "#D55E00")) +
    coord_fixed() 

set.seed(0125)

hundredths <- seq(from = -1, to = 1, by = .01)
xvalues <- sample(hundredths, size = 400, replace=TRUE)
yvalues <- sample(hundredths, size = 400, replace=TRUE)

xy_df <- data.frame(xvalues, yvalues)

xy_df <- xy_df %>% 
  mutate(x_squared = xvalues ^ 2) %>%
  mutate(y_squared = yvalues ^ 2) %>%
  mutate(z_squared = x_squared + y_squared) %>%
  mutate(zvalues = sqrt(z_squared)) %>%
  mutate(inside_or_outside = ifelse(zvalues < 1, "inside", "outside"))

counted <- xy_df %>%
  group_by(inside_or_outside) %>%
  count()

three <- ggplot(xy_df, aes(color = inside_or_outside)) + 
  geom_circle(aes(x0 = 0, y0 = 0, r = 1), color = 'black') +
  geom_point(aes(x = xvalues, y = yvalues), alpha = 0.5) +
  labs(subtitle = "n = 400") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) + 
  labs(caption = "estimate for pi: 3.11") +
  theme(legend.position = 'none') +
  theme(axis.title.y = element_blank()) +
  scale_color_manual(values = c("#376795", "#D55E00")) +
  coord_fixed() 

library(patchwork)
one + two + three

Discrete Distributions

There are many ways to think about theoretical distributions in statistics. We can think about distributions as belonging to one of two categories: discrete distributions and continuous distributions. A sampling distribution can either belong to a discrete or continuous distribution category.

A discrete distribution is a probability distribution that depicts the occurrence of a discrete outcome, or outcome that are countable as whole numbers.

Uniform distribution

The first discrete distribution that we can review is called a uniform distribution.

These functions provide information about the uniform distribution on the interval from min to max. dunif gives the density, punif gives the distribution function qunif gives the quantile function and runif generates random deviates.

dunif() punif() qunif() runif()

dunif takes the following syntax. arguments:

ggplot(data.frame(x=c(0:10)), aes(x)) +
  stat_function(geom = "area", fun = dunif, args = list(1, 10), fill = '#e76254', alpha = 0.5, color = 'black') + xlim(0, 11) +
  labs(subtitle = "Uniform Distribution") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) 

# dunif(x,           # X-axis values (grid of values)
#       min = 0,     # Lower limit of the distribution (a)
#       max = 1,     # Upper limit of the distribution (b)
#       log = FALSE) # If TRUE, probabilities are given as log

Binomial distribution

The binomial distribution is a probability distribution which models the probabilities of having a certain number of successes among n identical trials, each having p as the probability of success. The variables n and p are thus the two parameters of a binomial distribution.

dbinom(x, size, prob, log = FALSE)

dbinom() pbinom() qbinom() rbinom()

Put simply, dbinom() finds the probability of getting a certain number of successes (x) in a certain number of trials (size) where the probability of success on each trial is fixed (prob).

data4 <- data_frame(X = rbinom(n = 3000, size = 10, prob = 0.2))

ggplot(data = data4, mapping = aes(x = X)) +
    ## Default stat is "count".
  geom_bar(fill = '#e76254', alpha = 0.5, color = 'black') +
    ## Force the limits to include 0,...,10.
  scale_x_continuous(breaks = 0:10, limits = c(-1,11)) +
  labs(subtitle = "Binomial Distribution") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) 

# dbinom(1000, size = 10, prob = 0.5)
# ggplot(data.frame(x=c(1000)), aes(x)) +
#   stat_function(geom = "area", fun = dbinom, args = list(2, 0.5), fill = '#e76254', alpha = 0.5, color = 'black') + xlim(0, 11) + 
#   labs(subtitle = "Uniform Distribution") +
#   theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
#   theme(plot.subtitle = element_text(hjust = 0.5)) 

Bernoulli distribution

The Bernoulli distribution takes a value of 1 with probability p and a value of 0 with probability 1-p. A Bernoulli distribution is useful because it can be used to approximate the outcomes of an experiment (such as tossing a coin) as a range of percentages. It represents a “yes or no” type experiment. If the experiment succeeds, then it is given the value 1. If the experiment does not succeed, it is given value 0.[2] This can be used, for example, in tossing a coin, where “1” means it lands on “heads”, and “0” means it lands on “tails” (or the other way around).[3]

## Create a vector to hold sample means.
Xbars30 <- numeric(length = 10^4)
## Repeat sample mean generation 10^4 times.
for (i in 1:10^4) {
    ## Assign the generated value to the i-th element of the vector.
    Xbars30[i] <- mean(rbinom(n = 30, size = 1, prob = 0.2))
}
## Check the moments.
mean(Xbars30)
## [1] 0.20028
ggplot(data = data_frame(Xbar = Xbars30), mapping = aes(x = Xbar)) +
  geom_bar(fill = '#e76254', alpha = 0.5, color = 'black') +
  scale_x_continuous(limit = c(0,1)) +
  labs(subtitle = "Bernoulli Distribution") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) 

Poisson Distribution

where:

x: number of successes lambda: average rate of success

The Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event.

The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume.

Assumptions

The Poisson distribution is an appropriate model if the following assumptions are true

  • k is the number of times an event occurs in an interval and k can take values 0, 1, 2, ….
  • The occurrence of one event does not affect the probability that a second event will occur. That is, events occur independently.
  • The average rate at which events occur is independent of any occurrences. For simplicity, this is usually assumed to be constant, but may in practice vary with time.
  • Two events cannot occur at exactly the same instant; instead, at each very small sub-interval, either exactly one event occurs, or no event occurs.
  • If these conditions are true, then k is a Poisson random variable, and the distribution of k is a Poisson distribution.

A Poisson distribution is a probability distribution. It measures the probability that a certain number of events occur within a certain period of time.

The events need to be unrelated to each other. They also need to occur with a known average rate, represented by λ, called lambda.

# ggplot(transform(data.frame(x=c(0:10)), y=dpois(x, 1)), aes(x, y)) + 
#   geom_bar(stat="identity", fill = '#72bcd5', alpha = 0.5, color = 'black') +
#   labs(title = "Poisson Distribution") +
#   theme(plot.title = element_text(hjust = 0.5, face = "bold")) 
# 
# ggplot(transform(data.frame(x=c(0:10)), y=dpois(x, 1)), aes(x, y)) + 
#   geom_density(stat="identity", fill = '#72bcd5', alpha = 0.5, color = 'black') +
#   labs(title = "Poisson Distribution") +
#   theme(plot.title = element_text(hjust = 0.5, face = "bold")) 
# 
# ggplot(transform(data.frame(x=c(0:10)), y=dpois(x, 2)), aes(x, y)) + 
#   geom_bar(stat="identity", fill = '#72bcd5', alpha = 0.5, color = 'black') +
#   labs(title = "Poisson Distribution") +
#   theme(plot.title = element_text(hjust = 0.5, face = "bold")) 

  # scale_color_manual(values= list(c("#e76254", "#ef8a47", "#f7aa58", "#ffd06f", "#ffe6b7", "#aadce0", "#72bcd5", "#528fad", "#376795", "#1e466e"), c(6, 2, 9, 3, 7, 5, 1, 10, 4, 8), colorblind=TRUE)) 

How to make discrete also, how to change x-axis labels to discrete labels

library(patchwork)
one + two + three + four + five + six + seven + eight + nine

Continuous Distributions

These distributions are based on the normal distribution: Chi Squared Distribution: If you have a bunch of normally distributed variables with mean 0 and variance 1, then the sum of their squares has a chi squared distribution. T-Distribution: If you have a normally distributed variable with mean 0 and variance 1, and you divide it by the square root of a scaled Chi-Squared Distribution, it has a t-distribution. F-Distribution: If you have a chi-squared distributed variable and you divide it by another chi-squared distributed variable, it has an F-distribution.

Why do we have all of these different combinations of the normal? To give our tests the properties we want. For example, if we want to see if average heights are the same, we’d like to take into account how confident we are of each sample average. So if we are checking if the heights are different in two samples of sizes five and six respectively, we take our difference in average height, which is normally distributed, and divide by how confident in sample averages from samples of size five and six, which has a chi squared distribution, and arrive at a t-distribution. This is the t-test.I won’t get into all the reasons we use each of these distributions, as the specific reason you have to divide a chi squared by another chi squared, and thus use an F distribution is specific to each task. But hopefully this gives a bit of a start for tying these tests

So this algebraic expression is supposed to follow Standard Normal distribution as n becomes infinitely large. But in reality, we can’t have infinite data. If n is small, this distribution of the location of the true mean, relative to the sample mean and divided by the sample standard deviation, after multiplying by the normalizing term is a t-distribution with n-1 degrees of freedom. As n becomes infinitely large, this becomes the standard normal. That is why we see the t-distribution in computing confidence intervals. For large n you may notice the use of z distribution instead of t. This implies the use of the inverse of a standard normal, which implies an assumption of infinite data (large data).

Now we have seen a way to find the sample mean, with some confidence interval. What about sample variance? Just like the mean of a sufficiently large number of independent and identically distributed random variables follows normal, the variance of a sufficiently large number of independent and identically distributed random variables follows a Chi-square distribution. This follows from the theory that the sum of squares of standard normal distributed random variables follows the Chi-Square distribution.

F-distribution is a little more difficult to explain for me. A ratio of two Chi-squared distributions approximately follows F-distribution. Hence you would find the use of this distribution in F-test which is used for Hypothesis Testing. One example is testing for the hypothesis that the means of several normally distributedpopulations, all having the same standard deviation, are equal. This is perhaps the best-known F-test, and plays an important role in the analysis of variance(ANOVA). You would take ratios of variances and then check for the truth of your hypothesis. To understand all these better, you can try reading some material on ‘Hypothesis Testing’. A video link on Khan Academy: Hypothesis Test for Difference of Means

other distributions are based on teh normal distribution sum of squares mean 0 and sd = 1 is a chi square dist divide normal by square root of chi square you have a T chi square divided by chi square is F

The Normal Distribution

The normal distribution, also called a Gaussian distribution or a bell curve, is a foundational probability distribution in statistics. The most common visualization for a normal distribution is the histogram or density plot, but normality can also be studied through other visualizations such box plots and q-q plots. In the following four-panelled visualization, we can see a normal distribution visualized in four different ways. In this chapter, we will review in detail how to create and understand these visualizations.

The Standard Normal Distribution

The standard normal distribution is simply the normal distribution where the mean is zero and the standard deviation is one.

The standard normal distribution makes it clear what percent fall above or below the line in terms of standard deviations. Trained statisticians know how to easily convert z-scores into percentages.

The 68-95-99.7 Rule

We also know that 99% of values fall within 3 standard deviations from the mean in a normal probability distribution.

The empirical rule, also referred to as the three-sigma rule or 68-95-99.7 rule, is a statistical rule which states that for a normal distribution, almost all observed data will fall within three standard deviations (denoted by σ) of the mean or average (denoted by µ).

In particular, the empirical rule predicts that 68% of observations falls within the first standard deviation (µ ± σ), 95% within the first two standard deviations (µ ± 2σ), and 99.7% within the first three standard deviations (µ ± 3σ).

Central Limit Theorem

Two key points: central limit theorem - let’s us take an outcome whose distribution we dont know much about and arrive at a well known distribution - the normal distribution

ggplot(data.frame(x = c(-4, 4)), aes(x)) +
  stat_function(fun = dnorm,
                geom = "area", fill = '#ffe6b7', color = 'black', alpha = 0.5, 
                args = list(
                  mean = 0,
                  sd = 1)) +
  labs(title = "Standard Normal Distribution", subtitle = "Standard Deviations and Z-Scores") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  geom_vline(xintercept = 1, linetype = 'dotdash', color = "black") +
  geom_vline(xintercept = -1, linetype = 'dotdash', color = "black") +
  geom_vline(xintercept = 2, linetype = 'dotdash', color = "black") +
  geom_vline(xintercept = -2, linetype = 'dotdash', color = "black") +
  geom_vline(xintercept = 3, linetype = 'dotdash', color = "black") +
  geom_vline(xintercept = -3, linetype = 'dotdash', color = "black") +
  geom_text(aes(x=1, label="\n 1 σ", y=0.125), colour="#376795", angle=90, text=element_text(size=11)) +
  geom_text(aes(x=2, label="\n 2 σ", y=0.125), colour="#376795", angle=90, text=element_text(size=11)) +
  geom_text(aes(x=3, label="\n 3 σ", y=0.125), colour="#376795", angle=90, text=element_text(size=11)) +
  geom_text(aes(x=-1, label="\n-1 σ", y=0.125), colour="#376795", angle=90, text=element_text(size=11)) +
  geom_text(aes(x=-2, label="\n-2 σ", y=0.125), colour="#376795", angle=90, text=element_text(size=11)) +
  geom_text(aes(x=-3, label="\n-3 σ", y=0.125), colour="#376795", angle=90, text=element_text(size=11)) 

Z-Scores

A z-score, also called a standard score, is a way of telling us, for any given value, how many standard deviations above or below the population mean our value is. For example, a z-score of 2 tells us that our value is two standard deviations above the mean. The range of a z-score is -3 to 3.Z-scores are a way to compare results to a “normal” population. Here is how we calculate the z-score of a normal distribution.

\[ \begin{aligned} z = \frac{x - mu}{σ} \end{aligned} \]

This equation requires us to understand the mean μ and also the population standard deviation σ.

When you have multiple samples and want to describe the standard deviation of those sample means (the standard error), you would use this z score formula:

\[ \begin{aligned} z = \frac{(x – μ)}{(σ / √n)} \end{aligned} \]

This z-score will tell you how many standard errors there are between the sample mean and the population mean. It is used for dealing with a different kind of porblem. The key here is that we’re dealing with a sampling distribution of means, so we know we have to include the standard error in the formula.

Before deep-diving into the types of distributions, it is important to revise the fundamental concepts like Probability Density Function (PDF), Probability Mass Function (PMF), and Cumulative Density Function (CDF).

Functions

Before deep-diving into the types of distributions, it is important to revise the fundamental concepts like Probability Density Function (PDF), Probability Mass Function (PMF), and Cumulative Density Function (CDF).

Z-scores: x - mean(x)/sd(x)

q-q plots

In the bottom-right panel we see what is called quantile-quantile plot, or q-q plot, which we have not yet reviewed. Specifically, in this exmaple we are looking at what is called a normal quantile-quantile plot because we are making an assessment on normality, but q-q plots can be used for other kinds of distributions. A q-q plot is used both in terms of understanding the distribution of a given variable and also in doing diagnostics for a given model. The line in a q-q plot is a theoretical line that will always sit exactly on the diagonal; it shows how the scatter would lie if the distribution were perfectly normal, as is the case in the above example. In real world examples, the scatter in a q-q plot may or may not appear close to this theoretical line and it may or may not show a pattern coming away from the line.

A q–q plot is a probability plot. Specifically, it compares two probability distributions by plotting their quantiles against each other. The two probability distributions are firstly the distribution of our actual data and the secondly the distribution from a normal distribution. A q-q plot will take out sample data, sort it in ascending order, and then plot the data points versus the quantiles calculated from a theoretical distribution. The number of quantiles is selected to match the size of our sample data. If both sets of quantiles came from a normal distribution, we should see the points forming a straight diagonal line. Quantiles, which are kind of like percentiles, are points in our data below which a certain proportion of our data will fall. For example, in a classic bell-curve standard normal distribution with a mean of 0. The 0.5 quantile, or 50th percentile, is 0. Half the data lie below 0. That’s the peak of the hump in the curve. The 0.95 quantile, or 95th percentile, is about 1.64. 95 percent of the data lie below 1.64. The following R code generates the quantiles for a standard Normal distribution from 0.01 to 0.99 by increments of 0.01: qnorm(seq(0.01,0.99,0.01))

One advantage is that a q-q plot may be easier to use when there are small sample sizes. In order to get a feel for how to read a q-q plot, we can use our sample dataset from before. We create a q-q plot by using the stat_qq() and stat_qq_line() functions inside the ggplot() call.

R comes with predefined functions that are native to base R that are designed for examining the characteristics of a normal distribution.

  • dnorm()
  • pnorm()
  • qnorm()
  • rnorm()

We first start with the dnorm() function, which is the density function of the normal distribution. The dnorm() function takes the arguments x, mean, sd and log. We can use dnorm() to find the density at any point of x. In the following example, we use dnorm() to find the density of a normal distribution at the point x = 125, where the distribution has mean = 100 and sd = 15. The function evaluates to 0.006631809, meaning the y-value of the density plot is 0.006631809 at the point where x is 125.

The qnorm() function gives the quantile function of the normal distribution. In other words, given an area, we find the boundary value that determines this area. For example, suppose we want to find that the 95th percentile of a normal distribution whose mean is 100 and whose standard deviation is 15. This is a common example because the 95th percentile is a critical value when thinking about p-value with an alpha level of 0.05. In this example, the qnorm() would evaluate to 124.6728.

The pnorm() function, on the other hand, gives the cumulative probability density function of the normal distribution. We can think aboutpnorm() as arriving to the same numbers as qnorm() but through an opposite workflow. The critical area of pnorm(), therefore, given x = 124.6728, is the sum of the area to the left of where x is 124.6728. We can also think about this as the probability of seeing a value less than or equal to 124.6728. pnorm() for an x value of 124.6728. at mean of 100 and standard deviation of 15 evaluates to 0.95. In this example, we only know to input an x value of 124.6728 because of our valuation from qnorm().

# dnorm(x = 124.6728, mean = 100, sd = 15)
# pnorm(124.6728, mean = 100, sd = 15)
# qnorm(0.95, mean = 100, sd = 15)

distributions_functions_table <- tribble(
  ~mean, ~standard_deviation, ~function_, ~input, ~evaluation,
  "100", "15", "dnorm", "124.6728", "0.006631809",
  "100", "15", "pnorm", "124.6728", "0.9522096",
  "100", "15", "qnorm", "0.95", "125",
)
distributions_functions_table

In addition to creating histograms, boxplots, scatterplots and other visualizations with common geometry, as in the above example of the four-paneled visualization, ggplot2 can also be used to visualize functions. We do this by using stat_function() within the ggplot() call. stat_function() takes an argument called fun, which stands for function. We can supply a predefined function such as dnorm() to the fun argument in stat_function(). Another method of visualizing a normal distribution is by using stat_function() in combination with dput(). This will allow us to visualize a normal distribution with a given mean and standard deviation. In order to add these arguments to stat_function(), we can put the arguments of the function in separately as a list through the args argument. Note that there may be different arguments for each function. By putting dnorm() inside of ggplot(), we are able to visualize the normal distribution.

Previously, we have used ggplot() on the existing mtcars dataframe. In the following code, we are trying a new method by creating a dataframe inline within the ggplot() call. This dataframe is not saved locally as an independent object. As before, this dataframe defines the limits of the visualization. In the following example, we also add a vertical line the illustrate where x = 125 and y = 0.006631809 and also the crtical area to the left, where x = 95.22096.

one <- ggplot(data.frame(x = c(0, 200)), aes(x)) +
  stat_function(fun = dnorm,
                geom = "line", fill = '#1e466e', alpha = 0.5, 
                args = list(
                  mean = 100,
                  sd = 15), xlim = c(0, 125)) +
    stat_function(fun = dnorm,
                geom = "line", fill = '#1e466e', alpha = 0.5, 
                args = list(
                  mean = 100,
                  sd = 15), xlim = c(125, 200)) +
  geom_hline(yintercept = 0.006631809, linetype = 'dotdash') +
  labs(title = "Normal Distribution", subtitle = "Theoretical") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  geom_vline(xintercept = 125, linetype = 'dotdash') +
  labs(caption = "dnorm: 0.006631809")

two <- ggplot(data.frame(x = c(0, 200)), aes(x)) +
  stat_function(fun = dnorm,
                geom = "area", fill = '#1e466e', alpha = 0.5, 
                args = list(
                  mean = 100,
                  sd = 15), xlim = c(0, 125)) +
    stat_function(fun = dnorm,
                geom = "line", fill = '#1e466e', alpha = 0.5, 
                args = list(
                  mean = 100,
                  sd = 15), xlim = c(125, 200)) +
  labs(title = "Normal Distribution", subtitle = "Theoretical") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  geom_vline(xintercept = 125, linetype = 'dotdash') +
  labs(caption = "pnorm: 0.9522096; qnorm: 125")

library(patchwork)
one + two

# ggplot(data.frame(x = c(0, 5)), aes(x)) +
#   stat_function(fun = df,
#                 geom = "line",
#                 args = list(
#                   df1 = 3,
#                   df2 = 49
#                 )) 

The rnorm() function is random sampling from the normal distribution. The rnorm() function is designed to generate a vector of normally distributed random numbers, or random deviates. rnorm() takes three arguments: The first argument n is the number of numbers we want to generate and, similar to our other distribution functions, we also have to specify the mean and standard deviation.

Whereas in our previous example we had used dnorm() within stat_function() to create a theoretical normal distribution with perfect characteristics,rnorm() gives us an opportunity to create a normal distribution through simulation. In the following example, we create three simulations of different sizes, each one with ten times more numbers than the previous one. We know that, as our simulation approaches an infinitely big number, the shape of the distribution will approach the theoretical example of a perfect normal distribution. This is called the weak law of large numbers. We also accompany our distributions with q-q plots. Because our data is not perfectly normal, we can see scatter on our q-q plots that deviate from the perfectly straight qq_line().

set.seed(0125)

nnormal <- rnorm(10, mean = 100, sd = 15)
nnormal <- as.data.frame(nnormal)
names(nnormal) <- c('x')

one <- ggplot(nnormal, aes(x = x)) +
  geom_density(fill = '#aadce0') + 
  labs(subtitle = "n = 10") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  xlim(0, 200)

two <- ggplot(nnormal, aes(sample = x)) +
  stat_qq(color = '#aadce0') + 
  labs(subtitle = "n = 10") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5))

nnormal <- rnorm(100, mean = 100, sd = 15)
nnormal <- as.data.frame(nnormal)
names(nnormal) <- c('x')

three <- ggplot(nnormal, aes(x = x)) +
  geom_density(fill = '#aadce0') + 
  labs(title = "Normal Distributions", subtitle = "n = 100") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  xlim(0, 200)

four <- ggplot(nnormal, aes(sample = x)) +
  stat_qq(color = '#aadce0') + 
  labs(subtitle = "n = 100") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5))

nnormal <- rnorm(1000, mean = 100, sd = 15)
nnormal <- as.data.frame(nnormal)
names(nnormal) <- c('x')

five <- ggplot(nnormal, aes(x = x)) +
  geom_density(fill = '#aadce0') + 
  labs(subtitle = "n = 1000") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  xlim(0, 200)

six <- ggplot(nnormal, aes(sample = x)) +
  stat_qq(color = '#aadce0') + 
  labs(subtitle = "n = 1000") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "Simulations")

library(patchwork)
one + three + five + two + four + six

  # scale_color_manual(values= list(c("#e76254", "#ef8a47", "#f7aa58", "#ffd06f", "#ffe6b7", "#aadce0", "#72bcd5", "#528fad", "#376795", "#1e466e"), c(6, 2, 9, 3, 7, 5, 1, 10, 4, 8), colorblind=TRUE)) 

With a large enough sample, the simulation will approach the normal distribution. In the following example, we look at 100,000 values in a random normal distrubtion and overlay that with the normal distribution. Not surprisingly, we see almost a perfect overlap.

set.seed(0125)

nnormal <- rnorm(100000, mean = 100, sd = 15)
nnormal <- as.data.frame(nnormal)
names(nnormal) <- c('x')

ggplot(nnormal, aes(x = x)) +
  geom_density(color = 'darkgrey', alpha = 0.5) +
  labs(title = "Normal Distributions", subtitle = "Theoretical and Simulated") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  xlim(0, 200) +
    stat_function(fun = dnorm,
                geom = "line",
                color = "steelblue",
                alpha = 0.5,
                args = list(
                  mean = 100,
                  sd = 15
                )) 

In addition to using a predefined function within stat_function(), we can also create our own functions within the fun argument in stat_function(). In this example, we create functions for first, second, third, and fourth degree polynomials.

one <- ggplot(data.frame(x = c(-20, 20)), aes(x)) +
  stat_function(fun = function(x) { x**1 },
                geom = "line", color = '#1e466e') +
  labs(subtitle = "First Degree Polynomial") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) 

two <- ggplot(data.frame(x = c(-20, 20)), aes(x)) +
  stat_function(fun = function(x) { x**2 },
                geom = "line", color = '#1e466e') +
  labs(subtitle = "Second First Degree Polynomial") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) 

three <- ggplot(data.frame(x = c(-20, 20)), aes(x)) +
  stat_function(fun = function(x) { x**3},
                geom = "line", color = '#1e466e') +
  labs(subtitle = "Third First Degree Polynomial") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) 

four <- ggplot(data.frame(x = c(-20, 20)), aes(x)) +
  stat_function(fun = function(x) { x**4},
                geom = "line", color = '#1e466e') +
  labs(subtitle = "Fourth First Degree Polynomial") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) 

one + two + three + four

  # scale_color_manual(values= list(c("#e76254", "#ef8a47", "#f7aa58", "#ffd06f", "#ffe6b7", "#aadce0", "#72bcd5", "#528fad", "#376795", "#1e466e"), c(6, 2, 9, 3, 7, 5, 1, 10, 4, 8), colorblind=TRUE)) 

Student’s T Distribution

Is not as tall and with fatter tails.

A T-distribution is a sampling distribution that involves a small population or one where not much is known about it. It is used to estimate the mean of the population and other statistics such as confidence intervals, statistical differences and linear regression. The T-distribution uses a t-score to evaluate data that wouldn’t be appropriate for a normal distribution.

The t-distribution describes the standardized distances of sample means to the population mean when the population standard deviation is not known, and the observations come from a normally distributed population.

  # scale_color_manual(values= list(c("#e76254", "#ef8a47", "#f7aa58", "#ffd06f", "#ffe6b7", "#aadce0", "#72bcd5", "#528fad", "#376795", "#1e466e"), c(6, 2, 9, 3, 7, 5, 1, 10, 4, 8), colorblind=TRUE))

ggplot(data.frame(x = c(-4, 4)), aes(x = x)) +
  stat_function(fun = dnorm, color = 'black', linetype = 'dotted') +
  stat_function(fun = dt, args = list(df = 1), color = '#e76254') +
  stat_function(fun = dt, args = list(df = 2), color = '#ef8a47') +
  stat_function(fun = dt, args = list(df = 3), color = '#f7aa58') +
  stat_function(fun = dt, args = list(df = 10), color = '#ffd06f') +
  stat_function(fun = dt, args = list(df = 4), color = '#ffe6b7') +
  stat_function(fun = dt, args = list(df = 5), color = '#aadce0') +
  stat_function(fun = dt, args = list(df = 6), color = '#72bcd5') +
  stat_function(fun = dt, args = list(df = 7), color = '#528fad') +
  stat_function(fun = dt, args = list(df = 8), color = '#376795') +
  stat_function(fun = dt, args = list(df = 9), color = '#1e466e') +
  labs(title = "T Distribution", subtitle = "Degrees of Freedom 1 - 10") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) 

Chi-Squared Distribution

\[ \begin{aligned} Q\ \sim \ \chi _{k}^{2} \end{aligned} \]

The chi-squared distribution is used primarily in hypothesis testing. The chi-squared distribution with k degrees of freedom is the distribution of a sum of the squares of k independent standard normal random variables. If Z1, …, Zk are independent, standard normal random variables, then the sum of their squares is distributed according to the chi-squared distribution with k degrees of freedom. The chi-squared distribution has one parameter: a positive integer k that specifies the number of degrees of freedom (the number of random variables being summed, Zi s).

The chi-squared distribution is the distribution that emerges from taking the sums of the squared errors.

\[ \begin{aligned} Q\ =\sum _{i=1}^{k}Z_{i}^{2} \end{aligned} \]

We can also think about the chi-square distribution as the distribution of the chi-square statistic. The chi-square distribution is defined by the following probability density function.

\(P_r(x) = \frac{(x^(r/2-1)e^(-x/2))}{(Gamma(1/2r)2^(r/2))}\)

The chi-square distribution has these interesting properties: For one thing, the mean of the distribution is equal to the number of degrees of freedom: μ = v. Related to this, the variance is equal to two times the number of degrees of freedom: σ2 = 2 * v. Also, as the degrees of freedom increase, the chi-square curve approaches a normal distribution.

A chi-square (χ2) statistic is a measure of the difference between the observed and expected frequencies of the outcomes of a set of events or variables. Chi-square is useful for analyzing such differences in categorical variables, especially those nominal in nature. χ2 depends on the size of the difference between actual and observed values, the degrees of freedom, and the samples size. χ2 can be used to test whether two variables are related or independent from one another. It can also be used to test the goodness-of-fit between an observed distribution and a theoretical distribution of frequencies.

There are two main kinds of chi-square tests: the test of independence, which asks a question of relationship, and the goodness-of-fit test, which asks how well the data matches a theoretical value.

If Y_i have normal independent distributions with mean 0 and variance 1, then

\(x^2=sum_(i=1)^rY_i^2\)

is distributed as chi^2 with r degrees of freedom. This makes a chi^2 distribution a gamma distribution with theta=2 and alpha=r/2, where r is the number of degrees of freedom.

the goodness-of-fit test, which determines if data fit a particular distribution, such as in the lottery example the test of independence, which determines if events are independent, such as in the movie example the test of a single variance, which tests variability, such as in the coffee example

There is a different chi-square curve for each 𝑑𝑓.

When 𝑑𝑓>90, the chi-square curve approximates the normal distribution.

the prob of getting a chi square very low with lots of degrees of freedom is not that uncommon

needs to have sufficient deviation away from the mean

is the number of observations of type

is the expected number of observations of type

  1. The key to the chi-squared test is that the chi-squared statistic is well-approximated by a chi-squared distribution (which is itself an approximation to the multivariate normal distribution) with a properly chosen number of degrees of freedom.

Because of this approximation, a number of conditions (detailed in the next section) need to hold in order for the test to be valid. Should they hold, the chi-squared test proceeds as follows:

Intuitively, the test relies on the fact that if the expected distribution is indeed correct, the difference between the observed and expected distributions should approximate a multivariate normal distribution, which is approximated by a chi-squared distribution by the central limit theorem. If the chi-squared statistic is larger than the critical value, then it is unlikely to have occurred under this assumption, and thus the assumption is likely to be false.

The simplest chi-squared distribution is the square of a standard normal distribution. So wherever a normal distribution could be used for a hypothesis test, a chi-squared distribution could be used.

The difference between the left and right panel is that every number in the standard normal distribution has been squared. We can see that every number is greater than zero.

normal_numbers <- rnorm(10000, mean = 0, sd = 1)
normal_numbers <- as.data.frame(normal_numbers)

normal_numbers2 <- normal_numbers %>%
  mutate(square_of_normal_numbers = normal_numbers ^ 2)

one <- ggplot(normal_numbers, aes(x = normal_numbers)) + geom_density() +
  labs(title = "Normal Distribution", subtitle = "") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5))

two <- ggplot(normal_numbers2, aes(x = square_of_normal_numbers)) +
  geom_density() + xlim(0, 4) +
  labs(title = "Chi-Square Distribution", subtitle = "DF: 1") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5))
  
library(patchwork)
one + two

# scale_color_manual(values= list(c("#e76254", "#ef8a47", "#f7aa58", "#ffd06f", "#ffe6b7", "#aadce0", "#72bcd5", "#528fad", "#376795", "#1e466e"), c(6, 2, 9, 3, 7, 5, 1, 10, 4, 8), colorblind=TRUE)) 

# deviation between a commonly used stats concept and the formal mathematical proof
# distribution that emerges from taking the sums of the squared errors
#   stat_function(fun = df,
#                 geom = "area",
#                 fill = "steelblue",
#                 xlim = c(critical_f, 5),
#                 args = list(
#                   df1 = 3,
#                   df2 = 49
#                 ))

#dchisq()

  # scale_color_manual(values= list(c("#e76254", "#ef8a47", "#f7aa58", "#ffd06f", "#ffe6b7", "#aadce0", "#72bcd5", "#528fad", "#376795", "#1e466e"), c(6, 2, 9, 3, 7, 5, 1, 10, 4, 8), colorblind=TRUE)) 

ggplot(data.frame(x = c(0, 20)), aes(x)) +
    stat_function(fun = dchisq,
                geom = "area", color = '#e76254', fill = '#1e466e', alpha = 0.5, 
                args = list(
                  # from = 0, to = 40,
                  df = 3)) +
    stat_function(fun = dchisq,
                geom = "area", color = '#ef8a47', fill = '#1e466e', alpha = 0.5, 
                args = list(
                  # from = 0, to = 40,
                  df = 4)) +
    stat_function(fun = dchisq,
                geom = "area", color = '#f7aa58', fill = '#1e466e', alpha = 0.5, 
                args = list(
                  # from = 0, to = 40,
                  df = 5)) +
    stat_function(fun = dchisq,
                geom = "area", color = '#ffd06f', fill = '#1e466e', alpha = 0.5, 
                args = list(
                  # from = 0, to = 40,
                  df = 6)) +
    stat_function(fun = dchisq,
                geom = "area", color = '#aadce0', fill = '#1e466e', alpha = 0.5, 
                args = list(
                  # from = 0, to = 40,
                  df = 7)) +
  stat_function(fun = dchisq,
                geom = "area", color = '#72bcd5', fill = '#1e466e', alpha = 0.5, 
                args = list(
                  # from = 0, to = 40,
                  df = 8)) +
    stat_function(fun = dchisq,
                geom = "area", color = '#528fad', fill = '#1e466e', alpha = 0.5, 
                args = list(
                  # from = 0, to = 40,
                  df = 9)) +
      stat_function(fun = dchisq,
                geom = "area", color = '#376795', fill = '#1e466e', alpha = 0.5, 
                args = list(
                  # from = 0, to = 40,
                  df = 10)) +
  labs(title = "Chi-Square Distribution", subtitle = "Density Plot with degrees of freedom 3 - 10") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5))

# 

# how to plot cumulative density function...

# what is poisson distribution?

F Distribution

F = V1∕m1-~ F (m ,m ) V2∕m2 1 2

Let’s create another example with another function to extrapolate what we have learned. With the function df you can create an F-distribution. Each F-distribution requires at least three arguments, x, df1 and df2:

  # scale_color_manual(values= list(c("#e76254", "#ef8a47", "#f7aa58", "#ffd06f", "#ffe6b7", "#aadce0", "#72bcd5", "#528fad", "#376795", "#1e466e"), c(6, 2, 9, 3, 7, 5, 1, 10, 4, 8), colorblind=TRUE)) 


ggplot(data.frame(x = c(0, 3)), aes(x)) +
      stat_function(fun = df,
                geom = "line",
                color = "#e76254",
                alpha = .7,
                args = list(
                  df1 = 1,
                  df2 = 50
                )) +
      stat_function(fun = df,
                geom = "line",
                color = "#ef8a47",
                alpha = .7,
                args = list(
                  df1 = 2,
                  df2 = 50
                )) +
    stat_function(fun = df,
                geom = "line",
                color = "#f7aa58",
                alpha = .7,
                args = list(
                  df1 = 3,
                  df2 = 50
                )) +
    stat_function(fun = df,
                geom = "line",
                color = "#ffd06f",
                alpha = .7,
                args = list(
                  df1 = 4,
                  df2 = 50
                )) +
    stat_function(fun = df,
                geom = "line",
                color = "#ffe6b7",
                alpha = .7,
                args = list(
                  df1 = 5,
                  df2 = 50
                )) +
      stat_function(fun = df,
                geom = "line",
                color = "#aadce0",
                alpha = .7,
                args = list(
                  df1 = 6,
                  df2 = 50
                )) +
    stat_function(fun = df,
                geom = "line",
                color = "#72bcd5",
                alpha = .7,
                args = list(
                  df1 = 7,
                  df2 = 50
                )) +
        stat_function(fun = df,
                geom = "line",
                color = "#528fad",
                alpha = .7,
                args = list(
                  df1 = 8,
                  df2 = 50
                )) +
              stat_function(fun = df,
                geom = "line",
                color = "#376795",
                alpha = .7,
                args = list(
                  df1 = 9,
                  df2 = 50
                )) +
              stat_function(fun = df,
                geom = "line",
                color = "#1e466e",
                alpha = .7,
                args = list(
                  df1 = 10,
                  df2 = 50
                )) +
    labs(title = "F Distribution", subtitle = "Density Plot with degrees of freedom 1: 1 - 10; degrees of freedom 2: 50") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5))

# z score

5 Frequentist Statistics

5.1 Descriptive Statistics

Descriptive statistics summarize the characteristics of a dataset. Using descriptive statistics, we try to give the most accurate picture of what the data is or what the data shows. This includes basic information about each variable individually as well as highlights about relationships between variables. We try to distill the information into an easily understood form in a way that makes sense and is not misleading wihile at the same time preventing any misunderstanding. Along the way, we have to make a lot of decisions on what information is presented and what story is conveyed. After all, converting data into a summary statistic is an act of reduction and some information is lost. Despite obvious differences in the distribution of points in each panel, each of the nine panels in the following visualization share seven summary statistics in common including the x-mean, y-mean and correlation. The implications for understanding data through descriptive statistics and visualizations will be discussed in detail in this chapter.

Summary Tables

In many cases, the best way to convey information about a dataset is through the use of a table. We can use dplyr() to efficiently create summary tables of our data. dplyr() is a powerful package in the tidyverse() library that contains many functions inspired by the SQL programming language. Here are the most common dplyr() verbs.

  • select()
  • filter()
  • arrange()
  • mutate()
  • group_by()
  • summarize()

The following code shows us how to create a summary table of different variables in our dataframe using the dplyr() workflow. We also show an alternate way of writing the code that is more like SQL syntax because it works from the inside out. The difference is that our first code chunk makes use of the pipe operator %>% which we read as ‘and then’. The pipe operator first appeared in the magrittr() package with a sort of joke on René Magritte’s famous painting with the words, in French, “Ceci n’est pas une pipe”, meaning “This is a not a pipe”. The pipe operator has since been incorporate into dplyr() as well so there is no practical need to call the magrittr() package into our local environment if we intend to use dplyr() verbs as well. The pipe operator explains a lot of why R code sometimes appears like two different styles: base R or tidyverse methods.

In the following code chunks, both code chunks evaluate to the same result. Finally, we use the round() function from base R to found the values to one decimal point.

dplyr::select(
  arrange(
    summarise(
      group_by(
        filter(
          mutate(mtcars,
          l_per_100_km = 235 / mpg),
        origin != "United States"),
      origin),
    avg_l_per_100_km = mean(l_per_100_km), avg_mpg = mean(mpg)),
  desc(avg_mpg)),
 origin, avg_l_per_100_km, avg_mpg) -> sql_like_method

# mtcars |> 
#   transform(avg = mpg / wt) |>
#   subset(avg > 5) |>
#   subset(, c(wt, mpg, avg)) |> 
#   head()

mtcars %>%
  mutate(l_per_100_km = 235 / mpg) %>%
  group_by(origin) %>%
  summarize(avg_l_per_100_km = round(mean(l_per_100_km), 1), avg_mpg = round(mean(mpg), 1)) %>%
  arrange(desc(avg_mpg)) %>%
  dplyr::select(origin, avg_l_per_100_km, avg_mpg)

Univariate Statistics

Moments

In each dataset, the observations for each variable will belong to a certain distribution that can be described in terms of central tendency and variability or dispersion, as well as the shape of this variability or dispersion. This can be described in terms of four moments, which is a concept shared in other fields such as physics. Whereas in physics, moments are related to concepts of mass, rotational inertia and other things, the four moments of a distribution in statistics refer specifically to the following measures:

  • center
  • spread
  • skewness
  • kurtosis

In the table above, we show the average miles per gallon as well as the average liter per one hundred kilometers for cars from each country. This table is useful, of course. But averages don’t always work to summarize data. We need to understand the distribution, spread, and variability. If we don’t understand all the characteristics of a distribution, the average may mislead by hiding the other statistics. We will look at each of these moments in turn.

center

The first moment in a distribution is the center. There are three common ways to report this statistic:

  • mean
  • median
  • mode

The mean is calculated as the sum of all the values divided by the total number and is given by this equation:

\[ \begin{aligned} \bar{X} = \frac{\sum_X} N \end{aligned} \]

The median, on the other hand, is the middle number after sorting in either ascending or descending order. If there is an even number of observations in the distribution, the median is the average of the middle two numbers.

Where x is the ordered list of values in data set, the median is given by these equations. The first equation is appropriate when n is even; the second equation is used when n is odd.

\[ \begin{aligned} median(X) = X\frac{N} 2 \end{aligned} \] \[ \begin{aligned} median(X) = \frac{X\frac{N-1} 2 + X\frac{N+1} 2}2 \end{aligned} \]

Base R provides two useful functions for calculating these statistics: the mean() and median().

Unike mean and median, mode can have both numeric and character data. For example, we can say that the R does not have a standard in-built function to calculate mode. So we create a user function to calculate mode of a data set in R. This function takes the vector as input and gives the mode value as output. The following equation shows us how to find the mode.

The in-built mode() function tells us the internal storage mode of the object, not the mathematical or statistical mode, which is the value that occurs the most. We can instead use a different package modeest to return what we are looking for.

In the following example, the mode for the origin vector is United States but the mode for the mpg variable is shared by these values: 10.4 15.2 19.2 21.0 21.4 22.8 30.4, which all have two appearances. mlv is different than

The median is often a better measure of center in skewed distributions.

modeest::mlv(mtcars$mpg, method = "mfv") -> one
modeest::mlv(mtcars$origin, method = "mfv") -> two

names(sort(-table(mtcars$origin)))[1] -> three
names(sort(-table(mtcars$mpg)))[1] -> four

\[ \begin{aligned} mode = L + (fm - f1) h / (fm - f1) + (fm - f2) \end{aligned} \]

https://www.google.com/search?client=safari&rls=en&q=modeest::mlv&ie=UTF-8&oe=UTF-8

In the below visualization, we can draw vertical lines for the mean and median value in each histogram. We do these by adding the geom_vline() layer of ggplot. We can see that the mean statistic is larger than the median for the hp variable. On the other hand, the mean and median statistic are approximately equal for the qsec variable.

mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec)  %>%
  pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
  group_by(names) %>%
  summarize(median = median(values), mean = mean(values)) -> mtcars_bootstrapped_ci_df

mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec)  %>%
  pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
  ggplot(aes(x = values, fill = names)) +
  facet_wrap(~names, scales = 'free') +
  geom_histogram(bins = 32) +
  geom_vline(mapping = aes(xintercept = median), mtcars_bootstrapped_ci_df, linetype = 'dashed', color = 'red') +
  geom_vline(mapping = aes(xintercept = mean), mtcars_bootstrapped_ci_df, linetype = 'dashed', color = 'darkgreen') +
  labs(caption = "red line: median \n green line: mean") +
  labs(title = "Mtcars", subtitle = "Histogram") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  theme(legend.position = 'none') +
  scale_fill_manual(values=met.brewer("Hiroshige", 6))

# what about trimmed mean
spread

The second moment of a distribution is spread. We can think of this in terms of standard deviation or variance and interquartile range.

  • standard deviation
  • interquartile range

We can start by talking about the standard deviation or the variance. But what do we mean by the standard deviation or variance and why do we talk about these two measures together? We talk about these two measures together because the standard deviation is the square root of the variance and these measures are therefore two sides of the same coin.

\[ \begin{aligned} sd = sqrt(var) \end{aligned} \]

The above equation shows the mathematical transformation from standard deviation to variance. This equation, however, doesn’t increase our understanding very much. In order to increase our understanding, it’s useful to review how the variance is calculated. The variance is the average of the sum of squares; this is calculated by taking each value in the distribution, subtracting the mean value, squaring the total and dividing by the total number of observations. The below equation shows this process.

\(Equation for variance equals the sum of X values minus the average of X values squared divided by the number of values N\)

The R progrmaming language, of course, has shortcuts in calculating these measures: We use the var() and sd() functions to calculate the variance and standard deviation in one step. But we can practice calculating the variance and standard deviation by hand in order to increase our understanding of what is happening under the hood of these functions.

n <- length(mtcars$wt) # this value is 32
  
sum_of_wt <- sum(mtcars$wt) # add up all the values 102.952

mean_wt <- mean(mtcars$wt) # this value is 3.21725

sq_difference_from_mean_for_each_value <- (mtcars$wt - mean(mtcars$wt)) ^ 2 # this is the squared difference from the mean for each data value

sum_sq_difference_from_mean <- sum((mtcars$wt - mean(mtcars$wt)) ^ 2) # add them all up 29.67875

variance_wt <- sum((mtcars$wt - mean(mtcars$wt)) ^ 2) / (length(mtcars$wt) - 1) # this number is 0.957379
variance_wt <- var(mtcars$wt) # this number is also 0.957379

sq_root_variance_wt <- sqrt(var(mtcars$wt)) # this number is 0.9784574
sd_wt <- sd(mtcars$wt) # this number is also 0.9784574

one_sd_below <- mean(mtcars$wt) - sd(mtcars$wt)
one_sd_above <- mean(mtcars$wt) + sd(mtcars$wt)

The following table shows the calculated standard deviation for each of the variables in our dataset. We can do this efficiently by using the %>% operator, the dplyr group_by and summarize functions and the sd function from base R.

mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
  group_by(names) %>%
  summarize(sd = sd(values))

The interquartile range, often abbreviated as IQR and accessed by the IQR() function, describes the middle 50% of values when ordered from lowest to highest. To find the interquartile range, we first find the median of the lower and upper half of the data. These values are the first quartile, abbreviated as Q1, and the third quartile, abbreviated as Q3. The interquartile range is the difference between Q3 and Q1.

mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
  group_by(names) %>%
  summarize(mean(values), sd(values), median(values), IQR(values))

In addition to reporting the standard deviation and the interquartile range, it’s also useful to think about the range and minimum and maximum values of a distribution, which can be expressed by the range(), min() and max() functions, respectively.

# range
# min()
# max()
skewness

The third moment of a distribution is skewness. Skewness is given by the following equation. This is known as the adjusted Fisher-Pearson standardized moment coefficient to find skew.

\[ \begin{aligned} tilde {\mu }_{3} = \frac{\sum_{i}^{N}\left(X_{i}-\bar{X}\right)^{3}}{(N-1) * \sigma^{3}} \end{aligned} \]

Skewness is a measure of symmetry. Negative skewness indicates that the mean of the data values is less than the median, and the data distribution is left-skewed. Positive skewness would indicate that the mean of the data values is larger than the median, and the data distribution is right-skewed. The formula given in most textbooks is

\(Skew = 3 * (Mean – Median) / Standard Deviation\)

This is known as an alternative Pearson Mode Skewness.

fisher_pearson_skew = 3 * ( ( mean(mtcars$mpg) - median(mtcars$mpg) )  / sd(mtcars$mpg) )

beb <- moments::skewness(mtcars$mpg) # this is an agostino test

# Compute the skewness (or standardized third central moment) of a variable.

 # https://brownmath.com/stat/shape.htm#KurtosisVisualize

# The reference standard is a normal distribution, which has a kurtosis of 3.
# the moment coefficient of skewness

skewness <- sum((mtcars$mpg - mean(mtcars$mpg)) ^ 3) / 32 / (var(mtcars$mpg) ^ (3/2))

sample_skewness <- (sqrt(32*(32-1)) / (32 -2) ) * skewness

Skewness is important to understand because it helps us identify the most appropriate descriptive statistics to report on a given dataset. The mean tends to be a better measure of center when the distribution is normal or almost normal. However, if there are outliers in the distribution, the median tends to be a better measure of the center. In a similar way, standard deviation tends to be a good measure of the spread of a distribution when the distribution is normal or almost normal and the interquartile range tends to be a good measure of the spread of a distribution when the distribution is left-skewed or right-skewed.

kurtosis

The fourth moment of a distribution is kurtosis.

m <- moments::kurtosis(mtcars$mpg) 

excess_kurtosis <- m - 3

kurtosis <- sum((mtcars$mpg - mean(mtcars$mpg)) ^ 4) / 32 / (var(mtcars$mpg) ^ (2))

excess_kurtosis <- kurtosis - 3

sample_kurtosis <- ((32-1)/((32-2)*(32-3))) * ((32+1)*(excess_kurtosis)+6)

After reviewing all the moments of a distribution, we now create a better table.

We can immediately see that the British and Swedish cars only have one value.

mtcars %>%
  mutate(l_per_100_km = 235 / mpg) %>%
  group_by(origin) %>%
  summarize(mean = round(mean(mpg), 1), 
            sd = round(sd(mpg),1),
            skew = round(3 * ( ( mean(mpg) - median(mpg) )  / sd(mpg) ),1),
            kurtosis = round(sum((mpg - mean(mpg)) ^ 4) / 32 / (var(mpg) ^ (2)),1)) %>%
  arrange(desc(mean))

Bivariate Statistics

The following datasets all share much of the same summary statistics, including x-bar, y-bar, the x and y variance and standard deviation, and the correlation and r-squared.

The point here is to show that graphing the data is important in understanding the data, not just reading summary statistics. Anscombe.

anscomb <- tribble(
  ~id, ~dataset, ~x, ~y,
"0","I","10.0","8.04",
"1","I","8.0","6.95",
"2","I","13.0","7.58",
"3","I","9.0","8.81",
"4","I","11.0","8.33",
"5","I","14.0","9.96",
"6","I","6.0","7.24",
"7","I","4.0","4.26",
"8","I","12.0","10.84",
"9","I","7.0","4.82",
"10","I","5.0","5.68",
"11","II","10.0","9.14",
"12","II","8.0","8.14",
"13","II","13.0","8.74",
"14","II","9.0","8.77",
"15","II","11.0","9.26",
"16","II","14.0","8.1",
"17","II","6.0","6.13",
"18","II","4.0","3.1",
"19","II","12.0","9.13",
"20","II","7.0","7.26",
"21","II","5.0","4.74",
"22","III","10.0","7.46",
"23","III","8.0","6.77",
"24","III","13.0","12.74",
"25","III","9.0","7.11",
"26","III","11.0","7.81",
"27","III","14.0","8.84",
"28","III","6.0","6.08",
"29","III","4.0","5.39",
"30","III","12.0","8.15",
"31","III","7.0","6.42",
"32","III","5.0","5.73",
"33","IV","8.0","6.58",
"34","IV","8.0","5.76",
"35","IV","8.0","7.71",
"36","IV","8.0","8.84",
"37","IV","8.0","8.47",
"38","IV","8.0","7.04",
"39","IV","8.0","5.25",
"40","IV","19.0","12.5",
"41","IV","8.0","5.56",
"42","IV","8.0","7.91",
"43","IV","8.0","6.89")

anscomb$x <- as.numeric(anscomb$x)
anscomb$y <- as.numeric(anscomb$y)

ggplot(anscomb, aes(x = x, y = y, color = dataset)) + geom_point() +
  facet_wrap(~dataset) + geom_smooth(method = 'lm', se = FALSE) +
  labs(title = "examples", subtitle = "examples") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  scale_color_manual(values=met.brewer("Hiroshige", 4))

Pearson Correlation

When we fit a linear model we can investigate goodness-of-fit statistics like r-squared. We keep in mind, however, that r-squared is another way of interpreting another statistic called the pearson correlation, denoted as r. So often it’s good practice to look at what the pearson correlation is telling us before fitting a linear regression and examining the r-squared. The pearson correlation is calculated using the following formula:

\[ \begin{aligned} \frac{\sum\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sqrt{\sum\left(x_{i}-\bar{x}\right)^{2} \sum\left(y_{i}-\bar{y}\right)^{2}}} \end{aligned} \]

We can create the pearson correlation value ourselves by creating our own pearson correlation table. This table will tell us the values that we need to use for our general equation. The pearson correlation is a measure of the linear correlation between two variables. It is the ratio between the covariance of two variables and the product of their standard deviations; thus it is essentially a normalised measurement of the covariance, such that the result always has a value between −1 and 1. Whenever you square a number between -1 and 1, you end up with a number between 0 and 1. Thus R-squared is always between 0 and 1.

mtcars %>%
  dplyr::select(wt, mpg) %>%
  mutate(wt_times_mpg = wt * mpg) %>%
  mutate(wt_squared = wt ^ 2) %>%
  mutate(mpg_squared = mpg ^ 2) %>%
  janitor::adorn_totals("row") %>%
  tail(n = 1)
# This is our Pearson correlation equation using the language of R
pearson_correlation <- 
  ((length(mtcars$mpg) * sum(mtcars$wt * mtcars$mpg)) - 
     (sum(mtcars$wt) * sum(mtcars$mpg))) / 
  sqrt(((length(mtcars$mpg) * sum(mtcars$wt ^ 2)) - (sum(mtcars$wt) ^ 2)) * 
         ((length(mtcars$mpg) * sum(mtcars$mpg ^ 2)) - (sum(mtcars$mpg) ^ 2))) 

# We see that R calculates the Pearson correlation for us using the cor function

pearson_correlation <- cor(mtcars$wt, mtcars$mpg)

# Keep in mind that order is not important
# cor(mtcars$wt, mtcars$mpg) = cor(mtcars$mpg, mtcars$wt)

# We square r in order to find r-squared
r_squared <- cor(mtcars$wt, mtcars$mpg) ^ 2

tibble(pearson_correlation, r_squared, rownames = 'wt_and_mpg') %>% relocate(rownames)
mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
  group_by(names) %>%
  summarize(correlation_with_mpg = cor(mpg, values)) %>%
  mutate(r_squared = correlation_with_mpg ^ 2) %>%
  arrange(correlation_with_mpg)
mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
  nest(data = -names) -> nested

n <- nested %>% 
  mutate(test = map(data, ~ cor.test(.x$mpg, .x$values)))

n2 <- nested %>% 
  mutate(
    test = map(data, ~ cor.test(.x$mpg, .x$values)), # S3 list-col
    tidied = map(test, broom::tidy)
  ) 

mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
  nest(data = -names) %>% 
  mutate(
    test = map(data, ~ cor.test(.x$mpg, .x$values)), # S3 list-col
    tidied = map(test, broom::tidy)
  ) %>% 
  unnest(tidied) -> n3
Scatterplot Matrix

Scatterplot matrices are useful because they plot each variable against each other variable. The GGally package makes this method easy.

For each panel, the variable on the vertical axis is given by the variable name in the row. The variable on the horizontal is the variable name in the column. Correlations are shown in the upper right and the scatterplots are shown in the lower part. Density plots on the diagonal where the variables meet each other.

Our scatterplot matrix is telling us that the strongest correlation exists between the wt and mpg variables.

library(GGally)

mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  ggpairs(3:8) + 
  labs(title = "Mtcars", subtitle = "Scatterplot Matrix") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  scale_fill_manual(values=met.brewer("Hiroshige", 6))

Correlation Test
correlation_test <- cor.test(mtcars$wt, mtcars$mpg,
                        method = "spearman")

broom::tidy(correlation_test)
R-Squared

\[ \begin{aligned} {R ^ 2} = cor(x, y) ^ 2 \end{aligned} \]

RMSE

\[ \begin{aligned} {RMSD} = \sqrt{\frac{\sum_{i=1}^{N}\left(x_{i}-\hat{x}_{i}\right)^{2}}{N}} \end{aligned} \]

Root Mean Squared Error (RMSE): As the name suggests it is the square root of the averaged squared difference between the actual value and the predicted value of the target variable. It gives the average prediction error made by the model, thus decrease the RMSE value to increase the accuracy of the model.

RMSE is a measure of the differences between values predicted by a model and the values observed.

Root Mean Square Error (RMSE) is the standard deviation of the residuals which are the prediction errors.

Residuals are a measure of how far from the regression line data points are. RMSE is a measure of how spread out the residuals are.

RMSE tells us how concentrated the data is around the line of best fit.

You can use whichever formula you feel most comfortable with, as they both do the same thing. If you don’t like formulas, you can find the RMSE by:

Squaring the residuals. Finding the average of the residuals. Taking the square root of the result.

RMSE has the same unit as the dependent variable (DV)

These deviations are called residuals when the calculations are performed over the data sample that was used for estimation and are called errors (or prediction errors) when computed out-of-sample

The RMSD serves to aggregate the magnitudes of the errors in predictions for various data points into a single measure of predictive power. RMSD is a measure of accuracy, to compare forecasting errors of different models for a particular dataset and not between datasets, as it is scale-dependent

RMSD is always non-negative, and a value of 0 (almost never achieved in practice) would indicate a perfect fit to the data. In general, a lower RMSD is better than a higher one. However, comparisons across different types of data would be invalid because the measure is dependent on the scale of the numbers used.

RMSD is the square root of the average of squared errors. The effect of each error on RMSD is proportional to the size of the squared error; thus larger errors have a disproportionately large effect on RMSD. Consequently, RMSD is sensitive to outliers.

mtcars_linear_model <- lm(mpg ~ wt, mtcars) 

# summary(mtcars_linear_model)
# 
# ?model.matrix

stats::model.matrix(mtcars_linear_model)
##    (Intercept)    wt
## 1            1 2.620
## 2            1 2.875
## 3            1 2.320
## 4            1 3.215
## 5            1 3.440
## 6            1 3.460
## 7            1 3.570
## 8            1 3.190
## 9            1 3.150
## 10           1 3.440
## 11           1 3.440
## 12           1 4.070
## 13           1 3.730
## 14           1 3.780
## 15           1 5.250
## 16           1 5.424
## 17           1 5.345
## 18           1 2.200
## 19           1 1.615
## 20           1 1.835
## 21           1 2.465
## 22           1 3.520
## 23           1 3.435
## 24           1 3.840
## 25           1 3.845
## 26           1 1.935
## 27           1 2.140
## 28           1 1.513
## 29           1 3.170
## 30           1 2.770
## 31           1 3.570
## 32           1 2.780
## attr(,"assign")
## [1] 0 1
mse <- sum(residuals(mtcars_linear_model) ^ 2) / 30


# mtcars_linear_model <- lm(mpg ~ wt, x = TRUE, mtcars) 
# mtcars_linear_model$x

# str(mtcars_linear_model)
# lm(mpg ~ wt, qr = TRUE, mtcars) 
# summary(mtcars_linear_model)
mtcars_linear_model <- lm(mpg ~ wt, mtcars) 

rmse = sqrt(sum(residuals(mtcars_linear_model) ^ 2) / 30)
# sqrt(sum(mtcars_linear_model$residuals ^ 2) / 30)


# sd(residuals(mtcars_linear_model)) # similar to standard deviation

Some researchers have recommended the use of the Mean Absolute Error (MAE) instead of the Root Mean Square Deviation. MAE possesses advantages in interpretability over RMSD. MAE is the average of the absolute values of the errors. MAE is fundamentally easier to understand than the square root of the average of squared errors. Furthermore, each error influences MAE in direct proportion to the absolute value of the error, which is not the case for RMSD.[2]

Mean Absolute Error (MAE): This metric gives the absolute difference between the actual values and the values predicted by the model for the target variable. If the value of the outliers does not have much to do with the accuracy of the model, then MAE can be used to evaluate the performance of the model. Its value must be less in order to make better models.

mae <- mean(abs(residuals(mtcars_linear_model)))
r <- cor(mtcars$mpg, mtcars$wt)

variance_wt <- sum((mtcars$wt - mean(mtcars$wt)) ^ 2) / (length(mtcars$wt) - 1) 
variance_mpg <- sum((mtcars$mpg - mean(mtcars$mpg)) ^ 2) / (length(mtcars$mpg) - 1) 

mean_wt <- (sum(mtcars$wt) / length(mtcars$wt))
mean_mpg <- (sum(mtcars$mpg) / length(mtcars$mpg))

sd_wt <- sqrt(variance_wt)
sd_mpg <- sqrt(variance_mpg)

pearson_correlation <- ((length(mtcars$mpg) * sum(mtcars$wt * mtcars$mpg)) - (sum(mtcars$wt) * sum(mtcars$mpg))) / sqrt(((length(mtcars$mpg) * sum(mtcars$wt ^ 2)) - (sum(mtcars$wt) ^ 2)) * ((length(mtcars$mpg) * sum(mtcars$mpg ^ 2)) - (sum(mtcars$mpg) ^ 2))) 

5.2 Inferential Statistics

Gelman, Jennifer and Akil, in their book Regressions and Other Stories, outline three different challenges of inferential statistics. In their words:

  • Generalizing from sample to population
  • Generalizing from treatment to control group
  • Generalizing from observed measurements to the underlying constructs of interest

While descriptive statistics summarize the characteristics of a dataset, inferential statistics allow us to make an inference about the more general population, including working on hyptothesis testing.

sample to population

Our first step towards understanding confidence intervals is to understand the standard deviation of the sample.

We first practice calculating the variance and standard deviation by hand in order to understand what is happening under the hood of the var() and sd() functions.

n <- length(mtcars$wt) # this value is 32
  
sum_of_wt <- sum(mtcars$wt) # add up all the values 102.952

mean_wt <- mean(mtcars$wt) # this value is 3.21725
# this is same as sum(mtcars$wt) / length(mtcars$wt) 

# mtcars %>%
#   mutate(squared_difference_from_mean = (mtcars$wt - mean(mtcars$wt)) ^ 2) %>%
#   summarize(sum_squared_difference_from_mean = sum(squared_difference_from_mean)) 

sq_difference_from_mean_for_each_value <- (mtcars$wt - mean(mtcars$wt)) ^ 2 # this is the squared difference from the mean for each data value

sum_sq_difference_from_mean <- sum((mtcars$wt - mean(mtcars$wt)) ^ 2) # add them all up 29.67875

# another method to sum squared difference from mean
# mtcars %>%
#   mutate(squared_difference_from_mean = (mtcars$wt - mean(mtcars$wt)) ^ 2) %>%
#   summarize(sum_squared_difference_from_mean = sum(squared_difference_from_mean)) 

variance_wt <- sum((mtcars$wt - mean(mtcars$wt)) ^ 2) / (length(mtcars$wt) - 1) # this number is 0.957379

variance_wt <- var(mtcars$wt) # this number is also 0.957379

sq_root_variance_wt <- sqrt(var(mtcars$wt)) # this number is 0.9784574

sd_wt <- sd(mtcars$wt) # this number is also 0.9784574

one_sd_below <- mean(mtcars$wt) - sd(mtcars$wt)
one_sd_above <- mean(mtcars$wt) + sd(mtcars$wt)

mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
  group_by(names) %>%
  summarize(sd = sd(values))

Standard Error

The standard deviation (SD) measures the amount of variability, or dispersion, from the individual data values to the mean

The standard error of the mean (SEM) measures how far the sample mean (average) of the data is likely to be from the true population mean. The SEM is always smaller than the SD.

We calculate the upper and lower bound confidence intervals for the populations by first calculating the standard error in the population

sigma = 1.96 # this is the appropriate sigma for a 95% confidence interval

n <- length(mtcars$wt) # this value is 32
  
standard_error = (sigma * sd(mtcars$wt)) / sqrt(n)
 # this value is 0.3390182

mean_wt <- mean(mtcars$wt) # 3.21725

lower_bound_ci <- mean_wt - standard_error # lower bound confidence interval
upper_bound_ci <- mean_wt + standard_error # upper bound confidence interval

tibble(lower_bound_ci, upper_bound_ci, Mtcars_variable = "wt")

Confidence Interval

Theory
mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
  group_by(names) %>%
  summarize(standard_error = 1.96 * sd(values) / sqrt(length(values)), 
            mean_value = mean(values)) %>%
  mutate(lower_ci = mean_value - standard_error) %>%
  mutate(upper_ci = mean_value + standard_error) -> mtcars_theoretical_ci_df

mtcars_theoretical_ci_df
Bootstrapping
mtcars %>% # use this to get a single value
  infer::specify(response = wt) %>%
  infer::generate(reps = 1000, type = 'bootstrap') %>%
  infer::calculate(stat = 'mean') %>%
  infer::get_ci(level = 0.95) -> one_value

mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
  split(.$names) %>% # split(mtcars_long_numeric$names) %>%
  map_df(~.x %>% 
  specify(response = values) %>%
  generate(reps = 1000, type = 'bootstrap') %>%
  calculate(stat = 'mean') %>%
  get_ci(level = 0.95), .id = 'names') -> mtcars_bootstrapped_ci_df
mtcars_theoretical_ci_df <- mtcars_theoretical_ci_df %>% dplyr::select(names, lower_ci, upper_ci)

names(mtcars_theoretical_ci_df) <- c("names", "lower_ci_theoretical", "upper_ci_theoretical")
names(mtcars_bootstrapped_ci_df) <- c("names", "lower_ci_bootstrapped", "upper_ci_bootstrapped")

mtcars_ci_df <- left_join(mtcars_theoretical_ci_df, mtcars_bootstrapped_ci_df, by = 'names')
mtcars_ci_df
mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
  ggplot(aes(x = values, fill = names)) +
  facet_wrap(~names, scales = 'free') +
  geom_histogram(bins = 32) +
  geom_vline(mapping = aes(xintercept = lower_ci_theoretical), mtcars_theoretical_ci_df, linetype = 'dashed') +
  geom_vline(mapping = aes(xintercept = upper_ci_theoretical), mtcars_theoretical_ci_df, linetype = 'dashed') +
  geom_vline(mapping = aes(xintercept = lower_ci_bootstrapped), mtcars_ci_df, linetype = 'dashed') +
  geom_vline(mapping = aes(xintercept = upper_ci_bootstrapped), mtcars_ci_df, linetype = 'dashed') +
  labs(caption = "dashed line equals 95% confidence interval for mean") +
  labs(title = "Mtcars", subtitle = "Histogram") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  scale_fill_manual(values=met.brewer("Hiroshige", 6))

# ggplot(df1, aes(x = values, fill = names)) +
#   geom_density(
#     alpha = 0.3,
#     aes(y = after_stat(density + 0.0001 * group))
#   ) 
# ggplot(df1, aes(x = values, fill = names)) +
#   geom_density(
#     alpha = 0.3,
#     aes(y = after_stat(density + 0.0001 * group))
#   ) 

mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
  ggplot(aes(x = values, fill = names)) +
  facet_wrap(~names, scales = 'free') +
  geom_density() +
  geom_vline(mapping = aes(xintercept = lower_ci_theoretical), mtcars_theoretical_ci_df, linetype = 'dashed') +
  geom_vline(mapping = aes(xintercept = upper_ci_theoretical), mtcars_theoretical_ci_df, linetype = 'dashed') +
  geom_vline(mapping = aes(xintercept = lower_ci_bootstrapped), mtcars_ci_df, linetype = 'dashed') +
  geom_vline(mapping = aes(xintercept = upper_ci_bootstrapped), mtcars_ci_df, linetype = 'dashed') +
  labs(caption = "dashed line equals 95% confidence interval for mean") +
  labs(title = "Mtcars", subtitle = "Density Plot") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  scale_fill_manual(values=met.brewer("Hiroshige", 6))

The Jackknife

There are various “solutions” for heteroskedasticity, which are less “solutions” (since it may hinge on the kind of heteroskedasticity and whether you know it or not). In practice, these solutions are more robustness tests to compare with your standard error estimates derived from a simple OLS model. One of those solutions is “bootstrapping”, which was first introduced by Efron (1979) as an extension of the “jackknife” approach. Briefly, the “jackknife” approach first introduced by Quenouille (1949), and given its name by Tukey (1958), is a “leave-one-out” resampling method that recalculates a statistic of interest iteratively until each observation has been removed once. One limitation from this, beyond how tricky it is to adapt to more complex data structures, is jackknifing struggles with smaller datasets and the number of resamples is capped at the number of observations. Bootstrapping, a resampling with replacement approach to calculating statistics of interest (e.g. standard errors from a regression), is far more versatile and flexible.

32 different options depending on which one you leave out

with jackknife the most shift you could get is the shift of leaving one out

# unlike thebootstrap, jackknifing is reproducible every time

for (i in 1:nrow(mtcars)) {
  mtcars_minus_one <- mtcars[-i,]
  mean_value_minus_one = mean(mtcars_minus_one$mpg)
  mtcars$jackknife_mean_mpg[i] <- mean_value_minus_one
}

# mtcars
# mtcars_long_numeric_with_mpg

ggplot(mtcars, aes(x = jackknife_mean_mpg)) + 
  geom_histogram(bins = 32, color = 'black', fill = 'lightblue') +
      labs(title = "Mtcars", subtitle = "Histogram") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5))

# mtcars$mpg # the 32 mpg values
# mean(mtcars$mpg) # the mean mpg # sometimes called x.bar or x-bar
n <- 32
standard_error <- sd(mtcars$mpg)/sqrt(n)

# the mean is the same but not all the statistics are the same
# sd(mtcars$mpg) # these two values are different
# sd(mtcars$jackknife_mean_mpg) # these two values are different

# mtcars$jackknife_mean_mpg # the 32 jackknifed means
jacknife_mean_mpg <- mean(mtcars$jackknife_mean_mpg) # the jacknife mean

jacknife_standard_error <- sqrt(((n - 1) / n) * sum((mtcars$jackknife_mean_mpg-jacknife_mean_mpg) ^ 2))

# 
# jack.mean.xbar
# x.bar
# jack.SE
# SE.x

t.star <- qt(p=0.975,df=n-1) # what is qt
#  t.star
confidence_intervals <- mean(mtcars$mpg) + c(-1,1) * t.star * standard_error
confidence_intervals
## [1] 17.91768 22.26357
# You will notice that the mean of all the jackknife statistics is equal to the statistic computed using all  
# n
#  data values, and similarly that the jackknifed standard error is equal to the standard error of the mean obtained using all of the data and the familiar formula, so for the mean this seems to be just a more convoluted way to compute quantities we already have formulas for and a waste of time.
# 
mtcars %>%
  dplyr::select(car, mpg, disp, hp, drat, wt, qsec) %>%
  mutate(across(c(mpg, disp, hp, drat, wt, qsec), function(x) 
       map_dbl(seq_along(x), function(rn) mean(x[-rn])), 
          .names = "{.col}_jackknife_mean")) %>%
  dplyr::select(-c(mpg:qsec)) -> saved

names(saved) <- c("car", "mpg", "disp", "hp", "drat", "wt", "qsec")

saved <- pivot_longer(saved, names_to = 'names', values_to = 'values', 2:7)

saved_ci_df <- saved %>%
  group_by(names) %>%
  summarize(standard_error = sd(values), 
            mean_value = mean(values)) %>%
  mutate(lower_ci_jackknife = mean_value - standard_error) %>%
  mutate(upper_ci_jackknife = mean_value + standard_error)

mtcars_theoretical_ci_df
mtcars_ci_df
saved_ci_df
saved_ci_df[1, 1] = "disp"
saved_ci_df[2, 1] = "drat"
saved_ci_df[3, 1] = "hp"
saved_ci_df[4, 1] = "mpg"
saved_ci_df[5, 1] = "qsec"
saved_ci_df[6, 1] = "wt"

ggplot(saved, aes(x = values)) +
  geom_histogram(bins = 32, color = 'black', fill = 'lightblue') +
  facet_wrap(~names, scales = 'free') +
  labs(title = "Mtcars", subtitle = "Histogram") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  geom_vline(mapping = aes(xintercept = lower_ci_jackknife), saved_ci_df, linetype = 'dashed', color = 'red') +
  geom_vline(mapping = aes(xintercept = upper_ci_jackknife), saved_ci_df, linetype = 'dashed', color = 'red') 

mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
  ggplot(aes(x = values, fill = names)) +
  facet_wrap(~names, scales = 'free') +
  geom_density() +
  # geom_vline(mapping = aes(xintercept = lower_ci_theoretical), mtcars_theoretical_ci_df, linetype = 'dashed') +
  # geom_vline(mapping = aes(xintercept = upper_ci_theoretical), mtcars_theoretical_ci_df, linetype = 'dashed') +
  # geom_vline(mapping = aes(xintercept = lower_ci_bootstrapped), mtcars_ci_df, linetype = 'dashed', color = 'blue') +
  # geom_vline(mapping = aes(xintercept = upper_ci_bootstrapped), mtcars_ci_df, linetype = 'dashed', color = 'blue') +
  geom_vline(mapping = aes(xintercept = lower_ci_jackknife), saved_ci_df, linetype = 'dashed', color = 'red') +
  geom_vline(mapping = aes(xintercept = upper_ci_jackknife), saved_ci_df, linetype = 'dashed', color = 'red') +
  labs(caption = "dashed line equals 95% confidence interval for mean") +
  labs(title = "Mtcars", subtitle = "Density Plot") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5))

treatment to control

observed measurements to underlying constructs

6 Hypothesis Testing

As a first step in understanding hypothesis testing, we first want to understand some of the characteristics of different sampling distributions, including understanding how to plot the area under the curve, the critical area and type 1 and type 2 errors.

Before we dive deep into a dataset, we want to create one or more hypotheses for testing. As a general process, we first state a hypothesis, then we set the criteria for a decision and, as a final step, we compute the test statistic. When we do hypothesis testing, we actually refer to two distinct hypotheses and choose between them: Our two distinct hypotheses are the null hypothesis and the alternate hypothesis. We have to create a criterion to choose between our two hypotheses. We do this by choosing a threshold for our p-value. This threshold is called our alpha value.

P-Values

We usually have more than one test statistic. The p-value is a good first place to start because it is a normalized criterion across different kinds of tests. The p-value is the level of marginal significance within a our hypothesis test and it represents the probability of the occurrence of a given event. The ‘p’ in p-value, therefore, refers to probability because it quantifies uncertainty. Specifically, the p-value is the probability of obtaining results at least as extreme as our observed results by random chance alone, assuming that our null hypothesis is correct. A smaller p-value means that there is stronger evidence in favor of the alternative hypothesis. For example, a p-value of 0.05 means that our results would only happen by random chance alone 5% of the time. Most of the time, criteria for decision is set at an alpha level of 0.05.

Whenever we perform report statistical significance, it’s extremely important that we have integrity around our hypothesiss testing. A p-value of 0.05 is the typical threshold for statistical significance. We should never go fishing for statistical signficance. The following is an illustration why. Let’s say we generate our own dataframe with random values from a normal distribution and we do this 100 times. We can use the rnorm() function to create a dataframe with random normal distributions. We would expect five of these vectors to have a p-value of less than 0.05. But that doesn’t mean anything because these are just randomly generated vectors. The workflow here is important. It would be dishonest to write conclusions about statistical significance after picking a variable after assessing p-values. This is sometimes referred to as ‘p-hacking’. It would be analogous to shooting bullets at the side of a barn and then drawing bullseyes around some of the holes.

Type I and Type II Errors

At the end of this tutorial we want to tackle an even more difficult problem. In statistics courses, teachers usually have a tough time explaining the concepts of type I and type II errors to their students. While type I errors refer to the false rejection of the null hypothesis given the null hypothesis is true, type II errors refers to the false rejection of the alternative hypothesis given the alternative hypothesis is true. The most difficult thing to understand is to mentally focus on one of these particular hypotheses. While the underlying assumption of the type I error is that the null hypothesis is true, the underlying assumption of the type II error is that the alternative hypothesis is true. We can visualize this fact by adding an area to the true hypothesis and using a more subtle line for the other hypothesis. First, let’s visualize the type I error in a standard normal distribution:

Parametric and Non-Parametric Equivalents

Parametric tests are those that make assumptions about the parameters of the population distribution from which the sample is drawn. This is often the assumption that the population data are normally distributed. Non-parametric tests are “distribution-free” and, as such, can be used for non-Normal variables. Table 3 shows the non-parametric equivalent of a number of parametric tests.

Usually, with a parametric test there’s a non-parametric equivalent.

6.1 Normality Tests

There are two ways to know if our data approximately follows a normal distribution or is sampled from a distribution that is close to normal. The first way is to look at frequency distributions like histograms, density plots, and boxplots, as well as assessing q-q plots, and the second way is to perform one or more normality tests. These methods can be used most effectively in conjunction with each other.

Hypothesis testing is often the way in which we evaluate our ultimate research question. However, hypothesis testing can also be used more locally as a way to better understand our data on the road towards answering our research questions. For this reason, it’s not uncommon to see a project that reports many p-values for different statistics on the way to a final set of conclusions. For example, we may have as our research question the idea that mpg can be predicted accurately by wt in the mtcars dataset by using an ordinary least squares regression. However, ordinary least squares regression requires an approximately normally distributed independent variable. For this reason, we may choose to use a normality test with a null and alternate hypothesis in order to gain confidence that the wt variable is normally distributed before we move on to the projecting of creating a linear model and assessing linear model statistics.

There are a few common normality tests, each with different strengths and weaknesses. Here are some of the most common:

  • Shapiro-Wilk Test
  • D’agsotino-Pearson Omnibus test
  • Kolmogorov-Smirnov Test

The following visualization is the same four-paneled visualization as before, except we are applying our mtcars data instead of looking at a normal distribution from theory. We have made a change to the panel involving our density plot, however: We can compare the mpg variable in mtcars to a normal distribution with the same mean and standard deviation in order to see how much the mpg variable might deviate from a perfect normal distribution. Overall, across the four panels, we can see that the mpg variable is at least a little bit right-skewed. But how much is it right-skewed? Normality tests are one way to quantify this.

Shapiro-Wilk Test

The first normality test that we will look at is the Shapiro-Wilk test. The null-hypothesis of this test is that the population is normally distributed. Therefore, if the p-value is less than the chosen alpha level, typically 0.05, then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed. On the other hand, if the p-value is greater than the chosen alpha level, then the null hypothesis, that our data comes from a normally distributed population, can not be rejected.

It’s important to know that the Shapiro-Wilk test is sensitive to outliers and is also influenced by sample size. For smaller samples, non-normality is less likely to be detected but the Shapiro-Wilk test should be preferred as it is generally more sensitive. For larger samples (i.e. more than one hundred), the normality tests are overly conservative and the assumption of normality might be rejected too easily. sw has the best power at a given significance level. test is sample size biased. Keep in mind that sample size can have a big influence on the test. Small sample sizes have little power to reject null hypothesis. Large sample sizes have too much power and can detect minor deviations from null distribution. This is what we mean when we say the Shapiro-Wilk test is sample size biased. Here is the mathematical equation for Shapiro-Wilk normality test. We can instead just use a table. SW only works if you have three data points.

\[ \begin{aligned} {\displaystyle W={\left(\sum _{i=1}^{n}a_{i}x_{(i)}\right)^{2} \over \sum _{i=1}^{n}(x_{i}-{\overline {x}})^{2}},} \end{aligned} \]

Calculating the Shapiro Wilk constant is a little bit involved. The Shapiro-Wilk test calculates a test statistic called W. Below we include some of the math to show how this might be done. Let’s think about this equation in two parts, by dividing the numerator and the denominator of the equation. The denominator in the formula is simply the sum of the squared errors, meaning the sum of the differences of the data from its mean. We had performed a similar process when looking at the Pearson correlation coefficient. The numerator of this equation has a constant called ai, which is the Shapiro-Wilk constant. The ai constant value is different for each number and also depends on sample size. Generally, we never do this stuff by hand. We can either use a table to show the critical values or else R software will sort the data for us. Using R, we can do all the math for a Shapiro-Wilk test with one simple line of code by using the shapiro.test() function. This one line of code will not only calculate the sum of squared errors for the denominator in our equation, but it will also calculate the numerator using the correct ai value by first sorting the data and taking into account sample size sort. The output will show both the W and the p-value. W is always between 0 and 1 and closer to 1 is more likely to be like a normal distribution. We can see in the following table that disp and hp have low p-values indicating that we can reject the null hypothesis that these variales follow a normal distribution or, more simply, we can see that they are not normally distributed.

# https://www.real-statistics.com/tests-normality-and-symmetry/statistical-tests-normality-symmetry/shapiro-wilk-test/

## calculating shapiro-wilk by hand

# vector <- function(myvector) {
#   first_value <- sort(myvector)[32] - sort(myvector)[1]
#   second_value <- sort(myvector)[31] - sort(myvector)[2]
#   third_value <- sort(myvector)[30] - sort(myvector)[3]
#   fourth_value <- sort(myvector)[29] - sort(myvector)[4]
#   fifth_value <- sort(myvector)[28] - sort(myvector)[5]
#   sixth_value <- sort(myvector)[27] - sort(myvector)[6]
#   seventh_value <- sort(myvector)[26] - sort(myvector)[7]
#   eighth_value <- sort(myvector)[25] - sort(myvector)[8]
#   ninth_value <- sort(myvector)[24] - sort(myvector)[9]
#   tenth_value <- sort(myvector)[23] - sort(myvector)[10]
#   eleventh_value <- sort(myvector)[22] - sort(myvector)[11]
#   twelfth_value <- sort(myvector)[21] - sort(myvector)[12]
#   thirteenth_value <- sort(myvector)[20] - sort(myvector)[13]
#   fourteenth_value <- sort(myvector)[19] - sort(myvector)[14]
#   fifteenth_value <- sort(myvector)[18] - sort(myvector)[15]
#   sixteenth_value <- sort(myvector)[17] - sort(myvector)[16]
#   return(c(first_value, second_tsvalue, third_value, fourth_value, fifth_value, sixth_value, seventh_value, eighth_value, ninth_value, tenth_value, eleventh_value, twelfth_value, thirteenth_value, fourteenth_value, fifteenth_value, sixteenth_value))
#   }

# maxtomin <- function(vec, N = NULL) {
#     if(is.null(N)) {N <- floor(length(vec)/2)}
#     vec <- sort(vec)
#     vecfin <- rev(vec) - vec
#     return(head(vecfin, N))
# }
# maxtomin(mtcars$mpg)

# If n is even, let m =  n/2, while if n is odd let m = (n–1)/2
# Calculate b as follows, taking the ai weights from Table 1 (based on the value of n) in the Shapiro-Wilk Tables. Note that if n is odd, the median data value is not used in the calculation of b.

# mtcars %>% 
#   select(mpg) %>%
#   arrange(mpg) %>%
#   mutate(mpg_rev = rev(mpg) - mpg) %>%
#   slice_head(n = length(mtcars$mpg) / 2)
shapiro.test(mtcars$mpg) # normal
## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$mpg
## W = 0.94756, p-value = 0.1229
# p-value = 0.1229

# shapiro.test(mtcars$hp) # skewed
# p-value = 0.04881

mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
  group_by(names) %>%
  do(broom::tidy(shapiro.test(.$values))) 
# 
# mtcars_long_numeric_with_mpg %>%
#   nest(data = -c(names, values)) %>%
#   mutate(shapiro = map(data, ~shapiro.test(.x$values)) %>%
#   glanced = map(shapiro, glance)))

How do we interpret the result of our Shapiro-Wilk normality test?

It’s important to think about the characteristic of the null and alternate hypothesis. The default here is that the data is normally distributed. Therefore, a p-value greater than 0.05 means we fail to reject the null hypothesis that the data is not normally distributed. This is not the same as saying that our data is therefore normally distributed like as a guarantee. Keep in mind that sample size can have a big influence on the test. Small sample sizes have little power to reject null hypothesis. Large sample sizes have too much power and can detect minor deviations from null distribution.

Another thing to keep in mind with this test is that even if the data is found to not be normally distributed, the test itself does not tell us how much the data differs from normal.

Here we add the p-values from the test to our qq plots.

  # scale_color_manual(values= list(c("#e76254", "#ef8a47", "#f7aa58", "#ffd06f", "#ffe6b7", "#aadce0", "#72bcd5", "#528fad", "#376795", "#1e466e"), c(6, 2, 9, 3, 7, 5, 1, 10, 4, 8), colorblind=TRUE)) 

one <- ggplot2::ggplot(data = mtcars, aes(sample = mpg)) + 
  stat_qq() + stat_qq_line(color = '#e76254', linetype = 'dotdash', size = 1) +
  labs(subtitle = "mpg") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = 'shapiro-wilk p-value: 0.12')

three <- ggplot2::ggplot(data = mtcars, aes(sample = disp)) + 
  stat_qq() + stat_qq_line(color = '#ef8a47', linetype = 'dotdash', size = 1) +
  labs(title = "Mtcars", subtitle = "disp") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = 'shapiro-wilk p-value: 0.02')

four <- ggplot2::ggplot(data = mtcars, aes(sample = hp)) + 
  stat_qq() + stat_qq_line(color = '#f7aa58', linetype = 'dotdash', size = 1) +
  labs(subtitle = "hp") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = 'shapiro-wilk p-value: 0.05')

five <- ggplot2::ggplot(data = mtcars, aes(sample = drat)) + 
  stat_qq() + stat_qq_line(color = '#ffd06f', linetype = 'dotdash', size = 1) +
  labs(subtitle = "drat") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = 'shapiro-wilk p-value: 0.11')

seven <- ggplot2::ggplot(data = mtcars, aes(sample = wt)) + 
  stat_qq() + stat_qq_line(color = '#ffe6b7', linetype = 'dotdash', size = 1) +
  labs(subtitle = "wt") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = 'shapiro-wilk p-value: 0.09')

eight <- ggplot2::ggplot(data = mtcars, aes(sample = qsec)) + 
  stat_qq() + stat_qq_line(color = '#aadce0', linetype = 'dotdash', size = 1) +
  labs(subtitle = "qsec") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = 'shapiro-wilk p-value: 0.59')


library(patchwork)
one  + three + four + five + seven + eight

What about next steps? What if the data is not normally distributed? If the checks suggest that the data is not normally distributed we can either transform the dependent variable (repeating the normality checks on the transformed data): Common transformations include taking the log or square root of the dependent variable

Or we can use a non-parametric test: Non-parametric tests are often called distribution free tests and can be used instead of their parametric equivalent

Now we can transform those variables that failed our shapiro-wilk test into log versions and redo our test

# 
# 
# mtcars_numeric_diff <- mtcars %>%
#   dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
#   mutate(log_hp = log(hp), log_disp = log(disp)) %>%
#   select(-hp, -disp)
# 
# mtcars_long_numeric_with_mpg_log <- mtcars %>% 
#   dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
#   pivot_longer(names_to = 'names', values_to = 'values', 2:7)
# 
# ggplot(mtcars_long_numeric_with_mpg_log, aes(x = values, fill = names)) +
#   facet_wrap(~names, scales = 'free') +
#   geom_density(color = 'black', alpha = 0.25, show.legend = FALSE) +
#       labs(title = "Mtcars", subtitle = "Density Plot") +
#   theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
#   theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
#   theme(plot.subtitle = element_text(hjust = 0.5))
# 
# mtcars_long_numeric_with_mpg_log %>%
#   group_by(names) %>%
#   do(tidy(shapiro.test(.$values))) 
#   
The Jarque–Bera test

6.2 Theoretical Tests

Parametric Tests

Student T-Test

A one and two sample t-test is used to test whether or not the means of two populations are equal. This type of test makes the following assumptions about the data.

  • Independence
  • Normality
  • Homogeneity of Variances
  • Random Sampling

Independence means that the observations in one sample are independent of the observations in the other sample. In other words, the independence assumption is held up if no two observations appear in both samples. After all, it wouldn’t make sense to try and draw conclusions about the differences between the samples when observations appear in both samples. Before we perform a two-sample t-test, we should always check to make sure that each observation only appears in each sample once. We should not try to perform a two sample t-test if this assumption is violated.

Normality means that both samples are approximately normally distributed. This is a crucial assumption because if the samples are not normally distributed then it isn’t valid to use the p-values from the test to draw conclusions about the differences between the samples. If the sample sizes are small (n < 50), then we can use a Shapiro-Wilk test to determine if each sample size is normally distributed. If the p-value of the test is less than a certain significance level, then the data is likely not normally distributed. If the sample sizes are large, then it’s better to use a Q-Q plot to visually check if the data is normally distributed. If the data points roughly fall along a straight diagonal line in a Q-Q plot, then the dataset likely follows a normal distribution. If this assumption is violated then we can perform a Mann-Whitney U test, which is considered the non-parametric equivalent to the two sample t-test and does not make the assumption that the two samples are normally distributed.

Homogeneity of variances means that both samples have have roughly equal variance. We use the following heuristic to determine the ratio of the variances is more or less equal: We divide the larger variance by the smaller variance and if the result is a ratio that is less than four, then we can assume the variances are approximately equal. In the event that the ratio of the variances is greater than four, we can instead use Welch’s T-Test, which is a non-parametric version of the two sample t-test and does not make the assumption that the two samples have equal variances.

Random Sampling means that both samples were obtained using a random sampling method. A two sample t-test makes the assumption that both samples were obtained using a random sampling method. In other words, random sampling means that each observations had an equal probability of being included in either sample. There is no formal statistical test we can use to test this assumption. If this assumption is violated, then it would be hard to reliably generalize the findings from the two sample t-test to the more general population.

You are indeed correct that they are testing for the equality of the mean.

But you might be more interested in what makes them different from one another and what criteria should be meant to select which test. The tests can be ranked in generality. (Most general to least general)

One Sample T-Test

The test statistic for a One Sample t Test is denoted t, which is calculated using the following formula.

\[ \begin{aligned} T = \frac{x -𝜇}{se} \end{aligned} \]

Another way to think about the formula is in this way.

\[ \begin{aligned} T = \frac{x -𝜇}{\frac{𝜎}{√𝑛}} \end{aligned} \]

t <- ( mean(mtcars$mpg) - 20 ) / ( sd(mtcars$mpg) / sqrt(( length(mtcars$mpg) - 0 )))
t
## [1] 0.08506004
t.test(mtcars$mpg, mu = 20, alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  mtcars$mpg
## t = 0.08506, df = 31, p-value = 0.9328
## alternative hypothesis: true mean is not equal to 20
## 95 percent confidence interval:
##  17.91768 22.26357
## sample estimates:
## mean of x 
##  20.09062

The calculated t value is then compared to the critical t value from the t distribution table with degrees of freedom df = n - 1 and chosen confidence level. If the calculated t value > critical t value, then we reject the null hypothesis.

Two Sample T-Test

Non-Parametric Tests

Mann Whitney Test

Welch Test

But you might be more interested in what makes them different from one another and what criteria should be meant to select which test. The tests can be ranked in generality. (Most general to least general)

The Mann Whitney test is the most general with the fewest assumptions, it is a nonparametric test, meaning that we do not need any distributional assumptions about the probability distribution that the data comes from.

The Welch test adds the assumption that the two groups comes from a normal distribution, but the variance in the two groups can be different.

The Student t-test adds yet another assumption, that the variance should be equal in the two groups.

t.test(mpg ~ am, data = mtcars, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  mpg by am
## t = -4.1061, df = 30, p-value = 0.000285
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.84837  -3.64151
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231
t.test(mpg ~ am, data = mtcars)
## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231
wilcox.test(mpg ~ am, data=mtcars) # is this the same as mann witney
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  mpg by am
## W = 42, p-value = 0.001871
## alternative hypothesis: true location shift is not equal to 0
t.test(mpg ~ vs, data = mtcars, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  mpg by vs
## t = -4.8644, df = 30, p-value = 3.416e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.274221  -4.606732
## sample estimates:
## mean in group 0 mean in group 1 
##        16.61667        24.55714
t.test(mpg ~ vs, data = mtcars)
## 
##  Welch Two Sample t-test
## 
## data:  mpg by vs
## t = -4.6671, df = 22.716, p-value = 0.0001098
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.462508  -4.418445
## sample estimates:
## mean in group 0 mean in group 1 
##        16.61667        24.55714
wilcox.test(mpg ~ vs, data=mtcars) # is this the same as mann witney
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  mpg by vs
## W = 22.5, p-value = 9.034e-05
## alternative hypothesis: true location shift is not equal to 0
options(scipen = 999)

# mtcars %>%
#   dplyr::select(mpg, vs, am) %>%
#   pivot_longer(names_to = 'names', values_to = 'values', 2:3) %>%
#   nest(data = -names) %>% 
#   mutate(
#     test = map(data, ~ t.test(.x$mpg, .x$values)), # S3 list-col
#     tidied = map(test, tidy)
#   ) %>% 
#   unnest(tidied) -> n3
# n3

t.test(mpg ~ vs, data = mtcars) # p-value = 0.0001098
## 
##  Welch Two Sample t-test
## 
## data:  mpg by vs
## t = -4.6671, df = 22.716, p-value = 0.0001098
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.462508  -4.418445
## sample estimates:
## mean in group 0 mean in group 1 
##        16.61667        24.55714
t.test(mpg ~ am, data = mtcars) # p-value = 0.001374
## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

So when you relax on your assumptions you become less certain of the outcome in some sense. If your 𝛼 α value of the test would be 0.1% (Which is rarely used except when doing multiple testing/comparisons or maybe in pharmaceutical experiments), then you would reject the null hypothesis using the t-test but not with the other tests.

If you know that the data comes from a normal distribution and the two groups have equal variance, this is the correct thing to do. People often tend to assume that the data comes from a normal distribution, but if you want to be conservative the Mann Whitney test is more appropriate.

6.3 Simulation Based Tests

Student T-Test

One Sample T-Test

In this vignette, we’ll walk through conducting t-tests and their randomization-based analogue using infer. We’ll start out with a 1-sample t-test, which compares a sample mean to a hypothesized true mean value.

The 1-sample t-test can be used to test whether a sample of continuous data could have plausibly come from a population with a specified mean.

As an example, we’ll test whether the average American adult works 40 hours a week using data from the gss. To do so, we make use of the hours variable, giving the number of hours that respondents reported having worked in the previous week. The distribution of hours in the observed data looks like this:

It looks like most respondents reported having worked 40 hours, but there’s quite a bit of variability. Let’s test whether we have evidence that the true mean number of hours that Americans work per week is 40.

infer’s randomization-based analogue to the 1-sample t-test is a 1-sample mean test. We’ll start off showcasing that test before demonstrating how to carry out a theory-based t-test with the package.

First, to calculate the observed statistic, we can use specify() and calculate().

# calculate the observed statistic
observed_statistic <- mtcars %>%
  specify(response = mpg) %>%
  calculate(stat = "mean")
# this is the same as mean(mtcars$mpg)

The observed statistic is 41.382. Now, we want to compare this statistic to a null distribution, generated under the assumption that the mean was actually 40, to get a sense of how likely it would be for us to see this observed mean if the true number of hours worked per week in the population was really 40.

We can generate the null distribution using the bootstrap. In the bootstrap, for each replicate, a sample of size equal to the input sample size is drawn (with replacement) from the input sample data. This allows us to get a sense of how much variability we’d expect to see in the entire population so that we can then understand how unlikely our sample mean would be.

# generate the null distribution
null_dist_1_sample <- mtcars %>%
  specify(response = mpg) %>%
  hypothesize(null = "point", mu = 21) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")
null_dist_1_sample

To get a sense for what these distributions look like, and where our observed statistic falls, we can use visualize():

  # visualize the null distribution and test statistic!
null_dist_1_sample %>%
  visualize() + 
  shade_p_value(observed_statistic,
                direction = "two-sided")

It looks like our observed mean of 41.382 would be relatively unlikely if the true mean was actually 40 hours a week. More exactly, we can calculate the p-value:

# calculate the p value from the test statistic and null distribution
p_value_1_sample <- null_dist_1_sample %>%
  get_p_value(obs_stat = observed_statistic,
              direction = "two-sided")

p_value_1_sample

Thus, if the true mean number of hours worked per week was really 40, our approximation of the probability that we would see a test statistic as or more extreme than 41.382 is approximately 0.026.

Analogously to the steps shown above, the package supplies a wrapper function, t_test, to carry out 1-sample t-tests on tidy data. Rather than using randomization, the wrappers carry out the theory-based t-test. The syntax looks like this:

t_test(mtcars, response = mpg, mu = 21)
Two Sample T-Test

Mann Whitney Test

Welch Test

To compare outcomes in experiments, we often use Student’s t-test. It assumes that data are randomly selected from the population, arrived in large samples (>30), or normally distributed with equal variances between groups. If we do not happen to meet these assumptions, we may use one of the simulation tests. In this article, we will introduce the Permutation Test. Rather than assuming underlying distribution, the permutation test builds its distribution, breaking up the associations between or among groups. Often we are interested in the difference of means or medians between the groups, and the null hypothesis is that there is no difference there. We may ask the question: from all the possible permutations, how extreme our data would look like? All possible permutations would represent a theoretical distribution. In practice, there is no need to perform ALL permutations to build the theoretical distribution, but run a reasonable number of simulations to take a sample from that distribution. Usually, there are 10k or 100k simulations.

6.4 Permutation Tests

Student T-Test

One Sample T-Test
Two Sample T-Test
Two-sample t-test

2-Sample t-tests evaluate the difference in mean values of two populations using data randomly-sampled from the population that approximately follows a normal distribution. As an example, we’ll test if Americans work the same number of hours a week regardless of whether they have a college degree or not using data from the gss. The college and hours variables allow us to do so:

It looks like both of these distributions are centered near 40 hours a week, but the distribution for those with a degree is slightly right skewed.

Again, note the warning about missing values—many respondents’ values are missing. If we were actually carrying out this hypothesis test, we might look further into how this data was collected; it’s possible that whether or not a value in either of these columns is missing is related to what that value would be.

infer’s randomization-based analogue to the 2-sample t-test is a difference in means test. We’ll start off showcasing that test before demonstrating how to carry out a theory-based t t-test with the package.

As with the one-sample test, to calculate the observed difference in means, we can use specify() and calculate().

ggplot(mtcars, aes(x = am, y = mpg, fill = am)) + 
  geom_boxplot() +
  labs(title = "Mtcars", subtitle = "Density Plot") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  scale_fill_manual(values=met.brewer("Hiroshige", 2)) +
  theme(legend.position = 'none')

mtcars <- mtcars %>%
  mutate(am = as_factor(am))

observed_statistic <- mtcars %>%
  specify(mpg ~ am) %>%
  calculate(stat = "diff in means", order = c("1", "0"))

observed_statistic

Note that, in the line specify(hours ~ college), we could have swapped this out with the syntax specify(response = hours, explanatory = college)!

The order argument in that calculate line gives the order to subtract the mean values in: in our case, we’re taking the mean number of hours worked by those with a degree minus the mean number of hours worked by those without a degree; a positive difference, then, would mean that people with degrees worked more than those without a degree.

Now, we want to compare this difference in means to a null distribution, generated under the assumption that the number of hours worked a week has no relationship with whether or not one has a college degree, to get a sense of how likely it would be for us to see this observed difference in means if there were really no relationship between these two variables.

We can generate the null distribution using permutation, where, for each replicate, each value of degree status will be randomly reassigned (without replacement) to a new number of hours worked per week in the sample in order to break any association between the two.

null_dist_2_sample <- mtcars %>%
  specify(mpg ~ am) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("1", "0"))
null_dist_2_sample

To get a sense for what these distributions look like, and where our observed statistic falls, we can use visualize().

# visualize the randomization-based null distribution and test statistic!
null_dist_2_sample %>%
  visualize() + 
  shade_p_value(observed_statistic,
                direction = "two-sided")

It looks like our observed statistic of 1.5384 would be unlikely if there was truly no relationship between degree status and number of hours worked. More exactly, we can calculate the p-value; theoretical p-values are not yet supported, so we’ll use the randomization-based null distribution to do calculate the p-value.

# calculate the p value from the randomization-based null 
# distribution and the observed statistic
p_value_2_sample <- null_dist_2_sample %>%
  get_p_value(obs_stat = observed_statistic,
              direction = "two-sided")

p_value_2_sample

Thus, if there were really no relationship between the number of hours worked a week and whether one has a college degree, the probability that we would see a statistic as or more extreme than 1.5384 is approximately 0.3.

Note that, similarly to the steps shown above, the package supplies a wrapper function, t_test, to carry out 2-sample t-tests on tidy data. The syntax looks like this:

t_test(x = mtcars, 
       formula = mpg ~ am, 
       order = c("1", "0"),
       alternative = "two-sided")

In the above example, we specified the relationship with the syntax formula = hours ~ college; we could have also written response = hours, explanatory = college.

An alternative approach to the t_test() wrapper is to calculate the observed statistic with an infer pipeline and then supply it to the pt function from base R. We can calculate the statistic as before, switching out the stat = “diff in means” argument with stat = “t”.

# calculate the observed statistic
observed_statistic <- gss %>%
  specify(hours ~ college) %>%
  hypothesize(null = "point", mu = 40) %>%
  calculate(stat = "t", order = c("degree", "no degree")) %>%
  dplyr::pull()

Note that this pipeline to calculate an observed statistic includes hypothesize() since the t statistic requires a hypothesized mean value.

Mann Whitney Test

Welch Test

A permutation test (also called re-randomization test) is an exact test, a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under all possible rearrangements of the observed data points. Permutation test are, therefore, a form of resampling. In other words, the method by which treatments are allocated to subjects in an experimental design is mirrored in the analysis of that design. If the labels are exchangeable under the null hypothesis, then the resulting tests yield exact significance levels; see also exchangeability. Confidence intervals can then be derived from the tests. The theory has evolved from the works of Ronald Fisher and E. J. G. Pitman in the 1930s.

What is a standard normal distribution? The standard normal distribution is a normal distribution with a mean of zero and standard deviation of 1. The standard normal distribution … The expected value of x is 0. Q = x ^ 2

\(V = X21 + X22 + ⋅⋅⋅+ X2m ~ χ2(m)\)

What’s the chance of getting a value larger than the certain amount is the p-value here.

The distribution of the chi-square statistic is called the chi-square distribution. The chi-square distribution is defined by the following probability density function:

\(Y = Y0 * ( Χ2 ) ( v/2 - 1 ) * e-Χ2 / 2\)

where Y0 is a constant that depends on the number of degrees of freedom, Χ2 is the chi-square statistic, v = n - 1 is the number of degrees of freedom, and e is a constant equal to the base of the natural logarithm system (approximately 2.71828). Y0 is defined, so that the area under the chi-square curve is equal to one.

The chi-square distribution has the following properties:

The mean of the distribution is equal to the number of degrees of freedom: μ = v. The variance is equal to two times the number of degrees of freedom: σ2 = 2 * v When the degrees of freedom are greater than or equal to 2, the maximum value for Y occurs when Χ2 = v - 2. As the degrees of freedom increase, the chi-square curve approaches a normal distribution.

What is a chi-squared distribution? In probability theory and statistics, the chi-squared distribution (also chi-square or χ2-distribution) with k degrees of freedom is the distribution of a sum of the squares of k independent standard normal random variables.

# 1 No significant difference on Engine type i.e. vs and Transmission i.e. am
# 2 Significant difference on gear and carburetors
# 3 Significant difference cylinders and gear

attach(mtcars)

vsAm <- table(vs, am)  
vsAm
##    am
## vs   0  1
##   0 12  6
##   1  7  7
cylGear <- table(cyl,gear)
cylGear
##    gear
## cyl  3  4  5
##   4  1  8  2
##   6  2  4  1
##   8 12  0  2
gearCarb <- table(gear,carb)
gearCarb
##     carb
## gear 1 2 3 4 6 8
##    3 3 4 3 5 0 0
##    4 4 4 0 4 0 0
##    5 0 2 0 1 1 1
chisq.test(vsAm) 
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  vsAm
## X-squared = 0.34754, df = 1, p-value = 0.5555
# The p value of the above test is greater than 0.05, this indicate that vs and am are independent, thus we’re not rejecting the null hypothesis.

chisq.test(gearCarb) 
## 
##  Pearson's Chi-squared test
## 
## data:  gearCarb
## X-squared = 16.518, df = 10, p-value = 0.08573
# The p value of the above test is greater than 0.05, this indicate that vs and am are independent, thus we’re not rejecting the null hypothesis. Gear and Carb are independent to each other.

chisq.test(cylGear) 
## 
##  Pearson's Chi-squared test
## 
## data:  cylGear
## X-squared = 18.036, df = 4, p-value = 0.001214
# The p value of the above test is less than 0.05, we’re rejecting the null hypothesis and also concluding that the cyl and gear are not independent with each other.



########################
# difference between as.factor and as_factor

mtcars$vs <- as_factor(mtcars$vs)
mtcars$am <- as_factor(mtcars$am)

mtcars$cyl <- as_factor(mtcars$cyl)
mtcars$gear <- as_factor(mtcars$gear)

# calculate the observed statistic

observed_indep_statistic <- mtcars %>%
  specify(response = cyl, explanatory = gear) %>%
  hypothesize(null = "independence") %>%
  calculate(stat = "Chisq")
observed_indep_statistic
# generate the null distribution using randomization

null_dist_sim <- mtcars %>%
  specify(cyl ~ gear) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "Chisq")

# generate the null distribution by theoretical approximation

# null_dist_theory <- mtcars %>%
#   specify(cyl ~ gear) %>%
#   assume(distribution = "Chisq")

# visualize the null distribution and test statistic!

null_dist_sim %>%
  visualize() + 
  shade_p_value(observed_indep_statistic,
                direction = "greater")

# visualize the theoretical null distribution and test statistic!
# mtcars %>%
#   specify(cyl ~ gear) %>%
#   assume(distribution = "Chisq") %>%
#   visualize() + 
#   shade_p_value(observed_indep_statistic,
#                 direction = "greater")

# visualize both null distributions and the test statistic!
null_dist_sim %>%
  visualize(method = "both") + 
  shade_p_value(observed_indep_statistic,
                direction = "greater")

# calculate the p value from the observed statistic and null distribution
p_value_independence <- null_dist_sim %>%
  get_p_value(obs_stat = observed_indep_statistic,
              direction = "greater")

p_value_independence
pchisq(observed_indep_statistic$stat, 5, lower.tail = FALSE)
##   X-squared 
## 0.002901174
# https://infer.netlify.app/reference/assume.html

7 Probability

8 Linear Algebra

Before considering a section on linear regression, it’s useful to first review concepts form linear algebra. Remember that the heart of linear algebra is solving a system of linear equations. We will see in the section in linear regression how linear algebra is important: The lm function in R uses linear algebra in order to find the slope and intercept of the line of best fit.

First, we will start with builidng a straight line between two points. Every straight line is entirely defined by exactly two parameters: the slope and intercept. We may remember the following equation from geometry or algebra:

\[ \begin{aligned} y = mx + c \end{aligned} \]

The is sometimes called the slope-intercept form of a straight line and it represents a general equation for a straight line where m is the steepness or gradient of the line, and c is the y-intercept, meaning the point where the line crosses the y-axis.

In the following visualization, we have kept only two observations from our dataset: We have kept the information for the Ford Pantera L and the Volvo 142E. For this exercise, we are keeping only two observations in order to illustrate how to draw a line between only two and exactly two points.

mtcars %>%
  filter(car == "Ford Pantera L" | car == "Volvo 142E") %>%
  ggplot(aes(x = wt, y = mpg)) + 
  labs(title = "Mtcars", subtitle = "y = mx + c") +
  geom_point(alpha = 0.3) +
  theme(plot.title = element_text(face = "bold")) +
  geom_text(aes(label = car), size = 2.5, vjust = 1.5) +
  geom_abline(intercept = 61.32, slope = -14.36, linetype = 'dashed', color = '#376795') 

How do we find this line? To find the slope of the line we use this equation, where, on a plane, each point has an x and y coordinate. We look at the two points and subtract the y values and divide by the difference of the x values. Rise over run.

\[ \begin{aligned} m = \frac{(y2 - y1)}{(x2 - x1)} \end{aligned} \]

slope <- (21.4 - 15.8) / (2.78 - 3.17)

Now to find the intercept we use either one of original equations and input the number for the slope. Now we have an equation with only one unknown variable.

\[ \begin{aligned} y = \frac{(y2 - y1)}{(x2 - x1)} x + c \end{aligned} \]

# intercept <- ( (21.4 - 15.8) / (2.78 - 3.17)) * x + c

Geometry of Linear Equations

Let’s try something else. Let’s say, for example, that we only had two cars in our dataset: the Volvo 142E and the Ferrari Dino. This time, however, we run an experiment by adding weight to each engine and measuring the mpg at a new point in time. We are asking the question: When might the Volvo and Ferrari have the same wt of engine and same mpg? We can use linear algebra to solve this question by creating a system of linear equations. What we have is we have two slopes and we want to see the point at which they intersect.

By using the above equations we are able to find the slope and intercept of both lines. Now we have two separate lines. Our goal here is to solve these two linear equations by finding the x and y value that works for both equations. This is a good place to start because linear combination is the most fundamental operation of linear algebra.

  • y = -14.36x + 61.32
  • y = -0.4651x + 20.9884

Our first step is to solve the first equation for x.

  • x = 61.32 / 14.36 - y / 14.36

Our next step is to plug our new x into the second equation.

  • 0.4651 * (4.270195 - y/14.36) - y = 20.9884

This gives us an answer for y.

  • y = 19.638283

Our next step is to plug that y back into our first equation.

  • -14.36x + 19.638283 = 61.32
  • -14.36x = 42.32
  • x = 2.902691

Let’s look at our original equations and see if this works.

  • -14 * - 2.902691 + 19.638283 = 60
  • 0 * - 2.902691 - 19.638283 = 21

This time we can draw a line with the stat_function argument. stat_function is not used to solve equations but to draw a function.

# (-14.36 * - 2.902691) + 19.638283
# (0.4651 * - 2.902691) - 19.638283

ggplot(mtcars44, aes(x = wt, y = mpg)) + 
  labs(title = "Mtcars", subtitle = "Linear Algebra") +
  geom_point(alpha = 0.3) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  # geom_abline(intercept = 36.72, slope = -6, linetype = 'dashed', color = '#376795') + 
  geom_text(aes(label = car), size = 2.5, vjust = 1.5) +
    geom_abline(intercept = 61.32, slope = -14.36, linetype = 'dashed', color = '#376795') +
  geom_abline(intercept = 20.9884, slope = -0.4651, linetype = 'dashed', color = '#376795') +
  geom_point(aes(x = 2.902691, y = 19.638283)) +
  stat_function(fun = function(x) { (61.32) - 14.36 * x }, color = 'orange') +
  stat_function(fun = function(x) { (20.9884) - 0.4651 * x }, color = 'orange')

We can also use an R package will do it for us to solve these linear equations.

# install.packages('PlaneGeometry')
library(PlaneGeometry)

line1 <- Line$new(A = c(3.845,19.2), B = c(2.770,19.7))
line2 <- Line$new(A = c(3.170,15.8), B = c(2.780,21.4))
intersectionLineLine(line1, line2)
## [1]  2.902691 19.638283

What does this tell us? This tells us the point at which both cars might have the same wt and same mpg.

Now let’s try to write these equations in matrix notation. We look at these two equations but rearrange the terms so that we have a left-hand side and a right-hand side.

However, there’s another way of looking at this. Let’s think about the geometry of linear equations.

Let’s consider these two linear equations and write them in matrix form like this:

  • 2x - y = 0
  • -x + 2y = 3

$$

\[\begin{aligned} \begin{bmatrix} 2 & -1 \\ -1 & 2 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} 0 \\ 3 \end{bmatrix} \end{aligned}\]

$$

Our linear equation is solved at the point that these two lines cross.

ggplot(mtcars, aes()) + 
  labs(title = "Mtcars", subtitle = "Linear Algebra") +
  theme(plot.title = element_text(face = "bold")) +
  # geom_abline(intercept = 36.72, slope = -6, linetype = 'dashed', color = '#376795') + 
  stat_function(fun = function(x) { 2 * x }, color = 'orange') +
  stat_function(fun = function(x) { (3/2) + (x / 2)}, color = 'orange') +
  xlim(-3, 3) + ylim(-6, 6)

We can also consider the equation like this:

$$

\[\begin{aligned} x \begin{bmatrix} 2 \\ -1 \end{bmatrix} y \begin{bmatrix} -1 \\ 2 \end{bmatrix} = \begin{bmatrix} 0 \\ 3 \end{bmatrix} \end{aligned}\]

$$
When we consider these as two vectors from the origin, there is a way of combining these vectors to get to that point of intersection we saw before. One of the first vector and two of the second will get us to our answer of 0, 3.

ggplot(mtcars, aes()) + 
  labs(title = "Mtcars", subtitle = "Linear Algebra") +
  theme(plot.title = element_text(face = "bold")) +
  # geom_abline(intercept = 36.72, slope = -6, linetype = 'dashed', color = '#376795') + 
  # geom_segment(aes(x = 0, xend = -1, y = 0, yend = 2), alpha = .2) +
  geom_segment(aes(x = 0, xend = 2, y = 0, yend = -1), alpha = .2) +
  geom_segment(aes(x = 2, xend = 1, y = -1, yend = 1, color = 'blue'), alpha = .2) +
  geom_segment(aes(x = 1, xend = 0, y = 1, yend = 3), alpha = .2, color = 'blue') +
  xlim(-3, 3) + ylim(-6, 6)

$$

\[\begin{aligned} Ax = b \end{aligned}\]

$$

Elimination with Matrices

What if we are solving three linear equations? Linear algebra is helpful here.

  • x + 2y + z = 2
  • 3x + 8y + z = 12
  • 4y + z = 12

Instead of plotting two lines and finding the intersection, we could create a graph of three planes. Three planes would intersect in a point. (Two planes would intersect in a line.)

Let’s write our equations as a matrix.

Let’s start with a matrix with only the left-hand size. Let’s create an elimination matrix. Talk about the pivot

These matrices might be labeled A -> E -> U

$$

\[\begin{aligned} \begin{bmatrix} 1 & 2 & 1 \\ 3 & 8 & 1 \\ 0 & 4 & 1 \end{bmatrix} -> \begin{bmatrix} 1 & 2 & 1 \\ 0 & 2 & -2 \\ 0 & 4 & 1 \end{bmatrix} -> \begin{bmatrix} 1 & 2 & 1 \\ 0 & 2 & -2 \\ 0 & 0 & 5 \end{bmatrix} \end{aligned}\]

$$

Let’s include the right-hand side. We will call this an augmented matrix.

$$

\[\begin{aligned} \begin{bmatrix} 1 & 2 & 1 & 2\\ 3 & 8 & 1 & 12 \\ 0 & 4 & 1 & 2 \end{bmatrix} -> \begin{bmatrix} 1 & 2 & 1 & 2\\ 0 & 2 & -2 & 6\\ 0 & 4 & 1 & 2 \end{bmatrix} -> \begin{bmatrix} 1 & 2 & 1 & 2 \\ 0 & 2 & -2 & 6\\ 0 & 0 & 5 & -10 \end{bmatrix} \end{aligned}\]

$$
Now we have a different set of equations that we can solve.

  • x + 2y + z = 2
  • 2y - 2z = 6
  • 5z = -10

We can see that z = -2 and, moving upwards, through a process called back substitution, we find that y = 1 and finally, x = 2.

Multiplication and Inverse Matrices

Not all matrices have an inverse

If A^-1 (A inverse) exists how do we find it. It’s really central to know if it exists.

$$

\[\begin{aligned} \begin{bmatrix} & & \\ - & - & - \\ & & \end{bmatrix} \begin{bmatrix} & - & \\ & - & \\ & - & \end{bmatrix} = \begin{bmatrix} & & \\ & - & \\ & & \end{bmatrix} \end{aligned}\]

$$

$$

\[\begin{aligned} \begin{bmatrix} & & \\ - & - & -\\ & & \end{bmatrix} \begin{bmatrix} - & - & - \\ - & - & - \\ - & - & - \end{bmatrix} = \begin{bmatrix} & & \\ - & - & - \\ & & \end{bmatrix} \end{aligned}\]

$$

$$

\[\begin{aligned} \begin{bmatrix} - & & \\ - & & \\ - & & \end{bmatrix} \begin{bmatrix} - & - & - \\ - & - & - \\ - & - & - \end{bmatrix} = \begin{bmatrix} - & & \\ - & & \\ - & & \end{bmatrix} \end{aligned}\]

$$

$$

\[\begin{aligned} \begin{bmatrix} 2 \\ 3 \\ 4 \end{bmatrix} \begin{bmatrix} 1 & 6 \\ \end{bmatrix} = \begin{bmatrix} 2 & 12 & \\ 3 & 18 & \\ 4 & 24 \end{bmatrix} \end{aligned}\]

$$

$$

\[\begin{aligned} \begin{bmatrix} 2 & 7 \\ 3 & 8 \\ 4 & 9 \end{bmatrix} \begin{bmatrix} 1 & 6 \\ 0 & 0 \end{bmatrix} = \begin{bmatrix} 2 \\ 3 \\ 4 \end{bmatrix} \begin{bmatrix} 1 & 6 \end{bmatrix} + \begin{bmatrix} 7 \\ 8 \\ 9 \end{bmatrix} \begin{bmatrix} 0 & 0 \end{bmatrix} \end{aligned}\]

$$

Gauss-Jordan

allows us to solve two equations at once

$$

\[\begin{aligned} \begin{bmatrix} 1 & 3 \\ 2 & 7 \end{bmatrix} \begin{bmatrix} a \\ b \end{bmatrix} =\begin{bmatrix} 1 \\ 0 \end{bmatrix} \end{aligned}\]

$$

Let’s deal with the right-hand side by sticking it on as an extra column. This makes for an identity matrix augmented on.

$$

\[\begin{aligned} \begin{bmatrix} 1 & 3 & 1 & 0 \\ 2 & 7 & 0 & 1 \end{bmatrix} \end{aligned}\]

$$
Now when you do elimination steps to turn the left side into the identity the right side will turn into the inverse.

$$

\[\begin{aligned} \begin{bmatrix} 1 & 3 & 1 & 0 \\ 2 & 7 & 0 & 1 \end{bmatrix} \end{aligned}\] -> \[\begin{aligned} \begin{bmatrix} 1 & 3 & 1 & 0 \\ 0 & 1 & -2 & 1 \end{bmatrix} \end{aligned}\]

$$

Now we keep going and do elimination upwards. Subtract three from everything on the top to make the left side the identity and the right side the inverse.

$$

\[\begin{aligned} \begin{bmatrix} 1 & 3 & 1 & 0 \\ 2 & 7 & 0 & 1 \end{bmatrix} \end{aligned}\] -> \[\begin{aligned} \begin{bmatrix} 1 & 3 & 1 & 0 \\ 0 & 1 & -2 & 1 \end{bmatrix} \end{aligned}\] -> \[\begin{aligned} \begin{bmatrix} 1 & 0 & 7 & -3 \\ 0 & 1 & -2 & 1 \end{bmatrix} \end{aligned}\]

$$ In this process we did A I to I A inverse.

$$

\[\begin{aligned} AI = IA^-1 \end{aligned}\]

$$

$$

\[\begin{aligned} \begin{bmatrix} 1 & 3 \\ 2 & 7 \end{bmatrix} \begin{bmatrix} c \\ d \end{bmatrix} =\begin{bmatrix} 0 \\ 1 \end{bmatrix} \end{aligned}\]

$$

Factorization into A = LU

Let’s look at this simple 2x2 matrix.

$$

\[\begin{aligned} \begin{bmatrix} 2 & 1 \\ 8 & 7 \end{bmatrix} \end{aligned}\]

$$
Now how do we get the upper triangular / upper echelon form of this matrix? We can create our upper triangular matrix in this way.

$$

\[\begin{aligned} \begin{bmatrix} 2 & 1 \\ 8 & 7 \end{bmatrix} \begin{bmatrix} 2 & 1 \\ 0 & 3 \end{bmatrix} \end{aligned}\]

$$

However, this doesn’t complete the equation. We have to make the left and ride sides equal. We have to multiply by an elimination matrix in order to get there. Here is the correct elimination matrix.

$$

\[\begin{aligned} \begin{bmatrix} 1 & 0 \\ -4 & 1 \end{bmatrix} \begin{bmatrix} 2 & 1 \\ 8 & 7 \end{bmatrix} = \begin{bmatrix} 2 & 1 \\ 0 & 3 \end{bmatrix} \end{aligned}\]

$$

We label our elimination matrix as E-21 because it creates a zero pivot in the upper triangular matrix in the second row and first column.

$$

\[\begin{aligned} E_(21) * A = U \end{aligned}\]

$$ We have to make the order different for LU decomposition which takes this form.

$$

\[\begin{aligned} A = LU \end{aligned}\]

$$

To know what L is in this case we have to know how L is related to E. L is the inverse of E. It has a 4 and not a -4 because it adds back what the E removed. U stands for upper triangular and L stands for lower triangular. The U has the pivots on the diagonal and the L has 1s on the diagonal.

$$

\[\begin{aligned} \begin{bmatrix} 2 & 1 \\ 8 & 7 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 4 & 1 \end{bmatrix} \begin{bmatrix} 2 & 1 \\ 0 & 3 \end{bmatrix} \end{aligned}\]

$$

Now we move up to a 3x3 matrix. With a 3x3 matrix the elimination steps are a little more involved.

$$

\[\begin{aligned} E_(32) * E_(31) * E_(21) * A = U \end{aligned}\]

$$ To invert these elimination matrices we take the invert of each one and put them in opposite order.

$$

\[\begin{aligned} A = E_(21)^1 * E_(31)^1 * E_(32)^1) * U = \end{aligned}\]

$$

This is equivlanet to LU because L is the product of inverses.

$$

\[\begin{aligned} A = LU \end{aligned}\]

$$

The Ls are the multipliers of U.

Transposes, Permutations, Spaces R^n

Column Space and Nullspace

Solving Ax = 0: Pivot Variables, Special Solutions

Solving Ax = b: Row Reduced Form R

Independence, Basis, and Dimension

Vectors Vi to Vl span a space when the space consists of all combinations of those vectors. This space “let s be the space they span, in other words let s contain all their combinations, that space s will be the smallest space with those vectors in it, cause any space with those vectors in it must have all the combinations of those vectors in it” span is take all linear combinations and compress them into a space

all these columns, are they independent - maybe yes maybe no. We are highly interested in a set of vectors that span the space and are independent

The columns of a matrix span the column space

Basis many bases

There are many bases, but they all have the same number of vectors. If we are talking about R3, that number of vectors is R3.

The Four Fundamental Subspaces

Matrix Spaces; Rank 1; Small World Graphs

Graphs, Networks, Incidence Matrices

Orthogonal Vectors and Subspaces

Projections onto Subspaces

Projection Matrices and Least Squares

Orthogonal Matrices and Gram-Schmidt

QR Decomposition

$$

\[\begin{aligned} y = C + Dt \end{aligned}\]

$$

If we solved the equation that went through the first point it would be C + D = 1. Here are the equations that go through all three points, lines starting from the origin.

  • C + D = 1
  • C + 2D = 2
  • C + 3D = 2

These three equations don’t have a solution. There’s no line that goes through all these points.

$$

\[\begin{aligned} Ax = b \end{aligned}\]

$$

$$

\[\begin{aligned} \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix} \begin{bmatrix} c \\ d \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \end{aligned}\]

$$

$$

\[\begin{aligned} A^TA\hat{x} = A^Tb \end{aligned}\]

$$

$$

\[\begin{aligned} \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 3 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix} = \begin{bmatrix} 3 & 6 \\ 6 & 14 \end{bmatrix} \end{aligned}\]

$$

$$

\[\begin{aligned} \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 3 \end{bmatrix} \begin{bmatrix} 1 & 1 & |1 \\ 1 & 2 & |2 \\ 1 & 3 & |2 \end{bmatrix} = \begin{bmatrix} 3 & 6 & | 5 \\ 6 & 14 & | 11 \end{bmatrix} \end{aligned}\]

$$

$$

\[\begin{aligned} 3C + 6D = 5 \end{aligned}\]

$$

$$

\[\begin{aligned} 6C + 14D = 11 \end{aligned}\]

$$

$$

\[\begin{aligned} 2D = 1 \end{aligned}\]

$$

Now we solve by doing elimination

$$

\[\begin{aligned} D = 1/2 \end{aligned}\]

$$

$$

\[\begin{aligned} D = 2/3 \end{aligned}\]

$$

123

$$

\[\begin{aligned} y = C + Dt \end{aligned}\]

$$

If we solved the equation that went through the first point it would be C + D = 1. Here are the equations that go through all three points, lines starting from the origin.

  • C + 3.17* D = 15.8
  • C + 2.77* D = 19.7
  • C + 2.78 * D = 21.4

These three equations don’t have a solution. There’s no line that goes through all these points.

$$

\[\begin{aligned} Ax = b \end{aligned}\]

$$

$$

\[\begin{aligned} \begin{bmatrix} 1 & 3.17 \\ 1 & 2.77 \\ 1 & 2.78 \end{bmatrix} \begin{bmatrix} c \\ d \end{bmatrix} = \begin{bmatrix} 3.17 \\ 2.77 \\ 2.78 \end{bmatrix} \end{aligned}\]

$$

$$

\[\begin{aligned} A^TA\hat{x} = A^Tb \end{aligned}\]

$$ 3.17 3.17 2.77 2.77 2.78 * 2.78

10.0489 + 7.6729 + 7.7284

$$

\[\begin{aligned} \begin{bmatrix} 1 & 1 & 1 \\ 3.17 & 2.77 & 2.78 \end{bmatrix} \begin{bmatrix} 1 & 3.17 \\ 1 & 2.77 \\ 1 & 2.78 \end{bmatrix} = \begin{bmatrix} 3 & 8.72 \\ 8.72 & 25.4502 \end{bmatrix} \end{aligned}\]

$$

15.8 + 19.7 + 21.4 15.8 * 3.17 19.7 * 2.77 21.4 * 2.78 50.086 + 54.569 + 59.492

$$

\[\begin{aligned} \begin{bmatrix} 1 & 1 & 1 \\ 3.17 & 2.77 & 2.78 \end{bmatrix} \begin{bmatrix} 1 & 3.17 & | 15.8\\ 1 & 2.77 & | 19.7 \\ 1 & 2.78 & | 21.4 \end{bmatrix} = \begin{bmatrix} 3 & 8.72 & | 56.9 \\ 8.72 & 25.4502 & | 164.147 \end{bmatrix} \end{aligned}\]

$$ 3.17 * 3.17 2.77 ^ 2 2.78 ^ 2

10.0489 + 7.6729 + 7.7284 15.8 + 19.7 + 21.4

$$

\[\begin{aligned} 3C + 8.72D = 56.9 \end{aligned}\]

$$

$$

\[\begin{aligned} 8.72C + 25.4502D = 164.147 \end{aligned}\]

$$

QR Decomposition

What is really going on, in mathematical terms, when we create a ordinary least squares regression estimate? The ordinary least squares regression line can be calculated exactly from the matrix algebra or else it can be approximated using a recursive technique. The following formula is an equation for finding the least squares estimator using notation familiar to someone versed in linear algebra.

\[ \begin{aligned} \hat{{\beta}}=\left(\mathbf{X}^{\top} \mathbf{X}\right)^{-1} \mathbf{X}^{\top} \mathbf{y} \end{aligned} \]

This is equal to

\[ \begin{aligned} 𝑋′𝑋𝛽̂=𝑋′𝑦 \end{aligned} \]

Which is equal to

\[ \begin{aligned} 𝑅′𝑄′𝑄𝑅𝛽̂=𝑅′𝑄′𝑦 \end{aligned} \]

𝑅𝛽̂=𝑄′𝑦

Understanding the math behind a linear regression in a general form requires some understanding of linear algebra.

9 Simple Linear Regression

We start our exploration of linear regression by looking at simple linear regression. We define simple linear regression as regression that takes exactly one independent variable, also called the explanatory variable, and exactly one dependent variable, also called the response variable. In simple linear regression, this dependent variable is continuous. Following convention, on a Cartesian coordinate plane we represent our independent variable on the x axis and our dependent variable set on the y axis.

Linear regression can, of course, take more than one independent variable but we start with the simple case of using just one independent variable and one dependent variable in order to learn and build our intuition.

Before fitting any linear regression model, it’s often useful to first create a scatterplot of our variables without including the line of best fit. Creating a scatterplot without the line of best fit gives us an opportunity to test our intuition. We can see in the below visualizations wt and mpg have an inverse relationship, meaning that as the wt of the engine increases, the mpg tends to decrease, and vice versa. Maybe this conclusion is not surprising. But this does open the door to a variety of more nuanced questions. Some examples include:

  • How well would a linear model explain the variability that we would see in miles per gallon?
  • Is a linear model even a good fit, given the curvature, especially at the far end of the x-axis with heavier engines and lower miles per gallon?
  • Are there confounding variables, such the size of the engine, referred to as displacement, independent of weight, that better predicts the miles per gallon?

For experienced data analysts, it’s hard to see a scatterplot and not imagine a line going through it. By creating a scatterplot, we already start to imply a relationship. But which line or estimator should we use? The most common way to do simple linear regression is through partial least squares estimation, also called ordinary least squares estimation, but options exist: The following visualization shows four different estimators for the relationship of wt to mpg in the mtcars dataset. This is not an exhaustive list. Notice that the slope and intercept is different for each estimator.

  • Ordinary Least Squares (OLS)
  • Median Absolute Deviation (MAD)
  • Least Median Squares (LMS)
  • Theil-Sen

slope and intercept

We reviewed in the last chapter that a straight line is defined by exactly two parameters: the slope and intercept. The following equation shows the slope-intercept form of a straight line where m is the gradient of the line and c is the y-intercept.

\[ \begin{aligned} y = mx + c \end{aligned} \]

When we have more than two points, however, we need to come up with a different way of calculating our line of best fit. It is theoretically possible that the line of best fit will go through all the points but most likely it won’t.

As we can see in the above visualization, there are different kinds of estimators that we could use in order to find the line of best fit. Each estimator has its own strengths and weaknesses. Making an argument for any particular estimator will require looking at the particular features of a data set, and also incorporating domain knowledge.

Regardless of which estimator we end up using, our equation for our estimator is often written as this linear equation: y' = b0 + b1x where b0 is the y-intercept and b1x is the slope. The equation is written in this way to distinguish it from the slope-interecept form of a straight line which is more commonly used in geometry or algebra. This regression equation shows, on the left-hand side of the equation y' meaning the predicted value of y, and the b0 and b1 terms are beta terms referring to parameter estimates; b0 and b1 are often referred to as model coefficients.

The following visualization shows the conceptual difference between the slope-intercept form of the line, on the left, and the regression equation, on the right.

9.1 Building a Model

Ordinary Least Squares

We start with partial least squares regression which is also called ordinary least squares regression, or OLS regression, for short. We start with this estimator because it’s by far the most common estimator and, as we will see, the OLS estimator in particular has some nice properties, including the fact that the matrix algebra is differentiable and that there is a well-studied relationship with maximum likelihood estimation, in that, when the errors are normally distributed, the OLS estimate is the maximum likelihood estimate.

When we create a visualization for a linear regression, we often include elements from descriptive or inferential statistics, such as standard error or labeling more extreme points. In the following four-paneled visualization, we can see the same relationship mpg ~ wt visualized in four different ways.

Closed Form Equations for Slope and Intercept

How do we find the slope and intercept of our simple ordinary least squares regression line? R will calculate the slope and intercept by accessing FORTRAN’s linear algebra library called Linpack and performing what is called a QR decomposition using the Gram-Smidt process.

Before reviewing any of the linear algebra derivation, however, we can first look at shortcut equations to find the slope and intercept. As it turns out, in the case of simple linear regression for an ordinary least squares estimate, meaning regression with only one independent and one dependent variable, the slope and intercept follow neat closed-form equations: the slope can be calculated by multiplying the correlation r by the quotient of standard deviation of y over standard deviation of x. In the below equation, a refers to the slope and sy and sx refer to the standard deviation of y and the standard deviation of x, respectively.

\[ \begin{aligned} a = r * \frac{sy}{sx} \end{aligned} \]

The intercept of the line of best fit for ordinary least squares simple linear regression can be calculated easily after we calculate the slope of the line of best fit. We do this by subtracting the slope of the line of best fit from mean of y, then multiplying the result by the mean of x. In the equation below, i refers to y-intercept and the straight line over the x and y values is a way of referring to the mean of x and y respectively; we refer to these terms as x-bar and y-bar.

\[ \begin{aligned} i = \bar{y} - r* \frac{sy}{sx} * \bar{x} \end{aligned} \]

The following code chunk shows how to calculate the slope and intercept using our mtcars dataset with 32 observations.

r_slope <- ( sd(mtcars$mpg) / sd(mtcars$wt) ) * cor(mtcars$mpg, mtcars$wt)
r_intercept <- mean(mtcars$mpg) - r_slope * mean(mtcars$wt) 

We can see that the r-intercept, by which we mean the y-intercept for when x is zero, is 37.285. We also see that the r-slope is -5.344, meaning that the line moves downward -5.344 units for every one unit along the x-axis.

In the interest of being thorough, we can explore alternative ways of writing these equations. Remember that the standard deviation is the same as the square root of the variance, so instead of referring to the standard deviation of y and standard deviation of x, we could also refer to the square root of the variance of y and square root of the variance of x. The variance, we remember, is the average of the sum of squares; this is calculated by taking each value in the distribution, subtracting the mean, squaring the total and dividing by the total number of observations.

In the above equation for the slope, a, we could also write the sy and sx in terms of the standard deviation and we could also write out the longer form equation for the Pearson correlation. We could then cross-multiply and simplify the equations by removing common terms and end up with the following set of equations.

\[ \begin{aligned} a = \frac{\sum{(x_i - \bar{x})(y_i-\bar{y})}}{\sum{(x_i-\bar{x}})^2} \end{aligned} \]

\[ \begin{aligned} i = \bar{y} - \frac{\sum{(x_i - \bar{x})(y_i-\bar{y})}}{\sum{(x_i-\bar{x}})^2} * \bar{x} \end{aligned} \]

The following code chunk shows the R code version of these longer equations for the r-slope and the r-intercept without relying much on the in-built functions in R. There is no practical reason in a situation like this for calculating either the variance or the correlation coefficient by hand. Using in-built functions for standard deviation, variance, and correlation, sd(), var(), and cor(), respectively, is preferred in order to save time and prevent mistakes; using variables in this way is a lot of the reason why we lean on programming languages such as R. Otherwise, our code is unwieldly. The longer forms are included below simply to clarify the mathematical process and see what is happening under the hood.

r_slope <- 
  (sqrt(sum((mtcars$mpg - sum((mtcars$mpg) / length(mtcars$mpg))) ^ 2) 
  / (length(mtcars$mpg) - 1)) / sqrt(sum((mtcars$wt - mean(mtcars$wt)) ^ 2) / 
  (length(mtcars$wt) - 1)) ) * ((length(mtcars$mpg) * sum(mtcars$wt * mtcars$mpg)) 
  - (sum(mtcars$wt) * sum(mtcars$mpg))) / sqrt(((length(mtcars$mpg) * 
  sum(mtcars$wt ^ 2)) - (sum(mtcars$wt) ^ 2)) * 
    ((length(mtcars$mpg) * sum(mtcars$mpg ^ 2)) - (sum(mtcars$mpg) ^ 2))) 

r_intercept <- 
  (sum(mtcars$mpg) / length(mtcars$mpg)) - (sqrt(sum((mtcars$mpg - sum((mtcars$mpg) 
  / length(mtcars$mpg))) ^ 2) / (length(mtcars$mpg) - 1)) / 
    sqrt(sum((mtcars$wt - sum((mtcars$wt) / length(mtcars$wt))) ^ 2) / 
  (length(mtcars$wt) - 1)) ) * ((length(mtcars$mpg) * 
  sum(mtcars$wt * mtcars$mpg)) - (sum(mtcars$wt) * sum(mtcars$mpg))) / 
  sqrt(((length(mtcars$mpg) * sum(mtcars$wt ^ 2)) - 
  (sum(mtcars$wt) ^ 2)) * ((length(mtcars$mpg) * sum(mtcars$mpg ^ 2)) - 
  (sum(mtcars$mpg) ^ 2)))  * (sum(mtcars$wt) / length(mtcars$wt))

Notice in the following code that we use (length(mtcars$mpg) - 1)) in our code for calculating both the r_slope and the r_intercept. This is known as Bessel’s correction, to move from the sample standard deviation to the population standard deviation. The calculation for standard deviation in R uses Bessel’s correction but the calculation for standard deviation in python does not. By deconstructing what is happening under the hood of the function we have learned an important difference between R and python!

Now that we have calculated the slope and the intercept of our simple regression line using the ordinary least squares estimator, we can now use the geom_abline() layer of ggplot() to add a line with the r-slope and r-intercept. These are the numbers we calculated by hand in our equation for r-slope and r-intercept. Later, we will see how we can obtain these numbers by looking at the summary of our linear model object which comes in the form of a specialized list. We choose to specify the linetype argument as dashed. The commented-out part of the code is an alternate way of writing the geom_abline() layer.

Center of Mass

The ordinary least squares regression line will always pass through the center of mass of the points. In other words, if we were to draw a point that represented the mean of x and the mean of y, our ordinary least squares regression line is guaranteed to go through this point. We can illustrate this by drawing a horizontal line connected each point to the mean of x and a vertcal line connecting each point to the mean of y. If we add up the total distances in each case and divide by the total number of points, we found ourselves at the mean of x and the mean of y.

The Normal Equation from Linear Algebra

In the case of simple linear regression, we can calculate the slope and intercept using neat closed-form equations using the standard deviation and the Pearson correlation. However, in the case of using more than one independent variable, the slope and intercept of the line has to be derived using linear algebra. Although, for technical reasons, even in the case of simple linear regression, R will calculate the slope and intercept using linear algebra as well because it gives us a quicker, more exact answer.

In this section, we will first review what is called the normal equation. The normal equation allows us to find our regression coefficients analytically. R will not solve the OLS minimization problem using the normal equation; R will instead calculate the OLS regression using QR decomposition because QR decomposition avoids working with an inverse matrix, which can be slow. Even though R will solve the problem using QR decomposition, it’s useful first to understand the normal equation.

We start with a basic formula from linear algebra, which is the equation for multiplying a matrix by a vector. In this equation, the A refers to a matrix and the x refers to a vector and b is a vector, but in our case b will be a vector of coefficients for our linear model.

Our model matrix is also called a design matrix.

\[ \begin{aligned} Ax = b \end{aligned} \]

We can use Ax = b to solve a system of linear equations. Let’s first an example. Take the following three linear equations.

  • x + 2y + z = 2
  • 3x + 8y + z = 12
  • 4y + z = 12

We can write these three linear equations and put them into a matrix. We will start first with the left-hand side.

$$

\[\begin{aligned} \begin{bmatrix} 1 & 2 & 1 \\ 3 & 8 & 1 \\ 0 & 4 & 1 \end{bmatrix} \end{aligned}\]

$$

Now we can solve this matrix using elimination, specifically by creating an elimination matrix and then creating an upper triangular matrix. The diagonal of our upper triangular matrix we call our pivot. These matrices might be labeled A -> E -> U

$$

\[\begin{aligned} \begin{bmatrix} 1 & 2 & 1 \\ 3 & 8 & 1 \\ 0 & 4 & 1 \end{bmatrix} -> \begin{bmatrix} 1 & 2 & 1 \\ 0 & 2 & -2 \\ 0 & 4 & 1 \end{bmatrix} -> \begin{bmatrix} 1 & 2 & 1 \\ 0 & 2 & -2 \\ 0 & 0 & 5 \end{bmatrix} \end{aligned}\]

$$

As a final step, we can now include the right-hand side. We will call this an augmented matrix.

$$

\[\begin{aligned} \begin{bmatrix} 1 & 2 & 1 & 2\\ 3 & 8 & 1 & 12 \\ 0 & 4 & 1 & 2 \end{bmatrix} -> \begin{bmatrix} 1 & 2 & 1 & 2\\ 0 & 2 & -2 & 6\\ 0 & 4 & 1 & 2 \end{bmatrix} -> \begin{bmatrix} 1 & 2 & 1 & 2 \\ 0 & 2 & -2 & 6\\ 0 & 0 & 5 & -10 \end{bmatrix} \end{aligned}\]

$$

We finally have a different set of equations that we can solve with back elimination.

  • x + 2y + z = 2
  • 2y - 2z = 6
  • 5z = -10

We can see the solutions to the set of linear equations, above. We start with the bottom and move upwards through a process called back substitution.

  • z = -2
  • y = 1
  • x = 2

But what do we do when we are looking at a scatter plot where obviously no line goes through all the points? Thinking now to our scatterplot using mtcars, we have wt on the x-axis and mpg on the y-axis. We can plainly see that no line will go through all the points. So the problem Ax = b has no solution for b. What we need instead is to add an error vector.

\[ Ax + e = b \]

It’s helpful to write our linear equation using matrix notation which helps to highlight our vector of errors, and also to refer to our variables in our familiar y = bx + b0 structure, and also to put our y variable on the left-hand side.

\[ \begin{aligned} \begin{bmatrix} y \\ y \\ y \end{bmatrix} = \begin{bmatrix} 1 & x \\ 1 & x \\ 1 & x \end{bmatrix} \begin{bmatrix} B0 \\ B1 \end{bmatrix} + \begin{bmatrix} e \\ e \\ e \end{bmatrix} \end{aligned} \]

Now instead of finding a line that goes through all the points, we want to find a function that minimizes the distance, called a minimizing function. We remember in matrix vector notation that the dot product of a vector with itself is the norm squared. The norm squared is the magnitude, or length of the vector.

\[ x * x = ||x||^2 \]
Another way of viewing the norm squared is to take a design matrix and mulitply the design matrix by its transpose. For example, we cal look at the error vector multiplied by its transpose.

\[ \begin{aligned} (e^Te) = \begin{bmatrix} e & e & e \end{bmatrix} \begin{bmatrix} e \\ e \\ e \end{bmatrix} = \sum(e^2) \end{aligned} \]

Now we can rearrange the terms in Ax + e = b so that we have our error vector on the right-hand side. We know the the magnitude of each should be equal.

\[ ||Ax - b||^2 = ||e||^2 \]

\[ ||x|| =\sqrt{\sum{x^2}} = \sqrt{(x^2 + x^2 + x^2)} \]

The solution to the problem is reduced to solving this equation. We can look at the following formula which shows how to derive the ordinary least squares estimator from linear algebra.

\[ \begin{aligned} A^TA\hat{y} = A^Tb \end{aligned} \]

Here A refers to our matrix, A^T is the transposed version of the matrix, \bar{x{} refers to the predicted values of x and the b are the beta coefficients. This formula may also be written in this way, after rearranging the terms in order to solve for b. Here A is a matrix of full rank and by extension invertible.

\[ \begin{aligned} b = (A^TA) ^1A^T\hat{y} \end{aligned} \]

What we are doing here is we are projecting b onto the column space of A, denoted C(A).

Here we look at the transposed version of our matrix. We use the model.matrix() function in order to see the model matrix that goes into the linear model.

##                1    2    3
## (Intercept) 1.00 1.00 1.00
## wt          3.17 2.77 2.78
## attr(,"assign")
## [1] 0 1

Here we look at a section of our matrix with the x or \bar{x} vector attached.

##      ones    wt  mpg
## [1,]    1 2.620 21.0
## [2,]    1 2.875 21.0
## [3,]    1 2.320 22.8
## [4,]    1 3.215 21.4
## [5,]    1 3.440 18.7

Here is our equation following matrix notation. We are just showing the first three values instead of all 32 in order to show the equation in action. Multiplying a matrix by its transpose gives us a square matrix.

$$

\[\begin{aligned} \begin{bmatrix} 1 & 1 & 1 \\ 3.17 & 2.77 & 2.78 \end{bmatrix} \begin{bmatrix} 1 & 3.17 & |1 \\ 1 & 2.77 & |2 \\ 1 & 2.78 & |2 \end{bmatrix} = \begin{bmatrix} 3 & 8.72 & | 5 \\ 8.72 & 10.0489 & | 14.27 \end{bmatrix} \end{aligned}\]

$$

Here is the full result of matrix multiplcation, the transpose of A with A and the predicted x vector.

##         ones       wt      mpg
## ones  32.000 102.9520  642.900
## wt   102.952 360.9011 1909.753

This result means that we are solving these linear equations:

  • 32.000 * x + 102.9520 * y = 642.900
  • 102.952x * x + 360.9011 * y = 1909.753

Now this evaluates to the following matrix.

This matrix corresponds to two linear equations which we then solve the beta coefficients for the slope and intercept of our line.

## [1] 37.285087 -5.344459

Now, solving for b in the equation b = ATA is slow, as it turns out. How do we speed up this computation? QR factorization is turning X into QR with X = QR. This simplifies our equation above:

\[ \begin{aligned} b = (A^TA) ^1A^T\hat{y} \end{aligned} \] Into a more simple equation to solve.

\[ \begin{aligned} b = R^1Q^Ty \end{aligned} \]

XTX=RTQTQR=RTR

Partial Least Squares vs Total Least Squares

The OLS regression technique works by minimizing the sum of the squares in the difference between the observed and predicted values of the dependent variable and it takes the form of a straight line, which we call the line of best fit.

Notice that with partial least squares estimation we minimize the difference between the observed and predicted values in the y direction. Contrast this with total least squares where we minimize the total distance including both x and y, which is to say that with total leeast squares we minimize the orthogonal, or perpendicular, distance. Studies have shown that, when given a scatterplot of points and asked to draw a regression line, people tend to draw the total least squares line instead of the partial least squares regression line.

Creating Our Model Object

We can create a linear model object in R using the lm() function from base R.

mtcars_lm_mpg_wt <- lm(mpg ~ wt, data = mtcars) 

But how does the lm() function calculate the ordinary least squares regression line? When we type ?lm into our console, we can read the documentation which tells us that lm() is using QR decomposition to return the model matrix.

While the interface for the R user is the lm() in R, R is actually accessing the C programming langauge and, beneath the C programming language, is using Fortran and its linear algebra package called Linpack. Within Linpack, the QR decomposition is calculated by householder reflections. The derivation for the householder reflection algorithm in the Linpack linear algebra package in Fortran is complicated and won’t be explored here, nor will we explore the programming language techniques from C or Fortran which involve a lot of functional programming techniques including error catching, caching and other things.

Important to know that R will always use matrix algebra to calculate the slope of the ordinary least squares regression line even though simple closed-form equations exist to find the slope. This is because the linear algebra will give us a more exact number than the closed form equations because of the thing about squaring the numbers.

So far, we have calculated the slope and intercept by hand by looking at the correlation coefficient and the standard deviations of the y and x variables. As we will see, calculating the r-slope and the r-intercept is not the same as creating a linear model object within the R programming language. We can create a model object using the lm() function from base R. The lm() function takes two arguments: formula and data. The output of the model object and the summary of the model object are printed in the console as specialized lists. This specialized list is commonly used in R so it’s worth becoming familiar with how to read this output. We choose to give our linear model a detailed, explicit name, recognizing that it may be more common as a convention to use a simpler name for different objects.

mtcars_lm_mpg_wt <- lm(mpg ~ wt, data = mtcars) 

We can also use the tidymodels() workflow. With tidymodels(), we supply a wrapper around the lm() function. For this partiuclar use case, it seems a little needlessly wordy, but the workflow is worth getting used to. We specify the engine and the mode with set_engine() and set_mode().

mtcars_lm_mpg_wt2 <- linear_reg() %>% 
            set_engine('lm') %>% # adds lm implementation of linear regression
            set_mode('regression')

# mtcars_lm_mpg_wt object properties
mtcars_lm_mpg_wt2
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
# We can do more things with our linear model object.

lm_fit <- mtcars_lm_mpg_wt2 %>% 
          fit(mpg ~ wt, data = mtcars)

# View lm_fit properties
lm_fit
## parsnip model object
## 
## 
## Call:
## stats::lm(formula = mpg ~ wt, data = data)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344
tidy_mtcars_lm_mpg_wt <- broom::tidy(mtcars_lm_mpg_wt)

tidy_mtcars_lm_mpg_wt_estimate <- tidy_mtcars_lm_mpg_wt$estimate

mtcars_lm_mpg_am <- lm(mpg ~ am, mtcars) 

tidy_mtcars_lm_mpg_am<- tidy(mtcars_lm_mpg_am)

tidy_mtcars_lm_mpg_am_estimate <- tidy_mtcars_lm_mpg_am$estimate

regression_estimates <- rbind(tidy_mtcars_lm_mpg_wt_estimate, tidy_mtcars_lm_mpg_am_estimate)

We can use the options() function in the RStudio environment and set the scipen argument to 999. Doing this turns off the scientific notation which makes the output more readable. By printing our linear model in the console, we see the call and the coefficients. The coefficients have two components: the intercept and the slope. We recognize the r-intercept and the r-slope from our previous work calculating these values by hand: The y-intercept for when x is zero is 37.285 and the slope of our line is -5.344.

options(scipen = 999)

mtcars_lm_mpg_wt
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344
summary(mtcars_lm_mpg_wt)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858 < 0.0000000000000002 ***
## wt           -5.3445     0.5591  -9.559       0.000000000129 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 0.0000000001294

compare that to when you make this tidy().

“the estimates have been calculated from least squares optimization”

the standard error column shows us the variability in the intercept and the variability in the slope. The units are in the associated units for the slope or intercept

statistic column combines least squares with standard error? it is a standardized estimate that measures the number of standard errors the estimate is above 0. the p-value test whether or not the slope or intercept is 0. the default test is two sided. we reject the value of 0 as a plausible value for the slope or intercept. no way for this data to come from a population with a slope or intercept of 0. if the research quesiton is whether or not it is positivelt or negatively associated we divide the p-value by two. the one sided p-value is half.

mtcars_lm_mpg_wt %>% tidy()

We can also perform a simple linear regression with a discrete variable on the x axis. Take the following example.

Performing a regression to predict a continuous variable from a discrete variable is equivalent to looking at the difference in means between two groups. In the first case, we can use geom_boxplot() and use the am variable as a factor. In the second case, we can use a linear model and use the am variable as a numeric in the model. To add geom_smooth() to the second plot, we have to convert the factor to a numeric datatype.

mean(mtcars$am[mtcars$am ==1])
## [1] NA
mean(mtcars$am[mtcars$am ==0])
## [1] NA
one <- ggplot(mtcars, aes(x = as.factor(am), y = mpg)) +
  geom_boxplot() +
  labs(title = "Mtcars", subtitle = "Boxplot: mpg ~ am") +
  theme(plot.title = element_text(face = "bold")) 

two <-ggplot(mtcars, aes(x = as.numeric(am), y = mpg)) +
  geom_point() + geom_smooth(method = 'lm', se = FALSE, color = '#e76254') +
  labs(title = "Mtcars", subtitle = "Regression: mpg ~ am") +
  theme(plot.title = element_text(face = "bold")) 

library(patchwork)
one + two

why do we use the terms coefficient in the models?

We create our model. This time we have three coefficients: an intercept, and one for two of the values of am. The coefficient for am == 0 is missing, but the number for the intercept we recognize as the mean value of mpg where am == 0. The intercept is the mean mass of the am == 0.

But what is the meaning of the coefficient for where am == 1? The value is 7.245. As it turns out, the coefficient for am == 1 is calculated relative to the intercept. 24.39231 - 17.14737 evaluates to 7.24494, which is the coefficient of am == 1 after rounding.

moderu <- lm(mpg ~ as.factor(am), mtcars)
moderu
## 
## Call:
## lm(formula = mpg ~ as.factor(am), data = mtcars)
## 
## Coefficients:
##    (Intercept)  as.factor(am)1  
##         17.147           7.245

Within the lm call, we can add + 0 after the explanatory variable and this tells R to report the coeffiecients relative to zero. In other words, the linear regression model is being fit without an intercept term, which generally is probably not a good idea.

When we do this, we can see very clearly that the coefficients are the mean values for where am is equal to 0 and where am is equal to 1. With only one categorical explanatory variable, the linear regression coefficients are the means of each category.

moderu <- lm(mpg ~ as.factor(am) + 0, mtcars)
moderu
## 
## Call:
## lm(formula = mpg ~ as.factor(am) + 0, data = mtcars)
## 
## Coefficients:
## as.factor(am)0  as.factor(am)1  
##          17.15           24.39

three variables and not just two for this exercise

ggplot(
  data = mtcars,
  aes(x = gear, y = mpg)) +
  geom_point() + geom_smooth(method = 'lm', se = FALSE)

# mtcars %>%
#   select(gear, mpg) %>%
#   group_by(gear) %>%
#   summarize_all(mean) # summarize(mpg = mean(mpg))

# mtcars %>%
#   select(gear, mpg) %>%
#   pivot_wider(names_from = gear, values_from = mpg)
# 
# mtcars <- pivot_wider(mtcars, names_from = 'names', values_from = 'values', 2)

# We have to turn this into a factor. But what happens if we don't? 

mtcars$gear <- as.factor(mtcars$gear)

modell11 <- lm(mpg ~ gear, mtcars)
modell11
## 
## Call:
## lm(formula = mpg ~ gear, data = mtcars)
## 
## Coefficients:
## (Intercept)        gear4        gear5  
##      16.107        8.427        5.273
modell22 <- lm(mpg ~ gear + 0, mtcars)
modell22
## 
## Call:
## lm(formula = mpg ~ gear + 0, data = mtcars)
## 
## Coefficients:
## gear3  gear4  gear5  
## 16.11  24.53  21.38
summary(modell11)
## 
## Call:
## lm(formula = mpg ~ gear, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7333 -3.2333 -0.9067  2.8483  9.3667 
## 
## Coefficients:
##             Estimate Std. Error t value           Pr(>|t|)    
## (Intercept)   16.107      1.216  13.250 0.0000000000000787 ***
## gear4          8.427      1.823   4.621 0.0000725738200575 ***
## gear5          5.273      2.431   2.169             0.0384 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.708 on 29 degrees of freedom
## Multiple R-squared:  0.4292, Adjusted R-squared:  0.3898 
## F-statistic:  10.9 on 2 and 29 DF,  p-value: 0.0002948
summary(modell22)
## 
## Call:
## lm(formula = mpg ~ gear + 0, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7333 -3.2333 -0.9067  2.8483  9.3667 
## 
## Coefficients:
##       Estimate Std. Error t value             Pr(>|t|)    
## gear3   16.107      1.216   13.25   0.0000000000000787 ***
## gear4   24.533      1.359   18.05 < 0.0000000000000002 ***
## gear5   21.380      2.106   10.15   0.0000000000465706 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.708 on 29 degrees of freedom
## Multiple R-squared:  0.9542, Adjusted R-squared:  0.9495 
## F-statistic: 201.5 on 3 and 29 DF,  p-value: < 0.00000000000000022

We can also perform a simple linear regression with a discrete variable on the y axis. Take the following example. This is a little different. We call this logistic regression.

9.2 Evaluating our Model

“All models are wrong, but some are useful. ~George Box”

Of course, we can always draw a line through any set of points on a scatterplot. But how will we know if that line is justified, that our linear model is a good fit and can be relied on to make accurate predictions? There are a couple of ways to test out our model and see how it’s working. One of the following methods may show a stronger signal than another method; taken together, these methods will give us the most accurate picture.

  • Create diagnostic plots
  • Look at model statistics
  • Perform goodness of fit tests
  • Validate the model with test/train workflow

Model Statistics

SST, SSR, and SSE

One of the most commonly used statistics in evaluating a model is called r-squared. We will first explore these other statistics in order to understand r-squared.

Without having access to any information about the x variable, our best guess for the value of y is the mean of y. This is the guess that would minimize error under the circumstance of knowing nothing about x. The statistic to define this is the sum of squares total, denoted as SST. The sum of squares total is the squared differences between the observed dependent variable and its mean. In other words, SST is the dispersion of the observed variables around the mean. In the below visualization on the left, the dashed orange line is the mean of mpg; SST is the squared sum of the distance from every point to the orange line, which evaluates to 1126.047.

\[ \begin{aligned} SST = \sum{(y_i - \bar{y}) ^ 2} \end{aligned} \]

Having information about an x value, however, will now change our guess about y. This is because we can now draw a line of best fit for the relationship of x and y. Now, given a value of x, our best guess for y is no longer the mean of y but rather the predicted value of y based on the line of best fit.

We can first calculate the sum of squares due to regression, or SSR, which is the sum of the differences between the predicted value and the mean of the dependent variable. The middle panel of our visualization is an illustration of SSR. SSR is calculated from the difference of the blue and orange lines.

\[ \begin{aligned} SSR = \sum{(\hat{y}_i - \bar{y}) ^ 2} \end{aligned} \]

The final term, which may be the one with which we are most familiar, is the sum of squares error, or SSE. The right panel on our visualization illustrates this. The SSE is the difference between the observed value and the predicted value; in other words, it’s the differences between our line of best fit and our actual data points. The rationale is the following: the total variability of the dataset is equal to the variability explained by the regression line plus the unexplained variability, known as error.

The residual sum of squares (RSS) is also called the sum of squared errors (SSE). It is given by the following equation. Essentially, our linear model creates a prediction based on the x values and then we subract y from the predited value and square it.

\[ \begin{aligned} SSE = \sum(y_i- \hat{y}) ^ 2 \end{aligned} \]

\[ \begin{aligned} SSE = \sum(y_i- (\hat{a} + \hat{b}x)) ^ 2 \end{aligned} \]

model <- lm(mpg ~ wt, mtcars)

sse <- sum((fitted(model) - mtcars$mpg)^2)

We can see something in the above graph, which is that the SSE is equal to difference between SST and SSR.

\[ \begin{aligned} SSE = SST - SSR \end{aligned} \]

MSE

\[ \begin{aligned} MSE = \frac{1}{n}\sum({y_i} -\hat{y_i}) ^ 2 \end{aligned} \]

“To find the MSE, take the observed value, subtract the predicted value, and square that difference. Repeat that for all observations. Then, sum all of those squared values and divide by the number of observations.”

The mean squared error is simply the SSE divided by the number of observations.

R-Squared

\[ \begin{aligned} R^2 = \frac{SSR}{SST} = \frac{SST - SSE}{SST} = \frac{\sum(y_i- \hat{y}) ^ 2 + \sum{(\hat{y}_i - \bar{y}) ^ 2} - \sum(y_i- \hat{y}) ^ 2}{\sum(y_i- \hat{y}) ^ 2 + \sum{(\hat{y}_i - \bar{y}) ^ 2}} \end{aligned} \]

The r-squared can be thought of as SSR / SST, or else SST - SSE / SST. The r-squared statistic represents the proportion of the variance for the dependent variable - the y variable - that is explained by the independent variable - the x variable. In the case of mpg explained by wt in the mtcars dataset, we see an r-squared of 0.7446, which is to say that our model accounts for 74% of the variability of mpg according to wt.

For the sake of calculating r-squared, the order of the x and y variables is not important. We will calculate an r-squared of 0.7446 if we are finding mpg as a function of wt or if we are finding wt as a function of mpg. The tilde can be read like ‘as a function of’ in the following context mpg ~ wt. We can place two graphs side-by-side and switch the x and y values. In each case, the same amount of variability on the y axis can be explained by the values of x.

This is what is meant when we talk about how r-squared is the percent of variance explained.

model1 <- lm(mpg ~ wt, mtcars)
model2 <- lm(wt ~ mpg, mtcars)

one <- ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  labs(title = "mtcars", subtitle = "mpg ~ wt") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) + 
  geom_abline(intercept = 37.2851, slope = -5.3445, linetype = 'dashed', color = '#376795', size = 1) +
  labs(caption = "r-squared: 0.7446")
  
two <- ggplot(mtcars, aes(x = mpg, y = wt)) +
  geom_point() +
  labs(title = "mtcars", subtitle = "wt ~ mpg") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  geom_abline(intercept = 6.0473, slope = -0.1409, linetype = 'dashed', color = '#376795', size = 1) +
  labs(caption = "r-squared: 0.7446")

library(patchwork)
one + two

There are other ways to arrive at the r-squared statistic. Amazingly, r-squared can also be understood as the quotient of the difference between 1 and MSE and the variance of y, as seen in this formula:

\[ \begin{aligned} R ^ 2 = \frac{1 - \frac{1}{n}\sum({y_i} -\hat{y_i}) ^ 2}{σ^2} = \frac{1 - MSE}{σ^2} \end{aligned} \]

\[ \begin{aligned} R ^ 2 = 1 - \frac{σ^2}{s_y^2} \end{aligned} \]

R-squared is also the square of the correlation between y and y-hat.

\[ \begin{aligned} R ^ 2 = ρ(y, \hat{y}) ^ 2 \end{aligned} \]

\[ \begin{aligned} f-statistic = \frac{R ^ 2}{1 - R ^ 2} * \frac{df_2}{df_1} \end{aligned} \]

Standard Deviation Line

In addition to calculating our line of best fit, we can also calculate the standard deviation line. We find the slope of the standard deviation line by taking the quotient of standard deviation of y over standard deviation of x, then multiplying by -1. The intercept of the line of best fit can be calculated by subtracting the slope of the line of best fit from mean of y, then multiplying the result by the mean of x. The standard deviation line doesn’t elucidate much about our inferences, but it is useful for visually inspecting r-squared. Essentially, the standard deviation line is just the comparison of the standard deviations without consideration for the correlation between the two?

\[ \begin{aligned} a = -1 * \frac{sy}{sx} \end{aligned} \]

The standard deviation line will always have a steeper slope than the regression line. Sometimes the standard deviation line is close to the line of best fit; sometimes the lines are further apart. A standard deviation line close to the regression line indicates a high r-squared. For this reason, overlaying the standard deviation line on top of our scatterplot and alongside our linear regression line is useful for visually inspecting r-squared. We can also add a point that represents the mean of the x and y variables. The linear regression line and the standard deviation line will intersect at this point.

mean_wt <- mean(mtcars$wt)
mean_mpg <- mean(mtcars$mpg)

sd_wt <- sd(mtcars$wt)
sd_mpg <- sd(mtcars$mpg)

model <- lm(mpg ~ wt, data = mtcars)

sd_slope = -1 * (sd_mpg / sd_wt)

sd_intercept <- mean_mpg - sd_slope * mean_wt

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_abline(aes(intercept = sd_intercept, slope = sd_slope), linetype = 'dashed', color = '#D55E00', size = 1) +
  geom_abline(intercept = r_intercept, slope = r_slope, color = '#376795', size = 1) +
  labs(title = "Mtcars", subtitle = "Linear Regression + Standard Deviation Line") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  geom_point(aes(x = mean(wt), y = mean(mpg)), color = 'orange', size = 4)

Standard Error

We sometimes think about the standard error of the regression line but in fact there are many kinds of standard errors. For example, we can think about standard error in terms of the single standard error value as well as the standard error of the line. Here are some examples of standard errors. We will review each of these in turn.

  • standard error of a coefficient estimate
  • standard error of the mean at X
  • standard error of the forecast

The standard error is the standard deviation of the estimate, which is why it’s also called the standard error of estimate, denoted se. This is because, if we were to create a distribution of the residuals and, if it’s a normal distribution or resampled to become a normal distribution, one-half of the data points would fall within two-thirds standard deviation above or below the regression line, two-thirds of the data points would fall within one standard deviation of the regression line, and 95% would fall within two standard deviations above or below the regression line.

The standard error is useful because it tells us about how large the residuals are for our dataset. The residuals, we remember, are the prediction errors. The residuals are in the same unit as our y variable. The standard error tells us how well we can predict the y value. A smaller standard error indicates more confidence in our predictions.

\[ \begin{aligned} Sest = \sqrt{\frac{(1-r^2) * (y - \hat{y}) ^ 2 }{(n - 2)}} \end{aligned} \]

r <- cor(mtcars$mpg, mtcars$wt)

# equation for single standard error value 
standard_error_value  <- sqrt ( ( ( 1 - (r ^ 2) ) * 
sum(((mtcars$mpg) - mean(mtcars$mpg)) ^ 2) ) / 
(length(mtcars$mpg) - 2) )

# equation for all y values relative to x 

standard_error_of_line <- qt(0.975, 30) * sqrt ( ( ( 1 - (r ^ 2) ) * 
sum(((mtcars$mpg) - mean(mtcars$mpg)) ^ 2) ) / (length(mtcars$mpg) - 2) )

We can also think ahout the standard error of y as it changes in relation to x.

\[ \begin{aligned} (s.e.)_y = \sqrt{\frac{\sum _{i=1}^{n}(y - \hat{y}) ^ 2 }{( n - 2)}} * \sqrt{\frac{1}{n} + \frac{(x - \bar{x}) ^ 2}{\sum _{i=1}^{n} (x - \bar{x}) ^ 2}} \end{aligned} \]

adfaf <- sqrt(
  ( 1 / (length(mtcars$wt)) +
  ((mtcars$wt - mean(mtcars$wt)) ^ 2 ) /
  (sum((mtcars$wt - mean(mtcars$wt)) ^ 2))
  ))
adfaf
##  [1] 0.2080119 0.1876080 0.2416107 0.1767772 0.1814437 0.1823061 0.1882622
##  [8] 0.1768474 0.1772072 0.1814437 0.1814437 0.2361182 0.2002714 0.2047450
## [15] 0.4128882 0.4419637 0.4287122 0.2571315 0.3431469 0.3092354 0.2243142
## [22] 0.1853060 0.1812391 0.2105165 0.2110162 0.2943616 0.2652376 0.3593238
## [29] 0.1769893 0.1949100 0.1882622 0.1941440
adfaf2 <- sqrt(
  ((mtcars$mpg - mean(mtcars$mpg)) ^ 2 ) /
  (length(mtcars$mpg) - 2))
adfaf2
##  [1] 0.16602840 0.16602840 0.49466193 0.23905807 0.25389223 0.36343674
##  [7] 1.05721864 0.78678063 0.49466193 0.16260513 0.41820899 0.67381285
## [13] 0.50949609 0.89290188 1.76925797 1.76925797 0.98418897 2.24737412
## [19] 1.88222575 2.52123540 0.25731549 0.83812962 0.89290188 1.23979283
## [25] 0.16260513 1.31624577 1.07889933 1.88222575 0.78335737 0.07131804
## [31] 0.92941671 0.23905807
adfaf <- sqrt(
  ( 1 / (length(mtcars$wt)) +
  ((mtcars$wt - mean(mtcars$wt)) ^ 2 ) /
  (sum((mtcars$wt - mean(mtcars$wt)) ^ 2))
  ))
adfaf
##  [1] 0.2080119 0.1876080 0.2416107 0.1767772 0.1814437 0.1823061 0.1882622
##  [8] 0.1768474 0.1772072 0.1814437 0.1814437 0.2361182 0.2002714 0.2047450
## [15] 0.4128882 0.4419637 0.4287122 0.2571315 0.3431469 0.3092354 0.2243142
## [22] 0.1853060 0.1812391 0.2105165 0.2110162 0.2943616 0.2652376 0.3593238
## [29] 0.1769893 0.1949100 0.1882622 0.1941440
sqrt(
  ((mtcars$mpg - mean(mtcars$mpg)) ^ 2 ) /
  (length(mtcars$mpg) - 2)) * 
  sqrt(
  ( 1 / (length(mtcars$wt)) +
  ((mtcars$wt - mean(mtcars$wt)) ^ 2 ) /
  (sum((mtcars$wt - mean(mtcars$wt)) ^ 2))
  ))
##  [1] 0.03453589 0.03114826 0.11951561 0.04226001 0.04606715 0.06625673
##  [7] 0.19903429 0.13914015 0.08765765 0.02950368 0.07588139 0.15909946
## [13] 0.10203747 0.18281720 0.73050570 0.78194779 0.42193384 0.57787072
## [19] 0.64587988 0.77965526 0.05771951 0.15531047 0.16182873 0.26099680
## [25] 0.03431232 0.38745217 0.28616462 0.67632843 0.13864590 0.01390060
## [31] 0.17497402 0.04641170

We can look at the summary() function with our model object in order to understand if our model is a good fit. We look at five numbers of our residuals. We want to see the median as a number close to 0. (We already know that in OLS regression the mean is 0.) We also want to see that the absolute value of the 1Q meaning first quartile is about equal to the absolute value of 3Q meaning the third quartile.

model <- lm(mpg ~ wt, mtcars)
summary(model)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858 < 0.0000000000000002 ***
## wt           -5.3445     0.5591  -9.559       0.000000000129 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 0.0000000001294
model_summ <-summary(model)

mean(model_summ$residuals^2)
## [1] 8.697561
sqrt(mean(model_summ$residuals^2))
## [1] 2.949163
data <- data.frame(pred = predict(model), actual = mtcars$mpg)
data
mean((data$actual - data$pred)^2)
## [1] 8.697561
sqrt(mean((data$actual - data$pred)^2))
## [1] 2.949163
"Mean square error is a useful way to determine the extent to which a regression model is capable of integrating a dataset.

The larger the difference indicates a larger gap between the predicted and observed values, which means poor regression model fit. In the same way, the smaller RMSE that indicates the better the model.

Based on RMSE we can compare the two different models with each other and be able to identify which model fits the data better."
## [1] "Mean square error is a useful way to determine the extent to which a regression model is capable of integrating a dataset.\n\nThe larger the difference indicates a larger gap between the predicted and observed values, which means poor regression model fit. In the same way, the smaller RMSE that indicates the better the model.\n\nBased on RMSE we can compare the two different models with each other and be able to identify which model fits the data better."
Prediction Interval

A prediction interval is an estimate of an interval in which a future observation will fall, with a certain probability, given what has already been observed.

We can see that in our dataset of 32 observations, two observations fall outside the prediction interval line. In a large dataset, we would expect 95% of the observations to fall within the prediction interval lines and 5% to fall outside the prediction interval lines.

Our prediction interval has the same slope as our line of best fit and runs parallel to our line of best fit, but the values are plus or minus the standard error above and below the line.

ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
  geom_point() + geom_smooth(method = 'lm', color = '#376795', size = 1, se = FALSE) +
  geom_abline(intercept = 37.2851 + standard_error_of_line, slope = -5.3445, linetype = 'dashed', color = '#376795', size = 0.5) +
  geom_abline(intercept = 37.2851 - standard_error_of_line, slope = -5.3445, linetype = 'dashed', color = '#376795', size = 0.5) +
  labs(title = "Mtcars", subtitle = "Linear Regression") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5))

Confidence Interval
fourth <- ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
  geom_point() + geom_smooth(method = 'lm', color = '#376795', size = 1, se = TRUE) +
  labs(title = "Mtcars", subtitle = "Linear Regression") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5))

fourth

mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
  ggplot(aes(x = values, y = mpg)) +
  geom_point() +
  facet_wrap(~names, scales = 'free') + geom_smooth(method = 'lm', color = '#376795', size = 1, se = TRUE) +
  labs(title = "Mtcars", subtitle = "Linear Regression") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5))

Ribbons are one aesthetic way to show a confidence interval without showing the regression line itself.

mtcars_linear_regression_geom_ribbon <- ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
  geom_point() + stat_smooth(method = "lm", color = '#376795', size = 1, fill = NA, geom = "ribbon") +
  labs(title = "Mtcars", subtitle = "Linear Regressions") +
  theme(plot.title = element_text(face = "bold")) 
  # theme(plot.subtitle = element_text(hjust = 0.5))

mtcars_linear_regression_geom_ribbon_filled <- ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
  geom_point() + stat_smooth(method = "lm", color = '#376795', fill = '#ffe6b7', size = 1, geom = "ribbon") +
  # labs(title = "Mtcars", subtitle = "Linear Regression") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5))

library(patchwork)
mtcars_linear_regression_geom_ribbon + mtcars_linear_regression_geom_ribbon_filled

# 
# mtcars %>%
#   dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
#   pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
#   ggplot(aes(x = values, y = mpg)) +
#   geom_point() + stat_smooth(method = "lm", color = '#376795', size = 1, geom = "ribbon") +
#   facet_wrap(~names, scales = 'free') +
#   labs(title = "Mtcars", subtitle = "Linear Regression") +
#   theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
#   theme(plot.subtitle = element_text(hjust = 0.5))
# 
# # what about  geom_area()
# 
Residual Distance
mtcars_linear_model <- lm(mpg ~ wt, mtcars)

mtcars$predicted <- predict(mtcars_linear_model)   # Save the predicted values
mtcars$residuals <- residuals(mtcars_linear_model) # Save the residual values

single_line_plus_residual_distance <- ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_smooth(method = "lm", se = FALSE, color = '#376795', size = 1) +     # regression line  
  geom_segment(aes(xend = wt, yend = predicted), alpha = .2) +      # draw line from point to line
  geom_point(aes(color = abs(residuals))) +  # size of the points
  # scale_color_continuous(low = "green", high = "red") +             # colour of the points mapped to residual size - green smaller, red larger
  guides(color = FALSE, size = FALSE) +                             # Size legend removed
  # geom_point(aes(y = predicted), shape = 1) 
  geom_text(aes(label = car), size = 2.5, hjust = 1.1, data = subset(mtcars, residuals > 6)) +
  labs(title = "Mtcars", subtitle = "Linear Regression") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5))

single_line_plus_residual_distance

mtcars %>%
  dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
  pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
  nest_by(names) %>% 
  mutate(model = list(lm(mpg ~ values, data = data))) %>% 
  summarise(augment(model)) -> predicted_values

mtcars_long_numeric_ij <- inner_join(mtcars_long_numeric, predicted_values, by = c("names", "mpg", "values"))

ggplot(mtcars_long_numeric_ij, aes(x = values, y = mpg)) +
  geom_smooth(method = "lm", se = FALSE, color = '#376795', size = 1) +
  facet_wrap(~names, scales = 'free_x') +
  geom_segment(mapping = aes(xend = values, yend = .fitted), alpha = .2, data = mtcars_long_numeric_ij) +
  geom_point(aes(color = abs(.resid))) + 
  guides(color = FALSE, size = FALSE) +                             
  geom_text(aes(label = car), size = 2.5, hjust = 1.1, data = subset(mtcars_long_numeric_ij, .resid > 6)) +
  labs(title = "Mtcars", subtitle = "Linear Regression") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5))

s <- mtcars %>%
  nest(data = -am) %>% 
  mutate(
    fit = map(data, ~ lm(wt ~ mpg + qsec + gear, data = .x)),  # S3 list-col
    tidied = map(fit, tidy)
  ) %>% 
  unnest(tidied)

regressions <- mtcars %>%
  nest(data = -am) %>% 
  mutate(
    fit = map(data, ~ lm(wt ~ mpg + qsec + gear, data = .x)),
    tidied = map(fit, tidy),
    glanced = map(fit, glance),
    augmented = map(fit, augment)
  )

# regressions %>% 
#   unnest(tidied)
# 
# regressions %>% 
#   unnest(glanced)
# 
# regressions %>% 
#   unnest(augmented)
Bootstrapping

Bootstrapping is a statistical procedure that resamples a single dataset many times in order to get many simulated datasets. By doing this, we are able to calculate confidence intervals and other things. In this example, we use bootstrapping on our regression model in order to test the stability of our model coefficients.

boots <- rsample::bootstraps(mtcars, times = 250, apparent = TRUE)

fit_nls_on_bootstrap <- function(split) {
    lm(mpg ~ wt, analysis(split))
}

boot_models <-
  boots %>% 
  dplyr::mutate(model = map(splits, fit_nls_on_bootstrap),
         coef_info = map(model, tidy))

boot_coefs <- 
  boot_models %>% 
  unnest(coef_info)

percentile_intervals <- int_pctl(boot_models, coef_info)
percentile_intervals
one <- ggplot(boot_coefs, aes(estimate)) +
  geom_histogram(bins = 30, alpha = .3, col = "#376795") +
  facet_wrap( ~ term, scales = "free") +
  labs(title="Mtcars", subtitle = "Stability of Coeffiecients") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "95% Confidence Interval Parameter Estimates, Intercept + Estimate") +
  geom_vline(aes(xintercept = .lower), data = percentile_intervals, col = "black", linetype = 'dotdash') +
  geom_vline(aes(xintercept = .upper), data = percentile_intervals, col = "black", linetype = 'dotdash')

boot_aug <- 
  boot_models %>% 
  sample_n(50) %>% 
  mutate(augmented = map(model, augment)) %>% 
  unnest(augmented)

# boot_aug <- 
#   boot_models %>% 
#   sample_n(200) %>% 
#   mutate(augmented = map(model, augment)) %>% 
#   unnest(augmented)

glimpse(boot_aug)
## Rows: 1,600
## Columns: 12
## $ splits     <list> [<boot_split[32 x 10 x 32 x 20]>], [<boot_split[32 x 10 x …
## $ id         <chr> "Bootstrap113", "Bootstrap113", "Bootstrap113", "Bootstrap1…
## $ model      <list> [36.575074, -5.311426, -0.09753868, 1.70991444, 7.07139355…
## $ coef_info  <list> [<tbl_df[2 x 5]>], [<tbl_df[2 x 5]>], [<tbl_df[2 x 5]>], […
## $ mpg        <dbl> 18.1, 10.4, 33.9, 17.8, 30.4, 18.1, 18.7, 19.2, 15.0, 15.8,…
## $ wt         <dbl> 3.460, 5.250, 1.835, 3.440, 1.513, 3.460, 3.440, 3.440, 3.5…
## $ .fitted    <dbl> 18.197539, 8.690086, 26.828606, 18.303767, 28.538886, 18.19…
## $ .resid     <dbl> -0.09753868, 1.70991444, 7.07139355, -0.50376720, 1.8611142…
## $ .hat       <dbl> 0.03359755, 0.19295709, 0.10557370, 0.03323023, 0.14431139,…
## $ .sigma     <dbl> 2.733326, 2.710440, 2.354482, 2.731732, 2.707735, 2.733326,…
## $ .cooksd    <dbl> 0.00002369387, 0.05996625680, 0.45684339934, 0.00062465140,…
## $ .std.resid <dbl> -0.0369197, 0.7082497, 2.7822303, -0.1906464, 0.7486443, -0…
# compare this to other versions

lm_bootstrapped <- ggplot(boot_aug, aes(wt, mpg)) +
  geom_line(aes(y = .fitted, group = id), alpha = .3, col = "#376795") +
  geom_point(alpha = 0.005) +
  # ylim(5,25) +
    # labs(title="", subtitle = "mpg ~ wt \n Linear Regression Bootstrap Resampling") +
  # labs(caption = "coefficient stability testing") 
  labs(title = "Mtcars", subtitle = "Linear Regression Bootstrap Resampling") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) 

two <- lm_bootstrapped

library(patchwork)
one + two

lm_coeffs = function(x, y) {
  coeffs = coefficients(lm(y~x))
  tibble(C = coeffs[1], m=coeffs[2])
}

nboot = 100

mtboot = lapply (seq_len(nboot), function(i) 
  mtcars_long_numeric %>%
    group_by(names) %>%
    slice_sample(prop=1, replace=TRUE) %>% 
    summarise(tibble(lm_coeffs(values, mpg))))
mtboot = do.call(rbind, mtboot)
#, color='#D55E00'
ggplot(mtcars_long_numeric, aes(values, mpg, color = names)) +
  geom_abline(aes(intercept=C, slope=m), data = mtboot, size=0.3, alpha=0.3) +
  geom_point() +
  facet_wrap(~names, scales = 'free_x') +
  labs(title = "Mtcars", subtitle = "Linear Regression") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  scale_color_manual(values=met.brewer("Hiroshige", 5))

Making Predictions
new_data <- data.frame(wt = c(1, 2, 3, 4, 5, 6, 7, 8))

predict(mtcars_linear_model, new_data)
##          1          2          3          4          5          6          7 
## 31.9406546 26.5961830 21.2517114 15.9072399 10.5627683  5.2182967 -0.1261748 
##          8 
## -5.4706464
predictions <- predict(mtcars_linear_model, new_data)

predictions <- as.data.frame(predictions)

predictions <- round(predictions, 0)
predictions
summary(mtcars_linear_model)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858 < 0.0000000000000002 ***
## wt           -5.3445     0.5591  -9.559       0.000000000129 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 0.0000000001294
mtcars_linear_model$coefficients
## (Intercept)          wt 
##   37.285126   -5.344472
# exploring out of sample estimates

Model Diagnostics

There are four model assumptions associated with linear regression. If a linear regression is created in violation of these assumptions, and no remedial measures are taken, the model may have errors and poor accuracy. The four assumptions for linear regression are:

  • Linearity
  • Homoscedasticity
  • Independence
  • Normality

In addition to verifying these linear model assumptions, we should also pay close attention to any problem data points which may have an inordinately large effect on the model.

  • Outliers
  • Leverage points
  • Influential observations

“Is it realistic to assume that all of these assumptions will hold? Probably not! We want the assumptions to hold just enough for the model to be useful.”

Linearity

In order to verify the assumption of linearity, we have to check a few separate parts, including the functional form of the model and whether or not the errors have zero population mean. Some of these pieces will never be violated because R constructs the linear model according to a certain math, but some of these pieces will need to be checked after constructing our model as they might reveal problems.

As a first thing, linearity means that the regression model is linear in the coefficients and the error term. In other words, we want to make sure that we are looking at a model that is correctly specified with a certain functional form, including that it is linear with an additive error term. Here is the general form of our linear model equation. The following equation may look familiar: In this formula, the beta values, labeled as β, are the parameters that ordinary least squares regression will estimate while epsilon, labeled as ε, represents the random error inherent in any model.

\[ \begin{aligned} Y =\beta _{0} + \beta _{1}X_{1} + \beta _{2}X_{2} + \cdots + \beta _{k}X_{k} + \epsilon \end{aligned} \]

The defining characteristic of linear regression is this functional form of the parameters. Despite its name, linear regression is still able to model curvature. Linear models can model curvature by including nonlinear variables such as polynomials and transforming exponential functions. The following three panels contain two examples of linear regressions and one example of a non-linear regression. We can see that a linear regression can still model curvature if the consideration is built into the model.

In the second panel, we might ask ourselves what is I() doing. We refer to I() with parentheses because I() is a function. I() does something very particular within the langauge of R: It isolates or insulates the contents of the function from the formula parsing code. It allows the standard R operators to work as they would if you used them outside of a formula, rather than being treated as special formula operators.

We take the following example: lm(mpg ~ wt + disp, mtcars). This means we are looking to predict mpg as a function of two variables: wt and disp. Compare this to lm(mpg ~ I(wt + disp), mtcars which predicts mpg as a function of a new predictor, which is the numeric combination of wt + disp. In this example, we can see that we need the ^ operator in our formula but ^ has special meaning in a formula, but we need its non-formula meaning, we need to wrap the elements of the operation in I(). It allows the standard R operators to work as they would if you used them outside of a formula, rather than being treated as special formula operators. In our example, fit <- lm(mpg ~ wt + I(wt^2), data = mtcars), we create a model that is a result of two predictors, wt and wt squared. This is different than asking fo the main effect and the second order interaction of x”,

A second thing with checking within the linearity assumption is that the error term has a zero population mean.

A third thing with checking within the linearity assumption is the independent explanatory variables are uncorrelated with the error term.

A fourth thing with checking within the linearity assumption is that the observations of the error term are uncorrelated with each other, which is called serial correlation.

A fifth thing No independent variable is a perfect linear function of other explanatory variables.6. No explanatory variable is a perfect linear function of any other explanatory variables (no perfect multicollinearity). If two or more independent variables have an exact linear relationship between them then we have perfect multicollinearity.Examples: including the same information twice (weight in pounds and weight in kilograms), not using dummy variables correctly (falling into the dummy variable trap), etc. Here is an example of perfect multicollinearity in a model with two explanatory variables:

because 􏰇􏰈 moves exactly when 􏰇􏰆 moves! Solution: Easy - Drop one of the variables! CHAPTER 8: MULTICOLLINEARITY Consequence: OLS cannot generate estimates of regression coefficients (error message). Why? OLS cannot estimate the marginal effect of 􏰇􏰆 on 􏰀 while holding 􏰇􏰈 constant

  1. The error term has a constant variance (no heteroskedasticity).

  2. The error term is normally distributed (not required).

What are residuals? Residuals = Observed value – the fitted value

mtcars2 <-  mtcars %>%
  mutate(l_per_100_km = 235 / mpg) 

mtcars2 
ggplot(mtcars2, aes(x = mpg, y = wt, size = l_per_100_km)) +
  geom_point()

moddell <- lm(mpg ~ wt + l_per_100_km, mtcars2)
moddell
## 
## Call:
## lm(formula = mpg ~ wt + l_per_100_km, data = mtcars2)
## 
## Coefficients:
##  (Intercept)            wt  l_per_100_km  
##       39.034        -1.110        -1.206
summary(moddell)
## 
## Call:
## lm(formula = mpg ~ wt + l_per_100_km, data = mtcars2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8695 -1.3470 -0.9679  0.5930  5.2650 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   39.0339     1.3853  28.178 < 0.0000000000000002 ***
## wt            -1.1098     0.8793  -1.262                0.217    
## l_per_100_km  -1.2063     0.2229  -5.412           0.00000808 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.185 on 29 degrees of freedom
## Multiple R-squared:  0.877,  Adjusted R-squared:  0.8685 
## F-statistic: 103.4 on 2 and 29 DF,  p-value: 0.00000000000006342
Homoscedasticity

Homoscedasticity is the idea of even variance, meaning that the variance of the residuals is the same for any value of x, and that error terms are normally distributed.

Independence

Independence refers to the idea that our observations are independent of each other / do not influence each other

Normality

Normality is the idea that for any fixed value of x, y is normally distributed.

Outliers

An outlier is an observation that has a large residual, meaning the observed value is very different from that predicted by our model.

Leverage Points

A point of high leverage is an observation that has a value of x that is far away from the mean of x.

Here is an example of leverage

ggplot(
  mtcars,
  aes(wt, mpg)
  ) +
    geom_point() +
  geom_point(aes(x = 1.55, y = 30.5), colour="blue", size = 30, shape = 21) +
  geom_point(aes(x = 5.35, y = 12.5), colour="blue", size = 30, shape = 21) 

Influential Observations

An influential observation is an observation that changes the slope of the line. One method to find influential points is to compare the fit of the model with and without each observation.

Sample size : n = 20 (cases per independent variable) No multicollinearity & auto-correlation

Diagnostic Plots

Regression diagnostics are used to evaluate the model assumptions and also to investigate whether or not there are observations with a large, undue influence on the analysis. It’s possible to have a linear model that has impressive goodness-of-fit statistics such as high r-squared and low RMSE but fails the linear model assumptions. It’s important to know that the goodness-of-fit statistics are only important if the model has good diagnostics.

The following diagnostic plots we will go over in detail in the following chapter. All of these diagnostic plots come from our linear model mpg ~ wt and can be used in conjunction with our scatterplot and linear model statisitics.

We can use a library such as the ggfortify library to create diagnostic plots or else we can construct our own diagnostic plots using our linear model object and ggplot2 methods which gives us more flexibility.

Fitted Values vs Residuals

When conducting a residual analysis, a “residuals versus fits plot” is the most frequently created plot. It is a scatter plot of residuals on the y axis and fitted values (estimated responses) on the x axis. The plot is used to detect non-linearity, unequal error variances, and outliers.

Used to check the linear relationship assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, what is good.

The sum of residuals is 0 therefore the mean is 0.

# x11()
# par(mfrow=c(2,2))
# plot(mtcars_linear_model)

autoplot(mtcars_linear_model)

mtcars_graph <- ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
  geom_point() + 
  geom_smooth(method = 'lm', se = FALSE, color = '#376795') +
  labs(title = "Mtcars", subtitle = "Linear Regression") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) 
  # scale_color_manual(values=met.brewer("Hiroshige", 6))

  # scale_color_manual(values= list(c("#e76254", "#ef8a47", "#f7aa58", "#ffd06f", "#ffe6b7", "#aadce0", "#72bcd5", "#528fad", "#376795", "#1e466e"), c(6, 2, 9, 3, 7, 5, 1, 10, 4, 8), colorblind=TRUE)) 

fitted_values_vs_residuals <- ggplot(data = mtcars_linear_model, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = '#ef8a47', size = 1) +
  xlab("fitted values") +
  ylab("residuals") +
  labs(title = "Mtcars") +
  labs(subtitle = "Fitted Values vs Residuals") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) 

library(patchwork)

mtcars_graph + fitted_values_vs_residuals

#ffe6b7
x_values_vs_residuals <- ggplot(data = mtcars_linear_model, aes(x =wt, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = '#ef8a47', size = 1) +
  xlab("x values") +
  ylab("residuals") +
  labs(title = "Mtcars") +
  labs(subtitle = "X Values vs Residuals") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5))

library(patchwork)

mtcars_graph + x_values_vs_residuals

Histogram of Residuals

The Histogram of the Residual can be used to check whether the variance is normally distributed. A symmetric bell-shaped histogram which is evenly distributed around zero indicates that the normality assumption is likely to be true.

histogram_of_residuals <- ggplot(data = mtcars_linear_model, aes(x = .resid)) +
  geom_histogram(color = '#ef8a47') +
  xlab("residuals") +
  labs(title = "Mtcars") +
  labs(subtitle = "Histogram of Residuals") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) 

library(patchwork)

mtcars_graph + histogram_of_residuals

QQ Plot

The QQ plot of residuals can be used to visually check the normality assumption. The normal probability plot of residuals should approximately follow a straight line.

In statistics, a Q–Q plot is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other. The QQ plot of residuals can be used to visually check the normality assumption. The normal probability plot of residuals should approximately follow a straight line.

A Q-Q plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight.

Now what are “quantiles”? These are often referred to as “percentiles”. These are points in your data below which a certain proportion of your data fall. For example, imagine the classic bell-curve standard Normal distribution with a mean of 0. The 0.5 quantile, or 50th percentile, is 0. Half the data lie below 0. That’s the peak of the hump in the curve. The 0.95 quantile, or 95th percentile, is about 1.64. 95 percent of the data lie below 1.64. The following R code generates the quantiles for a standard Normal distribution from 0.01 to 0.99 by increments of 0.01:

https://data.library.virginia.edu/understanding-q-q-plots/

Normal Q-Q. Used to examine whether the residuals are normally distributed. It’s good if residuals points follow the straight dashed line.

qq_plot <- ggplot(data = mtcars_linear_model, aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line(linetype = 'dashed', color = '#ef8a47', size = 1) +
  labs(title = "Mtcars") +
  labs(subtitle = "Residual QQ Plot") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5))

library(patchwork)

mtcars_graph + qq_plot

Scale Location

A scale-location plot is a type of plot that displays the fitted values of a regression model along the x-axis and the the square root of the standardized residuals along the y-axis.

This plot shows if residuals are spread equally along the ranges of predictors. It’s good if you see a horizontal line with equally spread points.

Maybe it can be seen that the variability (variances) of the residual points increases with the value of the fitted outcome variable, suggesting non-constant variances in the residuals errors (or heteroscedasticity).

A possible solution to reduce the heteroscedasticity problem is to use a log or square root transformation of the outcome variable (y).

The third plot is a scale-location plot (square rooted standardized residual vs. predicted value). This is useful for checking the assumption of homoscedasticity. In this particular plot we are checking to see if there is a pattern in the residuals.

Scale-Location (or Spread-Location). Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity. This is not the case in our example, where we have a heteroscedasticity problem.

scale_location <- ggplot(mtcars_linear_model, aes(.fitted, sqrt(abs(.stdresid)))) +
  geom_point(na.rm=TRUE) +
  stat_smooth(method="loess", na.rm = TRUE, color = '#ef8a47', size = 1, se = FALSE) +
  xlab("Fitted Value") +
  ylab(expression(sqrt("|Standardized residuals|"))) +
  labs(title = "Mtcars") +
  labs(subtitle = "Scale-Location") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) 

library(patchwork)

mtcars_graph + scale_location

Leverage vs Standardized Residuals

The Residuals vs. Leverage plots helps to identify influential data points on the model. outliers can be influential, though they don’t necessarily have to it and some points within a normal range in your model could be very influential. Outliers: defined as an observation that has a large residual.

Residuals vs Leverage. Used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis. This plot will be described further in the next sections.

leverage_vs_standardized_residuals <- ggplot(data = mtcars_linear_model, aes(.hat, .stdresid)) +
  geom_point(aes(size = .cooksd)) + 
  stat_smooth(method="loess", na.rm=TRUE, color = '#ef8a47', size = 1) +
  xlab("Leverage") +
  ylab("Standardized Residuals") +
  labs(title = "Mtcars") +
  labs(subtitle = "Residuals vs Leverage") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  scale_size_continuous("Cook's Distance", range=c(1,5)) +
  theme(legend.title = element_blank()) +
  theme(legend.position= "none")

library(patchwork)

mtcars_graph + leverage_vs_standardized_residuals

Observation Number vs Cook’s Distance

The fourth plot is of “Cook’s distance”, which is a measure of the influence of each observation on the regression coefficients. The Cook’s distance statistic is a measure, for each observation in turn, of the extent of change in model estimates when that particular observation is omitted. Any observation for which the Cook’s distance is close to 1 or more, or that is substantially larger than other Cook’s distances (highly influential data points), requires investigation.

observation_number_vs_cooks_distance <- ggplot(mtcars_linear_model, aes(seq_along(.cooksd), .cooksd)) +
  geom_bar(stat="identity", position="identity", color = '#ef8a47', size = 1) + 
  xlab("Obs. Number") + 
  ylab("Cook's distance") +
  ggtitle("Mtcars") +
  labs(subtitle = "Cook's distance") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) 
    
library(patchwork)

mtcars_graph + observation_number_vs_cooks_distance

Leverage vs Cook’s Distance
leverage_vs_cooks_distance <- ggplot(mtcars_linear_model, aes(.hat, .cooksd))+geom_point(na.rm=TRUE) +
  stat_smooth(method="loess", na.rm=TRUE, color = '#ef8a47', size = 1) +
  xlab("Leverage hii")+
  ylab("Cook's Distance") + 
  ggtitle("Mtcars") +
  labs(subtitle = "Cook's dist vs Leverage hii/(1-hii)") +
  geom_abline(slope=seq(0,3,0.5), color = "gray", linetype = "dotdash") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) 
    
library(patchwork)
mtcars_graph + leverage_vs_cooks_distance

Summaries
mtcars_linear_model <- lm(mpg ~ wt, mtcars)

# (?summary.lm)
# names(summary(mtcars_linear_model))
# str(summary(mtcars_linear_model))

one <- mtcars %>% 
  filter(cyl == 4) %>%
  lm(mpg ~ wt, data = .) %>% 
  summary() %>% 
  .$"r.squared"

two <- mtcars %>%
  split(.$cyl) %>%
  map(~ lm(mpg ~ wt, data = .)) %>%
  map(summary) %>%
  map_dbl("r.squared")

three <- mtcars_long_numeric %>%
  split(.$names) %>%
  map(~ lm(mpg ~ values, data = .)) %>%
  map(summary) %>%
  map_df("r.squared") %>%
  pivot_longer(., names_to = 'explanatory_variable_to_mpg', values_to = 'r_squared', 1:5) %>%
  arrange(desc(r_squared))

mtcars_long_numeric %>% 
  nest_by(names) %>% 
  mutate(model = list(lm(mpg ~ values, data = data))) %>% 
  summarise(glance(model))

Goodness-of-Fit Tests

Test / Train Workflow

“The reason for this test is simple, imagine we used the full dataset to train the model and then use the same data to predict the model’s accuracy. Naturally, we expect the model to perform well, given that all data points are already “known”. This is exactly what the term overfit refers to. What we are interested in is, if the model is also capable of handling data that has just been introduced — the test data.”

We use the initial_split() function from the rsample package in the tidyverse library.

set.seed(335)

# Create a split object
mtcars_split <- initial_split(mtcars, prop = 0.75, 
                                   strata = mpg)

# Build training dataset
mtcars_training <- mtcars_split %>% 
                        training()

# Build testing dataset
mtcars_test <- mtcars_split %>% 
                    testing()
lm_model <- linear_reg() %>% 
            set_engine('lm') %>% # adds lm implementation of linear regression
            set_mode('regression')

# View object properties
lm_model
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
# We can do more things with our linear model object.

lm_fit <- lm_model %>% 
          fit(mpg ~ wt, data = mtcars_training)

# View lm_fit properties
lm_fit
## parsnip model object
## 
## 
## Call:
## stats::lm(formula = mpg ~ wt, data = data)
## 
## Coefficients:
## (Intercept)           wt  
##      35.861       -4.829
mtcars_training <- mtcars_training %>%
  mutate(dataset = "training")

mtcars_test <- mtcars_test %>%
  mutate(dataset = "test")

mtcars_training_and_test <- bind_rows(mtcars_training, mtcars_test)
mtcars_training_and_test
ggplot(mtcars_training_and_test, aes(x = wt, y = mpg, color = factor(dataset))) +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE) +
  scale_color_manual(values = met.brewer("Hiroshige", 2)) +
  geom_abline(intercept = 37.2851, slope = -5.3445, linetype = 'dashed') +
  labs(title = "Mtcars", subtitle = "training vs testing") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "black line: regression from full dataset") +
  xlab("Displacement") +
  ylab("MPG") 

R-squared plot

We can examinethe test set accuracy by making an r-squared plot. An r-squared plot is a scatterplot that shows the actual values versus the model predictions. We also include a diagonal line through the origin where the x and y values are equal, which serves as a representation of a perfect model where all the predicted values would be equal to the true values in the test set. In an r-squared plot, the farther the points are from this line, the worse the model fit.

The reason this plot is called an r-squared plot is because the r-squared is the square of the correlation between y and y-hat, or the squared correlation between the true and predicted values, which are plotted as paired in the plot.

\[ \begin{aligned} R ^ 2 = cor(y, \hat{y}) ^ 2 \end{aligned} \] Let’s take a look at how we might test the correlation between y and yhat, squared.

mtcars_linear_model <- lm(mpg ~ wt, mtcars)

mtcars$predicted <- predict(mtcars_linear_model)  

summary(mtcars_linear_model) # the multiple r-squared is 0.7528328
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858 < 0.0000000000000002 ***
## wt           -5.3445     0.5591  -9.559       0.000000000129 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 0.0000000001294
cor(mtcars$mpg, mtcars$predicted) ^ 2 # the squared correlation between y and y-hat is  0.7528328
## [1] 0.7528328

In the code below, we use geom_point() and geom_abline() to make this plot. The geom_abline() function will plot a line with the provided slope and intercept arguments.

mtcars_test_results <- predict(lm_fit, new_data = mtcars_test) %>% 
                            bind_cols(mtcars_test)

# View results
mtcars_test_results
ggplot(data = mtcars_test_results,
       mapping = aes(x = .pred, y = mpg)) +
  geom_point(color = '#006EA1') +
  geom_abline(intercept = 0, slope = 1, color = '#D55E00') +
  labs(title = 'Mtcars Test Set Data', subtitle = 'Linear Regression Results',
       x = 'Predicted Sales',
       y = 'Actual Sales') +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "mpg ~ wt") 

Cross-Validation

“During the process of partitioning the complete dataset into the training set and the validation set, there are chances of losing some important and crucial data points for the training purpose. Since those data are not included in the training set, the model has not got the chance to detect some patterns. This situation can lead to overfitting or under fitting of the model. To avoid this, there are different types of cross-validation techniques that guarantees the random sampling of training and validation data set and maximizes the accuracy of the model. Some of the most popular cross-validation techniques are

Validation Set Approach Leave one out cross-validation(LOOCV) K-fold cross-Validation Repeated K-fold cross-validation”

K Fold Cross-Validation
# https://www.geeksforgeeks.org/cross-validation-in-r-programming/
Monte Carlo Cross-Validation

“The basis of a Monte Carlo simulation is that the probability of varying outcomes cannot be determined because of random variable interference. Therefore, a Monte Carlo simulation focuses on constantly repeating random samples to achieve certain results.”

“A Monte Carlo simulation takes the variable that has uncertainty and assigns it a random value. The model is then run and a result is provided. This process is repeated again and again while assigning the variable in question with many different values. Once the simulation is complete, the results are averaged together to provide an estimate.”

“By analyzing historical price data, you can determine the drift, standard deviation, variance, and average price movement of a security. These are the building blocks of a Monte Carlo simulation.”

https://rsample.tidymodels.org/reference/mc_cv.html

One resample of Monte Carlo cross-validation takes a random sample (without replacement) of the original data set to be used for analysis. All other data points are added to the assessment set.

# The first value is the number of rows of the first set, the second value of the second, and the third was the original amount of values in the training data, before splitting again

9.3 Transformations

Natural logarithm utilization. If one or more of your variables are not normally distributed, you can transform them using the natural logarithm. Only AFTER this transformation you may either standardize all the variables or center those that you need to center. In general, whatever transformation of a variable has to happen before standardizing or centering.

# standardization 

# what is a standard score === a standard score of 2 is an observation that falls 2 standard deviations above the mean for that variable

# lasso and ridge to test for multicollinearity 

# https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/

mtcars_p <- mtcars %>%
  mutate(mpg_centered = (mpg - mean(mpg))) %>%
  mutate(mpg_scaled = scale(mpg)) %>% #  (mpg - mean(mpg)) / sd(mpg)
  dplyr::select(index, car, mpg, mpg_centered, mpg_scaled)

mtcars_p
modelone <- lm(mpg ~ wt, mtcars)
modeltwo <- lm(mpg ~ log(wt), mtcars)
modelthree <- lm(log(mpg) ~ wt, mtcars)
modelfour <- lm(log(mpg) ~ log(wt), mtcars)

# summary(modelone)$r.squared
# summary(modeltwo)$r.squared
# summary(modelthree)$r.squared
# summary(modelfour)$r.squared

# equivalent to this: 

# ggplot(mtcars_p, aes(x = log(wt), y = mpg)) + geom_point() + geom_smooth(method = 'lm')

# mtcars_p <- mtcars %>%
#   mutate(log_wt = log(wt))
#
# ggplot(mtcars_p, aes(x = log_wt, y = mpg)) + geom_point() + geom_smooth(method = 'lm')

# model is better but harder to interpret

p <- ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + geom_smooth(method = 'lm') +
  labs(caption = "r-squared: 0.7528328")

q <- ggplot(mtcars, aes(x = log(wt), y = mpg)) + geom_point() + geom_smooth(method = 'lm') +
  labs(caption ="r-squared: 0.810146")

r <- ggplot(mtcars, aes(x = wt, y = log(mpg))) + geom_point() + geom_smooth(method = 'lm') +
  labs(caption ="r-squared: 0.7975582")

s <- ggplot(mtcars, aes(x = log(wt), y = log(mpg))) + geom_point() + geom_smooth(method = 'lm') +
  labs(caption ="r-squared: 0.8056487")

ggarrange(p, q, r, s)

# mtcars_p <- mtcars %>%
#   mutate(mpg_centered = (mpg - mean(mpg))) %>%
#   mutate(mpg_scaled = scale(mpg)) %>% 
#   mutate(wt_centered = (wt - mean(wt))) %>% 
#   mutate(wt_scaled = scale(wt)) 
# 
# mtcars %>%
#   recipes::step_center(all_numeric())
#   
# simple_recipe <- function(dataset){
#     recipe(price ~ ., data = dataset) %>%
#         recipes::step_center(all_numeric()) %>%
#         recipes::step_scale(all_numeric()) %>%
#         recipes::step_dummy(all_nominal())
# }

#%>% #  (mpg - mean(mpg)) / sd(mpg)
  # dplyr::select(index, car, mpg, mpg_centered, mpg_scaled)

Subsetting and Shrinkage

Shrinkage and selection aim at improving upon the simple linear regression. There are two main reasons why it could need improvement:

Prediction accuracy: Linear regression estimates tend to have low bias and high variance. Reducing model complexity (the number of parameters that need to be estimated) results in reducing the variance at the cost of introducing more bias. If we could find the sweet spot where the total error, so the error resulting from bias plus the one from variance, is minimized, we can improve the model’s predictions. Model’s interpretability: With too many predictors it is hard for a human to grasp all the relations between the variables. In some cases we would be willing to determine a small subset of variables with the strongest impact, thus sacrificing some details in order to get the big picture.

Normalization

Centering

Centered independent variables are obtained just by subtracting the mean of the variable.

We center variables if we want to gain a meaningful interpretation of the estimated constant. In this case, you can center the amount of variables you want to; you do not need to center all the independent variables in the model.

Interpretation coefficients centered variables. Coefficients of random variables: “An increase in X1 by -number- from its mean will increase (or decrease) Y by -number-.”

Constant: “It represents the expected value of Y when the non-centered variables are zero and when the centered variables are at their mean.”

Standardizing

Standardized variables are obtained by first subtracting the mean of the variable and then by dividing by the standard deviation of the variable.

We standardize variables to facilitate the interpretation of the estimated coefficients when the variables in your regression have different units of measurement.

When standardizing, we have to standardize all the variables in the regression. This implies we won’t get an estimate of the constant (i.e., the B0 or intercept).

Interpretation coefficients standardized variables. “An increase by 1 standard deviation in X1 will increase (or decrease) Y by -number-.”

Notice that for centered and standardized variables, the r-squared stays the same but the interpretation of coefficeints is different.