Introduction to Data Science in R
In this book, we take a well-studied public dataset called
mtcars and we work through this dataset from a data science
work flow. By doing this, we will develop an understanding of the R
programming language, the RStudio integrated development environment
(IDE), descriptive and inferential statistics, hypothesis testing, and
applied predictive modeling techniques with a focus on linear
regression.
This book assumes no real familiarity with the R programming language or even much in the way of math or statistics. The ideal audience for this book is someone who is interested in learning all of these things. In fact, we operate under the belief that the best way to commit statistics to memory is by coding, so there is great advantage in learning both at the same time. The goal of this book is to take someone from beginner to an expert. To that end, we take care to first go into some detail in the beginning sections on how to install the R programming language and how to install RStudio, as well as to explain the syntax of the language. In this book, every time a new method is introduced, we go into detailed explanation for what is happening in the code and the language will become more advanced throughout the book as the reader becomes more and more familiar with our particular style. Extra care will be taken every time a new function or process is introduced. We will explain the syntax everytime something new is introduced.
We recognize that comprehensive statistics textbooks are often
written after a lifetime of experience from their authors. This book has
been created through the opposite process, as a learning tool, and, in
so doing, we came to believe that we had a unique perspective as younger
authors in the field, that we had a unique ability to make tangible
concepts that are inherently difficult, which is data science, that
requires expertise across multiple fields, and approach it from the
perspective of someone in need of validated resources using well-studied
datasets, whose results we check against a variety of other resources.
The reason we choose to use the mtcars dataset is because
it’s a well-studied dataset that has been analyzed many times and much
of this analysis is available for free online. Indeed, much of the
already published work that is available through Rpubs, Stackoverflow,
Medium, Twitter and other such sources have informed what is in this
book, which began primarily as a personal learning resource before
transitioning into a tutoring document and eventually becoming a
full-length book.
We encourage anyone to take the methods or ideas in this book and
check them against other work using the mtcars dataset, or
to reproduce the methods for work in other fields. This dataset is a
clean dataset; real life datasets will be messier and require a
different level of skill and subtlety. This book should therefore be
considered as a sort of compilation or collection of much of the work
that has already been done in helping in understand mtcars.
We provide both specific attributions where they apply in addition to a
more general attribution to all of the authors on various websites who
we have not had the full ability to explicitly thank.
1 Getting Started
Our first project is to install both the R programming language and the RStudio interactive development environment. Instructions for how to install both of these are found on the R programming language website at the following url: https://www.r-project.org
- R website
- RStudio website
1.1 RStudio
1.2 The R Programming Language
R is a language and environment for statistical computing and graphics. R is similar to the S language and environment and can be considered as a different implementation of S. R provides a wide variety of statistical and graphical techniques and is highly extensible. One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control. R is available as Free Software. It compiles and runs on a wide variety of platforms and systems including Linux, Windows and MacOS.
R is a flexible language that is object-oriented and thus allows the manipulation of complex data structures in a condensed and efficient manner.
Data Types
R uses a variety of data types:
- scalars
- vectors
- lists
- matrices
- dataframes
We will look at each of these data types in turn.
scalars
vectors
We can refer to elements of a vector using subscripts.
lists
matrices
arrays
Arrays are similar to matrices but can have more than two dimensions. See help(array) for details.
dataframes
functions
Functions in R follow this form.
function_name <- function(arg1, arg2, ..., argn) { put code here }
1.3 Our Libraries
After we have successfully installed RStudio, our next project is to
import the libraries that we will be using. For this project, we will
lean heavily on the functions from the tidyverse and
tidymodel libraries. We will follow a tidy work flow.
There are three rules which make a dataset tidy:
- Each variable must have its own column
- Each observation must have its own row
- Each value must have its own cell
Tidyverse and Tidymodels
Some libraries are larger than others and contain multiple packages.
The tidyverse and tidymodel libraries contain
many packages each with their own functions. The packages are grouped
together because they have common developers and work with the same
philosophy and methodology.
In the following lines of code, the use of hashtag #
before a code chunk is a way of commenting-out the code. This allows us
to leave the code in the console but at the same time make sure the code
doesn’t run or compile when we create a project. For example, we don’t
need to install the tidyverse package everytime we compile
our project, but we have left the code in for the sake of learning. Once
the package is installed, there is no need to run these lines of code
again. After the packages are installed, we can bring the packages into
our local environment by saying library() and including the
package.
library(tidyverse) includes the following libraries that
we will be using:
library(tibble)library(ggplot2)library(dplyr)library(tidyr)library(forcats)library(broom)library(purrr)
library(tidymodels) includes the following libraries
that we will be using:
library(infer)library(rsample)library(parsnip)
# install.packages('tidyverse')
# install.packages('tidymodels')
library(tidyverse)
library(infer)
library(broom)
library(rsample)
#library(tidymodels)
library(gridExtra)
library(car)
library(ggpubr)
library(rmdformats)
library(ggfortify)
library(ggridges)
library(moments)
library(MetBrewer)
library(modeest)
library(GGally)
library(ggforce)
library(MASS)
library(quantreg)
library(mblm)
library(parsnip)
library(reticulate) # python
library(dbplyr) # SQL
library(RSQLite) # SQL
library(DBI) # SQL
library(sqldf) # SQLhiro_red <- "#e76254"
hiro_orange <- "#ef8a47"
hiro_light_orange <- "#f7aa58"
hiro_yellow <- "#ffd06f"
hiro_light_yellow <- "#ffe6b7"
hiro_light_teal <- "#aadce0"
hiro_teal <- "#72bcd5"
hiro_blue <- "#528fad"
hiro_other_blue <- "#376795"
hiro_dark <- "#1e466e"
number_ticks <- function(n) {function(limits) pretty(limits, n)}# py_install("pandas")
# py_install("matplotlib")
# py_install("numpy")
# py_install("sklearn.linear_model")
# py_install("statsmodels")
# py_install("seaborn")1.4 The Grammar of Graphics
Applied to visualizations, a grammar of graphics is a grammar used to describe and create a wide range of statistical graphics. 3. The layered grammar of graphics approach is implemented in ggplot2 , a widely used graphics library for R.
2 Data Munging
Part of the aim of this book is to better familiarize ourselves with the data science workflow. We can think about our data science workflow as following these steps. However, the lines are blurry. Different data scientists may have a preference for a different order for better or worse reasons, and working through a project often involves revisiting previous steps.
- Load libraries
- Import the data
- Explore the data
- Clean the data
- Model the data
- Build reports
- Visualizations
- Statistics
For this book, we decide to explore well-studied public dataset
called mtcars. We have chosen this dataset because it is
accessible to audiences of all skills. We will show in our project how
we can get a deep understanding of different statistical methods from a
small dataset. Our plan, therefore, is to get a lot from just a little.
Our goal is to have a comprehensive repository of code and techniques
that can become a useful guide going forward for future projects.
2.1 Investigating Our Data
To find mtcars, we use the data() function
in our console to pull up a comprehensive list of datasets that exist in
the background of the RStudio environment. We can bring
mtcars into our local environment by declaring and saving
it as a variable in the following way: mtcars <- mtcars.
Left and right arrows <- -> are very
common in R. We can also type either help(mtcars) or
?mtcars in order to pull up documentation about the dataset
in our RStudio environment. When we type help(mtcars) we
can read, among other things, the following description of our
dataset:
“The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).”
# data()
# ?mtcars
# help(mtcars)
mtcars <- mtcars
mtcarscolnames(mtcars)## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
mtcarsAfter we have saved mtcars in our local environment, we
can explore our dataset with common functions to give us a high-level
picture of what the data is like. These functions are in-built functions
of the R programming language, which means that they are always
available even if no additional packages are installed and used. Using
the syntax of the R language without the help of new packages is
equivalent to using “base R”.
class()head()/tail()str()/glimpse()
length()/nrow()/dim()summary()names()/colnames()/rownames()
class
Our first function is the class() function from base R.
Looking at the class of mtcars is a good place to start: It
is important for us to understand the class of the object that we are
working with in order to understand how it will behave. We see that
mtcars is a dataframe type. We choose to
convert it to a tibble, which is preferred.
Tibbles are preferred for a couple of reasons:
Tibbleshave a refined print method that shows only the first 10 rows as well as all the columns that fit on screen, which makes them neater in the console.Tibblesbehave differently with functions: they don’t automatically change data types, which prevent surprises.
In our case, the as_tibble function has an additional
useful argument. We use the row_names argument to make the
index column with car names a column like any other that can be used in
visualizations. After converting mtcars from a
dataframe to a tibble and setting
rownames = 'car', we can see that mtcars has
12 variables or columns instead of 11.
In the following example, we call a function from a specific package
by making the package name explicit in the function call by using two
colons, such as in this way: tibble::as_tibble, which is
taken to mean the as_tibble function from the
tibble package. We might choose to explicitly call a
function from a package if we don’t want to load the package into our
more general environment, or else if we are worried that two different
packages may have a function with the same name. This is particularly
useful if two packages that are loaded have functions that have the same
name but the behavior of each function is different in the different
packages. A common example is the select function which
exists in both the plyr and dplyr packages.
For this reason, it’s not uncommon to see dplyr::select, as
in the example below.
# class(mtcars)
remove(mtcars)
mtcars <- tibble::as_tibble(mtcars, rownames = 'car') head / tail
We next use the head() and tail() functions
to look at the first and last rows of our dataset, respectively. The
default number of rows that will be printed to our console is six but we
can change this. In our case, we decide instead to print three lines to
our console by specifying n = 3. We can see that the very
first car in our dataframe is Mazda RX4 and the very last car in our
dataframe is Volvo 142E.
head(mtcars, n = 3)tail(mtcars, n = 3)glimpse
We can use the glimpse() function to inspect the
structure of the data frame. To do this we can either use the
str() function in base R, or else we can use
glimpse() from the tibble package. The
glimpse function is preferable because it prints the output
in a neater way. Both of these functions let us see the data types of
all of our columns.
# View(mtcars)
# str(mtcars)
tibble::glimpse(mtcars) ## Rows: 32
## Columns: 12
## $ car <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Ho…
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
dimension
We use the length(), nrow(), and
dim() functions in order to inspect the mtcars
dataset by looking at the number of columns/variables and the number of
rows/observations in the dataset. We can see that we have 12 variables
and 32 observations each.
We can call length(), nrow() or
dim() on the dataframe class and we can also
call it on specific columns as well. We refer to specific columns in a
dataset by using the $ operator. For example,
mtcars refers to the entire dataframe and
mtcars$mpg will refer to the mpg column in the
mtcars dataframe or tibble.
We also see the following text above out table:
A tibble: 32 x 11. This means that we are looking at a 32 x
11 table, which has 32 observations and 11 variables. Users of Excel
might think of this in terms of rows and columns: a table of 32 rows and
11 columns. The corresponding langauge in R, instead of rows and
columns, is to talk about observations and variables.
# length(mtcars) # 11 variables
# nrow(mtcars) # 32 observations
# dim(mtcars)
# unique(mtcars$car) # there are 32 unique car values
# length(mtcars$car)
# length(unique(mtcars$car)) summary
We use the summary() function to report summary
statistics on each of the columns to get a better understanding of the
distribution of variables. It’s important to know that the
summary() function produces a different output for vectors
with a different character type. We see that the summary statistics for
the car variable shows us the number of observations and
the class but the summary statistics for the other variables, which are
numeric data types, will print the min, first quartile, median, mean,
third quartile, and max value.
summary(mtcars) ## car mpg cyl disp
## Length:32 Min. :10.40 Min. :4.000 Min. : 71.1
## Class :character 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8
## Mode :character Median :19.20 Median :6.000 Median :196.3
## Mean :20.09 Mean :6.188 Mean :230.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0
## Max. :33.90 Max. :8.000 Max. :472.0
## hp drat wt qsec
## Min. : 52.0 Min. :2.760 Min. :1.513 Min. :14.50
## 1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89
## Median :123.0 Median :3.695 Median :3.325 Median :17.71
## Mean :146.7 Mean :3.597 Mean :3.217 Mean :17.85
## 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90
## Max. :335.0 Max. :4.930 Max. :5.424 Max. :22.90
## vs am gear carb
## Min. :0.0000 Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4375 Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :1.0000 Max. :5.000 Max. :8.000
colnames / rownames
We can use two additional base R functions, colnames()
and rownames(), to look at the column and row names in our
dataset. names(mtcars) is equivalent to
colnames(mtcars).
# names(mtcars)
colnames(mtcars) ## [1] "car" "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am"
## [11] "gear" "carb"
rownames(mtcars) ## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
## [31] "31" "32"
tibbles and tribbles
We can also create data tables from scratch. For example, we may do our research into the country origin of each of the cars in our dataset. One option would be to create another table in a tool like Excel, or else we could create our own table with code in RStudio.
One common method to create our own table is to create two vectors
and put them together using either the as.data.frame() or
tibble() functions. In this case we create our a new tibble
which we call cars_origin_tbl.
car <- c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout", "Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280", "Merc 280C", "Merc 450SE", "Merc 450SL", "Merc 450SLC", "Cadillac Fleetwood", "Lincoln Continental", "Chrysler Imperial", "Fiat 128", "Honda Civic", "Toyota Corolla", "Toyota Corona", "Dodge Challenger", "AMC Javelin", "Camaro Z28", "Pontiac Firebird", "Fiat X1-9", "Porsche 914-2", "Lotus Europa", "Ford Pantera L", "Ferrari Dino", "Maserati Bora", "Volvo 142E")
origin <- c("Japan", "Japan", "Japan", "United States", "United States", "United States", "United States", "Germany", "Germany", "Germany", "Germany", "Germany", "Germany", "Germany", "United States", "United States", "United States", "Italy", "Japan", "Japan", "Japan", "United States", "United States", "United States", "United States", "Italy", "Germany", "British", "United States", "Italy", "Italy", "Sweden")
# cars_origin_df <- data.frame(car, origin)
cars_origin_tbl <- tibble(car, origin)
head(cars_origin_tbl, n = 3)Another method to creating a tibble is to create a row-wise tibble called a tribble. This rowwise tribble is the same as the tibble that we created; the only difference is the method in which we created it. Because it is a redundant dataframe, we can remove it from our console using the remove() argument.
cars_origin_trbl <- tribble(
~car, ~origin,
"Mazda RX4", "Japan",
"Mazda RX4 Wag", "Japan",
"Datsun 710", "Japan",
"Hornet 4 Drive", "United States",
"Hornet Sportabout", "United States",
"Valiant", "United States",
"Duster 360", "United States",
"Merc 240D", "Germany",
"Merc 230", "Germany",
"Merc 280", "Germany",
"Merc 280C", "Germany",
"Merc 450SE", "Germany",
"Merc 450SL", "Germany",
"Merc 450SLC", "Germany",
"Cadillac Fleetwood", "United States",
"Lincoln Continental", "United States",
"Chrysler Imperial", "United States",
"Fiat 128", "Italy",
"Honda Civic", "Japan",
"Toyota Corolla", "Japan",
"Toyota Corona", "Japan",
"Dodge Challenger", "United States",
"AMC Javelin", "United States",
"Camaro Z28", "United States",
"Pontiac Firebird", "United States",
"Fiat X1-9", "Italy",
"Porsche 914-2", "Germany",
"Lotus Europa", "British",
"Ford Pantera L", "United States",
"Ferrari Dino", "Italy",
"Maserati Bora", "Italy",
"Volvo 142E", "Sweden"
)
remove(cars_origin_trbl)indexes
We can add an index column to our tibble as well by using the rowid_to_column argument from the tibble package.
mtcars <- tibble::rowid_to_column(mtcars, "index")
mtcarssql_mtcars <- mtcars
# create data base connection
con <- RSQLite::dbConnect(SQLite(), ":memory:")
# copy tables to data base
dplyr::copy_to(con, sql_mtcars)
dplyr::copy_to(con, cars_origin_tbl)
# check contents of data base
dbListTables(con)## [1] "cars_origin_tbl" "sql_mtcars" "sqlite_stat1" "sqlite_stat4"
# create local pointer to remote table
sql_mtcars_db <- tbl(con, "sql_mtcars")
cars_origin_tbl_db <- tbl(con, "cars_origin_tbl")
# this adds a row number partitioned by cyl
tbl(con, "sql_mtcars") %>%
dplyr::select(car, mpg, wt) %>%
dplyr::show_query()## <SQL>
## SELECT `car`, `mpg`, `wt`
## FROM `sql_mtcars`
tbl(con, "sql_mtcars") %>%
dplyr::mutate(car2 = car) %>%
dplyr::show_query()## <SQL>
## SELECT *, `car` AS `car2`
## FROM `sql_mtcars`
sqldf("SELECT *,
row_number() over (partition by cyl) AS rn
FROM mtcars")tbl(con, "sql_mtcars") %>%
dplyr::select(mpg, cyl, disp) %>%
# dplyr::mutate(car2 = car) %>%
dplyr::show_query()## <SQL>
## SELECT `mpg`, `cyl`, `disp`
## FROM `sql_mtcars`
SELECT `car`, `mpg`, `wt`
FROM `sql_mtcars`| car | mpg | wt |
|---|---|---|
| Mazda RX4 | 21.0 | 2.620 |
| Mazda RX4 Wag | 21.0 | 2.875 |
| Datsun 710 | 22.8 | 2.320 |
| Hornet 4 Drive | 21.4 | 3.215 |
| Hornet Sportabout | 18.7 | 3.440 |
| Valiant | 18.1 | 3.460 |
| Duster 360 | 14.3 | 3.570 |
| Merc 240D | 24.4 | 3.190 |
| Merc 230 | 22.8 | 3.150 |
| Merc 280 | 19.2 | 3.440 |
2.2 Joining New Datasets
For the sake of being thorough, we will practice every different kind of two-table join. Two-table joins fall into two categories:
- mutating joins
- filtering joins
A mutating join combines variables from two tables. It does this by matching observations by their common variables, called keys and copying variables from one table to the other. A filtering join, on the other hand, allows a table to be filtered based on the matching key or another table.
In order to practice the different kinds of two-table joins, we first
arbitrarily remove one of the observations in our new
cars_origin_tbl. In this case, we remove the
origin information for Datsun 710 pretending
we do not know that Datsun is Japanese. Removing this value will make it
clear what is happening with the different joins.
cars_origin_tbl$origin[cars_origin_tbl$car == "Datsun 710"] <- NA
cars_origin_tbl$car[cars_origin_tbl$car == "Datsun 710"] <- NA
head(cars_origin_tbl)Mutating Joins
full_join
Our first try is to use a full_join from
dplyr.
If we look at the head of our new dataframe we see that Datsun was
kept in the original dataframe and then matched to an NA
value from the new dataframe.
However, if we look at the tail of our new dataframe we see that there is a new row that is NA all the way through. This is because full join keeps everything on the left-hand-side but also matches everything on the right-hand-side, which is a row that is NA all the way through (where car == NA and origin == NA).
We look at the length of our new dataframe which is 13 observations (the original 12 plus the origin column) but also 33 observations (which is the original 32 plus a new NA observation joined at the bottom).
mtcars_full_join <- dplyr::full_join(mtcars, cars_origin_tbl, by = 'car')
head(mtcars_full_join, n = 3)tail(mtcars_full_join, n = 2)# length(mtcars_full_join) # 13
# nrow(mtcars_full_join) # 33
remove(mtcars_full_join)left_join
Our next try is to use a left_join from
dplyr.
Left_join includes all records from the left side and
matched rows from the right table.
If we look at the head of our new dataframe we see that “Datsun” was kept in the original dataframe and then matched to an NA value from the new dataframe.
Unlike full_join, there is no additional row added for
the NA value from the ‘car’ variable in the second dataframe.
We look at the length of our new dataframe which is 13 observations (the original 12 plus the origin column) and then only the 32 observations from the original dataframe.
mtcars_left_join <- left_join(mtcars, cars_origin_tbl, by = 'car')
# an alternate way to write the code
mtcars %>% dplyr::left_join(cars_origin_tbl, by = 'car') -> mtcars_left_join
head(mtcars_left_join, n = 3)tail(mtcars_left_join, n = 2)# length(mtcars_left_join) # 13
# nrow(mtcars_left_join) # 32right_join
Our next try is to use a right_join from
dplyr.
If we look at the head of our new dataframe we see that Datsun was kept in the original dataframe and then matched to an NA value from the new dataframe.
Right_join is different from left_join
because of the inclusion of unmatched rows.
Whereas left_join includes all records from the left
side and matched rows from the right table, right_join
returns all rows from the right side and unmatched rows from the left
table.
In the result, “Datsun” is missing because it was not in the
right-hand table. However, a new NA row was included.
mtcars_right_join <- right_join(mtcars, cars_origin_tbl, by = 'car')
# an alternate way to write the code
mtcars %>% dplyr::right_join(cars_origin_tbl, by = 'car') -> mtcars_right_join
head(mtcars_right_join, n = 3)tail(mtcars_right_join, n = 2)# length(mtcars_right_join) # 13
# nrow(mtcars_right_join) # 32
remove(mtcars_right_join)Filtering Joins
inner_join
An inner_join in R is a merge operation between two data
frames where the merge returns all of the rows that match from both
tables. You are going to need to specify a common key for R use to use
to match the data elements.
mtcars_inner_join <- inner_join(mtcars, cars_origin_tbl, by = 'car')
# an alternate way to write the code
mtcars %>% dplyr::inner_join(cars_origin_tbl, by = 'car') -> mtcars_inner_join
# nrow(mtcars_inner_join) # only 31
# length(mtcars_inner_join) # 13
head(mtcars_inner_join, n = 3)tail(mtcars_inner_join, n = 2)remove(mtcars_inner_join)semi_join
A semi_join returns the rows of the first table where it
can find a match in the second table.
mtcars_semi_join <- semi_join(mtcars, cars_origin_tbl, by = 'car')
# an alternate way to write the code
mtcars %>% dplyr::semi_join(cars_origin_tbl, by = 'car') -> mtcars_semi_join
# nrow(mtcars_semi_join) # only 31
# length(mtcars_semi_join) # only 12
# another method using value matching
mtcars_semi_join <- mtcars %>%
filter(car %in% cars_origin_tbl$car)
# nrow(mtcars_semi_join) # only 31
# length(mtcars_semi_join) # only 12
head(mtcars_semi_join, n = 3)tail(mtcars_semi_join, n = 2)remove(mtcars_semi_join)anti_join
An anti_join returns the rows of the first table where
it cannot find a match in the second table. … Anti joins are a type of
filtering join, since they return the contents of the first table, but
with their rows filtered depending upon the match conditions.
Semi_join(mtcars, cars_origin_tbl) plus
anti_join(mtcars, cars_origin_tbl) will return
mtcars.
mtcars_anti_join <- anti_join(mtcars, cars_origin_tbl, by = 'car') # this
# an alternate way to write the code
mtcars %>% dplyr::anti_join(cars_origin_tbl)# length(mtcars_anti_join) # 12
# nrow(mtcars_anti_join) # only 1
remove(mtcars_anti_join)
# what about joining a table on more than one category?
-- semi join
SELECT * FROM `sql_mtcars`
WHERE EXISTS (
SELECT 1 FROM `cars_origin_tbl`
WHERE (sql_mtcars.car = cars_origin_tbl.car)
)| index | car | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
| 2 | Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
| 3 | Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
| 4 | Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
| 5 | Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
| 6 | Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
| 7 | Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
| 8 | Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
| 9 | Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
| 10 | Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
-- left join
SELECT `index`, `sql_mtcars`.`car` AS `car`, `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`, `origin`
FROM `sql_mtcars`
LEFT JOIN `cars_origin_tbl`
ON (`sql_mtcars`.`car` = `cars_origin_tbl`.`car`)
/*
SELECT `index`, `sql_mtcars`.`car` AS `car`, `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`, `origin`
FROM `sql_mtcars`
LEFT JOIN `cars_origin_tbl` USING `car`
*/| index | car | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | origin |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 | Japan |
| 2 | Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 | Japan |
| 3 | Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 | Japan |
| 4 | Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 | United States |
| 5 | Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 | United States |
| 6 | Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 | United States |
| 7 | Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 | United States |
| 8 | Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 | Germany |
| 9 | Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 | Germany |
| 10 | Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 | Germany |
2.3 Formatting Our Data
Data Types
After we investigate our data in order to understand the data types for each of our column vectors, we also want to pay special consideration to the data types as they might relate to our research questions. In particular, we can think about our numeric variables as belonging to two distinct categories: numeric and categorical.
- Numeric variables have a measurement with a numerical meaning.
- Categorical variables have a measurement with a categorical meaning.
It’s important to understand this distinction because R may read a
column vector as numerical when the everyday meaning may be better
understood as a category. For example, the cyl column in
mtcars has three unique values: 4, 6, and 8.
Mtcars stores this column vector as a numeric data type
because R is recognizing 4, 6, and 8 as numbers. However, the true
meaning here is the type of engine: 4, 6, or 8 cylinders. For this
reason, it is best to convert the cyl column vector to a
factor data type.
Mtcars documentation helps us decide.
We change the name of mtcars_left_join back to mtcars, just to keep it more simple.
mtcars_left_join$origin[mtcars_left_join$car == "Datsun 710"] <- "Japan"
mtcars <- mtcars_left_join
mtcars <- mtcars %>%
mutate(am = as.factor(am), cyl = as.factor(cyl), gear = as.factor(gear), vs = as.factor(vs), carb = as.factor(carb))
# split the data frame into two: one dataframe for categorical variables and one for factor variables
mtcars_factor <- mtcars %>%
dplyr::select(car, origin, mpg, cyl, vs, am, gear, carb)
mtcars_numeric <- mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) Data Structure
Wide format data is easier to read and interpret but long data format
is better for database structures. The tidyr package has
useful functions to transform data in this way. To convert from wide to
long, sometimes referring to as ‘lengthening’ our data, we use the
pivot_longer function, which increases the number of rows
and decreases the number of columns. To convert from long to wide we use
pivot_wider. pivot_longer and
pivot_wider were previously called gather and
spread and before that they were called melt
and cast.
mtcars_long_numeric <- pivot_longer(mtcars_numeric, names_to = 'names', values_to = 'values', 4:8)
mtcars_long_numeric_with_mpg <- pivot_longer(mtcars_numeric, names_to = 'names', values_to = 'values', 3:8)
# nrow(mtcars_long_numeric) # 160 observations
mtcars_numeric <- pivot_wider(mtcars_long_numeric, names_from = 'names', values_from = 'values')
# spread vs gather
# melt vs
# mtcars_long <- mtcars %>%
# tidyr::pivot_longer(everything(), names_to = 'names', values_to = 'values')
remove(mtcars_factor)
remove(mtcars_numeric)
remove(mtcars_left_join)
mtcars %>%
dplyr::select(car, origin, mpg, cyl, vs, am, gear, carb) %>%
tidyr::pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
head(n = 3)mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
head(n = 3)mtcarswrite.csv(mtcars, '/Users/josefwaples/Desktop/mtcars.csv', row.names = FALSE, col.names = TRUE)2.4 Missing Values
Types of Missingness
We check to see if there are any NA values in our dataset.
We keep in mind that values can be missing for different reasons.
- Missing at random (MAR)
- Missing completely at random (MCAR)
- Missing not at random (MNAR)
MAR is probability of being missing is the same only within groups defined by the observed data
MCAR is The probability of being missing is the same for all cases. This implies that causes of the missing data are unrelated to the data.
MNAR The probability of being missing varies for reasons that are unknown to us.
Imputation
Determining the cause for missingness is important in making decisions about imputation. We may consider imputation for example median or mode imputation.
# Let's check for NA values
any(is.na(mtcars)) # FALSE## [1] FALSE
# is.na(mtcars) # this might help us check the exact location of the NA values. 3 Exploratory Data Analysis
Exploratory Data Analysis is the process of doing first investigations in order to check assumptions, uncover patterns, and discover anomalies. Exploratory Data Analysis can take the form of summary statistics and visualizations. It’s often useful to start exploring the characteristics of each individual variable with univariate visualizations and statistics before exploring the relationships between variables with bivariate visualizations and statistics.
3.1 Common Visualizations
When examining a dataset, it may be challenging to figure out the most appropriate visualization to use for any particular story. It’s easy to get stuck considering many options. It’s important to remember that there is no perfect answer when visualizing data. Every visualization is part of a larger story and always has to be understood in context. We bias our data at every step of the process. Our lodestar is our hypothesis. We always keep our hypothesis or goal in mind. The following points can serve as a good heuristic in deciding which visualization to use. There are many different kinds of visualizations so this is not a comprehensive list, just a guide.
- Histograms, density plots, and boxplots are useful in showing the distribution or spread of a variable.
- Boxplots, bar plots and dot plots are useful in comparing values across groups.
- Pie charts and donut charts are useful in showing composition.
- Scatterplots are useful in showing relationships between continuous variables.
We can create these visualizations by using the ggplot2
package from the tidyverse set of packages. There is a
particular philosophy that accompanies ggplot2. With the
ggplot() system, we supply a dataset, choose aesthetic
mappings, add layers and customize scales, coordinates, facets,
graphical primitives, and many other things. In addition to being highly
extensible, ggplot() also can be paired easily with
companion packages.
In the following three-paneled visualization, we show the successive
process of adding layers while using ggplot2. In the first
case, we supply only a dataset to the ggplot() function.
Without supplying any other arguments, we see a blank canvas. In the
second case, we add an x-axis aesthetic mapping. In the third case, we
also add a geom layer in the form of a histogram. We can link these
three separate visualizations together by using a library called
patchwork which is one of many ggplot2
extensions. In order to link these visualizations together, we have to
save each visualization as it’s own variable.
In the following visualization, we also include a few extra lines of
code in order to make our graphs look nice and to standardize our style.
We add a title and subtitle to our visual by using the
title and subtitle arguments within the
labs layer of ggplot(). We also use the
theme layer to center and make bold our title and subtitle
by specifyingface = "bold" and
hjust = 0.5.
one <- ggplot2::ggplot(data = mtcars) +
labs(title = "Mtcars", subtitle = "Blank Canvas") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
two <- ggplot2::ggplot(data = mtcars, aes(x = mpg)) +
labs(title = "Mtcars", subtitle = "+ Aesthetic Mapping") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
three <- ggplot2::ggplot(data = mtcars, aes(x = mpg, y = after_stat(count))) +
geom_histogram() +
labs(title = "Mtcars", subtitle = "+ Geom Layer") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
library(patchwork)
one + two + three# `'ggitle'` is a useful short cut.
?after_stat(count)Distribution
Histograms
Univariate visualizations such as histograms are often a good place
to start because they give us an opportunity to explore the
distributions of the variables in our dataset. Structurally,
ggplot2 requires two position scales for the x and y
aesthetics. Because histograms are a univariate visualization, this
means that only our x aesthetic is found as a variable in our data. Our
y aesthetic will be a computed variable. We can set up our histogram by
setting our x aesthetic as our variable and our y aesthetic as
referencing an argument: y = after_stat(count))).
What is after_stat(count) really doing in our code?
after_stat is telling ggplot() to delay the
mapping until later in the rendering process. The height of the
histogram bars is mapped to the count geom_histogram() with
the bins argument. As a default, the bins
argument is set to 30. In many cases, the histogram will
work well without having to adjust the default number of bins. In our
case, however, because we know that our mtcars dataset is
32 observations, we can specify bins = 32 as an argument of
geom_histogram() and override the default setting. Doing
this may make our visualization a little bit easier to interpret.
Binwidth is another option…
In the following example, we also set the color and
fill arguments with geom_histogram() to change
the border color of the histogram columns and the color inside the
columns, respectively. We specify the color by using a hexadecimal.
There are other ways to specify colors.
one <- ggplot2::ggplot(data = mtcars, aes(x = mpg, y = after_stat(count))) +
geom_histogram(bins = 32, color = 'black', fill = '#ffe6b7') +
labs(title = "Mtcars", subtitle = "Histogram") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
two <- ggplot2::ggplot(data = mtcars, aes(x = mpg, y = after_stat(count))) +
geom_histogram(bins = 32, color = 'black', fill = '#e76254') +
labs(title = "Mtcars", subtitle = "Histogram") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
scale_x_reverse()
three <- ggplot2::ggplot(data = mtcars, aes(x = mpg, y = after_stat(count))) +
geom_histogram(bins = 32, color = 'black', fill = '#ef8a47') +
labs(title = "Mtcars", subtitle = "Histogram") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
scale_y_reverse()
# The second and third graph would likely never be used. We use these graphs only to show how the `scale_x_reverse()` and `scale_y_reverse()` functions are working within `ggplot()`, and also to show how to customize different colors.library(patchwork)
oneWe can also pivot our data frame into a long data frame and use a
technique called faceting to create histograms for multiple variables
side-by-side. In the first case, we use facet_grid() and in
the second case we use facet_wrap(). The main difference is
that facet_grid normalizes the y-axis scale which obviates
the need for having y-axis ticks for every panel. Mostly, the aesthetic
difference between the facet_grid() and
facet_wrap() is a consequence of this design choice. Notice
that in order to create histograms side-by-side, we first have to
convert our wide dataframe into a long dataframe using the
pivot_longer() function from the tidyr
library. Finally, we choose to add another layer to our
ggplot2 object by creating a caption on our graph.
In the following code, we continue to create histograms by using
geom_histogram(). This time, however, we don’t specify
y = after_stat(count) in the aesthetic mapping. In fact, we
don’t specify any y aesthetic explicitly and ggplot() knows
how to handle this and our code will compile all the same.
mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
ggplot(aes(x = values, fill = names)) +
facet_grid(~names, scales = 'free') + # facet_grid automatically has to define a y scale
geom_histogram(bins = 32, color = 'black', show.legend = FALSE) +
labs(title = "Mtcars", subtitle = "Histogram") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = 'facet_grid') +
scale_fill_manual(values=met.brewer("Hiroshige", 6))mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
ggplot(aes(x = values, color = names)) +
facet_wrap(~names, scales = 'free_x') + # facet_grid automatically has to define a y scale
geom_histogram(bins = 32, fill = 'black', show.legend = FALSE) +
labs(title = "Mtcars", subtitle = "Histogram") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = 'facet_wrap') +
scale_color_manual(values=met.brewer("Hiroshige", 6))# ggplot2::ggplot(data = mtcars, aes(x = mpg, y = after_stat(count))) +
# geom_histogram(bins = 1, color = 'black', fill = '#ffe6b7') +
# labs(title = "Mtcars", subtitle = "Histogram") +
# theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
# theme(plot.subtitle = element_text(hjust = 0.5))
#
# ggplot2::ggplot(data = mtcars, aes(x = mpg, y = after_stat(count))) +
# geom_histogram(bins = 10, color = 'black', fill = '#ffe6b7') +
# labs(title = "Mtcars", subtitle = "Histogram") +
# theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
# theme(plot.subtitle = element_text(hjust = 0.5))
#
# ggplot2::ggplot(data = mtcars, aes(x = mpg, y = after_stat(count))) +
# geom_histogram(bins = 5000, color = 'black', fill = '#ffe6b7') +
# labs(title = "Mtcars", subtitle = "Histogram") +
# theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
# theme(plot.subtitle = element_text(hjust = 0.5))Density Plots
In addition to creating histograms, we can also create density plots.
Density plots are similar to histograms except with smoothed edges. The
standard deviation of the variable becomes the default smoothing
bandwidth in a density plot. In the case of mpg variable,
the bandwidth is 6.026948. We can verify this by looking at the standard
deviation of the column vector: sd(mtars$mpg). Within
geom_density(), we specify the bandwidth as equal to the
standard deviation of the mpg vector in
mtcars; we could also specify the exact value of 6.026948,
or else we could leave the argument off since this is the default
setting. In our facet_wrap() version, we specify an alpha
value of 0.25 which is a way of adding transparency to our
visualization. Alpha here refers to opacity; values of alpha range from
0 to 1, with lower values corresponding to more transparent colors.
ggplot(data = mtcars, aes(x = mpg)) +
geom_density(color = 'black', fill = '#ffe6b7', bandwidth = sd(mtcars$mpg)) +
labs(title = "Mtcars", subtitle = "Density Plot") +
theme(plot.title = element_text(face = "bold")) mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
ggplot(aes(x = values, color = names)) +
facet_grid(~names, scales = 'free') + # facet_grid automatically has to define a y scale
geom_density(fill = 'black', show.legend = FALSE) +
labs(title = "Mtcars", subtitle = "Density Plot") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
scale_color_manual(values=met.brewer("Hiroshige", 6))mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
ggplot(aes(x = values, fill = names)) +
facet_wrap(~names, scales = 'free') +
geom_density(color = 'black', alpha = 0.25, show.legend = FALSE) +
labs(title = "Mtcars", subtitle = "Density Plot") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
scale_fill_manual(values=met.brewer("Hiroshige", 6))Notice that in the above visualizations, our
facet_grid() has a disadvantage with our density plots: The
relative density for the disp and hp variables
at any given point was almost zero. Because the
facet_grid() layer relies on a common y-axis scale for our
facets, the facet_grid() layer automatically has to define
a y-axis scale in ggplot(). This makes the
disp and hp variables appear flat in our
visualization, almost as though we forgot to chart them. For this
reason, facet_grid should be used with care when the
variables have a different y-axis scale.
We have another option if we lean on another library called
ggridges. The geom_density_ridges() function
in the ggridges library is another option to put density
plots layered on top of each other. However, if the x-axis values are
very different between groups, the hill may look artificially smooth in
some cases, and if the x-axis is close enough to see the variation,
other groups may not be included. For this reason, we include two
different options below with different x-axis scales to illustrate the
limitations of each, so care should be used.
one <- mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
ggplot(aes(x = values, y = names)) +
geom_density()
two <- mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
ggplot(aes(x = values, y = names)) +
ggridges::geom_density_ridges()
three <- mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
ggplot(aes(x = values, y = names)) +
ggridges::geom_density_ridges() + xlim(0, 40)
library(patchwork)
one + two + threeComparison
Dotplots
A dotplot is a way of representing each unique variable as a point on
an x-axis. In our case, we create one dot for each car to show the
mpg associated with that car along a number line. We do
this in order to get a sense of the trend, grouping, and dispersion in
the data.
We can also use facet_wrap() to create a dotplot for
each unique variable in our dataframe. We combine
facet_wrap() with the reoder_within() function
from the tidytext library in order to reorder each unique
facet. A downside of this technique is that the font becomes hard to
read when we include all the variables.
ggplot(mtcars, aes(x = mpg, y = reorder(car, mpg))) +
geom_point(color = '#e76254') +
labs(title = "Mtcars", subtitle = "Dot Plot") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) mtcars %>%
dplyr::select(car, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 2:7) %>%
mutate(car = tidytext::reorder_within(car, values, names)) %>%
ggplot(aes(x = values, y = reorder(car, values), color = names)) +
geom_point() +
labs(title = "Mtcars", subtitle = "Dot Plot") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
tidytext::scale_y_reordered() +
facet_wrap(~names, scales = 'free')+
theme(text = element_text(size=6)) +
scale_color_manual(values=met.brewer("Hiroshige", 6)) +
theme(legend.position = 'none')A lollipop plot is constituted of a marker and a stem.
ggplot(mtcars, aes(x = mpg, y = reorder(car, mpg))) +
geom_point(color = '#ef8a47') +
labs(title = "Mtcars", subtitle = "Dot Plot") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
geom_segment( aes(x=mpg, xend=0, y=car, yend=car), color="#ef8a47") mtcars %>%
dplyr::select(car, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 2:7) %>%
mutate(car = tidytext::reorder_within(car, values, names)) %>%
ggplot(aes(x = values, y = reorder(car, values), color = names)) +
geom_point() +
labs(title = "Mtcars", subtitle = "Dot Plot") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
tidytext::scale_y_reordered() +
facet_wrap(~names, scales = 'free')+
theme(text = element_text(size=6)) +
scale_color_manual(values=met.brewer("Hiroshige", 6)) +
theme(legend.position = 'none') +
geom_segment( aes(x=values, xend=0, y=car, yend=car), color='black')
plt.close()
plt.cla()
plt.clf()
mtcars['car'] = mtcars['mpg'].rank(ascending=True)
seaborn.scatterplot(data = mtcars, x="mpg", y="car", color = '#ef8a47')
plt.title("dot plot", loc = 'left')
plt.xlabel("mpg")
plt.ylabel("car")
plt.style.use('ggplot')
plt.tight_layout()
plt.hlines(y = mtcars['car'], xmin=0, xmax=mtcars['mpg'], alpha =0.8, color = '#ef8a47')
plt.show()Strip Plots
In addition to creating a visualization that shows a dot for each car, we can also create a visualization with a larger grouping. We can use a strip plot to create a string of dots for cars from each country of origin.
The following three panels show different versions of strip plots.
The panel on the left shows a series of dots for cars from each country
of origin. In this first panel, however, some dots overlap, as in the
case of two Japanese cars where mpg = 21 for both. The
middle panel makes use of an alpha argument in the
geom_point() layer by setting alpha = 0.5.
Whereas in our previous visualization we set the alpha value below 1 for
aesthetic reasons, here the alpha argument is useful in allowing us to
see overlapping dots. Similar to the panel in the middle, the panel on
the right also distinguishes overlapping dots, except this time by
adding a jitter, which is random noise. We could specify jitter in the
following way: geom_point(position = position_jitter()) or
else we could use geom_jitter() as a shortcut. Finally, we
remove the y-axis title, ticks, and labels by setting them to
element_blank().
one <- ggplot(mtcars, aes(x = mpg, y = reorder(origin, mpg))) +
geom_point() +
labs(title = "Mtcars", subtitle = "Strip Plots") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.title = element_text(face = "bold")) +
theme(axis.title.y = element_blank())
two <- ggplot(mtcars, aes(x = mpg, y = reorder(origin, mpg))) +
geom_point(alpha = 0.5, color = '#D55E00') +
theme(axis.title.y = element_blank()) +
theme(axis.ticks.y = element_blank(),
axis.text.y = element_blank())
three <- ggplot(mtcars, aes(x = mpg, y = reorder(origin, mpg))) +
geom_jitter(width = 0.1, height = 0.1, color = '#0072B2') +
theme(plot.title = element_text(face = "bold")) +
theme(axis.title.y = element_blank()) +
theme(axis.ticks.y = element_blank(),
axis.text.y = element_blank())
library(patchwork)
one + two + threeBoxplots
The order of our aesthetic layers matters in terms of the visual
output. In this case, we decide to put the geom_boxplot()
before the geom_jitter(). As a result, the points are
overlayed on top of the boxplot.
We can change the x or y axis labels by using the xlab() or ylab() layer. In the following exmaple, we change the y axis label.
Boxplots are a great way of seeing interquartile range. Japanese, German, and U.S. cars all have a right skew but Italian cars don’t have a skew.
ggplot(mtcars, aes(x = mpg, y = reorder(origin, mpg), color = origin)) +
geom_boxplot() + geom_jitter(width=0.1, height=0.1) +
theme(legend.position = "none") +
labs(title = "Mtcars", subtitle = "Box Plot") +
theme(plot.title = element_text(face = "bold")) +
ylab("country") +
scale_color_manual(values=met.brewer("Hiroshige", 6))
plt.close()
plt.cla()
plt.clf()
order = mtcars.groupby('origin')['mpg'].mean().sort_values(ascending=False)
# sort mpg
seaborn.boxplot(x="mpg", y="origin", data=mtcars, order=order.index)
plt.title("boxplot", loc = 'left')
plt.show()We can also create a boxplot for each of the numeric variables in our
dataset. Notice that some of variables have dots floating outside the
whiskers of the boxplot. An outlier is defined as a point outside 1.5
times the interquartile range above the upper quartile and below the
lower quartile. Here we adjust the y-axis limits. It has to be y-axis
limits because of coord_flip() but then we also remove the
axis text but leave the axis ticks.
\[ \begin{aligned} Q1 - 1.5 * IQR \end{aligned} \] \[ \begin{aligned} Q3 + 1.5 * IQR \end{aligned} \]
mtcars %>%
dplyr::select(3:13) %>%
dplyr::select_if(is.numeric) %>%
pivot_longer(names_to = 'names', values_to = 'values', 2:6) %>%
ggplot(aes(x = values, color = names)) +
geom_boxplot() + coord_flip() +
facet_wrap(~names, scales = 'free_y') +
theme(legend.position = "none") +
labs(title = "Mtcars", subtitle = "Box Plot") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
scale_color_manual(values=met.brewer("Hiroshige", 5)) +
ylim(-0.8, 0.8) +
theme(axis.text.x=element_blank()) Violin Plots
In this case, we have added the geom_jitter before the
geom_violin. As a result, the violin is layered over the
points.
#
# ggplot(mtcars, aes(x = mpg, y = reorder(origin, mpg), fill = origin)) +
# geom_jitter(width=0.1, height=0.1) + geom_violin() +
# theme(legend.position = "none") +
# labs(title = "Mtcars", subtitle = "Violin Plot") +
# theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
# theme(plot.subtitle = element_text(hjust = 0.5)) +
# ylab("country") +
# scale_fill_manual(values=met.brewer("Hiroshige", 6))
#
# ggplot(mtcars, aes(x = mpg, y = reorder(origin, mpg), fill = origin)) +
# geom_jitter(width=0.1, height=0.1) + geom_boxplot(alpha = 0.5) +
# theme(legend.position = "none") +
# labs(title = "Mtcars", subtitle = "Violin Plot") +
# theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
# theme(plot.subtitle = element_text(hjust = 0.5)) +
# ylab("country") +
# scale_fill_manual(values=met.brewer("Hiroshige", 6))
ggplot(mtcars, aes(x = mpg, y = reorder(origin, mpg), fill = origin)) +
geom_jitter(width=0.1, height=0.1) + geom_violin(alpha = 0.5) +
theme(legend.position = "none") +
labs(title = "Mtcars", subtitle = "Violin Plot") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
ylab("country") +
scale_fill_manual(values=met.brewer("Hiroshige", 6))
plt.close()
plt.cla()
plt.clf()
order = mtcars.groupby('origin')['mpg'].mean().sort_values(ascending=False)
# sort mpg
seaborn.violinplot(x="mpg", y="origin", data=mtcars, order=order.index)
plt.title("violin plot", loc = 'left')
plt.show()But how does the violin plot get its shape? The “violin” shape of a violin plot comes from the data’s density plot. You just turn that density plot sideway and put it on both sides of the box plot, mirroring each other.
one <- mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
# pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
ggplot(aes(x = mpg, y = reorder(origin, mpg), fill = origin)) +
ggridges::geom_density_ridges() + xlim(0, 40) +
ylab("country") +
scale_fill_manual(values=met.brewer("Hiroshige", 6)) +
theme(legend.position = 'none') +
labs(title = "Mtcars", subtitle = "Density Plot") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
two <- ggplot(mtcars, aes(x = mpg, y = reorder(origin, mpg), fill = origin)) +
geom_boxplot() + geom_jitter(width=0.1, height=0.1) +
theme(legend.position = "none") +
labs(title = "Mtcars", subtitle = "Box Plot") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
scale_fill_manual(values=met.brewer("Hiroshige", 6)) +
theme(axis.text.y=element_blank()) +
ylab("")
three <- ggplot(mtcars, aes(x = mpg, y = reorder(origin, mpg), fill = origin)) +
geom_jitter(width=0.1, height=0.1) + geom_violin() +
theme(legend.position = "none") +
labs(title = "Mtcars", subtitle = "Violin Plot") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
scale_fill_manual(values=met.brewer("Hiroshige", 6)) +
theme(axis.text.y=element_blank()) +
ylab("")
library(patchwork)
one + two + threeThe ggplot2 package’s facet options were not designed
for varying axis labels / scales across facets. Placing panels next to
each other with very different scales is something that visual design
experts will tend to avoid because it can be at best misleading. As a
result, facet_wrap() may not be the best technique for
boxplots and violin plots when the groupings have different scales.
Instead, we can lean on the patchwork to link separate
boxplots together.
one <- mtcars %>%
dplyr::select(car, origin, mpg, cyl, vs, am, gear, carb) %>%
tidyr::pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
mutate(names = factor(names)) %>%
mutate(values = as_factor(values)) %>%
filter(names == 'am') %>%
ggplot(aes(x = mpg, y = values)) +
geom_boxplot(fill = '#e76254') + coord_flip() + facet_wrap(~names)
#
two <- mtcars %>%
dplyr::select(car, origin, mpg, cyl, vs, am, gear, carb) %>%
tidyr::pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
mutate(names = factor(names)) %>%
mutate(values = as_factor(values)) %>%
filter(names == 'carb') %>%
ggplot(aes(x = mpg, y = values)) +
labs(title = "Mtcars", subtitle = "Boxplot Panels") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
geom_boxplot(fill = '#ef8a47') + coord_flip() + facet_wrap(~names)
three <- mtcars %>%
dplyr::select(car, origin, mpg, cyl, vs, am, gear, carb) %>%
tidyr::pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
mutate(names = factor(names)) %>%
mutate(values = as_factor(values)) %>%
filter(names == 'cyl') %>%
ggplot(aes(x = mpg, y = values)) +
geom_boxplot(fill = '#f7aa58') + coord_flip() + facet_wrap(~names)
four <- mtcars %>%
dplyr::select(car, origin, mpg, cyl, vs, am, gear, carb) %>%
tidyr::pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
mutate(names = factor(names)) %>%
mutate(values = as_factor(values)) %>%
filter(names == 'gear') %>%
ggplot(aes(x = mpg, y = values)) +
geom_boxplot(fill = '#ffd06f') + coord_flip() + facet_wrap(~names)
five <- mtcars %>%
dplyr::select(car, origin, mpg, cyl, vs, am, gear, carb) %>%
tidyr::pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
mutate(names = factor(names)) %>%
mutate(values = as_factor(values)) %>%
filter(names == 'vs') %>%
ggplot(aes(x = mpg, y = values)) +
geom_boxplot(fill = '#ffe6b7') + coord_flip() + facet_wrap(~names)
# scale_color_manual(values= list(c("#", "#", "#", "#", "#", "#aadce0", "#72bcd5", "#528fad", "#376795", "#1e466e"), c(6, 2, 9, 3, 7, 5, 1, 10, 4, 8), colorblind=TRUE))
library(patchwork)
one + two + three + four + five# outliers on boxplots
# head(mtcars, n = 1)Bar Graphs
Stat = Count
It’s important for bar graphs to have a zero baseline.
if you don’t insert stat argument, default is ‘count’ +
Stat = Identity
In this case, we add a forty-five degree x-axis tilt to our x-axis labels for readability.
# mention forcats package
one <- ggplot(mtcars, aes(x = forcats::fct_infreq(origin), fill = origin)) +
geom_bar(stat = 'count', alpha = 0.3, color = 'black') +
labs(title = "Mtcars", subtitle = "Bar Plot") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
scale_fill_manual(values=met.brewer("Hiroshige", 6)) +
theme(legend.position = 'none') +
labs(caption = "Stat = Count") +
theme(axis.text.x = element_text(angle = 45))
two <- mtcars %>%
group_by(origin) %>%
summarise(mean_mpg = mean(mpg)) %>%
ggplot(aes(x = reorder(origin, mean_mpg), y = mean_mpg, fill = origin)) +
geom_bar(stat = 'identity', color = 'black', alpha = 0.3) +
labs(title = "Mtcars", subtitle = "Bar Plot") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
scale_fill_manual(values=met.brewer("Hiroshige", 6)) +
theme(legend.position = 'none') +
labs(caption = "Stat = Identity") +
theme(axis.text.x = element_text(angle = 45))
one + twoUS cars are the heaviest with the biggest engines and the most horse power and the lowest mpg.
mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
group_by(origin, names) %>%
summarise(mean_value = mean(values)) %>%
ggplot(aes(x = reorder(origin, mean_value), y = mean_value)) +
geom_bar(stat = 'identity', color = 'black', fill = 'lightblue') +
facet_wrap(~names, scales = 'free') +
theme(axis.text.x = element_text(angle = 45)) +
labs(title = "Mtcars", subtitle = "Bar Plot") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) # scale_fill_manual(values=met.brewer("Hiroshige", 5))Composition
Donut Chart
Pie charts may be a little harder to interpret because some people may interpret based on angle and some may interpret based on area. Donut charts may solve this problem.
donut_data <- mtcars %>%
group_by(origin) %>%
tally()
donut_data$fraction = donut_data$n / sum(donut_data$n)
donut_data = donut_data[order(donut_data$fraction), ]
donut_data$ymax = cumsum(donut_data$fraction)
donut_data$ymin = c(0, head(donut_data$ymax, n=-1))
ggplot(donut_data, aes(fill = origin, ymax=ymax, ymin=ymin, xmax=4, xmin=3)) +
geom_rect() +
coord_polar(theta="y") +
xlim(c(0, 4)) +
theme(panel.grid=element_blank()) +
theme(axis.text=element_blank()) +
theme(axis.ticks=element_blank()) +
annotate("text", x = 0, y = 0, label = "") +
labs(title="Mtcars", subtitle = "Origin") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = "donut") +
scale_fill_manual(values=met.brewer("Hiroshige", 6)) +
theme(axis.title.y = element_blank()) +
theme(axis.ticks.y = element_blank(),
axis.text.y = element_blank()) +
theme(axis.title.x = element_blank()) +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_blank())Relationship
Scatterplot
We can create a scatterplot in order to examine the relationship
between two variables. Our method is straightforward using
ggplot2 and following the grammar of graphics workflow.
First, we feed our dataframe as an argument into the
ggplot2 package. Next, we add an aesthetic mapping and
decide the variables that will be represented along the x
and y axes. Finally, we can add the geom_point
layer to create a scatterplot.
We can also put more than one variable against each other with
facet_wrap. This is useful if we are interested in looking
at multiple relationships at the same time. We have arranged our data to
have a common y-axis, which is the mpg variable. We choose
to hold the y-axis scales the same for each variable by specifying
scales = 'free_x' within the facet_wrap
layer.
We can see at first glance that two of our variables have a positive
relationship with mpg and three of our variables have a negative
relationship with mpg. We can also see that the
relationship seems like a closer relationship with some variables than
others. For example, the relationship of mpg with
wt seems to be stronger than the relationship of
mpg with qsec.
ggplot(data = mtcars, aes(x = wt, y = mpg, color = "#f7aa58")) +
geom_point() +
labs(title = "Mtcars", subtitle = "Scatter Plot") +
theme(plot.title = element_text(face = "bold")) +
theme(legend.position = 'none') mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
ggplot(aes(x = values, y = mpg, color = names)) +
geom_point() +
facet_wrap(~names, scales = 'free_x') +
labs(title = "Mtcars", subtitle = "Scatter Plot") +
theme(plot.title = element_text(face = "bold")) +
scale_color_manual(values=met.brewer("Hiroshige", 5))2d Density Plots
We can also add geom_rug
one <- ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_density_2d() + geom_point() +
labs(title = "Mtcars", subtitle = "2D Density Plot") +
theme(plot.title = element_text(face = "bold"))
# theme(legend.position = 'none')
two <- ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_density_2d_filled() +
geom_rug() +
theme(legend.position = 'none')
library(patchwork)
one + two# mesokurtic
# ggplot(data = mtcars, aes(x = wt, y = mpg)) +
# geom_bin2d() + geom_point() +
# labs(title = "Mtcars", subtitle = "2D Density Plot") +
# theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
# theme(plot.subtitle = element_text(hjust = 0.5))
#
# ggplot(data = mtcars, aes(x = wt, y = mpg)) +
# stat_bin_hex(geom = "hex", position = "identity") + geom_point() +
# labs(title = "Mtcars", subtitle = "2D Density Plot") +
# theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
# theme(plot.subtitle = element_text(hjust = 0.5))
plt.close()
plt.cla()
plt.clf()
seaborn.kdeplot(x="wt", y="mpg", data=mtcars, order=order.index)
plt.title("2D density plot", loc = 'left')
plt.show()3.2 Summary Tables
4 Theoretical Distributions
A probability distribution shows us the probabilities of different outcomes occurring. This distribution is useful because it describes the full range of possible outcomes of a statistic and the likelihood of each outcome. We can visualize a probability distribution with a histogram or density plot. To read the visualization and find the probability at a given point, we choose a value on the x-axis and then find the corresponding height on the y-axis, this represents the density at a given point.
We can illustrate this process by taking IQ as an example. We know that IQ follows a normal distribution with a mean of 100 and standard deviation of 15. When we create a visualization of the distribution of IQ scores in a population, we can see that 95% of values fall within two standard deviations of the mean. Or else, to use the language of probability, we can say that there’s a 95% probability that a person sampled randomly from the population will have an IQ between 70 and 130. This is a lot of the use case of probability distributions: charting the likelihood of events with different parameters.
Central limit theorem
We start by reviewing what is called the central limit theorem. The central limit theorem states that if you have a population with mean, called μ, pronounced as mu, and standard deviation σ, called sigma, and we take a sufficiently large number of random samples from the population, then the distribution of the sample means will be normally distributed.
4.1 Sampling Distributions
Researchers sometimes talk about sampling distributions. But what is a sampling distribution, exactly, and how does that relate to the general idea of what a probability distribution is? Essentially, a sampling distribution is a probability distribution that has been created from a certain method: A sampling distribution is obtained through a process of repeated sampling from a population, choosing a test statistic from each sample, and then creating a probability distribution of that given statistic. There are different kinds of sampling distributions: We can use a sampling distribution in order to find statistics such as the mean or variance, or we could use a sampling distribution for proportion to find the the proportion of success for a given outcome.
Consider a process of sampling a population and then taking a statistic from the sample, such as the mean, and then sampling from the population again in order to find another value of this same statistic. In the end, we will have a distribution of mean values. As the sample size increases, this distribution conforms itself to a normal distribution; we know this to be true because of the central limit theorem.
Why do we create a sampling distribution, and why do we care if a resampling technique for a paramater conforms itself to a normal distribution? Sampling distributions come in handy because they simplify the process of statistical inference. A sampling distribution that has taken the shape of a normal distribution in accordance with the central limit is easy to work with because a normal distribution can be described by well-understood equations that fit a wide variety of problems. Furthermore, we can transform a normal distribution into other distributions with different properties, such as the chi-squared or f-distributions.
We can illustrate how the working of the central limit theorem by
looking at the hp variable in the mtcars
dataset and then comparing that to a resampling for the mean
hp. This is, essentially, an illustration of the central
limit theorem: We take samples from our population and find the mean
value for each sample; then we create a distribution of these mean
values and discover that this distribution is like a normal
distribution. The code chunks for how to create a resampling method such
as this will be reviewed in detail later in the chapter.
Resampling Methods
There are four main types of resampling methods. The reason we use a resampling method is because it isn’t practical to keep generating new real samples from our population of interest. Resampling is a way of mimicking the sampling process. For example, when we want to understand something about our population of interest, we may consider running a survey and hoping for one thousand responses. However, no one will administer the same survey into a target population one thousand times. It is because of this practical limitation that we use a resampling method. Here are the four main resampling methods:
- Bootstrap
- Jackknife
- Randomization
- Monte Carlo
These methods can be used to build the distribution of a statistic based on our data. Once we have created a sample distribution using a resampling method, we can then generate confidence intervals, among other things.
Before diving into specific resampling methods, it’s useful to first
get familiar with how the base R sample() function works.
sample() takes at least two arguments that need to be
explicitly decided in the function call: x, and
size. x decides the list or range of values
from which we are choosing to make a sample and size decides the size of
our sample. There is an additional argument that is often explicitly
specified, which is the replace argument. Unless otherwise
decided, replace = FALSE is set as the default.
What is the difference between sampling with and without replacement?
When we sample with replacement, we are replacing the value after every
sample. Each sample is therefore indepedent of the value that came
before it. When we sample without replacement, we aren’t replacing
values in our vector. Once a value is selected, it cannot be selected
again. In other words, when we sample without replacement, the two
sample values are not indepedent, so whatever value we get on one sample
affects the possibly of the values form the next sample. A consequence
here is that we cannot choose a sample size larger than the size of the
input vector unless we specify replace = TRUE. A fourth
commonly used argument is also interesting. We can also specify a
probability as part of our sample which increases the likelihood of
certain values.
sample_without_replacement <- sample(x = 1:10, size = 10) # replace = FALSE as default
sample_with_replacement <- sample(x = 1:10, size = 10, replace = TRUE)
sample_with_replacement_and_weighted_probability <- sample(x = c(1, 0), size = 10, replace = TRUE, prob = c(0.8, 0.2))Bootstrap
What are we doing when we bootstrap? When we bootstrap, we use any
test or metric that uses random sampling with replacement. Bootstrapping
is especially effective as a way of assigning measures of accuracy such
as bias, variance, confidence intervals, prediction error, and other
things, to sample estimates. Here is one way to bootstrap in R is by
using the infer package from the tidymodels
library.
mtcars %>%
dplyr::select(hp) %>%
specify(response = hp) %>%
generate(reps = 10000, type = 'bootstrap') %>%
calculate(stat = 'mean') -> mtcars_bootstrapped_ci_df
mtcars_bootstrapped_ci_dfJacknife
The jacknife is a precursor to bootstrapping and bootstrapping was introduced as a sort of extension and improvement on the jacknife, which was developed when computers had about one kilobyte of memory. Bootstrapping is indeed more versatile and flexible. In the interest of thoroughness we will review jacknifing as well, although the times in which the jackknife should be used in replacement of bootstrapping will be rare.
The jacknife is a leave-one-out resampling method that calculates a statistic of interest; it does this successively or iteratively until each observation has been removed. This means that the number of resamples is limited to the number of observations and largely for this reason the jackknife performs a little bit poorly with small sample sizes. Jackknife is also a little bit limited in terms of the kinds of data that can be used. On the other hand, unlike the bootstrap, jackknifing is reproducible every time.
In the following example, we try a bootstrap on our
mtcars dataset. There are 32 different options depending on
which one you leave out. As we can observe, the confidence intervals
will be too narrow with the jackknife because the most shift we could
get is the shift of leaving one out.
When we look at the jackknife statistic, we see that some but not all
of the statistics are the same. The mean is the same but the standard
deviation is different, so sd(mtcars$mpg) and
sd(mtcars$jackknife_mean_mpg) will evaluate to different
values.
In the following code we use a function that we create for ourselves.
for (i in 1:nrow(mtcars)) {
mtcars_minus_one <- mtcars[-i,]
mean_value_minus_one = mean(mtcars_minus_one$mpg)
mtcars$jackknife_mean_mpg[i] <- mean_value_minus_one
}
ggplot(mtcars, aes(x = jackknife_mean_mpg)) +
geom_histogram(bins = 32, color = 'black', fill = '#ef8a47', alpha = 0.75) +
labs(title = "Mtcars", subtitle = "Histogram") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) n <- 32
standard_error <- sd(mtcars$mpg)/sqrt(n)
jacknife_mean_mpg <- mean(mtcars$jackknife_mean_mpg)
jacknife_standard_error <- sqrt(((n - 1) / n) * sum((mtcars$jackknife_mean_mpg-jacknife_mean_mpg) ^ 2))
t.star <- qt(p=0.975,df=n-1) # what is qt
confidence_intervals <- mean(mtcars$mpg) + c(-1,1) * t.star * standard_error
# data values, and similarly that the jackknifed standard error is equal to the standard error of the mean obtained using all of the data and the familiar formula, so for the mean this seems to be just a more convoluted way to compute quantities we already have formulas for and a waste of time.Randomization / Permutation?
Monte Carlo
Monte Carlo simulation uses random samples to make clear a certain result. In a Monte Carlo simulation, we assign variables random values and run simulations before we put the results together as an average which gives us an estimate.
The following example, the method taken from Wikipedia, shows an example of how a Monte Carlo simulation works. We can first draw a square whose sides lengths are equal to two. The area inside this square is therefore equal to four, which is the length or height squared. We can then inscribe a circle inside this square. We know that the area of the cirlce is pi times the radius squared. But we can take another approach to figure out the area of the cirlce. We can instead use a Monte Carlo simulation to figure out the area of the circle. We do this by uniformly scattering a given number of points over the square and then we count the number of points that fall inside our circle. We know that the points fall within the cirlce if they have a distance from the origin / center of less than The ratio of the points that fall inside the cirlce to the total number of points is an approximation for pi. pi is approximarley 3.14159 use symbol for pi Something interesting is happening in our example below. By chance, we are guessing a closer value for pi with n = 4 than we are with n = 40. This is a good illustration of the random error that can happen with a Monte Carlo simulation. nm
library(ggforce)
set.seed(0125)
hundredths <- seq(from = -1, to = 1, by = .01)
xvalues <- sample(hundredths, size=4, replace=TRUE)
yvalues <- sample(hundredths, size=4, replace=TRUE)
xy_df <- data.frame(xvalues, yvalues)
xy_df <- xy_df %>%
mutate(x_squared = xvalues ^ 2) %>%
mutate(y_squared = yvalues ^ 2) %>%
mutate(z_squared = x_squared + y_squared) %>%
mutate(zvalues = sqrt(z_squared)) %>%
mutate(inside_or_outside = ifelse(zvalues < 1, "inside", "outside"))
counted <- xy_df %>%
group_by(inside_or_outside) %>%
count()
one <- ggplot(xy_df, aes(color = inside_or_outside)) +
geom_circle(aes(x0 = 0, y0 = 0, r = 1), color = 'black') +
geom_point(aes(x = xvalues, y = yvalues), alpha = 0.5) +
# coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
labs(subtitle = "n = 4") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = "estimate for pi: 3.00") +
theme(legend.position = 'none') +
theme(axis.title.y = element_blank()) +
scale_color_manual(values = c("#376795", "#D55E00")) +
coord_fixed()
set.seed(0125)
hundredths <- seq(from = -1, to = 1, by = .01)
xvalues <- sample(hundredths, size = 40, replace=TRUE)
yvalues <- sample(hundredths, size = 40, replace=TRUE)
xy_df <- data.frame(xvalues, yvalues)
xy_df <- xy_df %>%
mutate(x_squared = xvalues ^ 2) %>%
mutate(y_squared = yvalues ^ 2) %>%
mutate(z_squared = x_squared + y_squared) %>%
mutate(zvalues = sqrt(z_squared)) %>%
mutate(inside_or_outside = ifelse(zvalues < 1, "inside", "outside"))
counted <- xy_df %>%
group_by(inside_or_outside) %>%
count()
two <- ggplot(xy_df, aes(color = inside_or_outside)) +
geom_circle(aes(x0 = 0, y0 = 0, r = 1), color = 'black') +
geom_point(aes(x = xvalues, y = yvalues), alpha = 0.5) +
# coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
labs(title = "Monte Carlo Simulations", subtitle = "n = 40") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = "estimate for pi: 2.80") +
theme(legend.position = 'none') +
theme(axis.title.y = element_blank()) +
scale_color_manual(values = c("#376795", "#D55E00")) +
coord_fixed()
set.seed(0125)
hundredths <- seq(from = -1, to = 1, by = .01)
xvalues <- sample(hundredths, size = 400, replace=TRUE)
yvalues <- sample(hundredths, size = 400, replace=TRUE)
xy_df <- data.frame(xvalues, yvalues)
xy_df <- xy_df %>%
mutate(x_squared = xvalues ^ 2) %>%
mutate(y_squared = yvalues ^ 2) %>%
mutate(z_squared = x_squared + y_squared) %>%
mutate(zvalues = sqrt(z_squared)) %>%
mutate(inside_or_outside = ifelse(zvalues < 1, "inside", "outside"))
counted <- xy_df %>%
group_by(inside_or_outside) %>%
count()
three <- ggplot(xy_df, aes(color = inside_or_outside)) +
geom_circle(aes(x0 = 0, y0 = 0, r = 1), color = 'black') +
geom_point(aes(x = xvalues, y = yvalues), alpha = 0.5) +
labs(subtitle = "n = 400") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = "estimate for pi: 3.11") +
theme(legend.position = 'none') +
theme(axis.title.y = element_blank()) +
scale_color_manual(values = c("#376795", "#D55E00")) +
coord_fixed()
library(patchwork)
one + two + threeDiscrete Distributions
There are many ways to think about theoretical distributions in statistics. We can think about distributions as belonging to one of two categories: discrete distributions and continuous distributions. A sampling distribution can either belong to a discrete or continuous distribution category.
A discrete distribution is a probability distribution that depicts the occurrence of a discrete outcome, or outcome that are countable as whole numbers.
Uniform distribution
The first discrete distribution that we can review is called a uniform distribution.
These functions provide information about the uniform distribution on the interval from min to max. dunif gives the density, punif gives the distribution function qunif gives the quantile function and runif generates random deviates.
dunif() punif() qunif()
runif()
dunif takes the following syntax. arguments:
ggplot(data.frame(x=c(0:10)), aes(x)) +
stat_function(geom = "area", fun = dunif, args = list(1, 10), fill = '#e76254', alpha = 0.5, color = 'black') + xlim(0, 11) +
labs(subtitle = "Uniform Distribution") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) # dunif(x, # X-axis values (grid of values)
# min = 0, # Lower limit of the distribution (a)
# max = 1, # Upper limit of the distribution (b)
# log = FALSE) # If TRUE, probabilities are given as logBinomial distribution
The binomial distribution is a probability distribution which models
the probabilities of having a certain number of successes among n
identical trials, each having p as the probability of success. The
variables n and p are thus the two parameters
of a binomial distribution.
dbinom(x, size, prob, log = FALSE)
dbinom() pbinom() qbinom()
rbinom()
Put simply, dbinom() finds the probability of getting a
certain number of successes (x) in a certain number of trials (size)
where the probability of success on each trial is fixed (prob).
data4 <- data_frame(X = rbinom(n = 3000, size = 10, prob = 0.2))
ggplot(data = data4, mapping = aes(x = X)) +
## Default stat is "count".
geom_bar(fill = '#e76254', alpha = 0.5, color = 'black') +
## Force the limits to include 0,...,10.
scale_x_continuous(breaks = 0:10, limits = c(-1,11)) +
labs(subtitle = "Binomial Distribution") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) # dbinom(1000, size = 10, prob = 0.5)
# ggplot(data.frame(x=c(1000)), aes(x)) +
# stat_function(geom = "area", fun = dbinom, args = list(2, 0.5), fill = '#e76254', alpha = 0.5, color = 'black') + xlim(0, 11) +
# labs(subtitle = "Uniform Distribution") +
# theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
# theme(plot.subtitle = element_text(hjust = 0.5)) Bernoulli distribution
The Bernoulli distribution takes a value of 1 with probability p and a value of 0 with probability 1-p. A Bernoulli distribution is useful because it can be used to approximate the outcomes of an experiment (such as tossing a coin) as a range of percentages. It represents a “yes or no” type experiment. If the experiment succeeds, then it is given the value 1. If the experiment does not succeed, it is given value 0.[2] This can be used, for example, in tossing a coin, where “1” means it lands on “heads”, and “0” means it lands on “tails” (or the other way around).[3]
## Create a vector to hold sample means.
Xbars30 <- numeric(length = 10^4)
## Repeat sample mean generation 10^4 times.
for (i in 1:10^4) {
## Assign the generated value to the i-th element of the vector.
Xbars30[i] <- mean(rbinom(n = 30, size = 1, prob = 0.2))
}
## Check the moments.
mean(Xbars30)## [1] 0.20028
ggplot(data = data_frame(Xbar = Xbars30), mapping = aes(x = Xbar)) +
geom_bar(fill = '#e76254', alpha = 0.5, color = 'black') +
scale_x_continuous(limit = c(0,1)) +
labs(subtitle = "Bernoulli Distribution") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) Poisson Distribution
where:
x: number of successes lambda: average rate of success
The Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event.
The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume.
Assumptions
The Poisson distribution is an appropriate model if the following assumptions are true
- k is the number of times an event occurs in an interval and k can take values 0, 1, 2, ….
- The occurrence of one event does not affect the probability that a second event will occur. That is, events occur independently.
- The average rate at which events occur is independent of any occurrences. For simplicity, this is usually assumed to be constant, but may in practice vary with time.
- Two events cannot occur at exactly the same instant; instead, at each very small sub-interval, either exactly one event occurs, or no event occurs.
- If these conditions are true, then k is a Poisson random variable, and the distribution of k is a Poisson distribution.
A Poisson distribution is a probability distribution. It measures the probability that a certain number of events occur within a certain period of time.
The events need to be unrelated to each other. They also need to occur with a known average rate, represented by λ, called lambda.
# ggplot(transform(data.frame(x=c(0:10)), y=dpois(x, 1)), aes(x, y)) +
# geom_bar(stat="identity", fill = '#72bcd5', alpha = 0.5, color = 'black') +
# labs(title = "Poisson Distribution") +
# theme(plot.title = element_text(hjust = 0.5, face = "bold"))
#
# ggplot(transform(data.frame(x=c(0:10)), y=dpois(x, 1)), aes(x, y)) +
# geom_density(stat="identity", fill = '#72bcd5', alpha = 0.5, color = 'black') +
# labs(title = "Poisson Distribution") +
# theme(plot.title = element_text(hjust = 0.5, face = "bold"))
#
# ggplot(transform(data.frame(x=c(0:10)), y=dpois(x, 2)), aes(x, y)) +
# geom_bar(stat="identity", fill = '#72bcd5', alpha = 0.5, color = 'black') +
# labs(title = "Poisson Distribution") +
# theme(plot.title = element_text(hjust = 0.5, face = "bold"))
# scale_color_manual(values= list(c("#e76254", "#ef8a47", "#f7aa58", "#ffd06f", "#ffe6b7", "#aadce0", "#72bcd5", "#528fad", "#376795", "#1e466e"), c(6, 2, 9, 3, 7, 5, 1, 10, 4, 8), colorblind=TRUE)) How to make discrete also, how to change x-axis labels to discrete labels
library(patchwork)
one + two + three + four + five + six + seven + eight + nineContinuous Distributions
These distributions are based on the normal distribution: Chi Squared Distribution: If you have a bunch of normally distributed variables with mean 0 and variance 1, then the sum of their squares has a chi squared distribution. T-Distribution: If you have a normally distributed variable with mean 0 and variance 1, and you divide it by the square root of a scaled Chi-Squared Distribution, it has a t-distribution. F-Distribution: If you have a chi-squared distributed variable and you divide it by another chi-squared distributed variable, it has an F-distribution.
Why do we have all of these different combinations of the normal? To give our tests the properties we want. For example, if we want to see if average heights are the same, we’d like to take into account how confident we are of each sample average. So if we are checking if the heights are different in two samples of sizes five and six respectively, we take our difference in average height, which is normally distributed, and divide by how confident in sample averages from samples of size five and six, which has a chi squared distribution, and arrive at a t-distribution. This is the t-test.I won’t get into all the reasons we use each of these distributions, as the specific reason you have to divide a chi squared by another chi squared, and thus use an F distribution is specific to each task. But hopefully this gives a bit of a start for tying these tests
So this algebraic expression is supposed to follow Standard Normal distribution as n becomes infinitely large. But in reality, we can’t have infinite data. If n is small, this distribution of the location of the true mean, relative to the sample mean and divided by the sample standard deviation, after multiplying by the normalizing term is a t-distribution with n-1 degrees of freedom. As n becomes infinitely large, this becomes the standard normal. That is why we see the t-distribution in computing confidence intervals. For large n you may notice the use of z distribution instead of t. This implies the use of the inverse of a standard normal, which implies an assumption of infinite data (large data).
Now we have seen a way to find the sample mean, with some confidence interval. What about sample variance? Just like the mean of a sufficiently large number of independent and identically distributed random variables follows normal, the variance of a sufficiently large number of independent and identically distributed random variables follows a Chi-square distribution. This follows from the theory that the sum of squares of standard normal distributed random variables follows the Chi-Square distribution.
F-distribution is a little more difficult to explain for me. A ratio of two Chi-squared distributions approximately follows F-distribution. Hence you would find the use of this distribution in F-test which is used for Hypothesis Testing. One example is testing for the hypothesis that the means of several normally distributedpopulations, all having the same standard deviation, are equal. This is perhaps the best-known F-test, and plays an important role in the analysis of variance(ANOVA). You would take ratios of variances and then check for the truth of your hypothesis. To understand all these better, you can try reading some material on ‘Hypothesis Testing’. A video link on Khan Academy: Hypothesis Test for Difference of Means
other distributions are based on teh normal distribution sum of squares mean 0 and sd = 1 is a chi square dist divide normal by square root of chi square you have a T chi square divided by chi square is F
The Normal Distribution
The normal distribution, also called a Gaussian distribution or a bell curve, is a foundational probability distribution in statistics. The most common visualization for a normal distribution is the histogram or density plot, but normality can also be studied through other visualizations such box plots and q-q plots. In the following four-panelled visualization, we can see a normal distribution visualized in four different ways. In this chapter, we will review in detail how to create and understand these visualizations.
The Standard Normal Distribution
The standard normal distribution is simply the normal distribution where the mean is zero and the standard deviation is one.
The standard normal distribution makes it clear what percent fall above or below the line in terms of standard deviations. Trained statisticians know how to easily convert z-scores into percentages.
The 68-95-99.7 Rule
We also know that 99% of values fall within 3 standard deviations from the mean in a normal probability distribution.
The empirical rule, also referred to as the three-sigma rule or 68-95-99.7 rule, is a statistical rule which states that for a normal distribution, almost all observed data will fall within three standard deviations (denoted by σ) of the mean or average (denoted by µ).
In particular, the empirical rule predicts that 68% of observations falls within the first standard deviation (µ ± σ), 95% within the first two standard deviations (µ ± 2σ), and 99.7% within the first three standard deviations (µ ± 3σ).
Central Limit Theorem
Two key points: central limit theorem - let’s us take an outcome whose distribution we dont know much about and arrive at a well known distribution - the normal distribution
ggplot(data.frame(x = c(-4, 4)), aes(x)) +
stat_function(fun = dnorm,
geom = "area", fill = '#ffe6b7', color = 'black', alpha = 0.5,
args = list(
mean = 0,
sd = 1)) +
labs(title = "Standard Normal Distribution", subtitle = "Standard Deviations and Z-Scores") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
geom_vline(xintercept = 1, linetype = 'dotdash', color = "black") +
geom_vline(xintercept = -1, linetype = 'dotdash', color = "black") +
geom_vline(xintercept = 2, linetype = 'dotdash', color = "black") +
geom_vline(xintercept = -2, linetype = 'dotdash', color = "black") +
geom_vline(xintercept = 3, linetype = 'dotdash', color = "black") +
geom_vline(xintercept = -3, linetype = 'dotdash', color = "black") +
geom_text(aes(x=1, label="\n 1 σ", y=0.125), colour="#376795", angle=90, text=element_text(size=11)) +
geom_text(aes(x=2, label="\n 2 σ", y=0.125), colour="#376795", angle=90, text=element_text(size=11)) +
geom_text(aes(x=3, label="\n 3 σ", y=0.125), colour="#376795", angle=90, text=element_text(size=11)) +
geom_text(aes(x=-1, label="\n-1 σ", y=0.125), colour="#376795", angle=90, text=element_text(size=11)) +
geom_text(aes(x=-2, label="\n-2 σ", y=0.125), colour="#376795", angle=90, text=element_text(size=11)) +
geom_text(aes(x=-3, label="\n-3 σ", y=0.125), colour="#376795", angle=90, text=element_text(size=11)) Z-Scores
A z-score, also called a standard score, is a way of telling us, for any given value, how many standard deviations above or below the population mean our value is. For example, a z-score of 2 tells us that our value is two standard deviations above the mean. The range of a z-score is -3 to 3.Z-scores are a way to compare results to a “normal” population. Here is how we calculate the z-score of a normal distribution.
\[ \begin{aligned} z = \frac{x - mu}{σ} \end{aligned} \]
This equation requires us to understand the mean μ and also the population standard deviation σ.
When you have multiple samples and want to describe the standard deviation of those sample means (the standard error), you would use this z score formula:
\[ \begin{aligned} z = \frac{(x – μ)}{(σ / √n)} \end{aligned} \]
This z-score will tell you how many standard errors there are between the sample mean and the population mean. It is used for dealing with a different kind of porblem. The key here is that we’re dealing with a sampling distribution of means, so we know we have to include the standard error in the formula.
Before deep-diving into the types of distributions, it is important to revise the fundamental concepts like Probability Density Function (PDF), Probability Mass Function (PMF), and Cumulative Density Function (CDF).
Functions
Before deep-diving into the types of distributions, it is important to revise the fundamental concepts like Probability Density Function (PDF), Probability Mass Function (PMF), and Cumulative Density Function (CDF).
Z-scores: x - mean(x)/sd(x)
q-q plots
In the bottom-right panel we see what is called quantile-quantile plot, or q-q plot, which we have not yet reviewed. Specifically, in this exmaple we are looking at what is called a normal quantile-quantile plot because we are making an assessment on normality, but q-q plots can be used for other kinds of distributions. A q-q plot is used both in terms of understanding the distribution of a given variable and also in doing diagnostics for a given model. The line in a q-q plot is a theoretical line that will always sit exactly on the diagonal; it shows how the scatter would lie if the distribution were perfectly normal, as is the case in the above example. In real world examples, the scatter in a q-q plot may or may not appear close to this theoretical line and it may or may not show a pattern coming away from the line.
A q–q plot is a probability plot. Specifically, it compares two
probability distributions by plotting their quantiles against each
other. The two probability distributions are firstly the distribution of
our actual data and the secondly the distribution from a normal
distribution. A q-q plot will take out sample data, sort it in ascending
order, and then plot the data points versus the quantiles calculated
from a theoretical distribution. The number of quantiles is selected to
match the size of our sample data. If both sets of quantiles came from a
normal distribution, we should see the points forming a straight
diagonal line. Quantiles, which are kind of like percentiles, are points
in our data below which a certain proportion of our data will fall. For
example, in a classic bell-curve standard normal distribution with a
mean of 0. The 0.5 quantile, or 50th percentile, is 0. Half the data lie
below 0. That’s the peak of the hump in the curve. The 0.95 quantile, or
95th percentile, is about 1.64. 95 percent of the data lie below 1.64.
The following R code generates the quantiles for a standard Normal
distribution from 0.01 to 0.99 by increments of 0.01:
qnorm(seq(0.01,0.99,0.01))
One advantage is that a q-q plot may be easier to use when there are
small sample sizes. In order to get a feel for how to read a q-q plot,
we can use our sample dataset from before. We create a q-q plot by using
the stat_qq() and stat_qq_line() functions
inside the ggplot() call.
R comes with predefined functions that are native to base R that are designed for examining the characteristics of a normal distribution.
dnorm()pnorm()qnorm()rnorm()
We first start with the dnorm() function, which is the
density function of the normal distribution. The dnorm()
function takes the arguments x, mean,
sd and log. We can use dnorm() to
find the density at any point of x. In the following example, we use
dnorm() to find the density of a normal distribution at the
point x = 125, where the distribution has
mean = 100 and sd = 15. The function evaluates
to 0.006631809, meaning the y-value of the density plot is 0.006631809
at the point where x is 125.
The qnorm() function gives the quantile function of the
normal distribution. In other words, given an area, we find the boundary
value that determines this area. For example, suppose we want to find
that the 95th percentile of a normal distribution whose mean is 100 and
whose standard deviation is 15. This is a common example because the
95th percentile is a critical value when thinking about p-value with an
alpha level of 0.05. In this example, the qnorm() would
evaluate to 124.6728.
The pnorm() function, on the other hand, gives the
cumulative probability density function of the normal distribution. We
can think aboutpnorm() as arriving to the same numbers as
qnorm() but through an opposite workflow. The critical area
of pnorm(), therefore, given x = 124.6728, is
the sum of the area to the left of where x is 124.6728. We can also
think about this as the probability of seeing a value less than or equal
to 124.6728. pnorm() for an x value of 124.6728. at mean of
100 and standard deviation of 15 evaluates to 0.95. In this example, we
only know to input an x value of 124.6728 because of our valuation from
qnorm().
# dnorm(x = 124.6728, mean = 100, sd = 15)
# pnorm(124.6728, mean = 100, sd = 15)
# qnorm(0.95, mean = 100, sd = 15)
distributions_functions_table <- tribble(
~mean, ~standard_deviation, ~function_, ~input, ~evaluation,
"100", "15", "dnorm", "124.6728", "0.006631809",
"100", "15", "pnorm", "124.6728", "0.9522096",
"100", "15", "qnorm", "0.95", "125",
)
distributions_functions_tableIn addition to creating histograms, boxplots, scatterplots and other
visualizations with common geometry, as in the above example of the
four-paneled visualization, ggplot2 can also be used to
visualize functions. We do this by using stat_function()
within the ggplot() call. stat_function()
takes an argument called fun, which stands for function. We
can supply a predefined function such as dnorm() to the
fun argument in stat_function(). Another
method of visualizing a normal distribution is by using
stat_function() in combination with dput().
This will allow us to visualize a normal distribution with a given mean
and standard deviation. In order to add these arguments to
stat_function(), we can put the arguments of the function
in separately as a list through the args argument. Note
that there may be different arguments for each function. By putting
dnorm() inside of ggplot(), we are able to
visualize the normal distribution.
Previously, we have used ggplot() on the existing
mtcars dataframe. In the following code, we are trying a
new method by creating a dataframe inline within the
ggplot() call. This dataframe is not saved locally as an
independent object. As before, this dataframe defines the limits of the
visualization. In the following example, we also add a vertical line the
illustrate where x = 125 and y = 0.006631809 and also the crtical area
to the left, where x = 95.22096.
one <- ggplot(data.frame(x = c(0, 200)), aes(x)) +
stat_function(fun = dnorm,
geom = "line", fill = '#1e466e', alpha = 0.5,
args = list(
mean = 100,
sd = 15), xlim = c(0, 125)) +
stat_function(fun = dnorm,
geom = "line", fill = '#1e466e', alpha = 0.5,
args = list(
mean = 100,
sd = 15), xlim = c(125, 200)) +
geom_hline(yintercept = 0.006631809, linetype = 'dotdash') +
labs(title = "Normal Distribution", subtitle = "Theoretical") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
geom_vline(xintercept = 125, linetype = 'dotdash') +
labs(caption = "dnorm: 0.006631809")
two <- ggplot(data.frame(x = c(0, 200)), aes(x)) +
stat_function(fun = dnorm,
geom = "area", fill = '#1e466e', alpha = 0.5,
args = list(
mean = 100,
sd = 15), xlim = c(0, 125)) +
stat_function(fun = dnorm,
geom = "line", fill = '#1e466e', alpha = 0.5,
args = list(
mean = 100,
sd = 15), xlim = c(125, 200)) +
labs(title = "Normal Distribution", subtitle = "Theoretical") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
geom_vline(xintercept = 125, linetype = 'dotdash') +
labs(caption = "pnorm: 0.9522096; qnorm: 125")
library(patchwork)
one + two# ggplot(data.frame(x = c(0, 5)), aes(x)) +
# stat_function(fun = df,
# geom = "line",
# args = list(
# df1 = 3,
# df2 = 49
# )) The rnorm() function is random sampling from the normal
distribution. The rnorm() function is designed to generate
a vector of normally distributed random numbers, or random deviates.
rnorm() takes three arguments: The first argument
n is the number of numbers we want to generate and, similar
to our other distribution functions, we also have to specify the mean
and standard deviation.
Whereas in our previous example we had used dnorm()
within stat_function() to create a theoretical normal
distribution with perfect characteristics,rnorm() gives us
an opportunity to create a normal distribution through simulation. In
the following example, we create three simulations of different sizes,
each one with ten times more numbers than the previous one. We know
that, as our simulation approaches an infinitely big number, the shape
of the distribution will approach the theoretical example of a perfect
normal distribution. This is called the weak law of large numbers. We
also accompany our distributions with q-q plots. Because our data is not
perfectly normal, we can see scatter on our q-q plots that deviate from
the perfectly straight qq_line().
set.seed(0125)
nnormal <- rnorm(10, mean = 100, sd = 15)
nnormal <- as.data.frame(nnormal)
names(nnormal) <- c('x')
one <- ggplot(nnormal, aes(x = x)) +
geom_density(fill = '#aadce0') +
labs(subtitle = "n = 10") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
xlim(0, 200)
two <- ggplot(nnormal, aes(sample = x)) +
stat_qq(color = '#aadce0') +
labs(subtitle = "n = 10") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
nnormal <- rnorm(100, mean = 100, sd = 15)
nnormal <- as.data.frame(nnormal)
names(nnormal) <- c('x')
three <- ggplot(nnormal, aes(x = x)) +
geom_density(fill = '#aadce0') +
labs(title = "Normal Distributions", subtitle = "n = 100") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
xlim(0, 200)
four <- ggplot(nnormal, aes(sample = x)) +
stat_qq(color = '#aadce0') +
labs(subtitle = "n = 100") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
nnormal <- rnorm(1000, mean = 100, sd = 15)
nnormal <- as.data.frame(nnormal)
names(nnormal) <- c('x')
five <- ggplot(nnormal, aes(x = x)) +
geom_density(fill = '#aadce0') +
labs(subtitle = "n = 1000") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
xlim(0, 200)
six <- ggplot(nnormal, aes(sample = x)) +
stat_qq(color = '#aadce0') +
labs(subtitle = "n = 1000") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = "Simulations")
library(patchwork)
one + three + five + two + four + six # scale_color_manual(values= list(c("#e76254", "#ef8a47", "#f7aa58", "#ffd06f", "#ffe6b7", "#aadce0", "#72bcd5", "#528fad", "#376795", "#1e466e"), c(6, 2, 9, 3, 7, 5, 1, 10, 4, 8), colorblind=TRUE)) With a large enough sample, the simulation will approach the normal distribution. In the following example, we look at 100,000 values in a random normal distrubtion and overlay that with the normal distribution. Not surprisingly, we see almost a perfect overlap.
set.seed(0125)
nnormal <- rnorm(100000, mean = 100, sd = 15)
nnormal <- as.data.frame(nnormal)
names(nnormal) <- c('x')
ggplot(nnormal, aes(x = x)) +
geom_density(color = 'darkgrey', alpha = 0.5) +
labs(title = "Normal Distributions", subtitle = "Theoretical and Simulated") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
xlim(0, 200) +
stat_function(fun = dnorm,
geom = "line",
color = "steelblue",
alpha = 0.5,
args = list(
mean = 100,
sd = 15
)) In addition to using a predefined function within
stat_function(), we can also create our own functions
within the fun argument in stat_function(). In
this example, we create functions for first, second, third, and fourth
degree polynomials.
one <- ggplot(data.frame(x = c(-20, 20)), aes(x)) +
stat_function(fun = function(x) { x**1 },
geom = "line", color = '#1e466e') +
labs(subtitle = "First Degree Polynomial") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
two <- ggplot(data.frame(x = c(-20, 20)), aes(x)) +
stat_function(fun = function(x) { x**2 },
geom = "line", color = '#1e466e') +
labs(subtitle = "Second First Degree Polynomial") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
three <- ggplot(data.frame(x = c(-20, 20)), aes(x)) +
stat_function(fun = function(x) { x**3},
geom = "line", color = '#1e466e') +
labs(subtitle = "Third First Degree Polynomial") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
four <- ggplot(data.frame(x = c(-20, 20)), aes(x)) +
stat_function(fun = function(x) { x**4},
geom = "line", color = '#1e466e') +
labs(subtitle = "Fourth First Degree Polynomial") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
one + two + three + four # scale_color_manual(values= list(c("#e76254", "#ef8a47", "#f7aa58", "#ffd06f", "#ffe6b7", "#aadce0", "#72bcd5", "#528fad", "#376795", "#1e466e"), c(6, 2, 9, 3, 7, 5, 1, 10, 4, 8), colorblind=TRUE)) Student’s T Distribution
Is not as tall and with fatter tails.
A T-distribution is a sampling distribution that involves a small population or one where not much is known about it. It is used to estimate the mean of the population and other statistics such as confidence intervals, statistical differences and linear regression. The T-distribution uses a t-score to evaluate data that wouldn’t be appropriate for a normal distribution.
The t-distribution describes the standardized distances of sample means to the population mean when the population standard deviation is not known, and the observations come from a normally distributed population.
# scale_color_manual(values= list(c("#e76254", "#ef8a47", "#f7aa58", "#ffd06f", "#ffe6b7", "#aadce0", "#72bcd5", "#528fad", "#376795", "#1e466e"), c(6, 2, 9, 3, 7, 5, 1, 10, 4, 8), colorblind=TRUE))
ggplot(data.frame(x = c(-4, 4)), aes(x = x)) +
stat_function(fun = dnorm, color = 'black', linetype = 'dotted') +
stat_function(fun = dt, args = list(df = 1), color = '#e76254') +
stat_function(fun = dt, args = list(df = 2), color = '#ef8a47') +
stat_function(fun = dt, args = list(df = 3), color = '#f7aa58') +
stat_function(fun = dt, args = list(df = 10), color = '#ffd06f') +
stat_function(fun = dt, args = list(df = 4), color = '#ffe6b7') +
stat_function(fun = dt, args = list(df = 5), color = '#aadce0') +
stat_function(fun = dt, args = list(df = 6), color = '#72bcd5') +
stat_function(fun = dt, args = list(df = 7), color = '#528fad') +
stat_function(fun = dt, args = list(df = 8), color = '#376795') +
stat_function(fun = dt, args = list(df = 9), color = '#1e466e') +
labs(title = "T Distribution", subtitle = "Degrees of Freedom 1 - 10") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) Chi-Squared Distribution
\[ \begin{aligned} Q\ \sim \ \chi _{k}^{2} \end{aligned} \]
The chi-squared distribution is used primarily in hypothesis testing. The chi-squared distribution with k degrees of freedom is the distribution of a sum of the squares of k independent standard normal random variables. If Z1, …, Zk are independent, standard normal random variables, then the sum of their squares is distributed according to the chi-squared distribution with k degrees of freedom. The chi-squared distribution has one parameter: a positive integer k that specifies the number of degrees of freedom (the number of random variables being summed, Zi s).
The chi-squared distribution is the distribution that emerges from taking the sums of the squared errors.
\[ \begin{aligned} Q\ =\sum _{i=1}^{k}Z_{i}^{2} \end{aligned} \]
We can also think about the chi-square distribution as the distribution of the chi-square statistic. The chi-square distribution is defined by the following probability density function.
\(P_r(x) = \frac{(x^(r/2-1)e^(-x/2))}{(Gamma(1/2r)2^(r/2))}\)
The chi-square distribution has these interesting properties: For one thing, the mean of the distribution is equal to the number of degrees of freedom: μ = v. Related to this, the variance is equal to two times the number of degrees of freedom: σ2 = 2 * v. Also, as the degrees of freedom increase, the chi-square curve approaches a normal distribution.
A chi-square (χ2) statistic is a measure of the difference between the observed and expected frequencies of the outcomes of a set of events or variables. Chi-square is useful for analyzing such differences in categorical variables, especially those nominal in nature. χ2 depends on the size of the difference between actual and observed values, the degrees of freedom, and the samples size. χ2 can be used to test whether two variables are related or independent from one another. It can also be used to test the goodness-of-fit between an observed distribution and a theoretical distribution of frequencies.
There are two main kinds of chi-square tests: the test of independence, which asks a question of relationship, and the goodness-of-fit test, which asks how well the data matches a theoretical value.
If Y_i have normal independent distributions with mean 0 and variance 1, then
\(x^2=sum_(i=1)^rY_i^2\)
is distributed as chi^2 with r degrees of freedom. This makes a chi^2 distribution a gamma distribution with theta=2 and alpha=r/2, where r is the number of degrees of freedom.
the goodness-of-fit test, which determines if data fit a particular distribution, such as in the lottery example the test of independence, which determines if events are independent, such as in the movie example the test of a single variance, which tests variability, such as in the coffee example
There is a different chi-square curve for each 𝑑𝑓.
When 𝑑𝑓>90, the chi-square curve approximates the normal distribution.
the prob of getting a chi square very low with lots of degrees of freedom is not that uncommon
needs to have sufficient deviation away from the mean
is the number of observations of type
is the expected number of observations of type
- The key to the chi-squared test is that the chi-squared statistic is well-approximated by a chi-squared distribution (which is itself an approximation to the multivariate normal distribution) with a properly chosen number of degrees of freedom.
Because of this approximation, a number of conditions (detailed in the next section) need to hold in order for the test to be valid. Should they hold, the chi-squared test proceeds as follows:
Intuitively, the test relies on the fact that if the expected distribution is indeed correct, the difference between the observed and expected distributions should approximate a multivariate normal distribution, which is approximated by a chi-squared distribution by the central limit theorem. If the chi-squared statistic is larger than the critical value, then it is unlikely to have occurred under this assumption, and thus the assumption is likely to be false.
The simplest chi-squared distribution is the square of a standard normal distribution. So wherever a normal distribution could be used for a hypothesis test, a chi-squared distribution could be used.
The difference between the left and right panel is that every number in the standard normal distribution has been squared. We can see that every number is greater than zero.
normal_numbers <- rnorm(10000, mean = 0, sd = 1)
normal_numbers <- as.data.frame(normal_numbers)
normal_numbers2 <- normal_numbers %>%
mutate(square_of_normal_numbers = normal_numbers ^ 2)
one <- ggplot(normal_numbers, aes(x = normal_numbers)) + geom_density() +
labs(title = "Normal Distribution", subtitle = "") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
two <- ggplot(normal_numbers2, aes(x = square_of_normal_numbers)) +
geom_density() + xlim(0, 4) +
labs(title = "Chi-Square Distribution", subtitle = "DF: 1") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
library(patchwork)
one + two# scale_color_manual(values= list(c("#e76254", "#ef8a47", "#f7aa58", "#ffd06f", "#ffe6b7", "#aadce0", "#72bcd5", "#528fad", "#376795", "#1e466e"), c(6, 2, 9, 3, 7, 5, 1, 10, 4, 8), colorblind=TRUE))
# deviation between a commonly used stats concept and the formal mathematical proof
# distribution that emerges from taking the sums of the squared errors# stat_function(fun = df,
# geom = "area",
# fill = "steelblue",
# xlim = c(critical_f, 5),
# args = list(
# df1 = 3,
# df2 = 49
# ))
#dchisq()
# scale_color_manual(values= list(c("#e76254", "#ef8a47", "#f7aa58", "#ffd06f", "#ffe6b7", "#aadce0", "#72bcd5", "#528fad", "#376795", "#1e466e"), c(6, 2, 9, 3, 7, 5, 1, 10, 4, 8), colorblind=TRUE))
ggplot(data.frame(x = c(0, 20)), aes(x)) +
stat_function(fun = dchisq,
geom = "area", color = '#e76254', fill = '#1e466e', alpha = 0.5,
args = list(
# from = 0, to = 40,
df = 3)) +
stat_function(fun = dchisq,
geom = "area", color = '#ef8a47', fill = '#1e466e', alpha = 0.5,
args = list(
# from = 0, to = 40,
df = 4)) +
stat_function(fun = dchisq,
geom = "area", color = '#f7aa58', fill = '#1e466e', alpha = 0.5,
args = list(
# from = 0, to = 40,
df = 5)) +
stat_function(fun = dchisq,
geom = "area", color = '#ffd06f', fill = '#1e466e', alpha = 0.5,
args = list(
# from = 0, to = 40,
df = 6)) +
stat_function(fun = dchisq,
geom = "area", color = '#aadce0', fill = '#1e466e', alpha = 0.5,
args = list(
# from = 0, to = 40,
df = 7)) +
stat_function(fun = dchisq,
geom = "area", color = '#72bcd5', fill = '#1e466e', alpha = 0.5,
args = list(
# from = 0, to = 40,
df = 8)) +
stat_function(fun = dchisq,
geom = "area", color = '#528fad', fill = '#1e466e', alpha = 0.5,
args = list(
# from = 0, to = 40,
df = 9)) +
stat_function(fun = dchisq,
geom = "area", color = '#376795', fill = '#1e466e', alpha = 0.5,
args = list(
# from = 0, to = 40,
df = 10)) +
labs(title = "Chi-Square Distribution", subtitle = "Density Plot with degrees of freedom 3 - 10") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))#
# how to plot cumulative density function...
# what is poisson distribution?F Distribution
F = V1∕m1-~ F (m ,m ) V2∕m2 1 2
Let’s create another example with another function to extrapolate what we have learned. With the function df you can create an F-distribution. Each F-distribution requires at least three arguments, x, df1 and df2:
# scale_color_manual(values= list(c("#e76254", "#ef8a47", "#f7aa58", "#ffd06f", "#ffe6b7", "#aadce0", "#72bcd5", "#528fad", "#376795", "#1e466e"), c(6, 2, 9, 3, 7, 5, 1, 10, 4, 8), colorblind=TRUE))
ggplot(data.frame(x = c(0, 3)), aes(x)) +
stat_function(fun = df,
geom = "line",
color = "#e76254",
alpha = .7,
args = list(
df1 = 1,
df2 = 50
)) +
stat_function(fun = df,
geom = "line",
color = "#ef8a47",
alpha = .7,
args = list(
df1 = 2,
df2 = 50
)) +
stat_function(fun = df,
geom = "line",
color = "#f7aa58",
alpha = .7,
args = list(
df1 = 3,
df2 = 50
)) +
stat_function(fun = df,
geom = "line",
color = "#ffd06f",
alpha = .7,
args = list(
df1 = 4,
df2 = 50
)) +
stat_function(fun = df,
geom = "line",
color = "#ffe6b7",
alpha = .7,
args = list(
df1 = 5,
df2 = 50
)) +
stat_function(fun = df,
geom = "line",
color = "#aadce0",
alpha = .7,
args = list(
df1 = 6,
df2 = 50
)) +
stat_function(fun = df,
geom = "line",
color = "#72bcd5",
alpha = .7,
args = list(
df1 = 7,
df2 = 50
)) +
stat_function(fun = df,
geom = "line",
color = "#528fad",
alpha = .7,
args = list(
df1 = 8,
df2 = 50
)) +
stat_function(fun = df,
geom = "line",
color = "#376795",
alpha = .7,
args = list(
df1 = 9,
df2 = 50
)) +
stat_function(fun = df,
geom = "line",
color = "#1e466e",
alpha = .7,
args = list(
df1 = 10,
df2 = 50
)) +
labs(title = "F Distribution", subtitle = "Density Plot with degrees of freedom 1: 1 - 10; degrees of freedom 2: 50") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))# z score5 Frequentist Statistics
5.1 Descriptive Statistics
Descriptive statistics summarize the characteristics of a dataset. Using descriptive statistics, we try to give the most accurate picture of what the data is or what the data shows. This includes basic information about each variable individually as well as highlights about relationships between variables. We try to distill the information into an easily understood form in a way that makes sense and is not misleading wihile at the same time preventing any misunderstanding. Along the way, we have to make a lot of decisions on what information is presented and what story is conveyed. After all, converting data into a summary statistic is an act of reduction and some information is lost. Despite obvious differences in the distribution of points in each panel, each of the nine panels in the following visualization share seven summary statistics in common including the x-mean, y-mean and correlation. The implications for understanding data through descriptive statistics and visualizations will be discussed in detail in this chapter.
Summary Tables
In many cases, the best way to convey information about a dataset is
through the use of a table. We can use dplyr() to
efficiently create summary tables of our data. dplyr() is a
powerful package in the tidyverse() library that contains
many functions inspired by the SQL programming language. Here are the
most common dplyr() verbs.
select()filter()arrange()mutate()group_by()summarize()
The following code shows us how to create a summary table of
different variables in our dataframe using the dplyr()
workflow. We also show an alternate way of writing the code that is more
like SQL syntax because it works from the inside out. The difference is
that our first code chunk makes use of the pipe operator
%>% which we read as ‘and then’. The pipe operator first
appeared in the magrittr() package with a sort of joke on
René Magritte’s famous painting with the words, in French, “Ceci n’est
pas une pipe”, meaning “This is a not a pipe”. The pipe operator has
since been incorporate into dplyr() as well so there is no
practical need to call the magrittr() package into our
local environment if we intend to use dplyr() verbs as
well. The pipe operator explains a lot of why R code sometimes appears
like two different styles: base R or tidyverse methods.
In the following code chunks, both code chunks evaluate to the same
result. Finally, we use the round() function from base R to
found the values to one decimal point.
dplyr::select(
arrange(
summarise(
group_by(
filter(
mutate(mtcars,
l_per_100_km = 235 / mpg),
origin != "United States"),
origin),
avg_l_per_100_km = mean(l_per_100_km), avg_mpg = mean(mpg)),
desc(avg_mpg)),
origin, avg_l_per_100_km, avg_mpg) -> sql_like_method
# mtcars |>
# transform(avg = mpg / wt) |>
# subset(avg > 5) |>
# subset(, c(wt, mpg, avg)) |>
# head()
mtcars %>%
mutate(l_per_100_km = 235 / mpg) %>%
group_by(origin) %>%
summarize(avg_l_per_100_km = round(mean(l_per_100_km), 1), avg_mpg = round(mean(mpg), 1)) %>%
arrange(desc(avg_mpg)) %>%
dplyr::select(origin, avg_l_per_100_km, avg_mpg)Univariate Statistics
Moments
In each dataset, the observations for each variable will belong to a certain distribution that can be described in terms of central tendency and variability or dispersion, as well as the shape of this variability or dispersion. This can be described in terms of four moments, which is a concept shared in other fields such as physics. Whereas in physics, moments are related to concepts of mass, rotational inertia and other things, the four moments of a distribution in statistics refer specifically to the following measures:
- center
- spread
- skewness
- kurtosis
In the table above, we show the average miles per gallon as well as the average liter per one hundred kilometers for cars from each country. This table is useful, of course. But averages don’t always work to summarize data. We need to understand the distribution, spread, and variability. If we don’t understand all the characteristics of a distribution, the average may mislead by hiding the other statistics. We will look at each of these moments in turn.
center
The first moment in a distribution is the center. There are three common ways to report this statistic:
- mean
- median
- mode
The mean is calculated as the sum of all the values divided by the total number and is given by this equation:
\[ \begin{aligned} \bar{X} = \frac{\sum_X} N \end{aligned} \]
The median, on the other hand, is the middle number after sorting in either ascending or descending order. If there is an even number of observations in the distribution, the median is the average of the middle two numbers.
Where x is the ordered list of values in data set, the
median is given by these equations. The first equation is appropriate
when n is even; the second equation is used when
n is odd.
\[ \begin{aligned} median(X) = X\frac{N} 2 \end{aligned} \] \[ \begin{aligned} median(X) = \frac{X\frac{N-1} 2 + X\frac{N+1} 2}2 \end{aligned} \]
Base R provides two useful functions for calculating these
statistics: the mean() and median().
Unike mean and median, mode can have both numeric and character data. For example, we can say that the R does not have a standard in-built function to calculate mode. So we create a user function to calculate mode of a data set in R. This function takes the vector as input and gives the mode value as output. The following equation shows us how to find the mode.
The in-built mode() function tells us the internal
storage mode of the object, not the mathematical or statistical mode,
which is the value that occurs the most. We can instead use a different
package modeest to return what we are looking for.
In the following example, the mode for the origin vector is United
States but the mode for the mpg variable is shared by these values:
10.4 15.2 19.2 21.0
21.4 22.8 30.4, which all have
two appearances. mlv is different than
The median is often a better measure of center in skewed distributions.
modeest::mlv(mtcars$mpg, method = "mfv") -> one
modeest::mlv(mtcars$origin, method = "mfv") -> two
names(sort(-table(mtcars$origin)))[1] -> three
names(sort(-table(mtcars$mpg)))[1] -> four\[ \begin{aligned} mode = L + (fm - f1) h / (fm - f1) + (fm - f2) \end{aligned} \]
https://www.google.com/search?client=safari&rls=en&q=modeest::mlv&ie=UTF-8&oe=UTF-8
In the below visualization, we can draw vertical lines for the mean
and median value in each histogram. We do these by adding the
geom_vline() layer of ggplot. We can see that
the mean statistic is larger than the median for the hp
variable. On the other hand, the mean and median statistic are
approximately equal for the qsec variable.
mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
group_by(names) %>%
summarize(median = median(values), mean = mean(values)) -> mtcars_bootstrapped_ci_df
mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
ggplot(aes(x = values, fill = names)) +
facet_wrap(~names, scales = 'free') +
geom_histogram(bins = 32) +
geom_vline(mapping = aes(xintercept = median), mtcars_bootstrapped_ci_df, linetype = 'dashed', color = 'red') +
geom_vline(mapping = aes(xintercept = mean), mtcars_bootstrapped_ci_df, linetype = 'dashed', color = 'darkgreen') +
labs(caption = "red line: median \n green line: mean") +
labs(title = "Mtcars", subtitle = "Histogram") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
theme(legend.position = 'none') +
scale_fill_manual(values=met.brewer("Hiroshige", 6))# what about trimmed meanspread
The second moment of a distribution is spread. We can think of this in terms of standard deviation or variance and interquartile range.
- standard deviation
- interquartile range
We can start by talking about the standard deviation or the variance. But what do we mean by the standard deviation or variance and why do we talk about these two measures together? We talk about these two measures together because the standard deviation is the square root of the variance and these measures are therefore two sides of the same coin.
\[ \begin{aligned} sd = sqrt(var) \end{aligned} \]
The above equation shows the mathematical transformation from standard deviation to variance. This equation, however, doesn’t increase our understanding very much. In order to increase our understanding, it’s useful to review how the variance is calculated. The variance is the average of the sum of squares; this is calculated by taking each value in the distribution, subtracting the mean value, squaring the total and dividing by the total number of observations. The below equation shows this process.
\(Equation for variance equals the sum of X values minus the average of X values squared divided by the number of values N\)
The R progrmaming language, of course, has shortcuts in calculating
these measures: We use the var() and sd()
functions to calculate the variance and standard deviation in one step.
But we can practice calculating the variance and standard deviation by
hand in order to increase our understanding of what is happening under
the hood of these functions.
n <- length(mtcars$wt) # this value is 32
sum_of_wt <- sum(mtcars$wt) # add up all the values 102.952
mean_wt <- mean(mtcars$wt) # this value is 3.21725
sq_difference_from_mean_for_each_value <- (mtcars$wt - mean(mtcars$wt)) ^ 2 # this is the squared difference from the mean for each data value
sum_sq_difference_from_mean <- sum((mtcars$wt - mean(mtcars$wt)) ^ 2) # add them all up 29.67875
variance_wt <- sum((mtcars$wt - mean(mtcars$wt)) ^ 2) / (length(mtcars$wt) - 1) # this number is 0.957379
variance_wt <- var(mtcars$wt) # this number is also 0.957379
sq_root_variance_wt <- sqrt(var(mtcars$wt)) # this number is 0.9784574
sd_wt <- sd(mtcars$wt) # this number is also 0.9784574
one_sd_below <- mean(mtcars$wt) - sd(mtcars$wt)
one_sd_above <- mean(mtcars$wt) + sd(mtcars$wt)The following table shows the calculated standard deviation for each
of the variables in our dataset. We can do this efficiently by using the
%>% operator, the dplyr
group_by and summarize functions and the
sd function from base R.
mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
group_by(names) %>%
summarize(sd = sd(values))The interquartile range, often abbreviated as IQR and accessed by the
IQR() function, describes the middle 50% of values when
ordered from lowest to highest. To find the interquartile range, we
first find the median of the lower and upper half of the data. These
values are the first quartile, abbreviated as Q1, and the third
quartile, abbreviated as Q3. The interquartile range is the difference
between Q3 and Q1.
mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
group_by(names) %>%
summarize(mean(values), sd(values), median(values), IQR(values))In addition to reporting the standard deviation and the interquartile
range, it’s also useful to think about the range and minimum and maximum
values of a distribution, which can be expressed by the
range(), min() and max()
functions, respectively.
# range
# min()
# max()skewness
The third moment of a distribution is skewness. Skewness is given by the following equation. This is known as the adjusted Fisher-Pearson standardized moment coefficient to find skew.
\[ \begin{aligned} tilde {\mu }_{3} = \frac{\sum_{i}^{N}\left(X_{i}-\bar{X}\right)^{3}}{(N-1) * \sigma^{3}} \end{aligned} \]
Skewness is a measure of symmetry. Negative skewness indicates that the mean of the data values is less than the median, and the data distribution is left-skewed. Positive skewness would indicate that the mean of the data values is larger than the median, and the data distribution is right-skewed. The formula given in most textbooks is
\(Skew = 3 * (Mean – Median) / Standard Deviation\)
This is known as an alternative Pearson Mode Skewness.
fisher_pearson_skew = 3 * ( ( mean(mtcars$mpg) - median(mtcars$mpg) ) / sd(mtcars$mpg) )
beb <- moments::skewness(mtcars$mpg) # this is an agostino test
# Compute the skewness (or standardized third central moment) of a variable.
# https://brownmath.com/stat/shape.htm#KurtosisVisualize
# The reference standard is a normal distribution, which has a kurtosis of 3.
# the moment coefficient of skewness
skewness <- sum((mtcars$mpg - mean(mtcars$mpg)) ^ 3) / 32 / (var(mtcars$mpg) ^ (3/2))
sample_skewness <- (sqrt(32*(32-1)) / (32 -2) ) * skewnessSkewness is important to understand because it helps us identify the most appropriate descriptive statistics to report on a given dataset. The mean tends to be a better measure of center when the distribution is normal or almost normal. However, if there are outliers in the distribution, the median tends to be a better measure of the center. In a similar way, standard deviation tends to be a good measure of the spread of a distribution when the distribution is normal or almost normal and the interquartile range tends to be a good measure of the spread of a distribution when the distribution is left-skewed or right-skewed.
kurtosis
The fourth moment of a distribution is kurtosis.
m <- moments::kurtosis(mtcars$mpg)
excess_kurtosis <- m - 3
kurtosis <- sum((mtcars$mpg - mean(mtcars$mpg)) ^ 4) / 32 / (var(mtcars$mpg) ^ (2))
excess_kurtosis <- kurtosis - 3
sample_kurtosis <- ((32-1)/((32-2)*(32-3))) * ((32+1)*(excess_kurtosis)+6)After reviewing all the moments of a distribution, we now create a better table.
We can immediately see that the British and Swedish cars only have one value.
mtcars %>%
mutate(l_per_100_km = 235 / mpg) %>%
group_by(origin) %>%
summarize(mean = round(mean(mpg), 1),
sd = round(sd(mpg),1),
skew = round(3 * ( ( mean(mpg) - median(mpg) ) / sd(mpg) ),1),
kurtosis = round(sum((mpg - mean(mpg)) ^ 4) / 32 / (var(mpg) ^ (2)),1)) %>%
arrange(desc(mean))Bivariate Statistics
The following datasets all share much of the same summary statistics, including x-bar, y-bar, the x and y variance and standard deviation, and the correlation and r-squared.
The point here is to show that graphing the data is important in understanding the data, not just reading summary statistics. Anscombe.
anscomb <- tribble(
~id, ~dataset, ~x, ~y,
"0","I","10.0","8.04",
"1","I","8.0","6.95",
"2","I","13.0","7.58",
"3","I","9.0","8.81",
"4","I","11.0","8.33",
"5","I","14.0","9.96",
"6","I","6.0","7.24",
"7","I","4.0","4.26",
"8","I","12.0","10.84",
"9","I","7.0","4.82",
"10","I","5.0","5.68",
"11","II","10.0","9.14",
"12","II","8.0","8.14",
"13","II","13.0","8.74",
"14","II","9.0","8.77",
"15","II","11.0","9.26",
"16","II","14.0","8.1",
"17","II","6.0","6.13",
"18","II","4.0","3.1",
"19","II","12.0","9.13",
"20","II","7.0","7.26",
"21","II","5.0","4.74",
"22","III","10.0","7.46",
"23","III","8.0","6.77",
"24","III","13.0","12.74",
"25","III","9.0","7.11",
"26","III","11.0","7.81",
"27","III","14.0","8.84",
"28","III","6.0","6.08",
"29","III","4.0","5.39",
"30","III","12.0","8.15",
"31","III","7.0","6.42",
"32","III","5.0","5.73",
"33","IV","8.0","6.58",
"34","IV","8.0","5.76",
"35","IV","8.0","7.71",
"36","IV","8.0","8.84",
"37","IV","8.0","8.47",
"38","IV","8.0","7.04",
"39","IV","8.0","5.25",
"40","IV","19.0","12.5",
"41","IV","8.0","5.56",
"42","IV","8.0","7.91",
"43","IV","8.0","6.89")
anscomb$x <- as.numeric(anscomb$x)
anscomb$y <- as.numeric(anscomb$y)
ggplot(anscomb, aes(x = x, y = y, color = dataset)) + geom_point() +
facet_wrap(~dataset) + geom_smooth(method = 'lm', se = FALSE) +
labs(title = "examples", subtitle = "examples") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
scale_color_manual(values=met.brewer("Hiroshige", 4))Pearson Correlation
When we fit a linear model we can investigate goodness-of-fit statistics like r-squared. We keep in mind, however, that r-squared is another way of interpreting another statistic called the pearson correlation, denoted as r. So often it’s good practice to look at what the pearson correlation is telling us before fitting a linear regression and examining the r-squared. The pearson correlation is calculated using the following formula:
\[ \begin{aligned} \frac{\sum\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sqrt{\sum\left(x_{i}-\bar{x}\right)^{2} \sum\left(y_{i}-\bar{y}\right)^{2}}} \end{aligned} \]
We can create the pearson correlation value ourselves by creating our own pearson correlation table. This table will tell us the values that we need to use for our general equation. The pearson correlation is a measure of the linear correlation between two variables. It is the ratio between the covariance of two variables and the product of their standard deviations; thus it is essentially a normalised measurement of the covariance, such that the result always has a value between −1 and 1. Whenever you square a number between -1 and 1, you end up with a number between 0 and 1. Thus R-squared is always between 0 and 1.
mtcars %>%
dplyr::select(wt, mpg) %>%
mutate(wt_times_mpg = wt * mpg) %>%
mutate(wt_squared = wt ^ 2) %>%
mutate(mpg_squared = mpg ^ 2) %>%
janitor::adorn_totals("row") %>%
tail(n = 1)# This is our Pearson correlation equation using the language of R
pearson_correlation <-
((length(mtcars$mpg) * sum(mtcars$wt * mtcars$mpg)) -
(sum(mtcars$wt) * sum(mtcars$mpg))) /
sqrt(((length(mtcars$mpg) * sum(mtcars$wt ^ 2)) - (sum(mtcars$wt) ^ 2)) *
((length(mtcars$mpg) * sum(mtcars$mpg ^ 2)) - (sum(mtcars$mpg) ^ 2)))
# We see that R calculates the Pearson correlation for us using the cor function
pearson_correlation <- cor(mtcars$wt, mtcars$mpg)
# Keep in mind that order is not important
# cor(mtcars$wt, mtcars$mpg) = cor(mtcars$mpg, mtcars$wt)
# We square r in order to find r-squared
r_squared <- cor(mtcars$wt, mtcars$mpg) ^ 2
tibble(pearson_correlation, r_squared, rownames = 'wt_and_mpg') %>% relocate(rownames)mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
group_by(names) %>%
summarize(correlation_with_mpg = cor(mpg, values)) %>%
mutate(r_squared = correlation_with_mpg ^ 2) %>%
arrange(correlation_with_mpg)mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
nest(data = -names) -> nested
n <- nested %>%
mutate(test = map(data, ~ cor.test(.x$mpg, .x$values)))
n2 <- nested %>%
mutate(
test = map(data, ~ cor.test(.x$mpg, .x$values)), # S3 list-col
tidied = map(test, broom::tidy)
)
mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
nest(data = -names) %>%
mutate(
test = map(data, ~ cor.test(.x$mpg, .x$values)), # S3 list-col
tidied = map(test, broom::tidy)
) %>%
unnest(tidied) -> n3Scatterplot Matrix
Scatterplot matrices are useful because they plot each variable against each other variable. The GGally package makes this method easy.
For each panel, the variable on the vertical axis is given by the variable name in the row. The variable on the horizontal is the variable name in the column. Correlations are shown in the upper right and the scatterplots are shown in the lower part. Density plots on the diagonal where the variables meet each other.
Our scatterplot matrix is telling us that the strongest correlation exists between the wt and mpg variables.
library(GGally)
mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
ggpairs(3:8) +
labs(title = "Mtcars", subtitle = "Scatterplot Matrix") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
scale_fill_manual(values=met.brewer("Hiroshige", 6))Correlation Test
correlation_test <- cor.test(mtcars$wt, mtcars$mpg,
method = "spearman")
broom::tidy(correlation_test)R-Squared
\[ \begin{aligned} {R ^ 2} = cor(x, y) ^ 2 \end{aligned} \]
RMSE
\[ \begin{aligned} {RMSD} = \sqrt{\frac{\sum_{i=1}^{N}\left(x_{i}-\hat{x}_{i}\right)^{2}}{N}} \end{aligned} \]
Root Mean Squared Error (RMSE): As the name suggests it is the square root of the averaged squared difference between the actual value and the predicted value of the target variable. It gives the average prediction error made by the model, thus decrease the RMSE value to increase the accuracy of the model.
RMSE is a measure of the differences between values predicted by a model and the values observed.
Root Mean Square Error (RMSE) is the standard deviation of the residuals which are the prediction errors.
Residuals are a measure of how far from the regression line data points are. RMSE is a measure of how spread out the residuals are.
RMSE tells us how concentrated the data is around the line of best fit.
You can use whichever formula you feel most comfortable with, as they both do the same thing. If you don’t like formulas, you can find the RMSE by:
Squaring the residuals. Finding the average of the residuals. Taking the square root of the result.
RMSE has the same unit as the dependent variable (DV)
These deviations are called residuals when the calculations are performed over the data sample that was used for estimation and are called errors (or prediction errors) when computed out-of-sample
The RMSD serves to aggregate the magnitudes of the errors in predictions for various data points into a single measure of predictive power. RMSD is a measure of accuracy, to compare forecasting errors of different models for a particular dataset and not between datasets, as it is scale-dependent
RMSD is always non-negative, and a value of 0 (almost never achieved in practice) would indicate a perfect fit to the data. In general, a lower RMSD is better than a higher one. However, comparisons across different types of data would be invalid because the measure is dependent on the scale of the numbers used.
RMSD is the square root of the average of squared errors. The effect of each error on RMSD is proportional to the size of the squared error; thus larger errors have a disproportionately large effect on RMSD. Consequently, RMSD is sensitive to outliers.
mtcars_linear_model <- lm(mpg ~ wt, mtcars)
# summary(mtcars_linear_model)
#
# ?model.matrix
stats::model.matrix(mtcars_linear_model)## (Intercept) wt
## 1 1 2.620
## 2 1 2.875
## 3 1 2.320
## 4 1 3.215
## 5 1 3.440
## 6 1 3.460
## 7 1 3.570
## 8 1 3.190
## 9 1 3.150
## 10 1 3.440
## 11 1 3.440
## 12 1 4.070
## 13 1 3.730
## 14 1 3.780
## 15 1 5.250
## 16 1 5.424
## 17 1 5.345
## 18 1 2.200
## 19 1 1.615
## 20 1 1.835
## 21 1 2.465
## 22 1 3.520
## 23 1 3.435
## 24 1 3.840
## 25 1 3.845
## 26 1 1.935
## 27 1 2.140
## 28 1 1.513
## 29 1 3.170
## 30 1 2.770
## 31 1 3.570
## 32 1 2.780
## attr(,"assign")
## [1] 0 1
mse <- sum(residuals(mtcars_linear_model) ^ 2) / 30
# mtcars_linear_model <- lm(mpg ~ wt, x = TRUE, mtcars)
# mtcars_linear_model$x
# str(mtcars_linear_model)
# lm(mpg ~ wt, qr = TRUE, mtcars)
# summary(mtcars_linear_model)mtcars_linear_model <- lm(mpg ~ wt, mtcars)
rmse = sqrt(sum(residuals(mtcars_linear_model) ^ 2) / 30)
# sqrt(sum(mtcars_linear_model$residuals ^ 2) / 30)
# sd(residuals(mtcars_linear_model)) # similar to standard deviationSome researchers have recommended the use of the Mean Absolute Error (MAE) instead of the Root Mean Square Deviation. MAE possesses advantages in interpretability over RMSD. MAE is the average of the absolute values of the errors. MAE is fundamentally easier to understand than the square root of the average of squared errors. Furthermore, each error influences MAE in direct proportion to the absolute value of the error, which is not the case for RMSD.[2]
Mean Absolute Error (MAE): This metric gives the absolute difference between the actual values and the values predicted by the model for the target variable. If the value of the outliers does not have much to do with the accuracy of the model, then MAE can be used to evaluate the performance of the model. Its value must be less in order to make better models.
mae <- mean(abs(residuals(mtcars_linear_model)))r <- cor(mtcars$mpg, mtcars$wt)
variance_wt <- sum((mtcars$wt - mean(mtcars$wt)) ^ 2) / (length(mtcars$wt) - 1)
variance_mpg <- sum((mtcars$mpg - mean(mtcars$mpg)) ^ 2) / (length(mtcars$mpg) - 1)
mean_wt <- (sum(mtcars$wt) / length(mtcars$wt))
mean_mpg <- (sum(mtcars$mpg) / length(mtcars$mpg))
sd_wt <- sqrt(variance_wt)
sd_mpg <- sqrt(variance_mpg)
pearson_correlation <- ((length(mtcars$mpg) * sum(mtcars$wt * mtcars$mpg)) - (sum(mtcars$wt) * sum(mtcars$mpg))) / sqrt(((length(mtcars$mpg) * sum(mtcars$wt ^ 2)) - (sum(mtcars$wt) ^ 2)) * ((length(mtcars$mpg) * sum(mtcars$mpg ^ 2)) - (sum(mtcars$mpg) ^ 2))) 5.2 Inferential Statistics
Gelman, Jennifer and Akil, in their book Regressions and Other Stories, outline three different challenges of inferential statistics. In their words:
- Generalizing from sample to population
- Generalizing from treatment to control group
- Generalizing from observed measurements to the underlying constructs of interest
While descriptive statistics summarize the characteristics of a dataset, inferential statistics allow us to make an inference about the more general population, including working on hyptothesis testing.
sample to population
Our first step towards understanding confidence intervals is to understand the standard deviation of the sample.
We first practice calculating the variance and standard deviation by hand in order to understand what is happening under the hood of the var() and sd() functions.
n <- length(mtcars$wt) # this value is 32
sum_of_wt <- sum(mtcars$wt) # add up all the values 102.952
mean_wt <- mean(mtcars$wt) # this value is 3.21725
# this is same as sum(mtcars$wt) / length(mtcars$wt)
# mtcars %>%
# mutate(squared_difference_from_mean = (mtcars$wt - mean(mtcars$wt)) ^ 2) %>%
# summarize(sum_squared_difference_from_mean = sum(squared_difference_from_mean))
sq_difference_from_mean_for_each_value <- (mtcars$wt - mean(mtcars$wt)) ^ 2 # this is the squared difference from the mean for each data value
sum_sq_difference_from_mean <- sum((mtcars$wt - mean(mtcars$wt)) ^ 2) # add them all up 29.67875
# another method to sum squared difference from mean
# mtcars %>%
# mutate(squared_difference_from_mean = (mtcars$wt - mean(mtcars$wt)) ^ 2) %>%
# summarize(sum_squared_difference_from_mean = sum(squared_difference_from_mean))
variance_wt <- sum((mtcars$wt - mean(mtcars$wt)) ^ 2) / (length(mtcars$wt) - 1) # this number is 0.957379
variance_wt <- var(mtcars$wt) # this number is also 0.957379
sq_root_variance_wt <- sqrt(var(mtcars$wt)) # this number is 0.9784574
sd_wt <- sd(mtcars$wt) # this number is also 0.9784574
one_sd_below <- mean(mtcars$wt) - sd(mtcars$wt)
one_sd_above <- mean(mtcars$wt) + sd(mtcars$wt)
mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
group_by(names) %>%
summarize(sd = sd(values))Standard Error
The standard deviation (SD) measures the amount of variability, or dispersion, from the individual data values to the mean
The standard error of the mean (SEM) measures how far the sample mean (average) of the data is likely to be from the true population mean. The SEM is always smaller than the SD.
We calculate the upper and lower bound confidence intervals for the populations by first calculating the standard error in the population
sigma = 1.96 # this is the appropriate sigma for a 95% confidence interval
n <- length(mtcars$wt) # this value is 32
standard_error = (sigma * sd(mtcars$wt)) / sqrt(n)
# this value is 0.3390182
mean_wt <- mean(mtcars$wt) # 3.21725
lower_bound_ci <- mean_wt - standard_error # lower bound confidence interval
upper_bound_ci <- mean_wt + standard_error # upper bound confidence interval
tibble(lower_bound_ci, upper_bound_ci, Mtcars_variable = "wt")Confidence Interval
Theory
mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
group_by(names) %>%
summarize(standard_error = 1.96 * sd(values) / sqrt(length(values)),
mean_value = mean(values)) %>%
mutate(lower_ci = mean_value - standard_error) %>%
mutate(upper_ci = mean_value + standard_error) -> mtcars_theoretical_ci_df
mtcars_theoretical_ci_dfBootstrapping
mtcars %>% # use this to get a single value
infer::specify(response = wt) %>%
infer::generate(reps = 1000, type = 'bootstrap') %>%
infer::calculate(stat = 'mean') %>%
infer::get_ci(level = 0.95) -> one_value
mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
split(.$names) %>% # split(mtcars_long_numeric$names) %>%
map_df(~.x %>%
specify(response = values) %>%
generate(reps = 1000, type = 'bootstrap') %>%
calculate(stat = 'mean') %>%
get_ci(level = 0.95), .id = 'names') -> mtcars_bootstrapped_ci_dfmtcars_theoretical_ci_df <- mtcars_theoretical_ci_df %>% dplyr::select(names, lower_ci, upper_ci)
names(mtcars_theoretical_ci_df) <- c("names", "lower_ci_theoretical", "upper_ci_theoretical")
names(mtcars_bootstrapped_ci_df) <- c("names", "lower_ci_bootstrapped", "upper_ci_bootstrapped")
mtcars_ci_df <- left_join(mtcars_theoretical_ci_df, mtcars_bootstrapped_ci_df, by = 'names')
mtcars_ci_dfmtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
ggplot(aes(x = values, fill = names)) +
facet_wrap(~names, scales = 'free') +
geom_histogram(bins = 32) +
geom_vline(mapping = aes(xintercept = lower_ci_theoretical), mtcars_theoretical_ci_df, linetype = 'dashed') +
geom_vline(mapping = aes(xintercept = upper_ci_theoretical), mtcars_theoretical_ci_df, linetype = 'dashed') +
geom_vline(mapping = aes(xintercept = lower_ci_bootstrapped), mtcars_ci_df, linetype = 'dashed') +
geom_vline(mapping = aes(xintercept = upper_ci_bootstrapped), mtcars_ci_df, linetype = 'dashed') +
labs(caption = "dashed line equals 95% confidence interval for mean") +
labs(title = "Mtcars", subtitle = "Histogram") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
scale_fill_manual(values=met.brewer("Hiroshige", 6))# ggplot(df1, aes(x = values, fill = names)) +
# geom_density(
# alpha = 0.3,
# aes(y = after_stat(density + 0.0001 * group))
# ) # ggplot(df1, aes(x = values, fill = names)) +
# geom_density(
# alpha = 0.3,
# aes(y = after_stat(density + 0.0001 * group))
# )
mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
ggplot(aes(x = values, fill = names)) +
facet_wrap(~names, scales = 'free') +
geom_density() +
geom_vline(mapping = aes(xintercept = lower_ci_theoretical), mtcars_theoretical_ci_df, linetype = 'dashed') +
geom_vline(mapping = aes(xintercept = upper_ci_theoretical), mtcars_theoretical_ci_df, linetype = 'dashed') +
geom_vline(mapping = aes(xintercept = lower_ci_bootstrapped), mtcars_ci_df, linetype = 'dashed') +
geom_vline(mapping = aes(xintercept = upper_ci_bootstrapped), mtcars_ci_df, linetype = 'dashed') +
labs(caption = "dashed line equals 95% confidence interval for mean") +
labs(title = "Mtcars", subtitle = "Density Plot") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
scale_fill_manual(values=met.brewer("Hiroshige", 6))The Jackknife
There are various “solutions” for heteroskedasticity, which are less “solutions” (since it may hinge on the kind of heteroskedasticity and whether you know it or not). In practice, these solutions are more robustness tests to compare with your standard error estimates derived from a simple OLS model. One of those solutions is “bootstrapping”, which was first introduced by Efron (1979) as an extension of the “jackknife” approach. Briefly, the “jackknife” approach first introduced by Quenouille (1949), and given its name by Tukey (1958), is a “leave-one-out” resampling method that recalculates a statistic of interest iteratively until each observation has been removed once. One limitation from this, beyond how tricky it is to adapt to more complex data structures, is jackknifing struggles with smaller datasets and the number of resamples is capped at the number of observations. Bootstrapping, a resampling with replacement approach to calculating statistics of interest (e.g. standard errors from a regression), is far more versatile and flexible.
32 different options depending on which one you leave out
with jackknife the most shift you could get is the shift of leaving one out
# unlike thebootstrap, jackknifing is reproducible every time
for (i in 1:nrow(mtcars)) {
mtcars_minus_one <- mtcars[-i,]
mean_value_minus_one = mean(mtcars_minus_one$mpg)
mtcars$jackknife_mean_mpg[i] <- mean_value_minus_one
}
# mtcars
# mtcars_long_numeric_with_mpg
ggplot(mtcars, aes(x = jackknife_mean_mpg)) +
geom_histogram(bins = 32, color = 'black', fill = 'lightblue') +
labs(title = "Mtcars", subtitle = "Histogram") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))# mtcars$mpg # the 32 mpg values
# mean(mtcars$mpg) # the mean mpg # sometimes called x.bar or x-bar
n <- 32
standard_error <- sd(mtcars$mpg)/sqrt(n)
# the mean is the same but not all the statistics are the same
# sd(mtcars$mpg) # these two values are different
# sd(mtcars$jackknife_mean_mpg) # these two values are different
# mtcars$jackknife_mean_mpg # the 32 jackknifed means
jacknife_mean_mpg <- mean(mtcars$jackknife_mean_mpg) # the jacknife mean
jacknife_standard_error <- sqrt(((n - 1) / n) * sum((mtcars$jackknife_mean_mpg-jacknife_mean_mpg) ^ 2))
#
# jack.mean.xbar
# x.bar
# jack.SE
# SE.x
t.star <- qt(p=0.975,df=n-1) # what is qt
# t.star
confidence_intervals <- mean(mtcars$mpg) + c(-1,1) * t.star * standard_error
confidence_intervals## [1] 17.91768 22.26357
# You will notice that the mean of all the jackknife statistics is equal to the statistic computed using all
# n
# data values, and similarly that the jackknifed standard error is equal to the standard error of the mean obtained using all of the data and the familiar formula, so for the mean this seems to be just a more convoluted way to compute quantities we already have formulas for and a waste of time.
# mtcars %>%
dplyr::select(car, mpg, disp, hp, drat, wt, qsec) %>%
mutate(across(c(mpg, disp, hp, drat, wt, qsec), function(x)
map_dbl(seq_along(x), function(rn) mean(x[-rn])),
.names = "{.col}_jackknife_mean")) %>%
dplyr::select(-c(mpg:qsec)) -> saved
names(saved) <- c("car", "mpg", "disp", "hp", "drat", "wt", "qsec")
saved <- pivot_longer(saved, names_to = 'names', values_to = 'values', 2:7)
saved_ci_df <- saved %>%
group_by(names) %>%
summarize(standard_error = sd(values),
mean_value = mean(values)) %>%
mutate(lower_ci_jackknife = mean_value - standard_error) %>%
mutate(upper_ci_jackknife = mean_value + standard_error)
mtcars_theoretical_ci_dfmtcars_ci_dfsaved_ci_dfsaved_ci_df[1, 1] = "disp"
saved_ci_df[2, 1] = "drat"
saved_ci_df[3, 1] = "hp"
saved_ci_df[4, 1] = "mpg"
saved_ci_df[5, 1] = "qsec"
saved_ci_df[6, 1] = "wt"
ggplot(saved, aes(x = values)) +
geom_histogram(bins = 32, color = 'black', fill = 'lightblue') +
facet_wrap(~names, scales = 'free') +
labs(title = "Mtcars", subtitle = "Histogram") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
geom_vline(mapping = aes(xintercept = lower_ci_jackknife), saved_ci_df, linetype = 'dashed', color = 'red') +
geom_vline(mapping = aes(xintercept = upper_ci_jackknife), saved_ci_df, linetype = 'dashed', color = 'red') mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
ggplot(aes(x = values, fill = names)) +
facet_wrap(~names, scales = 'free') +
geom_density() +
# geom_vline(mapping = aes(xintercept = lower_ci_theoretical), mtcars_theoretical_ci_df, linetype = 'dashed') +
# geom_vline(mapping = aes(xintercept = upper_ci_theoretical), mtcars_theoretical_ci_df, linetype = 'dashed') +
# geom_vline(mapping = aes(xintercept = lower_ci_bootstrapped), mtcars_ci_df, linetype = 'dashed', color = 'blue') +
# geom_vline(mapping = aes(xintercept = upper_ci_bootstrapped), mtcars_ci_df, linetype = 'dashed', color = 'blue') +
geom_vline(mapping = aes(xintercept = lower_ci_jackknife), saved_ci_df, linetype = 'dashed', color = 'red') +
geom_vline(mapping = aes(xintercept = upper_ci_jackknife), saved_ci_df, linetype = 'dashed', color = 'red') +
labs(caption = "dashed line equals 95% confidence interval for mean") +
labs(title = "Mtcars", subtitle = "Density Plot") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))treatment to control
observed measurements to underlying constructs
6 Hypothesis Testing
As a first step in understanding hypothesis testing, we first want to understand some of the characteristics of different sampling distributions, including understanding how to plot the area under the curve, the critical area and type 1 and type 2 errors.
Before we dive deep into a dataset, we want to create one or more hypotheses for testing. As a general process, we first state a hypothesis, then we set the criteria for a decision and, as a final step, we compute the test statistic. When we do hypothesis testing, we actually refer to two distinct hypotheses and choose between them: Our two distinct hypotheses are the null hypothesis and the alternate hypothesis. We have to create a criterion to choose between our two hypotheses. We do this by choosing a threshold for our p-value. This threshold is called our alpha value.
P-Values
We usually have more than one test statistic. The p-value is a good first place to start because it is a normalized criterion across different kinds of tests. The p-value is the level of marginal significance within a our hypothesis test and it represents the probability of the occurrence of a given event. The ‘p’ in p-value, therefore, refers to probability because it quantifies uncertainty. Specifically, the p-value is the probability of obtaining results at least as extreme as our observed results by random chance alone, assuming that our null hypothesis is correct. A smaller p-value means that there is stronger evidence in favor of the alternative hypothesis. For example, a p-value of 0.05 means that our results would only happen by random chance alone 5% of the time. Most of the time, criteria for decision is set at an alpha level of 0.05.
Whenever we perform report statistical significance, it’s extremely
important that we have integrity around our hypothesiss testing. A
p-value of 0.05 is the typical threshold for statistical significance.
We should never go fishing for statistical signficance. The following is
an illustration why. Let’s say we generate our own dataframe with random
values from a normal distribution and we do this 100 times. We can use
the rnorm() function to create a dataframe with random
normal distributions. We would expect five of these vectors to have a
p-value of less than 0.05. But that doesn’t mean anything because these
are just randomly generated vectors. The workflow here is important. It
would be dishonest to write conclusions about statistical significance
after picking a variable after assessing p-values. This is sometimes
referred to as ‘p-hacking’. It would be analogous to shooting bullets at
the side of a barn and then drawing bullseyes around some of the
holes.
Type I and Type II Errors
At the end of this tutorial we want to tackle an even more difficult problem. In statistics courses, teachers usually have a tough time explaining the concepts of type I and type II errors to their students. While type I errors refer to the false rejection of the null hypothesis given the null hypothesis is true, type II errors refers to the false rejection of the alternative hypothesis given the alternative hypothesis is true. The most difficult thing to understand is to mentally focus on one of these particular hypotheses. While the underlying assumption of the type I error is that the null hypothesis is true, the underlying assumption of the type II error is that the alternative hypothesis is true. We can visualize this fact by adding an area to the true hypothesis and using a more subtle line for the other hypothesis. First, let’s visualize the type I error in a standard normal distribution:
Parametric and Non-Parametric Equivalents
Parametric tests are those that make assumptions about the parameters of the population distribution from which the sample is drawn. This is often the assumption that the population data are normally distributed. Non-parametric tests are “distribution-free” and, as such, can be used for non-Normal variables. Table 3 shows the non-parametric equivalent of a number of parametric tests.
Usually, with a parametric test there’s a non-parametric equivalent.
6.1 Normality Tests
There are two ways to know if our data approximately follows a normal distribution or is sampled from a distribution that is close to normal. The first way is to look at frequency distributions like histograms, density plots, and boxplots, as well as assessing q-q plots, and the second way is to perform one or more normality tests. These methods can be used most effectively in conjunction with each other.
Hypothesis testing is often the way in which we evaluate our ultimate
research question. However, hypothesis testing can also be used more
locally as a way to better understand our data on the road towards
answering our research questions. For this reason, it’s not uncommon to
see a project that reports many p-values for different statistics on the
way to a final set of conclusions. For example, we may have as our
research question the idea that mpg can be predicted
accurately by wt in the mtcars dataset by
using an ordinary least squares regression. However, ordinary least
squares regression requires an approximately normally distributed
independent variable. For this reason, we may choose to use a normality
test with a null and alternate hypothesis in order to gain confidence
that the wt variable is normally distributed before we move
on to the projecting of creating a linear model and assessing linear
model statistics.
There are a few common normality tests, each with different strengths and weaknesses. Here are some of the most common:
- Shapiro-Wilk Test
- D’agsotino-Pearson Omnibus test
- Kolmogorov-Smirnov Test
The following visualization is the same four-paneled visualization as
before, except we are applying our mtcars data instead of
looking at a normal distribution from theory. We have made a change to
the panel involving our density plot, however: We can compare the
mpg variable in mtcars to a normal
distribution with the same mean and standard deviation in order to see
how much the mpg variable might deviate from a perfect
normal distribution. Overall, across the four panels, we can see that
the mpg variable is at least a little bit right-skewed. But
how much is it right-skewed? Normality tests are one way to quantify
this.
Shapiro-Wilk Test
The first normality test that we will look at is the Shapiro-Wilk test. The null-hypothesis of this test is that the population is normally distributed. Therefore, if the p-value is less than the chosen alpha level, typically 0.05, then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed. On the other hand, if the p-value is greater than the chosen alpha level, then the null hypothesis, that our data comes from a normally distributed population, can not be rejected.
It’s important to know that the Shapiro-Wilk test is sensitive to outliers and is also influenced by sample size. For smaller samples, non-normality is less likely to be detected but the Shapiro-Wilk test should be preferred as it is generally more sensitive. For larger samples (i.e. more than one hundred), the normality tests are overly conservative and the assumption of normality might be rejected too easily. sw has the best power at a given significance level. test is sample size biased. Keep in mind that sample size can have a big influence on the test. Small sample sizes have little power to reject null hypothesis. Large sample sizes have too much power and can detect minor deviations from null distribution. This is what we mean when we say the Shapiro-Wilk test is sample size biased. Here is the mathematical equation for Shapiro-Wilk normality test. We can instead just use a table. SW only works if you have three data points.
\[ \begin{aligned} {\displaystyle W={\left(\sum _{i=1}^{n}a_{i}x_{(i)}\right)^{2} \over \sum _{i=1}^{n}(x_{i}-{\overline {x}})^{2}},} \end{aligned} \]
Calculating the Shapiro Wilk constant is a little bit involved. The
Shapiro-Wilk test calculates a test statistic called W. Below we include
some of the math to show how this might be done. Let’s think about this
equation in two parts, by dividing the numerator and the denominator of
the equation. The denominator in the formula is simply the sum of the
squared errors, meaning the sum of the differences of the data from its
mean. We had performed a similar process when looking at the Pearson
correlation coefficient. The numerator of this equation has a constant
called ai, which is the Shapiro-Wilk constant. The ai constant value is
different for each number and also depends on sample size. Generally, we
never do this stuff by hand. We can either use a table to show the
critical values or else R software will sort the data for us. Using R,
we can do all the math for a Shapiro-Wilk test with one simple line of
code by using the shapiro.test() function. This one line of
code will not only calculate the sum of squared errors for the
denominator in our equation, but it will also calculate the numerator
using the correct ai value by first sorting the data and taking into
account sample size sort. The output will show both the W and the
p-value. W is always between 0 and 1 and closer to 1 is more likely to
be like a normal distribution. We can see in the following table that
disp and hp have low p-values indicating that
we can reject the null hypothesis that these variales follow a normal
distribution or, more simply, we can see that they are not normally
distributed.
# https://www.real-statistics.com/tests-normality-and-symmetry/statistical-tests-normality-symmetry/shapiro-wilk-test/
## calculating shapiro-wilk by hand
# vector <- function(myvector) {
# first_value <- sort(myvector)[32] - sort(myvector)[1]
# second_value <- sort(myvector)[31] - sort(myvector)[2]
# third_value <- sort(myvector)[30] - sort(myvector)[3]
# fourth_value <- sort(myvector)[29] - sort(myvector)[4]
# fifth_value <- sort(myvector)[28] - sort(myvector)[5]
# sixth_value <- sort(myvector)[27] - sort(myvector)[6]
# seventh_value <- sort(myvector)[26] - sort(myvector)[7]
# eighth_value <- sort(myvector)[25] - sort(myvector)[8]
# ninth_value <- sort(myvector)[24] - sort(myvector)[9]
# tenth_value <- sort(myvector)[23] - sort(myvector)[10]
# eleventh_value <- sort(myvector)[22] - sort(myvector)[11]
# twelfth_value <- sort(myvector)[21] - sort(myvector)[12]
# thirteenth_value <- sort(myvector)[20] - sort(myvector)[13]
# fourteenth_value <- sort(myvector)[19] - sort(myvector)[14]
# fifteenth_value <- sort(myvector)[18] - sort(myvector)[15]
# sixteenth_value <- sort(myvector)[17] - sort(myvector)[16]
# return(c(first_value, second_tsvalue, third_value, fourth_value, fifth_value, sixth_value, seventh_value, eighth_value, ninth_value, tenth_value, eleventh_value, twelfth_value, thirteenth_value, fourteenth_value, fifteenth_value, sixteenth_value))
# }
# maxtomin <- function(vec, N = NULL) {
# if(is.null(N)) {N <- floor(length(vec)/2)}
# vec <- sort(vec)
# vecfin <- rev(vec) - vec
# return(head(vecfin, N))
# }
# maxtomin(mtcars$mpg)
# If n is even, let m = n/2, while if n is odd let m = (n–1)/2
# Calculate b as follows, taking the ai weights from Table 1 (based on the value of n) in the Shapiro-Wilk Tables. Note that if n is odd, the median data value is not used in the calculation of b.
# mtcars %>%
# select(mpg) %>%
# arrange(mpg) %>%
# mutate(mpg_rev = rev(mpg) - mpg) %>%
# slice_head(n = length(mtcars$mpg) / 2)shapiro.test(mtcars$mpg) # normal##
## Shapiro-Wilk normality test
##
## data: mtcars$mpg
## W = 0.94756, p-value = 0.1229
# p-value = 0.1229
# shapiro.test(mtcars$hp) # skewed
# p-value = 0.04881
mtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 3:8) %>%
group_by(names) %>%
do(broom::tidy(shapiro.test(.$values))) #
# mtcars_long_numeric_with_mpg %>%
# nest(data = -c(names, values)) %>%
# mutate(shapiro = map(data, ~shapiro.test(.x$values)) %>%
# glanced = map(shapiro, glance)))How do we interpret the result of our Shapiro-Wilk normality test?
It’s important to think about the characteristic of the null and alternate hypothesis. The default here is that the data is normally distributed. Therefore, a p-value greater than 0.05 means we fail to reject the null hypothesis that the data is not normally distributed. This is not the same as saying that our data is therefore normally distributed like as a guarantee. Keep in mind that sample size can have a big influence on the test. Small sample sizes have little power to reject null hypothesis. Large sample sizes have too much power and can detect minor deviations from null distribution.
Another thing to keep in mind with this test is that even if the data is found to not be normally distributed, the test itself does not tell us how much the data differs from normal.
Here we add the p-values from the test to our qq plots.
# scale_color_manual(values= list(c("#e76254", "#ef8a47", "#f7aa58", "#ffd06f", "#ffe6b7", "#aadce0", "#72bcd5", "#528fad", "#376795", "#1e466e"), c(6, 2, 9, 3, 7, 5, 1, 10, 4, 8), colorblind=TRUE))
one <- ggplot2::ggplot(data = mtcars, aes(sample = mpg)) +
stat_qq() + stat_qq_line(color = '#e76254', linetype = 'dotdash', size = 1) +
labs(subtitle = "mpg") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = 'shapiro-wilk p-value: 0.12')
three <- ggplot2::ggplot(data = mtcars, aes(sample = disp)) +
stat_qq() + stat_qq_line(color = '#ef8a47', linetype = 'dotdash', size = 1) +
labs(title = "Mtcars", subtitle = "disp") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = 'shapiro-wilk p-value: 0.02')
four <- ggplot2::ggplot(data = mtcars, aes(sample = hp)) +
stat_qq() + stat_qq_line(color = '#f7aa58', linetype = 'dotdash', size = 1) +
labs(subtitle = "hp") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = 'shapiro-wilk p-value: 0.05')
five <- ggplot2::ggplot(data = mtcars, aes(sample = drat)) +
stat_qq() + stat_qq_line(color = '#ffd06f', linetype = 'dotdash', size = 1) +
labs(subtitle = "drat") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = 'shapiro-wilk p-value: 0.11')
seven <- ggplot2::ggplot(data = mtcars, aes(sample = wt)) +
stat_qq() + stat_qq_line(color = '#ffe6b7', linetype = 'dotdash', size = 1) +
labs(subtitle = "wt") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = 'shapiro-wilk p-value: 0.09')
eight <- ggplot2::ggplot(data = mtcars, aes(sample = qsec)) +
stat_qq() + stat_qq_line(color = '#aadce0', linetype = 'dotdash', size = 1) +
labs(subtitle = "qsec") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = 'shapiro-wilk p-value: 0.59')
library(patchwork)
one + three + four + five + seven + eightWhat about next steps? What if the data is not normally distributed? If the checks suggest that the data is not normally distributed we can either transform the dependent variable (repeating the normality checks on the transformed data): Common transformations include taking the log or square root of the dependent variable
Or we can use a non-parametric test: Non-parametric tests are often called distribution free tests and can be used instead of their parametric equivalent
Now we can transform those variables that failed our shapiro-wilk test into log versions and redo our test
#
#
# mtcars_numeric_diff <- mtcars %>%
# dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
# mutate(log_hp = log(hp), log_disp = log(disp)) %>%
# select(-hp, -disp)
#
# mtcars_long_numeric_with_mpg_log <- mtcars %>%
# dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
# pivot_longer(names_to = 'names', values_to = 'values', 2:7)
#
# ggplot(mtcars_long_numeric_with_mpg_log, aes(x = values, fill = names)) +
# facet_wrap(~names, scales = 'free') +
# geom_density(color = 'black', alpha = 0.25, show.legend = FALSE) +
# labs(title = "Mtcars", subtitle = "Density Plot") +
# theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
# theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
# theme(plot.subtitle = element_text(hjust = 0.5))
#
# mtcars_long_numeric_with_mpg_log %>%
# group_by(names) %>%
# do(tidy(shapiro.test(.$values)))
# The Jarque–Bera test
6.2 Theoretical Tests
Parametric Tests
Student T-Test
A one and two sample t-test is used to test whether or not the means of two populations are equal. This type of test makes the following assumptions about the data.
- Independence
- Normality
- Homogeneity of Variances
- Random Sampling
Independence means that the observations in one sample are independent of the observations in the other sample. In other words, the independence assumption is held up if no two observations appear in both samples. After all, it wouldn’t make sense to try and draw conclusions about the differences between the samples when observations appear in both samples. Before we perform a two-sample t-test, we should always check to make sure that each observation only appears in each sample once. We should not try to perform a two sample t-test if this assumption is violated.
Normality means that both samples are approximately normally distributed. This is a crucial assumption because if the samples are not normally distributed then it isn’t valid to use the p-values from the test to draw conclusions about the differences between the samples. If the sample sizes are small (n < 50), then we can use a Shapiro-Wilk test to determine if each sample size is normally distributed. If the p-value of the test is less than a certain significance level, then the data is likely not normally distributed. If the sample sizes are large, then it’s better to use a Q-Q plot to visually check if the data is normally distributed. If the data points roughly fall along a straight diagonal line in a Q-Q plot, then the dataset likely follows a normal distribution. If this assumption is violated then we can perform a Mann-Whitney U test, which is considered the non-parametric equivalent to the two sample t-test and does not make the assumption that the two samples are normally distributed.
Homogeneity of variances means that both samples have have roughly equal variance. We use the following heuristic to determine the ratio of the variances is more or less equal: We divide the larger variance by the smaller variance and if the result is a ratio that is less than four, then we can assume the variances are approximately equal. In the event that the ratio of the variances is greater than four, we can instead use Welch’s T-Test, which is a non-parametric version of the two sample t-test and does not make the assumption that the two samples have equal variances.
Random Sampling means that both samples were obtained using a random sampling method. A two sample t-test makes the assumption that both samples were obtained using a random sampling method. In other words, random sampling means that each observations had an equal probability of being included in either sample. There is no formal statistical test we can use to test this assumption. If this assumption is violated, then it would be hard to reliably generalize the findings from the two sample t-test to the more general population.
You are indeed correct that they are testing for the equality of the mean.
But you might be more interested in what makes them different from one another and what criteria should be meant to select which test. The tests can be ranked in generality. (Most general to least general)
One Sample T-Test
The test statistic for a One Sample t Test is denoted t, which is calculated using the following formula.
\[ \begin{aligned} T = \frac{x -𝜇}{se} \end{aligned} \]
Another way to think about the formula is in this way.
\[ \begin{aligned} T = \frac{x -𝜇}{\frac{𝜎}{√𝑛}} \end{aligned} \]
t <- ( mean(mtcars$mpg) - 20 ) / ( sd(mtcars$mpg) / sqrt(( length(mtcars$mpg) - 0 )))
t## [1] 0.08506004
t.test(mtcars$mpg, mu = 20, alternative = "two.sided")##
## One Sample t-test
##
## data: mtcars$mpg
## t = 0.08506, df = 31, p-value = 0.9328
## alternative hypothesis: true mean is not equal to 20
## 95 percent confidence interval:
## 17.91768 22.26357
## sample estimates:
## mean of x
## 20.09062
The calculated t value is then compared to the critical t value from the t distribution table with degrees of freedom df = n - 1 and chosen confidence level. If the calculated t value > critical t value, then we reject the null hypothesis.
Two Sample T-Test
Non-Parametric Tests
Mann Whitney Test
Welch Test
But you might be more interested in what makes them different from one another and what criteria should be meant to select which test. The tests can be ranked in generality. (Most general to least general)
The Mann Whitney test is the most general with the fewest assumptions, it is a nonparametric test, meaning that we do not need any distributional assumptions about the probability distribution that the data comes from.
The Welch test adds the assumption that the two groups comes from a normal distribution, but the variance in the two groups can be different.
The Student t-test adds yet another assumption, that the variance should be equal in the two groups.
t.test(mpg ~ am, data = mtcars, var.equal=TRUE)##
## Two Sample t-test
##
## data: mpg by am
## t = -4.1061, df = 30, p-value = 0.000285
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -10.84837 -3.64151
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
t.test(mpg ~ am, data = mtcars)##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
wilcox.test(mpg ~ am, data=mtcars) # is this the same as mann witney##
## Wilcoxon rank sum test with continuity correction
##
## data: mpg by am
## W = 42, p-value = 0.001871
## alternative hypothesis: true location shift is not equal to 0
t.test(mpg ~ vs, data = mtcars, var.equal=TRUE)##
## Two Sample t-test
##
## data: mpg by vs
## t = -4.8644, df = 30, p-value = 3.416e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.274221 -4.606732
## sample estimates:
## mean in group 0 mean in group 1
## 16.61667 24.55714
t.test(mpg ~ vs, data = mtcars)##
## Welch Two Sample t-test
##
## data: mpg by vs
## t = -4.6671, df = 22.716, p-value = 0.0001098
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.462508 -4.418445
## sample estimates:
## mean in group 0 mean in group 1
## 16.61667 24.55714
wilcox.test(mpg ~ vs, data=mtcars) # is this the same as mann witney##
## Wilcoxon rank sum test with continuity correction
##
## data: mpg by vs
## W = 22.5, p-value = 9.034e-05
## alternative hypothesis: true location shift is not equal to 0
options(scipen = 999)
# mtcars %>%
# dplyr::select(mpg, vs, am) %>%
# pivot_longer(names_to = 'names', values_to = 'values', 2:3) %>%
# nest(data = -names) %>%
# mutate(
# test = map(data, ~ t.test(.x$mpg, .x$values)), # S3 list-col
# tidied = map(test, tidy)
# ) %>%
# unnest(tidied) -> n3
# n3
t.test(mpg ~ vs, data = mtcars) # p-value = 0.0001098##
## Welch Two Sample t-test
##
## data: mpg by vs
## t = -4.6671, df = 22.716, p-value = 0.0001098
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.462508 -4.418445
## sample estimates:
## mean in group 0 mean in group 1
## 16.61667 24.55714
t.test(mpg ~ am, data = mtcars) # p-value = 0.001374##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
So when you relax on your assumptions you become less certain of the outcome in some sense. If your 𝛼 α value of the test would be 0.1% (Which is rarely used except when doing multiple testing/comparisons or maybe in pharmaceutical experiments), then you would reject the null hypothesis using the t-test but not with the other tests.
If you know that the data comes from a normal distribution and the two groups have equal variance, this is the correct thing to do. People often tend to assume that the data comes from a normal distribution, but if you want to be conservative the Mann Whitney test is more appropriate.
6.3 Simulation Based Tests
Student T-Test
One Sample T-Test
In this vignette, we’ll walk through conducting t-tests and their randomization-based analogue using infer. We’ll start out with a 1-sample t-test, which compares a sample mean to a hypothesized true mean value.
The 1-sample t-test can be used to test whether a sample of continuous data could have plausibly come from a population with a specified mean.
As an example, we’ll test whether the average American adult works 40 hours a week using data from the gss. To do so, we make use of the hours variable, giving the number of hours that respondents reported having worked in the previous week. The distribution of hours in the observed data looks like this:
It looks like most respondents reported having worked 40 hours, but there’s quite a bit of variability. Let’s test whether we have evidence that the true mean number of hours that Americans work per week is 40.
infer’s randomization-based analogue to the 1-sample t-test is a 1-sample mean test. We’ll start off showcasing that test before demonstrating how to carry out a theory-based t-test with the package.
First, to calculate the observed statistic, we can use specify() and calculate().
# calculate the observed statistic
observed_statistic <- mtcars %>%
specify(response = mpg) %>%
calculate(stat = "mean")
# this is the same as mean(mtcars$mpg)The observed statistic is 41.382. Now, we want to compare this statistic to a null distribution, generated under the assumption that the mean was actually 40, to get a sense of how likely it would be for us to see this observed mean if the true number of hours worked per week in the population was really 40.
We can generate the null distribution using the bootstrap. In the bootstrap, for each replicate, a sample of size equal to the input sample size is drawn (with replacement) from the input sample data. This allows us to get a sense of how much variability we’d expect to see in the entire population so that we can then understand how unlikely our sample mean would be.
# generate the null distribution
null_dist_1_sample <- mtcars %>%
specify(response = mpg) %>%
hypothesize(null = "point", mu = 21) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")
null_dist_1_sampleTo get a sense for what these distributions look like, and where our observed statistic falls, we can use visualize():
# visualize the null distribution and test statistic!
null_dist_1_sample %>%
visualize() +
shade_p_value(observed_statistic,
direction = "two-sided")It looks like our observed mean of 41.382 would be relatively unlikely if the true mean was actually 40 hours a week. More exactly, we can calculate the p-value:
# calculate the p value from the test statistic and null distribution
p_value_1_sample <- null_dist_1_sample %>%
get_p_value(obs_stat = observed_statistic,
direction = "two-sided")
p_value_1_sampleThus, if the true mean number of hours worked per week was really 40, our approximation of the probability that we would see a test statistic as or more extreme than 41.382 is approximately 0.026.
Analogously to the steps shown above, the package supplies a wrapper function, t_test, to carry out 1-sample t-tests on tidy data. Rather than using randomization, the wrappers carry out the theory-based t-test. The syntax looks like this:
t_test(mtcars, response = mpg, mu = 21)Two Sample T-Test
Mann Whitney Test
Welch Test
To compare outcomes in experiments, we often use Student’s t-test. It assumes that data are randomly selected from the population, arrived in large samples (>30), or normally distributed with equal variances between groups. If we do not happen to meet these assumptions, we may use one of the simulation tests. In this article, we will introduce the Permutation Test. Rather than assuming underlying distribution, the permutation test builds its distribution, breaking up the associations between or among groups. Often we are interested in the difference of means or medians between the groups, and the null hypothesis is that there is no difference there. We may ask the question: from all the possible permutations, how extreme our data would look like? All possible permutations would represent a theoretical distribution. In practice, there is no need to perform ALL permutations to build the theoretical distribution, but run a reasonable number of simulations to take a sample from that distribution. Usually, there are 10k or 100k simulations.
6.4 Permutation Tests
Student T-Test
One Sample T-Test
Two Sample T-Test
Two-sample t-test
2-Sample t-tests evaluate the difference in mean values of two populations using data randomly-sampled from the population that approximately follows a normal distribution. As an example, we’ll test if Americans work the same number of hours a week regardless of whether they have a college degree or not using data from the gss. The college and hours variables allow us to do so:
It looks like both of these distributions are centered near 40 hours a week, but the distribution for those with a degree is slightly right skewed.
Again, note the warning about missing values—many respondents’ values are missing. If we were actually carrying out this hypothesis test, we might look further into how this data was collected; it’s possible that whether or not a value in either of these columns is missing is related to what that value would be.
infer’s randomization-based analogue to the 2-sample t-test is a difference in means test. We’ll start off showcasing that test before demonstrating how to carry out a theory-based t t-test with the package.
As with the one-sample test, to calculate the observed difference in means, we can use specify() and calculate().
ggplot(mtcars, aes(x = am, y = mpg, fill = am)) +
geom_boxplot() +
labs(title = "Mtcars", subtitle = "Density Plot") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
scale_fill_manual(values=met.brewer("Hiroshige", 2)) +
theme(legend.position = 'none')mtcars <- mtcars %>%
mutate(am = as_factor(am))
observed_statistic <- mtcars %>%
specify(mpg ~ am) %>%
calculate(stat = "diff in means", order = c("1", "0"))
observed_statisticNote that, in the line specify(hours ~ college), we could have swapped this out with the syntax specify(response = hours, explanatory = college)!
The order argument in that calculate line gives the order to subtract the mean values in: in our case, we’re taking the mean number of hours worked by those with a degree minus the mean number of hours worked by those without a degree; a positive difference, then, would mean that people with degrees worked more than those without a degree.
Now, we want to compare this difference in means to a null distribution, generated under the assumption that the number of hours worked a week has no relationship with whether or not one has a college degree, to get a sense of how likely it would be for us to see this observed difference in means if there were really no relationship between these two variables.
We can generate the null distribution using permutation, where, for each replicate, each value of degree status will be randomly reassigned (without replacement) to a new number of hours worked per week in the sample in order to break any association between the two.
null_dist_2_sample <- mtcars %>%
specify(mpg ~ am) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("1", "0"))
null_dist_2_sampleTo get a sense for what these distributions look like, and where our observed statistic falls, we can use visualize().
# visualize the randomization-based null distribution and test statistic!
null_dist_2_sample %>%
visualize() +
shade_p_value(observed_statistic,
direction = "two-sided")It looks like our observed statistic of 1.5384 would be unlikely if there was truly no relationship between degree status and number of hours worked. More exactly, we can calculate the p-value; theoretical p-values are not yet supported, so we’ll use the randomization-based null distribution to do calculate the p-value.
# calculate the p value from the randomization-based null
# distribution and the observed statistic
p_value_2_sample <- null_dist_2_sample %>%
get_p_value(obs_stat = observed_statistic,
direction = "two-sided")
p_value_2_sampleThus, if there were really no relationship between the number of hours worked a week and whether one has a college degree, the probability that we would see a statistic as or more extreme than 1.5384 is approximately 0.3.
Note that, similarly to the steps shown above, the package supplies a wrapper function, t_test, to carry out 2-sample t-tests on tidy data. The syntax looks like this:
t_test(x = mtcars,
formula = mpg ~ am,
order = c("1", "0"),
alternative = "two-sided")In the above example, we specified the relationship with the syntax formula = hours ~ college; we could have also written response = hours, explanatory = college.
An alternative approach to the t_test() wrapper is to calculate the observed statistic with an infer pipeline and then supply it to the pt function from base R. We can calculate the statistic as before, switching out the stat = “diff in means” argument with stat = “t”.
# calculate the observed statistic
observed_statistic <- gss %>%
specify(hours ~ college) %>%
hypothesize(null = "point", mu = 40) %>%
calculate(stat = "t", order = c("degree", "no degree")) %>%
dplyr::pull()Note that this pipeline to calculate an observed statistic includes hypothesize() since the t statistic requires a hypothesized mean value.
Mann Whitney Test
Welch Test
A permutation test (also called re-randomization test) is an exact test, a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under all possible rearrangements of the observed data points. Permutation test are, therefore, a form of resampling. In other words, the method by which treatments are allocated to subjects in an experimental design is mirrored in the analysis of that design. If the labels are exchangeable under the null hypothesis, then the resulting tests yield exact significance levels; see also exchangeability. Confidence intervals can then be derived from the tests. The theory has evolved from the works of Ronald Fisher and E. J. G. Pitman in the 1930s.
What is a standard normal distribution? The standard normal distribution is a normal distribution with a mean of zero and standard deviation of 1. The standard normal distribution … The expected value of x is 0. Q = x ^ 2
\(V = X21 + X22 + ⋅⋅⋅+ X2m ~ χ2(m)\)
What’s the chance of getting a value larger than the certain amount is the p-value here.
The distribution of the chi-square statistic is called the chi-square distribution. The chi-square distribution is defined by the following probability density function:
\(Y = Y0 * ( Χ2 ) ( v/2 - 1 ) * e-Χ2 / 2\)
where Y0 is a constant that depends on the number of degrees of freedom, Χ2 is the chi-square statistic, v = n - 1 is the number of degrees of freedom, and e is a constant equal to the base of the natural logarithm system (approximately 2.71828). Y0 is defined, so that the area under the chi-square curve is equal to one.
The chi-square distribution has the following properties:
The mean of the distribution is equal to the number of degrees of freedom: μ = v. The variance is equal to two times the number of degrees of freedom: σ2 = 2 * v When the degrees of freedom are greater than or equal to 2, the maximum value for Y occurs when Χ2 = v - 2. As the degrees of freedom increase, the chi-square curve approaches a normal distribution.
What is a chi-squared distribution? In probability theory and statistics, the chi-squared distribution (also chi-square or χ2-distribution) with k degrees of freedom is the distribution of a sum of the squares of k independent standard normal random variables.
# 1 No significant difference on Engine type i.e. vs and Transmission i.e. am
# 2 Significant difference on gear and carburetors
# 3 Significant difference cylinders and gear
attach(mtcars)
vsAm <- table(vs, am)
vsAm## am
## vs 0 1
## 0 12 6
## 1 7 7
cylGear <- table(cyl,gear)
cylGear## gear
## cyl 3 4 5
## 4 1 8 2
## 6 2 4 1
## 8 12 0 2
gearCarb <- table(gear,carb)
gearCarb## carb
## gear 1 2 3 4 6 8
## 3 3 4 3 5 0 0
## 4 4 4 0 4 0 0
## 5 0 2 0 1 1 1
chisq.test(vsAm) ##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: vsAm
## X-squared = 0.34754, df = 1, p-value = 0.5555
# The p value of the above test is greater than 0.05, this indicate that vs and am are independent, thus we’re not rejecting the null hypothesis.
chisq.test(gearCarb) ##
## Pearson's Chi-squared test
##
## data: gearCarb
## X-squared = 16.518, df = 10, p-value = 0.08573
# The p value of the above test is greater than 0.05, this indicate that vs and am are independent, thus we’re not rejecting the null hypothesis. Gear and Carb are independent to each other.
chisq.test(cylGear) ##
## Pearson's Chi-squared test
##
## data: cylGear
## X-squared = 18.036, df = 4, p-value = 0.001214
# The p value of the above test is less than 0.05, we’re rejecting the null hypothesis and also concluding that the cyl and gear are not independent with each other.
########################
# difference between as.factor and as_factor
mtcars$vs <- as_factor(mtcars$vs)
mtcars$am <- as_factor(mtcars$am)
mtcars$cyl <- as_factor(mtcars$cyl)
mtcars$gear <- as_factor(mtcars$gear)
# calculate the observed statistic
observed_indep_statistic <- mtcars %>%
specify(response = cyl, explanatory = gear) %>%
hypothesize(null = "independence") %>%
calculate(stat = "Chisq")
observed_indep_statistic# generate the null distribution using randomization
null_dist_sim <- mtcars %>%
specify(cyl ~ gear) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "Chisq")
# generate the null distribution by theoretical approximation
# null_dist_theory <- mtcars %>%
# specify(cyl ~ gear) %>%
# assume(distribution = "Chisq")
# visualize the null distribution and test statistic!
null_dist_sim %>%
visualize() +
shade_p_value(observed_indep_statistic,
direction = "greater")# visualize the theoretical null distribution and test statistic!
# mtcars %>%
# specify(cyl ~ gear) %>%
# assume(distribution = "Chisq") %>%
# visualize() +
# shade_p_value(observed_indep_statistic,
# direction = "greater")
# visualize both null distributions and the test statistic!
null_dist_sim %>%
visualize(method = "both") +
shade_p_value(observed_indep_statistic,
direction = "greater")# calculate the p value from the observed statistic and null distribution
p_value_independence <- null_dist_sim %>%
get_p_value(obs_stat = observed_indep_statistic,
direction = "greater")
p_value_independencepchisq(observed_indep_statistic$stat, 5, lower.tail = FALSE)## X-squared
## 0.002901174
# https://infer.netlify.app/reference/assume.html7 Probability
8 Linear Algebra
Before considering a section on linear regression, it’s useful to
first review concepts form linear algebra. Remember that the heart of
linear algebra is solving a system of linear equations. We will see in
the section in linear regression how linear algebra is important: The
lm function in R uses linear algebra in order to find the
slope and intercept of the line of best fit.
First, we will start with builidng a straight line between two points. Every straight line is entirely defined by exactly two parameters: the slope and intercept. We may remember the following equation from geometry or algebra:
\[ \begin{aligned} y = mx + c \end{aligned} \]
The is sometimes called the slope-intercept form of a straight line and it represents a general equation for a straight line where m is the steepness or gradient of the line, and c is the y-intercept, meaning the point where the line crosses the y-axis.
In the following visualization, we have kept only two observations from our dataset: We have kept the information for the Ford Pantera L and the Volvo 142E. For this exercise, we are keeping only two observations in order to illustrate how to draw a line between only two and exactly two points.
mtcars %>%
filter(car == "Ford Pantera L" | car == "Volvo 142E") %>%
ggplot(aes(x = wt, y = mpg)) +
labs(title = "Mtcars", subtitle = "y = mx + c") +
geom_point(alpha = 0.3) +
theme(plot.title = element_text(face = "bold")) +
geom_text(aes(label = car), size = 2.5, vjust = 1.5) +
geom_abline(intercept = 61.32, slope = -14.36, linetype = 'dashed', color = '#376795') How do we find this line? To find the slope of the line we use this equation, where, on a plane, each point has an x and y coordinate. We look at the two points and subtract the y values and divide by the difference of the x values. Rise over run.
\[ \begin{aligned} m = \frac{(y2 - y1)}{(x2 - x1)} \end{aligned} \]
slope <- (21.4 - 15.8) / (2.78 - 3.17)Now to find the intercept we use either one of original equations and input the number for the slope. Now we have an equation with only one unknown variable.
\[ \begin{aligned} y = \frac{(y2 - y1)}{(x2 - x1)} x + c \end{aligned} \]
# intercept <- ( (21.4 - 15.8) / (2.78 - 3.17)) * x + cGeometry of Linear Equations
Let’s try something else. Let’s say, for example, that we only had two cars in our dataset: the Volvo 142E and the Ferrari Dino. This time, however, we run an experiment by adding weight to each engine and measuring the mpg at a new point in time. We are asking the question: When might the Volvo and Ferrari have the same wt of engine and same mpg? We can use linear algebra to solve this question by creating a system of linear equations. What we have is we have two slopes and we want to see the point at which they intersect.
By using the above equations we are able to find the slope and
intercept of both lines. Now we have two separate lines. Our goal here
is to solve these two linear equations by finding the x and
y value that works for both equations. This is a good place
to start because linear combination is the most fundamental operation of
linear algebra.
y = -14.36x + 61.32y = -0.4651x + 20.9884
Our first step is to solve the first equation for x.
x = 61.32 / 14.36 - y / 14.36
Our next step is to plug our new x into the second
equation.
0.4651 * (4.270195 - y/14.36) - y = 20.9884
This gives us an answer for y.
y = 19.638283
Our next step is to plug that y back into our first
equation.
-14.36x + 19.638283 = 61.32-14.36x = 42.32x = 2.902691
Let’s look at our original equations and see if this works.
-14 * - 2.902691 + 19.638283 = 600 * - 2.902691 - 19.638283 = 21
This time we can draw a line with the stat_function
argument. stat_function is not used to solve equations but
to draw a function.
# (-14.36 * - 2.902691) + 19.638283
# (0.4651 * - 2.902691) - 19.638283
ggplot(mtcars44, aes(x = wt, y = mpg)) +
labs(title = "Mtcars", subtitle = "Linear Algebra") +
geom_point(alpha = 0.3) +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
# geom_abline(intercept = 36.72, slope = -6, linetype = 'dashed', color = '#376795') +
geom_text(aes(label = car), size = 2.5, vjust = 1.5) +
geom_abline(intercept = 61.32, slope = -14.36, linetype = 'dashed', color = '#376795') +
geom_abline(intercept = 20.9884, slope = -0.4651, linetype = 'dashed', color = '#376795') +
geom_point(aes(x = 2.902691, y = 19.638283)) +
stat_function(fun = function(x) { (61.32) - 14.36 * x }, color = 'orange') +
stat_function(fun = function(x) { (20.9884) - 0.4651 * x }, color = 'orange')We can also use an R package will do it for us to solve these linear equations.
# install.packages('PlaneGeometry')
library(PlaneGeometry)
line1 <- Line$new(A = c(3.845,19.2), B = c(2.770,19.7))
line2 <- Line$new(A = c(3.170,15.8), B = c(2.780,21.4))
intersectionLineLine(line1, line2)## [1] 2.902691 19.638283
What does this tell us? This tells us the point at which both cars might have the same wt and same mpg.
Now let’s try to write these equations in matrix notation. We look at these two equations but rearrange the terms so that we have a left-hand side and a right-hand side.
However, there’s another way of looking at this. Let’s think about the geometry of linear equations.
Let’s consider these two linear equations and write them in matrix form like this:
2x - y = 0-x + 2y = 3
$$
\[\begin{aligned} \begin{bmatrix} 2 & -1 \\ -1 & 2 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} 0 \\ 3 \end{bmatrix} \end{aligned}\]$$
Our linear equation is solved at the point that these two lines cross.
ggplot(mtcars, aes()) +
labs(title = "Mtcars", subtitle = "Linear Algebra") +
theme(plot.title = element_text(face = "bold")) +
# geom_abline(intercept = 36.72, slope = -6, linetype = 'dashed', color = '#376795') +
stat_function(fun = function(x) { 2 * x }, color = 'orange') +
stat_function(fun = function(x) { (3/2) + (x / 2)}, color = 'orange') +
xlim(-3, 3) + ylim(-6, 6)
We can also consider the equation like this:
$$
\[\begin{aligned} x \begin{bmatrix} 2 \\ -1 \end{bmatrix} y \begin{bmatrix} -1 \\ 2 \end{bmatrix} = \begin{bmatrix} 0 \\ 3 \end{bmatrix} \end{aligned}\]$$
When we consider these as two vectors from the origin, there is a way of
combining these vectors to get to that point of intersection we saw
before. One of the first vector and two of the second will get us to our
answer of 0, 3.
ggplot(mtcars, aes()) +
labs(title = "Mtcars", subtitle = "Linear Algebra") +
theme(plot.title = element_text(face = "bold")) +
# geom_abline(intercept = 36.72, slope = -6, linetype = 'dashed', color = '#376795') +
# geom_segment(aes(x = 0, xend = -1, y = 0, yend = 2), alpha = .2) +
geom_segment(aes(x = 0, xend = 2, y = 0, yend = -1), alpha = .2) +
geom_segment(aes(x = 2, xend = 1, y = -1, yend = 1, color = 'blue'), alpha = .2) +
geom_segment(aes(x = 1, xend = 0, y = 1, yend = 3), alpha = .2, color = 'blue') +
xlim(-3, 3) + ylim(-6, 6)$$
\[\begin{aligned} Ax = b \end{aligned}\]$$
Elimination with Matrices
What if we are solving three linear equations? Linear algebra is helpful here.
x + 2y + z = 23x + 8y + z = 124y + z = 12
Instead of plotting two lines and finding the intersection, we could create a graph of three planes. Three planes would intersect in a point. (Two planes would intersect in a line.)
Let’s write our equations as a matrix.
Let’s start with a matrix with only the left-hand size. Let’s create an elimination matrix. Talk about the pivot
These matrices might be labeled A -> E -> U
$$
\[\begin{aligned} \begin{bmatrix} 1 & 2 & 1 \\ 3 & 8 & 1 \\ 0 & 4 & 1 \end{bmatrix} -> \begin{bmatrix} 1 & 2 & 1 \\ 0 & 2 & -2 \\ 0 & 4 & 1 \end{bmatrix} -> \begin{bmatrix} 1 & 2 & 1 \\ 0 & 2 & -2 \\ 0 & 0 & 5 \end{bmatrix} \end{aligned}\]$$
Let’s include the right-hand side. We will call this an augmented matrix.
$$
\[\begin{aligned} \begin{bmatrix} 1 & 2 & 1 & 2\\ 3 & 8 & 1 & 12 \\ 0 & 4 & 1 & 2 \end{bmatrix} -> \begin{bmatrix} 1 & 2 & 1 & 2\\ 0 & 2 & -2 & 6\\ 0 & 4 & 1 & 2 \end{bmatrix} -> \begin{bmatrix} 1 & 2 & 1 & 2 \\ 0 & 2 & -2 & 6\\ 0 & 0 & 5 & -10 \end{bmatrix} \end{aligned}\]$$
Now we have a different set of equations that we can solve.
x + 2y + z = 22y - 2z = 65z = -10
We can see that z = -2 and, moving upwards, through a process called back substitution, we find that y = 1 and finally, x = 2.
Multiplication and Inverse Matrices
Not all matrices have an inverse
If A^-1 (A inverse) exists how do we find it. It’s really central to know if it exists.
$$
\[\begin{aligned} \begin{bmatrix} & & \\ - & - & - \\ & & \end{bmatrix} \begin{bmatrix} & - & \\ & - & \\ & - & \end{bmatrix} = \begin{bmatrix} & & \\ & - & \\ & & \end{bmatrix} \end{aligned}\]$$
$$
\[\begin{aligned} \begin{bmatrix} & & \\ - & - & -\\ & & \end{bmatrix} \begin{bmatrix} - & - & - \\ - & - & - \\ - & - & - \end{bmatrix} = \begin{bmatrix} & & \\ - & - & - \\ & & \end{bmatrix} \end{aligned}\]$$
$$
\[\begin{aligned} \begin{bmatrix} - & & \\ - & & \\ - & & \end{bmatrix} \begin{bmatrix} - & - & - \\ - & - & - \\ - & - & - \end{bmatrix} = \begin{bmatrix} - & & \\ - & & \\ - & & \end{bmatrix} \end{aligned}\]$$
$$
\[\begin{aligned} \begin{bmatrix} 2 \\ 3 \\ 4 \end{bmatrix} \begin{bmatrix} 1 & 6 \\ \end{bmatrix} = \begin{bmatrix} 2 & 12 & \\ 3 & 18 & \\ 4 & 24 \end{bmatrix} \end{aligned}\]$$
$$
\[\begin{aligned} \begin{bmatrix} 2 & 7 \\ 3 & 8 \\ 4 & 9 \end{bmatrix} \begin{bmatrix} 1 & 6 \\ 0 & 0 \end{bmatrix} = \begin{bmatrix} 2 \\ 3 \\ 4 \end{bmatrix} \begin{bmatrix} 1 & 6 \end{bmatrix} + \begin{bmatrix} 7 \\ 8 \\ 9 \end{bmatrix} \begin{bmatrix} 0 & 0 \end{bmatrix} \end{aligned}\]$$
Gauss-Jordan
allows us to solve two equations at once
$$
\[\begin{aligned} \begin{bmatrix} 1 & 3 \\ 2 & 7 \end{bmatrix} \begin{bmatrix} a \\ b \end{bmatrix} =\begin{bmatrix} 1 \\ 0 \end{bmatrix} \end{aligned}\]$$
Let’s deal with the right-hand side by sticking it on as an extra column. This makes for an identity matrix augmented on.
$$
\[\begin{aligned} \begin{bmatrix} 1 & 3 & 1 & 0 \\ 2 & 7 & 0 & 1 \end{bmatrix} \end{aligned}\]$$
Now when you do elimination steps to turn the left side into the
identity the right side will turn into the inverse.
$$
\[\begin{aligned} \begin{bmatrix} 1 & 3 & 1 & 0 \\ 2 & 7 & 0 & 1 \end{bmatrix} \end{aligned}\] -> \[\begin{aligned} \begin{bmatrix} 1 & 3 & 1 & 0 \\ 0 & 1 & -2 & 1 \end{bmatrix} \end{aligned}\]$$
Now we keep going and do elimination upwards. Subtract three from everything on the top to make the left side the identity and the right side the inverse.
$$
\[\begin{aligned} \begin{bmatrix} 1 & 3 & 1 & 0 \\ 2 & 7 & 0 & 1 \end{bmatrix} \end{aligned}\] -> \[\begin{aligned} \begin{bmatrix} 1 & 3 & 1 & 0 \\ 0 & 1 & -2 & 1 \end{bmatrix} \end{aligned}\] -> \[\begin{aligned} \begin{bmatrix} 1 & 0 & 7 & -3 \\ 0 & 1 & -2 & 1 \end{bmatrix} \end{aligned}\]$$ In this process we did A I to I A inverse.
$$
\[\begin{aligned} AI = IA^-1 \end{aligned}\]$$
$$
\[\begin{aligned} \begin{bmatrix} 1 & 3 \\ 2 & 7 \end{bmatrix} \begin{bmatrix} c \\ d \end{bmatrix} =\begin{bmatrix} 0 \\ 1 \end{bmatrix} \end{aligned}\]$$
Factorization into A = LU
Let’s look at this simple 2x2 matrix.
$$
\[\begin{aligned} \begin{bmatrix} 2 & 1 \\ 8 & 7 \end{bmatrix} \end{aligned}\]$$
Now how do we get the upper triangular / upper echelon form of this
matrix? We can create our upper triangular matrix in this way.
$$
\[\begin{aligned} \begin{bmatrix} 2 & 1 \\ 8 & 7 \end{bmatrix} \begin{bmatrix} 2 & 1 \\ 0 & 3 \end{bmatrix} \end{aligned}\]$$
However, this doesn’t complete the equation. We have to make the left and ride sides equal. We have to multiply by an elimination matrix in order to get there. Here is the correct elimination matrix.
$$
\[\begin{aligned} \begin{bmatrix} 1 & 0 \\ -4 & 1 \end{bmatrix} \begin{bmatrix} 2 & 1 \\ 8 & 7 \end{bmatrix} = \begin{bmatrix} 2 & 1 \\ 0 & 3 \end{bmatrix} \end{aligned}\]$$
We label our elimination matrix as E-21 because it creates a zero pivot in the upper triangular matrix in the second row and first column.
$$
\[\begin{aligned} E_(21) * A = U \end{aligned}\]$$ We have to make the order different for LU decomposition which takes this form.
$$
\[\begin{aligned} A = LU \end{aligned}\]$$
To know what L is in this case we have to know how L is related to E. L is the inverse of E. It has a 4 and not a -4 because it adds back what the E removed. U stands for upper triangular and L stands for lower triangular. The U has the pivots on the diagonal and the L has 1s on the diagonal.
$$
\[\begin{aligned} \begin{bmatrix} 2 & 1 \\ 8 & 7 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 4 & 1 \end{bmatrix} \begin{bmatrix} 2 & 1 \\ 0 & 3 \end{bmatrix} \end{aligned}\]$$
Now we move up to a 3x3 matrix. With a 3x3 matrix the elimination steps are a little more involved.
$$
\[\begin{aligned} E_(32) * E_(31) * E_(21) * A = U \end{aligned}\]$$ To invert these elimination matrices we take the invert of each one and put them in opposite order.
$$
\[\begin{aligned} A = E_(21)^1 * E_(31)^1 * E_(32)^1) * U = \end{aligned}\]$$
This is equivlanet to LU because L is the product of inverses.
$$
\[\begin{aligned} A = LU \end{aligned}\]$$
The Ls are the multipliers of U.
Transposes, Permutations, Spaces R^n
Column Space and Nullspace
Solving Ax = 0: Pivot Variables, Special Solutions
Solving Ax = b: Row Reduced Form R
Independence, Basis, and Dimension
Vectors Vi to Vl span a space when the space consists of all combinations of those vectors. This space “let s be the space they span, in other words let s contain all their combinations, that space s will be the smallest space with those vectors in it, cause any space with those vectors in it must have all the combinations of those vectors in it” span is take all linear combinations and compress them into a space
all these columns, are they independent - maybe yes maybe no. We are highly interested in a set of vectors that span the space and are independent
The columns of a matrix span the column space
Basis many bases
There are many bases, but they all have the same number of vectors. If we are talking about R3, that number of vectors is R3.
The Four Fundamental Subspaces
Matrix Spaces; Rank 1; Small World Graphs
Graphs, Networks, Incidence Matrices
Orthogonal Vectors and Subspaces
Projections onto Subspaces
Projection Matrices and Least Squares
Orthogonal Matrices and Gram-Schmidt
QR Decomposition
$$
\[\begin{aligned} y = C + Dt \end{aligned}\]$$
If we solved the equation that went through the first point it would
be C + D = 1. Here are the equations that go through all
three points, lines starting from the origin.
C + D = 1C + 2D = 2C + 3D = 2
These three equations don’t have a solution. There’s no line that goes through all these points.
$$
\[\begin{aligned} Ax = b \end{aligned}\]$$
$$
\[\begin{aligned} \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix} \begin{bmatrix} c \\ d \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \end{aligned}\]$$
$$
\[\begin{aligned} A^TA\hat{x} = A^Tb \end{aligned}\]$$
$$
\[\begin{aligned} \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 3 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix} = \begin{bmatrix} 3 & 6 \\ 6 & 14 \end{bmatrix} \end{aligned}\]$$
$$
\[\begin{aligned} \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 3 \end{bmatrix} \begin{bmatrix} 1 & 1 & |1 \\ 1 & 2 & |2 \\ 1 & 3 & |2 \end{bmatrix} = \begin{bmatrix} 3 & 6 & | 5 \\ 6 & 14 & | 11 \end{bmatrix} \end{aligned}\]$$
$$
\[\begin{aligned} 3C + 6D = 5 \end{aligned}\]$$
$$
\[\begin{aligned} 6C + 14D = 11 \end{aligned}\]$$
$$
\[\begin{aligned} 2D = 1 \end{aligned}\]$$
Now we solve by doing elimination
$$
\[\begin{aligned} D = 1/2 \end{aligned}\]$$
$$
\[\begin{aligned} D = 2/3 \end{aligned}\]$$
123
$$
\[\begin{aligned} y = C + Dt \end{aligned}\]$$
If we solved the equation that went through the first point it would
be C + D = 1. Here are the equations that go through all
three points, lines starting from the origin.
C + 3.17* D = 15.8C + 2.77* D = 19.7C + 2.78 * D = 21.4
These three equations don’t have a solution. There’s no line that goes through all these points.
$$
\[\begin{aligned} Ax = b \end{aligned}\]$$
$$
\[\begin{aligned} \begin{bmatrix} 1 & 3.17 \\ 1 & 2.77 \\ 1 & 2.78 \end{bmatrix} \begin{bmatrix} c \\ d \end{bmatrix} = \begin{bmatrix} 3.17 \\ 2.77 \\ 2.78 \end{bmatrix} \end{aligned}\]$$
$$
\[\begin{aligned} A^TA\hat{x} = A^Tb \end{aligned}\]$$ 3.17 3.17 2.77 2.77 2.78 * 2.78
10.0489 + 7.6729 + 7.7284
$$
\[\begin{aligned} \begin{bmatrix} 1 & 1 & 1 \\ 3.17 & 2.77 & 2.78 \end{bmatrix} \begin{bmatrix} 1 & 3.17 \\ 1 & 2.77 \\ 1 & 2.78 \end{bmatrix} = \begin{bmatrix} 3 & 8.72 \\ 8.72 & 25.4502 \end{bmatrix} \end{aligned}\]$$
15.8 + 19.7 + 21.4 15.8 * 3.17 19.7 * 2.77 21.4 * 2.78 50.086 + 54.569 + 59.492
$$
\[\begin{aligned} \begin{bmatrix} 1 & 1 & 1 \\ 3.17 & 2.77 & 2.78 \end{bmatrix} \begin{bmatrix} 1 & 3.17 & | 15.8\\ 1 & 2.77 & | 19.7 \\ 1 & 2.78 & | 21.4 \end{bmatrix} = \begin{bmatrix} 3 & 8.72 & | 56.9 \\ 8.72 & 25.4502 & | 164.147 \end{bmatrix} \end{aligned}\]$$ 3.17 * 3.17 2.77 ^ 2 2.78 ^ 2
10.0489 + 7.6729 + 7.7284 15.8 + 19.7 + 21.4
$$
\[\begin{aligned} 3C + 8.72D = 56.9 \end{aligned}\]$$
$$
\[\begin{aligned} 8.72C + 25.4502D = 164.147 \end{aligned}\]$$
QR Decomposition
What is really going on, in mathematical terms, when we create a ordinary least squares regression estimate? The ordinary least squares regression line can be calculated exactly from the matrix algebra or else it can be approximated using a recursive technique. The following formula is an equation for finding the least squares estimator using notation familiar to someone versed in linear algebra.
\[ \begin{aligned} \hat{{\beta}}=\left(\mathbf{X}^{\top} \mathbf{X}\right)^{-1} \mathbf{X}^{\top} \mathbf{y} \end{aligned} \]
This is equal to
\[ \begin{aligned} 𝑋′𝑋𝛽̂=𝑋′𝑦 \end{aligned} \]
Which is equal to
\[ \begin{aligned} 𝑅′𝑄′𝑄𝑅𝛽̂=𝑅′𝑄′𝑦 \end{aligned} \]
𝑅𝛽̂=𝑄′𝑦
Understanding the math behind a linear regression in a general form requires some understanding of linear algebra.
9 Simple Linear Regression
We start our exploration of linear regression by looking at simple linear regression. We define simple linear regression as regression that takes exactly one independent variable, also called the explanatory variable, and exactly one dependent variable, also called the response variable. In simple linear regression, this dependent variable is continuous. Following convention, on a Cartesian coordinate plane we represent our independent variable on the x axis and our dependent variable set on the y axis.
Linear regression can, of course, take more than one independent variable but we start with the simple case of using just one independent variable and one dependent variable in order to learn and build our intuition.
Before fitting any linear regression model, it’s often useful to
first create a scatterplot of our variables without including the line
of best fit. Creating a scatterplot without the line of best fit gives
us an opportunity to test our intuition. We can see in the below
visualizations wt and mpg have an inverse
relationship, meaning that as the wt of the engine
increases, the mpg tends to decrease, and vice versa. Maybe
this conclusion is not surprising. But this does open the door to a
variety of more nuanced questions. Some examples include:
- How well would a linear model explain the variability that we would see in miles per gallon?
- Is a linear model even a good fit, given the curvature, especially at the far end of the x-axis with heavier engines and lower miles per gallon?
- Are there confounding variables, such the size of the engine, referred to as displacement, independent of weight, that better predicts the miles per gallon?
For experienced data analysts, it’s hard to see a scatterplot and not
imagine a line going through it. By creating a scatterplot, we already
start to imply a relationship. But which line or estimator should we
use? The most common way to do simple linear regression is through
partial least squares estimation, also called ordinary least squares
estimation, but options exist: The following visualization shows four
different estimators for the relationship of wt to
mpg in the mtcars dataset. This is not an
exhaustive list. Notice that the slope and intercept is different for
each estimator.
- Ordinary Least Squares (OLS)
- Median Absolute Deviation (MAD)
- Least Median Squares (LMS)
- Theil-Sen
slope and intercept
We reviewed in the last chapter that a straight line is defined by exactly two parameters: the slope and intercept. The following equation shows the slope-intercept form of a straight line where m is the gradient of the line and c is the y-intercept.
\[ \begin{aligned} y = mx + c \end{aligned} \]
When we have more than two points, however, we need to come up with a different way of calculating our line of best fit. It is theoretically possible that the line of best fit will go through all the points but most likely it won’t.
As we can see in the above visualization, there are different kinds of estimators that we could use in order to find the line of best fit. Each estimator has its own strengths and weaknesses. Making an argument for any particular estimator will require looking at the particular features of a data set, and also incorporating domain knowledge.
Regardless of which estimator we end up using, our equation for our
estimator is often written as this linear equation:
y' = b0 + b1x where b0 is the y-intercept and
b1x is the slope. The equation is written in this way to
distinguish it from the slope-interecept form of a straight line which
is more commonly used in geometry or algebra. This regression equation
shows, on the left-hand side of the equation y' meaning the
predicted value of y, and the b0 and
b1 terms are beta terms referring to parameter estimates;
b0 and b1 are often referred to as model
coefficients.
The following visualization shows the conceptual difference between the slope-intercept form of the line, on the left, and the regression equation, on the right.
9.1 Building a Model
Ordinary Least Squares
We start with partial least squares regression which is also called ordinary least squares regression, or OLS regression, for short. We start with this estimator because it’s by far the most common estimator and, as we will see, the OLS estimator in particular has some nice properties, including the fact that the matrix algebra is differentiable and that there is a well-studied relationship with maximum likelihood estimation, in that, when the errors are normally distributed, the OLS estimate is the maximum likelihood estimate.
When we create a visualization for a linear regression, we often
include elements from descriptive or inferential statistics, such as
standard error or labeling more extreme points. In the following
four-paneled visualization, we can see the same relationship
mpg ~ wt visualized in four different ways.
Closed Form Equations for Slope and Intercept
How do we find the slope and intercept of our simple ordinary least squares regression line? R will calculate the slope and intercept by accessing FORTRAN’s linear algebra library called Linpack and performing what is called a QR decomposition using the Gram-Smidt process.
Before reviewing any of the linear algebra derivation, however, we can first look at shortcut equations to find the slope and intercept. As it turns out, in the case of simple linear regression for an ordinary least squares estimate, meaning regression with only one independent and one dependent variable, the slope and intercept follow neat closed-form equations: the slope can be calculated by multiplying the correlation r by the quotient of standard deviation of y over standard deviation of x. In the below equation, a refers to the slope and sy and sx refer to the standard deviation of y and the standard deviation of x, respectively.
\[ \begin{aligned} a = r * \frac{sy}{sx} \end{aligned} \]
The intercept of the line of best fit for ordinary least squares simple linear regression can be calculated easily after we calculate the slope of the line of best fit. We do this by subtracting the slope of the line of best fit from mean of y, then multiplying the result by the mean of x. In the equation below, i refers to y-intercept and the straight line over the x and y values is a way of referring to the mean of x and y respectively; we refer to these terms as x-bar and y-bar.
\[ \begin{aligned} i = \bar{y} - r* \frac{sy}{sx} * \bar{x} \end{aligned} \]
The following code chunk shows how to calculate the slope and intercept using our mtcars dataset with 32 observations.
r_slope <- ( sd(mtcars$mpg) / sd(mtcars$wt) ) * cor(mtcars$mpg, mtcars$wt)
r_intercept <- mean(mtcars$mpg) - r_slope * mean(mtcars$wt) We can see that the r-intercept, by which we mean the y-intercept for when x is zero, is 37.285. We also see that the r-slope is -5.344, meaning that the line moves downward -5.344 units for every one unit along the x-axis.
In the interest of being thorough, we can explore alternative ways of writing these equations. Remember that the standard deviation is the same as the square root of the variance, so instead of referring to the standard deviation of y and standard deviation of x, we could also refer to the square root of the variance of y and square root of the variance of x. The variance, we remember, is the average of the sum of squares; this is calculated by taking each value in the distribution, subtracting the mean, squaring the total and dividing by the total number of observations.
In the above equation for the slope, a, we could also write the sy and sx in terms of the standard deviation and we could also write out the longer form equation for the Pearson correlation. We could then cross-multiply and simplify the equations by removing common terms and end up with the following set of equations.
\[ \begin{aligned} a = \frac{\sum{(x_i - \bar{x})(y_i-\bar{y})}}{\sum{(x_i-\bar{x}})^2} \end{aligned} \]
\[ \begin{aligned} i = \bar{y} - \frac{\sum{(x_i - \bar{x})(y_i-\bar{y})}}{\sum{(x_i-\bar{x}})^2} * \bar{x} \end{aligned} \]
The following code chunk shows the R code version of these longer
equations for the r-slope and the r-intercept without relying much on
the in-built functions in R. There is no practical reason in a situation
like this for calculating either the variance or the correlation
coefficient by hand. Using in-built functions for standard deviation,
variance, and correlation, sd(), var(), and
cor(), respectively, is preferred in order to save time and
prevent mistakes; using variables in this way is a lot of the reason why
we lean on programming languages such as R. Otherwise, our code is
unwieldly. The longer forms are included below simply to clarify the
mathematical process and see what is happening under the hood.
r_slope <-
(sqrt(sum((mtcars$mpg - sum((mtcars$mpg) / length(mtcars$mpg))) ^ 2)
/ (length(mtcars$mpg) - 1)) / sqrt(sum((mtcars$wt - mean(mtcars$wt)) ^ 2) /
(length(mtcars$wt) - 1)) ) * ((length(mtcars$mpg) * sum(mtcars$wt * mtcars$mpg))
- (sum(mtcars$wt) * sum(mtcars$mpg))) / sqrt(((length(mtcars$mpg) *
sum(mtcars$wt ^ 2)) - (sum(mtcars$wt) ^ 2)) *
((length(mtcars$mpg) * sum(mtcars$mpg ^ 2)) - (sum(mtcars$mpg) ^ 2)))
r_intercept <-
(sum(mtcars$mpg) / length(mtcars$mpg)) - (sqrt(sum((mtcars$mpg - sum((mtcars$mpg)
/ length(mtcars$mpg))) ^ 2) / (length(mtcars$mpg) - 1)) /
sqrt(sum((mtcars$wt - sum((mtcars$wt) / length(mtcars$wt))) ^ 2) /
(length(mtcars$wt) - 1)) ) * ((length(mtcars$mpg) *
sum(mtcars$wt * mtcars$mpg)) - (sum(mtcars$wt) * sum(mtcars$mpg))) /
sqrt(((length(mtcars$mpg) * sum(mtcars$wt ^ 2)) -
(sum(mtcars$wt) ^ 2)) * ((length(mtcars$mpg) * sum(mtcars$mpg ^ 2)) -
(sum(mtcars$mpg) ^ 2))) * (sum(mtcars$wt) / length(mtcars$wt))Notice in the following code that we use
(length(mtcars$mpg) - 1)) in our code for calculating both
the r_slope and the r_intercept. This is known as Bessel’s correction,
to move from the sample standard deviation to the population standard
deviation. The calculation for standard deviation in R uses Bessel’s
correction but the calculation for standard deviation in python does
not. By deconstructing what is happening under the hood of the function
we have learned an important difference between R and python!
Now that we have calculated the slope and the intercept of our simple
regression line using the ordinary least squares estimator, we can now
use the geom_abline() layer of ggplot() to add
a line with the r-slope and r-intercept. These are the numbers we
calculated by hand in our equation for r-slope and r-intercept. Later,
we will see how we can obtain these numbers by looking at the summary of
our linear model object which comes in the form of a specialized list.
We choose to specify the linetype argument as
dashed. The commented-out part of the code is an alternate
way of writing the geom_abline() layer.
Center of Mass
The ordinary least squares regression line will always pass through the center of mass of the points. In other words, if we were to draw a point that represented the mean of x and the mean of y, our ordinary least squares regression line is guaranteed to go through this point. We can illustrate this by drawing a horizontal line connected each point to the mean of x and a vertcal line connecting each point to the mean of y. If we add up the total distances in each case and divide by the total number of points, we found ourselves at the mean of x and the mean of y.
The Normal Equation from Linear Algebra
In the case of simple linear regression, we can calculate the slope and intercept using neat closed-form equations using the standard deviation and the Pearson correlation. However, in the case of using more than one independent variable, the slope and intercept of the line has to be derived using linear algebra. Although, for technical reasons, even in the case of simple linear regression, R will calculate the slope and intercept using linear algebra as well because it gives us a quicker, more exact answer.
In this section, we will first review what is called the normal equation. The normal equation allows us to find our regression coefficients analytically. R will not solve the OLS minimization problem using the normal equation; R will instead calculate the OLS regression using QR decomposition because QR decomposition avoids working with an inverse matrix, which can be slow. Even though R will solve the problem using QR decomposition, it’s useful first to understand the normal equation.
We start with a basic formula from linear algebra, which is the
equation for multiplying a matrix by a vector. In this equation, the
A refers to a matrix and the x refers to a
vector and b is a vector, but in our case b
will be a vector of coefficients for our linear model.
Our model matrix is also called a design matrix.
\[ \begin{aligned} Ax = b \end{aligned} \]
We can use Ax = b to solve a system of linear equations.
Let’s first an example. Take the following three linear equations.
x + 2y + z = 23x + 8y + z = 124y + z = 12
We can write these three linear equations and put them into a matrix. We will start first with the left-hand side.
$$
\[\begin{aligned} \begin{bmatrix} 1 & 2 & 1 \\ 3 & 8 & 1 \\ 0 & 4 & 1 \end{bmatrix} \end{aligned}\]$$
Now we can solve this matrix using elimination, specifically by creating an elimination matrix and then creating an upper triangular matrix. The diagonal of our upper triangular matrix we call our pivot. These matrices might be labeled A -> E -> U
$$
\[\begin{aligned} \begin{bmatrix} 1 & 2 & 1 \\ 3 & 8 & 1 \\ 0 & 4 & 1 \end{bmatrix} -> \begin{bmatrix} 1 & 2 & 1 \\ 0 & 2 & -2 \\ 0 & 4 & 1 \end{bmatrix} -> \begin{bmatrix} 1 & 2 & 1 \\ 0 & 2 & -2 \\ 0 & 0 & 5 \end{bmatrix} \end{aligned}\]$$
As a final step, we can now include the right-hand side. We will call this an augmented matrix.
$$
\[\begin{aligned} \begin{bmatrix} 1 & 2 & 1 & 2\\ 3 & 8 & 1 & 12 \\ 0 & 4 & 1 & 2 \end{bmatrix} -> \begin{bmatrix} 1 & 2 & 1 & 2\\ 0 & 2 & -2 & 6\\ 0 & 4 & 1 & 2 \end{bmatrix} -> \begin{bmatrix} 1 & 2 & 1 & 2 \\ 0 & 2 & -2 & 6\\ 0 & 0 & 5 & -10 \end{bmatrix} \end{aligned}\]$$
We finally have a different set of equations that we can solve with back elimination.
x + 2y + z = 22y - 2z = 65z = -10
We can see the solutions to the set of linear equations, above. We start with the bottom and move upwards through a process called back substitution.
z = -2y = 1x = 2
But what do we do when we are looking at a scatter plot where
obviously no line goes through all the points? Thinking now to our
scatterplot using mtcars, we have wt on the
x-axis and mpg on the y-axis. We can plainly see that no
line will go through all the points. So the problem Ax = b
has no solution for b. What we need instead is to add an
error vector.
\[ Ax + e = b \]
It’s helpful to write our linear equation using matrix notation which
helps to highlight our vector of errors, and also to refer to our
variables in our familiar y = bx + b0 structure, and also
to put our y variable on the left-hand side.
\[ \begin{aligned} \begin{bmatrix} y \\ y \\ y \end{bmatrix} = \begin{bmatrix} 1 & x \\ 1 & x \\ 1 & x \end{bmatrix} \begin{bmatrix} B0 \\ B1 \end{bmatrix} + \begin{bmatrix} e \\ e \\ e \end{bmatrix} \end{aligned} \]
Now instead of finding a line that goes through all the points, we want to find a function that minimizes the distance, called a minimizing function. We remember in matrix vector notation that the dot product of a vector with itself is the norm squared. The norm squared is the magnitude, or length of the vector.
\[
x * x = ||x||^2
\]
Another way of viewing the norm squared is to take a design matrix and
mulitply the design matrix by its transpose. For example, we cal look at
the error vector multiplied by its transpose.
\[ \begin{aligned} (e^Te) = \begin{bmatrix} e & e & e \end{bmatrix} \begin{bmatrix} e \\ e \\ e \end{bmatrix} = \sum(e^2) \end{aligned} \]
Now we can rearrange the terms in Ax + e = b so that we
have our error vector on the right-hand side. We know the the magnitude
of each should be equal.
\[ ||Ax - b||^2 = ||e||^2 \]
\[ ||x|| =\sqrt{\sum{x^2}} = \sqrt{(x^2 + x^2 + x^2)} \]
The solution to the problem is reduced to solving this equation. We can look at the following formula which shows how to derive the ordinary least squares estimator from linear algebra.
\[ \begin{aligned} A^TA\hat{y} = A^Tb \end{aligned} \]
Here A refers to our matrix, A^T is the
transposed version of the matrix, \bar{x{} refers to the
predicted values of x and the b are the beta
coefficients. This formula may also be written in this way, after
rearranging the terms in order to solve for b. Here
A is a matrix of full rank and by extension invertible.
\[ \begin{aligned} b = (A^TA) ^1A^T\hat{y} \end{aligned} \]
What we are doing here is we are projecting b onto the
column space of A, denoted C(A).
Here we look at the transposed version of our matrix. We use the
model.matrix() function in order to see the model matrix
that goes into the linear model.
## 1 2 3
## (Intercept) 1.00 1.00 1.00
## wt 3.17 2.77 2.78
## attr(,"assign")
## [1] 0 1
Here we look at a section of our matrix with the x or
\bar{x} vector attached.
## ones wt mpg
## [1,] 1 2.620 21.0
## [2,] 1 2.875 21.0
## [3,] 1 2.320 22.8
## [4,] 1 3.215 21.4
## [5,] 1 3.440 18.7
Here is our equation following matrix notation. We are just showing the first three values instead of all 32 in order to show the equation in action. Multiplying a matrix by its transpose gives us a square matrix.
$$
\[\begin{aligned} \begin{bmatrix} 1 & 1 & 1 \\ 3.17 & 2.77 & 2.78 \end{bmatrix} \begin{bmatrix} 1 & 3.17 & |1 \\ 1 & 2.77 & |2 \\ 1 & 2.78 & |2 \end{bmatrix} = \begin{bmatrix} 3 & 8.72 & | 5 \\ 8.72 & 10.0489 & | 14.27 \end{bmatrix} \end{aligned}\]$$
Here is the full result of matrix multiplcation, the transpose of A with A and the predicted x vector.
## ones wt mpg
## ones 32.000 102.9520 642.900
## wt 102.952 360.9011 1909.753
This result means that we are solving these linear equations:
32.000 * x + 102.9520 * y = 642.900102.952x * x + 360.9011 * y = 1909.753
Now this evaluates to the following matrix.
This matrix corresponds to two linear equations which we then solve the beta coefficients for the slope and intercept of our line.
## [1] 37.285087 -5.344459
Now, solving for b in the equation b = ATA is slow, as it turns out. How do we speed up this computation? QR factorization is turning X into QR with X = QR. This simplifies our equation above:
\[ \begin{aligned} b = (A^TA) ^1A^T\hat{y} \end{aligned} \] Into a more simple equation to solve.
\[ \begin{aligned} b = R^1Q^Ty \end{aligned} \]
XTX=RTQTQR=RTR
Partial Least Squares vs Total Least Squares
The OLS regression technique works by minimizing the sum of the squares in the difference between the observed and predicted values of the dependent variable and it takes the form of a straight line, which we call the line of best fit.
Notice that with partial least squares estimation we minimize the difference between the observed and predicted values in the y direction. Contrast this with total least squares where we minimize the total distance including both x and y, which is to say that with total leeast squares we minimize the orthogonal, or perpendicular, distance. Studies have shown that, when given a scatterplot of points and asked to draw a regression line, people tend to draw the total least squares line instead of the partial least squares regression line.
Creating Our Model Object
We can create a linear model object in R using the lm()
function from base R.
mtcars_lm_mpg_wt <- lm(mpg ~ wt, data = mtcars) But how does the lm() function calculate the ordinary
least squares regression line? When we type ?lm into our
console, we can read the documentation which tells us that
lm() is using QR decomposition to return the model
matrix.
While the interface for the R user is the lm() in R, R
is actually accessing the C programming langauge and, beneath the C
programming language, is using Fortran and its linear algebra package
called Linpack. Within Linpack, the QR decomposition is calculated by
householder reflections. The derivation for the householder reflection
algorithm in the Linpack linear algebra package in Fortran is
complicated and won’t be explored here, nor will we explore the
programming language techniques from C or Fortran which involve a lot of
functional programming techniques including error catching, caching and
other things.
Important to know that R will always use matrix algebra to calculate the slope of the ordinary least squares regression line even though simple closed-form equations exist to find the slope. This is because the linear algebra will give us a more exact number than the closed form equations because of the thing about squaring the numbers.
So far, we have calculated the slope and intercept by hand by looking
at the correlation coefficient and the standard deviations of the
y and x variables. As we will see, calculating the
r-slope and the r-intercept is not the same as creating a linear model
object within the R programming language. We can create a model object
using the lm() function from base R. The lm()
function takes two arguments: formula and
data. The output of the model object and the summary of the
model object are printed in the console as specialized lists. This
specialized list is commonly used in R so it’s worth becoming familiar
with how to read this output. We choose to give our linear model a
detailed, explicit name, recognizing that it may be more common as a
convention to use a simpler name for different objects.
mtcars_lm_mpg_wt <- lm(mpg ~ wt, data = mtcars) We can also use the tidymodels() workflow. With
tidymodels(), we supply a wrapper around the
lm() function. For this partiuclar use case, it seems a
little needlessly wordy, but the workflow is worth getting used to. We
specify the engine and the mode with set_engine() and
set_mode().
mtcars_lm_mpg_wt2 <- linear_reg() %>%
set_engine('lm') %>% # adds lm implementation of linear regression
set_mode('regression')
# mtcars_lm_mpg_wt object properties
mtcars_lm_mpg_wt2## Linear Regression Model Specification (regression)
##
## Computational engine: lm
# We can do more things with our linear model object.
lm_fit <- mtcars_lm_mpg_wt2 %>%
fit(mpg ~ wt, data = mtcars)
# View lm_fit properties
lm_fit## parsnip model object
##
##
## Call:
## stats::lm(formula = mpg ~ wt, data = data)
##
## Coefficients:
## (Intercept) wt
## 37.285 -5.344
tidy_mtcars_lm_mpg_wt <- broom::tidy(mtcars_lm_mpg_wt)
tidy_mtcars_lm_mpg_wt_estimate <- tidy_mtcars_lm_mpg_wt$estimate
mtcars_lm_mpg_am <- lm(mpg ~ am, mtcars)
tidy_mtcars_lm_mpg_am<- tidy(mtcars_lm_mpg_am)
tidy_mtcars_lm_mpg_am_estimate <- tidy_mtcars_lm_mpg_am$estimate
regression_estimates <- rbind(tidy_mtcars_lm_mpg_wt_estimate, tidy_mtcars_lm_mpg_am_estimate)We can use the options() function in the RStudio
environment and set the scipen argument to
999. Doing this turns off the scientific notation which
makes the output more readable. By printing our linear model in the
console, we see the call and the coefficients. The coefficients have two
components: the intercept and the slope. We recognize the r-intercept
and the r-slope from our previous work calculating these values by hand:
The y-intercept for when x is zero is 37.285 and the slope of our line
is -5.344.
options(scipen = 999)
mtcars_lm_mpg_wt##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Coefficients:
## (Intercept) wt
## 37.285 -5.344
summary(mtcars_lm_mpg_wt)##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 0.0000000000000002 ***
## wt -5.3445 0.5591 -9.559 0.000000000129 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 0.0000000001294
compare that to when you make this tidy().
“the estimates have been calculated from least squares optimization”
the standard error column shows us the variability in the intercept and the variability in the slope. The units are in the associated units for the slope or intercept
statistic column combines least squares with standard error? it is a standardized estimate that measures the number of standard errors the estimate is above 0. the p-value test whether or not the slope or intercept is 0. the default test is two sided. we reject the value of 0 as a plausible value for the slope or intercept. no way for this data to come from a population with a slope or intercept of 0. if the research quesiton is whether or not it is positivelt or negatively associated we divide the p-value by two. the one sided p-value is half.
mtcars_lm_mpg_wt %>% tidy()We can also perform a simple linear regression with a discrete variable on the x axis. Take the following example.
Performing a regression to predict a continuous variable from a
discrete variable is equivalent to looking at the difference in means
between two groups. In the first case, we can use
geom_boxplot() and use the am variable as a
factor. In the second case, we can use a linear model and use the
am variable as a numeric in the model. To add
geom_smooth() to the second plot, we have to convert the
factor to a numeric datatype.
mean(mtcars$am[mtcars$am ==1])## [1] NA
mean(mtcars$am[mtcars$am ==0])## [1] NA
one <- ggplot(mtcars, aes(x = as.factor(am), y = mpg)) +
geom_boxplot() +
labs(title = "Mtcars", subtitle = "Boxplot: mpg ~ am") +
theme(plot.title = element_text(face = "bold"))
two <-ggplot(mtcars, aes(x = as.numeric(am), y = mpg)) +
geom_point() + geom_smooth(method = 'lm', se = FALSE, color = '#e76254') +
labs(title = "Mtcars", subtitle = "Regression: mpg ~ am") +
theme(plot.title = element_text(face = "bold"))
library(patchwork)
one + twowhy do we use the terms coefficient in the models?
We create our model. This time we have three coefficients: an
intercept, and one for two of the values of am. The coefficient for
am == 0 is missing, but the number for the intercept we
recognize as the mean value of mpg where am == 0. The
intercept is the mean mass of the am == 0.
But what is the meaning of the coefficient for where
am == 1? The value is 7.245. As it turns out,
the coefficient for am == 1 is calculated relative to the
intercept. 24.39231 - 17.14737 evaluates to
7.24494, which is the coefficient of am == 1
after rounding.
moderu <- lm(mpg ~ as.factor(am), mtcars)
moderu##
## Call:
## lm(formula = mpg ~ as.factor(am), data = mtcars)
##
## Coefficients:
## (Intercept) as.factor(am)1
## 17.147 7.245
Within the lm call, we can add + 0 after
the explanatory variable and this tells R to report the coeffiecients
relative to zero. In other words, the linear regression model is being
fit without an intercept term, which generally is probably not a good
idea.
When we do this, we can see very clearly that the coefficients are the mean values for where am is equal to 0 and where am is equal to 1. With only one categorical explanatory variable, the linear regression coefficients are the means of each category.
moderu <- lm(mpg ~ as.factor(am) + 0, mtcars)
moderu##
## Call:
## lm(formula = mpg ~ as.factor(am) + 0, data = mtcars)
##
## Coefficients:
## as.factor(am)0 as.factor(am)1
## 17.15 24.39
three variables and not just two for this exercise
ggplot(
data = mtcars,
aes(x = gear, y = mpg)) +
geom_point() + geom_smooth(method = 'lm', se = FALSE)# mtcars %>%
# select(gear, mpg) %>%
# group_by(gear) %>%
# summarize_all(mean) # summarize(mpg = mean(mpg))
# mtcars %>%
# select(gear, mpg) %>%
# pivot_wider(names_from = gear, values_from = mpg)
#
# mtcars <- pivot_wider(mtcars, names_from = 'names', values_from = 'values', 2)
# We have to turn this into a factor. But what happens if we don't?
mtcars$gear <- as.factor(mtcars$gear)
modell11 <- lm(mpg ~ gear, mtcars)
modell11##
## Call:
## lm(formula = mpg ~ gear, data = mtcars)
##
## Coefficients:
## (Intercept) gear4 gear5
## 16.107 8.427 5.273
modell22 <- lm(mpg ~ gear + 0, mtcars)
modell22##
## Call:
## lm(formula = mpg ~ gear + 0, data = mtcars)
##
## Coefficients:
## gear3 gear4 gear5
## 16.11 24.53 21.38
summary(modell11)##
## Call:
## lm(formula = mpg ~ gear, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7333 -3.2333 -0.9067 2.8483 9.3667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.107 1.216 13.250 0.0000000000000787 ***
## gear4 8.427 1.823 4.621 0.0000725738200575 ***
## gear5 5.273 2.431 2.169 0.0384 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.708 on 29 degrees of freedom
## Multiple R-squared: 0.4292, Adjusted R-squared: 0.3898
## F-statistic: 10.9 on 2 and 29 DF, p-value: 0.0002948
summary(modell22)##
## Call:
## lm(formula = mpg ~ gear + 0, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7333 -3.2333 -0.9067 2.8483 9.3667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## gear3 16.107 1.216 13.25 0.0000000000000787 ***
## gear4 24.533 1.359 18.05 < 0.0000000000000002 ***
## gear5 21.380 2.106 10.15 0.0000000000465706 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.708 on 29 degrees of freedom
## Multiple R-squared: 0.9542, Adjusted R-squared: 0.9495
## F-statistic: 201.5 on 3 and 29 DF, p-value: < 0.00000000000000022
We can also perform a simple linear regression with a discrete variable on the y axis. Take the following example. This is a little different. We call this logistic regression.
9.2 Evaluating our Model
“All models are wrong, but some are useful. ~George Box”
Of course, we can always draw a line through any set of points on a scatterplot. But how will we know if that line is justified, that our linear model is a good fit and can be relied on to make accurate predictions? There are a couple of ways to test out our model and see how it’s working. One of the following methods may show a stronger signal than another method; taken together, these methods will give us the most accurate picture.
- Create diagnostic plots
- Look at model statistics
- Perform goodness of fit tests
- Validate the model with test/train workflow
Model Statistics
SST, SSR, and SSE
One of the most commonly used statistics in evaluating a model is called r-squared. We will first explore these other statistics in order to understand r-squared.
Without having access to any information about the x
variable, our best guess for the value of y is the mean of
y. This is the guess that would minimize error under the
circumstance of knowing nothing about x. The statistic to
define this is the sum of squares total, denoted as SST. The sum of
squares total is the squared differences between the observed dependent
variable and its mean. In other words, SST is the dispersion of the
observed variables around the mean. In the below visualization on the
left, the dashed orange line is the mean of mpg; SST is the
squared sum of the distance from every point to the orange line, which
evaluates to 1126.047.
\[ \begin{aligned} SST = \sum{(y_i - \bar{y}) ^ 2} \end{aligned} \]
Having information about an x value, however, will now change our guess about y. This is because we can now draw a line of best fit for the relationship of x and y. Now, given a value of x, our best guess for y is no longer the mean of y but rather the predicted value of y based on the line of best fit.
We can first calculate the sum of squares due to regression, or SSR, which is the sum of the differences between the predicted value and the mean of the dependent variable. The middle panel of our visualization is an illustration of SSR. SSR is calculated from the difference of the blue and orange lines.
\[ \begin{aligned} SSR = \sum{(\hat{y}_i - \bar{y}) ^ 2} \end{aligned} \]
The final term, which may be the one with which we are most familiar, is the sum of squares error, or SSE. The right panel on our visualization illustrates this. The SSE is the difference between the observed value and the predicted value; in other words, it’s the differences between our line of best fit and our actual data points. The rationale is the following: the total variability of the dataset is equal to the variability explained by the regression line plus the unexplained variability, known as error.
The residual sum of squares (RSS) is also called the sum of squared errors (SSE). It is given by the following equation. Essentially, our linear model creates a prediction based on the x values and then we subract y from the predited value and square it.
\[ \begin{aligned} SSE = \sum(y_i- \hat{y}) ^ 2 \end{aligned} \]
\[ \begin{aligned} SSE = \sum(y_i- (\hat{a} + \hat{b}x)) ^ 2 \end{aligned} \]
model <- lm(mpg ~ wt, mtcars)
sse <- sum((fitted(model) - mtcars$mpg)^2)We can see something in the above graph, which is that the SSE is equal to difference between SST and SSR.
\[ \begin{aligned} SSE = SST - SSR \end{aligned} \]
MSE
\[ \begin{aligned} MSE = \frac{1}{n}\sum({y_i} -\hat{y_i}) ^ 2 \end{aligned} \]
“To find the MSE, take the observed value, subtract the predicted value, and square that difference. Repeat that for all observations. Then, sum all of those squared values and divide by the number of observations.”
The mean squared error is simply the SSE divided by the number of observations.
R-Squared
\[ \begin{aligned} R^2 = \frac{SSR}{SST} = \frac{SST - SSE}{SST} = \frac{\sum(y_i- \hat{y}) ^ 2 + \sum{(\hat{y}_i - \bar{y}) ^ 2} - \sum(y_i- \hat{y}) ^ 2}{\sum(y_i- \hat{y}) ^ 2 + \sum{(\hat{y}_i - \bar{y}) ^ 2}} \end{aligned} \]
The r-squared can be thought of as SSR / SST, or else
SST - SSE / SST. The r-squared statistic represents the
proportion of the variance for the dependent variable - the y
variable - that is explained by the independent variable - the
x variable. In the case of mpg explained by
wt in the mtcars dataset, we see an r-squared
of 0.7446, which is to say that our model accounts for 74% of the
variability of mpg according to wt.
For the sake of calculating r-squared, the order of the x
and y variables is not important. We will calculate an
r-squared of 0.7446 if we are finding mpg as a function of
wt or if we are finding wt as a function of
mpg. The tilde can be read like ‘as a function of’ in the
following context mpg ~ wt. We can place two graphs
side-by-side and switch the x and y values. In
each case, the same amount of variability on the y axis can
be explained by the values of x.
This is what is meant when we talk about how r-squared is the percent of variance explained.
model1 <- lm(mpg ~ wt, mtcars)
model2 <- lm(wt ~ mpg, mtcars)
one <- ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
labs(title = "mtcars", subtitle = "mpg ~ wt") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
geom_abline(intercept = 37.2851, slope = -5.3445, linetype = 'dashed', color = '#376795', size = 1) +
labs(caption = "r-squared: 0.7446")
two <- ggplot(mtcars, aes(x = mpg, y = wt)) +
geom_point() +
labs(title = "mtcars", subtitle = "wt ~ mpg") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
geom_abline(intercept = 6.0473, slope = -0.1409, linetype = 'dashed', color = '#376795', size = 1) +
labs(caption = "r-squared: 0.7446")
library(patchwork)
one + twoThere are other ways to arrive at the r-squared statistic. Amazingly, r-squared can also be understood as the quotient of the difference between 1 and MSE and the variance of y, as seen in this formula:
\[ \begin{aligned} R ^ 2 = \frac{1 - \frac{1}{n}\sum({y_i} -\hat{y_i}) ^ 2}{σ^2} = \frac{1 - MSE}{σ^2} \end{aligned} \]
\[ \begin{aligned} R ^ 2 = 1 - \frac{σ^2}{s_y^2} \end{aligned} \]
R-squared is also the square of the correlation between y and y-hat.
\[ \begin{aligned} R ^ 2 = ρ(y, \hat{y}) ^ 2 \end{aligned} \]
\[ \begin{aligned} f-statistic = \frac{R ^ 2}{1 - R ^ 2} * \frac{df_2}{df_1} \end{aligned} \]
Standard Deviation Line
In addition to calculating our line of best fit, we can also calculate the standard deviation line. We find the slope of the standard deviation line by taking the quotient of standard deviation of y over standard deviation of x, then multiplying by -1. The intercept of the line of best fit can be calculated by subtracting the slope of the line of best fit from mean of y, then multiplying the result by the mean of x. The standard deviation line doesn’t elucidate much about our inferences, but it is useful for visually inspecting r-squared. Essentially, the standard deviation line is just the comparison of the standard deviations without consideration for the correlation between the two?
\[ \begin{aligned} a = -1 * \frac{sy}{sx} \end{aligned} \]
The standard deviation line will always have a steeper slope than the
regression line. Sometimes the standard deviation line is close to the
line of best fit; sometimes the lines are further apart. A standard
deviation line close to the regression line indicates a high r-squared.
For this reason, overlaying the standard deviation line on top of our
scatterplot and alongside our linear regression line is useful for
visually inspecting r-squared. We can also add a point that represents
the mean of the x and y variables. The linear
regression line and the standard deviation line will intersect at this
point.
mean_wt <- mean(mtcars$wt)
mean_mpg <- mean(mtcars$mpg)
sd_wt <- sd(mtcars$wt)
sd_mpg <- sd(mtcars$mpg)
model <- lm(mpg ~ wt, data = mtcars)
sd_slope = -1 * (sd_mpg / sd_wt)
sd_intercept <- mean_mpg - sd_slope * mean_wt
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_abline(aes(intercept = sd_intercept, slope = sd_slope), linetype = 'dashed', color = '#D55E00', size = 1) +
geom_abline(intercept = r_intercept, slope = r_slope, color = '#376795', size = 1) +
labs(title = "Mtcars", subtitle = "Linear Regression + Standard Deviation Line") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
geom_point(aes(x = mean(wt), y = mean(mpg)), color = 'orange', size = 4)Standard Error
We sometimes think about the standard error of the regression line but in fact there are many kinds of standard errors. For example, we can think about standard error in terms of the single standard error value as well as the standard error of the line. Here are some examples of standard errors. We will review each of these in turn.
- standard error of a coefficient estimate
- standard error of the mean at X
- standard error of the forecast
The standard error is the standard deviation of the estimate, which is why it’s also called the standard error of estimate, denoted se. This is because, if we were to create a distribution of the residuals and, if it’s a normal distribution or resampled to become a normal distribution, one-half of the data points would fall within two-thirds standard deviation above or below the regression line, two-thirds of the data points would fall within one standard deviation of the regression line, and 95% would fall within two standard deviations above or below the regression line.
The standard error is useful because it tells us about how large the
residuals are for our dataset. The residuals, we remember, are the
prediction errors. The residuals are in the same unit as our
y variable. The standard error tells us how well we can
predict the y value. A smaller standard error indicates
more confidence in our predictions.
\[ \begin{aligned} Sest = \sqrt{\frac{(1-r^2) * (y - \hat{y}) ^ 2 }{(n - 2)}} \end{aligned} \]
r <- cor(mtcars$mpg, mtcars$wt)
# equation for single standard error value
standard_error_value <- sqrt ( ( ( 1 - (r ^ 2) ) *
sum(((mtcars$mpg) - mean(mtcars$mpg)) ^ 2) ) /
(length(mtcars$mpg) - 2) )
# equation for all y values relative to x
standard_error_of_line <- qt(0.975, 30) * sqrt ( ( ( 1 - (r ^ 2) ) *
sum(((mtcars$mpg) - mean(mtcars$mpg)) ^ 2) ) / (length(mtcars$mpg) - 2) )We can also think ahout the standard error of y as it changes in relation to x.
\[ \begin{aligned} (s.e.)_y = \sqrt{\frac{\sum _{i=1}^{n}(y - \hat{y}) ^ 2 }{( n - 2)}} * \sqrt{\frac{1}{n} + \frac{(x - \bar{x}) ^ 2}{\sum _{i=1}^{n} (x - \bar{x}) ^ 2}} \end{aligned} \]
adfaf <- sqrt(
( 1 / (length(mtcars$wt)) +
((mtcars$wt - mean(mtcars$wt)) ^ 2 ) /
(sum((mtcars$wt - mean(mtcars$wt)) ^ 2))
))
adfaf## [1] 0.2080119 0.1876080 0.2416107 0.1767772 0.1814437 0.1823061 0.1882622
## [8] 0.1768474 0.1772072 0.1814437 0.1814437 0.2361182 0.2002714 0.2047450
## [15] 0.4128882 0.4419637 0.4287122 0.2571315 0.3431469 0.3092354 0.2243142
## [22] 0.1853060 0.1812391 0.2105165 0.2110162 0.2943616 0.2652376 0.3593238
## [29] 0.1769893 0.1949100 0.1882622 0.1941440
adfaf2 <- sqrt(
((mtcars$mpg - mean(mtcars$mpg)) ^ 2 ) /
(length(mtcars$mpg) - 2))
adfaf2## [1] 0.16602840 0.16602840 0.49466193 0.23905807 0.25389223 0.36343674
## [7] 1.05721864 0.78678063 0.49466193 0.16260513 0.41820899 0.67381285
## [13] 0.50949609 0.89290188 1.76925797 1.76925797 0.98418897 2.24737412
## [19] 1.88222575 2.52123540 0.25731549 0.83812962 0.89290188 1.23979283
## [25] 0.16260513 1.31624577 1.07889933 1.88222575 0.78335737 0.07131804
## [31] 0.92941671 0.23905807
adfaf <- sqrt(
( 1 / (length(mtcars$wt)) +
((mtcars$wt - mean(mtcars$wt)) ^ 2 ) /
(sum((mtcars$wt - mean(mtcars$wt)) ^ 2))
))
adfaf## [1] 0.2080119 0.1876080 0.2416107 0.1767772 0.1814437 0.1823061 0.1882622
## [8] 0.1768474 0.1772072 0.1814437 0.1814437 0.2361182 0.2002714 0.2047450
## [15] 0.4128882 0.4419637 0.4287122 0.2571315 0.3431469 0.3092354 0.2243142
## [22] 0.1853060 0.1812391 0.2105165 0.2110162 0.2943616 0.2652376 0.3593238
## [29] 0.1769893 0.1949100 0.1882622 0.1941440
sqrt(
((mtcars$mpg - mean(mtcars$mpg)) ^ 2 ) /
(length(mtcars$mpg) - 2)) *
sqrt(
( 1 / (length(mtcars$wt)) +
((mtcars$wt - mean(mtcars$wt)) ^ 2 ) /
(sum((mtcars$wt - mean(mtcars$wt)) ^ 2))
))## [1] 0.03453589 0.03114826 0.11951561 0.04226001 0.04606715 0.06625673
## [7] 0.19903429 0.13914015 0.08765765 0.02950368 0.07588139 0.15909946
## [13] 0.10203747 0.18281720 0.73050570 0.78194779 0.42193384 0.57787072
## [19] 0.64587988 0.77965526 0.05771951 0.15531047 0.16182873 0.26099680
## [25] 0.03431232 0.38745217 0.28616462 0.67632843 0.13864590 0.01390060
## [31] 0.17497402 0.04641170
We can look at the summary() function with our model object in order to understand if our model is a good fit. We look at five numbers of our residuals. We want to see the median as a number close to 0. (We already know that in OLS regression the mean is 0.) We also want to see that the absolute value of the 1Q meaning first quartile is about equal to the absolute value of 3Q meaning the third quartile.
model <- lm(mpg ~ wt, mtcars)
summary(model)##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 0.0000000000000002 ***
## wt -5.3445 0.5591 -9.559 0.000000000129 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 0.0000000001294
model_summ <-summary(model)
mean(model_summ$residuals^2)## [1] 8.697561
sqrt(mean(model_summ$residuals^2))## [1] 2.949163
data <- data.frame(pred = predict(model), actual = mtcars$mpg)
datamean((data$actual - data$pred)^2)## [1] 8.697561
sqrt(mean((data$actual - data$pred)^2))## [1] 2.949163
"Mean square error is a useful way to determine the extent to which a regression model is capable of integrating a dataset.
The larger the difference indicates a larger gap between the predicted and observed values, which means poor regression model fit. In the same way, the smaller RMSE that indicates the better the model.
Based on RMSE we can compare the two different models with each other and be able to identify which model fits the data better."## [1] "Mean square error is a useful way to determine the extent to which a regression model is capable of integrating a dataset.\n\nThe larger the difference indicates a larger gap between the predicted and observed values, which means poor regression model fit. In the same way, the smaller RMSE that indicates the better the model.\n\nBased on RMSE we can compare the two different models with each other and be able to identify which model fits the data better."
Prediction Interval
A prediction interval is an estimate of an interval in which a future observation will fall, with a certain probability, given what has already been observed.
We can see that in our dataset of 32 observations, two observations fall outside the prediction interval line. In a large dataset, we would expect 95% of the observations to fall within the prediction interval lines and 5% to fall outside the prediction interval lines.
Our prediction interval has the same slope as our line of best fit and runs parallel to our line of best fit, but the values are plus or minus the standard error above and below the line.
ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point() + geom_smooth(method = 'lm', color = '#376795', size = 1, se = FALSE) +
geom_abline(intercept = 37.2851 + standard_error_of_line, slope = -5.3445, linetype = 'dashed', color = '#376795', size = 0.5) +
geom_abline(intercept = 37.2851 - standard_error_of_line, slope = -5.3445, linetype = 'dashed', color = '#376795', size = 0.5) +
labs(title = "Mtcars", subtitle = "Linear Regression") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))Confidence Interval
fourth <- ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point() + geom_smooth(method = 'lm', color = '#376795', size = 1, se = TRUE) +
labs(title = "Mtcars", subtitle = "Linear Regression") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
fourthmtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
ggplot(aes(x = values, y = mpg)) +
geom_point() +
facet_wrap(~names, scales = 'free') + geom_smooth(method = 'lm', color = '#376795', size = 1, se = TRUE) +
labs(title = "Mtcars", subtitle = "Linear Regression") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))Ribbons are one aesthetic way to show a confidence interval without showing the regression line itself.
mtcars_linear_regression_geom_ribbon <- ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point() + stat_smooth(method = "lm", color = '#376795', size = 1, fill = NA, geom = "ribbon") +
labs(title = "Mtcars", subtitle = "Linear Regressions") +
theme(plot.title = element_text(face = "bold"))
# theme(plot.subtitle = element_text(hjust = 0.5))
mtcars_linear_regression_geom_ribbon_filled <- ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point() + stat_smooth(method = "lm", color = '#376795', fill = '#ffe6b7', size = 1, geom = "ribbon") +
# labs(title = "Mtcars", subtitle = "Linear Regression") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
library(patchwork)
mtcars_linear_regression_geom_ribbon + mtcars_linear_regression_geom_ribbon_filled#
# mtcars %>%
# dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
# pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
# ggplot(aes(x = values, y = mpg)) +
# geom_point() + stat_smooth(method = "lm", color = '#376795', size = 1, geom = "ribbon") +
# facet_wrap(~names, scales = 'free') +
# labs(title = "Mtcars", subtitle = "Linear Regression") +
# theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
# theme(plot.subtitle = element_text(hjust = 0.5))
#
# # what about geom_area()
# Residual Distance
mtcars_linear_model <- lm(mpg ~ wt, mtcars)
mtcars$predicted <- predict(mtcars_linear_model) # Save the predicted values
mtcars$residuals <- residuals(mtcars_linear_model) # Save the residual values
single_line_plus_residual_distance <- ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_smooth(method = "lm", se = FALSE, color = '#376795', size = 1) + # regression line
geom_segment(aes(xend = wt, yend = predicted), alpha = .2) + # draw line from point to line
geom_point(aes(color = abs(residuals))) + # size of the points
# scale_color_continuous(low = "green", high = "red") + # colour of the points mapped to residual size - green smaller, red larger
guides(color = FALSE, size = FALSE) + # Size legend removed
# geom_point(aes(y = predicted), shape = 1)
geom_text(aes(label = car), size = 2.5, hjust = 1.1, data = subset(mtcars, residuals > 6)) +
labs(title = "Mtcars", subtitle = "Linear Regression") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
single_line_plus_residual_distancemtcars %>%
dplyr::select(car, origin, mpg, disp, hp, drat, wt, qsec) %>%
pivot_longer(names_to = 'names', values_to = 'values', 4:8) %>%
nest_by(names) %>%
mutate(model = list(lm(mpg ~ values, data = data))) %>%
summarise(augment(model)) -> predicted_values
mtcars_long_numeric_ij <- inner_join(mtcars_long_numeric, predicted_values, by = c("names", "mpg", "values"))
ggplot(mtcars_long_numeric_ij, aes(x = values, y = mpg)) +
geom_smooth(method = "lm", se = FALSE, color = '#376795', size = 1) +
facet_wrap(~names, scales = 'free_x') +
geom_segment(mapping = aes(xend = values, yend = .fitted), alpha = .2, data = mtcars_long_numeric_ij) +
geom_point(aes(color = abs(.resid))) +
guides(color = FALSE, size = FALSE) +
geom_text(aes(label = car), size = 2.5, hjust = 1.1, data = subset(mtcars_long_numeric_ij, .resid > 6)) +
labs(title = "Mtcars", subtitle = "Linear Regression") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))s <- mtcars %>%
nest(data = -am) %>%
mutate(
fit = map(data, ~ lm(wt ~ mpg + qsec + gear, data = .x)), # S3 list-col
tidied = map(fit, tidy)
) %>%
unnest(tidied)
regressions <- mtcars %>%
nest(data = -am) %>%
mutate(
fit = map(data, ~ lm(wt ~ mpg + qsec + gear, data = .x)),
tidied = map(fit, tidy),
glanced = map(fit, glance),
augmented = map(fit, augment)
)
# regressions %>%
# unnest(tidied)
#
# regressions %>%
# unnest(glanced)
#
# regressions %>%
# unnest(augmented)Bootstrapping
Bootstrapping is a statistical procedure that resamples a single dataset many times in order to get many simulated datasets. By doing this, we are able to calculate confidence intervals and other things. In this example, we use bootstrapping on our regression model in order to test the stability of our model coefficients.
boots <- rsample::bootstraps(mtcars, times = 250, apparent = TRUE)
fit_nls_on_bootstrap <- function(split) {
lm(mpg ~ wt, analysis(split))
}
boot_models <-
boots %>%
dplyr::mutate(model = map(splits, fit_nls_on_bootstrap),
coef_info = map(model, tidy))
boot_coefs <-
boot_models %>%
unnest(coef_info)
percentile_intervals <- int_pctl(boot_models, coef_info)
percentile_intervalsone <- ggplot(boot_coefs, aes(estimate)) +
geom_histogram(bins = 30, alpha = .3, col = "#376795") +
facet_wrap( ~ term, scales = "free") +
labs(title="Mtcars", subtitle = "Stability of Coeffiecients") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = "95% Confidence Interval Parameter Estimates, Intercept + Estimate") +
geom_vline(aes(xintercept = .lower), data = percentile_intervals, col = "black", linetype = 'dotdash') +
geom_vline(aes(xintercept = .upper), data = percentile_intervals, col = "black", linetype = 'dotdash')
boot_aug <-
boot_models %>%
sample_n(50) %>%
mutate(augmented = map(model, augment)) %>%
unnest(augmented)
# boot_aug <-
# boot_models %>%
# sample_n(200) %>%
# mutate(augmented = map(model, augment)) %>%
# unnest(augmented)
glimpse(boot_aug)## Rows: 1,600
## Columns: 12
## $ splits <list> [<boot_split[32 x 10 x 32 x 20]>], [<boot_split[32 x 10 x …
## $ id <chr> "Bootstrap113", "Bootstrap113", "Bootstrap113", "Bootstrap1…
## $ model <list> [36.575074, -5.311426, -0.09753868, 1.70991444, 7.07139355…
## $ coef_info <list> [<tbl_df[2 x 5]>], [<tbl_df[2 x 5]>], [<tbl_df[2 x 5]>], […
## $ mpg <dbl> 18.1, 10.4, 33.9, 17.8, 30.4, 18.1, 18.7, 19.2, 15.0, 15.8,…
## $ wt <dbl> 3.460, 5.250, 1.835, 3.440, 1.513, 3.460, 3.440, 3.440, 3.5…
## $ .fitted <dbl> 18.197539, 8.690086, 26.828606, 18.303767, 28.538886, 18.19…
## $ .resid <dbl> -0.09753868, 1.70991444, 7.07139355, -0.50376720, 1.8611142…
## $ .hat <dbl> 0.03359755, 0.19295709, 0.10557370, 0.03323023, 0.14431139,…
## $ .sigma <dbl> 2.733326, 2.710440, 2.354482, 2.731732, 2.707735, 2.733326,…
## $ .cooksd <dbl> 0.00002369387, 0.05996625680, 0.45684339934, 0.00062465140,…
## $ .std.resid <dbl> -0.0369197, 0.7082497, 2.7822303, -0.1906464, 0.7486443, -0…
# compare this to other versions
lm_bootstrapped <- ggplot(boot_aug, aes(wt, mpg)) +
geom_line(aes(y = .fitted, group = id), alpha = .3, col = "#376795") +
geom_point(alpha = 0.005) +
# ylim(5,25) +
# labs(title="", subtitle = "mpg ~ wt \n Linear Regression Bootstrap Resampling") +
# labs(caption = "coefficient stability testing")
labs(title = "Mtcars", subtitle = "Linear Regression Bootstrap Resampling") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
two <- lm_bootstrapped
library(patchwork)
one + twolm_coeffs = function(x, y) {
coeffs = coefficients(lm(y~x))
tibble(C = coeffs[1], m=coeffs[2])
}
nboot = 100
mtboot = lapply (seq_len(nboot), function(i)
mtcars_long_numeric %>%
group_by(names) %>%
slice_sample(prop=1, replace=TRUE) %>%
summarise(tibble(lm_coeffs(values, mpg))))
mtboot = do.call(rbind, mtboot)
#, color='#D55E00'
ggplot(mtcars_long_numeric, aes(values, mpg, color = names)) +
geom_abline(aes(intercept=C, slope=m), data = mtboot, size=0.3, alpha=0.3) +
geom_point() +
facet_wrap(~names, scales = 'free_x') +
labs(title = "Mtcars", subtitle = "Linear Regression") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
scale_color_manual(values=met.brewer("Hiroshige", 5))Making Predictions
new_data <- data.frame(wt = c(1, 2, 3, 4, 5, 6, 7, 8))
predict(mtcars_linear_model, new_data)## 1 2 3 4 5 6 7
## 31.9406546 26.5961830 21.2517114 15.9072399 10.5627683 5.2182967 -0.1261748
## 8
## -5.4706464
predictions <- predict(mtcars_linear_model, new_data)
predictions <- as.data.frame(predictions)
predictions <- round(predictions, 0)
predictionssummary(mtcars_linear_model)##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 0.0000000000000002 ***
## wt -5.3445 0.5591 -9.559 0.000000000129 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 0.0000000001294
mtcars_linear_model$coefficients## (Intercept) wt
## 37.285126 -5.344472
# exploring out of sample estimatesModel Diagnostics
There are four model assumptions associated with linear regression. If a linear regression is created in violation of these assumptions, and no remedial measures are taken, the model may have errors and poor accuracy. The four assumptions for linear regression are:
- Linearity
- Homoscedasticity
- Independence
- Normality
In addition to verifying these linear model assumptions, we should also pay close attention to any problem data points which may have an inordinately large effect on the model.
- Outliers
- Leverage points
- Influential observations
“Is it realistic to assume that all of these assumptions will hold? Probably not! We want the assumptions to hold just enough for the model to be useful.”
Linearity
In order to verify the assumption of linearity, we have to check a few separate parts, including the functional form of the model and whether or not the errors have zero population mean. Some of these pieces will never be violated because R constructs the linear model according to a certain math, but some of these pieces will need to be checked after constructing our model as they might reveal problems.
As a first thing, linearity means that the regression model is linear in the coefficients and the error term. In other words, we want to make sure that we are looking at a model that is correctly specified with a certain functional form, including that it is linear with an additive error term. Here is the general form of our linear model equation. The following equation may look familiar: In this formula, the beta values, labeled as β, are the parameters that ordinary least squares regression will estimate while epsilon, labeled as ε, represents the random error inherent in any model.
\[ \begin{aligned} Y =\beta _{0} + \beta _{1}X_{1} + \beta _{2}X_{2} + \cdots + \beta _{k}X_{k} + \epsilon \end{aligned} \]
The defining characteristic of linear regression is this functional form of the parameters. Despite its name, linear regression is still able to model curvature. Linear models can model curvature by including nonlinear variables such as polynomials and transforming exponential functions. The following three panels contain two examples of linear regressions and one example of a non-linear regression. We can see that a linear regression can still model curvature if the consideration is built into the model.
In the second panel, we might ask ourselves what is I()
doing. We refer to I() with parentheses because
I() is a function. I() does something very
particular within the langauge of R: It isolates or insulates the
contents of the function from the formula parsing code. It allows the
standard R operators to work as they would if you used them outside of a
formula, rather than being treated as special formula operators.
We take the following example:
lm(mpg ~ wt + disp, mtcars). This means we are looking to
predict mpg as a function of two variables: wt
and disp. Compare this to
lm(mpg ~ I(wt + disp), mtcars which predicts
mpg as a function of a new predictor, which is the numeric
combination of wt + disp. In this example, we
can see that we need the ^ operator in our formula but
^ has special meaning in a formula, but we need its
non-formula meaning, we need to wrap the elements of the operation in
I(). It allows the standard R operators to work as they
would if you used them outside of a formula, rather than being treated
as special formula operators. In our example,
fit <- lm(mpg ~ wt + I(wt^2), data = mtcars), we create
a model that is a result of two predictors, wt and
wt squared. This is different than asking fo the main
effect and the second order interaction of x”,
A second thing with checking within the linearity assumption is that the error term has a zero population mean.
A third thing with checking within the linearity assumption is the independent explanatory variables are uncorrelated with the error term.
A fourth thing with checking within the linearity assumption is that the observations of the error term are uncorrelated with each other, which is called serial correlation.
A fifth thing No independent variable is a perfect linear function of other explanatory variables.6. No explanatory variable is a perfect linear function of any other explanatory variables (no perfect multicollinearity). If two or more independent variables have an exact linear relationship between them then we have perfect multicollinearity.Examples: including the same information twice (weight in pounds and weight in kilograms), not using dummy variables correctly (falling into the dummy variable trap), etc. Here is an example of perfect multicollinearity in a model with two explanatory variables:
because moves exactly when moves! Solution: Easy - Drop one of the variables! CHAPTER 8: MULTICOLLINEARITY Consequence: OLS cannot generate estimates of regression coefficients (error message). Why? OLS cannot estimate the marginal effect of on while holding constant
The error term has a constant variance (no heteroskedasticity).
The error term is normally distributed (not required).
What are residuals? Residuals = Observed value – the fitted value
mtcars2 <- mtcars %>%
mutate(l_per_100_km = 235 / mpg)
mtcars2 ggplot(mtcars2, aes(x = mpg, y = wt, size = l_per_100_km)) +
geom_point()moddell <- lm(mpg ~ wt + l_per_100_km, mtcars2)
moddell##
## Call:
## lm(formula = mpg ~ wt + l_per_100_km, data = mtcars2)
##
## Coefficients:
## (Intercept) wt l_per_100_km
## 39.034 -1.110 -1.206
summary(moddell)##
## Call:
## lm(formula = mpg ~ wt + l_per_100_km, data = mtcars2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8695 -1.3470 -0.9679 0.5930 5.2650
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.0339 1.3853 28.178 < 0.0000000000000002 ***
## wt -1.1098 0.8793 -1.262 0.217
## l_per_100_km -1.2063 0.2229 -5.412 0.00000808 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.185 on 29 degrees of freedom
## Multiple R-squared: 0.877, Adjusted R-squared: 0.8685
## F-statistic: 103.4 on 2 and 29 DF, p-value: 0.00000000000006342
Homoscedasticity
Homoscedasticity is the idea of even variance, meaning that the variance of the residuals is the same for any value of x, and that error terms are normally distributed.
Independence
Independence refers to the idea that our observations are independent of each other / do not influence each other
Normality
Normality is the idea that for any fixed value of x, y is normally distributed.
Outliers
An outlier is an observation that has a large residual, meaning the observed value is very different from that predicted by our model.
Leverage Points
A point of high leverage is an observation that has a value of x that is far away from the mean of x.
Here is an example of leverage
ggplot(
mtcars,
aes(wt, mpg)
) +
geom_point() +
geom_point(aes(x = 1.55, y = 30.5), colour="blue", size = 30, shape = 21) +
geom_point(aes(x = 5.35, y = 12.5), colour="blue", size = 30, shape = 21) Influential Observations
An influential observation is an observation that changes the slope of the line. One method to find influential points is to compare the fit of the model with and without each observation.
Sample size : n = 20 (cases per independent variable) No multicollinearity & auto-correlation
Diagnostic Plots
Regression diagnostics are used to evaluate the model assumptions and also to investigate whether or not there are observations with a large, undue influence on the analysis. It’s possible to have a linear model that has impressive goodness-of-fit statistics such as high r-squared and low RMSE but fails the linear model assumptions. It’s important to know that the goodness-of-fit statistics are only important if the model has good diagnostics.
The following diagnostic plots we will go over in detail in the
following chapter. All of these diagnostic plots come from our linear
model mpg ~ wt and can be used in conjunction with our
scatterplot and linear model statisitics.
We can use a library such as the ggfortify library to
create diagnostic plots or else we can construct our own diagnostic
plots using our linear model object and ggplot2 methods
which gives us more flexibility.
Fitted Values vs Residuals
When conducting a residual analysis, a “residuals versus fits plot” is the most frequently created plot. It is a scatter plot of residuals on the y axis and fitted values (estimated responses) on the x axis. The plot is used to detect non-linearity, unequal error variances, and outliers.
Used to check the linear relationship assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, what is good.
The sum of residuals is 0 therefore the mean is 0.
# x11()
# par(mfrow=c(2,2))
# plot(mtcars_linear_model)autoplot(mtcars_linear_model)
mtcars_graph <- ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE, color = '#376795') +
labs(title = "Mtcars", subtitle = "Linear Regression") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
# scale_color_manual(values=met.brewer("Hiroshige", 6))
# scale_color_manual(values= list(c("#e76254", "#ef8a47", "#f7aa58", "#ffd06f", "#ffe6b7", "#aadce0", "#72bcd5", "#528fad", "#376795", "#1e466e"), c(6, 2, 9, 3, 7, 5, 1, 10, 4, 8), colorblind=TRUE))
fitted_values_vs_residuals <- ggplot(data = mtcars_linear_model, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = '#ef8a47', size = 1) +
xlab("fitted values") +
ylab("residuals") +
labs(title = "Mtcars") +
labs(subtitle = "Fitted Values vs Residuals") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
library(patchwork)
mtcars_graph + fitted_values_vs_residuals#ffe6b7x_values_vs_residuals <- ggplot(data = mtcars_linear_model, aes(x =wt, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = '#ef8a47', size = 1) +
xlab("x values") +
ylab("residuals") +
labs(title = "Mtcars") +
labs(subtitle = "X Values vs Residuals") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
library(patchwork)
mtcars_graph + x_values_vs_residualsHistogram of Residuals
The Histogram of the Residual can be used to check whether the variance is normally distributed. A symmetric bell-shaped histogram which is evenly distributed around zero indicates that the normality assumption is likely to be true.
histogram_of_residuals <- ggplot(data = mtcars_linear_model, aes(x = .resid)) +
geom_histogram(color = '#ef8a47') +
xlab("residuals") +
labs(title = "Mtcars") +
labs(subtitle = "Histogram of Residuals") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
library(patchwork)
mtcars_graph + histogram_of_residualsQQ Plot
The QQ plot of residuals can be used to visually check the normality assumption. The normal probability plot of residuals should approximately follow a straight line.
In statistics, a Q–Q plot is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other. The QQ plot of residuals can be used to visually check the normality assumption. The normal probability plot of residuals should approximately follow a straight line.
A Q-Q plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight.
Now what are “quantiles”? These are often referred to as “percentiles”. These are points in your data below which a certain proportion of your data fall. For example, imagine the classic bell-curve standard Normal distribution with a mean of 0. The 0.5 quantile, or 50th percentile, is 0. Half the data lie below 0. That’s the peak of the hump in the curve. The 0.95 quantile, or 95th percentile, is about 1.64. 95 percent of the data lie below 1.64. The following R code generates the quantiles for a standard Normal distribution from 0.01 to 0.99 by increments of 0.01:
https://data.library.virginia.edu/understanding-q-q-plots/
Normal Q-Q. Used to examine whether the residuals are normally distributed. It’s good if residuals points follow the straight dashed line.
qq_plot <- ggplot(data = mtcars_linear_model, aes(sample = .resid)) +
stat_qq() +
stat_qq_line(linetype = 'dashed', color = '#ef8a47', size = 1) +
labs(title = "Mtcars") +
labs(subtitle = "Residual QQ Plot") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
library(patchwork)
mtcars_graph + qq_plotScale Location
A scale-location plot is a type of plot that displays the fitted values of a regression model along the x-axis and the the square root of the standardized residuals along the y-axis.
This plot shows if residuals are spread equally along the ranges of predictors. It’s good if you see a horizontal line with equally spread points.
Maybe it can be seen that the variability (variances) of the residual points increases with the value of the fitted outcome variable, suggesting non-constant variances in the residuals errors (or heteroscedasticity).
A possible solution to reduce the heteroscedasticity problem is to use a log or square root transformation of the outcome variable (y).
The third plot is a scale-location plot (square rooted standardized residual vs. predicted value). This is useful for checking the assumption of homoscedasticity. In this particular plot we are checking to see if there is a pattern in the residuals.
Scale-Location (or Spread-Location). Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity. This is not the case in our example, where we have a heteroscedasticity problem.
scale_location <- ggplot(mtcars_linear_model, aes(.fitted, sqrt(abs(.stdresid)))) +
geom_point(na.rm=TRUE) +
stat_smooth(method="loess", na.rm = TRUE, color = '#ef8a47', size = 1, se = FALSE) +
xlab("Fitted Value") +
ylab(expression(sqrt("|Standardized residuals|"))) +
labs(title = "Mtcars") +
labs(subtitle = "Scale-Location") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
library(patchwork)
mtcars_graph + scale_locationLeverage vs Standardized Residuals
The Residuals vs. Leverage plots helps to identify influential data points on the model. outliers can be influential, though they don’t necessarily have to it and some points within a normal range in your model could be very influential. Outliers: defined as an observation that has a large residual.
Residuals vs Leverage. Used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis. This plot will be described further in the next sections.
leverage_vs_standardized_residuals <- ggplot(data = mtcars_linear_model, aes(.hat, .stdresid)) +
geom_point(aes(size = .cooksd)) +
stat_smooth(method="loess", na.rm=TRUE, color = '#ef8a47', size = 1) +
xlab("Leverage") +
ylab("Standardized Residuals") +
labs(title = "Mtcars") +
labs(subtitle = "Residuals vs Leverage") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
scale_size_continuous("Cook's Distance", range=c(1,5)) +
theme(legend.title = element_blank()) +
theme(legend.position= "none")
library(patchwork)
mtcars_graph + leverage_vs_standardized_residualsObservation Number vs Cook’s Distance
The fourth plot is of “Cook’s distance”, which is a measure of the influence of each observation on the regression coefficients. The Cook’s distance statistic is a measure, for each observation in turn, of the extent of change in model estimates when that particular observation is omitted. Any observation for which the Cook’s distance is close to 1 or more, or that is substantially larger than other Cook’s distances (highly influential data points), requires investigation.
observation_number_vs_cooks_distance <- ggplot(mtcars_linear_model, aes(seq_along(.cooksd), .cooksd)) +
geom_bar(stat="identity", position="identity", color = '#ef8a47', size = 1) +
xlab("Obs. Number") +
ylab("Cook's distance") +
ggtitle("Mtcars") +
labs(subtitle = "Cook's distance") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
library(patchwork)
mtcars_graph + observation_number_vs_cooks_distanceLeverage vs Cook’s Distance
leverage_vs_cooks_distance <- ggplot(mtcars_linear_model, aes(.hat, .cooksd))+geom_point(na.rm=TRUE) +
stat_smooth(method="loess", na.rm=TRUE, color = '#ef8a47', size = 1) +
xlab("Leverage hii")+
ylab("Cook's Distance") +
ggtitle("Mtcars") +
labs(subtitle = "Cook's dist vs Leverage hii/(1-hii)") +
geom_abline(slope=seq(0,3,0.5), color = "gray", linetype = "dotdash") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5))
library(patchwork)
mtcars_graph + leverage_vs_cooks_distanceSummaries
mtcars_linear_model <- lm(mpg ~ wt, mtcars)
# (?summary.lm)
# names(summary(mtcars_linear_model))
# str(summary(mtcars_linear_model))
one <- mtcars %>%
filter(cyl == 4) %>%
lm(mpg ~ wt, data = .) %>%
summary() %>%
.$"r.squared"
two <- mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .)) %>%
map(summary) %>%
map_dbl("r.squared")
three <- mtcars_long_numeric %>%
split(.$names) %>%
map(~ lm(mpg ~ values, data = .)) %>%
map(summary) %>%
map_df("r.squared") %>%
pivot_longer(., names_to = 'explanatory_variable_to_mpg', values_to = 'r_squared', 1:5) %>%
arrange(desc(r_squared))
mtcars_long_numeric %>%
nest_by(names) %>%
mutate(model = list(lm(mpg ~ values, data = data))) %>%
summarise(glance(model))Goodness-of-Fit Tests
Test / Train Workflow
“The reason for this test is simple, imagine we used the full dataset to train the model and then use the same data to predict the model’s accuracy. Naturally, we expect the model to perform well, given that all data points are already “known”. This is exactly what the term overfit refers to. What we are interested in is, if the model is also capable of handling data that has just been introduced — the test data.”
We use the initial_split() function from the
rsample package in the tidyverse library.
set.seed(335)
# Create a split object
mtcars_split <- initial_split(mtcars, prop = 0.75,
strata = mpg)
# Build training dataset
mtcars_training <- mtcars_split %>%
training()
# Build testing dataset
mtcars_test <- mtcars_split %>%
testing()lm_model <- linear_reg() %>%
set_engine('lm') %>% # adds lm implementation of linear regression
set_mode('regression')
# View object properties
lm_model## Linear Regression Model Specification (regression)
##
## Computational engine: lm
# We can do more things with our linear model object.
lm_fit <- lm_model %>%
fit(mpg ~ wt, data = mtcars_training)
# View lm_fit properties
lm_fit## parsnip model object
##
##
## Call:
## stats::lm(formula = mpg ~ wt, data = data)
##
## Coefficients:
## (Intercept) wt
## 35.861 -4.829
mtcars_training <- mtcars_training %>%
mutate(dataset = "training")
mtcars_test <- mtcars_test %>%
mutate(dataset = "test")
mtcars_training_and_test <- bind_rows(mtcars_training, mtcars_test)
mtcars_training_and_testggplot(mtcars_training_and_test, aes(x = wt, y = mpg, color = factor(dataset))) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
scale_color_manual(values = met.brewer("Hiroshige", 2)) +
geom_abline(intercept = 37.2851, slope = -5.3445, linetype = 'dashed') +
labs(title = "Mtcars", subtitle = "training vs testing") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = "black line: regression from full dataset") +
xlab("Displacement") +
ylab("MPG") R-squared plot
We can examinethe test set accuracy by making an r-squared plot. An r-squared plot is a scatterplot that shows the actual values versus the model predictions. We also include a diagonal line through the origin where the x and y values are equal, which serves as a representation of a perfect model where all the predicted values would be equal to the true values in the test set. In an r-squared plot, the farther the points are from this line, the worse the model fit.
The reason this plot is called an r-squared plot is because the r-squared is the square of the correlation between y and y-hat, or the squared correlation between the true and predicted values, which are plotted as paired in the plot.
\[ \begin{aligned} R ^ 2 = cor(y, \hat{y}) ^ 2 \end{aligned} \] Let’s take a look at how we might test the correlation between y and yhat, squared.
mtcars_linear_model <- lm(mpg ~ wt, mtcars)
mtcars$predicted <- predict(mtcars_linear_model)
summary(mtcars_linear_model) # the multiple r-squared is 0.7528328##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 0.0000000000000002 ***
## wt -5.3445 0.5591 -9.559 0.000000000129 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 0.0000000001294
cor(mtcars$mpg, mtcars$predicted) ^ 2 # the squared correlation between y and y-hat is 0.7528328## [1] 0.7528328
In the code below, we use geom_point() and
geom_abline() to make this plot. The
geom_abline() function will plot a line with the provided
slope and intercept arguments.
mtcars_test_results <- predict(lm_fit, new_data = mtcars_test) %>%
bind_cols(mtcars_test)
# View results
mtcars_test_resultsggplot(data = mtcars_test_results,
mapping = aes(x = .pred, y = mpg)) +
geom_point(color = '#006EA1') +
geom_abline(intercept = 0, slope = 1, color = '#D55E00') +
labs(title = 'Mtcars Test Set Data', subtitle = 'Linear Regression Results',
x = 'Predicted Sales',
y = 'Actual Sales') +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = "mpg ~ wt") Cross-Validation
“During the process of partitioning the complete dataset into the training set and the validation set, there are chances of losing some important and crucial data points for the training purpose. Since those data are not included in the training set, the model has not got the chance to detect some patterns. This situation can lead to overfitting or under fitting of the model. To avoid this, there are different types of cross-validation techniques that guarantees the random sampling of training and validation data set and maximizes the accuracy of the model. Some of the most popular cross-validation techniques are
Validation Set Approach Leave one out cross-validation(LOOCV) K-fold cross-Validation Repeated K-fold cross-validation”
K Fold Cross-Validation
# https://www.geeksforgeeks.org/cross-validation-in-r-programming/Monte Carlo Cross-Validation
“The basis of a Monte Carlo simulation is that the probability of varying outcomes cannot be determined because of random variable interference. Therefore, a Monte Carlo simulation focuses on constantly repeating random samples to achieve certain results.”
“A Monte Carlo simulation takes the variable that has uncertainty and assigns it a random value. The model is then run and a result is provided. This process is repeated again and again while assigning the variable in question with many different values. Once the simulation is complete, the results are averaged together to provide an estimate.”
“By analyzing historical price data, you can determine the drift, standard deviation, variance, and average price movement of a security. These are the building blocks of a Monte Carlo simulation.”
https://rsample.tidymodels.org/reference/mc_cv.html
One resample of Monte Carlo cross-validation takes a random sample (without replacement) of the original data set to be used for analysis. All other data points are added to the assessment set.
# The first value is the number of rows of the first set, the second value of the second, and the third was the original amount of values in the training data, before splitting again9.3 Transformations
Natural logarithm utilization. If one or more of your variables are not normally distributed, you can transform them using the natural logarithm. Only AFTER this transformation you may either standardize all the variables or center those that you need to center. In general, whatever transformation of a variable has to happen before standardizing or centering.
# standardization
# what is a standard score === a standard score of 2 is an observation that falls 2 standard deviations above the mean for that variable
# lasso and ridge to test for multicollinearity
# https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/
mtcars_p <- mtcars %>%
mutate(mpg_centered = (mpg - mean(mpg))) %>%
mutate(mpg_scaled = scale(mpg)) %>% # (mpg - mean(mpg)) / sd(mpg)
dplyr::select(index, car, mpg, mpg_centered, mpg_scaled)
mtcars_pmodelone <- lm(mpg ~ wt, mtcars)
modeltwo <- lm(mpg ~ log(wt), mtcars)
modelthree <- lm(log(mpg) ~ wt, mtcars)
modelfour <- lm(log(mpg) ~ log(wt), mtcars)
# summary(modelone)$r.squared
# summary(modeltwo)$r.squared
# summary(modelthree)$r.squared
# summary(modelfour)$r.squared
# equivalent to this:
# ggplot(mtcars_p, aes(x = log(wt), y = mpg)) + geom_point() + geom_smooth(method = 'lm')
# mtcars_p <- mtcars %>%
# mutate(log_wt = log(wt))
#
# ggplot(mtcars_p, aes(x = log_wt, y = mpg)) + geom_point() + geom_smooth(method = 'lm')
# model is better but harder to interpret
p <- ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + geom_smooth(method = 'lm') +
labs(caption = "r-squared: 0.7528328")
q <- ggplot(mtcars, aes(x = log(wt), y = mpg)) + geom_point() + geom_smooth(method = 'lm') +
labs(caption ="r-squared: 0.810146")
r <- ggplot(mtcars, aes(x = wt, y = log(mpg))) + geom_point() + geom_smooth(method = 'lm') +
labs(caption ="r-squared: 0.7975582")
s <- ggplot(mtcars, aes(x = log(wt), y = log(mpg))) + geom_point() + geom_smooth(method = 'lm') +
labs(caption ="r-squared: 0.8056487")
ggarrange(p, q, r, s)# mtcars_p <- mtcars %>%
# mutate(mpg_centered = (mpg - mean(mpg))) %>%
# mutate(mpg_scaled = scale(mpg)) %>%
# mutate(wt_centered = (wt - mean(wt))) %>%
# mutate(wt_scaled = scale(wt))
#
# mtcars %>%
# recipes::step_center(all_numeric())
#
# simple_recipe <- function(dataset){
# recipe(price ~ ., data = dataset) %>%
# recipes::step_center(all_numeric()) %>%
# recipes::step_scale(all_numeric()) %>%
# recipes::step_dummy(all_nominal())
# }
#%>% # (mpg - mean(mpg)) / sd(mpg)
# dplyr::select(index, car, mpg, mpg_centered, mpg_scaled)Subsetting and Shrinkage
Shrinkage and selection aim at improving upon the simple linear regression. There are two main reasons why it could need improvement:
Prediction accuracy: Linear regression estimates tend to have low bias and high variance. Reducing model complexity (the number of parameters that need to be estimated) results in reducing the variance at the cost of introducing more bias. If we could find the sweet spot where the total error, so the error resulting from bias plus the one from variance, is minimized, we can improve the model’s predictions. Model’s interpretability: With too many predictors it is hard for a human to grasp all the relations between the variables. In some cases we would be willing to determine a small subset of variables with the strongest impact, thus sacrificing some details in order to get the big picture.
Normalization
Centering
Centered independent variables are obtained just by subtracting the mean of the variable.
We center variables if we want to gain a meaningful interpretation of the estimated constant. In this case, you can center the amount of variables you want to; you do not need to center all the independent variables in the model.
Interpretation coefficients centered variables. Coefficients of random variables: “An increase in X1 by -number- from its mean will increase (or decrease) Y by -number-.”
Constant: “It represents the expected value of Y when the non-centered variables are zero and when the centered variables are at their mean.”
Standardizing
Standardized variables are obtained by first subtracting the mean of the variable and then by dividing by the standard deviation of the variable.
We standardize variables to facilitate the interpretation of the estimated coefficients when the variables in your regression have different units of measurement.
When standardizing, we have to standardize all the variables in the regression. This implies we won’t get an estimate of the constant (i.e., the B0 or intercept).
Interpretation coefficients standardized variables. “An increase by 1 standard deviation in X1 will increase (or decrease) Y by -number-.”
Notice that for centered and standardized variables, the r-squared stays the same but the interpretation of coefficeints is different.