The Tidyverse library is a useful tool that enables us to read various datasets into a data frame.
First, load the tidyverse library.
library(tidyverse)
── Attaching core tidyverse packages ───────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2 ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
The original Airline dataset is hosted on IBM Data Asset eXchange. This sample dataset can be found here. We will be using a subset of the original dataset, which contains just LAX to JFK flights, throughout this notebook.
Now using the subset dataset link, you can load it and store as a
dataframe sub_airline
:
url <- "https://dax-cdn.cdn.appdomain.cloud/dax-airline/1.0.1/lax_to_jfk.tar.gz"
# Download the file
download.file(url, destfile = "lax_to_jfk.tar.gz")
probando la URL 'https://dax-cdn.cdn.appdomain.cloud/dax-airline/1.0.1/lax_to_jfk.tar.gz'
Content type 'application/x-gzip' length 58424 bytes (57 KB)
downloaded 57 KB
# Untar the file so we can get the csv only
untar("lax_to_jfk.tar.gz", tar = "internal")
Aviso: using pax extended headers
# read_csv only
sub_airline <- read_csv("lax_to_jfk/lax_to_jfk.csv", col_types = cols(
'DivDistance' = col_number(),
'DivArrDelay' = col_number()
))
Now that we have the data loaded, let’s first take a look at the data
using the method head()
.
head(sub_airline)
In R, there are some special symbols to represent special cases in data:
NA
: missing values are represented by the symbol
NA
(not available), it is a special symbol in R. Note, that
"NA"
(a string) is not the same as
NA
.
NaN
: Impossible values (e.g., dividing by zero) are
represented by the symbol NaN
(not a number).
The missing values in airline dataset are already represented with
R’s NA
symbol. We use R’s built-in (also called
base R) functions to identify these missing values.
There are two methods to detect missing data:
is.na(x)
: x can be a vector or list, this method
returns a vector of TRUE or FALSE depending if the according element in
x is NA
or not. For example is.na(c(1, NA))
returns FALSE TRUE
anyNA(x, recursive = FALSE)
: x can be a vector or
list, this method returns TRUE if x contains any NAs and FALSE
otherwise. For example is.na(c(1, NA))
returns
TRUE
is.na(c(1, NA))
[1] FALSE TRUE
is.na(paste(c(1, NA)))
[1] FALSE FALSE
Again, the output for is.na()
is a vector of logical
values whereTRUE
stands for missing value, while
FALSE
stands for not missing value.
anyNA(c(1, NA))
[1] TRUE
The output for anyNA()
is a vector of logical values
where TRUE
stands for at least one missing value in the
vector, while FALSE
stands for no missing values in the
vector.
We can quickly figure out the number of missing values in each
column. As mentioned above, when using the function
is.na()
, TRUE
represents a missing value while
FALSE
is otherwise. The method sum()
counts
the number of TRUE
values.
Let’s check how many missing values in CarrierDelay
column and also check how many missing values in each column:
# counting missing values
sub_airline %>%
summarise(count = sum(is.na(CarrierDelay)))
We can use purrr::map()
to count missing values in each
of the columns.
map()
essentially maps (applies) a function or formula
to each given element. In the code below, it is mapping a
formula, ~sum(is.na(.))
, that sums the NAs to
every column in sub_airline
. Since it is using a
formula, you will also notice two special operators dot
.
and tilde ~
:
The tilde ~
separates the left side of a formula
with the right side. Normally formulas are two-sided like
y ~ x
, in this case in map()
, the formula gets
converted to a function so it only needs the right side.
The dot .
refers to each column in the dataset. If
you view the documentation with ?map
, you can see that you
use .
when the function takes in just one parameter (a
column in this case).
See ?formula
and ?map
for more
information.
map(sub_airline, ~sum(is.na(.)))
$Month
[1] 0
$DayOfWeek
[1] 0
$FlightDate
[1] 0
$Reporting_Airline
[1] 0
$Origin
[1] 0
$Dest
[1] 0
$CRSDepTime
[1] 0
$CRSArrTime
[1] 0
$DepTime
[1] 0
$ArrTime
[1] 0
$ArrDelay
[1] 0
$ArrDelayMinutes
[1] 0
$CarrierDelay
[1] 2486
$WeatherDelay
[1] 2486
$NASDelay
[1] 2486
$SecurityDelay
[1] 2486
$LateAircraftDelay
[1] 2486
$DepDelay
[1] 0
$DepDelayMinutes
[1] 0
$DivDistance
[1] 2855
$DivArrDelay
[1] 2855
dim(sub_airline)
[1] 2855 21
Based on the summary above, “CarrierDelay”, “WeatherDelay”, “NASDelay”, “SecurityDelay” and “LateAircraftDelay” columns have 2486 rows of missing data, while “DivDistance” and “DivArrDelay” columns have 2855 rows of missing data. All other columns do not have missing data.
“CarrierDelay”: 2486 missing data
“WeatherDelay”: 2486 missing data
“NASDelay”: 2486 missing data
“SecurityDelay” : 2486 missing data
“LateAircraftDelay”: 2486 missing data
“DivDistance”: 2855 missing data
“DivArrDelay”: 2855 missing data
How to deal with missing data?
Generally, you should not blindly drop NAs. However if an entire
column or almost the entire column contains NAs, then it may be a good
idea to leave it out. In our dataset, columns DivDistance
and DivArrDelay
are nearly all empty so we will drop them
entirely.
Drop the whole column:
“DivDistance”: 2855 missing data
“DivArrDelay”: 2855 missing data
drop_na_cols <- sub_airline %>%
select(-DivDistance, -DivArrDelay)
dim(drop_na_cols)
[1] 2855 19
head(drop_na_cols, n = 10)
Drop the whole row:
“CarrierDelay”: 2486 missing data
“WeatherDelay”: 2486 missing data
“NASDelay”: 2486 missing data
“SecurityDelay” : 2486 missing data
“LateAircraftDelay”: 2486 missing data
We see CarrierDelay
, WeatherDelay
,
NASDelay
, SecurityDelay
,
LateAircraftDelay
have the same amount of missing values
from the summary. By dropping the missing values in one column will also
solve the missing value issues in the others.
# Drop the missing values
drop_na_rows <- drop_na_cols %>%
drop_na(CarrierDelay)
dim(drop_na_rows)
[1] 369 19
head(drop_na_rows)
We have some freedom in choosing which method to replace data; however, some methods may seem more reasonable than others. In this scenario, we would like to replace missing values with 0.
Convert NA to 0
In the airline dataset, missing data for the different types of delay corresponds to no delay. So, we can replace these NAs with 0 in this case. To do this we use the function:
tidyr::replace_na(data, replace, ...)
The columns that corresponds with types of delays are CarrierDelay,
WeatherDelay, NASDelay, SecurityDelay, and LateAircraftDelay. For
example if CarrierDelay = NA
then this means there is no
delay in Carrier, so the delay in minutes can be changed to 0 or
CarrierDelay = 0
. Let’s transform these columns and see the
result:
# Replace the missing values in five columns
replace_nan <- drop_na_rows %>%
replace_na(list(CarrierDelay = 0, WeatherDelay = 0, NSADelay = 0, SecurityDelay = 0, LateAircraftDelay = 0))
head(replace_nan)
According to the example above, let’s try to replace NA in “CarrierDelay” column by the mean value.
carrier_mean <- mean(drop_na_rows$CarrierDelay)
sub_airline %>% replace_na(list(CarrierDelay = carrier_mean))
head(sub_airline)
We are almost there!
The last step in data cleaning is checking and making sure that all data is in the correct format.
In dplyr, we can add _all
and _if
to the
end of it’s main functions like mutate
,
filter
, group_by
, and
summarize
.
These are called scoped dplyr verbs. The
_all
variant applies an operation on all
variaibles and the _if
variant applies an operation to a
variable if the given function is TRUE
.
Now that we understand these different versions of dplyr functions,
let’s list the data types for each column by using
summarize_all()
and gather()
:
sub_airline %>%
summarise_all(class) %>%
gather(variable, class)
date_airline <- replace_nan %>%
separate(FlightDate, sep = "-", into = c("year", "month", "day"))
head(date_airline)
For a number of reasons, including when we import a dataset into R or
process a variable, the data type may be incorrectly established. For
example, here we notice that the assigned data type to for
year
, month
, and day
is
“character” although the expected data type should really be numeric.
You can change the type using mutate_all()
and
mutate_if()
.
In the below code, it is mutating year
,
month
, and day
values to numeric only if it is
a character. Note that in mutate_if()
, the first parameter
is the function is.character
that checks a
condition while the second parameters is the function
as.numeric
that modifies the data if the
condition is true.
date_airline %>%
select(year, month, day) %>%
mutate_all(type.convert) %>%
mutate_if(., is.character, as.numeric)
Aviso: There were 3 warnings in `mutate()`.
The first warning was:
ℹ In argument: `year = (function (x, ...) ...`.
Caused by warning in `type.convert.default()`:
! 'as.is' should be specified by the caller; using TRUE
ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 2 remaining warnings.
Wonderful!
Now, we finally obtain the cleaned dataset with no missing values and all data in its proper format.
Why normalization?
Normalization is the process of transforming values of several features (variables) into a similar range. An example of why this could be important is if you have a variable for income and another variable for age. Income is likely much bigger values than age, so in a model, income would naturally influence the model more. Thus, normalization helps make comparisons between different variables more fair.
Simple scaling divides each value by the maximum value in a feature. The new range is between 0 and 1.
Example
To demonstrate sampling scaling, let’s say we want to scale the column “ArrDelay”.
Target: Would like to Normalize those variables so their value ranges from 0 to 1.
Approach: Replace the original value by (original value) / (maximum value)
simple_scale <- sub_airline$ArrDelay / max(sub_airline$ArrDelay)
head(simple_scale, n=10)
[1] 0.002932551 -0.030791789 -0.020527859 -0.004398827 -0.019061584 -0.029325513 0.001466276 -0.058651026 0.036656891 0.010263930
According to the example above, normalize the column “DepDelay” using the simple scaling technique.
head(simple_scale_depdly, n=10)
[1] -0.001373626 0.005494505 -0.010989011 0.009615385 -0.005494505 0.002747253 -0.002747253 -0.002747253 0.002747253 0.020604396
Min-max subtracts the minimum value from the original and divides by the maximum minus the minimum. The minimum becomes 0 and the maximum becomes 1.
Example
Using “ArrDelay” as an example again, you can transform this column using the min-max technique:
head(minmax_scale, n=10)
[1] 0.09933775 0.06887417 0.07814570 0.09271523 0.07947020 0.07019868 0.09801325 0.04370861 0.12980132 0.10596026
Standardization (Z-score) subtracts the mean (𝜇) of the feature and divides by the standard deviation (𝜎).
Example
Let’s use “ArrDelay” again. We can use mean()
to find
the mean of the feature and we can use sd()
to find the
standard deviation of the feature.
head(z_scale, n=10)
[1] -0.04815631 -0.60922513 -0.43846505 -0.17012779 -0.41407075 -0.58483083 -0.07255060 -1.07271676 0.51291251 0.07381518
According to the example above, standardize the “DepDelay” column.
head(z_scale_DepDly, n=10)
[1] -0.28063352 -0.14031185 -0.47708387 -0.05611884 -0.36482653 -0.19644052 -0.30869786 -0.30869786 -0.19644052 0.16839584
Binning is a process of transforming continuous numerical variables into discrete categorical ‘bins’, for grouped analysis.
Example:
Using binning, we categorize arrival delays into four bins by quartiles.
Quartiles refer to the boundaries that divide observations into four defined intervals. They are often determined based on the values of data points and how they are compared with the rest of the dataset. In the actual airline dataset, “ArrDelay” is a numerical variable ranging from -73 to 682, it has 2855 unique values. We can categorize them into 4 bins. Can we rearrange them into four ‘bins’ to simplify analysis?
We will use the tidyverse method
mutate(quantile_rank = ntile())
to segment the “ArrDelay”
column into 4 bins
Lets plot the histogram of flight arrival delays, to see what the distribution of “ArrDelay” looks like.
ggplot(data = sub_airline, mapping = aes(x= ArrDelay)) +
geom_histogram(bins = 100, color = "white", fill = "red") +
coord_cartesian(xlim = c(-73, 682))
First we use the dplyr function ntile
to break
“ArrDelay” into 4 buckets, which have equal amount of observations of
flight arrival delays. We then create a list “quantile_rank” that
contains 4 bins, which are respectively labeled “1”, “2”, “3”, “4”. So
bin 1 would contain the first 25% of data, bin 2 the next 25% of data
and so on.
binning <- sub_airline %>%
mutate(quantile_rank = ntile(sub_airline$ArrDelay, 4))
head(binning)
The observations are put into different bins based on the flights’ delay minutes. The larger the bin label is, the longer the flight was delayed.
Now if we look at a histogram of the bins, you can see that all bins are equal.
ggplot(data = binning, mapping = aes(x = quantile_rank)) +
geom_histogram(bins = 4, color = "white", fill = "red")
An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called ‘dummies’ because the numbers themselves don’t have inherent meaning.
Why we use indicator variables?
Regression models need numerical variables, however categorical variables in their original forms are strings. Using indicator variables allows us to be able to use these categorical variables in regression models.
Example
In the airline dataset, the “Reporting_Airline” feature is a categorical variable that has nine values, “AA”, “AS”, “B6”, “DL”, “HP”, “PA (1)”, “TW”, “UA” or “VX”, which are in character type. For further analysis, we need to convert these variables into some form of numeric format.
We will use the tidyverse’s method spread()
method to
convert categorical variables to dummy variables. The parameters to use
in the function are:
key
: the column to convert into categorical
values
value
: the value you want to set the key to
fill
: fills the missing values with this
value
Also keep in mind that the method will drop the “Reporting_Airline” column by default. So if you used “ArrDelay” as the key value instead, that column would get dropped.
The below code does the following:
mutate
- creates a column dummy
with
all 1’s then
spread
- creates new column for every
Reporting_Ariline
and sets the value to 1 from
dummy
. But replaces the 1 with a 0 if the corresponding
value is NA
in Reporting_Airline
then
slice
- looks at the specified rows
sub_airline %>%
mutate(dummy = 1) %>% #column with single value
spread(key = Reporting_Airline, value = dummy, fill = 0) %>%
slice(1:10)
When a value occurs in the original feature, we set the corresponding value to one in the new feature; the rest of the features are set to zero.
So in the output above, for the first row, the reporting airline is “UA”. Therefore, the feature “UA” is set to one and the other features to zero. Similarly, for the second row, the reporting airline value is “AS”. Therefore, the feature “AS” is set to one and the other features to zero.
Alternatively, instead of assigning dummy values 0 or 1, we can
assign flight delay values to each feature. So this will also create new
columns, one for each Reporting_Airline
. Taking the first
row for example, the column “UA” is now set to 2 because that is the
value from ArrDelay.
sub_airline %>%
spread(Reporting_Airline, ArrDelay) %>%
slice(1:10)
view(sub_airline)
Visualize Airline Category
Let’s visualize how many data points in each airline category.
sub_airline %>%
mutate(Reporting_Airline = factor(Reporting_Airline, labels = c("AA", "AS", "DL", "UA", "B6", "PA (1)", "HP", "TW", "VX"))) %>%
ggplot(aes(Reporting_Airline)) +
stat_count(width = 0.5) +
labs(x = "Number of datapoints in each airline")
As above, create indicator variable to the column of “Month”.
sub_airline %>%
mutate(Month = factor(Month, labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))) %>%
ggplot(aes(Month)) +
stat_count(width = 0.5) +
labs(x = "Number of data points in each Month")
Now, create indicator variable to the column of “Month” by applying departure delay values
sub_airline %>%
spread(Month, DepDelay) %>%
slice(1:10)
summary_airline_delays <- sub_airline %>%
group_by(Reporting_Airline) %>%
summarise(mean = mean(ArrDelayMinutes, na.rm = TRUE), std_dev = sd(ArrDelayMinutes, na.rm = TRUE)) %>%
arrange(mean)
summary_airline_delays
Create a simple average across Reporting_Airline and DayOfWeek
avg_delays <- sub_airline %>%
group_by(Reporting_Airline, DayOfWeek) %>%
summarise(mean_delays = mean(ArrDelayMinutes))
`summarise()` has grouped output by 'Reporting_Airline'. You can override using the `.groups` argument.
avg_delays %>% arrange(desc(mean_delays))
Plot target variable over multiple variables
avg_delays %>%
ggplot(aes(x = Reporting_Airline, y = DayOfWeek, fill = mean_delays)) +
geom_tile(color = "white", size = 0.2) +
scale_fill_gradient(low = "skyblue", high = "blue")
Aviso: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
Please use `linewidth` instead.
Boxplots are a great way to visualize numeric (or quantitative) data, since you can visualize the various distributions of the data. They are similar to histograms but can show more informatioin such as:
The median of the data, which represents the middle datapoint
The Upper Quartile, which is the 75th percentile
The Lower Quartile, which is the 25th percentile
The Interquartile Range, which is the data between the Upper and Lower Quartile
Boxplots also display outliers as individual dots that occur outside the upper and lower extremes. With boxplots, you can easily spot outliers and also see the distribution and skewness of the data.
We will use the ggplot
library to create plots, which is
already loaded when the tidyverse
library is loaded.
To create boxplots, the main function to use is
geom_boxplot()
. In the below code we also use additional
functions primarly to customize the aesthetics of the plot. You can
check out the Data Visualization Course in R if you want to know more
about customizing plots with ggplot
.
geom_boxplot()
: The boxplot compactly displays the
distribution of a continuous variable. It visualises five summary
statistics (the median, two hinges and two whiskers), and all “outlying”
points individually.
geom_jitter()
: The jitter geom is a convenient
shortcut for geom_point(position = “jitter”). It adds a small amount of
random variation to the location of each point, and is a useful way of
handling overplotting caused by discreteness in smaller
datasets.
ggtitle()
: Modifies plot titles (main title, axis
labels and legend titles).
guides()
: Guides for each scale can be set
scale-by-scale with the guide argument.
theme_minimal()
: A minimalistic theme with no
background annotations.
coord_cartesian()
: The Cartesian coordinate system
is the most familiar, and common, type of coordinate system. Setting
limits on the coordinate system will zoom the plot (like you’re looking
at it with a magnifying glass), and will not change the underlying data
like setting limits on a scale will.
Below shows the distribution of arrival delays for each reporting airline. So here, ArrDelay is the numerica data.
ggplot(data = sub_airline, mapping = aes(x = Reporting_Airline, y = ArrDelay)) +
geom_boxplot(fill = "bisque", color = "black", alpha = 0.3) +
geom_jitter(aes(color = "blue"), alpha = 0.2) +
labs(x = "Airline") +
ggtitle("Arrival Delays by Airline") +
guides(color = "none") +
theme_minimal() +
coord_cartesian(ylim = quantile(sub_airline$ArrDelay, c(0, 0.99)))
Often times we see continuous variables in our data. These data points are numbers contained in some range.
For example, in our dataset, departure delays and arrival delays are continuous numeric variables. What if we want to understand the relationship between “DepDelay” and “ArrDelay”? Could departure delay possibly predict arrival delay?
One good way to visualize two continuous variables is to use a scatter plot. Each observation in a scatter plot is represented as a point.
Using ggplot
, you can input the two continuous variables
to compare and add gemo_point()
to show the points in a
scatter plot.
alaska_flights <- sub_airline %>%
filter(Reporting_Airline == "AS") %>%
filter(!is.na(DepDelay) & !is.na(ArrDelay)) %>%
filter(DepDelay < 40)
ggplot(data = alaska_flights, mapping = aes(x = DepDelay, y = ArrDelay)) +
geom_point() +
ggtitle("Alaska flight Departure Delays vs. Arrival Delays")
When visualizing individual variables, it is important to first understand what type of variable you are dealing with. This will help us find the right visualization method for that variable. Below is a way to show each variable and their type in a dataframe.
str(sub_airline)
spc_tbl_ [2,855 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Month : num [1:2855] 3 11 8 4 11 4 12 12 2 3 ...
$ DayOfWeek : num [1:2855] 5 4 5 7 3 1 1 3 4 4 ...
$ FlightDate : Date[1:2855], format: "2003-03-28" "2018-11-29" "2015-08-28" "2003-04-20" ...
$ Reporting_Airline: chr [1:2855] "UA" "AS" "UA" "DL" ...
$ Origin : chr [1:2855] "LAX" "LAX" "LAX" "LAX" ...
$ Dest : chr [1:2855] "JFK" "JFK" "JFK" "JFK" ...
$ CRSDepTime : chr [1:2855] "2210" "1045" "0805" "2205" ...
$ CRSArrTime : chr [1:2855] "0615" "1912" "1634" "0619" ...
$ DepTime : chr [1:2855] "2209" "1049" "0757" "2212" ...
$ ArrTime : chr [1:2855] "0617" "1851" "1620" "0616" ...
$ ArrDelay : num [1:2855] 2 -21 -14 -3 -13 -20 1 -40 25 7 ...
$ ArrDelayMinutes : num [1:2855] 2 0 0 0 0 0 1 0 25 7 ...
$ CarrierDelay : num [1:2855] NA NA NA NA NA NA NA NA NA NA ...
$ WeatherDelay : num [1:2855] NA NA NA NA NA NA NA NA NA NA ...
$ NASDelay : num [1:2855] NA NA NA NA NA NA NA NA NA NA ...
$ SecurityDelay : num [1:2855] NA NA NA NA NA NA NA NA NA NA ...
$ LateAircraftDelay: num [1:2855] NA NA NA NA NA NA NA NA NA NA ...
$ DepDelay : num [1:2855] -1 4 -8 7 -4 2 -2 -2 2 15 ...
$ DepDelayMinutes : num [1:2855] 0 4 0 7 0 2 0 0 2 15 ...
$ DivDistance : num [1:2855] NA NA NA NA NA NA NA NA NA NA ...
$ DivArrDelay : num [1:2855] NA NA NA NA NA NA NA NA NA NA ...
- attr(*, "spec")=
.. cols(
.. Month = col_double(),
.. DayOfWeek = col_double(),
.. FlightDate = col_date(format = ""),
.. Reporting_Airline = col_character(),
.. Origin = col_character(),
.. Dest = col_character(),
.. CRSDepTime = col_character(),
.. CRSArrTime = col_character(),
.. DepTime = col_character(),
.. ArrTime = col_character(),
.. ArrDelay = col_double(),
.. ArrDelayMinutes = col_double(),
.. CarrierDelay = col_double(),
.. WeatherDelay = col_double(),
.. NASDelay = col_double(),
.. SecurityDelay = col_double(),
.. LateAircraftDelay = col_double(),
.. DepDelay = col_double(),
.. DepDelayMinutes = col_double(),
.. DivDistance = col_number(),
.. DivArrDelay = col_number()
.. )
- attr(*, "problems")=<externalptr>
Now, we turn our focus to “ArrDelayMinutes” as we want to create models to predict this variable
What is the data type of the column “ArrDelayMinutes”?
class(sub_airline$ArrDelayMinutes)
[1] "numeric"
Next, let’s take a look at other variables, like “DepDelayMinutes” that could potentially help predict “ArrDelayMinutes”.
Find the correlation between the following columns: DepDelayMinutes and ArrDelayMinutes.
Hint: if you would like to select those columns, use the dollar sign ($)
cor(sub_airline$DepDelayMinutes, sub_airline$ArrDelayMinutes)
[1] 0.9213328
Continuous numerical variables are variables that may contain any value within some range. In R, continuous numerical variables can have the type “integer” or “numeric” (these are real numbers and sometimes are called “float” in other programming languages). A great way to visualize these variables is by using scatterplots with fitted lines.
With ggplot
, we can visualize this by using
geom_point()
to plot the data points and
geom_smooth()
to plot a fitted linear regression line (by
default the model uses formula = y ~ x
).
Let’s see several examples of different linear relationships:
Let’s find the scatterplot of “DepDelayMinutes” and “ArrDelayMinutes” of all airlines.
Example: Average flight delays of different reporting airlines
Finding correlation between different groups of categorical variable
Null hypothesis: The mean of the reporting airline is the same for all groups
# ANOVA between "AA" and "AS"
aa_as_subset <- sub_airline %>%
select(ArrDelay, Reporting_Airline) %>%
filter(Reporting_Airline == "AA" | Reporting_Airline == "AS")
ad_aov <- aov(ArrDelay ~ Reporting_Airline, data = aa_as_subset)
summary(ad_aov)
Df Sum Sq Mean Sq F value Pr(>F)
Reporting_Airline 1 126 125.7 0.13 0.718
Residuals 1139 1097707 963.7
aa_pa_subset <- sub_airline %>%
select(ArrDelay, Reporting_Airline) %>%
filter(Reporting_Airline == "AA" | Reporting_Airline == "PA (1)")
ad_aov <- aov(ArrDelay ~ Reporting_Airline, data = aa_pa_subset)
summary(ad_aov)
Df Sum Sq Mean Sq F value Pr(>F)
Reporting_Airline 1 24008 24008 17.95 2.45e-05 ***
Residuals 1127 1507339 1337
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Continuous numerical variables are variables that may contain any value within some range. In R, continuous numerical variables can have the type “integer” or “numeric” (these are real numbers and sometimes are called “float” in other programming languages). A great way to visualize these variables is by using scatterplots with fitted lines.
With ggplot
, we can visualize this by using
geom_point()
to plot the data points and
geom_smooth()
to plot a fitted linear regression line (by
default the model uses formula = y ~ x
).
Let’s see several examples of different linear relationships:
Let’s find the scatterplot of “DepDelayMinutes” and “ArrDelayMinutes” of all airlines.
ggplot(sub_airline, aes(DepDelayMinutes, ArrDelayMinutes)) +
geom_point() +
geom_smooth(method = "lm")
From the plot, as the depature delay (“DepDelayMinutes”) increases, the arrival delay (“ArrDelayMinutes”) increases. This indicates a positive direct correlation between these two variables. “DepDelayMinutes” may be a decent predictor of “ArrDelayMinutes” since the regression line is increasing and generally matches the data points.
Next, we can examine the correlation between “DepDelayMinutes” and “ArrDelayMinutes” and see it’s approximately 0.92
sub_airline %>%
select(DepDelayMinutes, ArrDelayMinutes) %>%
cor(method = "pearson")
DepDelayMinutes ArrDelayMinutes
DepDelayMinutes 1.0000000 0.9213328
ArrDelayMinutes 0.9213328 1.0000000
cor(sub_airline$DepDelayMinutes, sub_airline$ArrDelayMinutes)
[1] 0.9213328
Let’s now look at if “WeatherDelay” is a good predictor variable of “ArrDelayMinutes”.
ggplot(sub_airline, aes(WeatherDelay, ArrDelayMinutes)) +
geom_point(na.rm = TRUE) +
geom_smooth(method = "lm", na.rm = TRUE)
Weather delay does not seem like a good predictor of arrival delay minutes since the regression line is close to horizontal. Also, for small values of “WeatherDelay”, the data points are very scattered and far from the fitted line, showing lots of variability. Therefore it is not a reliable variable.
In the cor()
function, you can add
use = "complete.obs"
in order to only use complete
observations, that is, exclude NAs. You can examine the correlation
between “WeatherDelay” and “ArrDelayMinutes” and see that it’s a very
weak relationship since the value is close to 0.
sub_airline %>%
select(WeatherDelay, ArrDelayMinutes) %>%
cor(method = "pearson", use = "complete.obs")
WeatherDelay ArrDelayMinutes
WeatherDelay 1.00000000 0.04019104
ArrDelayMinutes 0.04019104 1.00000000
cor(sub_airline$WeatherDelay, sub_airline$ArrDelayMinutes, use = "complete.obs")
[1] 0.04019104
Find the correlation between x=“CarrierDelay”, y=“ArrDelayMinutes”.
Hint: if you would like to select those columns, use the dollar sign ($)
cor(sub_airline$CarrierDelay, sub_airline$ArrDelayMinutes, use = "complete.obs")
[1] 0.7287601
Given the correlation results between x=“CarrierDelay”, y=“ArrDelayMinutes”, do you expect a linear relationship?
Verify your results using the function of ggplot.
ggplot(data = sub_airline, mapping = aes(x = CarrierDelay, y = ArrDelayMinutes)) +
geom_point(na.rm = TRUE) +
geom_smooth(method = "lm", na.rm = TRUE)
ggplot(sub_airline, aes(CarrierDelay, LateAircraftDelay)) +
geom_point(na.rm = TRUE) +
geom_smooth(method = "lm", na.rm = TRUE)
sub_airline %>%
select(CarrierDelay, LateAircraftDelay) %>%
cor(method = "pearson", use = "complete.obs")
CarrierDelay LateAircraftDelay
CarrierDelay 1.00000000 -0.04173491
LateAircraftDelay -0.04173491 1.00000000
sub_airline %>%
select(DepDelay, ArrDelay) %>%
cor(method = "pearson")
DepDelay ArrDelay
DepDelay 1.0000000 0.8826109
ArrDelay 0.8826109 1.0000000
sub_airline %>%
cor.test(~DepDelay + ArrDelay, data = .)
Pearson's product-moment correlation
data: DepDelay and ArrDelay
t = 100.28, df = 2853, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8742325 0.8904638
sample estimates:
cor
0.8826109
There is a strong correlation that is statistically significant
library(Hmisc)
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Adjuntando el paquete: ‘Hmisc’
The following objects are masked from ‘package:dplyr’:
src, summarize
The following objects are masked from ‘package:base’:
format.pval, units
numerics_airline <- sub_airline %>%
select(ArrDelayMinutes, DepDelayMinutes, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
airlines_cor <- rcorr(as.matrix(numerics_airline), type = "pearson")
airlines_cor
ArrDelayMinutes DepDelayMinutes CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay
ArrDelayMinutes 1.00 0.92 0.73 0.04 0.26 0.08 0.42
DepDelayMinutes 0.92 1.00 0.75 0.04 0.07 0.11 0.45
CarrierDelay 0.73 0.75 1.00 -0.04 -0.19 -0.02 -0.04
WeatherDelay 0.04 0.04 -0.04 1.00 -0.04 -0.01 -0.03
NASDelay 0.26 0.07 -0.19 -0.04 1.00 -0.05 -0.16
SecurityDelay 0.08 0.11 -0.02 -0.01 -0.05 1.00 -0.03
LateAircraftDelay 0.42 0.45 -0.04 -0.03 -0.16 -0.03 1.00
n
ArrDelayMinutes DepDelayMinutes CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay
ArrDelayMinutes 2855 2855 369 369 369 369 369
DepDelayMinutes 2855 2855 369 369 369 369 369
CarrierDelay 369 369 369 369 369 369 369
WeatherDelay 369 369 369 369 369 369 369
NASDelay 369 369 369 369 369 369 369
SecurityDelay 369 369 369 369 369 369 369
LateAircraftDelay 369 369 369 369 369 369 369
P
ArrDelayMinutes DepDelayMinutes CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay
ArrDelayMinutes 0.0000 0.0000 0.4415 0.0000 0.1398 0.0000
DepDelayMinutes 0.0000 0.0000 0.4096 0.1730 0.0402 0.0000
CarrierDelay 0.0000 0.0000 0.4501 0.0003 0.6464 0.4241
WeatherDelay 0.4415 0.4096 0.4501 0.4441 0.8444 0.5930
NASDelay 0.0000 0.1730 0.0003 0.4441 0.3121 0.0024
SecurityDelay 0.1398 0.0402 0.6464 0.8444 0.3121 0.6093
LateAircraftDelay 0.0000 0.0000 0.4241 0.5930 0.0024 0.6093
When we began to analyze data, it’s important to first explore your data before you spend time building complicated models. One easy way to do so is to calculate some descriptive statistics for your data.
Descriptive statistical analysis helps to describe basic features of a dataset and generates a short summary about the sample and measures of the data.
Let’s take a look at a couple of different useful methods. One way in
which we can do this is by using the summarize()
function
in tidyverse:dplyr()
. We introduced this method in previous
modules. Method group_by()
is often used together with
summarize()
, which summarizes each group into a single-row
summary of that group. group_by()
takes as arguments the
column names that contain the categorical variables for which you want
to calculate the summary statistics.
This will show:
the count of that variable
the mean
the standard deviation (std)
the minimum value
the median (50th percentile or quartile 2)
the IQR (Interquartile Range: quartile 3 minus quartile 1)
the maximum value
We can apply the method “summarize” and “group_by” as follows:
summary_airline_delays <- sub_airline %>%
group_by(Reporting_Airline) %>%
summarise(count = n(), avg = mean(ArrDelayMinutes, na.rm = TRUE), std_dev = sd(ArrDelayMinutes, na.rm = TRUE), min = min(ArrDelayMinutes, na.rm = TRUE), median = median(ArrDelayMinutes, na.rm = TRUE), iqr = IQR(ArrDelayMinutes, na.rm = TRUE), max = max(ArrDelayMinutes, na.rm = TRUE)) %>%
arrange(desc(avg))
summary_airline_delays
To identify the data type of each column in a dataframe, we can use
sapply()
with typeof
, which finds the type of
something. So here, you apply typeof
to every
column in sub_airline.
sapply(sub_airline, typeof)
Month DayOfWeek FlightDate Reporting_Airline Origin Dest CRSDepTime CRSArrTime
"double" "double" "double" "character" "character" "character" "character" "character"
DepTime ArrTime ArrDelay ArrDelayMinutes CarrierDelay WeatherDelay NASDelay SecurityDelay
"character" "character" "double" "double" "double" "double" "double" "double"
LateAircraftDelay DepDelay DepDelayMinutes DivDistance DivArrDelay
"double" "double" "double" "double" "double"
One way you can summarize the categorical data is by using the
function count()
. As an example, you can get the count of
each reporting airline in this dataset. We see that we have 1096 flights
from AA - American Airlines, 45 flights from AS - Alaska Airlines, 258
from B6 - jetBlue airlines, etc.
sub_airline %>%
count(Reporting_Airline) %>%
arrange(desc(n))
We often ask questions like: Is there any relationship between the reporting airline and the flight delays? If so, which day of the week do flights have relatively longer delay times? For example, people take flights most frequently on Monday and Friday for business trips. Would that impact how long the flight is delayed? It would be nice if we could group the data by the different reporting airline, and compare the results of these different day of week against each other.
In tidyverse this can be done using the “group_by” method. The group by method is used on categorical variables, it groups the data into subsets according to the different categories of that variable. You can group by a single variable or you can group by multiple variables by passing in multiple variable names.
Let’s say we are interested in finding the average delay minutes of flights and observe how they differ between different “Reporting_Airline” and “DayOfWeeks” variables.
To do this, you can use group_by()
to group by
“Reporting_Airline” and “DayOfWeeks”, then use summarize()
to calculate the mean delay minutes. Setting
.groups = 'keep'
keeps the grouping structure, if you
instead were to set it to "drop"
then the output would be
ungrouped and essentially be a regualr tibble (dataframe).
avg_delays <- sub_airline %>%
group_by(Reporting_Airline, DayOfWeek) %>%
summarise(mean_delays = mean(ArrDelayMinutes), groups = "keep")
`summarise()` has grouped output by 'Reporting_Airline'. You can override using the `.groups` argument.
avg_delays
The function arrange()
will reorder the rows in an
ascending order. However, using arrange()
with
desc()
sorts data in descending order.
So now, we sort the mean_delays column by descending order using
arrange(desc())
and print the output table to see the top
airlines and day of the week pairs that have the highest average arrival
delays.
sorted <- avg_delays %>%
arrange(desc(mean_delays))
sorted
To make it easier to understand, we can transform this table to a heatmap.
A heatmap has one variable displayed along the x-axis and the other variable displayed along the y-axis. A heat map (or heatmap) is a data visualization technique that shows magnitude of a phenomenon as color in two dimensions.
The variation in color may be by hue or intensity, giving obvious visual cues to the reader about how the phenomenon is clustered or varied over space. It is a great way to plot the target variable over multiple variables and through this get visual clues of the relationship between these variables and the target.
With ggplot
, you can use geom_tile()
to
create heatmaps and use scale_fill_gradient()
to change the
coloring of the heatmap. So to interpret the heatmap below, the closer a
tile is to the color red, the higher the mean arrival delay for that
particualr day of the week and airline pair.
avg_delays %>%
ggplot(aes(x = Reporting_Airline, y = DayOfWeek, fill = mean_delays)) +
# Set the tile's borders to be white with size 0.2
geom_tile(color = "white", size = 0.2) +
# Define gradient color scales
scale_fill_gradient(low = "yellow", high = "red")
BONUS Heatmap
For something more sophisticated, you can always add more blocks in your code.
# This visualizations will use lubridate package
library(lubridate)
# Let's take a simple average accross Reporting_Airline amns DayOfWeek
avg_delays <- sub_airline %>%
group_by(Reporting_Airline, DayOfWeek) %>%
summarise(mean_delays = mean(ArrDelayMinutes), .groups = "keep") %>%
# create a new variable from mean_delays
# make the first range -0.1 to 0.1 to include zero values
mutate(bins = cut(mean_delays, breaks = c(-0.1, 0.1, 10, 20, 30, 50))) %>%
mutate(bins = factor(as.character(bins), levels = rev(levels(bins))))
ggplot(avg_delays, aes(x = Reporting_Airline, y = lubridate::wday(DayOfWeek, label = TRUE), fill = bins)) +
geom_tile(colour = "white", size = 0.2) +
geom_text(aes(label = round(mean_delays, 2))) +
guides(fill = guide_legend(titel = "Delays Time Scale")) +
labs(x = "Reporting Airline", y = "Days of Week", title = "Average Arrival Delays") +
scale_fill_manual(values = c("#d53e4f", "#f46d43", "#fdae61", "#fee08b", "#e6f598", "#abdda4"))
Use the “groupby” and “summarize” function to find the average “ArrDelayMinutes” of each flight based on “Reporting_Airline” ?
sub_airline %>%
group_by(Reporting_Airline) %>%
summarise(mean_delays = mean(ArrDelayMinutes)) %>%
arrange(mean_delays)
The main question we are trying to answer in this module is: “What causes flight delays?”. To get a better measure of the important characteristics, we look at the correlation of these variables with the arrival delay, AKA, “ArrDelayMinutes”, in other words: how is the arrival delay minutes dependent on this variable?
First, let’s begin with some important definitions:
Correlation: a measure of the extent of interdependence between variables.
Causation: the relationship between cause and effect between two variables.
It is important to know the difference between these two and that correlation does not imply causation. Determining correlation is much simpler than determining causation as causation may require independent experimentation.
The Pearson Correlation measures the linear dependence between two variables X and Y.
The resulting coefficient is a value between -1 and 1 inclusive, where:
1: Total positive linear correlation.
0: No linear correlation, the two variables most likely do not affect each other.
-1: Total negative linear correlation
Pearson Correlation is the default method of the function
cor()
, but other methods can be specified by setting
method
. Like before we can calculate the Pearson
Correlation of the “integer” or “numeric” variables.
sub_airline %>%
select(DepDelayMinutes, ArrDelayMinutes) %>%
cor(method = "pearson")
DepDelayMinutes ArrDelayMinutes
DepDelayMinutes 1.0000000 0.9213328
ArrDelayMinutes 0.9213328 1.0000000
Sometimes, we would like to know the significant of the correlation estimate.
P-value:
What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.
By convention, when the
p-value is < 0.001: we say there is strong evidence that the correlation is significant.
the p-value is < 0.05: there is moderate evidence that the correlation is significant.
the p-value is < 0.1: there is weak evidence that the correlation is significant.
the p-value is > 0.1: there is no evidence that the correlation is significant.
In R, to conduct a significance test and get the p-values, you can
use cor.test()
.
sub_airline %>%
cor.test(~DepDelayMinutes + ArrDelayMinutes, data = .)
Pearson's product-moment correlation
data: DepDelayMinutes and ArrDelayMinutes
t = 126.58, df = 2853, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9155942 0.9266962
sample estimates:
cor
0.9213328
Conclusion:
See that the P-value is very small, much smaller than .001. And so we can conclude that we are certain about the strong positive correlation.
To calculate correlations on multiple variables, you can
select
all the variable then use the cor()
function but set use = "pairwise.complete.obs"
so that the
correlation of every pair of variables is calculated. This computes a
matrix of Pearson’s r correlation coefficients for all possible pairs of
columns.
correlation <- sub_airline %>%
select(ArrDelayMinutes, DepDelayMinutes, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay) %>%
cor(use = "pairwise.complete.obs", method = "pearson")
correlation
ArrDelayMinutes DepDelayMinutes CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay
ArrDelayMinutes 1.00000000 0.92133281 0.72876012 0.04019104 0.26170778 0.07702208 0.41691413
DepDelayMinutes 0.92133281 1.00000000 0.75399335 0.04304843 0.07109447 0.10683841 0.44701937
CarrierDelay 0.72876012 0.75399335 1.00000000 -0.03943409 -0.18634695 -0.02396245 -0.04173491
WeatherDelay 0.04019104 0.04304843 -0.03943409 1.00000000 -0.03995573 -0.01024942 -0.02791712
NASDelay 0.26170778 0.07109447 -0.18634695 -0.03995573 1.00000000 -0.05276635 -0.15735914
SecurityDelay 0.07702208 0.10683841 -0.02396245 -0.01024942 -0.05276635 1.00000000 -0.02668880
LateAircraftDelay 0.41691413 0.44701937 -0.04173491 -0.02791712 -0.15735914 -0.02668880 1.00000000
Taking all variables into account, we can now create a heatmap that visualizes the correlation between each of the variables with one another.
Let’s use the corrplot()
function to plot an elegant
graph of a correlation matrix, you may need to install the library
corrplot
first before loading it. The color scheme
indicates the Pearson correlation coefficient, indicating the strength
of the correlation between two variables. We can see a diagonal line
with a dark blue color, indicating that all the values on this diagonal
are highly correlated. This makes sense because when you look closer,
the values on the diagonal are the correlation of all variables with
themselves, which will always be 1.
This correlation heatmap gives us a good overview of how the different variables are related to one another and, most importantly, how these variables are related to arrival delays.
library(corrplot)
corrplot 0.92 loaded
numerics_airline <- sub_airline %>%
select(ArrDelayMinutes, DepDelayMinutes, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay)
airlines_cor <- cor(numerics_airline, method = "pearson", use = "pairwise.complete.obs")
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(airlines_cor, method = "color", col = col(200), type = "upper", order = "hclust", addCoef.col = "black", tl.col = "black", tl.srt = 45)
From the above correlation plot, you can see that of the features we used, “CarrierDelay”, “DepDelayMinutes”, and “LateAircraftDelay” have the highest correlations with “ArrDelayMinutes”. The correlation between “CarrierDelay” and “ArrDelayMinutes” is 0.73, “DepDelayMinutes” and “ArrDelayMinutes” is 0.92, and so on.
Let’s say that we want to analyze a categorical variable and see the correlation among different categories. For example, consider the airline dataset, the question we may ask is, how do different categories of the reporting airline feature (as a categorical variable) impact flight delays?
The bar graph below shows the average flight delays of different airlines. The code first groups the data by airline, then finds the average arrival delay for each airline, then it plots this as a bar chart.
In geom_bar()
, you should set
stat = "identity"
since you are passing in the y values
(“Average_Delays”). If you left out this parameter then by default
stat = "count"
, which would instead count the frequency of
each airline.
summary_airline_delays <- sub_airline %>%
group_by(Reporting_Airline) %>%
summarise(Average_Delays = mean(ArrDelayMinutes, na.rm = TRUE))
summary_airline_delays %>%
ggplot(aes(x = Reporting_Airline, y = Average_Delays)) +
geom_bar(stat = "identity") +
ggtitle("Average Arrival Delays by Airline")
To analyze categorical variables such as the “Reporting_Airline” variable, we can use a method such as the ANOVA method.
The Analysis of Variance (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:
F-test score: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.
P-value: The p-value tells you how statistically significant the calculated score value is.
If our ArrDelay
variable is strongly correlated with the
variable we are analyzing, expect ANOVA to return a sizeable F-test
score and a small p-value.
American Airline (AA) and Alaska Airline (AS)
The ANOVA test can be performed in base R’s stats
package using the aov()
function. You can pass in the
arrival delay data of the two airline groups that we want to compare and
it calculates the ANOVA results.
In this first example, you can compare American Airline and Alaska Airline. The results confirm what we guessed at first. The flight delay between “AA” and “AS” are not significantly different, as the F score (0.13) is less than 1 and p-value is larger than 0.05.
aa_as_subset <- sub_airline %>%
select(ArrDelay, Reporting_Airline) %>%
filter(Reporting_Airline == "AA" | Reporting_Airline == "AS")
ad_aov <- aov(ArrDelay ~ Reporting_Airline, data = aa_as_subset)
summary(ad_aov)
Df Sum Sq Mean Sq F value Pr(>F)
Reporting_Airline 1 126 125.7 0.13 0.718
Residuals 1139 1097707 963.7
American Airline (AA) and Pan Am Airline (PA (1))
As another example, you can compare American Airline and Pan Am Airline. From the below output, the arrival delay between “AA” and “PA (1)” are significantly different, since the F-score is very large (F = 17.95) and the p-value is 0.0000245 which is smaller than 0.05. All in all, we can say that there is a strong correlation between a categorical variable and other variables, if the ANOVA test gives us a large F-test value and a small p-value.
aa_pa_subset <- sub_airline %>%
select(ArrDelay, Reporting_Airline) %>%
filter(Reporting_Airline == "AA" | Reporting_Airline == "PA (1)")
ad_aov <- aov(ArrDelay ~ Reporting_Airline, data = aa_pa_subset)
summary(ad_aov)
Df Sum Sq Mean Sq F value Pr(>F)
Reporting_Airline 1 24008 24008 17.95 2.45e-05 ***
Residuals 1127 1507339 1337
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We now have a better idea of what our data looks like and which variables are important to take into account when predicting the arrival delay (ArrDelay). We have narrowed it down to the following variables:
Continuous numerical variables:
DepDelayMinutes
CarrierDelay
LateAircraftDelay
Categorical variables:
As we now move into building machine learning models to automate our analysis, feeding the model with variables that meaningfully affect our target variable will improve our model’s prediction performance.
_______________________________________________________________________________________________________________
Simple Linear Regression is a method to help us understand the relationship between two variables:
X: Predictor variable - DepDelayMinutes
Y: Target variable - ArrDelayMinutes
The result of Linear Regression is a linear function that predicts the response (dependent) variable as a function of the predictor (independent) variable.
\[ \hat{Y} = b_{0} + b_{1}X \]
\(b_{0}\) refers to the intercept of the regression line, the value of Y when X is 0
\(b_{1}\) refers to the slope of the regression line, the value with which Y changes when X increases by 1 unit
\(\hat{Y}\) (y-hat) is the predicted value from the linear model
Fit the data into a linear regression model
First, let’s just look at just Alaska Airline (AA) data, so filter the data first. We also filter out the NAs in CarrierDelay because you will use that variable later.
aa_delays <- sub_airline %>%
filter(CarrierDelay != "NA", Reporting_Airline == "AA")
head(aa_delays)
For this example, we want to look at how departure delay
(DepDelayMinutes) can help us predict arrival delay
(ArrDelayMinutes). Using simple linear regression, we will
create a linear function with “DepDelayMinutes” as the predictor
variable and the “ArrDelayMinutes” as the response variable. You can use
base R’s function lm()
to create a linear model.
Fit the data into a linear regression model:
linear_model <- lm(ArrDelayMinutes ~ DepDelayMinutes, data = aa_delays)
Summarize the regression model using summary()
. The
output displays the learned coefficients (“Estimate” in the output) of
the model, \(b_0\) and \(b_1\) as well as other information about
the fitted model.
summary(linear_model)
Call:
lm(formula = ArrDelayMinutes ~ DepDelayMinutes, data = aa_delays)
Residuals:
Min 1Q Median 3Q Max
-37.234 -12.716 -1.354 7.747 93.646
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.3544 2.5084 6.919 2.9e-10 ***
DepDelayMinutes 0.7523 0.0399 18.855 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 20.03 on 113 degrees of freedom
Multiple R-squared: 0.7588, Adjusted R-squared: 0.7567
F-statistic: 355.5 on 1 and 113 DF, p-value: < 2.2e-16
SLR - Estimated Linear Model
View the intercept (b0) 17.35
View the slope (b1) 0.7523
The relationship between arrival and departure delay is given by: ArrDelayMinutes = 17.35 + 0.7523 * DepDelayMinutes
Create a never seen data, and then we can output a prediction of three new data points
new_depdelay <- data.frame(DepDelayMinutes = c(12, 19, 24))
Predict the regression model
pred <- predict(linear_model, newdata = new_depdelay, interval = "confidence")
pred
fit lwr upr
1 26.38175 21.98838 30.77512
2 31.64769 27.52630 35.76908
3 35.40907 31.44593 39.37222
When we print the pred
object, we can see that there are
3 columns: fit, lwr and upr. The “fit” column is the prediction results
of the inputs. And “lwr” and “upr” are the lower bound and upper bound
of the 95% confidence intervals of prediction results. The confidence
interval reflects the uncertainty around the mean predictions.
For example, given that the DepDelayMinutes is 12, then the model predicts the ArrDelayMinutes to be 26.38, and we are 95% confident that the interval (21.98, 30.77) captures the true mean arrival delay for this instance.
What is the value of the intercept ($b_0$) and the Slope ($b_1$)?
As we saw above, we should get a final linear model with the structure:
\[ \hat{Y} = b_0 + b_1 X \]
Remember that we are predicting ArrDelayMinutes using DepDealyMinutes. So, plugging in the actual values we get:
\[ ArrDelayMinutes = 17.35 + 0.7523 * DepDelayMinutes \]
Create a linear function with “CarrierDelay” as the predictor variable and the “ArrDelayMinutes” as the response variable.
linear_model2 <- lm(ArrDelayMinutes ~ CarrierDelay, data = aa_delays)
Find the coefficients (intercept and slope) of the model.
linear_model2$coefficients
(Intercept) CarrierDelay
35.1176108 0.7031761
What is the equation of the predicted line. You can use x and yhat or ‘CarrierDelay’ or ‘ArrDelayMinutes’?
\[ ArrDelayMinutes = 35.12 + 0.703 * CarrierDelay \]
Simply visualizing your data with a regression
ggplot(aa_delays, aes(x = DepDelayMinutes, y = ArrDelayMinutes)) +
geom_point() +
stat_smooth(method = "lm", col = "red", se = FALSE)
What if we want to predict arrival delay minutes using more than one variable?
If we want to use more variables in our model to predict arrival delay minutes, we can use Multiple Linear Regression. Multiple Linear Regression is very similar to Simple Linear Regression, but this method is used to explain the relationship between one continuous response (dependent) variable and two or more predictor (independent) variables. Most of the real-world regression models involve multiple predictors. We will illustrate the structure by using two predictor variables, but these results can generalize to any amount of predictor variables:
\(Y\): Response Variable
\(X_1\): Predictor Variable
\(X2\): Predictor Variable 2
The equation is given by:
\[ \hat{Y} = b_0 + b_1X_1 + b_2X_2 \]
Where,
\(b_0\): intercept
\(b_1\): coefficient of Variable 1
\(b_2\): coefficient of Variable 2
From previous cells we know that other good predictors of ArrDelayMinutes could be:
DepDelayMinutes
LateAircraftDelay
Let´s develop a model using these variables as the predictor variables by fitting the data
Define the dataset, the predictor variables, and the target variable
Use lm() function to fit the model, find parameter b0, b1, b2
mlr <- lm(ArrDelayMinutes ~ DepDelayMinutes + LateAircraftDelay, data = aa_delays)
mlr
Call:
lm(formula = ArrDelayMinutes ~ DepDelayMinutes + LateAircraftDelay,
data = aa_delays)
Coefficients:
(Intercept) DepDelayMinutes LateAircraftDelay
17.31707 0.75556 -0.01028
Use summary() to output the model results
summary(mlr)
Call:
lm(formula = ArrDelayMinutes ~ DepDelayMinutes + LateAircraftDelay,
data = aa_delays)
Residuals:
Min 1Q Median 3Q Max
-37.188 -12.545 -1.317 7.791 93.683
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.31707 2.53786 6.823 4.78e-10 ***
DepDelayMinutes 0.75556 0.04822 15.668 < 2e-16 ***
LateAircraftDelay -0.01028 0.08407 -0.122 0.903
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 20.11 on 112 degrees of freedom
Multiple R-squared: 0.7588, Adjusted R-squared: 0.7545
F-statistic: 176.2 on 2 and 112 DF, p-value: < 2.2e-16
What is the value of the intercept (b0) and the coefficients (b1, b2)?
mlr$coefficients
(Intercept) DepDelayMinutes LateAircraftDelay
17.31706644 0.75555543 -0.01027513
What is the final estimated linear model that we get?
As we saw above, we should get a final linear function with the structure:
\[ \hat{Y} = b_0 + b_1X_1 + b_2X_2 \]
What is the linear function we get in this example?
\[ ArrDelayMinutes = 17.32 + 0.7556 * DepDelayMinutes - 0.0103 * LateAircraftDelay \]
Create and train a Multiple Linear Regression model “mlr2” where the response variable is ArrDelayMinutes, and the predictor variable is ‘DepDelayMinutes’, ‘LateAircraftDelay’ and ‘CarrierDelay’.
mlr2 = lm(ArrDelayMinutes ~ DepDelayMinutes + LateAircraftDelay + CarrierDelay, data = aa_delays)
summary(mlr2)
Call:
lm(formula = ArrDelayMinutes ~ DepDelayMinutes + LateAircraftDelay +
CarrierDelay, data = aa_delays)
Residuals:
Min 1Q Median 3Q Max
-33.711 -12.875 -0.265 6.105 92.735
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.2650 2.5820 7.074 1.43e-10 ***
DepDelayMinutes 0.6107 0.0995 6.137 1.33e-08 ***
LateAircraftDelay 0.1536 0.1292 1.189 0.2370
CarrierDelay 0.1799 0.1083 1.661 0.0996 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 19.96 on 111 degrees of freedom
Multiple R-squared: 0.7647, Adjusted R-squared: 0.7583
F-statistic: 120.2 on 3 and 111 DF, p-value: < 2.2e-16
Find the coefficients of the model?
mlr2$coefficients
(Intercept) DepDelayMinutes LateAircraftDelay CarrierDelay
18.2649917 0.6106613 0.1536269 0.1799444
Using the fitted model, mlr2, what are the predicted values for the following new data points?
DepDelayMinutes <- c(10, 20, 30)
LateAircraftDelay <- c(20, 60, 30)
new_multidelay <- data.frame(DepDelayMinutes, LateAircraftDelay)
pred <- predict(mlr, newdata = new_multidelay, interval = "confidence")
pred
fit lwr upr
1 24.66712 19.02312 30.31112
2 31.81167 21.42849 42.19485
3 39.67548 34.12732 45.22364
Now that we’ve developed some models, how do we evaluate our models and how do we choose the best one? One way to do this is by using visualization.
When it comes to simple linear regression, an excellent way to visualize the fit of our model is by using regression plots.
Regression plots are a good estimate of:
The relationship of two variables,
The strength of the correlation, and
The direction of the relationship (positive or negative).
There are several ways to plot a regression plot; a simple way is to use “ggplot” from the tidyverse library.
This plot will show a combination of a scattered data points (a scatter plot), as well as the fitted linear regression line going through the data. This will give us a reasonable estimate of the relationship between the two variables, the strength of the correlation, as well as the direction (positive or negative correlation).
ggplot(aa_delays, aes(x = DepDelayMinutes, y = ArrDelayMinutes)) +
geom_point() +
stat_smooth(method = "lm", col = "red")
We can see from this plot that Arrival Delay Minutes (ArrDelayMinutes) is positively correlated to Departure Delay Minutes (DepDelayMinutes), since the regression slope is positive.
One thing to keep in mind when looking at a regression plot is to pay attention to how scattered the data points are around the regression line. This will give you a good indication of the variance of the data, and whether a linear model would be the best fit or not. If the data is too far off from the line, this linear model might not be the best model for this data.
Create a regression plot of “CarrierDelay” and “ArrDelayMinutes” using “aa_delays” dataset
ggplot(aa_delays, aes(x = CarrierDelay, y = ArrDelayMinutes)) +
geom_point() +
geom_smooth(method = "lm", col = "red")
Given the regression plots above is “DepDelayMinutes” or “CarrierDelay” more strongly correlated with “ArrDelayMinutes”. Use the method “cor()” to verify your answer.
cor(aa_delays$DepDelayMinutes, aa_delays$ArrDelayMinutes)
[1] 0.8710917
cor(aa_delays$CarrierDelay, aa_delays$ArrDelayMinutes)
[1] 0.624374
The variable “DepDelayMinutes” has a stronger correlation with “ArrDelayMinutes”, it is approximately 0.871 compared to “CarrierDelay” which is approximately 0.624.
A good way to visualize the variance of the data is to use a residual plot. Before we start creating residual plots let’s first answer the following questions:
What is a residual?
What is a residual plot?
What do we pay attention to when looking at a residual plot?
Now, let’s look again at the regression plot of ArrDelayMinutes as the response and DepDelayMinutes as the predictor. This time, let’s visualize the residuals on this plot.
The red line is the regression line
The black dots represent the observed values of ArrDelayMinutes
The white dots are the predicted values from the linear regression model
The light gray lines are the residuals, or errors. It shows how far away the observed values are from the predicted values. So a longer line means a larger error.
# Create a new column "predicted"
aa_delays$predicted <- predict(linear_model)
score_model <- lm(ArrDelayMinutes ~ DepDelayMinutes, data = aa_delays)
ggplot(aa_delays, aes(x = DepDelayMinutes, y = ArrDelayMinutes)) +
# Plot the actual points
geom_point() +
# Plot regression line
geom_smooth(method = "lm", se = FALSE, color = "red") +
# Add the predicted values
geom_point(aes(y = predicted), shape = 1, color = "blue") +
# Connect the actual data points with their corresponding predicted value
geom_segment(aes(xend = DepDelayMinutes, yend = predicted), alpha = 0.1)
Next, we can create a residual plot, which graphs the residuals
(light gray lines in the previous graph) against the observed
DepDelayMinutes. The code to do this is similar to a normal scatterplot,
but you pass in the linear model
lm(ArrDelayMinutes ~ DepDelayMinutes)
and when setting the
y axis, you can use .resid
which will use the residuals
from the model you inputted.
We can see from this residual plot that the residuals are not randomly spread around the x-axis, which leads us to believe that maybe a non-linear model is more appropriate for this data.
ggplot(lm(ArrDelayMinutes ~ DepDelayMinutes, data = aa_delays)) +
geom_point(aes(x = DepDelayMinutes, y = .resid))
In addition to residual plots, there are other useful plots. A simple
way to view diagnostic plots is to first create the linear model using
lm()
, then call base R’s plot()
function on
the model.
The below code will output four graphs:
Residual plot: Identical to the graph we made with ggplot, here it again shows that the residuals are not randomly spread around the x-axis.
Q-Q plot: The dotted diagonal line represents what normally distributed error (residual) values would follow. In this case, the residuals do not look normally distributed since there are many observations that fall above the line on the right side
Scale-location plot: This plot helps check the homoscedasticity assumption. Here, it shows a red line that is not straight and validates the homoscedasticity assumption is not satisfied.
Residuals vs leverage plot: This helps determine influential points. Any points outside the dotted lines (Cook’s distance) would make it influential. Here, none of the points cross the lines, however several points come close and could be removed or analyzed further.
linear_model <- lm(ArrDelayMinutes ~ DepDelayMinutes, data = aa_delays)
plot(linear_model)
Polynomial regression is a particular case of the general linear regression model or multiple linear regression models. That is, although the data is nonlinear in polynomial regression (the predicator variables have higher order terms), the model in all cases is linear. The model is always linear because it predicts the coefficients (𝑏0,𝑏1,…) which are always of order 1.
There are different orders of polynomial regression:
Quadratic - 2nd order
\[ \hat{Y} = b_0 + b_1X + b_2X^2 \]
Cubic - 3rd order
\[ Y = b_0 + b_1X + b_2X^2 + b_3X^3 \]
Higher (\(n^{th}\)) order
\[ Y = b_0 + b_1X + b_2X^2 + b_2X^3 .... + b_nX^n \]
Let’s look at the below example. Here, we create random predictor
variable q
and random response variable y
that
follows a 3rd order polynomial but then we add some random noise to it
to get noise.y
. We set the seed so that this result can be
reproduced.
set.seed(20)
x <- seq(from=0, to=20, by=0.1)
# value to predict (y):
y <- 500 + 0.4 * (x - 10)^3
# some noise is generated and added to the ral signal (y):
noise <- rnorm(length(x), mean=10, sd=80)
noisy.y <- y + noise
In the graph below, we fit a first order linear model to this example dataset and can see that the model does not fit the data very well.
# fit linear model
ggplot(data = NULL, aes(x, noisy.y)) +
geom_point() +
geom_smooth(method = "lm")
Instead, we can use a polynomial model. It is similar to the first
order linear model except that you include poly()
within
geom_smooth()
to indicate what order polynomial to use. For
example, using poly(x, 5)
equates to having \(b_0 + b_1X^2 + b_3X^3 + b_4X^4 +
b_5X^5\).
ggplot(data=NULL, aes(x, noisy.y)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ poly(x, 5))
We can already see from plotting that this polynomial model performs better than the linear model. This is because the generated polynomial function “hits” more of the data points.
Now let’s look at another example, this time using a 2nd order
polynomial. Again, we use a toy dataset where time
is the
predictor and temp
is the response.
time <- 6:19
temp <- c(4, 6, 7, 9, 10, 11, 11.5, 12, 12, 11.5, 11, 10, 9, 8)
ggplot(data = NULL, aes(time, temp)) +
geom_point()
We can create a model like how we saw before using lm()
and to include higher order, you can used poly()
.
For this dataset, we try a 2nd order polynomial model to see how it fits. The equation the model follows is:
\[ temp = b_0 + b_1 * time + b_2 + time^2 \]
polyfit2 <- lm(temp ~ poly(time, 2, raw = TRUE))
summary(polyfit2)
Call:
lm(formula = temp ~ poly(time, 2, raw = TRUE))
Residuals:
Min 1Q Median 3Q Max
-0.52005 -0.06387 0.03970 0.15543 0.21250
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -13.710165 0.601247 -22.80 1.30e-10 ***
poly(time, 2, raw = TRUE)1 3.760920 0.102822 36.58 7.69e-13 ***
poly(time, 2, raw = TRUE)2 -0.138393 0.004071 -33.99 1.71e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2197 on 11 degrees of freedom
Multiple R-squared: 0.9931, Adjusted R-squared: 0.9918
F-statistic: 791.3 on 2 and 11 DF, p-value: 1.301e-12
From the summary output of the model, you can find the coefficients, so to predict temp, you could use:
$ temp = -13.7 + 3.76 * time - 0.138 * time^2 $
Like for the first order linear models, you can use
ggplot
to graph the model.
ggplot(data = NULL, aes(time, temp)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ poly(x, 2))
Create a 4th order polynomial model with the variables time and temp from above and display the summary of the model.
polyfit4 <- lm(temp ~ poly(time, 4, raw = TRUE))
summary(polyfit4)
Call:
lm(formula = temp ~ poly(time, 4, raw = TRUE))
Residuals:
Min 1Q Median 3Q Max
-0.33403 -0.05810 -0.01222 0.07299 0.26814
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.9580273 5.3098909 0.180 0.8608
poly(time, 4, raw = TRUE)1 -1.6827915 1.9381916 -0.868 0.4078
poly(time, 4, raw = TRUE)2 0.5770452 0.2523955 2.286 0.0481 *
poly(time, 4, raw = TRUE)3 -0.0397085 0.0139698 -2.842 0.0193 *
poly(time, 4, raw = TRUE)4 0.0007906 0.0002788 2.836 0.0195 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1763 on 9 degrees of freedom
Multiple R-squared: 0.9964, Adjusted R-squared: 0.9947
F-statistic: 616.4 on 4 and 9 DF, p-value: 5.787e-11
polyfit4$coefficients
(Intercept) poly(time, 4, raw = TRUE)1 poly(time, 4, raw = TRUE)2 poly(time, 4, raw = TRUE)3 poly(time, 4, raw = TRUE)4
0.9580272669 -1.6827914732 0.5770451974 -0.0397084533 0.0007905697
Using the predicted coefficients from the summary output for the 4th order model, write down the model equation.
\[ temp = 0.96 - 1.682 * time + 0.577 * time^2 - 0.04 * time^3 + 0.00079 * time^4 \]
When evaluating our models, not only do we want to visualize the results, but we also want a quantitative measure to determine how accurate the model is.
Two very important measures that are often used in Statistics to determine the accuracy of a model are:
\(R^2\) / R-squared
Mean Squared Error (MSE)
R-squared
R squared, also known as the coefficient of determination, is a measure to indicate how close the data is to the fitted regression line. The value of the R-squared is the percentage of variation of the response variable (y) that is explained by a linear model.
Mean Squared Error (MSE)
\[ MSE = average((\hat{Y} - Y)^2) \]
\[ RMSE = \sqrt{\bar{MSE}} \]
The Mean Squared Error measures the average of the squares of errors, that is, the difference between actual value (\(Y\)) and the estimated value (\(\hat{Y}\)). Another metric that is related to MSE is root mean squared error (RMSE) and is simply the square root of MSE.
Let’s use the simple linear regression model we created previously.
Using this model, you can calculate MSE and RMSE. From below, MSE is 394 and RMSE is 19.85.
mse <- mean(linear_model$residuals^2)
mse
[1] 394.0639
rmse <- sqrt(mse)
rmse
[1] 19.85104
\(R^2\) can be obtained from the summary of the model. From the output below, we can say that approximately 75.9% of the variation of price is explained by this simple linear model.
summary(linear_model)$r.squared
[1] 0.7588008
Next, let’s use the multiple linear regression model we created in section 3.
mls <- lm(ArrDelayMinutes ~ DepDelayMinutes + LateAircraftDelay, data = aa_delays)
Let’s calculate MSE and RMSE. From below, MSE is 394 and RMSE is 19.849.
mse_mlr <- mean(mlr$residuals^2)
mse_mlr
[1] 394.0113
rmse_mlr <- sqrt(mse_mlr)
rmse_mlr
[1] 19.84972
From the r-squared value belwo, we can say that approximately 75.9 % of the variation of Arrival Delay Minutes is explained by this multiple linear regression “mlr”.
summary(mlr)$r.squared
[1] 0.7588329
Finally, we can use a polynomial regression model using the skills from section 5.
poly_reg <- lm(ArrDelayMinutes ~ poly(DepDelayMinutes, 3), data = aa_delays)
Similar to model 1 and 2, you can find MSE, RMSE, and R^2. Here the MSE is 328.97, RMSE is 19.85, and R^2 is 0.798.
mse_poly <- mean(poly_reg$residuals^2)
mse_poly
[1] 328.9701
rmse_poly <- sqrt(mse)
rmse_poly
[1] 19.85104
summary(poly_reg)$r.squared
[1] 0.7986434
Previously, we trained the model using the method lm()
and we used the method predict()
to produce a
prediction.
head(predict(score_model))
1 2 3 4 5 6
17.35443 83.55480 38.41818 73.77520 62.49104 67.75698
Now that we have visualized the different models, and generated the R-squared and MSE values for the fits, how do we determine a good model fit?
What is a good R-squared value?
What is a good MSE?
Simple Linear Regression: Using DepDelayMinutes as a Predictor Variable of ArrDelayMinutes.
R-squared: 0.7588
MSE: 394.06
Multiple Linear Regression: Using DepDelayMinutes and LateAircraftDelay as Predictor Variables of ArrDelayMinutes.
R-squared: 0.75883
MSE: 394.011
Polynomial Fit: Using 3rd Oder Polynomial of DepDelayMinutes as a Predictor Variable of ArrDelayMinutes
R-squared: 0.7986
MSE: 328.970
Usually, the more variables you have, the better your model is at predicting, but this is not always true. Sometimes you may not have enough data, you may run into numerical problems, or many of the variables may not be useful and or even act as noise. As a result, you should always check the MSE and R^2.
So to be able to compare the results of the MLR vs SLR models, we look at a combination of both the R-squared and MSE to make the best conclusion about the fit of the model
MSE: The MSE of SLR model is 394.063 while MLR has an MSE of 394.0113. The MSE of MLR model is ever slightly smaller.
R-squared: In this case, we can see that the R-squared for the SLR is a little lower than the R-squared for the MLR model.
This R-squared in combination with the MSE show that MLR seems like a slightly better model fit in this case, compared to SLR. However, you could try adding more predictor variables in the MLR model to see if that made a bigger improvement since in our example only two were used.
MSE: We can see that Polynomial model brought down the MSE, since this MSE is smaller than the one from the SLR.
R-squared: The R-squared for the Polyfit is larger than the R-squared for the SLR, so the Polynomial Fit also brought up the R-squared quite a bit.
Since the Polynomial Fit resulted in a lower MSE and a higher R-squared, we can conclude that this was a better fit model than the simple linear regression for predicting ArrDelayMinutes.
MSE: The MSE for the polynomial model is smaller than the MSE for the MLR model.
R-squared: The R-squared for the polynomial model is also larger than the MLR model’s.
Comparing these three models, the MLR model performs slightly better than the SLR model. Perhaps if we tried adding some more predictor variables, the MLR model could do even better. Of the three models, we conclude that the polynomial of order 3 model seems to be the best fit it as it has the highest R^2 and the lowest MSE.