Data Science Math - Final Project

Outline

Load hflights Package
Probability
Descriptive and Inferential Statistics
Linear Algebra and Correlation
Calculus Based Probability & Statistics

Load hflights Package

install.packages("hflights", repos='https://mirrors.nics.utk.edu/cran/')
library(hflights)

Citation information and documentation for the ‘hflights’ package can be found here: https://cran.r-project.org/web/packages/hflights/hflights.pdf
The package consists of a single dataset called ‘hflights’

Probability

I chose DepDelay (departure delay) as random variable X and ArrDelay (arrival delay) as random variable Y.

Both are positively skewed:

x <- hflights$DepDelay
hist(x, breaks = 150, freq = FALSE, xlim = c(-40,150), main = "Histogram of Departure Delay",xlab = "Departure Delay in Minutes")

y <- hflights$ArrDelay
hist(y, breaks = 150, freq = FALSE, xlim = c(-40,150), main = "Histogram of Arrival Delay",xlab = "Arrival Delay in Minutes")

After an analysis of NA values, I decided to eliminate any rows which had NA in any of the columns and reset X and Y (please see the “Preparatory Data Profiling” section of https://rpubs.com/LelandoSupreme/200997 for that analysis):

hf_excl_na <- hflights[rowSums(is.na(hflights))==0,]
x <- hf_excl_na$DepDelay
y <- hf_excl_na$ArrDelay

Then I used the quantile function to compute x, which is the 3rd quartile of the X variable, and y, which is the 2nd quartile of Y variable:

quantile(x, probs = 0.75)

## 75% 
##   9

quantile(y, probs = 0.5)

## 50% 
##   0

Compute P(X > x | Y > y)

# Is the P(X > x and Y > y) divided by P(Y > y)
# Probability P(X > x and Y > y)
p1 <- nrow(subset(hf_excl_na, hf_excl_na$DepDelay > quantile(hf_excl_na$DepDelay, probs = 0.75) & hf_excl_na$ArrDelay > quantile(hf_excl_na$ArrDelay, probs = 0.5))) / nrow(hf_excl_na)
# Probability P(Y > y)
p2 <- nrow(subset(hf_excl_na, hf_excl_na$ArrDelay > quantile(hf_excl_na$ArrDelay, probs = 0.5))) / nrow(hf_excl_na)
p1 / p2

## [1] 0.4804153

Compute P(X > x, Y > y)

nrow(subset(hf_excl_na, hf_excl_na$DepDelay > quantile(hf_excl_na$DepDelay, probs = 0.75) & hf_excl_na$ArrDelay > quantile(hf_excl_na$ArrDelay, probs = 0.5))) / nrow(hf_excl_na)

## [1] 0.2294416

Compute P(X < x | Y > y)

# Is the P(X < x and Y > y) divided by P(Y > y)
# Probability P(X < x and Y > y)
p1 <- nrow(subset(hf_excl_na, hf_excl_na$DepDelay <= quantile(hf_excl_na$DepDelay, probs = 0.75) & hf_excl_na$ArrDelay > quantile(hf_excl_na$ArrDelay, probs = 0.5))) / nrow(hf_excl_na)
# Probability P(Y > y)
p2 <- nrow(subset(hf_excl_na, hf_excl_na$ArrDelay > quantile(hf_excl_na$ArrDelay, probs = 0.5))) / nrow(hf_excl_na)
p1 / p2

## [1] 0.5195847

Table of counts: You can see below the key to map my R code (displayed below the table) to the counts grid. For example, the cell with (a) in it was populated using the code labeled “a” in my R scripts. You can see all totals columns were computed by adding some or all of the (a), (b), (c), and (d) values.

Mapping of Cells to Formulas

# Compute value for (a)
nrow(subset(hf_excl_na, hf_excl_na$DepDelay <= quantile(hf_excl_na$DepDelay, probs = 0.75) & hf_excl_na$ArrDelay <= quantile(hf_excl_na$ArrDelay, probs = 0.5)))

## [1] 113141

# Compute value for (b)
nrow(subset(hf_excl_na, hf_excl_na$DepDelay <= quantile(hf_excl_na$DepDelay, probs = 0.75) & hf_excl_na$ArrDelay > quantile(hf_excl_na$ArrDelay, probs = 0.5)))

## [1] 55554

# Compute value for (c)
nrow(subset(hf_excl_na, hf_excl_na$DepDelay > quantile(hf_excl_na$DepDelay, probs = 0.75) & hf_excl_na$ArrDelay <= quantile(hf_excl_na$ArrDelay, probs = 0.5)))

## [1] 3813

# Compute value for (d)
nrow(subset(hf_excl_na, hf_excl_na$DepDelay > quantile(hf_excl_na$DepDelay, probs = 0.75) & hf_excl_na$ArrDelay > quantile(hf_excl_na$ArrDelay, probs = 0.5)))

## [1] 51366

Counts

Does splitting the data in this fashion make them independent? No. The fact that we can take observations and subset them doesn’t make them independent or dependent.

Let A be the new variable counting those observations above the 3rd quartile for X, and let B be the new variable counting those observations for the 2nd quartile for Y. Does P(A|B) = P(A) * P(B)? Check mathematically. No - see below.

A <- nrow(subset(hf_excl_na, hf_excl_na$DepDelay > quantile(hf_excl_na$DepDelay, probs = 0.75)))
B <- nrow(subset(hf_excl_na, hf_excl_na$ArrDelay <= quantile(hf_excl_na$ArrDelay, probs = 0.5)))
# P(A)
pA <- A / nrow(hf_excl_na)
# P(B)
pB <- B / nrow(hf_excl_na)
# P(A|B)
pAB <- nrow(subset(hf_excl_na, hf_excl_na$DepDelay > quantile(hf_excl_na$DepDelay, probs = 0.75) & hf_excl_na$ArrDelay <= quantile(hf_excl_na$ArrDelay, probs = 0.5))) / nrow(hf_excl_na)
pA * pB

## [1] 0.1287602

pAB

## [1] 0.0170319

Evaluate by running a Chi Square test for association.

chisqtbl <- table(hf_excl_na$DepDelay, hf_excl_na$ArrDelay)
chisq.test(chisqtbl)

## Warning in chisq.test(chisqtbl): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  chisqtbl
## X-squared = 24824000, df = 197270, p-value < 2.2e-16

Descriptive and Inferential Statistics

Here are summary statistics for DepDelay and ArrDelay to supplment the Histograms in the prior section:

summary(hf_excl_na$DepDelay)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -33.000  -3.000   0.000   9.415   9.000 981.000

summary(hf_excl_na$ArrDelay)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -70.000  -8.000   0.000   7.094  11.000 978.000

To create a scatterplot, I loaded Hadley Wickham’s ggplot2 package:

# Load ggplot2
install.packages("ggplot2", repos='https://mirrors.nics.utk.edu/cran/')
library(ggplot2)

Then I created a scatterplot of DepDelay against ArrDelay:

ggplot(hf_excl_na, aes(x=hf_excl_na$ArrDelay, y=hf_excl_na$DepDelay)) + geom_point() + labs(title="Depature Delay vs. Arrival Delay",x="Arrival Delay in Minutes", y = "Departure Delay in Minutes")

It certainly looks like the variables are correlated.

Provide a 95% confidence interval for the difference in the mean of the variables.

# Difference between the means
dm <- mean(hf_excl_na$DepDelay) - mean(hf_excl_na$ArrDelay)
# Standard error of the difference between means
se <- sqrt(((sd(hf_excl_na$DepDelay)/nrow(hf_excl_na))+(sd(hf_excl_na$ArrDelay)/nrow(hf_excl_na))))
# 95% confidence interval 
c(dm - se*qnorm(0.975),dm + se*qnorm(0.975))

## [1] 2.288710 2.352588

Derive a correlation matrix for two of the quantitative variables you selected.

cm <- cor(hf_excl_na[c("DepDelay","ArrDelay")])
cm

##           DepDelay  ArrDelay
## DepDelay 1.0000000 0.9292181
## ArrDelay 0.9292181 1.0000000

Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval.

cor.test(hf_excl_na$DepDelay,hf_excl_na$ArrDelay,conf.level = 0.99)

## 
##  Pearson's product-moment correlation
## 
## data:  hf_excl_na$DepDelay and hf_excl_na$ArrDelay
## t = 1189.8, df = 223870, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.9284710 0.9299578
## sample estimates:
##       cor 
## 0.9292181

Given the p-value, the likelihood that the hypothesis of a zero correlation is very low.

Linear Algebra and Correlation

Invert your correlation matrix. (This is known as the precison matrix).

im <- solve(cm)
im

##           DepDelay  ArrDelay
## DepDelay  7.323130 -6.804785
## ArrDelay -6.804785  7.323130

** Multiply the correlation matrix by the precision matrix, and then mutiply the precision matrix by the correlation matrix.**

cm %*% im

##          DepDelay ArrDelay
## DepDelay        1        0
## ArrDelay        0        1

im %*% cm

##          DepDelay ArrDelay
## DepDelay        1        0
## ArrDelay        0        1

The result in both cases is the identity matrix.

Calculus Based Probability and Statistics

For your variable which is skewed to the right, shift it so that the minimum value is above zero.

#Create new DF
hf_min_val <- hf_excl_na
# Check range for DepDelay
c(hf_min_val$DepDelay[which.min(hf_min_val$DepDelay)],hf_min_val$DepDelay[which.max(hf_min_val$DepDelay)])

## [1] -33 981

# Add 34 to all values
hf_min_val$DepDelay <- hf_min_val$DepDelay + 34
# Check range for ArrDelay
c(hf_min_val$ArrDelay[which.min(hf_min_val$ArrDelay)],hf_min_val$ArrDelay[which.max(hf_min_val$ArrDelay)])

## [1] -70 978

# Add 71 to all values
hf_min_val$ArrDelay <- hf_min_val$ArrDelay + 71

Though both variabiles are skewed to the right, for this exercise, I will use the ArrDelay variable.

Then load the MASS package and run fitdistr to fit an exponential probability density function. Documentation for MASS::fitdistr is here: https::/stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html

install.packages("MASS", repos='https://mirrors.nics.utk.edu/cran/')
library(MASS)

fd <- fitdistr(hf_min_val$ArrDelay, "exponential")

Find the optimal value of lambda for this distribution, and then take 1000 samples from this exponential distribution using this value. Plot a histogram and compare it with a histogram of your original variable.

# Optimal value of lambda
fd$estimate

##       rate 
## 0.01280503

# 100 samples from this distribution
r <- rexp(1000,fd$estimate)
# Plot a histogram using 1000 samples
hist(r, freq = FALSE, main = "Histogram of 1000 Samples")

# Plot a histogram using original variable
hist(hf_excl_na$ArrDelay, freq = FALSE, xlim = c(0, 500), main = "Histogram of Arrival Delay",xlab="Arrival Delay in Minutes")

The two histograms are similar in shape except that one starts below zero and other only contains positive values.

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

# 5th percentile
pexp(quantile(r, probs = 0.05), rate=fd$estimate, lower.tail = TRUE)

##         5% 
## 0.04637791

# 95th percentile
pexp(quantile(r, probs = 0.95), rate=fd$estimate, lower.tail = TRUE)

##       95% 
## 0.9402936

Also generate a 95% confidence interval from the empirical data, assuming normality.

# y was set to hf_excl_na$ArrDelay earlier
# Calculate mean and standard deviation
m <- mean(y)
se <- sd(y)
# 95% confidence interval
c(m - (se*qnorm(0.975)), m + (se*qnorm(0.975)))

## [1] -53.09325  67.28192

Finally, provide the empirical 5th percentile and 95th percentile of the data.

quantile(y, probs = 0.05)

##  5% 
## -18

quantile(y, probs = 0.95)

## 95% 
##  57

Discuss.

There is a large divergence between the 5% and 95% percentiles (assuming normality) and the 5th and 95th percentiles determined empirically because the distribution is not normal, it is exponetial.