Final Data Science Math Presentation

My first Random Variable is Distance

#Random variable X = Distance
dist <-hflights$Distance

Distance is skewed left as Median is greater than Mean

summary(dist)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    79.0   376.0   809.0   787.8  1042.0  3904.0

My second variable is Arrival Delay. The Y variable is skewed to right as seen in summary(aDel)

# Random variable Y = Arrival Delay 
aDel <-hflights$ArrDelay
summary(aDel)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -70.000  -8.000   0.000   7.094  11.000 978.000    3622

# Skewed right

Using below formula, I am calculating 3d quartile of x i.e Distance and 2d quartile of y i.e Arrival Delay

# x <- distance, 75 percentile 
x<-quantile(dist)["75%"]
# y <- arr delay, 50 percentile
y<-quantile(aDel,na.rm = TRUE)["50%"]

1. Part a) Probability P(X>x | Y>y)

total<-nrow(hflights)

#a.P(X>x|Y>y)
xg_yg<-nrow(subset(hflights,dist>x & aDel>y))
yg<-nrow(subset(hflights,aDel>y))
p1<-xg_yg/total
p2<-yg/total
a<-p1/p2
a

## [1] 0.2502806

Analysis: The probability of a distance to be above third quartile given that the arrival delay is greater than 2d quartile, is 0.2502806

2. Part b) Probability P(X>x, Y>y)

#b. P(X>x,Y>y)
xg<-nrow(subset(hflights,dist>x))
yg<-nrow(subset(hflights,aDel>y))
p3 <-xg/total
p4<-yg/total
b<-p3*p4
b

## [1] 0.1171846

Analysis:The probability of a distance to be greater than third quartile and the arrival delay is greater than 2d quartile, is 0.1171846

3. Part c) P(Xy)

xl_yl<-nrow(subset(hflights,dist<x & aDel>y))
p5<-xl_yl/total
c<-p5/p4
c

## [1] 0.7417321

Analysis:The probability of a distance to be below third quartile given that the arrival delay is greater than 2d quartile, is 0.7417321

4. Create and Fill Table

#Fill the table values
c11<-nrow(subset(hflights,dist<=x & aDel<=y))
c12<-nrow(subset(hflights,dist<=x & aDel>y))
c13<-c11+c12
c21<-nrow(subset(hflights,dist>x & aDel<=y))
c22<-nrow(subset(hflights,dist>x & aDel>y))
c23<-c21+c22
c31<-c11+c21
c32<-c12+c22
c33<-c13+c23

tabValues<- matrix(c(c11,c12,c13,c21,c22,c23,c31,c32,c33),3,3)
colnames(tabValues) <- c("<=2d quartile",">2d quartile","Total")
rownames(tabValues) <- c('<=3d quartile', '>3d quartile','Total')
tabValues

##               <=2d quartile >2d quartile  Total
## <=3d quartile         87620        29334 116954
## >3d quartile          80160        26760 106920
## Total                167780        56094 223874

Question: Does splitting the data in this fashion make them independent?

Analysis: It is difficult to comment by just looking at the data, whether splitting has made them independent.

5. Check P(A|B) = P(A).P(B)

# Check P(A|B) = P(A).P(B)
A<-xg
B<-yg
p6=p1/p4
p7=p3*p4
check<-(p6==p7)
check

## [1] FALSE

Analysis: Mathematically P(A|B) NE P(A).P(B)

6. Chi-square test for independence

tbl = table(hflights$Distance, hflights$ArrDelay)
chisq.test(tbl)

## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 99671, df = 72996, p-value < 2.2e-16

Analysis: p value 2.2e-16 is very small, so rejecting null hypothesis, meaning,there is relationship b/w distance and arrival delay

Part 2 DESCRIPTIVE AND INFERENTIAL STATISTICS

univariate descriptive of statistics (Summary)

summary(dist)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    79.0   376.0   809.0   787.8  1042.0  3904.0

summary(aDel)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -70.000  -8.000   0.000   7.094  11.000 978.000    3622

# Histogram
hist(hflights$Distance,xlab= "Distance", main = "Distance of flights")

hist(hflights$ArrDelay,xlab= "Arrival Delay", main = "Delay in Arrival")

# Density plot
d1 <- density(dist) 
plot(d1)

d2 <-density(aDel, na.rm = TRUE)
plot(d2)

# Scatter plot 1
plot(dist,aDel)

#  Plot 2
qqplot(dist,aDel, xlab="Distance", ylab="Arrival Delay")

95% CI for the difference of the two means

t.test(dist, aDel)

## 
##  Welch Two Sample t-test
## 
## data:  dist and aDel
## t = 818.85, df = 229610, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  778.8203 782.5575
## sample estimates:
##  mean of x  mean of y 
## 787.783245   7.094334

Analysis: The difference of the mean of random variable distance andarrival delay lies between 787.783245 and 7.094334.This can be stated with 95% confidence.

Correlation Matrix

# Correlation matrix
Dist_aDelay<- hflights[,c("Distance","ArrDelay")]
Dist_aDelay <- Dist_aDelay[complete.cases(Dist_aDelay), ]
mat<-cor(Dist_aDelay)
mat

##              Distance     ArrDelay
## Distance  1.000000000 -0.004434254
## ArrDelay -0.004434254  1.000000000

Analysis: Correlation matrix shows their is a slight negative correlation between distance and arrival delay

Test the hypothesis that the correlation b/w these variables is 0 and provide a 99% confidence interval ?

cor.test(dist,aDel, method = "pearson" , conf.level = 0.99)

## 
##  Pearson's product-moment correlation
## 
## data:  dist and aDel
## t = -2.0981, df = 223870, p-value = 0.0359
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  -0.009877962  0.001009717
## sample estimates:
##          cor 
## -0.004434254

Analysis:

1. As the P value is too small, there is enough evidence to reject the null hypothesis.

2. At alpha level 0.01 we say that correlation between distance and arrival delay is not 0, i.e there is correlation b/w these two variables

Part 3 LINEAR ALGEBRA AND CORRELATION

Inverse correlation matrix

inv<-solve(mat)
inv

##             Distance    ArrDelay
## Distance 1.000019663 0.004434341
## ArrDelay 0.004434341 1.000019663

Multiply correlation matrix with precision matrix

matrix1 <-mat  %*% inv
matrix2<- inv %*% mat
matrix1

##               Distance ArrDelay
## Distance  1.000000e+00        0
## ArrDelay -8.673617e-19        1

matrix2

##               Distance ArrDelay
## Distance  1.000000e+00        0
## ArrDelay -8.673617e-19        1

Analysis: The correlation matrix shows a negative correlation between distance and arrrival delay, meaning when distance increases arrival delay decreases and vice versa. But this correlation is very small(-0.004434254)

Part 4. CALCULUS BASED PROBABILITY AND STATISTICS

# For variable skewed to right, shift so that the min value is above 0. Arr Delay is skewed right

# For variable skewed to right, shift so that the min value is above 0.
# Arr Delay is skewed right
min_aDel <- min(aDel,na.rm = TRUE)
# The min value is -70
aDel_new<-na.omit(aDel + 71)

# Fit exponential probability function and calculate lambda
expdist<-fitdistr(aDel_new,"exponential")
l<-expdist$estimate
samp<-rexp(1000, l)

#Plot Histogram
hist(samp,xlab= "Arriavl Delay", main = "Arrival Delay Data")

hist(aDel,xlab= "Arriavl Delay", main = "Arrival Delay Data")

Analysis: comparing the histograms we see the data is still positively skewed as in the original dataset, but with the estimations, it is more spread out.

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

qexp(0.05,rate = l) #  4.005716

## [1] 4.005716

qexp(0.95,rate = l) # 233.9497

## [1] 233.9497

#95% confidence interval from the empirical data, assuming normality.
error<-qnorm(0.975)*expdist$sd/sqrt(expdist$n)
left<- l-error  # 0.01280491 
right<-l+error  # 0.01280514 

#5th percentile and 95th percentile of the data
quantile(aDel, c(.05, .95),na.rm = TRUE)

##  5% 95% 
## -18  57

# 5% 95% 
#-18  57

Final Data Science Math Presentation

Dhananjay Kumar

August 11, 2016

My first Random Variable is Distance

Distance is skewed left as Median is greater than Mean

My second variable is Arrival Delay. The Y variable is skewed to right as seen in summary(aDel)

Using below formula, I am calculating 3d quartile of x i.e Distance and 2d quartile of y i.e Arrival Delay

1. Part a) Probability P(X>x | Y>y)

Analysis: The probability of a distance to be above third quartile given that the arrival delay is greater than 2d quartile, is 0.2502806

2. Part b) Probability P(X>x, Y>y)

Analysis:The probability of a distance to be greater than third quartile and the arrival delay is greater than 2d quartile, is 0.1171846

3. Part c) P(Xy)

Analysis:The probability of a distance to be below third quartile given that the arrival delay is greater than 2d quartile, is 0.7417321

4. Create and Fill Table

Question: Does splitting the data in this fashion make them independent?

Analysis: It is difficult to comment by just looking at the data, whether splitting has made them independent.

5. Check P(A|B) = P(A).P(B)

Analysis: Mathematically P(A|B) NE P(A).P(B)

6. Chi-square test for independence

Analysis: p value 2.2e-16 is very small, so rejecting null hypothesis, meaning,there is relationship b/w distance and arrival delay

Part 2 DESCRIPTIVE AND INFERENTIAL STATISTICS

univariate descriptive of statistics (Summary)

95% CI for the difference of the two means

Analysis: The difference of the mean of random variable distance andarrival delay lies between 787.783245 and 7.094334.This can be stated with 95% confidence.

Correlation Matrix

Analysis: Correlation matrix shows their is a slight negative correlation between distance and arrival delay

Test the hypothesis that the correlation b/w these variables is 0 and provide a 99% confidence interval ?

Analysis:

1. As the P value is too small, there is enough evidence to reject the null hypothesis.

2. At alpha level 0.01 we say that correlation between distance and arrival delay is not 0, i.e there is correlation b/w these two variables

Part 3 LINEAR ALGEBRA AND CORRELATION

Inverse correlation matrix

Multiply correlation matrix with precision matrix

Analysis: The correlation matrix shows a negative correlation between distance and arrrival delay, meaning when distance increases arrival delay decreases and vice versa. But this correlation is very small(-0.004434254)

Part 4. CALCULUS BASED PROBABILITY AND STATISTICS

# For variable skewed to right, shift so that the min value is above 0. Arr Delay is skewed right

Analysis: comparing the histograms we see the data is still positively skewed as in the original dataset, but with the estimations, it is more spread out.

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

Analysis:There is a considerable difference between the empirical data and the original data