Normal Distribution Codes

Lets say X is normally distributed with a known mean= 75 and sigma= 5

Lets find the P(X<=70)

Use pnorm(q,mean,sigma)

pnorm(q=70,mean=75,sd=5,lower.tail = TRUE)

## [1] 0.1586553

pnorm(70,75,5)

## [1] 0.1586553

Lets find the P(X>=85)

pnorm(q=85,mean=75,sd=5,lower.tail = FALSE)

## [1] 0.02275013

pnorm(85,75,5,lower.tail = FALSE)

## [1] 0.02275013

qnorm function can be use to calculate Quantiles or Percentiles
Lets find the first quartile(Q1) for this variable

qnorm(p=0.25,mean=75,sd=5,lower.tail = TRUE)

## [1] 71.62755

qnorm(0.25,75,5,lower.tail = TRUE)

## [1] 71.62755

Find or plot the probabilty density function

Use seq(from,to, by) function to create a sequence
Use dnorm to create density and plot(x, density)
To have the points attached in plot use plot(x,density,type=“l”)
You can add labels and titles by :
plot(x,density,type=“l”, main=“X~Normal: Mean= 75, sd=5”,xlab=“x”,ylab = “Probability Density”)
You can add a vertical line to show the mean by using: abline(v=mean)

x<-seq(from=55, to=95, by=0.25)
x

##   [1] 55.00 55.25 55.50 55.75 56.00 56.25 56.50 56.75 57.00 57.25 57.50 57.75
##  [13] 58.00 58.25 58.50 58.75 59.00 59.25 59.50 59.75 60.00 60.25 60.50 60.75
##  [25] 61.00 61.25 61.50 61.75 62.00 62.25 62.50 62.75 63.00 63.25 63.50 63.75
##  [37] 64.00 64.25 64.50 64.75 65.00 65.25 65.50 65.75 66.00 66.25 66.50 66.75
##  [49] 67.00 67.25 67.50 67.75 68.00 68.25 68.50 68.75 69.00 69.25 69.50 69.75
##  [61] 70.00 70.25 70.50 70.75 71.00 71.25 71.50 71.75 72.00 72.25 72.50 72.75
##  [73] 73.00 73.25 73.50 73.75 74.00 74.25 74.50 74.75 75.00 75.25 75.50 75.75
##  [85] 76.00 76.25 76.50 76.75 77.00 77.25 77.50 77.75 78.00 78.25 78.50 78.75
##  [97] 79.00 79.25 79.50 79.75 80.00 80.25 80.50 80.75 81.00 81.25 81.50 81.75
## [109] 82.00 82.25 82.50 82.75 83.00 83.25 83.50 83.75 84.00 84.25 84.50 84.75
## [121] 85.00 85.25 85.50 85.75 86.00 86.25 86.50 86.75 87.00 87.25 87.50 87.75
## [133] 88.00 88.25 88.50 88.75 89.00 89.25 89.50 89.75 90.00 90.25 90.50 90.75
## [145] 91.00 91.25 91.50 91.75 92.00 92.25 92.50 92.75 93.00 93.25 93.50 93.75
## [157] 94.00 94.25 94.50 94.75 95.00

density<-dnorm(x,mean=75,sd=5)

plot(x,density)

plot(x,density,type="l")

plot(x,density,type="l", main="X~Normal: Mean= 75, sd=5",xlab="x",ylab = "Probability Density")

abline(v=75)

To draw a random sample from a normally distributed population
Use rnorm(siz, mean,sd)

rnorm(n=40,75,5)

##  [1] 76.20173 68.23328 77.76532 76.12049 72.56559 68.57126 85.41460 72.15540
##  [9] 71.66060 76.80773 78.39429 75.23548 76.04482 85.11366 70.06132 75.98531
## [17] 76.31632 52.34074 76.70232 69.28436 79.61299 74.69519 78.98842 74.55272
## [25] 81.14561 72.69119 81.61999 76.47443 80.40794 76.58560 73.55685 71.08607
## [33] 64.97134 76.54887 73.81189 74.13570 64.98633 77.04111 75.23523 81.80966

Z Standard Normal Codes

Lets find the P(Z>=1)

pnorm(q=1,mean=0,sd=1,lower.tail = FALSE)

## [1] 0.1586553

pnorm(1,0,1,lower.tail = FALSE)

## [1] 0.1586553

Binomial Distribution Codes

Lets say X is Binomially distributed with n=20 trials and p=1/6 probability of success

X~Bin(n=20,p=1/6)
Lets find the probability X is exactly equal to 3 X=3;

P(X=3)

Use dbinom(x,size,prob) P(X=3)

dbinom(x=3,size=20,prob = 1/6)

## [1] 0.2378866

dbinom(3,20,1/6)

## [1] 0.2378866

Use dbinom(x,size,prob) to find P(X=1) & P(X=2) & P(X=3)

dbinom(x=0:3,size=20,prob=1/6)

## [1] 0.02608405 0.10433621 0.19823881 0.23788657

dbinom(0:3,20,1/6)

## [1] 0.02608405 0.10433621 0.19823881 0.23788657

Use dbinom(x,size,prob) to find P(X<=3)

sum(dbinom(x=0:3,size=20,prob=1/6))

## [1] 0.5665456

Use pbinom to find the probability distribution function
Lets find P(X<=3) by using pbinom(q,size, probability)

pbinom(q=3,size=20,prob = 1/6)

## [1] 0.5665456

pbinom(3,20,1/6)

## [1] 0.5665456

-Use rbinom to take a random sample from a binomial distribution

rbinom(n=1,size=10,prob=1/2)

## [1] 6

rbinom(n=10,size=10,prob = 1/2)

##  [1] 5 5 4 5 1 6 6 3 5 4

rbinom(10,10,0.5)

##  [1] 5 3 6 5 5 4 7 4 6 6

-Use qbinom to find the quantiles for a binomial distribution

qbinom(p=0.25,10,prob=1/2,lower.tail = TRUE)

## [1] 4

qbinom(p=0.25,100,prob=1/2,lower.tail = TRUE)

## [1] 47

T-distribution, finding probabilities and Percenties for T-distribution

t-stat=2.3, df=25 find one sided p value P(t>2.3)
One sided p-value

pt(q=2.3,df=25, lower.tail=FALSE)

## [1] 0.01503675

pt(2.3,25,lower.tail = FALSE)

## [1] 0.01503675

two sided p-value

pt(q=2.3,df=25, lower.tail=FALSE)+ pt(q=-2.3,df=25, lower.tail=TRUE)

## [1] 0.03007351

pt(q=2.3,df=25, lower.tail=FALSE)*2

## [1] 0.03007351

Constructing confidence interval
find t for 95% confidence interval df=25, value of t with 2.5% in each tail

qt(p=0.025,df=25,lower.tail = TRUE)

## [1] -2.059539

Deciding the distribution of the data (Nomal or Not)

Visualization of the Dataset

Viewing dataset

View(mtcars)

## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):
## running command ''/usr/bin/otool' -L '/Library/Frameworks/R.framework/Resources/
## modules/R_de.so'' had status 1

Short example from the dataset

head(mtcars)

Use summary and see if median and mean close to each other

summary(mtcars$mpg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90

Install (install.packages(psych)) and use describe function, which gives you; mean sd se skew kutosis

#install.packages("psych")
library(psych)

describe(mtcars)

Use boxplot(Data) see if there is a skewness or not

boxplot(mtcars$mpg,horizontal = TRUE)

Use qqnorm and qqline see if the line is about 45 degree

qqnorm(mtcars$mpg)
qqline(mtcars$mpg,col=3)

Use histopgram and see if there is a skewness or not

hist(mtcars$mpg,breaks = 4)

When σ is unknown

Use t distribution to estimate population mean and standard deviation
if the t value is 2.7 and n=10, use pt(quantile, df) to find the area under t value

pt(2.7,9)

## [1] 0.9878032

Margin of Error

m= t* s/sqrt(n) or m= z*sigma/sqrt(n)

Standard Error of the Mean

SEM=s/sqrt(n)

Simple Linear Regression

Liner Relation Visualization

Plot the data. Does the trend in lean over time appear to be linear?

par(mfrow=c(2,2))
  
plot(mtcars$hp,mtcars$mpg)

model=lm(mtcars$mpg~mtcars$cyl+mtcars$disp+mtcars$hp)

abline(model,col=4)

## Warning in abline(model, col = 4): only using the first two of 4 regression
## coefficients

summary(model)

## 
## Call:
## lm(formula = mtcars$mpg ~ mtcars$cyl + mtcars$disp + mtcars$hp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0889 -2.0845 -0.7745  1.3972  6.9183 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.18492    2.59078  13.195 1.54e-13 ***
## mtcars$cyl  -1.22742    0.79728  -1.540   0.1349    
## mtcars$disp -0.01884    0.01040  -1.811   0.0809 .  
## mtcars$hp   -0.01468    0.01465  -1.002   0.3250    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.055 on 28 degrees of freedom
## Multiple R-squared:  0.7679, Adjusted R-squared:  0.743 
## F-statistic: 30.88 on 3 and 28 DF,  p-value: 5.054e-09

plot(model$residuals)

abline(0,0)

hist(model$residuals)

qqnorm(model$residuals)

qqline(model$residuals)

Multiple Linear Regression

Anova table for the model(above)

anova(model)

See the correlation between variables and round the values into two digits
if the correlation between variables are big, this may cause the multicollinearity

round(cor(mtcars),2)

##        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
## mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
## cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
## disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
## hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
## drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
## wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43
## qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00  0.74 -0.23 -0.21 -0.66
## vs    0.66 -0.81 -0.71 -0.72  0.44 -0.55  0.74  1.00  0.17  0.21 -0.57
## am    0.60 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06
## gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27
## carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00

Testing multicollinearity in the model VIF(Variance inflation Factor)
If the value is bigger than 10 we can say there might be a multicollinearity(linear dependency). In this case cyl can be eliminated

VIF=diag(solve(cor(mtcars)))
VIF

##       mpg       cyl      disp        hp      drat        wt      qsec        vs 
##  7.634507 15.382159 22.194360 10.287988  3.411846 17.942416  7.980373  4.971265 
##        am      gear      carb 
##  4.980880  5.406599  7.930553

Finding the realtionships that cause Variance Inflation

eigen.stuff=eigen(cor(mtcars))

eigen.stuff$values

##  [1] 6.60840025 2.65046789 0.62719727 0.26959744 0.22345110 0.21159612
##  [7] 0.13526199 0.12290143 0.07704665 0.05203544 0.02204441

Small eigen values indicate linear dependence between variables

#eigen.stuff$vectors

round(eigen.stuff$vectors,2)

##        [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11]
##  [1,]  0.36 -0.02 -0.23 -0.02 -0.10 -0.11  0.37  0.75  0.24  0.14 -0.12
##  [2,] -0.37 -0.04 -0.18  0.00 -0.06  0.17  0.06  0.23  0.05 -0.85 -0.14
##  [3,] -0.37  0.05 -0.06  0.26 -0.39 -0.34  0.21  0.00  0.20  0.05  0.66
##  [4,] -0.33 -0.25  0.14 -0.07 -0.54  0.07  0.00  0.22 -0.58  0.25 -0.26
##  [5,]  0.29 -0.27  0.16  0.85 -0.08  0.24  0.02 -0.03 -0.05 -0.10 -0.04
##  [6,] -0.35  0.14  0.34  0.25  0.08 -0.46 -0.02  0.01  0.36  0.09 -0.57
##  [7,]  0.20  0.46  0.40  0.07  0.16 -0.33  0.05  0.23 -0.53 -0.27  0.18
##  [8,]  0.31  0.23  0.43 -0.21 -0.60  0.19 -0.27 -0.03  0.36 -0.16  0.01
##  [9,]  0.23 -0.43 -0.21 -0.03 -0.09 -0.57 -0.59  0.06 -0.05 -0.18  0.03
## [10,]  0.21 -0.46  0.29 -0.26 -0.05 -0.24  0.61 -0.34  0.00 -0.21 -0.05
## [11,] -0.21 -0.41  0.53 -0.13  0.36  0.18 -0.17  0.40  0.17  0.07  0.32

Two-Sample t test

We use two sample t test to compare two different sample and means of the samples

-H o: μ1=μ2 Ha= μ1 not equal μ2 (Here we are trying to test different samples’ mean)

Code example(dataset does not show up, comparing two preference)

# t.test(df_primed$Preference,df_nonprimed$Preference)

According to two sample t- test our t = 3.6322 and p-value = 0.0007998. Since p value is very small we reject null hypothesis. Hence, these two groups do not show the same preference for this product.

Manual two sample t test for mean difference

Is there a significant evidence at the 5% level that conductors consume more calories per day than do drivers? Use the two-sample t method to give a P-value, and then assess significance.
Ho: μdrivers=μ_conductor Ha: μ_drivers<μconductor
Example code(below)
t= (x_conductor - x_drivers)/SE where SE= sqrt(((s_conductor)^2)/83)+ (s_drivers)^2)/98)
x_conductor= 2844 x_drivers= 2821 s_conductor=437.3008 s_driver= 435.5778 SE= 65.11528
Hence, t value(df=82) is 0.3532197 and corresponding p value is 0.3624151. Since p value is bigger 0.05 we can not reject null hypothesis.

SE_cal=sqrt(((437.3008^2)/83)+ ((435.5778)^2)/98)
SE_cal

## [1] 65.11528

t_col= (2844-2821)/ SE_cal
t_col

## [1] 0.3532197

pt(0.3532197,82,lower.tail = FALSE)

## [1] 0.3624151

Confidence Interval for two sample mean difference

-Give a 99% confidence interval for the difference in mean daily alcohol consumption between drivers and conductors.

x_driver= 0.24 x_conductor= 0.39 where s_driver= 0.5939697 s_conductor= 1.002148

Confidence Interval= (x_driver-x_conductor)+- (t*SE)

where x_driver-x_conductor= -.15 t= 2.637123 , SE= 0.1252997

Hence, CI (-0.4804306 0.1804306)

qt(0.005,df=82)

## [1] -2.637123

SE_alc=sqrt(((1.002148^2)/83)+ ((0.5939697)^2)/98)
SE_alc

## [1] 0.1252997

UpperCI_dif= -0.15+ 2.637123*SE_alc

UpperCI_dif

## [1] 0.1804306

LowerCI_dif= -0.15- 2.637123*SE_alc

LowerCI_dif

## [1] -0.4804306

Applied Stat with Applications in Finance-1