Normal Distribution Codes

Lets say X is normally distributed with a known mean= 75 and sigma= 5

  • Lets find the P(X<=70)

Use pnorm(q,mean,sigma)

pnorm(q=70,mean=75,sd=5,lower.tail = TRUE)
## [1] 0.1586553
pnorm(70,75,5)
## [1] 0.1586553
  • Lets find the P(X>=85)
pnorm(q=85,mean=75,sd=5,lower.tail = FALSE)
## [1] 0.02275013
pnorm(85,75,5,lower.tail = FALSE)
## [1] 0.02275013
  • qnorm function can be use to calculate Quantiles or Percentiles

  • Lets find the first quartile(Q1) for this variable

qnorm(p=0.25,mean=75,sd=5,lower.tail = TRUE)
## [1] 71.62755
qnorm(0.25,75,5,lower.tail = TRUE)
## [1] 71.62755

Find or plot the probabilty density function

x<-seq(from=55, to=95, by=0.25)
x
##   [1] 55.00 55.25 55.50 55.75 56.00 56.25 56.50 56.75 57.00 57.25 57.50 57.75
##  [13] 58.00 58.25 58.50 58.75 59.00 59.25 59.50 59.75 60.00 60.25 60.50 60.75
##  [25] 61.00 61.25 61.50 61.75 62.00 62.25 62.50 62.75 63.00 63.25 63.50 63.75
##  [37] 64.00 64.25 64.50 64.75 65.00 65.25 65.50 65.75 66.00 66.25 66.50 66.75
##  [49] 67.00 67.25 67.50 67.75 68.00 68.25 68.50 68.75 69.00 69.25 69.50 69.75
##  [61] 70.00 70.25 70.50 70.75 71.00 71.25 71.50 71.75 72.00 72.25 72.50 72.75
##  [73] 73.00 73.25 73.50 73.75 74.00 74.25 74.50 74.75 75.00 75.25 75.50 75.75
##  [85] 76.00 76.25 76.50 76.75 77.00 77.25 77.50 77.75 78.00 78.25 78.50 78.75
##  [97] 79.00 79.25 79.50 79.75 80.00 80.25 80.50 80.75 81.00 81.25 81.50 81.75
## [109] 82.00 82.25 82.50 82.75 83.00 83.25 83.50 83.75 84.00 84.25 84.50 84.75
## [121] 85.00 85.25 85.50 85.75 86.00 86.25 86.50 86.75 87.00 87.25 87.50 87.75
## [133] 88.00 88.25 88.50 88.75 89.00 89.25 89.50 89.75 90.00 90.25 90.50 90.75
## [145] 91.00 91.25 91.50 91.75 92.00 92.25 92.50 92.75 93.00 93.25 93.50 93.75
## [157] 94.00 94.25 94.50 94.75 95.00
density<-dnorm(x,mean=75,sd=5)

plot(x,density)

plot(x,density,type="l")

plot(x,density,type="l", main="X~Normal: Mean= 75, sd=5",xlab="x",ylab = "Probability Density")

abline(v=75)

rnorm(n=40,75,5)
##  [1] 76.20173 68.23328 77.76532 76.12049 72.56559 68.57126 85.41460 72.15540
##  [9] 71.66060 76.80773 78.39429 75.23548 76.04482 85.11366 70.06132 75.98531
## [17] 76.31632 52.34074 76.70232 69.28436 79.61299 74.69519 78.98842 74.55272
## [25] 81.14561 72.69119 81.61999 76.47443 80.40794 76.58560 73.55685 71.08607
## [33] 64.97134 76.54887 73.81189 74.13570 64.98633 77.04111 75.23523 81.80966

Z Standard Normal Codes

  • Lets find the P(Z>=1)
pnorm(q=1,mean=0,sd=1,lower.tail = FALSE)
## [1] 0.1586553
pnorm(1,0,1,lower.tail = FALSE)
## [1] 0.1586553

Binomial Distribution Codes

Lets say X is Binomially distributed with n=20 trials and p=1/6 probability of success

  • X~Bin(n=20,p=1/6)

  • Lets find the probability X is exactly equal to 3 X=3;

P(X=3)

  • Use dbinom(x,size,prob) P(X=3)
dbinom(x=3,size=20,prob = 1/6)
## [1] 0.2378866
dbinom(3,20,1/6)
## [1] 0.2378866
  • Use dbinom(x,size,prob) to find P(X=1) & P(X=2) & P(X=3)
dbinom(x=0:3,size=20,prob=1/6)
## [1] 0.02608405 0.10433621 0.19823881 0.23788657
dbinom(0:3,20,1/6)
## [1] 0.02608405 0.10433621 0.19823881 0.23788657
  • Use dbinom(x,size,prob) to find P(X<=3)
sum(dbinom(x=0:3,size=20,prob=1/6))
## [1] 0.5665456
  • Use pbinom to find the probability distribution function

  • Lets find P(X<=3) by using pbinom(q,size, probability)

pbinom(q=3,size=20,prob = 1/6)
## [1] 0.5665456
pbinom(3,20,1/6)
## [1] 0.5665456

-Use rbinom to take a random sample from a binomial distribution

rbinom(n=1,size=10,prob=1/2)
## [1] 6
rbinom(n=10,size=10,prob = 1/2)
##  [1] 5 5 4 5 1 6 6 3 5 4
rbinom(10,10,0.5)
##  [1] 5 3 6 5 5 4 7 4 6 6

-Use qbinom to find the quantiles for a binomial distribution

qbinom(p=0.25,10,prob=1/2,lower.tail = TRUE)
## [1] 4
qbinom(p=0.25,100,prob=1/2,lower.tail = TRUE)
## [1] 47

T-distribution, finding probabilities and Percenties for T-distribution

pt(q=2.3,df=25, lower.tail=FALSE)
## [1] 0.01503675
pt(2.3,25,lower.tail = FALSE)
## [1] 0.01503675
pt(q=2.3,df=25, lower.tail=FALSE)+ pt(q=-2.3,df=25, lower.tail=TRUE)
## [1] 0.03007351
pt(q=2.3,df=25, lower.tail=FALSE)*2
## [1] 0.03007351
qt(p=0.025,df=25,lower.tail = TRUE)
## [1] -2.059539

Deciding the distribution of the data (Nomal or Not)

Visualization of the Dataset

  • Viewing dataset
View(mtcars)
## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):
## running command ''/usr/bin/otool' -L '/Library/Frameworks/R.framework/Resources/
## modules/R_de.so'' had status 1
  • Short example from the dataset
head(mtcars)
  • Use summary and see if median and mean close to each other
summary(mtcars$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90
  • Install (install.packages(psych)) and use describe function, which gives you; mean sd se skew kutosis
#install.packages("psych")
library(psych)

describe(mtcars)
  • Use boxplot(Data) see if there is a skewness or not
boxplot(mtcars$mpg,horizontal = TRUE)

  • Use qqnorm and qqline see if the line is about 45 degree
qqnorm(mtcars$mpg)
qqline(mtcars$mpg,col=3)

  • Use histopgram and see if there is a skewness or not
hist(mtcars$mpg,breaks = 4)

When σ is unknown

pt(2.7,9)
## [1] 0.9878032

m= t* s/sqrt(n) or m= z*sigma/sqrt(n)

SEM=s/sqrt(n)

Simple Linear Regression

Liner Relation Visualization

Plot the data. Does the trend in lean over time appear to be linear?

par(mfrow=c(2,2))
  
plot(mtcars$hp,mtcars$mpg)

model=lm(mtcars$mpg~mtcars$cyl+mtcars$disp+mtcars$hp)

abline(model,col=4)
## Warning in abline(model, col = 4): only using the first two of 4 regression
## coefficients
summary(model)
## 
## Call:
## lm(formula = mtcars$mpg ~ mtcars$cyl + mtcars$disp + mtcars$hp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0889 -2.0845 -0.7745  1.3972  6.9183 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.18492    2.59078  13.195 1.54e-13 ***
## mtcars$cyl  -1.22742    0.79728  -1.540   0.1349    
## mtcars$disp -0.01884    0.01040  -1.811   0.0809 .  
## mtcars$hp   -0.01468    0.01465  -1.002   0.3250    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.055 on 28 degrees of freedom
## Multiple R-squared:  0.7679, Adjusted R-squared:  0.743 
## F-statistic: 30.88 on 3 and 28 DF,  p-value: 5.054e-09
plot(model$residuals)

abline(0,0)

hist(model$residuals)

qqnorm(model$residuals)

qqline(model$residuals)

Multiple Linear Regression

Anova table for the model(above)

anova(model)
  • See the correlation between variables and round the values into two digits

  • if the correlation between variables are big, this may cause the multicollinearity

round(cor(mtcars),2)
##        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
## mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
## cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
## disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
## hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
## drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
## wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43
## qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00  0.74 -0.23 -0.21 -0.66
## vs    0.66 -0.81 -0.71 -0.72  0.44 -0.55  0.74  1.00  0.17  0.21 -0.57
## am    0.60 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06
## gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27
## carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00
  • Testing multicollinearity in the model VIF(Variance inflation Factor)
  • If the value is bigger than 10 we can say there might be a multicollinearity(linear dependency). In this case cyl can be eliminated
VIF=diag(solve(cor(mtcars)))
VIF
##       mpg       cyl      disp        hp      drat        wt      qsec        vs 
##  7.634507 15.382159 22.194360 10.287988  3.411846 17.942416  7.980373  4.971265 
##        am      gear      carb 
##  4.980880  5.406599  7.930553
  • Finding the realtionships that cause Variance Inflation
eigen.stuff=eigen(cor(mtcars))

eigen.stuff$values
##  [1] 6.60840025 2.65046789 0.62719727 0.26959744 0.22345110 0.21159612
##  [7] 0.13526199 0.12290143 0.07704665 0.05203544 0.02204441
  • Small eigen values indicate linear dependence between variables
#eigen.stuff$vectors

round(eigen.stuff$vectors,2)
##        [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11]
##  [1,]  0.36 -0.02 -0.23 -0.02 -0.10 -0.11  0.37  0.75  0.24  0.14 -0.12
##  [2,] -0.37 -0.04 -0.18  0.00 -0.06  0.17  0.06  0.23  0.05 -0.85 -0.14
##  [3,] -0.37  0.05 -0.06  0.26 -0.39 -0.34  0.21  0.00  0.20  0.05  0.66
##  [4,] -0.33 -0.25  0.14 -0.07 -0.54  0.07  0.00  0.22 -0.58  0.25 -0.26
##  [5,]  0.29 -0.27  0.16  0.85 -0.08  0.24  0.02 -0.03 -0.05 -0.10 -0.04
##  [6,] -0.35  0.14  0.34  0.25  0.08 -0.46 -0.02  0.01  0.36  0.09 -0.57
##  [7,]  0.20  0.46  0.40  0.07  0.16 -0.33  0.05  0.23 -0.53 -0.27  0.18
##  [8,]  0.31  0.23  0.43 -0.21 -0.60  0.19 -0.27 -0.03  0.36 -0.16  0.01
##  [9,]  0.23 -0.43 -0.21 -0.03 -0.09 -0.57 -0.59  0.06 -0.05 -0.18  0.03
## [10,]  0.21 -0.46  0.29 -0.26 -0.05 -0.24  0.61 -0.34  0.00 -0.21 -0.05
## [11,] -0.21 -0.41  0.53 -0.13  0.36  0.18 -0.17  0.40  0.17  0.07  0.32

Two-Sample t test

-H o: μ1=μ2 Ha= μ1 not equal μ2 (Here we are trying to test different samples’ mean)

# t.test(df_primed$Preference,df_nonprimed$Preference)

Manual two sample t test for mean difference

  • Is there a significant evidence at the 5% level that conductors consume more calories per day than do drivers? Use the two-sample t method to give a P-value, and then assess significance.

  • Ho: μdrivers=μ_conductor Ha: μ_drivers<μconductor

  • Example code(below)

  • t= (x_conductor - x_drivers)/SE where SE= sqrt(((s_conductor)^2)/83)+ (s_drivers)^2)/98)

  • x_conductor= 2844 x_drivers= 2821 s_conductor=437.3008 s_driver= 435.5778 SE= 65.11528

  • Hence, t value(df=82) is 0.3532197 and corresponding p value is 0.3624151. Since p value is bigger 0.05 we can not reject null hypothesis.

SE_cal=sqrt(((437.3008^2)/83)+ ((435.5778)^2)/98)
SE_cal
## [1] 65.11528
t_col= (2844-2821)/ SE_cal
t_col
## [1] 0.3532197
pt(0.3532197,82,lower.tail = FALSE)
## [1] 0.3624151

Confidence Interval for two sample mean difference

-Give a 99% confidence interval for the difference in mean daily alcohol consumption between drivers and conductors.

x_driver= 0.24 x_conductor= 0.39 where s_driver= 0.5939697 s_conductor= 1.002148

Confidence Interval= (x_driver-x_conductor)+- (t*SE)

where x_driver-x_conductor= -.15 t= 2.637123 , SE= 0.1252997

Hence, CI (-0.4804306 0.1804306)

qt(0.005,df=82)
## [1] -2.637123
SE_alc=sqrt(((1.002148^2)/83)+ ((0.5939697)^2)/98)
SE_alc
## [1] 0.1252997
UpperCI_dif= -0.15+ 2.637123*SE_alc

UpperCI_dif
## [1] 0.1804306
LowerCI_dif= -0.15- 2.637123*SE_alc

LowerCI_dif
## [1] -0.4804306