Data File1: LungCapacityDataSet
Data File2: BloodPressure

Loading Data

LungCapData2 <- read.csv(file.choose() ,stringsAsFactors = TRUE)
# and attach the data
attach(LungCapData2)
# ask for a summary of the data
summary(LungCapData2)
##     LungCap            Age            Height      Smoke        Gender   
##  Min.   : 0.507   Min.   : 3.00   Min.   :45.30   no :648   female:358  
##  1st Qu.: 6.150   1st Qu.: 9.00   1st Qu.:59.90   yes: 77   male  :367  
##  Median : 8.000   Median :13.00   Median :65.40                         
##  Mean   : 7.863   Mean   :12.33   Mean   :64.84                         
##  3rd Qu.: 9.800   3rd Qu.:15.00   3rd Qu.:70.30                         
##  Max.   :14.675   Max.   :19.00   Max.   :81.80                         
##  Caesarean
##  no :561  
##  yes:164  
##           
##           
##           
## 
str(LungCapData2)
## 'data.frame':    725 obs. of  6 variables:
##  $ LungCap  : num  6.47 10.12 9.55 11.12 4.8 ...
##  $ Age      : int  6 18 16 14 5 11 8 11 15 11 ...
##  $ Height   : num  62.1 74.7 69.7 71 56.9 58.7 63.3 70.4 70.5 59.2 ...
##  $ Smoke    : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ Gender   : Factor w/ 2 levels "female","male": 2 1 1 2 2 1 2 2 2 2 ...
##  $ Caesarean: Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...

1: Checking normality using box and hist

we are going to examine the LungCap variable First we check assumption of normality by plotting histogram and box plot

hist(LungCap, col = 2)

By looking at box plot data seems to be normal

boxplot(LungCap, horizontal = T, col = 3)

By looking at the histogram data looking as slightly negatively skewed

2. Checking normality by Quantile-Quantile (QQ) plot

qqnorm(LungCap, main='Normal')
qqline(LungCap)

The line is straight therefore variable is looking slightly away from normality

3. Checking normality by shapiro test

shapiro.test(LungCap)
## 
##  Shapiro-Wilk normality test
## 
## data:  LungCap
## W = 0.99305, p-value = 0.001886

H0: LungCap is normally distributed
H1: LungCap is not normal

pvalue is 0.001886 means reject H0 and conclude that the data is not normal

4. Checking normality by Kolmogrov test

ks.test(LungCap,"pnorm")
## Warning in ks.test.default(LungCap, "pnorm"): ties should not be present for
## the Kolmogorov-Smirnov test
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  LungCap
## D = 0.96575, p-value < 2.2e-16
## alternative hypothesis: two-sided

H0: LungCap is normally distributed
H1: LungCap is not normal

pvalue is 0.001886 means reject H0 and conclude that the data is not normal

but the sample size is large we can use t-test

5. One sample t-test (Left tailed)

Is a parametric method for examining a single numeric variable
numeric variable also refered to as scaled variable
help(“t.test”)

H0: mu = 8
H1: mu < 8

t.test(LungCap, mu=8, alternative="less", conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  LungCap
## t = -1.3842, df = 724, p-value = 0.08336
## alternative hypothesis: true mean is less than 8
## 95 percent confidence interval:
##      -Inf 8.025974
## sample estimates:
## mean of x 
##  7.863148

pvalue is 0.0833 which is larger than 0.05 therefore data provide sufficient evidence to conclude that Pop Mean =8

6. One sample t-test (Two tailed)

H0: mu = 8
H1: mu <> 8

t.test(LungCap, mu=8, alternative="two.sided", conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  LungCap
## t = -1.3842, df = 724, p-value = 0.1667
## alternative hypothesis: true mean is not equal to 8
## 95 percent confidence interval:
##  7.669052 8.057243
## sample estimates:
## mean of x 
##  7.863148

pvalue is 0.1667 which is larger than 0.05 therefore data provide sufficient evidence to conclude that Pop Mean = 8

two sided test is the default in r therefore we can write

## 
##  One Sample t-test
## 
## data:  LungCap
## t = -1.3842, df = 724, p-value = 0.1667
## alternative hypothesis: true mean is not equal to 8
## 95 percent confidence interval:
##  7.669052 8.057243
## sample estimates:
## mean of x 
##  7.863148

we can store the result in a variable

tst <- t.test(LungCap, mu=8, conf.level = 0.95)
attributes(tst)
## $names
##  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
##  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"  
## 
## $class
## [1] "htest"

if we want to find the p.value only then

tst$p.value
## [1] 0.1667108

7. Two Independent sample t-test (Pooled)

H0: mean lungcap of smokers = mean lung cap non smokers
H1: they are unequal

two sided alternative by default assumue population variance non equal

t.test(LungCap ~ Smoke, mu = 0, alt = "two.sided",conf=0.95, var.eq =T, paired=F) 
## 
##  Two Sample t-test
## 
## data:  LungCap by Smoke
## t = -2.7399, df = 723, p-value = 0.006297
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
##  -1.5024262 -0.2481063
## sample estimates:
##  mean in group no mean in group yes 
##          7.770188          8.645455

p-value = 0.006297 which is very small than 0.05 therefore mean lungCap of smokers and non smokers differ significantly

Is is justified to use pooled t-test without knowing equality of variances?
Ans: NO first check the equality of variances

it examining the relationship b/w a scaled and a factor/2 var we want to compare the lungcap of two groups smoker/non smomers There are two versions of this test one is called pooled t-test (equal variance) Other is known as non-pooled (unequal variance)

8. How to decide pooled or non-pooled

Several ways to find equality of variances
a. box plot indicates low variation
b. ratio of two sample variance is less than two
c. Leven’s test pvalue > 0.05

8a: Box Plot: how to decide var.eq=T or F

boxplot(LungCap ~ Smoke, horizontal = T, col=c(2,3))

IQR of yes is very small and no is very high indicating unequal variances means non-pooled

8b: if sample variance having a ratio >= 2

keep large sample variance in numerator

var(LungCap[Smoke=="yes"])
## [1] 3.545292
var(LungCap[Smoke=="no"])
## [1] 7.431694
# ratio 
var(LungCap[Smoke=="no"])/var(LungCap[Smoke=="yes"])
## [1] 2.096215

if ratio of two sample variances is more than two means non pooled (unequal variances) otherwise pooled

8c: Levene’s testing for testing equality of variances

Levenes test of car package

H0: Popu variance are equal for smokers and non smokers
H1: Popu variance are un-equal for smokers and non smokers

## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.2
leveneTest(LungCap ~ Smoke)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value    Pr(>F)    
## group   1  12.955 0.0003408 ***
##       723                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

very small pvalue indicates that the variances are unequal or we can say that the variances are significant

9. Two Independent sample t-test (Non-Pooled)

H0: mean lungcap of smokers = mean lung cap non smokers
H1: they are unequal
two sided alternative by default
assumue population variance non equal

t.test(LungCap ~ Smoke, mu = 0, alt = "two.sided",conf=0.95, var.eq =F, paired=F) 
## 
##  Welch Two Sample t-test
## 
## data:  LungCap by Smoke
## t = -3.6498, df = 117.72, p-value = 0.0003927
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
##  -1.3501778 -0.4003548
## sample estimates:
##  mean in group no mean in group yes 
##          7.770188          8.645455

all the arguments are defaults therefore it is same as
t.test(LungCap ~ Smoke)

as the p-value = 0.0003927 which is smaller than 0.05 indicating that two population means are unequal average lungs capacity of smokers is significantly larger than of non smokers

10. Paired sample t-test

Also known as Intervention, pre-post analysis, repeated measure

data file for paired t-test
file name: BloodPressure.txt

BP <- read.table(file.choose(), header = T, sep = "\t")
attach(BP)

help(“t.test”)
making box plot

boxplot(Before,After, col=c(3,2))

indicate that the BP is lower in after

now making a scatter plot

plot(Before,After, col=c(2,3))
# a line over this scatterplot
# slope = 0, intercept =1
abline(0,1)
abline(a=0,b=1)

if there is no change in the BP before and After then
the points are equally scattered above and below this line
If there is a decrease in BP after then more points are below the line
If there is a increase in BP after then more point are above th eline

Ho: Mean difference in BP is zero
H1: Mean difference in BP is not equal zero

t.test(Before, After, mu=0, alt="two.sided", paired = T, conf.level = 0.99)
## 
##  Paired t-test
## 
## data:  Before and After
## t = 3.8882, df = 24, p-value = 0.0006986
## alternative hypothesis: true mean difference is not equal to 0
## 99 percent confidence interval:
##   2.245279 13.754721
## sample estimates:
## mean difference 
##               8

if the pvalue is very small then H0 is not true and the difference is significant Here the mean difference is +VE mean BP After is significantly lower than Before

sample mean diff =+8

if we put after first and before later then
t.test(After,Before, mu=0, alt=“two.sided”, paired = T, conf.level = 0.99)

the the mean diff is -ve indicating that BP after is lower than before sample mean diff =-8

11: One-Way Analysis of Variance (ANOVA)

H0: Mean lung capacity is same for all heights
H1: At least 2 height categories having significantly different LungCap

CatHeight <- cut(Height, breaks=4, labels = c("A","B","C","D"))
ANOVA1 <-  aov(Height ~ CatHeight)
# to see simple output
ANOVA1
## Call:
##    aov(formula = Height ~ CatHeight)
## 
## Terms:
##                 CatHeight Residuals
## Sum of Squares   33149.56   4404.95
## Deg. of Freedom         3       721
## 
## Residual standard error: 2.471741
## Estimated effects may be unbalanced
# to see the extended output
summary(ANOVA1)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## CatHeight     3  33150   11050    1809 <2e-16 ***
## Residuals   721   4405       6                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# to see all objects in ANOVA1 variable
attributes(ANOVA1)
## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "contrasts"     "xlevels"       "call"          "terms"        
## [13] "model"        
## 
## $class
## [1] "aov" "lm"

As the pvalue of anova is very very small meaning that the means are significant and we have to apply multiple comparison using posthoc analysis (TukeyHSD)

12: PostHoc Analysis by TukeyHSD

If H0 of anova is rejected then we conduct pairwise multiple comparisons

TukeyHSD(ANOVA1)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Height ~ CatHeight)
## 
## $CatHeight
##          diff       lwr       upr p adj
## B-A  8.440986  7.535386  9.346587     0
## C-A 16.747614 15.862368 17.632861     0
## D-A 24.151184 23.138731 25.163636     0
## C-B  8.306628  7.761677  8.851579     0
## D-B 15.710197 14.976460 16.443935     0
## D-C  7.403569  6.695107  8.112032     0
# las = 1 means labels are horizontal
plot(TukeyHSD(ANOVA1),las=1, col=CatHeight)

13: Kruskal Walis test to conduct one way anova

is a non parametric equalivalant of ANOVA

kruskal.test(Height ~ CatHeight)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Height by CatHeight
## Kruskal-Wallis chi-squared = 636.7, df = 3, p-value < 2.2e-16

H0: Mean lung capacity is same for all heights
H1: Atleast 2 height categories having significantly different LungCap

P-value of kruskal Walis is very small therefor at leat two means are significantly different

14: Chi-Square test of independence in R

help(chisq.test) Display the Contingency table

contble <-  table(Gender, Smoke )
barplot(contble, beside = T, legend=T, col = c(2,3))

Ho: Gender and Smoke are independent
H1: Gender and Smoke are associated

chisq.test(contble, correct = T)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contble
## X-squared = 1.7443, df = 1, p-value = 0.1866
# store the output in a variable
chi <- chisq.test(contble, correct = T)
chi
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contble
## X-squared = 1.7443, df = 1, p-value = 0.1866
attributes(chi)
## $names
## [1] "statistic" "parameter" "p.value"   "method"    "data.name" "observed" 
## [7] "expected"  "residuals" "stdres"   
## 
## $class
## [1] "htest"
chi$observed
##         Smoke
## Gender    no yes
##   female 314  44
##   male   334  33
chi$expected
##         Smoke
## Gender         no      yes
##   female 319.9779 38.02207
##   male   328.0221 38.97793
chi$p.value
## [1] 0.1865893

pvalue of chi-square test is 0.1866 which is larger than 0.05 therefore gender and smoke are not associated

15: Fisher test is a competitor version of chi-sq test is

fisher.test(contble, conf.int = T, conf.level = 0.99)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  contble
## p-value = 0.1845
## alternative hypothesis: true odds ratio is not equal to 1
## 99 percent confidence interval:
##  0.3625381 1.3521266
## sample estimates:
## odds ratio 
##  0.7054345

Ho: Gender and Smoke are independent
H1: Gender and Smoke are associated

pvalue of Fisher’s test is 0.1845 therefore the H0 is assumed to be true and conclude that Gender and Smoking habit are not associated

** Thank You**