Data File1: LungCapacityDataSet
Data File2: BloodPressure
LungCapData2 <- read.csv(file.choose() ,stringsAsFactors = TRUE)
# and attach the data
attach(LungCapData2)
# ask for a summary of the data
summary(LungCapData2)
## LungCap Age Height Smoke Gender
## Min. : 0.507 Min. : 3.00 Min. :45.30 no :648 female:358
## 1st Qu.: 6.150 1st Qu.: 9.00 1st Qu.:59.90 yes: 77 male :367
## Median : 8.000 Median :13.00 Median :65.40
## Mean : 7.863 Mean :12.33 Mean :64.84
## 3rd Qu.: 9.800 3rd Qu.:15.00 3rd Qu.:70.30
## Max. :14.675 Max. :19.00 Max. :81.80
## Caesarean
## no :561
## yes:164
##
##
##
##
str(LungCapData2)
## 'data.frame': 725 obs. of 6 variables:
## $ LungCap : num 6.47 10.12 9.55 11.12 4.8 ...
## $ Age : int 6 18 16 14 5 11 8 11 15 11 ...
## $ Height : num 62.1 74.7 69.7 71 56.9 58.7 63.3 70.4 70.5 59.2 ...
## $ Smoke : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ Gender : Factor w/ 2 levels "female","male": 2 1 1 2 2 1 2 2 2 2 ...
## $ Caesarean: Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
we are going to examine the LungCap variable First we check assumption of normality by plotting histogram and box plot
hist(LungCap, col = 2)
By looking at box plot data seems to be normal
boxplot(LungCap, horizontal = T, col = 3)
By looking at the histogram data looking as slightly negatively skewed
qqnorm(LungCap, main='Normal')
qqline(LungCap)
The line is straight therefore variable is looking slightly away from normality
shapiro.test(LungCap)
##
## Shapiro-Wilk normality test
##
## data: LungCap
## W = 0.99305, p-value = 0.001886
H0: LungCap is normally distributed
H1: LungCap is not normal
pvalue is 0.001886 means reject H0 and conclude that the data is not normal
ks.test(LungCap,"pnorm")
## Warning in ks.test.default(LungCap, "pnorm"): ties should not be present for
## the Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: LungCap
## D = 0.96575, p-value < 2.2e-16
## alternative hypothesis: two-sided
H0: LungCap is normally distributed
H1: LungCap is not normal
pvalue is 0.001886 means reject H0 and conclude that the data is not normal
but the sample size is large we can use t-test
Is a parametric method for examining a single numeric variable
numeric variable also refered to as scaled variable
help(“t.test”)
H0: mu = 8
H1: mu < 8
t.test(LungCap, mu=8, alternative="less", conf.level = 0.95)
##
## One Sample t-test
##
## data: LungCap
## t = -1.3842, df = 724, p-value = 0.08336
## alternative hypothesis: true mean is less than 8
## 95 percent confidence interval:
## -Inf 8.025974
## sample estimates:
## mean of x
## 7.863148
pvalue is 0.0833 which is larger than 0.05 therefore data provide sufficient evidence to conclude that Pop Mean =8
H0: mu = 8
H1: mu <> 8
t.test(LungCap, mu=8, alternative="two.sided", conf.level = 0.95)
##
## One Sample t-test
##
## data: LungCap
## t = -1.3842, df = 724, p-value = 0.1667
## alternative hypothesis: true mean is not equal to 8
## 95 percent confidence interval:
## 7.669052 8.057243
## sample estimates:
## mean of x
## 7.863148
pvalue is 0.1667 which is larger than 0.05 therefore data provide sufficient evidence to conclude that Pop Mean = 8
two sided test is the default in r therefore we can write
##
## One Sample t-test
##
## data: LungCap
## t = -1.3842, df = 724, p-value = 0.1667
## alternative hypothesis: true mean is not equal to 8
## 95 percent confidence interval:
## 7.669052 8.057243
## sample estimates:
## mean of x
## 7.863148
we can store the result in a variable
tst <- t.test(LungCap, mu=8, conf.level = 0.95)
attributes(tst)
## $names
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "stderr" "alternative" "method" "data.name"
##
## $class
## [1] "htest"
if we want to find the p.value only then
tst$p.value
## [1] 0.1667108
H0: mean lungcap of smokers = mean lung cap non
smokers
H1: they are unequal
two sided alternative by default assumue population variance non equal
t.test(LungCap ~ Smoke, mu = 0, alt = "two.sided",conf=0.95, var.eq =T, paired=F)
##
## Two Sample t-test
##
## data: LungCap by Smoke
## t = -2.7399, df = 723, p-value = 0.006297
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## -1.5024262 -0.2481063
## sample estimates:
## mean in group no mean in group yes
## 7.770188 8.645455
p-value = 0.006297 which is very small than 0.05 therefore mean lungCap of smokers and non smokers differ significantly
Is is justified to use pooled t-test without knowing equality of
variances?
Ans: NO first check the equality of variances
it examining the relationship b/w a scaled and a factor/2 var we want to compare the lungcap of two groups smoker/non smomers There are two versions of this test one is called pooled t-test (equal variance) Other is known as non-pooled (unequal variance)
Several ways to find equality of variances
a. box plot indicates low variation
b. ratio of two sample variance is less than two
c. Leven’s test pvalue > 0.05
boxplot(LungCap ~ Smoke, horizontal = T, col=c(2,3))
IQR of yes is very small and no is very high indicating unequal
variances means non-pooled
keep large sample variance in numerator
var(LungCap[Smoke=="yes"])
## [1] 3.545292
var(LungCap[Smoke=="no"])
## [1] 7.431694
# ratio
var(LungCap[Smoke=="no"])/var(LungCap[Smoke=="yes"])
## [1] 2.096215
if ratio of two sample variances is more than two means non pooled (unequal variances) otherwise pooled
Levenes test of car package
H0: Popu variance are equal for smokers and non
smokers
H1: Popu variance are un-equal for smokers and non
smokers
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.2
leveneTest(LungCap ~ Smoke)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 12.955 0.0003408 ***
## 723
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
very small pvalue indicates that the variances are unequal or we can say that the variances are significant
H0: mean lungcap of smokers = mean lung cap non
smokers
H1: they are unequal
two sided alternative by default
assumue population variance non equal
t.test(LungCap ~ Smoke, mu = 0, alt = "two.sided",conf=0.95, var.eq =F, paired=F)
##
## Welch Two Sample t-test
##
## data: LungCap by Smoke
## t = -3.6498, df = 117.72, p-value = 0.0003927
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## -1.3501778 -0.4003548
## sample estimates:
## mean in group no mean in group yes
## 7.770188 8.645455
all the arguments are defaults therefore it is same as
t.test(LungCap ~ Smoke)
as the p-value = 0.0003927 which is smaller than 0.05 indicating that two population means are unequal average lungs capacity of smokers is significantly larger than of non smokers
Also known as Intervention, pre-post analysis, repeated measure
data file for paired t-test
file name: BloodPressure.txt
BP <- read.table(file.choose(), header = T, sep = "\t")
attach(BP)
help(“t.test”)
making box plot
boxplot(Before,After, col=c(3,2))
indicate that the BP is lower in after
plot(Before,After, col=c(2,3))
# a line over this scatterplot
# slope = 0, intercept =1
abline(0,1)
abline(a=0,b=1)
if there is no change in the BP before and After then
the points are equally scattered above and below this line
If there is a decrease in BP after then more points are below the
line
If there is a increase in BP after then more point are above th
eline
Ho: Mean difference in BP is zero
H1: Mean difference in BP is not equal zero
t.test(Before, After, mu=0, alt="two.sided", paired = T, conf.level = 0.99)
##
## Paired t-test
##
## data: Before and After
## t = 3.8882, df = 24, p-value = 0.0006986
## alternative hypothesis: true mean difference is not equal to 0
## 99 percent confidence interval:
## 2.245279 13.754721
## sample estimates:
## mean difference
## 8
if the pvalue is very small then H0 is not true and the difference is significant Here the mean difference is +VE mean BP After is significantly lower than Before
sample mean diff =+8
if we put after first and before later then
t.test(After,Before, mu=0, alt=“two.sided”, paired = T, conf.level =
0.99)
the the mean diff is -ve indicating that BP after is lower than before sample mean diff =-8
H0: Mean lung capacity is same for all heights
H1: At least 2 height categories having significantly different
LungCap
CatHeight <- cut(Height, breaks=4, labels = c("A","B","C","D"))
ANOVA1 <- aov(Height ~ CatHeight)
# to see simple output
ANOVA1
## Call:
## aov(formula = Height ~ CatHeight)
##
## Terms:
## CatHeight Residuals
## Sum of Squares 33149.56 4404.95
## Deg. of Freedom 3 721
##
## Residual standard error: 2.471741
## Estimated effects may be unbalanced
# to see the extended output
summary(ANOVA1)
## Df Sum Sq Mean Sq F value Pr(>F)
## CatHeight 3 33150 11050 1809 <2e-16 ***
## Residuals 721 4405 6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# to see all objects in ANOVA1 variable
attributes(ANOVA1)
## $names
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "contrasts" "xlevels" "call" "terms"
## [13] "model"
##
## $class
## [1] "aov" "lm"
As the pvalue of anova is very very small meaning that the means are significant and we have to apply multiple comparison using posthoc analysis (TukeyHSD)
If H0 of anova is rejected then we conduct pairwise multiple comparisons
TukeyHSD(ANOVA1)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Height ~ CatHeight)
##
## $CatHeight
## diff lwr upr p adj
## B-A 8.440986 7.535386 9.346587 0
## C-A 16.747614 15.862368 17.632861 0
## D-A 24.151184 23.138731 25.163636 0
## C-B 8.306628 7.761677 8.851579 0
## D-B 15.710197 14.976460 16.443935 0
## D-C 7.403569 6.695107 8.112032 0
# las = 1 means labels are horizontal
plot(TukeyHSD(ANOVA1),las=1, col=CatHeight)
is a non parametric equalivalant of ANOVA
kruskal.test(Height ~ CatHeight)
##
## Kruskal-Wallis rank sum test
##
## data: Height by CatHeight
## Kruskal-Wallis chi-squared = 636.7, df = 3, p-value < 2.2e-16
H0: Mean lung capacity is same for all heights
H1: Atleast 2 height categories having significantly different
LungCap
P-value of kruskal Walis is very small therefor at leat two means are significantly different
help(chisq.test) Display the Contingency table
contble <- table(Gender, Smoke )
barplot(contble, beside = T, legend=T, col = c(2,3))
Ho: Gender and Smoke are independent
H1: Gender and Smoke are associated
chisq.test(contble, correct = T)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: contble
## X-squared = 1.7443, df = 1, p-value = 0.1866
# store the output in a variable
chi <- chisq.test(contble, correct = T)
chi
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: contble
## X-squared = 1.7443, df = 1, p-value = 0.1866
attributes(chi)
## $names
## [1] "statistic" "parameter" "p.value" "method" "data.name" "observed"
## [7] "expected" "residuals" "stdres"
##
## $class
## [1] "htest"
chi$observed
## Smoke
## Gender no yes
## female 314 44
## male 334 33
chi$expected
## Smoke
## Gender no yes
## female 319.9779 38.02207
## male 328.0221 38.97793
chi$p.value
## [1] 0.1865893
pvalue of chi-square test is 0.1866 which is larger than 0.05 therefore gender and smoke are not associated
fisher.test(contble, conf.int = T, conf.level = 0.99)
##
## Fisher's Exact Test for Count Data
##
## data: contble
## p-value = 0.1845
## alternative hypothesis: true odds ratio is not equal to 1
## 99 percent confidence interval:
## 0.3625381 1.3521266
## sample estimates:
## odds ratio
## 0.7054345
Ho: Gender and Smoke are independent
H1: Gender and Smoke are associated
pvalue of Fisher’s test is 0.1845 therefore the H0 is assumed to be true and conclude that Gender and Smoking habit are not associated
** Thank You**