#Hypothesis is a proposition to be tested ie a testable assumption
# In statistics, we can define the corresponding null hypothesis (H0) as follow:
#
# H0:sample based population mean=hypothesized mean (mu=mu0)
# H0:sample based population mean<=hypothesized mean (mu<=mu0)
# H0:sample based population mean>=hypothesized mean (mu>=mu0)
#
# The corresponding alternative hypotheses (Ha) are as follow:
#
# Ha:sample based population mean!=hypothesized mean (mu=mu0) (two sided: different) two-tailed tests
# Ha:sample based population mean>hypothesized mean (mu>mu0) (greater) one-tailed tests
# Ha:sample based population mean<hypothesized mean (mu<mu0) (less) one-tailed tests
#
#
# Formula of one-sample t-test
#
# The t-statistic can be calculated as follow:
#
# t=(xbar-mu)/(s/sqrt(n))
#
# where,
#
# xbar is the sample mean
# n is the sample size
# s is the sample standard deviation with n-1 degrees of freedom
# mu is the theoretical value
#
# We can compute the p-value corresponding to
# the absolute value of the t-test statistics (|t|)
# for the degrees of freedom (df): df=n-1.
#
# How to interpret the results?
#
# If the p-value is inferior or equal to the significance level 0.05,
# we can reject the null hypothesis and accept the alternative hypothesis.
# we conclude that the sample mean is significantly differentfrom the theoretical mean.
#--------------------------------------------------------------------------------------------------------
#One Sample t test : One tail - left tail #Donot Reject H0
#--------------------------------------------------------------------------------------------------------
# A consumer report has charged that Sweettooth Candy Company is underfilling its bags of candy.
# The industry average is 6.5 ounces of candy per bag. Sweettooth is confident it is not underfilling
# its bags and wants to publish a statistical rebuttal.(Note: One ounce is equal to approximately 28.35 grams.).
# The company draws 30 sample bags of candy randomly from its production line and weighs each bag.
# The weight in ounces is then entered into a table for analysis.
ounces<-c(7.1,5.8,6.4,6.4,6.3,7.1,6.9,6,6.5,6.6,7.1,6.5,5.7,6.6,6.4,6.7,6.4,6.6,7.1,6.6,6.4,6.5,7.1,6.8,7,6.7,7.1,6.7,7,6.5)
#Because underfilling the alternate is less than 6.5 ie one tail left tail test
# Null hypothesis is that the mean is greater than or equal to 6.5
#Run the test
t.test(ounces,mu=6.5,alternative="less", conf.level=0.95)
##
## One Sample t-test
##
## data: ounces
## t = 1.7463, df = 29, p-value = 0.9543
## alternative hypothesis: true mean is less than 6.5
## 95 percent confidence interval:
## -Inf 6.736757
## sample estimates:
## mean of x
## 6.62
t.test(ounces,m=6.5,a="l", c=0.95)#Shorter version default is 0.95 and c can be ignored
##
## One Sample t-test
##
## data: ounces
## t = 1.7463, df = 29, p-value = 0.9543
## alternative hypothesis: true mean is less than 6.5
## 95 percent confidence interval:
## -Inf 6.736757
## sample estimates:
## mean of x
## 6.62
htr<-t.test(ounces,m=6.5,a="l")#we can store results into variable
attributes(htr)#check the attributes
## $names
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "stderr" "alternative" "method" "data.name"
##
## $class
## [1] "htest"
#If t calc < t crit Reject H0
tcalc<-htr$statistic
tcrit<-qt(0.05,30)#qt(p, df, lower.tail=TRUE)
if(tcalc<tcrit){print("Reject H0")}else{"Donot Reject H0"}
## [1] "Donot Reject H0"
# If pleft value is less than alpha (0.05) then reject null hypothesis else donot reject null hypothesis
pleft<-htr$p.value
alpha<-0.05
#P 0.95 is not less than alpha 0.05 #Donot Reject Null Hypothesis #
if(pleft<alpha){print("Reject H0")}else{"Donot Reject H0"}
## [1] "Donot Reject H0"
#---------------------------------------------------------------------------------------
#One sample t test :one tail-left tail #Reject H0
#----------------------------------------------------------------------------------------
# A biologist was interested in determining whether sunflower seedlings treated with an extract
# from Vinca minor roots resulted in a lower average height of sunflower seedlings
# than the standard height of 15.7 cm. The biologist treated a random sample of n = 33
# seedlings with the extract and subsequently obtained the following heights:
hcm<-c(11.5,11.8,15.7,16.1,14.1,10.5,9.3,15,11.1,15.2,19,12.8,12.4,19.2,13.5,12.2,13.3,16.5,13.5,14.4,16.7,10.9,13,10.3,15.8,15.1,17.1,13.3,12.4,8.5,14.3,12.9,13.5)
# The biologist's hypotheses are:
#
# H0 : mu = 15.7
# HA : mu < 15.7
htr2<-t.test(hcm,m=15.7,a="l")
print(htr2)
##
## One Sample t-test
##
## data: hcm
## t = -4.599, df = 32, p-value = 3.174e-05
## alternative hypothesis: true mean is less than 15.7
## 95 percent confidence interval:
## -Inf 14.41366
## sample estimates:
## mean of x
## 13.66364
attributes(htr2)#check the attributes
## $names
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "stderr" "alternative" "method" "data.name"
##
## $class
## [1] "htest"
#If t calc < t crit Reject H0
tcalc<-htr2$statistic
tcrit<-qt(0.05,33)#qt(p, df, lower.tail=TRUE)
if(tcalc<tcrit){print("Reject H0")}else{"Donot Reject H0"}
## [1] "Reject H0"
#If pleft < alpha Reject H0
pleft<-htr2$p.value
alpha<-0.05
if(pleft<alpha){print("Reject H0")}else{"Donot Reject H0"}
## [1] "Reject H0"
#--------------------------------------------------------------------------------------------
#One sample t test : one tail-Right tail #Donot Reject H0
#--------------------------------------------------------------------------------------------
# SKS TMT claims that their iron hardness is a minimum of 170.
# An engineer collected 25 pieces and measured their hardness as given below.
# Does the data support the claim of the company?
ih<-c(170,167,174,179,179,187,179,183,179,156,163,156,187,156,167,156,174,170,183,179,174,179,170,159,187)
# He was interested in testing the hypotheses:
#
# H0 : mu = 170
# HA : mu > 170
htr3<-t.test(ih,m=170,a="g")
print(htr3)
##
## One Sample t-test
##
## data: ih
## t = 1.2218, df = 24, p-value = 0.1168
## alternative hypothesis: true mean is greater than 170
## 95 percent confidence interval:
## 168.9914 Inf
## sample estimates:
## mean of x
## 172.52
attributes(htr3)#check the attributes
## $names
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "stderr" "alternative" "method" "data.name"
##
## $class
## [1] "htest"
#If t calc > t crit Reject H0
tcalc<-htr3$statistic
tcrit<-qt(0.05,25,lower.tail = FALSE)#qt(p, df, lower.tail=TRUE)
if(tcalc>tcrit){print("Reject H0")}else{"Donot Reject H0"}
## [1] "Donot Reject H0"
#If pright < alpha Reject H0
pright<-htr3$p.value
alpha<-0.05
if(pright<alpha){print("Reject H0")}else{"Donot Reject H0"}
## [1] "Donot Reject H0"
#-----------------------------------------------------------------------------------------------
#One sample t test : one tail-Right tail #Donot Reject H0 with Normality test
#-----------------------------------------------------------------------------------------------
set.seed(1234)
my_data <- data.frame(
name = paste0(rep("M_", 10), 1:10),
weight = round(rnorm(10, 20, 2), 1)
)
# Print the first 10 rows of the data
head(my_data, 10)
## name weight
## 1 M_1 17.6
## 2 M_2 20.6
## 3 M_3 22.2
## 4 M_4 15.3
## 5 M_5 20.9
## 6 M_6 21.0
## 7 M_7 18.9
## 8 M_8 18.9
## 9 M_9 18.9
## 10 M_10 18.2
# Statistical summaries of weight
summary(my_data$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.30 18.38 18.90 19.25 20.82 22.20
# Preleminary test to check one-sample t-test assumptions
# Is this a large sample? - No, because n < 30.
# Since the sample size is not large enough (less than 30, central limit theorem),
# we need to check whether the data follow a normal distribution.
# It is possible to use the Shapiro-Wilk normality test and to look at the normality plot.
#
# Shapiro-Wilk test:
# Null hypothesis: the data are normally distributed
# Alternative hypothesis: the data are not normally distributed
shapiro.test(my_data$weight) # => p-value = 0.6993
##
## Shapiro-Wilk normality test
##
## data: my_data$weight
## W = 0.9526, p-value = 0.6993
#Visual test qq from base package
qqnorm(my_data$weight)
qqline(my_data$weight)

#Visual test qq from car package
library(car)
## Loading required package: carData
qqPlot(my_data$weight)

## [1] 4 3
#Thus we can assume the normality.
# H0 : mu = 25
# HA : mu > 25
# One-sample t-test
htr4 <- t.test(my_data$weight, m = 25,a="g")
# Printing the results
print(htr4)
##
## One Sample t-test
##
## data: my_data$weight
## t = -9.0783, df = 9, p-value = 1
## alternative hypothesis: true mean is greater than 25
## 95 percent confidence interval:
## 18.08895 Inf
## sample estimates:
## mean of x
## 19.25
attributes(htr4)#check the attributes
## $names
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "stderr" "alternative" "method" "data.name"
##
## $class
## [1] "htest"
#If t calc > t crit Reject H0
tcalc<-htr4$statistic
tcrit<-qt(0.05,10,lower.tail = FALSE)#qt(p, df, lower.tail=TRUE)
if(tcalc>tcrit){print("Reject H0")}else{"Donot Reject H0"}
## [1] "Donot Reject H0"
#If pright < alpha Reject H0
pright<-htr4$p.value
alpha<-0.05
if(pright<alpha){print("Reject H0")}else{"Donot Reject H0"}
## [1] "Donot Reject H0"
#--------------------------------------------------------------------------------------------------------
#One sample t test : one tail-Right tail #Reject H0
#--------------------------------------------------------------------------------------------------------
#A rehabilitation center for drug abuse wanted to know if there is decrease in
#young drug abuse in young population (aged 20 and below) in the community
#the recent batch had 18 participants whose ages are given below
#test to determine at alpha=0.05 whether the population mean age is significantly greater than 20 given the data below
age<-c(36,28,21,25,31,17,22,18,18,29,21,26,17,18,30,19,19,28)
# H0 : mu = 20
# HA : mu > 20
htr5<-t.test(age,m=20,a="g")
print(htr5)
##
## One Sample t-test
##
## data: age
## t = 2.5769, df = 17, p-value = 0.009797
## alternative hypothesis: true mean is greater than 20
## 95 percent confidence interval:
## 21.13723 Inf
## sample estimates:
## mean of x
## 23.5
attributes(htr5)#check the attributes
## $names
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "stderr" "alternative" "method" "data.name"
##
## $class
## [1] "htest"
#If t calc > t crit Reject H0
tcalc<-htr5$statistic
tcrit<-qt(0.05,25,lower.tail = FALSE)#qt(p, df, lower.tail=TRUE)
if(tcalc>tcrit){print("Reject H0")}else{"Donot Reject H0"}
## [1] "Reject H0"
#If pright < alpha Reject H0
pright<-htr5$p.value
alpha<-0.05
if(pright<alpha){print("Reject H0")}else{"Donot Reject H0"}
## [1] "Reject H0"
#--------------------------------------------------------------------------------------------------------
#One sample t test : two tail-Right tail #Donot Reject H0
#--------------------------------------------------------------------------------------------------------
# A song is written in such a way that it should take on an average 20 minutes to sing it.
# The rehearsal was done for 12 times to perfect the act using 12 different artists
# the time recorded is given below. Can we conclude that the average time taken
# for singing the song is 20 minutes?
time<-c(21.5, 24.5, 18.5, 17.2, 14.5, 23.2, 22.1, 20.5, 19.4, 18.1, 24.1, 18.5)
# H0 : mu = 20
# HA : mu!= 20
htr6<-t.test(time,m=20,a="t")
print(htr6)
##
## One Sample t-test
##
## data: time
## t = 0.20066, df = 11, p-value = 0.8446
## alternative hypothesis: true mean is not equal to 20
## 95 percent confidence interval:
## 18.25544 22.09456
## sample estimates:
## mean of x
## 20.175
attributes(htr6)#check the attributes
## $names
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "stderr" "alternative" "method" "data.name"
##
## $class
## [1] "htest"
#If t calc > t crit Reject H0
tcalc<-htr6$statistic
tcrit<-qt(0.025,17)#qt(p, df, lower.tail=TRUE)
if(abs(tcalc)>abs(tcrit)){print("Reject H0")}else{"Donot Reject H0"}
## [1] "Donot Reject H0"
#If phalf < alpha Reject H0
phalf<-htr6$p.value
alpha<-0.025
if(phalf<alpha){print("Reject H0")}else{"Donot Reject H0"}
## [1] "Donot Reject H0"
#--------------------------------------------------------------------------------------------------------
#Two sample t test : one tail-Right tail #Donot Reject H0
#--------------------------------------------------------------------------------------------------------
# The diameter of a bar manufactured by stroung TMT is supposed to be 8mm
# on an average as per the specification.
# A sample of 15 bars indicated the sizes given below
# Can we conclude that the company is delivering 8mm bars as per the specification
dia<-c(8,7.89,8,8.26,8,8.07,8,8,7.95,8.04,8,8,7.99,8,8.03)
# H0 : mu = 8
# HA : mu!= 8
htr7<-t.test(time,m=8,a="t")
print(htr7)
##
## One Sample t-test
##
## data: time
## t = 13.96, df = 11, p-value = 2.422e-08
## alternative hypothesis: true mean is not equal to 8
## 95 percent confidence interval:
## 18.25544 22.09456
## sample estimates:
## mean of x
## 20.175
attributes(htr7)#check the attributes
## $names
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "stderr" "alternative" "method" "data.name"
##
## $class
## [1] "htest"
#If t calc > t crit Reject H0
tcalc<-htr7$statistic
tcrit<-qt(0.025,14)#qt(p, df, lower.tail=TRUE)
if(abs(tcalc)>abs(tcrit)){print("Reject H0")}else{"Donot Reject H0"}
## [1] "Reject H0"
#If phalf < alpha Reject H0
phalf<-htr7$p.value
alpha<-0.025
if(phalf<alpha){print("Reject H0")}else{"Donot Reject H0"}
## [1] "Reject H0"
#-------------------------------------------------------------------------------------------------
# S3 method for default
# t.test(x, y = NULL,
# alternative = c("two.sided", "less", "greater"),
# mu = 0, paired = FALSE, var.equal = FALSE,
# conf.level = 0.95, .)
#
# S3 method for formula
# t.test(formula, data, subset, na.action, .)
# To conduct a hypothesis test on two independent sample mean,
# we use the function: t.test(quantitative_variable#1, quantitative_variable#2, . )
# Additional arguments for t.test( ) may include the following:
# alternative = "two.sided", "less", or "greater".
# If nothing is indicated, the argument defaults to two-sided.
# var.equal = TRUE or FALSE. TRUE means that the variances are equal and FALSE means the variances are unequal. This is a required argument.
# conf.level = confidence level desired. If nothing is indicated, the argument defaults to 95% confidence level.
#Check if group a and b means are equal ie belong to same population
# t test in case of two numeric vectors
a<-c(1.1,2.3,1.5,3.4,5.4)
b<-c(1.4,2.5,1.7,2.4,4.1)
t.test(a,b)#quickest way 1
##
## Welch Two Sample t-test
##
## data: a and b
## t = 0.35425, df = 6.5915, p-value = 0.7342
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.843158 2.483158
## sample estimates:
## mean of x mean of y
## 2.74 2.42
x<-c("a","a","a","a","a","b","b","b","b","b")
y<-c(1.1,2.3,1.5,3.4,5.4,1.4,2.5,1.7,2.4,4.1)
t.test(y~x)#quickest way 2 formula method
##
## Welch Two Sample t-test
##
## data: y by x
## t = 0.35425, df = 6.5915, p-value = 0.7342
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.843158 2.483158
## sample estimates:
## mean in group a mean in group b
## 2.74 2.42
# The variables used in this test are known as:
# Dependent variable, or test variable (y)
# Independent variable, or grouping variable (x)
# *Assumptions and defaults for the above*
# alternative hypothesis is not equal to ie two sided
# the mean difference mu=0
# paired is false ie unpaired test or independent sample test
# the default test assumes unequal variance and
# applies the Welsh degrees-of-freedom modification
# You can add a var.equal=TRUE
# option to specify equal variances and a pooled variance estimate
# confidence level is 95% and alpha is 0.05
library("MASS")
#------------------------------------------------------------------------------
#Two sample means test - Independent sample t test - One sided greater
#------------------------------------------------------------------------------
# We will use the dataset built into R called ToothGrowth which looks at the effect of Vitamin C on tooth growth in guinea pigs. The dataset shows the lengths of teeth after each guinea pig receives doses of vitamin C.
# The variables of the dataset are: len which is the numeric tooth length measurement (unspecified units)
# supp which is the delivery method of Vitamin C - either by orange juice (OJ) or by ascorbic acid (VC)
# dose which is the dosage of the supplement - 0.5 ml/day, 1.0 ml/day or 2.0 ml/day
#Suppose we want to see the effectiveness of the delivery method disregarding dosage.
#
# First, let us check assumptions before doing the hypothesis test:
#1 the two samples are independent and randomly selected from the population of interest and
#2 the sample size is greater than 30
#Let us check if there are any outliers by drawing the boxplot of
#the tooth length separated by the delivery method.
boxplot(ToothGrowth$len ~ ToothGrowth$supp,
main = "Tooth Growth in Guinea Pig",
xlab = "Delivery Method",
ylab = "Tooth Length")
#The boxplots do not show any visible outliers.
# For this example, we want to see if delivering vitamin C via orange juice enhances
# more tooth growth in guinea pigs than ascorbic acid.
#A one-sided hypothesis test will be conducted.
#H0: mean tooth growth via orange juice <= mean tooth growth via ascorbic acid
#Ha: mean tooth growth via orange juice > mean tooth growth via ascorbic acid
#Since we are not told whether the variances are equal or not,
#we will assume unequal variances to be on the conservative side
#We will use the default, 95% confidence level.
#Explore the dataset
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
head(ToothGrowth)#Its a stacked form
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
# As it is stacked, the formula method is easier ynumeric~xcategoricalbinary
t.test(ToothGrowth$len~ToothGrowth$supp) #Quick method but
##
## Welch Two Sample t-test
##
## data: ToothGrowth$len by ToothGrowth$supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1710156 7.5710156
## sample estimates:
## mean in group OJ mean in group VC
## 20.66333 16.96333
#If you want method 2 then data set should be modified or reshaped
oj<-subset(ToothGrowth,supp=="OJ")
vc<-subset(ToothGrowth,supp=="VC")
alpha=0.05
t.test(oj$len,vc$len)
##
## Welch Two Sample t-test
##
## data: oj$len and vc$len
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1710156 7.5710156
## sample estimates:
## mean of x mean of y
## 20.66333 16.96333
t.test(ToothGrowth$len~ToothGrowth$supp, alternative = "greater", var.equal = FALSE)#formula
##
## Welch Two Sample t-test
##
## data: ToothGrowth$len by ToothGrowth$supp
## t = 1.9153, df = 55.309, p-value = 0.03032
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.4682687 Inf
## sample estimates:
## mean in group OJ mean in group VC
## 20.66333 16.96333
t.test(oj$len, vc$len, alternative = "greater", var.equal = FALSE)#default
##
## Welch Two Sample t-test
##
## data: oj$len and vc$len
## t = 1.9153, df = 55.309, p-value = 0.03032
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.4682687 Inf
## sample estimates:
## mean of x mean of y
## 20.66333 16.96333
#store t test results into a variable
tf<-t.test(ToothGrowth$len~ToothGrowth$supp, alternative = "greater", var.equal = FALSE)
td<-t.test(oj$len, vc$len, alternative = "greater", var.equal = FALSE)
#If tcalc > tcrit reject H0
tcalc<-td$statistic
df<-td$parameter
tcrit<-qt(0.05,df)
if(tcalc>tcrit){"Reject H0"}else{"Donot Reject H0"}
## [1] "Reject H0"
#If pright > alpha Reject H0
pright<-td$p.value
if (pright<alpha){"Reject H0"}else{"Donot Reject H0"}
## [1] "Reject H0"
#--------------------------------------------------------------------------------------------------
#Two sample means test - Independent sample t test - Two sided equal ie no difference -Donot Reject
#--------------------------------------------------------------------------------------------------
#Introduction to the dataset UScrime from {MASS} package
#The Effect of Punishment Regimes on Crime Rates
#This has been studied using aggregate data on 47 states of the USA for 1960
#given in this data frame.
# M percentage of males aged 14-24.
# So indicator variable for a Southern state.
# Ed mean years of schooling.
# Po1 police expenditure in 1960.
# Po2 police expenditure in 1959.
# LF labour force participation rate.
# M.F number of males per 1000 females.
# Pop state population.
# NW number of non-whites per 1000 people.
# U1 unemployment rate of urban males 14-24.
# U2 unemployment rate of urban males 35-39.
# GDP gross domestic product per head.
# Ineq income inequality.
# Prob probability of imprisonment.
# Time average time served in state prisons.
# y rate of crimes in a particular category per head of population.
#Find out if there is any difference in crime rate across southern and non southern states
#Explore the dataset
str(UScrime)
## 'data.frame': 47 obs. of 16 variables:
## $ M : int 151 143 142 136 141 121 127 131 157 140 ...
## $ So : int 1 0 1 0 0 0 1 1 1 0 ...
## $ Ed : int 91 113 89 121 121 110 111 109 90 118 ...
## $ Po1 : int 58 103 45 149 109 118 82 115 65 71 ...
## $ Po2 : int 56 95 44 141 101 115 79 109 62 68 ...
## $ LF : int 510 583 533 577 591 547 519 542 553 632 ...
## $ M.F : int 950 1012 969 994 985 964 982 969 955 1029 ...
## $ Pop : int 33 13 18 157 18 25 4 50 39 7 ...
## $ NW : int 301 102 219 80 30 44 139 179 286 15 ...
## $ U1 : int 108 96 94 102 91 84 97 79 81 100 ...
## $ U2 : int 41 36 33 39 20 29 38 35 28 24 ...
## $ GDP : int 394 557 318 673 578 689 620 472 421 526 ...
## $ Ineq: int 261 194 250 167 174 126 168 206 239 174 ...
## $ Prob: num 0.0846 0.0296 0.0834 0.0158 0.0414 ...
## $ Time: num 26.2 25.3 24.3 29.9 21.3 ...
## $ y : int 791 1635 578 1969 1234 682 963 1555 856 705 ...
table(UScrime$So)
##
## 0 1
## 31 16
#The data contains 30 record related to non southern states and 16 records related to southern states
#The dataset is stacked and formula method should be used
t.test(UScrime$y~UScrime$So)
##
## Welch Two Sample t-test
##
## data: UScrime$y by UScrime$So
## t = 0.70677, df = 43.344, p-value = 0.4835
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -135.5961 281.9711
## sample estimates:
## mean in group 0 mean in group 1
## 930.0000 856.8125
#Assumptions: mu=0,two tailed test, unequal variance, confidence level is 95%, paired is false
t.test(UScrime$y~UScrime$So)#syntax type1
##
## Welch Two Sample t-test
##
## data: UScrime$y by UScrime$So
## t = 0.70677, df = 43.344, p-value = 0.4835
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -135.5961 281.9711
## sample estimates:
## mean in group 0 mean in group 1
## 930.0000 856.8125
t.test(y~So,data=UScrime)#syntax type2
##
## Welch Two Sample t-test
##
## data: y by So
## t = 0.70677, df = 43.344, p-value = 0.4835
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -135.5961 281.9711
## sample estimates:
## mean in group 0 mean in group 1
## 930.0000 856.8125
tt<-t.test(UScrime$y~UScrime$So)
#If phalf < alpha Reject H0
phalf<-tt$p.value
alpha<-0.025
if(phalf<alpha){"Reject H0"}else{"Donot Reject H0"}
## [1] "Donot Reject H0"
#Conclusion: The crime rate is same across southern and non southern states
#xtra note:
#abs(pcalc)>abs(pcrit) Reject H0
pcalc<-tt$statistic
dof<-tt$parameter
pcrit<-qt(alpha,dof)
if(abs(pcalc)>abs(pcrit)){"Reject H0"} else {"Donot Reject H0"}
## [1] "Donot Reject H0"
#--------------------------------------------------------------------------------------------------
#Two sample means test - Independent sample t test - Two sided equal ie no difference - Reject
#--------------------------------------------------------------------------------------------------
#Find out if there is any difference probability of punishment given crime committed across southern and non southern states
#H0:There is no difference in average probability of punishment across the southern and non southern states
#Ha:There is a difference in average probability of punishment across the southern and non southern states
t.test(Prob~So,data=UScrime)
##
## Welch Two Sample t-test
##
## data: Prob by So
## t = -3.8954, df = 24.925, p-value = 0.0006506
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.03852569 -0.01187439
## sample estimates:
## mean in group 0 mean in group 1
## 0.03851265 0.06371269
t.test(UScrime$Prob,UScrime$So)
##
## Welch Two Sample t-test
##
## data: UScrime$Prob and UScrime$So
## t = -4.1938, df = 46.207, p-value = 0.0001229
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.4341078 -0.1525605
## sample estimates:
## mean of x mean of y
## 0.04709138 0.34042553
tt<-t.test(Prob~So,data=UScrime)
alpha=0.025
phalf<-tt$statistic
if (phalf<alpha){"Reject H0"}else{"Donot Reject H0"}
## [1] "Reject H0"
#The probability of getting punished for a crime is not same across southern and nonsouthern states
#xtra note:
#abs(pcalc)>abs(pcrit) Reject H0
pcalc<-tt$statistic
dof<-tt$parameter
pcrit<-qt(alpha,dof)
if(abs(pcalc)>abs(pcrit)){"Reject H0"} else {"Donot Reject H0"}
## [1] "Reject H0"
#--------------------------------------------------------------------------------------------------
#Two sample means test - Paired sample t test -Dependent t test - matched pair- onetail right
#--------------------------------------------------------------------------------------------------
#Introduce anorexia from {MASS} package
#Anorexia Data on Weight Change
#The anorexia data frame has 72 rows and 3 columns.
#Weight change data for young female anorexia patients.
#This data frame contains the following columns:
#Treat Factor of three levels: "Cont" (control),
#"CBT" (Cognitive Behavioural treatment) and "FT" (family treatment).
#Prewt Weight of patient before study period, in lbs.
#Postwt Weight of patient after study period, in lbs.
#Syntax t.test(quantitative_variable_1, quantitative_variable_2, paired = TRUE, . )
# Additional arguments for t.test( ) may include the following:
# alternative = "two.sided", "less", or "greater". If none is indicated, the argument defaults to two-sided.
# conf.level = confidence level desired. If none is indicated, the argument defaults to 95% confidence level.
# In this example, we want to assess if the study participants have a
# statistically significant weight gain after treatment.
# A one-sided alternative test will be conducted.
# If the treatments were successful, then the difference between the variable,
# Prewt, which is the weight before the study, and the variable,
# Postwt, which is the weight after the study, will be less than 0.
# Be sure to include the argument, alternative = "less" in the t.test( ) function.
#H0: The difference in paired population means is greater than or equal to 0
#Ha: The difference in paired population means is less than 0
t.test(anorexia$Prewt, anorexia$Postwt,
alternative = "less",
paired = TRUE)
##
## Paired t-test
##
## data: anorexia$Prewt and anorexia$Postwt
## t = -2.9376, df = 71, p-value = 0.002229
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -1.195825
## sample estimates:
## mean of the differences
## -2.763889
tt<-t.test(anorexia$Prewt, anorexia$Postwt, alternative = "less",paired = TRUE)
#If pleft<alpha Reject H0
pleft<-tt$p.value
alpha=0.05
if(pleft<alpha){"Reject H0"}else{"Donot Reject H0"}
## [1] "Reject H0"
#Difference in paired population means is less than 0
#--------------------------------------------------------------------------------------------------
#Two sample means test - Paired sample t test -Dependent t test - two tail
#--------------------------------------------------------------------------------------------------
# From UScrime dataset of MASS package
# Find if the unemployment rate for younger males (14-24)
# is greater than for older males (35-39).
# In this case, the two groups aren't independent.
# You wouldn't expect the unemployment rate
# for younger and older males in Alabama to be unrelated
#H0: true difference in means of U1 and U2 is equal to 0
#Ha: true difference in means of U1 and U2 is not equal to 0
t.test(UScrime$U1,UScrime$U2,paired=TRUE)
##
## Paired t-test
##
## data: UScrime$U1 and UScrime$U2
## t = 32.407, df = 46, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 57.67003 65.30870
## sample estimates:
## mean of the differences
## 61.48936
tt<-t.test(UScrime$U1,UScrime$U2,paired=TRUE)
tt
##
## Paired t-test
##
## data: UScrime$U1 and UScrime$U2
## t = 32.407, df = 46, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 57.67003 65.30870
## sample estimates:
## mean of the differences
## 61.48936
#If phalf<alphahalf Reject H0
phalf<-tt$p.value
alphahalf<-0.025
if(phalf<alphahalf){"Reject H0"}else{"Donot Reject H0"}
## [1] "Reject H0"
#The unemployment rates might not be the same
#--------------------------------------------------------------------------------------------------
#Three and more sample means test - ANOVA
#--------------------------------------------------------------------------------------------------
#Introduce warpbreaks of {datasets} package
#The Number of Breaks in Yarn during Weaving
# A data frame with 54 observations on 3 variables.
#
# [,1] breaks numeric The number of breaks
# [,2] wool factor The type of wool (A or B)
# [,3] tension factor The level of tension (L, M, H)
#Find if the average number of breaks is same across Low Medium and High tension operations
#If two groups then t test more than 2 use ANOVA
#syntax: aov(formula,datasetname)
#H0: The breaks across all tensions are same
#Ha: The breaks across all tensions are not the same
aov(breaks~tension,data=warpbreaks)
## Call:
## aov(formula = breaks ~ tension, data = warpbreaks)
##
## Terms:
## tension Residuals
## Sum of Squares 2034.259 7198.556
## Deg. of Freedom 2 51
##
## Residual standard error: 11.88058
## Estimated effects may be unbalanced
anova<-aov(breaks~tension,data=warpbreaks)
summary(anova)#see the model information
## Df Sum Sq Mean Sq F value Pr(>F)
## tension 2 2034 1017.1 7.206 0.00175 **
## Residuals 51 7199 141.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#If p is less than alpha Reject H0
p<-0.00175
alpha<-0.05
#Therefore reject null hypothesis
#conclude that different tensions give different breaks
#-------------------------------------------------------------------------------------------------------------------
#Test of variance -two sample variance using F test
#------------------------------------------------------------------------------------------------------------------
# When to you use the F-test?
# Comparing two variances is useful in several cases, including:
# When you want to perform a two samples t-test to check the equality of the variances of the two samples
# When you want to compare the variability of a new measurement method to an old one.
# Does the new method reduce the variability of the measure?
# Typical research questions are:
#
# whether the variance of group A is equal to the variance of group B ?
# whether the variance of group A is less than the variance of group B ?
# whether the variance of group A is greather than the variance of group B ?
#
# Preleminary test to check F-test assumptions
# F-test is very sensitive to departure from the normal assumption.
# You need to check whether the data is normally distributed before using the F-test.
#
# Shapiro-Wilk test can be used to test whether the normal assumption holds.
# It’s also possible to use Q-Q plot (quantile-quantile plot) to graphically evaluate the normality of a variable.
# Q-Q plot draws the correlation between a given sample and the normal distribution.
#
# If there is doubt about normality, the better choice is to use Levene’s test
# or Fligner-Killeen test, which are less sensitive to departure from normal assumption.
#Example 1:
#H0: Variance of group a is equal to variance of group b
#Ha: Variance of group a is not equal to variance of group b
# rnorm default values as mean = 0 and variance = 1
a<-rnorm(40,mean=0)
a
## [1] -0.47719270 -0.99838644 -0.77625389 0.06445882 0.95949406 -0.11028549
## [7] -0.51100951 -0.91119542 -0.83717168 2.41583518 0.13408822 -0.49068590
## [13] -0.44054787 0.45958944 -0.69372025 -1.44820491 0.57475572 -1.02365572
## [19] -0.01513830 -0.93594860 1.10229755 -0.47559308 -0.70944004 -0.50125806
## [25] -1.62909347 -1.16761926 -2.18003965 -1.34099319 -0.29429386 -0.46589754
## [31] 1.44949627 -1.06864272 -0.85536463 -0.28062300 -0.99434008 -0.96851432
## [37] -1.10731819 -1.25198589 -0.52382812 -0.49684996
b<-rnorm(40,mean=0)
b
## [1] -1.806031257 -0.582075925 -1.108889624 -1.014962009 -0.162309524
## [6] 0.563055819 1.647817473 -0.773353424 1.605909629 -1.157808548
## [11] 0.656588464 2.548991071 -0.034760390 -0.669633580 -0.007604756
## [16] 1.777084448 -1.138607737 1.367827179 1.329564791 0.336472797
## [21] 0.006892838 -0.455468738 -0.366523933 0.648286568 2.070270861
## [26] -0.153398412 -1.390700947 -0.723581777 0.258261762 -0.317059115
## [31] -0.177789958 -0.169994077 -1.372301886 -0.173787170 0.850232257
## [36] 0.697608712 0.549997351 -0.402731975 -0.191593770 -1.194527880
var.test(a,b)
##
## F test to compare two variances
##
## data: a and b
## F = 0.70356, num df = 39, denom df = 39, p-value = 0.2766
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.3721114 1.3302306
## sample estimates:
## ratio of variances
## 0.7035581
#Since p is not less than alpha 0.05 , we donot reject H0
# Example2 : Tooth Growth dataset
#H0: The variances of tooth growth across the two supplements is equal
#Ha: The variances of tooth growth across the two supplements is not equal
# F-test
#compares the variance of any two samples from normal distribution
#compares the Ratio of the two variances and if these are equal , the ratio will be 1
ft <- var.test(len ~ supp, data = ToothGrowth)
ft
##
## F test to compare two variances
##
## data: len by supp
## F = 0.6386, num df = 29, denom df = 29, p-value = 0.2331
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.3039488 1.3416857
## sample estimates:
## ratio of variances
## 0.6385951
#If p is less than alpha Reject H0
p<-ft$p.value
alpha<-0.05
if(p<alpha){"Reject H0"}else{"Donot Reject H0"}
## [1] "Donot Reject H0"
#Therefore The variances of tooth growth across the two supplements is equal
######################Non Parametric Tests#############################
# A non parametric test (sometimes called a distribution free test) does not assume anything about the underlying distribution
# (for example, that the data comes from a normal distribution).
# That’s compared to parametric test, which makes assumptions about a population’s parameters
# (for example, the mean or standard deviation);
# When the word “non parametric” is used in stats,
# it doesn’t quite mean that you know nothing about the population.
# It usually means that you know the population data does not have a normal distribution.
# The only non parametric test you are likely to come across in elementary stats is the chi-square test.
#-----------------------------------------------------------------------------------------------------------------
# Mann Whitney U Test (Wilcoxon Rank Sum Test) #Equivalent to t test for two independent samples
#-----------------------------------------------------------------------------------------------------------------
# testing the equality of means in two independent samples
# A popular nonparametric test to compare outcomes between two independent groups is the Mann Whitney U test
# The Mann Whitney U test, sometimes called the Mann Whitney Wilcoxon Test or the Wilcoxon Rank Sum Test,
# is used to test whether two samples are likely to derive from the same population
# Some investigators interpret this test as comparing the medians between the two populations
# This test is often performed as a two-sided test and, thus, the research hypothesis indicates that the populations are not equal
# as opposed to specifying directionality. A one-sided research hypothesis is used if interest lies in
# detecting a positive or negative shift in one population as compared to the other.
#Example:
#H0: Probability of Imprisonment is same across southern and non southern states
#H1: Probability of Imprisonment is not same across southern and non southern states
library(MASS)
mwu<-wilcox.test(Prob~So,data=UScrime)
mwu
##
## Wilcoxon rank sum exact test
##
## data: Prob by So
## W = 81, p-value = 8.488e-05
## alternative hypothesis: true location shift is not equal to 0
#P is less that alpha 0.05 therefore Reject H0
#--------------------------------------------------------------------------------------
# dependent 2-group Wilcoxon Signed Rank Test
#--------------------------------------------------------------------------------------
# wilcox.test(y1,y2,paired=TRUE) # where y1 and y2 are numeric
#Lets use the unemployment rate u1 and u2 pair data and do wilcox paired test
#H0 : Unemployment rates among young and adult populations are same
#Ha : Unemployment rates among young and adult populations are not same
# Wilcoxon signed rank test with continuity correction
wilcox.test(UScrime$U1, UScrime$U2, paired=TRUE)#syntax 1
## Warning in wilcox.test.default(UScrime$U1, UScrime$U2, paired = TRUE): cannot
## compute exact p-value with ties
##
## Wilcoxon signed rank test with continuity correction
##
## data: UScrime$U1 and UScrime$U2
## V = 1128, p-value = 2.464e-09
## alternative hypothesis: true location shift is not equal to 0
with(UScrime, wilcox.test(U1, U2, paired=TRUE))#alternative syntax 2
## Warning in wilcox.test.default(U1, U2, paired = TRUE): cannot compute exact p-
## value with ties
##
## Wilcoxon signed rank test with continuity correction
##
## data: U1 and U2
## V = 1128, p-value = 2.464e-09
## alternative hypothesis: true location shift is not equal to 0
# Again, you reach the same conclusion reached with the paired t-test
# Note: For the wilcox.test you can use the alternative="less" or alternative="greater"
# option to specify a one tailed test.
#--------------------------------------------------------------------------------------
# Kruskal Wallis Test One Way Anova by Ranks
#--------------------------------------------------------------------------------------
#kruskal.test(y~A) # where y1 is numeric and A is a factor
# kruskal.test(y ~ A, data)
# where y is a numeric outcome variable and A is a grouping variable with two or more
# levels (if there are two levels, it’s equivalent to the Mann–Whitney U test).
states <- data.frame(state.region, state.x77)
#state.x77 of datasets package
#matrix with 50 rows and 8 columns giving the following statistics
#in the respective columns.
# Population: population estimate as of July 1, 1975
# Income: per capita income (1974)
# Illiteracy: illiteracy (1970, percent of population)
# Life Exp: life expectancy in years (1969–71)
# Murder: murder and non-negligent manslaughter rate per 100,000 population (1976)
# HS Grad: percent high-school graduates (1970)
# Frost: mean number of days with minimum temperature below freezing (1931–1960) in capital or large city
# Area: land area in square miles
kruskal.test(Illiteracy ~ state.region, data=states)
##
## Kruskal-Wallis rank sum test
##
## data: Illiteracy by state.region
## Kruskal-Wallis chi-squared = 22.672, df = 3, p-value = 4.726e-05
#P is less than 0.05 therefore reject H0 of equality of illiteracy
#what is contigency table
# A contingency table is a type of table in a matrix format
# that displays the (multivariate) frequency distribution of the variables
#-----------------------------------------------
#Univariate frequency table or one way table
#-----------------------------------------------
# table(var1, var2, ..., varN)
table(Puromycin$state)#two levels
##
## treated untreated
## 12 11
# xtabs(formula, data)
xtabs(~state,Puromycin)
## state
## treated untreated
## 12 11
#Exercise xtabs for table(PlantGrowth$group)#three levels
#Proportions using prop.table
table(state.region)#four levels
## state.region
## Northeast South North Central West
## 9 16 12 13
ft<-table(state.region)#four levels freqency table
ft
## state.region
## Northeast South North Central West
## 9 16 12 13
# prop.table(table, margins)
prop.table(ft)#proportions table
## state.region
## Northeast South North Central West
## 0.18 0.32 0.24 0.26
prop.table(ft)*100#percentage table
## state.region
## Northeast South North Central West
## 18 32 24 26
#Exercise table(chickwts$feed)# six levels
#--------------------------------------------------------------
#Bivariate frequency table or two way table or contigency table
#--------------------------------------------------------------
table(warpbreaks$wool,warpbreaks$tension)
##
## L M H
## A 9 9 9
## B 9 9 9
xtabs(~wool+tension,warpbreaks)
## tension
## wool L M H
## A 9 9 9
## B 9 9 9
#Exercise xtabs table(mtcars$gear,mtcars$cyl)
#Proportion table
# prop.table(table, margins)
m <- matrix(1:4, 2)
m
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
proportions(m, 1)#row wise proportions
## [,1] [,2]
## [1,] 0.2500000 0.7500000
## [2,] 0.3333333 0.6666667
prop.table(m,1)# same
## [,1] [,2]
## [1,] 0.2500000 0.7500000
## [2,] 0.3333333 0.6666667
x<-prop.table(m,1)
x
## [,1] [,2]
## [1,] 0.2500000 0.7500000
## [2,] 0.3333333 0.6666667
apply(x,1,sum)#row wise sum
## [1] 1 1
apply(x,2,sum)#column wise sum
## [1] 0.5833333 1.4166667
proportions(m,2)#column wise proportions
## [,1] [,2]
## [1,] 0.3333333 0.4285714
## [2,] 0.6666667 0.5714286
prop.table(m,2)
## [,1] [,2]
## [1,] 0.3333333 0.4285714
## [2,] 0.6666667 0.5714286
x<-prop.table(m,2)
x
## [,1] [,2]
## [1,] 0.3333333 0.4285714
## [2,] 0.6666667 0.5714286
apply(x,1,sum)#row wise sum
## [1] 0.7619048 1.2380952
apply(x,2,sum)#column wise sum
## [1] 1 1
#Lets have some dataframe
xt<-table(mtcars$gear,mtcars$cyl)
xt
##
## 4 6 8
## 3 1 2 12
## 4 8 4 0
## 5 2 1 2
prop.table(xt)
##
## 4 6 8
## 3 0.03125 0.06250 0.37500
## 4 0.25000 0.12500 0.00000
## 5 0.06250 0.03125 0.06250
prop.table(xt)*100
##
## 4 6 8
## 3 3.125 6.250 37.500
## 4 25.000 12.500 0.000
## 5 6.250 3.125 6.250
m1<-prop.table(xt)*100
m1
##
## 4 6 8
## 3 3.125 6.250 37.500
## 4 25.000 12.500 0.000
## 5 6.250 3.125 6.250
sum(m1)
## [1] 100
prop.table(xt,1)
##
## 4 6 8
## 3 0.06666667 0.13333333 0.80000000
## 4 0.66666667 0.33333333 0.00000000
## 5 0.40000000 0.20000000 0.40000000
prop.table(xt,1)*100
##
## 4 6 8
## 3 6.666667 13.333333 80.000000
## 4 66.666667 33.333333 0.000000
## 5 40.000000 20.000000 40.000000
x<-prop.table(xt,1)*100
x
##
## 4 6 8
## 3 6.666667 13.333333 80.000000
## 4 66.666667 33.333333 0.000000
## 5 40.000000 20.000000 40.000000
apply(x,1,sum)#row wise sum
## 3 4 5
## 100 100 100
apply(x,2,sum)#column wise sum
## 4 6 8
## 113.33333 66.66667 120.00000
prop.table(xt,2)
##
## 4 6 8
## 3 0.09090909 0.28571429 0.85714286
## 4 0.72727273 0.57142857 0.00000000
## 5 0.18181818 0.14285714 0.14285714
prop.table(xt,2)*100
##
## 4 6 8
## 3 9.090909 28.571429 85.714286
## 4 72.727273 57.142857 0.000000
## 5 18.181818 14.285714 14.285714
x<-prop.table(xt,2)*100
x
##
## 4 6 8
## 3 9.090909 28.571429 85.714286
## 4 72.727273 57.142857 0.000000
## 5 18.181818 14.285714 14.285714
apply(x,1,sum)#row wise sum
## 3 4 5
## 123.37662 129.87013 46.75325
margin.table(x,1)
##
## 3 4 5
## 123.37662 129.87013 46.75325
marginSums(x,1)
##
## 3 4 5
## 123.37662 129.87013 46.75325
apply(x,2,sum)#column wise sum
## 4 6 8
## 100 100 100
margin.table(x,2)
##
## 4 6 8
## 100 100 100
marginSums(x,2)
##
## 4 6 8
## 100 100 100
#add margins to the table
#addmargins(table, margins)
addmargins(x,1)
##
## 4 6 8
## 3 9.090909 28.571429 85.714286
## 4 72.727273 57.142857 0.000000
## 5 18.181818 14.285714 14.285714
## Sum 100.000000 100.000000 100.000000
addmargins(x,2)
##
## 4 6 8 Sum
## 3 9.090909 28.571429 85.714286 123.376623
## 4 72.727273 57.142857 0.000000 129.870130
## 5 18.181818 14.285714 14.285714 46.753247
addmargins(x,c(1,2))
##
## 4 6 8 Sum
## 3 9.090909 28.571429 85.714286 123.376623
## 4 72.727273 57.142857 0.000000 129.870130
## 5 18.181818 14.285714 14.285714 46.753247
## Sum 100.000000 100.000000 100.000000 300.000000
y<-addmargins(x,c(1,2))
class(y)
## [1] "table" "matrix" "array"
#-----------------------------------------------------------------
#Multivariate frequency table or two way table or contigency table
#-----------------------------------------------------------------
#Threeway table
table(mtcars$cyl,mtcars$gear,mtcars$am)#using table
## , , = 0
##
##
## 3 4 5
## 4 1 2 0
## 6 2 2 0
## 8 12 0 0
##
## , , = 1
##
##
## 3 4 5
## 4 0 6 2
## 6 0 2 1
## 8 0 0 2
xtabs(~cyl+gear+am,mtcars)#using xtabs
## , , am = 0
##
## gear
## cyl 3 4 5
## 4 1 2 0
## 6 2 2 0
## 8 12 0 0
##
## , , am = 1
##
## gear
## cyl 3 4 5
## 4 0 6 2
## 6 0 2 1
## 8 0 0 2
x<-table(mtcars$cyl,mtcars$gear,mtcars$am)
ftable(x)#Compact form using ftable
## 0 1
##
## 4 3 1 0
## 4 2 6
## 5 0 2
## 6 3 2 0
## 4 2 2
## 5 0 1
## 8 3 12 0
## 4 0 0
## 5 0 2
#Extra note: Two way table with more details
library(gmodels)
library(vcd)
## Loading required package: grid

table(Arthritis$Treatment,Arthritis$Improved)
##
## None Some Marked
## Placebo 29 7 7
## Treated 13 7 21
CrossTable(Arthritis$Treatment, Arthritis$Improved)#Output is a list
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 84
##
##
## | Arthritis$Improved
## Arthritis$Treatment | None | Some | Marked | Row Total |
## --------------------|-----------|-----------|-----------|-----------|
## Placebo | 29 | 7 | 7 | 43 |
## | 2.616 | 0.004 | 3.752 | |
## | 0.674 | 0.163 | 0.163 | 0.512 |
## | 0.690 | 0.500 | 0.250 | |
## | 0.345 | 0.083 | 0.083 | |
## --------------------|-----------|-----------|-----------|-----------|
## Treated | 13 | 7 | 21 | 41 |
## | 2.744 | 0.004 | 3.935 | |
## | 0.317 | 0.171 | 0.512 | 0.488 |
## | 0.310 | 0.500 | 0.750 | |
## | 0.155 | 0.083 | 0.250 | |
## --------------------|-----------|-----------|-----------|-----------|
## Column Total | 42 | 14 | 28 | 84 |
## | 0.500 | 0.167 | 0.333 | |
## --------------------|-----------|-----------|-----------|-----------|
##
##
#CrossTable is more powerful than this please check the help
#--------------------------------------------------------------------------------------
# Test of Independence - Correlation for categorical variables
#--------------------------------------------------------------------------------------
#Chisquare and Fisher exact test are two tests for independence
# Fisher's Exact test is a way to test the association between two categorical variables
# when you have small cell sizes (expected values less than 5)
#Chisquare test of independence
# Introduction to dataset Arthritis of {vcd} package
# clinical trial investigating a new treatment for rheumatoid arthritis
# A data frame with 84 observations and 5 variables.
#
# ID patient ID.
# Treatment factor indicating treatment (Placebo, Treated).
# Sex factor indicating sex (Female, Male).
# Age age of patient.
# Improved ordered factor indicating treatment outcome (None, Some, Marked)
# You can apply the function chisq.test() to a two-way table in order to produce
# a chi.square test of independence of the row and column variables.
#Findout if there is any relationship (dependence)between
#Type of treatment and Results of the treatment
# H0: the variables are independent, there is no relationship between
# the two categorical variables. Knowing the value of one variable does not
# help to predict the value of the other variable
#H0: Treatment type and Improved status are not related/Independent to each other
#Ha: Treatment type and Improved status are related/dependent to each other
#Step 1 : two way table ie contingency table
table1 <- xtabs(~Treatment+Improved, data=Arthritis)
table1
## Improved
## Treatment None Some Marked
## Placebo 29 7 7
## Treated 13 7 21
#Step 2: use the table as argument in chisquare test function
chisq.test(table1)
##
## Pearson's Chi-squared test
##
## data: table1
## X-squared = 13.055, df = 2, p-value = 0.001463
#Since p is less than alpha 0.05 Reject H0
#Conclude that treatment type is related to the outcomes
#Extra note:
cst1<-chisq.test(table1)
attributes(cst1)
## $names
## [1] "statistic" "parameter" "p.value" "method" "data.name" "observed"
## [7] "expected" "residuals" "stdres"
##
## $class
## [1] "htest"
cst1$expected<5
## Improved
## Treatment None Some Marked
## Placebo FALSE FALSE FALSE
## Treated FALSE FALSE FALSE
#Example2: Find if gender has any relation to the treatment outcomes
# The assumption of the Chi-square test is not that the observed value in each cell is greater than 5.
# It is that the expected value in each cell is greater than 5.
# (The expected value for each cell is row total*column total/overall total).
# Often when the observed values are low, the totals are too, so they overlap a lot, but not always.
# If you still find that your expected values are too low, use a Fisher’s Exact test.
#H0: Gender and Improved status are not related/Independent to each other
#Ha: Gender and Improved status are related/dependent to each other
#Step 1 : two way table ie contingency table
table2 <- xtabs(~Sex+Improved, data=Arthritis)
table2
## Improved
## Sex None Some Marked
## Female 25 12 22
## Male 17 2 6
#Step 2: use the table as argument in chisquare test function
chisq.test(table2)
## Warning in chisq.test(table2): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: table2
## X-squared = 4.8407, df = 2, p-value = 0.08889
#Since p is not less than alpha 0.05 Donot Reject H0
#Conclude that gender is not related to the treatment outcomes
#Observe the warning message , lets explore why it has come
cst2<-chisq.test(table2)
## Warning in chisq.test(table2): Chi-squared approximation may be incorrect
attributes(cst2)
## $names
## [1] "statistic" "parameter" "p.value" "method" "data.name" "observed"
## [7] "expected" "residuals" "stdres"
##
## $class
## [1] "htest"
cst2$expected<5
## Improved
## Sex None Some Marked
## Female FALSE FALSE FALSE
## Male FALSE TRUE FALSE
#Thus we have to use fisher exact test for the above example
# one of the six cells in the table (male-some improvement)
# has an expected value less than five, which may
# invalidate the chi-square approximation
# In contrast to many statistical packages, the fisher.test()
# function can be applied to any two-way table
# with two or more rows and columns, not a 2 × 2 table.
fisher.test(table2)#fisher test
##
## Fisher's Exact Test for Count Data
##
## data: table2
## p-value = 0.1094
## alternative hypothesis: two.sided
# p is not less than 0.05 Donot Reject H0
# gender and treatment outcomes are not related
#--------------------------------------------------------------------------------------
# Correlation for numeric variables - Test of Independence
#--------------------------------------------------------------------------------------
# Correlation coefficients are used to describe relationships among quantitative variables.
# The sign ± indicates the direction of the relationship (positive or inverse), and
# the magnitude indicates the strength of the relationship
# (ranging from 0 for no relationship to 1 for a perfectly predictable relationship).
# Types of correlations: R can produce a variety of correlation coefficients,
# including Pearson, Spearman,Kendall, partial, polychoric, and polyserial
# PEARSON, SPEARMAN, AND KENDALL CORRELATIONS
# The Pearson product-moment correlation assesses the degree of linear relationship between two quantitative variables.
# Spearman’s rank-order correlation coefficient assesses the degree of relationship between two rank-ordered variables.
# Kendall’s tau is also a nonparametric measure of rank correlation.
# The cor() function produces all three correlation coefficients,
# whereas the cov() function provides covariances.
# cor(x, use= , method= )
# x Matrix or data frame.
# use Specifies the handling of missing data.
# The options are all.obs (assumes no missing data—missing data will produce an error),
# everything (any correlation involving a case with missing values will be set to missing),
# complete.obs (listwise deletion),
# and pairwise.complete.obs (pairwise deletion).
#
# method Specifies the type of correlation. The options are pearson, spearman, and kendall
# The default options are use="everything" and method="pearson"
states<- state.x77[,1:6]
str(states)
## num [1:50, 1:6] 3615 365 2212 2110 21198 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
## ..$ : chr [1:6] "Population" "Income" "Illiteracy" "Life Exp" ...
head(states)
## Population Income Illiteracy Life Exp Murder HS Grad
## Alabama 3615 3624 2.1 69.05 15.1 41.3
## Alaska 365 6315 1.5 69.31 11.3 66.7
## Arizona 2212 4530 1.8 70.55 7.8 58.1
## Arkansas 2110 3378 1.9 70.66 10.1 39.9
## California 21198 5114 1.1 71.71 10.3 62.6
## Colorado 2541 4884 0.7 72.06 6.8 63.9
cov(states)
## Population Income Illiteracy Life Exp Murder
## Population 19931683.7588 571229.7796 292.8679592 -407.8424612 5663.523714
## Income 571229.7796 377573.3061 -163.7020408 280.6631837 -521.894286
## Illiteracy 292.8680 -163.7020 0.3715306 -0.4815122 1.581776
## Life Exp -407.8425 280.6632 -0.4815122 1.8020204 -3.869480
## Murder 5663.5237 -521.8943 1.5817755 -3.8694804 13.627465
## HS Grad -3551.5096 3076.7690 -3.2354694 6.3126849 -14.549616
## HS Grad
## Population -3551.509551
## Income 3076.768980
## Illiteracy -3.235469
## Life Exp 6.312685
## Murder -14.549616
## HS Grad 65.237894
cor(states)
## Population Income Illiteracy Life Exp Murder HS Grad
## Population 1.00000000 0.2082276 0.1076224 -0.06805195 0.3436428 -0.09848975
## Income 0.20822756 1.0000000 -0.4370752 0.34025534 -0.2300776 0.61993232
## Illiteracy 0.10762237 -0.4370752 1.0000000 -0.58847793 0.7029752 -0.65718861
## Life Exp -0.06805195 0.3402553 -0.5884779 1.00000000 -0.7808458 0.58221620
## Murder 0.34364275 -0.2300776 0.7029752 -0.78084575 1.0000000 -0.48797102
## HS Grad -0.09848975 0.6199323 -0.6571886 0.58221620 -0.4879710 1.00000000
cor(states, method="spearman")
## Population Income Illiteracy Life Exp Murder HS Grad
## Population 1.0000000 0.1246098 0.3130496 -0.1040171 0.3457401 -0.3833649
## Income 0.1246098 1.0000000 -0.3145948 0.3241050 -0.2174623 0.5104809
## Illiteracy 0.3130496 -0.3145948 1.0000000 -0.5553735 0.6723592 -0.6545396
## Life Exp -0.1040171 0.3241050 -0.5553735 1.0000000 -0.7802406 0.5239410
## Murder 0.3457401 -0.2174623 0.6723592 -0.7802406 1.0000000 -0.4367330
## HS Grad -0.3833649 0.5104809 -0.6545396 0.5239410 -0.4367330 1.0000000
# you can see, for example, that a strong positive correlation exists
# between income and high school graduation rate and
# that a strong negative correlation exists
# between illiteracy rates and life expectancy.
corstate<-cor(states)
class(corstate)
## [1] "matrix" "array"
#Better way of presenting correlation
x <- states[,c("Population", "Income", "Illiteracy", "HS Grad")]
y <- states[,c("Life Exp", "Murder")]
cor(x,y)
## Life Exp Murder
## Population -0.06805195 0.3436428
## Income 0.34025534 -0.2300776
## Illiteracy -0.58847793 0.7029752
## HS Grad 0.58221620 -0.4879710
#H0: There is no correlation between Illiteracy and murder
#Ha: There is correlation between Illiteracy and murder
cor.test(states[,3], states[,5])#Illiterarcy Murder
##
## Pearson's product-moment correlation
##
## data: states[, 3] and states[, 5]
## t = 6.8479, df = 48, p-value = 1.258e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5279280 0.8207295
## sample estimates:
## cor
## 0.7029752
# p is less than alpha 0.05 Reject H0
# There is correlation between Illiteracy and murder
print("########################## THE ### END ###############################################")
## [1] "########################## THE ### END ###############################################"
# Independent two sample t test
# Equal sample sizes and equal variance : degrees of freedom = 2n-2
# Unequal samples sizes and equal variance: degrees of freedom = n1+n2-2
# Equal or unequal sample sizes, unequal variances This test, also known as Welch's t-test,
# is used only when the two population variances are not assumed to be equal and degrees of freedom formula should be used
# Dependent two sample t test degrees of freedom is n-1 where n is number of pairs
#
# we can calculate a t-statistic that will follow a t-distribution with n1 + n2 - 2 degrees of freedom.
# There is also a widely used modification of the t-test, known as Welch's t-test that adjusts the number of degrees of freedom when the variances are thought not to be equal to each other.
#For the Friedman test, the format is
# friedman.test(y ~ A | B, data)
# where y is the numeric outcome variable, A is a grouping variable, and B is a blocking
# variable that identifies matched observations. In both cases, data is an option argument
# specifying a matrix or data frame containing the variables
# Randomized Block Design - Friedman Test
# friedman.test(y~A|B)
# where y are the data values, A is a grouping factor
# and B is a blocking factor