Hypothesis Testing

#Hypothesis is a proposition to be tested ie a testable assumption

# In statistics, we can define the corresponding null hypothesis (H0) as follow:
#   
#   H0:sample based population mean=hypothesized mean (mu=mu0)
#   H0:sample based population mean<=hypothesized mean (mu<=mu0)
#   H0:sample based population mean>=hypothesized mean (mu>=mu0)
#   
# The corresponding alternative hypotheses (Ha) are as follow:
#     
#   Ha:sample based population mean!=hypothesized mean (mu=mu0) (two sided: different) two-tailed tests
#   Ha:sample based population mean>hypothesized mean (mu>mu0) (greater) one-tailed tests
#   Ha:sample based population mean<hypothesized mean (mu<mu0) (less) one-tailed tests
#     
# 
# Formula of one-sample t-test
# 
# The t-statistic can be calculated as follow:
#   
#   t=(xbar-mu)/(s/sqrt(n))
# 
# where,
# 
# xbar is the sample mean
# n is the sample size
# s is the sample standard deviation with n-1 degrees of freedom
# mu is the theoretical value
# 
# We can compute the p-value corresponding to 
# the absolute value of the t-test statistics (|t|) 
# for the degrees of freedom (df): df=n-1.
# 
# How to interpret the results?
#   
# If the p-value is inferior or equal to the significance level 0.05, 
# we can reject the null hypothesis and accept the alternative hypothesis. 
# we conclude that the sample mean is significantly differentfrom the theoretical mean.


#--------------------------------------------------------------------------------------------------------

#One Sample t test : One tail - left tail #Donot Reject H0

#--------------------------------------------------------------------------------------------------------

# A consumer report has charged that Sweettooth Candy Company is underfilling its bags of candy. 
# The industry average is 6.5 ounces of candy per bag. Sweettooth is confident it is not underfilling 
# its bags and wants to publish a statistical rebuttal.(Note: One ounce is equal to approximately 28.35 grams.). 
# The company draws 30 sample bags of candy randomly from its production line and weighs each bag.
# The weight in ounces is then entered into a table for analysis.

ounces<-c(7.1,5.8,6.4,6.4,6.3,7.1,6.9,6,6.5,6.6,7.1,6.5,5.7,6.6,6.4,6.7,6.4,6.6,7.1,6.6,6.4,6.5,7.1,6.8,7,6.7,7.1,6.7,7,6.5)

#Because underfilling the alternate is less than 6.5 ie one tail left tail test

# Null hypothesis is that the mean is greater than or equal to 6.5

#Run the test

t.test(ounces,mu=6.5,alternative="less", conf.level=0.95)

## 
##  One Sample t-test
## 
## data:  ounces
## t = 1.7463, df = 29, p-value = 0.9543
## alternative hypothesis: true mean is less than 6.5
## 95 percent confidence interval:
##      -Inf 6.736757
## sample estimates:
## mean of x 
##      6.62

t.test(ounces,m=6.5,a="l", c=0.95)#Shorter version default is 0.95 and c can be ignored

## 
##  One Sample t-test
## 
## data:  ounces
## t = 1.7463, df = 29, p-value = 0.9543
## alternative hypothesis: true mean is less than 6.5
## 95 percent confidence interval:
##      -Inf 6.736757
## sample estimates:
## mean of x 
##      6.62

htr<-t.test(ounces,m=6.5,a="l")#we can store results into variable

attributes(htr)#check the attributes

## $names
##  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
##  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"  
## 
## $class
## [1] "htest"

#If t calc < t crit Reject H0

tcalc<-htr$statistic
tcrit<-qt(0.05,30)#qt(p, df, lower.tail=TRUE)

if(tcalc<tcrit){print("Reject H0")}else{"Donot Reject H0"}

## [1] "Donot Reject H0"

# If pleft value is less than alpha (0.05) then reject null hypothesis else donot reject null hypothesis

pleft<-htr$p.value
alpha<-0.05

#P 0.95 is not less than alpha 0.05 #Donot Reject Null Hypothesis #

if(pleft<alpha){print("Reject H0")}else{"Donot Reject H0"}

## [1] "Donot Reject H0"

#---------------------------------------------------------------------------------------
#One sample t test :one tail-left tail #Reject H0
#----------------------------------------------------------------------------------------

# A biologist was interested in determining whether sunflower seedlings treated with an extract 
# from Vinca minor roots resulted in a lower average height of sunflower seedlings 
# than the standard height of 15.7 cm. The biologist treated a random sample of n = 33 
# seedlings with the extract and subsequently obtained the following heights:

hcm<-c(11.5,11.8,15.7,16.1,14.1,10.5,9.3,15,11.1,15.2,19,12.8,12.4,19.2,13.5,12.2,13.3,16.5,13.5,14.4,16.7,10.9,13,10.3,15.8,15.1,17.1,13.3,12.4,8.5,14.3,12.9,13.5)

# The biologist's hypotheses are:
# 
# H0 : mu = 15.7
# HA : mu < 15.7


htr2<-t.test(hcm,m=15.7,a="l")
print(htr2)

## 
##  One Sample t-test
## 
## data:  hcm
## t = -4.599, df = 32, p-value = 3.174e-05
## alternative hypothesis: true mean is less than 15.7
## 95 percent confidence interval:
##      -Inf 14.41366
## sample estimates:
## mean of x 
##  13.66364

attributes(htr2)#check the attributes

## $names
##  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
##  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"  
## 
## $class
## [1] "htest"

#If t calc < t crit Reject H0

tcalc<-htr2$statistic
tcrit<-qt(0.05,33)#qt(p, df, lower.tail=TRUE)

if(tcalc<tcrit){print("Reject H0")}else{"Donot Reject H0"}

## [1] "Reject H0"

#If pleft < alpha Reject H0

pleft<-htr2$p.value
alpha<-0.05

if(pleft<alpha){print("Reject H0")}else{"Donot Reject H0"}

## [1] "Reject H0"

#--------------------------------------------------------------------------------------------
#One sample t test : one tail-Right tail #Donot Reject H0
#--------------------------------------------------------------------------------------------

# SKS TMT claims that their iron hardness is a minimum of 170. 
# An engineer collected 25 pieces and measured their hardness as given below. 
# Does the data support the claim of the company?

ih<-c(170,167,174,179,179,187,179,183,179,156,163,156,187,156,167,156,174,170,183,179,174,179,170,159,187)


# He was interested in testing the hypotheses:
#   
# H0 : mu = 170
# HA : mu > 170


htr3<-t.test(ih,m=170,a="g")
print(htr3)

## 
##  One Sample t-test
## 
## data:  ih
## t = 1.2218, df = 24, p-value = 0.1168
## alternative hypothesis: true mean is greater than 170
## 95 percent confidence interval:
##  168.9914      Inf
## sample estimates:
## mean of x 
##    172.52

attributes(htr3)#check the attributes

## $names
##  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
##  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"  
## 
## $class
## [1] "htest"

#If t calc > t crit Reject H0

tcalc<-htr3$statistic
tcrit<-qt(0.05,25,lower.tail = FALSE)#qt(p, df, lower.tail=TRUE)

if(tcalc>tcrit){print("Reject H0")}else{"Donot Reject H0"}

## [1] "Donot Reject H0"

#If pright < alpha Reject H0

pright<-htr3$p.value
alpha<-0.05

if(pright<alpha){print("Reject H0")}else{"Donot Reject H0"}

## [1] "Donot Reject H0"

#-----------------------------------------------------------------------------------------------
#One sample t test : one tail-Right tail #Donot Reject H0 with Normality test
#-----------------------------------------------------------------------------------------------
set.seed(1234)
my_data <- data.frame(
  name = paste0(rep("M_", 10), 1:10),
  weight = round(rnorm(10, 20, 2), 1)
)

# Print the first 10 rows of the data
head(my_data, 10)

##    name weight
## 1   M_1   17.6
## 2   M_2   20.6
## 3   M_3   22.2
## 4   M_4   15.3
## 5   M_5   20.9
## 6   M_6   21.0
## 7   M_7   18.9
## 8   M_8   18.9
## 9   M_9   18.9
## 10 M_10   18.2

# Statistical summaries of weight
summary(my_data$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.30   18.38   18.90   19.25   20.82   22.20

# Preleminary test to check one-sample t-test assumptions
# Is this a large sample? - No, because n < 30.
# Since the sample size is not large enough (less than 30, central limit theorem),
# we need to check whether the data follow a normal distribution.


# It is possible to use the Shapiro-Wilk normality test and to look at the normality plot.
# 
# Shapiro-Wilk test:
# Null hypothesis: the data are normally distributed
# Alternative hypothesis: the data are not normally distributed


shapiro.test(my_data$weight) # => p-value = 0.6993

## 
##  Shapiro-Wilk normality test
## 
## data:  my_data$weight
## W = 0.9526, p-value = 0.6993

#Visual test qq from base package
qqnorm(my_data$weight)
qqline(my_data$weight)

#Visual test qq from car package
library(car)

## Loading required package: carData

qqPlot(my_data$weight)

## [1] 4 3

#Thus we can assume the normality.

# H0 : mu = 25
# HA : mu > 25

# One-sample t-test
htr4 <- t.test(my_data$weight, m = 25,a="g")
# Printing the results
print(htr4)

## 
##  One Sample t-test
## 
## data:  my_data$weight
## t = -9.0783, df = 9, p-value = 1
## alternative hypothesis: true mean is greater than 25
## 95 percent confidence interval:
##  18.08895      Inf
## sample estimates:
## mean of x 
##     19.25

attributes(htr4)#check the attributes

## $names
##  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
##  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"  
## 
## $class
## [1] "htest"

#If t calc > t crit Reject H0

tcalc<-htr4$statistic
tcrit<-qt(0.05,10,lower.tail = FALSE)#qt(p, df, lower.tail=TRUE)

if(tcalc>tcrit){print("Reject H0")}else{"Donot Reject H0"}

## [1] "Donot Reject H0"

#If pright < alpha Reject H0

pright<-htr4$p.value
alpha<-0.05

if(pright<alpha){print("Reject H0")}else{"Donot Reject H0"}

## [1] "Donot Reject H0"

#--------------------------------------------------------------------------------------------------------
#One sample t test : one tail-Right tail #Reject H0  
#--------------------------------------------------------------------------------------------------------

#A rehabilitation center for drug abuse wanted to know if there is decrease in 
#young drug abuse in young population (aged 20 and below) in the community
#the recent batch had 18 participants whose ages are given below
#test to determine at alpha=0.05 whether the population mean age is significantly greater than 20 given the data below

age<-c(36,28,21,25,31,17,22,18,18,29,21,26,17,18,30,19,19,28)


# H0 : mu = 20
# HA : mu > 20


htr5<-t.test(age,m=20,a="g")
print(htr5)

## 
##  One Sample t-test
## 
## data:  age
## t = 2.5769, df = 17, p-value = 0.009797
## alternative hypothesis: true mean is greater than 20
## 95 percent confidence interval:
##  21.13723      Inf
## sample estimates:
## mean of x 
##      23.5

attributes(htr5)#check the attributes

## $names
##  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
##  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"  
## 
## $class
## [1] "htest"

#If t calc > t crit Reject H0

tcalc<-htr5$statistic
tcrit<-qt(0.05,25,lower.tail = FALSE)#qt(p, df, lower.tail=TRUE)

if(tcalc>tcrit){print("Reject H0")}else{"Donot Reject H0"}

## [1] "Reject H0"

#If pright < alpha Reject H0

pright<-htr5$p.value
alpha<-0.05

if(pright<alpha){print("Reject H0")}else{"Donot Reject H0"}

## [1] "Reject H0"

#--------------------------------------------------------------------------------------------------------
#One sample t test : two tail-Right tail #Donot Reject H0  
#--------------------------------------------------------------------------------------------------------

# A song is written in such a way that it should take on an average 20 minutes to sing it. 
# The rehearsal was done for 12 times to perfect the act using 12 different artists
# the time recorded is given below. Can we conclude that the average time taken 
# for singing the song is 20 minutes?

time<-c(21.5, 24.5, 18.5, 17.2, 14.5, 23.2, 22.1, 20.5, 19.4, 18.1, 24.1, 18.5)


# H0 : mu = 20
# HA : mu!= 20


htr6<-t.test(time,m=20,a="t")
print(htr6)

## 
##  One Sample t-test
## 
## data:  time
## t = 0.20066, df = 11, p-value = 0.8446
## alternative hypothesis: true mean is not equal to 20
## 95 percent confidence interval:
##  18.25544 22.09456
## sample estimates:
## mean of x 
##    20.175

attributes(htr6)#check the attributes

## $names
##  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
##  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"  
## 
## $class
## [1] "htest"

#If t calc > t crit Reject H0

tcalc<-htr6$statistic
tcrit<-qt(0.025,17)#qt(p, df, lower.tail=TRUE)

if(abs(tcalc)>abs(tcrit)){print("Reject H0")}else{"Donot Reject H0"}

## [1] "Donot Reject H0"

#If phalf < alpha Reject H0

phalf<-htr6$p.value
alpha<-0.025

if(phalf<alpha){print("Reject H0")}else{"Donot Reject H0"}

## [1] "Donot Reject H0"

#--------------------------------------------------------------------------------------------------------
#Two sample t test : one tail-Right tail #Donot Reject H0  
#--------------------------------------------------------------------------------------------------------

# The diameter of a bar manufactured by stroung TMT is supposed to be 8mm 
# on an average as per the specification.
# A sample of 15 bars indicated the sizes given below

# Can we conclude that the company is delivering 8mm bars as per the specification

dia<-c(8,7.89,8,8.26,8,8.07,8,8,7.95,8.04,8,8,7.99,8,8.03)

# H0 : mu = 8
# HA : mu!= 8


htr7<-t.test(time,m=8,a="t")
print(htr7)

## 
##  One Sample t-test
## 
## data:  time
## t = 13.96, df = 11, p-value = 2.422e-08
## alternative hypothesis: true mean is not equal to 8
## 95 percent confidence interval:
##  18.25544 22.09456
## sample estimates:
## mean of x 
##    20.175

attributes(htr7)#check the attributes

## $names
##  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
##  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"  
## 
## $class
## [1] "htest"

#If t calc > t crit Reject H0

tcalc<-htr7$statistic
tcrit<-qt(0.025,14)#qt(p, df, lower.tail=TRUE)

if(abs(tcalc)>abs(tcrit)){print("Reject H0")}else{"Donot Reject H0"}

## [1] "Reject H0"

#If phalf < alpha Reject H0

phalf<-htr7$p.value
alpha<-0.025

if(phalf<alpha){print("Reject H0")}else{"Donot Reject H0"}

## [1] "Reject H0"

#-------------------------------------------------------------------------------------------------


# S3 method for default
# t.test(x, y = NULL,
#        alternative = c("two.sided", "less", "greater"),
#        mu = 0, paired = FALSE, var.equal = FALSE,
#        conf.level = 0.95, .)
# 
# S3 method for formula
# t.test(formula, data, subset, na.action, .)


# To conduct a hypothesis test on two independent sample mean,
# we use the function:  t.test(quantitative_variable#1, quantitative_variable#2, . )
# Additional arguments for t.test( ) may include the following:
# alternative = "two.sided", "less", or "greater". 
# If nothing is indicated, the argument defaults to two-sided.
# var.equal = TRUE or FALSE. TRUE means that the variances are equal and FALSE means the variances are unequal. This is a required argument.
# conf.level = confidence level desired. If nothing is indicated, the argument defaults to 95% confidence level.

#Check if group a and b means are equal ie belong to same population

# t test in case of two numeric vectors
a<-c(1.1,2.3,1.5,3.4,5.4)
b<-c(1.4,2.5,1.7,2.4,4.1)

t.test(a,b)#quickest way 1

## 
##  Welch Two Sample t-test
## 
## data:  a and b
## t = 0.35425, df = 6.5915, p-value = 0.7342
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.843158  2.483158
## sample estimates:
## mean of x mean of y 
##      2.74      2.42

x<-c("a","a","a","a","a","b","b","b","b","b")
y<-c(1.1,2.3,1.5,3.4,5.4,1.4,2.5,1.7,2.4,4.1)

t.test(y~x)#quickest way 2 formula method

## 
##  Welch Two Sample t-test
## 
## data:  y by x
## t = 0.35425, df = 6.5915, p-value = 0.7342
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.843158  2.483158
## sample estimates:
## mean in group a mean in group b 
##            2.74            2.42

# The variables used in this test are known as:
# Dependent variable, or test variable (y)
# Independent variable, or grouping variable (x)

# *Assumptions and defaults for the above*

# alternative hypothesis is not equal to ie two sided
# the mean difference mu=0
# paired is false ie unpaired test or independent sample test
# the default test assumes unequal variance and
# applies the Welsh degrees-of-freedom modification
# You can add a var.equal=TRUE
# option to specify equal variances and a pooled variance estimate
# confidence level is 95% and alpha is 0.05

library("MASS")
#------------------------------------------------------------------------------
#Two sample means test - Independent sample t test - One sided greater
#------------------------------------------------------------------------------

# We will use the dataset built into R called ToothGrowth which looks at the effect of Vitamin C on tooth growth in guinea pigs. The dataset shows the lengths of teeth after each guinea pig receives doses of vitamin C.
# The variables of the dataset are:  len which is the numeric tooth length measurement (unspecified units)
# supp which is the delivery method of Vitamin C - either by orange juice (OJ) or by ascorbic acid (VC)
# dose which is the dosage of the supplement - 0.5 ml/day, 1.0 ml/day or 2.0 ml/day

#Suppose we want to see the effectiveness of the delivery method disregarding dosage.
# 
# First, let us check assumptions before doing the hypothesis test:
#1 the two samples are independent and randomly selected from the population of interest and 
#2 the sample size is greater than 30

#Let us check if there are any outliers by drawing the boxplot of 
#the tooth length separated by the delivery method.


boxplot(ToothGrowth$len ~ ToothGrowth$supp,
        main = "Tooth Growth in Guinea Pig",
        xlab = "Delivery Method",
        ylab = "Tooth Length")

#The boxplots do not show any visible outliers.


# For this example, we want to see if delivering vitamin C via orange juice enhances 
# more tooth growth in guinea pigs than ascorbic acid. 

#A one-sided hypothesis test will be conducted.

#H0: mean tooth growth via orange juice <= mean tooth growth via ascorbic acid
#Ha: mean tooth growth via orange juice > mean tooth growth via ascorbic acid

#Since we are not told whether the variances are equal or not, 
#we will assume unequal variances to be on the conservative side


#We will use the default, 95% confidence level.

#Explore the dataset
str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

head(ToothGrowth)#Its a stacked form

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

# As it is stacked, the formula method is easier ynumeric~xcategoricalbinary
t.test(ToothGrowth$len~ToothGrowth$supp) #Quick method but

## 
##  Welch Two Sample t-test
## 
## data:  ToothGrowth$len by ToothGrowth$supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

#If you want method 2 then data set should be modified or reshaped
oj<-subset(ToothGrowth,supp=="OJ")
vc<-subset(ToothGrowth,supp=="VC")

alpha=0.05
t.test(oj$len,vc$len)

## 
##  Welch Two Sample t-test
## 
## data:  oj$len and vc$len
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean of x mean of y 
##  20.66333  16.96333

t.test(ToothGrowth$len~ToothGrowth$supp, alternative = "greater", var.equal = FALSE)#formula

## 
##  Welch Two Sample t-test
## 
## data:  ToothGrowth$len by ToothGrowth$supp
## t = 1.9153, df = 55.309, p-value = 0.03032
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.4682687       Inf
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

t.test(oj$len, vc$len, alternative = "greater", var.equal = FALSE)#default

## 
##  Welch Two Sample t-test
## 
## data:  oj$len and vc$len
## t = 1.9153, df = 55.309, p-value = 0.03032
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.4682687       Inf
## sample estimates:
## mean of x mean of y 
##  20.66333  16.96333

#store t test results into a variable

tf<-t.test(ToothGrowth$len~ToothGrowth$supp, alternative = "greater", var.equal = FALSE)
td<-t.test(oj$len, vc$len, alternative = "greater", var.equal = FALSE)

#If tcalc > tcrit reject H0
tcalc<-td$statistic
df<-td$parameter
tcrit<-qt(0.05,df)

if(tcalc>tcrit){"Reject H0"}else{"Donot Reject H0"}

## [1] "Reject H0"

#If pright > alpha Reject H0

pright<-td$p.value

if (pright<alpha){"Reject H0"}else{"Donot Reject H0"}

## [1] "Reject H0"

#--------------------------------------------------------------------------------------------------
#Two sample means test - Independent sample t test - Two sided equal ie no difference -Donot Reject
#--------------------------------------------------------------------------------------------------

#Introduction to the dataset UScrime from {MASS} package
#The Effect of Punishment Regimes on Crime Rates
#This has been studied using aggregate data on 47 states of the USA for 1960 
#given in this data frame.

# M    percentage of males aged 14-24.
# So   indicator variable for a Southern state.
# Ed   mean years of schooling.
# Po1  police expenditure in 1960.
# Po2  police expenditure in 1959.
# LF   labour force participation rate.
# M.F  number of males per 1000 females.
# Pop  state population.
# NW   number of non-whites per 1000 people.
# U1   unemployment rate of urban males 14-24.
# U2   unemployment rate of urban males 35-39.
# GDP  gross domestic product per head.
# Ineq income inequality.
# Prob probability of imprisonment.
# Time average time served in state prisons.
# y    rate of crimes in a particular category per head of population.


#Find out if there is any difference in crime rate across southern and non southern states


#Explore the dataset
str(UScrime)

## 'data.frame':    47 obs. of  16 variables:
##  $ M   : int  151 143 142 136 141 121 127 131 157 140 ...
##  $ So  : int  1 0 1 0 0 0 1 1 1 0 ...
##  $ Ed  : int  91 113 89 121 121 110 111 109 90 118 ...
##  $ Po1 : int  58 103 45 149 109 118 82 115 65 71 ...
##  $ Po2 : int  56 95 44 141 101 115 79 109 62 68 ...
##  $ LF  : int  510 583 533 577 591 547 519 542 553 632 ...
##  $ M.F : int  950 1012 969 994 985 964 982 969 955 1029 ...
##  $ Pop : int  33 13 18 157 18 25 4 50 39 7 ...
##  $ NW  : int  301 102 219 80 30 44 139 179 286 15 ...
##  $ U1  : int  108 96 94 102 91 84 97 79 81 100 ...
##  $ U2  : int  41 36 33 39 20 29 38 35 28 24 ...
##  $ GDP : int  394 557 318 673 578 689 620 472 421 526 ...
##  $ Ineq: int  261 194 250 167 174 126 168 206 239 174 ...
##  $ Prob: num  0.0846 0.0296 0.0834 0.0158 0.0414 ...
##  $ Time: num  26.2 25.3 24.3 29.9 21.3 ...
##  $ y   : int  791 1635 578 1969 1234 682 963 1555 856 705 ...

table(UScrime$So)

## 
##  0  1 
## 31 16

#The data contains 30 record related to non southern states and 16 records related to southern states
#The dataset is stacked and formula method should be used

t.test(UScrime$y~UScrime$So)

## 
##  Welch Two Sample t-test
## 
## data:  UScrime$y by UScrime$So
## t = 0.70677, df = 43.344, p-value = 0.4835
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -135.5961  281.9711
## sample estimates:
## mean in group 0 mean in group 1 
##        930.0000        856.8125

#Assumptions: mu=0,two tailed test, unequal variance, confidence level is 95%, paired is false

t.test(UScrime$y~UScrime$So)#syntax type1

## 
##  Welch Two Sample t-test
## 
## data:  UScrime$y by UScrime$So
## t = 0.70677, df = 43.344, p-value = 0.4835
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -135.5961  281.9711
## sample estimates:
## mean in group 0 mean in group 1 
##        930.0000        856.8125

t.test(y~So,data=UScrime)#syntax type2

## 
##  Welch Two Sample t-test
## 
## data:  y by So
## t = 0.70677, df = 43.344, p-value = 0.4835
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -135.5961  281.9711
## sample estimates:
## mean in group 0 mean in group 1 
##        930.0000        856.8125

tt<-t.test(UScrime$y~UScrime$So)


#If phalf < alpha Reject H0
phalf<-tt$p.value
alpha<-0.025

if(phalf<alpha){"Reject H0"}else{"Donot Reject H0"}

## [1] "Donot Reject H0"

#Conclusion: The crime rate is same across southern and non southern states

#xtra note:
#abs(pcalc)>abs(pcrit) Reject H0
pcalc<-tt$statistic
dof<-tt$parameter
pcrit<-qt(alpha,dof)
if(abs(pcalc)>abs(pcrit)){"Reject H0"} else {"Donot Reject H0"}

## [1] "Donot Reject H0"

#--------------------------------------------------------------------------------------------------
#Two sample means test - Independent sample t test - Two sided equal ie no difference - Reject
#--------------------------------------------------------------------------------------------------

#Find out if there is any difference probability of punishment given crime committed across southern and non southern states

#H0:There is no difference in average probability of punishment across the southern and non southern states
#Ha:There is a difference in average probability of punishment across the southern and non southern states

t.test(Prob~So,data=UScrime)

## 
##  Welch Two Sample t-test
## 
## data:  Prob by So
## t = -3.8954, df = 24.925, p-value = 0.0006506
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.03852569 -0.01187439
## sample estimates:
## mean in group 0 mean in group 1 
##      0.03851265      0.06371269

t.test(UScrime$Prob,UScrime$So)

## 
##  Welch Two Sample t-test
## 
## data:  UScrime$Prob and UScrime$So
## t = -4.1938, df = 46.207, p-value = 0.0001229
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4341078 -0.1525605
## sample estimates:
##  mean of x  mean of y 
## 0.04709138 0.34042553

tt<-t.test(Prob~So,data=UScrime)
alpha=0.025
phalf<-tt$statistic

if (phalf<alpha){"Reject H0"}else{"Donot Reject H0"}

## [1] "Reject H0"

#The probability of getting punished for a crime is not same across southern and nonsouthern states

#xtra note:
#abs(pcalc)>abs(pcrit) Reject H0
pcalc<-tt$statistic
dof<-tt$parameter
pcrit<-qt(alpha,dof)
if(abs(pcalc)>abs(pcrit)){"Reject H0"} else {"Donot Reject H0"}

## [1] "Reject H0"

#--------------------------------------------------------------------------------------------------
#Two sample means test - Paired sample t test -Dependent t test - matched pair- onetail right
#--------------------------------------------------------------------------------------------------

#Introduce anorexia from {MASS} package
#Anorexia Data on Weight Change
#The anorexia data frame has 72 rows and 3 columns. 
#Weight change data for young female anorexia patients.

#This data frame contains the following columns:
  
#Treat Factor of three levels: "Cont" (control), 
#"CBT" (Cognitive Behavioural treatment) and "FT" (family treatment).
#Prewt Weight of patient before study period, in lbs.
#Postwt Weight of patient after study period, in lbs.

#Syntax t.test(quantitative_variable_1, quantitative_variable_2, paired = TRUE, . )

# Additional arguments for t.test( ) may include the following:
  
# alternative = "two.sided", "less", or "greater". If none is indicated, the argument defaults to two-sided.
# conf.level = confidence level desired. If none is indicated, the argument defaults to 95% confidence level.

# In this example, we want to assess if the study participants have a 
# statistically significant weight gain after treatment.

# A one-sided alternative test will be conducted. 
# If the treatments were successful, then the difference between the variable, 
# Prewt, which is the weight before the study, and the variable, 
# Postwt, which is the weight after the study, will be less than 0. 
# Be sure to include the argument, alternative = "less" in the t.test( ) function.


#H0: The difference in paired population means is greater than or equal to 0 
#Ha: The difference in paired population means is less than 0

t.test(anorexia$Prewt, anorexia$Postwt,
       alternative = "less",
       paired = TRUE)

## 
##  Paired t-test
## 
## data:  anorexia$Prewt and anorexia$Postwt
## t = -2.9376, df = 71, p-value = 0.002229
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -1.195825
## sample estimates:
## mean of the differences 
##               -2.763889

tt<-t.test(anorexia$Prewt, anorexia$Postwt, alternative = "less",paired = TRUE)

#If pleft<alpha Reject H0
pleft<-tt$p.value
alpha=0.05
if(pleft<alpha){"Reject H0"}else{"Donot Reject H0"}

## [1] "Reject H0"

#Difference in paired population means is less than 0

#--------------------------------------------------------------------------------------------------
#Two sample means test - Paired sample t test -Dependent t test - two tail
#--------------------------------------------------------------------------------------------------

# From UScrime dataset of MASS package
# Find if the unemployment rate for younger males (14-24) 
# is greater than for older males (35-39). 
# In this case, the two groups aren't independent. 
# You wouldn't expect the unemployment rate 
# for younger and older males in Alabama to be unrelated

#H0: true difference in means of U1 and U2 is equal to 0
#Ha: true difference in means of U1 and U2 is not equal to 0

t.test(UScrime$U1,UScrime$U2,paired=TRUE)

## 
##  Paired t-test
## 
## data:  UScrime$U1 and UScrime$U2
## t = 32.407, df = 46, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  57.67003 65.30870
## sample estimates:
## mean of the differences 
##                61.48936

tt<-t.test(UScrime$U1,UScrime$U2,paired=TRUE)
tt

## 
##  Paired t-test
## 
## data:  UScrime$U1 and UScrime$U2
## t = 32.407, df = 46, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  57.67003 65.30870
## sample estimates:
## mean of the differences 
##                61.48936

#If phalf<alphahalf Reject H0
phalf<-tt$p.value
alphahalf<-0.025
if(phalf<alphahalf){"Reject H0"}else{"Donot Reject H0"}

## [1] "Reject H0"

#The unemployment rates might not be the same

#--------------------------------------------------------------------------------------------------
#Three and more sample means test - ANOVA
#--------------------------------------------------------------------------------------------------

#Introduce warpbreaks of {datasets} package
#The Number of Breaks in Yarn during Weaving
# A data frame with 54 observations on 3 variables.
# 
# [,1]  breaks  numeric The number of breaks
# [,2]  wool    factor  The type of wool (A or B)
# [,3]  tension factor  The level of tension (L, M, H)

#Find if the average number of breaks is same across Low Medium and High tension operations

#If two groups then t test more than 2 use ANOVA

#syntax: aov(formula,datasetname)

#H0: The breaks across all tensions are same
#Ha: The breaks across all tensions are not the same

aov(breaks~tension,data=warpbreaks)

## Call:
##    aov(formula = breaks ~ tension, data = warpbreaks)
## 
## Terms:
##                  tension Residuals
## Sum of Squares  2034.259  7198.556
## Deg. of Freedom        2        51
## 
## Residual standard error: 11.88058
## Estimated effects may be unbalanced

anova<-aov(breaks~tension,data=warpbreaks)

summary(anova)#see the model information

##             Df Sum Sq Mean Sq F value  Pr(>F)   
## tension      2   2034  1017.1   7.206 0.00175 **
## Residuals   51   7199   141.1                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#If p is less than alpha Reject H0
p<-0.00175
alpha<-0.05

#Therefore reject null hypothesis
#conclude that different tensions give different breaks



#-------------------------------------------------------------------------------------------------------------------
#Test of variance -two sample variance using F test
#------------------------------------------------------------------------------------------------------------------

# When to you use the F-test?
# Comparing two variances is useful in several cases, including:
# When you want to perform a two samples t-test to check the equality of the variances of the two samples
# When you want to compare the variability of a new measurement method to an old one. 
# Does the new method reduce the variability of the measure?

# Typical research questions are:
#   
# whether the variance of group A  is equal to the variance of group B ?
# whether the variance of group A  is less than the variance of group B ?
# whether the variance of group A  is greather than the variance of group B ?
#   
# Preleminary test to check F-test assumptions
# F-test is very sensitive to departure from the normal assumption. 
# You need to check whether the data is normally distributed before using the F-test.
# 
# Shapiro-Wilk test can be used to test whether the normal assumption holds. 
# It’s also possible to use Q-Q plot (quantile-quantile plot) to graphically evaluate the normality of a variable. 
# Q-Q plot draws the correlation between a given sample and the normal distribution.
# 
# If there is doubt about normality, the better choice is to use Levene’s test 
# or Fligner-Killeen test, which are less sensitive to departure from normal assumption.


#Example 1:
#H0: Variance of group a is equal to variance of group b
#Ha: Variance of group a is not equal to variance of group b

# rnorm default values as mean = 0 and variance = 1
a<-rnorm(40,mean=0)
a

##  [1] -0.47719270 -0.99838644 -0.77625389  0.06445882  0.95949406 -0.11028549
##  [7] -0.51100951 -0.91119542 -0.83717168  2.41583518  0.13408822 -0.49068590
## [13] -0.44054787  0.45958944 -0.69372025 -1.44820491  0.57475572 -1.02365572
## [19] -0.01513830 -0.93594860  1.10229755 -0.47559308 -0.70944004 -0.50125806
## [25] -1.62909347 -1.16761926 -2.18003965 -1.34099319 -0.29429386 -0.46589754
## [31]  1.44949627 -1.06864272 -0.85536463 -0.28062300 -0.99434008 -0.96851432
## [37] -1.10731819 -1.25198589 -0.52382812 -0.49684996

b<-rnorm(40,mean=0)
b

##  [1] -1.806031257 -0.582075925 -1.108889624 -1.014962009 -0.162309524
##  [6]  0.563055819  1.647817473 -0.773353424  1.605909629 -1.157808548
## [11]  0.656588464  2.548991071 -0.034760390 -0.669633580 -0.007604756
## [16]  1.777084448 -1.138607737  1.367827179  1.329564791  0.336472797
## [21]  0.006892838 -0.455468738 -0.366523933  0.648286568  2.070270861
## [26] -0.153398412 -1.390700947 -0.723581777  0.258261762 -0.317059115
## [31] -0.177789958 -0.169994077 -1.372301886 -0.173787170  0.850232257
## [36]  0.697608712  0.549997351 -0.402731975 -0.191593770 -1.194527880

var.test(a,b)

## 
##  F test to compare two variances
## 
## data:  a and b
## F = 0.70356, num df = 39, denom df = 39, p-value = 0.2766
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.3721114 1.3302306
## sample estimates:
## ratio of variances 
##          0.7035581

#Since p is not less than alpha 0.05 , we donot reject H0


# Example2 : Tooth Growth dataset

#H0: The variances of tooth growth across the two supplements is equal
#Ha: The variances of tooth growth across the two supplements is not equal

# F-test
#compares the variance of any two samples from normal distribution
#compares the Ratio of the two variances and if these are equal , the ratio will be 1


ft <- var.test(len ~ supp, data = ToothGrowth)
ft

## 
##  F test to compare two variances
## 
## data:  len by supp
## F = 0.6386, num df = 29, denom df = 29, p-value = 0.2331
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.3039488 1.3416857
## sample estimates:
## ratio of variances 
##          0.6385951

#If p is less than alpha Reject H0
p<-ft$p.value
alpha<-0.05
if(p<alpha){"Reject H0"}else{"Donot Reject H0"}

## [1] "Donot Reject H0"

#Therefore The variances of tooth growth across the two supplements is equal


######################Non Parametric Tests#############################

# A non parametric test (sometimes called a distribution free test) does not assume anything about the underlying distribution 
# (for example, that the data comes from a normal distribution). 
# That’s compared to parametric test, which makes assumptions about a population’s parameters 
# (for example, the mean or standard deviation); 
# When the word “non parametric” is used in stats, 
# it doesn’t quite mean that you know nothing about the population. 
# It usually means that you know the population data does not have a normal distribution.

# The only non parametric test you are likely to come across in elementary stats is the chi-square test.

#-----------------------------------------------------------------------------------------------------------------
# Mann Whitney U Test (Wilcoxon Rank Sum Test) #Equivalent to t test for two independent samples
#----------------------------------------------------------------------------------------------------------------- 
# testing the equality of means in two independent samples
# A popular nonparametric test to compare outcomes between two independent groups is the Mann Whitney U test
# The Mann Whitney U test, sometimes called the Mann Whitney Wilcoxon Test or the Wilcoxon Rank Sum Test, 
# is used to test whether two samples are likely to derive from the same population
# Some investigators interpret this test as comparing the medians between the two populations
# This test is often performed as a two-sided test and, thus, the research hypothesis indicates that the populations are not equal 
# as opposed to specifying directionality. A one-sided research hypothesis is used if interest lies in 
# detecting a positive or negative shift in one population as compared to the other. 

#Example: 
  
#H0: Probability of Imprisonment is same across southern and non southern states
#H1: Probability of Imprisonment is not same across southern and non southern states

library(MASS)
mwu<-wilcox.test(Prob~So,data=UScrime)
mwu

## 
##  Wilcoxon rank sum exact test
## 
## data:  Prob by So
## W = 81, p-value = 8.488e-05
## alternative hypothesis: true location shift is not equal to 0

#P is less that alpha 0.05 therefore Reject H0


#--------------------------------------------------------------------------------------
# dependent 2-group Wilcoxon Signed Rank Test
#--------------------------------------------------------------------------------------
# wilcox.test(y1,y2,paired=TRUE) # where y1 and y2 are numeric
#Lets use the unemployment rate u1 and u2 pair data and do wilcox paired test

#H0 : Unemployment rates among young and adult populations are same
#Ha : Unemployment rates among young and adult populations are not same

# Wilcoxon signed rank test with continuity correction
wilcox.test(UScrime$U1, UScrime$U2, paired=TRUE)#syntax 1

## Warning in wilcox.test.default(UScrime$U1, UScrime$U2, paired = TRUE): cannot
## compute exact p-value with ties

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  UScrime$U1 and UScrime$U2
## V = 1128, p-value = 2.464e-09
## alternative hypothesis: true location shift is not equal to 0

with(UScrime, wilcox.test(U1, U2, paired=TRUE))#alternative syntax 2

## Warning in wilcox.test.default(U1, U2, paired = TRUE): cannot compute exact p-
## value with ties

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  U1 and U2
## V = 1128, p-value = 2.464e-09
## alternative hypothesis: true location shift is not equal to 0

# Again, you reach the same conclusion reached with the paired t-test


# Note: For the wilcox.test you can use the alternative="less" or alternative="greater" 
# option to specify a one tailed test.
#--------------------------------------------------------------------------------------
# Kruskal Wallis Test One Way Anova by Ranks
#--------------------------------------------------------------------------------------
#kruskal.test(y~A) # where y1 is numeric and A is a factor

# kruskal.test(y ~ A, data)
# where y is a numeric outcome variable and A is a grouping variable with two or more
# levels (if there are two levels, it’s equivalent to the Mann–Whitney U test). 

states <- data.frame(state.region, state.x77)

#state.x77 of datasets package
#matrix with 50 rows and 8 columns giving the following statistics 
#in the respective columns.

# Population:  population estimate as of July 1, 1975
# Income:  per capita income (1974)
# Illiteracy:  illiteracy (1970, percent of population)
# Life Exp:  life expectancy in years (1969–71)
# Murder:  murder and non-negligent manslaughter rate per 100,000 population (1976)
# HS Grad:  percent high-school graduates (1970)
# Frost:  mean number of days with minimum temperature below freezing (1931–1960) in capital or large city
# Area:  land area in square miles

kruskal.test(Illiteracy ~ state.region, data=states)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Illiteracy by state.region
## Kruskal-Wallis chi-squared = 22.672, df = 3, p-value = 4.726e-05

#P is less than 0.05 therefore reject H0 of equality of illiteracy


#what is contigency table
# A contingency table is a type of table in a matrix format 
# that displays the (multivariate) frequency distribution of the variables

#-----------------------------------------------
#Univariate frequency table or one way table 
#-----------------------------------------------
# table(var1, var2, ..., varN)


table(Puromycin$state)#two levels

## 
##   treated untreated 
##        12        11

# xtabs(formula, data)

xtabs(~state,Puromycin)

## state
##   treated untreated 
##        12        11

#Exercise xtabs for table(PlantGrowth$group)#three levels

#Proportions using prop.table
table(state.region)#four levels

## state.region
##     Northeast         South North Central          West 
##             9            16            12            13

ft<-table(state.region)#four levels freqency table
ft

## state.region
##     Northeast         South North Central          West 
##             9            16            12            13

# prop.table(table, margins)
prop.table(ft)#proportions table

## state.region
##     Northeast         South North Central          West 
##          0.18          0.32          0.24          0.26

prop.table(ft)*100#percentage table

## state.region
##     Northeast         South North Central          West 
##            18            32            24            26

#Exercise table(chickwts$feed)# six levels


#--------------------------------------------------------------
#Bivariate frequency table or two way table or contigency table
#--------------------------------------------------------------

table(warpbreaks$wool,warpbreaks$tension)

##    
##     L M H
##   A 9 9 9
##   B 9 9 9

xtabs(~wool+tension,warpbreaks)

##     tension
## wool L M H
##    A 9 9 9
##    B 9 9 9

#Exercise xtabs table(mtcars$gear,mtcars$cyl)

#Proportion table
# prop.table(table, margins)

m <- matrix(1:4, 2)
m

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

proportions(m, 1)#row wise proportions

##           [,1]      [,2]
## [1,] 0.2500000 0.7500000
## [2,] 0.3333333 0.6666667

prop.table(m,1)# same

##           [,1]      [,2]
## [1,] 0.2500000 0.7500000
## [2,] 0.3333333 0.6666667

x<-prop.table(m,1)
x

##           [,1]      [,2]
## [1,] 0.2500000 0.7500000
## [2,] 0.3333333 0.6666667

apply(x,1,sum)#row wise sum

## [1] 1 1

apply(x,2,sum)#column wise sum

## [1] 0.5833333 1.4166667

proportions(m,2)#column wise proportions

##           [,1]      [,2]
## [1,] 0.3333333 0.4285714
## [2,] 0.6666667 0.5714286

prop.table(m,2)

##           [,1]      [,2]
## [1,] 0.3333333 0.4285714
## [2,] 0.6666667 0.5714286

x<-prop.table(m,2)
x

##           [,1]      [,2]
## [1,] 0.3333333 0.4285714
## [2,] 0.6666667 0.5714286

apply(x,1,sum)#row wise sum

## [1] 0.7619048 1.2380952

apply(x,2,sum)#column wise sum

## [1] 1 1

#Lets have some dataframe
xt<-table(mtcars$gear,mtcars$cyl)
xt

##    
##      4  6  8
##   3  1  2 12
##   4  8  4  0
##   5  2  1  2

prop.table(xt)

##    
##           4       6       8
##   3 0.03125 0.06250 0.37500
##   4 0.25000 0.12500 0.00000
##   5 0.06250 0.03125 0.06250

prop.table(xt)*100

##    
##          4      6      8
##   3  3.125  6.250 37.500
##   4 25.000 12.500  0.000
##   5  6.250  3.125  6.250

m1<-prop.table(xt)*100
m1

##    
##          4      6      8
##   3  3.125  6.250 37.500
##   4 25.000 12.500  0.000
##   5  6.250  3.125  6.250

sum(m1)

## [1] 100

prop.table(xt,1)

##    
##              4          6          8
##   3 0.06666667 0.13333333 0.80000000
##   4 0.66666667 0.33333333 0.00000000
##   5 0.40000000 0.20000000 0.40000000

prop.table(xt,1)*100

##    
##             4         6         8
##   3  6.666667 13.333333 80.000000
##   4 66.666667 33.333333  0.000000
##   5 40.000000 20.000000 40.000000

x<-prop.table(xt,1)*100
x

##    
##             4         6         8
##   3  6.666667 13.333333 80.000000
##   4 66.666667 33.333333  0.000000
##   5 40.000000 20.000000 40.000000

apply(x,1,sum)#row wise sum

##   3   4   5 
## 100 100 100

apply(x,2,sum)#column wise sum

##         4         6         8 
## 113.33333  66.66667 120.00000

prop.table(xt,2)

##    
##              4          6          8
##   3 0.09090909 0.28571429 0.85714286
##   4 0.72727273 0.57142857 0.00000000
##   5 0.18181818 0.14285714 0.14285714

prop.table(xt,2)*100

##    
##             4         6         8
##   3  9.090909 28.571429 85.714286
##   4 72.727273 57.142857  0.000000
##   5 18.181818 14.285714 14.285714

x<-prop.table(xt,2)*100
x

##    
##             4         6         8
##   3  9.090909 28.571429 85.714286
##   4 72.727273 57.142857  0.000000
##   5 18.181818 14.285714 14.285714

apply(x,1,sum)#row wise sum

##         3         4         5 
## 123.37662 129.87013  46.75325

margin.table(x,1)

## 
##         3         4         5 
## 123.37662 129.87013  46.75325

marginSums(x,1)

## 
##         3         4         5 
## 123.37662 129.87013  46.75325

apply(x,2,sum)#column wise sum

##   4   6   8 
## 100 100 100

margin.table(x,2)

## 
##   4   6   8 
## 100 100 100

marginSums(x,2)

## 
##   4   6   8 
## 100 100 100

#add margins to the table
#addmargins(table, margins)
addmargins(x,1)

##      
##                4          6          8
##   3     9.090909  28.571429  85.714286
##   4    72.727273  57.142857   0.000000
##   5    18.181818  14.285714  14.285714
##   Sum 100.000000 100.000000 100.000000

addmargins(x,2)

##    
##              4          6          8        Sum
##   3   9.090909  28.571429  85.714286 123.376623
##   4  72.727273  57.142857   0.000000 129.870130
##   5  18.181818  14.285714  14.285714  46.753247

addmargins(x,c(1,2))

##      
##                4          6          8        Sum
##   3     9.090909  28.571429  85.714286 123.376623
##   4    72.727273  57.142857   0.000000 129.870130
##   5    18.181818  14.285714  14.285714  46.753247
##   Sum 100.000000 100.000000 100.000000 300.000000

y<-addmargins(x,c(1,2))
class(y)

## [1] "table"  "matrix" "array"

#-----------------------------------------------------------------
#Multivariate frequency table or two way table or contigency table
#-----------------------------------------------------------------

#Threeway table

table(mtcars$cyl,mtcars$gear,mtcars$am)#using table

## , ,  = 0
## 
##    
##      3  4  5
##   4  1  2  0
##   6  2  2  0
##   8 12  0  0
## 
## , ,  = 1
## 
##    
##      3  4  5
##   4  0  6  2
##   6  0  2  1
##   8  0  0  2

xtabs(~cyl+gear+am,mtcars)#using xtabs

## , , am = 0
## 
##    gear
## cyl  3  4  5
##   4  1  2  0
##   6  2  2  0
##   8 12  0  0
## 
## , , am = 1
## 
##    gear
## cyl  3  4  5
##   4  0  6  2
##   6  0  2  1
##   8  0  0  2

x<-table(mtcars$cyl,mtcars$gear,mtcars$am)

ftable(x)#Compact form using ftable

##       0  1
##           
## 4 3   1  0
##   4   2  6
##   5   0  2
## 6 3   2  0
##   4   2  2
##   5   0  1
## 8 3  12  0
##   4   0  0
##   5   0  2

#Extra note: Two way table with more details
library(gmodels)
library(vcd)

## Loading required package: grid

table(Arthritis$Treatment,Arthritis$Improved)

##          
##           None Some Marked
##   Placebo   29    7      7
##   Treated   13    7     21

CrossTable(Arthritis$Treatment, Arthritis$Improved)#Output is a list

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  84 
## 
##  
##                     | Arthritis$Improved 
## Arthritis$Treatment |      None |      Some |    Marked | Row Total | 
## --------------------|-----------|-----------|-----------|-----------|
##             Placebo |        29 |         7 |         7 |        43 | 
##                     |     2.616 |     0.004 |     3.752 |           | 
##                     |     0.674 |     0.163 |     0.163 |     0.512 | 
##                     |     0.690 |     0.500 |     0.250 |           | 
##                     |     0.345 |     0.083 |     0.083 |           | 
## --------------------|-----------|-----------|-----------|-----------|
##             Treated |        13 |         7 |        21 |        41 | 
##                     |     2.744 |     0.004 |     3.935 |           | 
##                     |     0.317 |     0.171 |     0.512 |     0.488 | 
##                     |     0.310 |     0.500 |     0.750 |           | 
##                     |     0.155 |     0.083 |     0.250 |           | 
## --------------------|-----------|-----------|-----------|-----------|
##        Column Total |        42 |        14 |        28 |        84 | 
##                     |     0.500 |     0.167 |     0.333 |           | 
## --------------------|-----------|-----------|-----------|-----------|
## 
##

#CrossTable is more powerful than this please check the help



#--------------------------------------------------------------------------------------
# Test of Independence - Correlation for categorical variables
#--------------------------------------------------------------------------------------

#Chisquare and Fisher exact test are two tests for independence

# Fisher's Exact test is a way to test the association between two categorical variables 
# when you have small cell sizes (expected values less than 5)

#Chisquare test of independence

# Introduction to dataset Arthritis of {vcd} package
# clinical trial investigating a new treatment for rheumatoid arthritis

# A data frame with 84 observations and 5 variables.
# 
# ID              patient ID.
# Treatment     factor indicating treatment (Placebo, Treated).
# Sex           factor indicating sex (Female, Male).
# Age           age of patient.
# Improved    ordered factor indicating treatment outcome (None, Some, Marked)

# You can apply the function chisq.test() to a two-way table in order to produce 
# a chi.square test of independence of the row and column variables. 


#Findout if there is any relationship (dependence)between 
#Type of treatment and Results of the treatment

# H0: the variables are independent, there is no relationship between 
# the two categorical variables. Knowing the value of one variable does not 
# help to predict the value of the other variable

#H0: Treatment type and Improved status are not related/Independent to each other
#Ha: Treatment type and Improved status are related/dependent to each other


#Step 1 : two way table ie contingency table
table1 <- xtabs(~Treatment+Improved, data=Arthritis)
table1

##          Improved
## Treatment None Some Marked
##   Placebo   29    7      7
##   Treated   13    7     21

#Step 2: use the table as argument in chisquare test function
chisq.test(table1)

## 
##  Pearson's Chi-squared test
## 
## data:  table1
## X-squared = 13.055, df = 2, p-value = 0.001463

#Since p is less than alpha 0.05 Reject H0
#Conclude that treatment type is related to the outcomes 

#Extra note:
cst1<-chisq.test(table1)
attributes(cst1)

## $names
## [1] "statistic" "parameter" "p.value"   "method"    "data.name" "observed" 
## [7] "expected"  "residuals" "stdres"   
## 
## $class
## [1] "htest"

cst1$expected<5

##          Improved
## Treatment  None  Some Marked
##   Placebo FALSE FALSE  FALSE
##   Treated FALSE FALSE  FALSE

#Example2: Find if gender has any relation to the treatment outcomes

# The assumption of the Chi-square test is not that the observed value in each cell is greater than 5.  
# It is that the expected value in each cell is greater than 5.  
# (The expected value for each cell is row total*column total/overall total).   
# Often when the observed values are low, the totals are too, so they overlap a lot, but not always.
# If you still find that your expected values are too low, use a Fisher’s Exact test. 

#H0: Gender and Improved status are not related/Independent to each other
#Ha: Gender and Improved status are related/dependent to each other


#Step 1 : two way table ie contingency table
table2 <- xtabs(~Sex+Improved, data=Arthritis)
table2

##         Improved
## Sex      None Some Marked
##   Female   25   12     22
##   Male     17    2      6

#Step 2: use the table as argument in chisquare test function
chisq.test(table2)

## Warning in chisq.test(table2): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  table2
## X-squared = 4.8407, df = 2, p-value = 0.08889

#Since p is not less than alpha 0.05 Donot Reject H0
#Conclude that gender is not related to the treatment outcomes 

#Observe the warning message , lets explore why it has come

cst2<-chisq.test(table2)

## Warning in chisq.test(table2): Chi-squared approximation may be incorrect

attributes(cst2)

## $names
## [1] "statistic" "parameter" "p.value"   "method"    "data.name" "observed" 
## [7] "expected"  "residuals" "stdres"   
## 
## $class
## [1] "htest"

cst2$expected<5

##         Improved
## Sex       None  Some Marked
##   Female FALSE FALSE  FALSE
##   Male   FALSE  TRUE  FALSE

#Thus we have to use fisher exact test for the above example
# one of the six cells in the table (male-some improvement) 
# has an expected value less than five, which may
# invalidate the chi-square approximation

# In contrast to many statistical packages, the fisher.test() 
# function can be applied to any two-way table 
# with two or more rows and columns, not a 2 × 2 table.

fisher.test(table2)#fisher test

## 
##  Fisher's Exact Test for Count Data
## 
## data:  table2
## p-value = 0.1094
## alternative hypothesis: two.sided

# p is not less than 0.05 Donot Reject H0
# gender and treatment outcomes are not related

#--------------------------------------------------------------------------------------
# Correlation for numeric variables - Test of Independence  
#--------------------------------------------------------------------------------------


# Correlation coefficients are used to describe relationships among quantitative variables. 
# The sign ± indicates the direction of the relationship (positive or inverse), and
# the magnitude indicates the strength of the relationship 
# (ranging from 0 for no relationship to 1 for a perfectly predictable relationship).

# Types of correlations: R can produce a variety of correlation coefficients, 
# including Pearson, Spearman,Kendall, partial, polychoric, and polyserial

# PEARSON, SPEARMAN, AND KENDALL CORRELATIONS
# The Pearson product-moment correlation assesses the degree of linear relationship between two quantitative variables. 
# Spearman’s rank-order correlation coefficient assesses the degree of relationship between two rank-ordered variables. 
# Kendall’s tau is also a nonparametric measure of rank correlation.

# The cor() function produces all three correlation coefficients, 
# whereas the cov() function provides covariances. 

# cor(x, use= , method= ) 
# x Matrix or data frame.
# use Specifies the handling of missing data. 
# The options are all.obs (assumes no missing data—missing data will produce an error), 
# everything (any correlation involving a case with missing values will be set to missing), 
# complete.obs (listwise deletion), 
# and pairwise.complete.obs (pairwise deletion). 
# 
# method Specifies the type of correlation. The options are pearson, spearman, and kendall
# The default options are use="everything" and method="pearson"

states<- state.x77[,1:6]
str(states)

##  num [1:50, 1:6] 3615 365 2212 2110 21198 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##   ..$ : chr [1:6] "Population" "Income" "Illiteracy" "Life Exp" ...

head(states)

##            Population Income Illiteracy Life Exp Murder HS Grad
## Alabama          3615   3624        2.1    69.05   15.1    41.3
## Alaska            365   6315        1.5    69.31   11.3    66.7
## Arizona          2212   4530        1.8    70.55    7.8    58.1
## Arkansas         2110   3378        1.9    70.66   10.1    39.9
## California      21198   5114        1.1    71.71   10.3    62.6
## Colorado         2541   4884        0.7    72.06    6.8    63.9

cov(states)

##               Population      Income   Illiteracy     Life Exp      Murder
## Population 19931683.7588 571229.7796  292.8679592 -407.8424612 5663.523714
## Income       571229.7796 377573.3061 -163.7020408  280.6631837 -521.894286
## Illiteracy      292.8680   -163.7020    0.3715306   -0.4815122    1.581776
## Life Exp       -407.8425    280.6632   -0.4815122    1.8020204   -3.869480
## Murder         5663.5237   -521.8943    1.5817755   -3.8694804   13.627465
## HS Grad       -3551.5096   3076.7690   -3.2354694    6.3126849  -14.549616
##                 HS Grad
## Population -3551.509551
## Income      3076.768980
## Illiteracy    -3.235469
## Life Exp       6.312685
## Murder       -14.549616
## HS Grad       65.237894

cor(states)

##             Population     Income Illiteracy    Life Exp     Murder     HS Grad
## Population  1.00000000  0.2082276  0.1076224 -0.06805195  0.3436428 -0.09848975
## Income      0.20822756  1.0000000 -0.4370752  0.34025534 -0.2300776  0.61993232
## Illiteracy  0.10762237 -0.4370752  1.0000000 -0.58847793  0.7029752 -0.65718861
## Life Exp   -0.06805195  0.3402553 -0.5884779  1.00000000 -0.7808458  0.58221620
## Murder      0.34364275 -0.2300776  0.7029752 -0.78084575  1.0000000 -0.48797102
## HS Grad    -0.09848975  0.6199323 -0.6571886  0.58221620 -0.4879710  1.00000000

cor(states, method="spearman")

##            Population     Income Illiteracy   Life Exp     Murder    HS Grad
## Population  1.0000000  0.1246098  0.3130496 -0.1040171  0.3457401 -0.3833649
## Income      0.1246098  1.0000000 -0.3145948  0.3241050 -0.2174623  0.5104809
## Illiteracy  0.3130496 -0.3145948  1.0000000 -0.5553735  0.6723592 -0.6545396
## Life Exp   -0.1040171  0.3241050 -0.5553735  1.0000000 -0.7802406  0.5239410
## Murder      0.3457401 -0.2174623  0.6723592 -0.7802406  1.0000000 -0.4367330
## HS Grad    -0.3833649  0.5104809 -0.6545396  0.5239410 -0.4367330  1.0000000

# you can see, for example, that a strong positive correlation exists 
# between income and high school graduation rate and 
# that a strong negative correlation exists 
# between illiteracy rates and life expectancy. 


corstate<-cor(states)
class(corstate)

## [1] "matrix" "array"

#Better way of presenting correlation
x <- states[,c("Population", "Income", "Illiteracy", "HS Grad")]
y <- states[,c("Life Exp", "Murder")]
cor(x,y)

##               Life Exp     Murder
## Population -0.06805195  0.3436428
## Income      0.34025534 -0.2300776
## Illiteracy -0.58847793  0.7029752
## HS Grad     0.58221620 -0.4879710

#H0: There is no correlation between Illiteracy and murder
#Ha: There is correlation between Illiteracy and murder
cor.test(states[,3], states[,5])#Illiterarcy Murder

## 
##  Pearson's product-moment correlation
## 
## data:  states[, 3] and states[, 5]
## t = 6.8479, df = 48, p-value = 1.258e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5279280 0.8207295
## sample estimates:
##       cor 
## 0.7029752

# p is less than alpha 0.05 Reject H0
# There is correlation between Illiteracy and murder



print("########################## THE ### END ###############################################")

## [1] "########################## THE ### END ###############################################"

# Independent two sample t test
# Equal sample sizes and equal variance : degrees of freedom = 2n-2
# Unequal samples sizes and equal variance: degrees of freedom = n1+n2-2
# Equal or unequal sample sizes, unequal variances This test, also known as Welch's t-test, 
# is used only when the two population variances are not assumed to be equal and degrees of freedom formula should be used
# Dependent two sample t test degrees of freedom is n-1 where n is number of pairs
# 
# we can calculate a t-statistic that will follow a t-distribution with n1 + n2 - 2 degrees of freedom. 
# There is also a widely used modification of the t-test, known as Welch's t-test that adjusts the number of degrees of freedom when the variances are thought not to be equal to each other.

#For the Friedman test, the format is
# friedman.test(y ~ A | B, data)
# where y is the numeric outcome variable, A is a grouping variable, and B is a blocking
# variable that identifies matched observations. In both cases, data is an option argument 
# specifying a matrix or data frame containing the variables

# Randomized Block Design - Friedman Test
# friedman.test(y~A|B)
# where y are the data values, A is a grouping factor
# and B is a blocking factor

Hypothesis Testing

Dr N Srikanth Reddy

18/06/2021