Sampling Distribution (Practical): List 2

Real datasets in R

iris
is data set gives the measurements in centimeters of the variables sepal length, sepal width, petal length and petal width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

data("iris")
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

# ? iris

ToothGrowth
oothGrowth data set contains the result from an experiment studying the effect of vitamin C on tooth growth in 60 Guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, (orange juice or ascorbic acid (a form of vitamin C and coded as VC).

len: Tooth length

supp: Supplement type (VC or OJ).

dose: numeric Dose in milligrams/day

data("ToothGrowth")
  
head(ToothGrowth)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

PlantGrowth
esults obtained from an experiment to compare yields (as measured by dried weight of plants) obtained under a control and two different treatment condition.

data("PlantGrowth")
  
head(PlantGrowth)

##   weight group
## 1   4.17  ctrl
## 2   5.58  ctrl
## 3   5.18  ctrl
## 4   6.11  ctrl
## 5   4.50  ctrl
## 6   4.61  ctrl

Many More Real Datasets in R
The R Datasets Package
———————–

This package contains a variety of datasets. For a complete list, use library(help = “datasets”).

Author(s) R Core Team and contributors worldwide Reference:

https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html

#install.packages(datasets)

==================================================================

One Sample T-test
he One-Sample T-Test is used to test the statistical difference between a sample mean and a known or assumed/hypothesized value of the mean in the population.
For performing a one-sample t-test in R, we would use the syntax t.test(y, mu = 0) where x is the name of the variable of interest and mu is set equal to the mean specified by the null hypothesis

sweetSold <- c(rnorm(50, mean = 10, sd = 3))
t.test(sweetSold, mu = 15) # Ho: mu = 150

## 
##  One Sample t-test
## 
## data:  sweetSold
## t = -11.362, df = 49, p-value = 2.458e-15
## alternative hypothesis: true mean is not equal to 15
## 95 percent confidence interval:
##   9.217681 10.955745
## sample estimates:
## mean of x 
##  10.08671

Two sample T-test

It is used to help us to understand that the difference between the two means is real or simply by chance. The general form of the test is t.test(y1, y2, paired=FALSE). By default, R assumes that the variances of y1 and y2 are unequal, thus defaulting to Welch’s test. To bypass this, we use the flag var.equal=TRUE.

shopOne <- rnorm(50, mean = 140, sd = 4.5)
shopTwo <- rnorm(50, mean = 150, sd = 4)
 
t.test(shopOne, shopTwo, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  shopOne and shopTwo
## t = -13.206, df = 98, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.258761  -8.317024
## sample estimates:
## mean of x mean of y 
##  140.4616  150.2495

Paired Sample T-test ———————

This is a statistical procedure that is used to determine whether the mean difference between two sets of observations is zero. In a paired sample t-test, each subject is measured two times, resulting in pairs of observations.

The test is run using the syntax t.test(y1, y2, paired=TRUE)

sweetOne <- c(rnorm(100, mean = 14, sd = 0.3))
sweetTwo <- c(rnorm(100, mean = 13, sd = 0.2))
 
t.test(sweetOne, sweetTwo, paired = TRUE)

## 
##  Paired t-test
## 
## data:  sweetOne and sweetTwo
## t = 28.778, df = 99, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.9656321 1.1086519
## sample estimates:
## mean of the differences 
##                1.037142

Exercise 1

set.seed(101)

my_data <- data.frame(
  name = paste0(rep("M_", 10), 1:10),
  weight = round(rnorm(10, 20, 2), 1)
)

summary(my_data$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.70   19.65   20.50   20.48   21.18   22.30

Visualization of data using box plots

library(ggpubr)

## Loading required package: ggplot2

ggboxplot(my_data$weight, 
          ylab = "Weight (g)", xlab = FALSE,
          ggtheme = theme_minimal())

Preleminary test to check one-sample t-test assumptions:

Shapiro-Wilk normality test and to look at the normality plot.

Shapiro-Wilk test: Null hypothesis: the data are normally distributed Alternative hypothesis: the data are not normally distributed

shapiro.test(my_data$weight)

## 
##  Shapiro-Wilk normality test
## 
## data:  my_data$weight
## W = 0.98074, p-value = 0.969

Q-Q plot

Visual inspection of the data normality using Q-Q plots (quantile-quantile plots). Q-Q plot draws the correlation between a given sample and the normal distribution.

library("ggpubr")
ggqqplot(my_data$weight, ylab = "Men's weight",
         ggtheme = theme_minimal())

Compute one-sample t-test

# One-sample t-test
res <- t.test(my_data$weight, mu = 25)
# Printing the results
res

## 
##  One Sample t-test
## 
## data:  my_data$weight
## t = -12.496, df = 9, p-value = 5.45e-07
## alternative hypothesis: true mean is not equal to 25
## 95 percent confidence interval:
##  19.66172 21.29828
## sample estimates:
## mean of x 
##     20.48

Remarks:

if you want to test whether the mean weight of mice is less than 25g (one-tailed test), type this:

t.test(my_data$weight, mu = 25,
              alternative = "less")

## 
##  One Sample t-test
## 
## data:  my_data$weight
## t = -12.496, df = 9, p-value = 2.725e-07
## alternative hypothesis: true mean is less than 25
## 95 percent confidence interval:
##      -Inf 21.14308
## sample estimates:
## mean of x 
##     20.48

Or, if you want to test whether the mean weight of mice is greater than 25g (one-tailed test), type this:

t.test(my_data$weight, mu = 25,
              alternative = "greater")

## 
##  One Sample t-test
## 
## data:  my_data$weight
## t = -12.496, df = 9, p-value = 1
## alternative hypothesis: true mean is greater than 25
## 95 percent confidence interval:
##  19.81692      Inf
## sample estimates:
## mean of x 
##     20.48

Interpretation of the result:

If, The p-value of the test less than the significance level alpha = 0.05. We can conclude that the mean weight of the mice is significantly different from 25g .

========================================================

Simulated data Example: Paired T Test ————————————–

Suppose a training program was conducted to improve the participants’ knowledge of statistics. Data were collected from a selected sample of 10 individuals before and after the statistics training program. Test the hypothesis that the training is effective to improve the participants’ knowledge of ICT at 95% level of significance.

H0: there is no difference in participants’ knowledge before and after the ICT training

H1: ICT training affected the participant’s knowledge

We shall test this hypothesis against the alternative hypothesis.

Statistics training data ————————

Let’s create this data set in R. First, create before and after as objects containing the scores of statistics training.

 before <- c(11.2, 14.5, 12.4, 12.2, 12.7, 10.4, 15.8, 13.8, 8.5, 14.1)
    after <- c(12.5, 15.3, 13.6, 12.7, 13.9, 11.3, 16.6, 14.5, 8.2, 14.7)

stdata <- data.frame(subject = rep(c(1:10), 2), 
                   time = rep(c("before", "after"), each = 10),
                   score = c(before, after))
print(stdata)

##    subject   time score
## 1        1 before  11.2
## 2        2 before  14.5
## 3        3 before  12.4
## 4        4 before  12.2
## 5        5 before  12.7
## 6        6 before  10.4
## 7        7 before  15.8
## 8        8 before  13.8
## 9        9 before   8.5
## 10      10 before  14.1
## 11       1  after  12.5
## 12       2  after  15.3
## 13       3  after  13.6
## 14       4  after  12.7
## 15       5  after  13.9
## 16       6  after  11.3
## 17       7  after  16.6
## 18       8  after  14.5
## 19       9  after   8.2
## 20      10  after  14.7

Visualization of the Data:

par(pty = "s")
boxplot(stdata$score ~ stdata$time)

Association or correlation test:

Now let’s see the association or correlation between the paired samples. Use cor.test() function to test this association. Type the components of the time variable in x and y arguments. In the method argument, we shall use Pearson which is the most commonly used method. Let’s test this relationship at 0.95 confidence level.

library(stats)
cor.test(x = before, y = after, 
         method = c("pearson"), 
         conf.level = 0.95)

## 
##  Pearson's product-moment correlation
## 
## data:  before and after
## t = 15.187, df = 8, p-value = 3.5e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9277019 0.9961329
## sample estimates:
##       cor 
## 0.9830963

Interpretation of the Reult:

The results show that there is a significantly strong relationship between before and after training scores. The correlation coefficient is 0.936 which is very close to one. It reflects a strong positive relationship or association in before and after ICT training scores.

======================================================== Paired sample t-test ——————–

t.test(formula = stdata$score ~ stdata$time,
       alternative = "greater",
       mu = 0, 
       paired = TRUE,   
       var.equal = TRUE,
       conf.level = 0.95)

## 
##  Paired t-test
## 
## data:  stdata$score by stdata$time
## t = 5.2705, df = 9, p-value = 0.0002568
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.502187      Inf
## sample estimates:
## mean of the differences 
##                    0.77

Interpretation of the result:

The results showed that the probability value is lower than 0.05. Lower the P-value, lower the evidence we have to support the null hypothesis. Based on this result, we shall reject the null hypothesis of no difference. It means ICT training significantly improved the participants’ knowledge.

      EXRCISES / H.W
     -----------------
     
      For the above real-life datasets, Perform Hypothesis Test of significnce and confidence Intervals for single mean and difference of Two means and Paired Tests in R.

References:

http://www.sthda.com/english/wiki/one-sample-t-test-in-r

https://www.r-bloggers.com/2021/10/paired-sample-t-test-using-r/

http://www.sthda.com/english/wiki/f-test-compare-two-variances-in-r

https://www.r-bloggers.com/2021/10/paired-sample-t-test-using-r/

https://www.geeksforgeeks.org/t-test-approach-in-r-programming/

http://www.sthda.com/english/wiki/r-built-in-data-sets

https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html