Profile Setup

knitr::opts_chunk$set(echo = TRUE)


Simulation Exercise

Show the sample mean and compare it to the theoretical mean of the distribution.

set.seed(122)
lambda = 0.2
OrigExpD_0.2 = rexp(n = 40,rate = 0.2)
samp_mean = mean(OrigExpD_0.2)
theo_mean = 1/lambda
theo_mean 
## [1] 5
samp_mean
## [1] 5.825981

Compared to the theretical mean and the sample mean, the sample mean is a little bit larger than the theoretical mean. The sample mean is 5.8259814, and the theoretical mean is 5.

Show the sample varance and compare it to the theoretical variance of the distribution.

samp_var = var(OrigExpD_0.2)
theo_var = (1/lambda)^2
samp_var
## [1] 58.96815
theo_var
## [1] 25

Compared to the theretical variance and the sample varaince, the sample variance is larger than the theoretical variance. The sample variance is 58.9681518, and the theoretical mean is 25.

Show that the sampling distribution of mean is approximately normal..

First we generate 1000 exponential data to see the distribution.

set.seed(11)
hist(rexp(1000,rate = 0.2),breaks = 50,main = "Histogram of 1000 Exp Data",xlab = "Value",col = "sky blue")

Then, we set the sample size is 40, and the times of simulation is 1000 to create the sampling distribution of the mean of exponential data.

set.seed(123)
row = 1000
col = 40
da = matrix(rexp(40*1000,rate = 0.2),row,col)
daf = apply(da,1,mean)
hist(daf,breaks = 50,main = "Sampling distribution of the Exp Data",xlab = "Value",col = "pink")

par(mfrow=c(1,2),mar = c(4,4,4,5))
set.seed(11)
hist(rexp(1000,rate = 0.2),breaks = 50,main = "Histogram of 1000 Exp Data",xlab = "Value",col = "sky blue")
set.seed(123)
row = 1000
col = 40
da = matrix(rexp(40*1000,rate = 0.2),row,col)
daf = apply(da,1,mean)
hist(daf,breaks = 50,main = "Sampling distribution of the Exp Data",xlab = "Value",col = "pink")
abline(v = mean(daf))

mean(daf)
## [1] 5.011911

We can compared the distribution from the original 1000 data, the original distribution is extremely right skewness, however, the sampling distribution is approximately normal distributed. I label the the mean value of the sampling distribution, which is 5.011911, which is really close to the theoretical value of the exponential data. Moreover, this reasonable. According to the Central limit Theorem, if the sample size is large eough, whatever the distribution of the original data, we can always get the approximately normal distribtuion. You can see from this figure, the right plot is right skewed, but the left plot is approximately nomal.

Basic Inferential Data Analysis

Load Data and Basic Data Analysis

library(datasets)
data("ToothGrowth")
head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5
str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
hist(ToothGrowth$len,col = "sky blue")

boxplot(len~supp+dose,data = ToothGrowth,col = "pink")

We can see the basic information of the len variable in the dataset. Moreover, we can compare the len according to the factor variable of supp and dose.

Basic summary of the data

summary(ToothGrowth)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

Confidence Interval/Hypothesis tests to compare tooth growth by supp and dose

In these tests, because there are 6 different groups, as a result, the sample size is relative small, we shall use the t.test function to find the confidence interval or to use for the hypothesis test.

SplitData = split(ToothGrowth,f = interaction(ToothGrowth$supp,ToothGrowth$dose))
names(SplitData)
## [1] "OJ.0.5" "VC.0.5" "OJ.1"   "VC.1"   "OJ.2"   "VC.2"

Compared OJ.0.5 and VC.0.5. We set the null hyphothesis is H0: There is not difference between these two groups, then the alternating hyphothsis is there exist some difference. The we set the siginificant level as 5%.

t.test(SplitData[[1]][,1],SplitData[[2]][,1],alternative = "two.sided",mu = 0,paired = FALSE,conf.level = 0.95)
## 
##  Welch Two Sample t-test
## 
## data:  SplitData[[1]][, 1] and SplitData[[2]][, 1]
## t = 3.1697, df = 14.969, p-value = 0.006359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.719057 8.780943
## sample estimates:
## mean of x mean of y 
##     13.23      7.98

As a result, for the OJ.0.5 and VC.0.5 group, the p-value is 0.006359, which is less than the significant level 0.05, as a result, we should reject the null hypothesis and accept the alternating hypothesis, and state these is a difference between OJ.0.5 and VC.0.5.

Also, we can observe the confidence interval, which is [1.719057, 8.780943], which does not contain 0.

Also, we can do the same process for the rest of the groups.

Compared OJ.0.5 and OJ.1. We set the null hyphothesis is H0: There is not difference between these two groups, then the alternating hyphothsis is there exist some difference. The we set the siginificant level as 5%.

t.test(SplitData[[1]][,1],SplitData[[3]][,1],alternative = "two.sided",mu = 0,paired = FALSE,conf.level = 0.95)
## 
##  Welch Two Sample t-test
## 
## data:  SplitData[[1]][, 1] and SplitData[[3]][, 1]
## t = -5.0486, df = 17.698, p-value = 8.785e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -13.415634  -5.524366
## sample estimates:
## mean of x mean of y 
##     13.23     22.70

Here is another the example to compare OJ.0.5 and OJ.1. We can see the p-value is close to 0, as a result, we should reject the null hypothesis and accept the alternating hypothesis, and state these is a difference between OJ.0.5 and OJ.1.

Also, we can observe the confidence interval, which is [-13.415634, -5.524366], which does not contain 0.

We can use the same method to the rest of groups, or we can use the ANOVA test to directly compare all these 6 groups.

library(statsr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
ToothGrowth1 = ToothGrowth %>%
        mutate(index = interaction(supp,dose))
inference(y = len, x = index,data = ToothGrowth1,statistic = "mean",method = "theoretical",
          type = "ht",null = 0,conf_level = 0.95,alternative = "greater")
## Warning: Ignoring null value since it's undefined for ANOVA
## Response variable: numerical
## Explanatory variable: categorical (6 levels) 
## n_OJ.0.5 = 10, y_bar_OJ.0.5 = 13.23, s_OJ.0.5 = 4.4597
## n_VC.0.5 = 10, y_bar_VC.0.5 = 7.98, s_VC.0.5 = 2.7466
## n_OJ.1 = 10, y_bar_OJ.1 = 22.7, s_OJ.1 = 3.911
## n_VC.1 = 10, y_bar_VC.1 = 16.77, s_VC.1 = 2.5153
## n_OJ.2 = 10, y_bar_OJ.2 = 26.06, s_OJ.2 = 2.6551
## n_VC.2 = 10, y_bar_VC.2 = 26.14, s_VC.2 = 4.7977
## 
## ANOVA:
##           df    Sum_Sq  Mean_Sq       F  p_value
## index      5 2740.1033 548.0207 41.5572 < 0.0001
## Residuals 54   712.106  13.1871                 
## Total     59 3452.2093                          
## 
## Pairwise tests - t tests with pooled SD:
## # A tibble: 15 x 3
##    group1 group2  p.value
##    <chr>  <chr>     <dbl>
##  1 VC.0.5 OJ.0.5 2.09e- 3
##  2 OJ.1   OJ.0.5 3.18e- 7
##  3 VC.1   OJ.0.5 3.37e- 2
##  4 OJ.2   OJ.0.5 1.43e-10
##  5 VC.2   OJ.0.5 1.19e-10
##  6 OJ.1   VC.0.5 1.97e-12
##  7 VC.1   VC.0.5 1.46e- 6
##  8 OJ.2   VC.0.5 1.34e-15
##  9 VC.2   VC.0.5 1.13e-15
## 10 VC.1   OJ.1   5.90e- 4
## 11 OJ.2   OJ.1   4.34e- 2
## 12 VC.2   OJ.1   3.88e- 2
## 13 OJ.2   VC.1   4.77e- 7
## 14 VC.2   VC.1   3.98e- 7
## 15 VC.2   OJ.2   9.61e- 1

As a result, we can see the 15 comparision from this inference function.