In this lab we will:
t.test() and BinomCI()NOTE: as we discussed in lectures, depending on the TYPE of variable you have (categorical vs. numerical) there are different calculations for confidence intervals. Below is some introductory/sample code for determining CIs for different variable types:
For either data type, when asked for “desired confidence” this is going to be a value from 0-1, if I ask for 90% confidence, you need to enter 0.9
FOR CATEGORICAL (this means you are essentially creating proportional data): ** Install and load the package in the code chunk below
#install.packages("DescTools")
library(DescTools)
FOR ALL THE CODE BELOW, things have been commented out, if you copy and paste them, please make sure the # is removed. ### Condfidence Interval for Proportions
You will need the number of cases that are positive (x) from your data set as well as the total number of observations (n)
Access number of positive cases
#table(dataframe_name$categorical_column)
Access Total number of cases
#length(dataframe_name$categorical_column)
NOTE: there are two common methods for calculating CIs for proportions, the Wald method and the Agresti-Coull method, the only difference is a slight change in the calculation. The Wald method is a good default, Agresti-Coull is better for smaller sample sizes.
Confidence Interval - Wald Method
#BinomCI(x = number of positive cases, n = total cases, conf.level = desired confidence, method = "wald")
Condfidence Interval - Agresti-Coull Method
#BinomCI(x = number of positive cases, n = total cases, conf.level = desired confidence, method = "agresti-coull")
#t.test(dataframe_name$quantitative_column, conf.level = desired confidence)
Or directly access the Confidence Interval
#t.test(dataframe_name$quantitative_column, conf.level = desired confidence)$conf.int
In a study involving how rice grows across different nutrient treatments, researchers randomly selected plots where there was a mix of wild-type and gmo rice growing. Please bring in the “rice.csv” file and call it “rice_df”
rice_df <- read.table('rice.csv',sep=',', header=T)
rice_df
Using the code chunk below, write R commands to
variables<-colnames(rice_df)
variables
## [1] "PlantNo" "Block" "RootDryMass" "ShootDryMass" "trt"
## [6] "fert" "variety"
str(rice_df)
## 'data.frame': 72 obs. of 7 variables:
## $ PlantNo : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Block : int 1 1 1 1 1 1 2 2 2 2 ...
## $ RootDryMass : int 56 66 40 43 55 66 41 67 40 35 ...
## $ ShootDryMass: int 132 120 108 134 119 125 98 122 114 82 ...
## $ trt : chr "F10" "F10" "F10" "F10" ...
## $ fert : chr "F10" "F10" "F10" "F10" ...
## $ variety : chr "wt" "wt" "wt" "wt" ...
variety variable, which
confidence interval would be appropriate to use: C.I. for a mean or a
C.I. for a proportion?1.CI for proportion would be most appropriate because this is a categorical variable. 2. 3.The est. portion of the code output tells us that a 0.5 (or 50%) proportion of the data from the “variety” column is of the “wt” variety. The lwr.ci and upr.ci portions define the lower and upper limits of the 97% confidence interval as 0.3721262 and 0.6278738 (or 37.2% and 62.8%) respectively.
x <- sum(rice_df == "wt")
n <- length(rice_df$variety)
print(x)
## [1] 36
print(n)
## [1] 72
BinomCI(x, n, conf.level = 0.97, method = "wald")
## est lwr.ci upr.ci
## [1,] 0.5 0.3721262 0.6278738
ShootDryMass variable,
which confidence interval would be appropriate to use: C.I. for a mean
or a C.I. for a proportion?1.CI for mean would be appropriate because this is a numerical variable. 2. 3.The output from the code below provides us with a lot of statistical information. The t-value is 13.971 which measures how far the sample mean is from the null hypothesis mean. It also dells us the degrees of freedom are 71, which is one less than the sample number. The output provides a p-value of <2.2e-16 which is very low, telling us that our results are very significant.This also tells us the mean of the column is 59.55 and the lowew and upper limits within a 97% confidence interval are 50.11550 and 68.99561 respectively.
t.test(rice_df$ShootDryMass, conf.level = 0.97)
##
## One Sample t-test
##
## data: rice_df$ShootDryMass
## t = 13.971, df = 71, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 97 percent confidence interval:
## 50.11550 68.99561
## sample estimates:
## mean of x
## 59.55556
In this final section, we will continue to work with the
rice_df. We are now interested in whether the variety of
rice (wt or gmo) influences the growth of the rice plants.
Recall that the summary() function will compute the
five-number summary for a quantitative data set. Using
tapply() compute the “summary” of the ShootDryMass for each
plant broken up by variety:
tapply(rice_df$ShootDryMass, rice_df$variety, summary)
## $gmo
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 9.50 40.00 41.81 62.75 100.00
##
## $wt
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 26.00 47.50 76.00 77.31 108.25 134.00
varietyUsing the code chunk below, slice your rice_df into two
dataframes: wt and gmo. The wt
dataframe should include the ShootDryMass data for only the wild-type in
the sample, and the gmo dataframe should do the same for
gmo rice
The basic way to do this is to utilize the following steps, notice the first step highlights rows in your dataframe that contain a value your are looking to isolate, you need to look for the exact phrase! The second step takes your isolated rows and creates a new dataframe (which you control the name of)
# create a variable for desired rows
#desired_rows <- dataframe_name$categorical_column_name == "desired_level"
### use the desired rows to create a new dataframe
#new_dataframe_name <- dataframe_name[desired_rows,]
### I have done created a dataframe for wild-type rice below, you need to do the same for the gmo rice
wt_rows <- rice_df$variety == "wt"
wt_df <- rice_df[wt_rows,]
wt_df
gmo_rows <- rice_df$variety == "gmo"
gmo_df <- rice_df[gmo_rows,]
gmo_df
Lastly, using the code chunk below, compute a confidence interval for the ShootDryMass of wt, and then compute another for the ShootDryMass of gmo Use a confidence level of 96% in both cases.
t.test(wt_df$ShootDryMass, conf.level = 0.96)
##
## One Sample t-test
##
## data: wt_df$ShootDryMass
## t = 14.094, df = 35, p-value = 5.391e-16
## alternative hypothesis: true mean is not equal to 0
## 96 percent confidence interval:
## 65.60515 89.00596
## sample estimates:
## mean of x
## 77.30556
t.test(gmo_df$ShootDryMass, conf.level = 0.96)
##
## One Sample t-test
##
## data: gmo_df$ShootDryMass
## t = 8.2575, df = 35, p-value = 9.857e-10
## alternative hypothesis: true mean is not equal to 0
## 96 percent confidence interval:
## 31.00591 52.60520
## sample estimates:
## mean of x
## 41.80556
Answer the following:
1.The CI for only wt has a lower limit of 65.60515, an upper limit of 89.00596, and a mean of 77.30556 2.The CI for only gmo has a lower limit of 31.00591, an upper limit of 52.60520, and a mean of 41.80556 3.No the CIs do not overlap 4.Yes, the mean ShootDryMass is different for wt and gmo as the means must be between the lower and upper limits. As the lower and upper limits of the two varieties do not overlap, their means will be different.