Introduction:

In this lab we will:

Intro to code needed

NOTE: as we discussed in lectures, depending on the TYPE of variable you have (categorical vs. numerical) there are different calculations for confidence intervals. Below is some introductory/sample code for determining CIs for different variable types:

For either data type, when asked for “desired confidence” this is going to be a value from 0-1, if I ask for 90% confidence, you need to enter 0.9

FOR CATEGORICAL (this means you are essentially creating proportional data): ** Install and load the package in the code chunk below

#install.packages("DescTools")
library(DescTools)

FOR ALL THE CODE BELOW, things have been commented out, if you copy and paste them, please make sure the # is removed. ### Condfidence Interval for Proportions

You will need the number of cases that are positive (x) from your data set as well as the total number of observations (n)

Access number of positive cases

#table(dataframe_name$categorical_column)

Access Total number of cases

#length(dataframe_name$categorical_column)

NOTE: there are two common methods for calculating CIs for proportions, the Wald method and the Agresti-Coull method, the only difference is a slight change in the calculation. The Wald method is a good default, Agresti-Coull is better for smaller sample sizes.

Confidence Interval - Wald Method

#BinomCI(x = number of positive cases, n = total cases, conf.level = desired confidence, method = "wald")

Condfidence Interval - Agresti-Coull Method

#BinomCI(x = number of positive cases, n = total cases, conf.level = desired confidence, method = "agresti-coull")

Condfidence Interval for Mean of a Quantitative Variable (NUMERICAL)

#t.test(dataframe_name$quantitative_column, conf.level = desired confidence)

Or directly access the Confidence Interval

#t.test(dataframe_name$quantitative_column, conf.level = desired confidence)$conf.int

Exercise 1: Creating confidence intervals

In a study involving how rice grows across different nutrient treatments, researchers randomly selected plots where there was a mix of wild-type and gmo rice growing. Please bring in the “rice.csv” file and call it “rice_df”

rice_df <- read.table('rice.csv',sep=',', header=T)
rice_df

Exercise 2:

Using the code chunk below, write R commands to

  • list the names of the variables for the dataframe
  • get the data type for each variable in the list above.

Answer:

variables<-colnames(rice_df)
variables
## [1] "PlantNo"      "Block"        "RootDryMass"  "ShootDryMass" "trt"         
## [6] "fert"         "variety"
str(rice_df)
## 'data.frame':    72 obs. of  7 variables:
##  $ PlantNo     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Block       : int  1 1 1 1 1 1 2 2 2 2 ...
##  $ RootDryMass : int  56 66 40 43 55 66 41 67 40 35 ...
##  $ ShootDryMass: int  132 120 108 134 119 125 98 122 114 82 ...
##  $ trt         : chr  "F10" "F10" "F10" "F10" ...
##  $ fert        : chr  "F10" "F10" "F10" "F10" ...
##  $ variety     : chr  "wt" "wt" "wt" "wt" ...

Exercise 3:

  1. Based on the data type for the variety variable, which confidence interval would be appropriate to use: C.I. for a mean or a C.I. for a proportion?
  2. Use the code chunk below to get the appropriate C.I. with a confidence level of 97%; if you decided to work with proportions, use the Wald method.
  3. In your own words, what does the output of 2 tell us??

Answer:

1.CI for proportion would be most appropriate because this is a categorical variable. 2. 3.The est. portion of the code output tells us that a 0.5 (or 50%) proportion of the data from the “variety” column is of the “wt” variety. The lwr.ci and upr.ci portions define the lower and upper limits of the 97% confidence interval as 0.3721262 and 0.6278738 (or 37.2% and 62.8%) respectively.

x <- sum(rice_df == "wt")
n <- length(rice_df$variety)
print(x)
## [1] 36
print(n)
## [1] 72
BinomCI(x, n, conf.level = 0.97, method = "wald")
##      est    lwr.ci    upr.ci
## [1,] 0.5 0.3721262 0.6278738

Exercise 4:

  1. Based on the data type for the ShootDryMass variable, which confidence interval would be appropriate to use: C.I. for a mean or a C.I. for a proportion?
  2. Use the code chunk below to get the appropriate C.I. with a confidence level of 97%; if you decided to work with proportions, use the Agresti-Coull method.
  3. In your own words, what does the output of 2 tell us??

Answer:

1.CI for mean would be appropriate because this is a numerical variable. 2. 3.The output from the code below provides us with a lot of statistical information. The t-value is 13.971 which measures how far the sample mean is from the null hypothesis mean. It also dells us the degrees of freedom are 71, which is one less than the sample number. The output provides a p-value of <2.2e-16 which is very low, telling us that our results are very significant.This also tells us the mean of the column is 59.55 and the lowew and upper limits within a 97% confidence interval are 50.11550 and 68.99561 respectively.

t.test(rice_df$ShootDryMass, conf.level = 0.97)
## 
##  One Sample t-test
## 
## data:  rice_df$ShootDryMass
## t = 13.971, df = 71, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 97 percent confidence interval:
##  50.11550 68.99561
## sample estimates:
## mean of x 
##  59.55556

Section 2: Comparing Confidence Intervals

In this final section, we will continue to work with the rice_df. We are now interested in whether the variety of rice (wt or gmo) influences the growth of the rice plants.

Exercise 5:

Recall that the summary() function will compute the five-number summary for a quantitative data set. Using tapply() compute the “summary” of the ShootDryMass for each plant broken up by variety:


Answer:

tapply(rice_df$ShootDryMass, rice_df$variety, summary)
## $gmo
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    9.50   40.00   41.81   62.75  100.00 
## 
## $wt
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   26.00   47.50   76.00   77.31  108.25  134.00

Exercise 8: Confidence Intervals by variety

Using the code chunk below, slice your rice_df into two dataframes: wt and gmo. The wt dataframe should include the ShootDryMass data for only the wild-type in the sample, and the gmo dataframe should do the same for gmo rice

The basic way to do this is to utilize the following steps, notice the first step highlights rows in your dataframe that contain a value your are looking to isolate, you need to look for the exact phrase! The second step takes your isolated rows and creates a new dataframe (which you control the name of)

# create a variable for desired rows
#desired_rows <- dataframe_name$categorical_column_name == "desired_level"
### use the desired rows to create a new dataframe
#new_dataframe_name <- dataframe_name[desired_rows,]

Answer

### I have done created a dataframe for wild-type rice below, you need to do the same for the gmo rice

wt_rows <- rice_df$variety == "wt"
wt_df <- rice_df[wt_rows,]
wt_df
gmo_rows <- rice_df$variety == "gmo"
gmo_df <- rice_df[gmo_rows,]
gmo_df

Lastly, using the code chunk below, compute a confidence interval for the ShootDryMass of wt, and then compute another for the ShootDryMass of gmo Use a confidence level of 96% in both cases.

t.test(wt_df$ShootDryMass, conf.level = 0.96)
## 
##  One Sample t-test
## 
## data:  wt_df$ShootDryMass
## t = 14.094, df = 35, p-value = 5.391e-16
## alternative hypothesis: true mean is not equal to 0
## 96 percent confidence interval:
##  65.60515 89.00596
## sample estimates:
## mean of x 
##  77.30556
t.test(gmo_df$ShootDryMass, conf.level = 0.96)
## 
##  One Sample t-test
## 
## data:  gmo_df$ShootDryMass
## t = 8.2575, df = 35, p-value = 9.857e-10
## alternative hypothesis: true mean is not equal to 0
## 96 percent confidence interval:
##  31.00591 52.60520
## sample estimates:
## mean of x 
##  41.80556

Answer the following:

  1. What is the C.I. for only the wt rice?
  2. What is the C.I. for only the gmo rice?
  3. Do these C.I.’s overlap at all?
  4. In light of your answer to question 3, do you suspect the mean ShootDryMass is different for wt vs gmo? Explain your answer.

Answers:

1.The CI for only wt has a lower limit of 65.60515, an upper limit of 89.00596, and a mean of 77.30556 2.The CI for only gmo has a lower limit of 31.00591, an upper limit of 52.60520, and a mean of 41.80556 3.No the CIs do not overlap 4.Yes, the mean ShootDryMass is different for wt and gmo as the means must be between the lower and upper limits. As the lower and upper limits of the two varieties do not overlap, their means will be different.