Problem 4.3

a.Using R, load the data, check the variables, and find those that you will need to create a contingency table for showing top choice of water versus their usual water source. Label your rows and columns. Include as many additional code chunks and explanation areas as needed

df <- read.csv("WaterTaste.csv")
ls(df)
##  [1] "Age"            "Class"          "FavBotWatBrand" "First"         
##  [5] "Fourth"         "Gender"         "Preference"     "Second"        
##  [9] "Third"          "UsuallyDrink"

I used ls to first verify the variables in the data frame.

df$First <- as.factor(df$First)
df$UsuallyDrink <- as.factor(df$UsuallyDrink)

Changing the variables we are testing from characters to factors.

contTable <-table(df$First, df$UsuallyDrink)
(contTable)
##             
##              Bottled Filtered Tap
##   Aquafina        14        4   7
##   Fiji            15       10  16
##   SamsChoice       8        9   7
##   Tap              4        3   3

Putting the two variables we are testing, the first choice of water and the water that is usually drank into a table.

  1. State valid null and alternative hypotheses for the chi-square test of independence.

H0:The distribution of the first water choice and the usually water drank are independent of each other.

Ha:The distribution of the first water choice and the usually water drank are dependent on each other.

  1. Using R, conduct a chi-square test of independence with this data.
chisq.test(contTable)
## Warning in chisq.test(contTable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  contTable
## X-squared = 4.9725, df = 6, p-value = 0.5473

One limitation that occurred is that above the Chi-squared test a warning message popped up, Chi-squared approximation may be incorrect, which could mean the values are wrong. Another limitation is that there should be at least 6 observations for each value in the table, but some of them for the tap water is below 6, so that is another limitation.

  1. Based on your analysis, is there evidence that the top choices for taste preference are associated with whether or not people usually drink bottled water? If there is a significant association between these two variables, describe how they are related.

Since the p-value is greater than 0.05, we reject the null hypothesis which means that the first choice of water and the water usually drank are dependent on each other. Since they are dependent on each other there is a coorelation between the two.