Problem 4.3
a.Using R, load the data, check the variables, and find those that you will need to create a contingency table for showing top choice of water versus their usual water source. Label your rows and columns. Include as many additional code chunks and explanation areas as needed
df <- read.csv("WaterTaste.csv")
ls(df)
## [1] "Age" "Class" "FavBotWatBrand" "First"
## [5] "Fourth" "Gender" "Preference" "Second"
## [9] "Third" "UsuallyDrink"
I used ls to first verify the variables in the data frame.
df$First <- as.factor(df$First)
df$UsuallyDrink <- as.factor(df$UsuallyDrink)
Changing the variables we are testing from characters to factors.
contTable <-table(df$First, df$UsuallyDrink)
(contTable)
##
## Bottled Filtered Tap
## Aquafina 14 4 7
## Fiji 15 10 16
## SamsChoice 8 9 7
## Tap 4 3 3
Putting the two variables we are testing, the first choice of water and the water that is usually drank into a table.
H0:The distribution of the first water choice and the usually water drank are independent of each other.
Ha:The distribution of the first water choice and the usually water drank are dependent on each other.
chisq.test(contTable)
## Warning in chisq.test(contTable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: contTable
## X-squared = 4.9725, df = 6, p-value = 0.5473
One limitation that occurred is that above the Chi-squared test a warning message popped up, Chi-squared approximation may be incorrect, which could mean the values are wrong. Another limitation is that there should be at least 6 observations for each value in the table, but some of them for the tap water is below 6, so that is another limitation.
Since the p-value is greater than 0.05, we reject the null hypothesis which means that the first choice of water and the water usually drank are dependent on each other. Since they are dependent on each other there is a coorelation between the two.