Importing data into R and having our first look:
ampark <- read.csv("rintro-chapter7.csv")
attach(ampark)
str(ampark)
## 'data.frame': 500 obs. of 8 variables:
## $ weekend : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 2 1 1 2 ...
## $ num.child: int 0 2 1 0 4 5 1 0 0 3 ...
## $ distance : num 114.6 27 63.3 25.9 54.7 ...
## $ rides : int 87 87 85 88 84 81 77 82 90 88 ...
## $ games : int 73 78 80 72 87 79 73 70 88 86 ...
## $ wait : int 60 76 70 66 74 48 58 70 79 55 ...
## $ clean : int 89 87 88 89 87 79 85 83 95 88 ...
## $ overall : int 47 65 61 37 68 27 40 30 58 36 ...
… plotting satisfaction data:
par(mfrow = c(2, 3))
hist(rides); hist(games) ; hist(wait); hist(clean); hist(overall)
It’s though too early to conclude if all of the presented data is normally distributed. Let’s check their p-values by running Shapiro test:
sat_data <- ampark[ ,c("rides", "games", "wait", "clean", "overall")]
lapply(sat_data, shapiro.test)
## $rides
##
## Shapiro-Wilk normality test
##
## data: X[[i]]
## W = 0.99154, p-value = 0.005945
##
##
## $games
##
## Shapiro-Wilk normality test
##
## data: X[[i]]
## W = 0.99417, p-value = 0.05255
##
##
## $wait
##
## Shapiro-Wilk normality test
##
## data: X[[i]]
## W = 0.99553, p-value = 0.1629
##
##
## $clean
##
## Shapiro-Wilk normality test
##
## data: X[[i]]
## W = 0.99182, p-value = 0.007461
##
##
## $overall
##
## Shapiro-Wilk normality test
##
## data: X[[i]]
## W = 0.99472, p-value = 0.08358
As we can see with a given data “wait” and “overall” can be estimated as normally distributed, as well as “games”, p-value of which is though quite close to 0.05.
Let’s plot correlations of satisfaction data:
correlations <- cor(sat_data)
library(corrplot)
corrplot(correlations)
It seems that there is a stable correlation between $clean and $rides. Why? Somehow while riding guests could reveal some unclean park areas they didn’t see before or on contrairy be amazed by its clean road. We can also notice that overall satisaction depends the most on the cleanness of the park rather than on waiting time.
Let’s take a look at another part of data: weekend, number of children and distance.
table(weekend); table(num.child)
## weekend
## no yes
## 259 241
## num.child
## 0 1 2 3 4 5
## 151 66 143 68 47 25
children <- subset(ampark, num.child >= 1)
nrow(children)
## [1] 349
sum(children$num.child)
## [1] 869
We have almost even distribution of “weekend or not” answers. Moreover 349 guests came with at least one child against 151 childless. In summury which gives 869 children.
What about distance?
plot(overall, distance)
It seems that it has no impact on overall satisfaction of guests. The distribution is dispersed without any clear pattern.
…subsetting for further analysis:
children_sat <- children$overall
childless <- subset(ampark, num.child == 0)
childless_sat <- childless$overall
Let’s plot two distributions:
boxplot(children_sat, childless_sat, main = "Boxplot", xlab = "With or without children", ylab = "Overall satisfaction", col = c("red", "blue"))
It seems important to notice that we have different sizes of distributions. N of guests with children is at least twice as much as n of childless guests.
By doing independent t-test we make though an assumption that guests who came with child/children are NOT the same as ones who came without any. But before running we need to be sure, that the assumption of equal group variances is true. Otherwise our investigations will not be valid. Let’s run a LeveneTest to test this equality.
library(car)
leveneTest(ampark$overall ~ as.factor(ampark$num.child))
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 5 5.2054 0.0001145 ***
## 494
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
… which gives a pretty low p-value of 0.000115 so our assumption of homogeneity of variances (Null Hypothesis) is rejected. There IS a significant diffirence between six group variances. But, let’s run the test with twp groups: with child/children and without any.
Before we need to factorize our condition values: with or without children as “yes” and “no”.
all <- c(children$num.child, childless$num.child);
f_all <- factor(all); levels(f_all) #actual levels
## [1] "0" "1" "2" "3" "4" "5"
levels(f_all) <- c("no", "yes", "yes", "yes", "yes", "yes")
levels(f_all) #new levels
## [1] "no" "yes"
…running LeveneTest:
leveneTest(ampark$overall ~ f_all)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 3.9307 0.04796 *
## 498
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we see the p-value jumped to 0.048 which is very close to significance level of 5%. Let’s trick it and round up to the second decimal. By that we will hold up assumption of homogeneity of variance.
t.test(children_sat, childless_sat, var.equal = T)
##
## Two Sample t-test
##
## data: children_sat and childless_sat
## t = 10.923, df = 498, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 12.45753 17.92175
## sample estimates:
## mean of x mean of y
## 55.84527 40.65563
Our observed t-value is way large enough and we can tell, that there’s a significant difference between two groups. Thus the p-value is way low as well. The Null Hypothesis, that there’s no difference whatsoever between these two groups,is rejected.
Additionally to the t-value it would be better to find a cohens’d value, originally unbiased by a sample size, to check the magnitude of “with children effect”:
library(lsr)
cohensD(children_sat, childless_sat, method = "pooled")
## [1] 1.063992
Experiencing the park with a child increases guests’ overall satisfaction by 1.064 standard deviation. Which is quite high.
Conclusion: when you go to an amusement park, take your children with you :)