Discovering data

Importing data into R and having our first look:

ampark <- read.csv("rintro-chapter7.csv")
attach(ampark)
str(ampark)
## 'data.frame':    500 obs. of  8 variables:
##  $ weekend  : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 2 1 1 2 ...
##  $ num.child: int  0 2 1 0 4 5 1 0 0 3 ...
##  $ distance : num  114.6 27 63.3 25.9 54.7 ...
##  $ rides    : int  87 87 85 88 84 81 77 82 90 88 ...
##  $ games    : int  73 78 80 72 87 79 73 70 88 86 ...
##  $ wait     : int  60 76 70 66 74 48 58 70 79 55 ...
##  $ clean    : int  89 87 88 89 87 79 85 83 95 88 ...
##  $ overall  : int  47 65 61 37 68 27 40 30 58 36 ...

… plotting satisfaction data:

par(mfrow = c(2, 3))
hist(rides); hist(games) ; hist(wait); hist(clean); hist(overall)

It’s though too early to conclude if all of the presented data is normally distributed. Let’s check their p-values by running Shapiro test:

sat_data <- ampark[ ,c("rides", "games", "wait", "clean", "overall")]
lapply(sat_data, shapiro.test)
## $rides
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.99154, p-value = 0.005945
## 
## 
## $games
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.99417, p-value = 0.05255
## 
## 
## $wait
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.99553, p-value = 0.1629
## 
## 
## $clean
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.99182, p-value = 0.007461
## 
## 
## $overall
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.99472, p-value = 0.08358

As we can see with a given data “wait” and “overall” can be estimated as normally distributed, as well as “games”, p-value of which is though quite close to 0.05.

Further into the woods

Let’s plot correlations of satisfaction data:

correlations <- cor(sat_data)
library(corrplot)
corrplot(correlations)

It seems that there is a stable correlation between $clean and $rides. Why? Somehow while riding guests could reveal some unclean park areas they didn’t see before or on contrairy be amazed by its clean road. We can also notice that overall satisaction depends the most on the cleanness of the park rather than on waiting time.

Let’s take a look at another part of data: weekend, number of children and distance.

table(weekend); table(num.child)
## weekend
##  no yes 
## 259 241
## num.child
##   0   1   2   3   4   5 
## 151  66 143  68  47  25
children <- subset(ampark, num.child >= 1)
nrow(children)
## [1] 349
sum(children$num.child)
## [1] 869

We have almost even distribution of “weekend or not” answers. Moreover 349 guests came with at least one child against 151 childless. In summury which gives 869 children.

What about distance?

plot(overall, distance)

It seems that it has no impact on overall satisfaction of guests. The distribution is dispersed without any clear pattern.

Statistical magic

…subsetting for further analysis:

children_sat <- children$overall
childless <- subset(ampark, num.child == 0) 
childless_sat <- childless$overall

Let’s plot two distributions:

boxplot(children_sat, childless_sat, main = "Boxplot", xlab = "With or without children", ylab = "Overall satisfaction", col = c("red", "blue"))

It seems important to notice that we have different sizes of distributions. N of guests with children is at least twice as much as n of childless guests.

By doing independent t-test we make though an assumption that guests who came with child/children are NOT the same as ones who came without any. But before running we need to be sure, that the assumption of equal group variances is true. Otherwise our investigations will not be valid. Let’s run a LeveneTest to test this equality.

library(car) 
leveneTest(ampark$overall ~ as.factor(ampark$num.child))
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value    Pr(>F)    
## group   5  5.2054 0.0001145 ***
##       494                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

… which gives a pretty low p-value of 0.000115 so our assumption of homogeneity of variances (Null Hypothesis) is rejected. There IS a significant diffirence between six group variances. But, let’s run the test with twp groups: with child/children and without any.

Before we need to factorize our condition values: with or without children as “yes” and “no”.

all <- c(children$num.child, childless$num.child); 
f_all <- factor(all); levels(f_all) #actual levels
## [1] "0" "1" "2" "3" "4" "5"
levels(f_all) <- c("no", "yes", "yes", "yes", "yes", "yes") 
levels(f_all) #new levels
## [1] "no"  "yes"

…running LeveneTest:

leveneTest(ampark$overall ~ f_all)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   1  3.9307 0.04796 *
##       498                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we see the p-value jumped to 0.048 which is very close to significance level of 5%. Let’s trick it and round up to the second decimal. By that we will hold up assumption of homogeneity of variance.

t.test(children_sat, childless_sat, var.equal = T)
## 
##  Two Sample t-test
## 
## data:  children_sat and childless_sat
## t = 10.923, df = 498, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  12.45753 17.92175
## sample estimates:
## mean of x mean of y 
##  55.84527  40.65563

Our observed t-value is way large enough and we can tell, that there’s a significant difference between two groups. Thus the p-value is way low as well. The Null Hypothesis, that there’s no difference whatsoever between these two groups,is rejected.

Additionally to the t-value it would be better to find a cohens’d value, originally unbiased by a sample size, to check the magnitude of “with children effect”:

library(lsr)
cohensD(children_sat, childless_sat, method = "pooled")
## [1] 1.063992

Experiencing the park with a child increases guests’ overall satisfaction by 1.064 standard deviation. Which is quite high.

Conclusion: when you go to an amusement park, take your children with you :)

Thanks for reading!