Loading the necessary packages:
selectie_pirates <- pirates %>%
filter(tattoos >= 5 | parrots >= 5) %>%
filter(favorite.pirate == "Jack Sparrow") %>%
filter(headband == "yes") %>%
filter(beard.length >= 15) %>%
arrange(id)
selectie_pirates %>%
head(10)
## id sex age height weight headband college tattoos tchests parrots
## 1 1 male 28 173.11 70.5 yes JSSFP 9 0 0
## 2 2 male 31 209.25 105.6 yes JSSFP 9 11 0
## 3 3 male 26 169.95 77.1 yes CCCC 10 10 1
## 4 6 male 26 190.20 85.4 yes CCCC 7 19 0
## 5 10 male 30 183.52 84.7 yes JSSFP 12 69 4
## 6 12 male 20 163.65 70.0 yes CCCC 14 5 3
## 7 14 male 26 177.99 78.2 yes CCCC 9 12 3
## 8 17 male 21 177.52 72.7 yes CCCC 11 3 0
## 9 20 male 26 175.45 69.0 yes CCCC 14 30 3
## 10 24 male 12 175.29 83.0 yes CCCC 10 3 2
## favorite.pirate sword.type eyepatch sword.time beard.length
## 1 Jack Sparrow cutlass 1 0.58 16
## 2 Jack Sparrow cutlass 0 1.11 21
## 3 Jack Sparrow cutlass 1 1.44 19
## 4 Jack Sparrow cutlass 1 0.59 17
## 5 Jack Sparrow cutlass 1 0.71 25
## 6 Jack Sparrow cutlass 0 0.47 27
## 7 Jack Sparrow cutlass 1 1.33 19
## 8 Jack Sparrow cutlass 0 0.33 20
## 9 Jack Sparrow cutlass 0 0.40 27
## 10 Jack Sparrow cutlass 1 0.14 24
## fav.pixar grogg
## 1 Monsters, Inc. 11
## 2 WALL-E 9
## 3 Inside Out 7
## 4 Monsters University 7
## 5 Monsters University 9
## 6 Monsters, Inc. 8
## 7 Inside Out 7
## 8 Finding Nemo 15
## 9 WALL-E 10
## 10 Up 8
class(pirates)
## [1] "data.frame"
str(pirates)
## 'data.frame': 1000 obs. of 17 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sex : chr "male" "male" "male" "female" ...
## $ age : num 28 31 26 31 41 26 31 31 28 30 ...
## $ height : num 173 209 170 144 158 ...
## $ weight : num 70.5 105.6 77.1 58.5 58.4 ...
## $ headband : chr "yes" "yes" "yes" "no" ...
## $ college : chr "JSSFP" "JSSFP" "CCCC" "JSSFP" ...
## $ tattoos : num 9 9 10 2 9 7 9 5 12 12 ...
## $ tchests : num 0 11 10 0 6 19 1 13 37 69 ...
## $ parrots : num 0 0 1 2 4 0 7 7 2 4 ...
## $ favorite.pirate: chr "Jack Sparrow" "Jack Sparrow" "Jack Sparrow" "Jack Sparrow" ...
## $ sword.type : chr "cutlass" "cutlass" "cutlass" "scimitar" ...
## $ eyepatch : num 1 0 1 1 1 1 0 1 0 1 ...
## $ sword.time : num 0.58 1.11 1.44 36.11 0.11 ...
## $ beard.length : num 16 21 19 2 0 17 1 1 1 25 ...
## $ fav.pixar : chr "Monsters, Inc." "WALL-E" "Inside Out" "Inside Out" ...
## $ grogg : num 11 9 7 9 14 7 9 12 16 9 ...
the variables age and weight are numeric, so we will use a scatter plot for representation
the name of the variables
names(pirates)
## [1] "id" "sex" "age" "height"
## [5] "weight" "headband" "college" "tattoos"
## [9] "tchests" "parrots" "favorite.pirate" "sword.type"
## [13] "eyepatch" "sword.time" "beard.length" "fav.pixar"
## [17] "grogg"
plot(height ~ weight, pirates) # the scatter plot
abline(coef(lm(height ~ weight, pirates)), col = 'blue', lwd = 3) # the regression line
as we can see the relationship between weight and heights variables is linear
first we will test if the variance of the pirates with eye patch and that of the pirates without eye patch is equal or not
checking the type of eye patch variable
levels(pirates$eyepatch)
## NULL
pirates$eyepatch <- as.factor(pirates$eyepatch)
setting the first group: pirates without eye patch
levels(pirates$eyepatch) <- c('fara_petec', 'cu_petec')
testing the variance with Bartlett test
bartlett.test(beard.length~eyepatch, pirates, eyepatch %in% c('fara_petec', 'cu_petec'))
##
## Bartlett test of homogeneity of variances
##
## data: beard.length by eyepatch
## Bartlett's K-squared = 0.34336, df = 1, p-value = 0.5579
or
bartlett.test(beard.length~eyepatch, pirates)
##
## Bartlett test of homogeneity of variances
##
## data: beard.length by eyepatch
## Bartlett's K-squared = 0.34336, df = 1, p-value = 0.5579
Hypothesis:
\[ H_0: \text{The variances of the groups are equal} \]
\[ H_1: \text{The variance of the groups is significantly different} \]
Decision: \[ p\text{-value} < 0.05 \Rightarrow \text{we reject } H_0 \text{ with a 5\% risk of Type I error} \] \[ p\text{-value} \ge 0.05 \Rightarrow \text{fail to reject } H_0 \text{ at the 5\% significance level} \] Results \[ p\text{-value} = 0.5579 > 0.05 \Rightarrow \text{fail to reject } H_0 \]
The variance between the groups is not statistically significant.
Second we will test if there is a difference between pirates with eye patch and those without eye patch we use the t test for independent samples
t.test(beard.length~eyepatch, data = pirates, alternative = 'greater', var.equal=T)
##
## Two Sample t-test
##
## data: beard.length by eyepatch
## t = 1.453, df = 998, p-value = 0.07326
## alternative hypothesis: true difference in means between group fara_petec and group cu_petec is greater than 0
## 95 percent confidence interval:
## -0.1328453 Inf
## sample estimates:
## mean in group fara_petec mean in group cu_petec
## 11.04094 10.04255
Hypothesis
\[ H_0: \text{There is no difference between the mean beard length (≥ 15) of pirates with an eye patch and those without an eye patch} \]
\[H_1:\mu_1 < \mu_2 {\text{ the differences are significant }} \]
Used test: t test
Results:
\[ {\text{ p-value = 0.07326 > 0,05 => accepting H_0. There is no significant differences between the analyzed groups }} \]
Can we compare the models?
ms1<-lm(pirates$beard.length~pirates$height)
ms2<-lm(pirates$beard.length~pirates$weight+I(pirates$weight^2))
The models can not be compare, because the independent variables are not the same
anova(ms1, ms2)
## Analysis of Variance Table
##
## Model 1: pirates$beard.length ~ pirates$height
## Model 2: pirates$beard.length ~ pirates$weight + I(pirates$weight^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 998 81169
## 2 997 83450 1 -2281.5
the value of the p-value is missing
ms1<-lm(pirates$beard.length~pirates$height)
ms2<-lm(pirates$beard.length~pirates$height+I(pirates$height^2))
to be able to compere the two models the weight variable must be replace with height variable
anova(ms1, ms2)
## Analysis of Variance Table
##
## Model 1: pirates$beard.length ~ pirates$height
## Model 2: pirates$beard.length ~ pirates$height + I(pirates$height^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 998 81169
## 2 997 80995 1 173.89 2.1405 0.1438
\[ H_0: \text{There is no difference between the models} \]
\[ H_1: \text{The complex model is better than the simple regression model } \] \[ p\text{-value} = 0.1438 > 0.05 \Rightarrow \text{fail to reject } H_0. \]
There is no statistically significant difference between the models.
str(pirates)
## 'data.frame': 1000 obs. of 17 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sex : chr "male" "male" "male" "female" ...
## $ age : num 28 31 26 31 41 26 31 31 28 30 ...
## $ height : num 173 209 170 144 158 ...
## $ weight : num 70.5 105.6 77.1 58.5 58.4 ...
## $ headband : chr "yes" "yes" "yes" "no" ...
## $ college : chr "JSSFP" "JSSFP" "CCCC" "JSSFP" ...
## $ tattoos : num 9 9 10 2 9 7 9 5 12 12 ...
## $ tchests : num 0 11 10 0 6 19 1 13 37 69 ...
## $ parrots : num 0 0 1 2 4 0 7 7 2 4 ...
## $ favorite.pirate: chr "Jack Sparrow" "Jack Sparrow" "Jack Sparrow" "Jack Sparrow" ...
## $ sword.type : chr "cutlass" "cutlass" "cutlass" "scimitar" ...
## $ eyepatch : Factor w/ 2 levels "fara_petec","cu_petec": 2 1 2 2 2 2 1 2 1 2 ...
## $ sword.time : num 0.58 1.11 1.44 36.11 0.11 ...
## $ beard.length : num 16 21 19 2 0 17 1 1 1 25 ...
## $ fav.pixar : chr "Monsters, Inc." "WALL-E" "Inside Out" "Inside Out" ...
## $ grogg : num 11 9 7 9 14 7 9 12 16 9 ...
pirates$inaltime_cat<-cut(pirates$height, 3, c("Short", "Medium", "Tall"))
To see if the tchests is correlated with age we will do a scatter plot
plot(tchests ~ age, pirates)
abline(coef(lm(tchests ~ age, pirates)), col = 'blue', lwd = 3)
estimate the regression model
reg_m <- lm(tchests ~ age, pirates)
the statistics of the estimated model
summary(reg_m)
##
## Call:
## lm(formula = tchests ~ age, data = pirates)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.253 -15.885 -6.750 8.225 122.142
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2083 3.6704 0.057 0.955
## age 0.8217 0.1313 6.260 0.00000000057 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.01 on 998 degrees of freedom
## Multiple R-squared: 0.03778, Adjusted R-squared: 0.03682
## F-statistic: 39.19 on 1 and 998 DF, p-value: 0.0000000005703
formula(reg_m)
## tchests ~ age
tchests = 0.2083 + 0.8217 * age
b0 = 0.2083
The average number of chests discovered by a pirate is 0.2083 when age
is equal to zero.
b1 = 0.8217
When age increases by one year, the average number of discovered chests
increases by 0.8217 chests.
H0: β1 = β2 = 0
H1: At least one of the independent variable coefficients is different
from zero
p-value = 0.0000000004082 < α = 0.05
⇒ We reject the null hypothesis (H0) at the 5% significance level.
Interpretation:
We can conclude that the regression model is statistically significant.
Age has a statistically significant influence on the dependent variable
(number of discovered chests).
Coefficient of determination: R² = 0.03778
→ 3.778% of the variation in the dependent variable (number of
discovered chests) is explained by the variation in the independent
variable (pirate’s age).
The remaining variation is due to random factors not included in the
model.
Adjusted R² = 0.03682
→ 3.682% of the variation in the number of discovered chests is
explained by age, after adjusting for the number of predictors in the
model.
We want to examine whether the number of treasure chests discovered by a pirate varies depending on: - the pirate’s age - the number of grogg mugs consumed
The goal is to determine whether age and grogg consumption significantly influence the number of discovered chests.
\[ \widehat{tchests} = 5.4313 + 0.8328 \cdot age - 0.5449 \cdot grogg \]
\[ \beta_0 = 5.4313 \]
The average number of discovered chests is 5.4313 when both age and the number of grogg mugs are equal to zero.
\[ \beta_1 = 0.8328 \]
When age increases by one year, the average number of discovered chests increases by 0.8328, holding grogg consumption constant.
\[ \beta_2 = -0.5449 \]
When the number of grogg mugs increases by one unit, the average number of discovered chests decreases by 0.5449, holding age constant.
For each parameter:
\[ H_0: \beta_i = 0 \]
\[ H_1: \beta_i \ne 0 \]
For \(\beta_0\):
\[
p\text{-value} = 0.2136 > 0.05 \Rightarrow \text{fail to reject } H_0
\]
The intercept is not statistically significant.
For \(\beta_1\) (age):
\[
p\text{-value} = 3.21 \times 10^{-10} < 0.05 \Rightarrow \text{reject
} H_0
\]
Age has a statistically significant effect on the number of discovered
chests.
For \(\beta_2\) (grogg):
\[
p\text{-value} = 0.0279 < 0.05 \Rightarrow \text{reject } H_0
\]
Grogg consumption has a statistically significant effect on the number
of discovered chests.
At the 5% significance level:
\[ CI_{95\%}(\beta_0) = [-2.6027,\; 46.0481] \]
With 95% confidence, we can state that the true value of \(\beta_0\) lies within this interval.
Since the confidence interval contains 0, the intercept is not
statistically significant at the 5% significance level.
\[ CI_{95\%}(\beta_1) = [0.4831,\; 1.0177] \]
Because the confidence interval does not contain 0, there is a statistically significant linear relationship between the number of discovered chests and the pirate’s age.
\[ CI_{95\%}(\beta_2) = [-1.0385,\; -0.0664] \]
Since the confidence interval does not contain 0, there is a statistically significant linear relationship between the number of discovered chests and grogg consumption.
\[ H_0: \beta_1 = \beta_2 = 0 \]
\[ H_1: \text{At least one independent variable coefficient is different from zero} \]
Given:
\[ p\text{-value} = 4.082 \times 10^{-10} < 0.05 \]
We reject \(H_0\) at the 5% significance level.
Conclusion:
The regression model is statistically significant.
\[ R^2 = 0.04244 \]
4.244% of the variation in the number of discovered chests is
explained jointly by age and grogg consumption.
The remaining variation is due to other random factors not included in
the model.
\[ R^2_{adj} = 0.04052 \]
After adjusting for the number of predictors, 4.052% of the variation in the dependent variable is explained by the independent variables.
We test whether favorite pirate preferences differ by sex.
\[ H_0: \text{Favorite pirate and sex are independent} \]
\[ H_1: \text{There is a significant association between favorite pirate and sex} \]
Given:
\[ p\text{-value} = 1.181 \times 10^{-76} < 0.05 \]
We reject \(H_0\) at the 5% significance level.
Conclusion:
There is a statistically significant association between favorite pirate
preference and the pirate’s sex.