Test - Introduction to R

Analyzed data set: pirates

Loading the necessary packages:

Obiectivul analizei

Select pirates that have at least 5 tattoos or at least 5 parrots, have ‘Jack Sparrow’ as their favorite pirate, wear a headband, have a beard length of at least 15, and what are their records ordered by ID?

selectie_pirates <- pirates %>%
  filter(tattoos >= 5 | parrots >= 5) %>%
  filter(favorite.pirate == "Jack Sparrow") %>%
  filter(headband == "yes") %>%
  filter(beard.length >= 15) %>%
  arrange(id)
selectie_pirates %>%
  head(10)
##    id  sex age height weight headband college tattoos tchests parrots
## 1   1 male  28 173.11   70.5      yes   JSSFP       9       0       0
## 2   2 male  31 209.25  105.6      yes   JSSFP       9      11       0
## 3   3 male  26 169.95   77.1      yes    CCCC      10      10       1
## 4   6 male  26 190.20   85.4      yes    CCCC       7      19       0
## 5  10 male  30 183.52   84.7      yes   JSSFP      12      69       4
## 6  12 male  20 163.65   70.0      yes    CCCC      14       5       3
## 7  14 male  26 177.99   78.2      yes    CCCC       9      12       3
## 8  17 male  21 177.52   72.7      yes    CCCC      11       3       0
## 9  20 male  26 175.45   69.0      yes    CCCC      14      30       3
## 10 24 male  12 175.29   83.0      yes    CCCC      10       3       2
##    favorite.pirate sword.type eyepatch sword.time beard.length
## 1     Jack Sparrow    cutlass        1       0.58           16
## 2     Jack Sparrow    cutlass        0       1.11           21
## 3     Jack Sparrow    cutlass        1       1.44           19
## 4     Jack Sparrow    cutlass        1       0.59           17
## 5     Jack Sparrow    cutlass        1       0.71           25
## 6     Jack Sparrow    cutlass        0       0.47           27
## 7     Jack Sparrow    cutlass        1       1.33           19
## 8     Jack Sparrow    cutlass        0       0.33           20
## 9     Jack Sparrow    cutlass        0       0.40           27
## 10    Jack Sparrow    cutlass        1       0.14           24
##              fav.pixar grogg
## 1       Monsters, Inc.    11
## 2               WALL-E     9
## 3           Inside Out     7
## 4  Monsters University     7
## 5  Monsters University     9
## 6       Monsters, Inc.     8
## 7           Inside Out     7
## 8         Finding Nemo    15
## 9               WALL-E    10
## 10                  Up     8

Plot the heights and the weight of the pirates

class(pirates)
## [1] "data.frame"
str(pirates)
## 'data.frame':    1000 obs. of  17 variables:
##  $ id             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sex            : chr  "male" "male" "male" "female" ...
##  $ age            : num  28 31 26 31 41 26 31 31 28 30 ...
##  $ height         : num  173 209 170 144 158 ...
##  $ weight         : num  70.5 105.6 77.1 58.5 58.4 ...
##  $ headband       : chr  "yes" "yes" "yes" "no" ...
##  $ college        : chr  "JSSFP" "JSSFP" "CCCC" "JSSFP" ...
##  $ tattoos        : num  9 9 10 2 9 7 9 5 12 12 ...
##  $ tchests        : num  0 11 10 0 6 19 1 13 37 69 ...
##  $ parrots        : num  0 0 1 2 4 0 7 7 2 4 ...
##  $ favorite.pirate: chr  "Jack Sparrow" "Jack Sparrow" "Jack Sparrow" "Jack Sparrow" ...
##  $ sword.type     : chr  "cutlass" "cutlass" "cutlass" "scimitar" ...
##  $ eyepatch       : num  1 0 1 1 1 1 0 1 0 1 ...
##  $ sword.time     : num  0.58 1.11 1.44 36.11 0.11 ...
##  $ beard.length   : num  16 21 19 2 0 17 1 1 1 25 ...
##  $ fav.pixar      : chr  "Monsters, Inc." "WALL-E" "Inside Out" "Inside Out" ...
##  $ grogg          : num  11 9 7 9 14 7 9 12 16 9 ...

the variables age and weight are numeric, so we will use a scatter plot for representation

the name of the variables

names(pirates)
##  [1] "id"              "sex"             "age"             "height"         
##  [5] "weight"          "headband"        "college"         "tattoos"        
##  [9] "tchests"         "parrots"         "favorite.pirate" "sword.type"     
## [13] "eyepatch"        "sword.time"      "beard.length"    "fav.pixar"      
## [17] "grogg"
plot(height ~ weight,  pirates) # the scatter plot
abline(coef(lm(height ~ weight, pirates)), col = 'blue', lwd = 3) # the regression line

as we can see the relationship between weight and heights variables is linear

Test if there is a difference between pirates with eye patch and those without eye patch

first we will test if the variance of the pirates with eye patch and that of the pirates without eye patch is equal or not

checking the type of eye patch variable

levels(pirates$eyepatch)
## NULL
pirates$eyepatch <- as.factor(pirates$eyepatch)

setting the first group: pirates without eye patch

levels(pirates$eyepatch) <- c('fara_petec', 'cu_petec')

testing the variance with Bartlett test

bartlett.test(beard.length~eyepatch, pirates, eyepatch %in%  c('fara_petec', 'cu_petec'))
## 
##  Bartlett test of homogeneity of variances
## 
## data:  beard.length by eyepatch
## Bartlett's K-squared = 0.34336, df = 1, p-value = 0.5579

or

bartlett.test(beard.length~eyepatch, pirates)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  beard.length by eyepatch
## Bartlett's K-squared = 0.34336, df = 1, p-value = 0.5579

Hypothesis:

\[ H_0: \text{The variances of the groups are equal} \]

\[ H_1: \text{The variance of the groups is significantly different} \]

Decision: \[ p\text{-value} < 0.05 \Rightarrow \text{we reject } H_0 \text{ with a 5\% risk of Type I error} \] \[ p\text{-value} \ge 0.05 \Rightarrow \text{fail to reject } H_0 \text{ at the 5\% significance level} \] Results \[ p\text{-value} = 0.5579 > 0.05 \Rightarrow \text{fail to reject } H_0 \]

The variance between the groups is not statistically significant.

Second we will test if there is a difference between pirates with eye patch and those without eye patch we use the t test for independent samples

t.test(beard.length~eyepatch, data = pirates, alternative = 'greater', var.equal=T)
## 
##  Two Sample t-test
## 
## data:  beard.length by eyepatch
## t = 1.453, df = 998, p-value = 0.07326
## alternative hypothesis: true difference in means between group fara_petec and group cu_petec is greater than 0
## 95 percent confidence interval:
##  -0.1328453        Inf
## sample estimates:
## mean in group fara_petec   mean in group cu_petec 
##                 11.04094                 10.04255

Hypothesis

\[ H_0: \text{There is no difference between the mean beard length (≥ 15) of pirates with an eye patch and those without an eye patch} \]

\[H_1:\mu_1 < \mu_2 {\text{ the differences are significant }} \]

Used test: t test

Results:

\[ {\text{ p-value = 0.07326 > 0,05 => accepting H_0. There is no significant differences between the analyzed groups }} \]

Checking the code for errors

Can we compare the models?

ms1<-lm(pirates$beard.length~pirates$height)
ms2<-lm(pirates$beard.length~pirates$weight+I(pirates$weight^2))

The models can not be compare, because the independent variables are not the same

anova(ms1, ms2)
## Analysis of Variance Table
## 
## Model 1: pirates$beard.length ~ pirates$height
## Model 2: pirates$beard.length ~ pirates$weight + I(pirates$weight^2)
##   Res.Df   RSS Df Sum of Sq F Pr(>F)
## 1    998 81169                      
## 2    997 83450  1   -2281.5

the value of the p-value is missing

ms1<-lm(pirates$beard.length~pirates$height)
ms2<-lm(pirates$beard.length~pirates$height+I(pirates$height^2))

to be able to compere the two models the weight variable must be replace with height variable

anova(ms1, ms2)
## Analysis of Variance Table
## 
## Model 1: pirates$beard.length ~ pirates$height
## Model 2: pirates$beard.length ~ pirates$height + I(pirates$height^2)
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1    998 81169                           
## 2    997 80995  1    173.89 2.1405 0.1438

\[ H_0: \text{There is no difference between the models} \]

\[ H_1: \text{The complex model is better than the simple regression model } \] \[ p\text{-value} = 0.1438 > 0.05 \Rightarrow \text{fail to reject } H_0. \]

There is no statistically significant difference between the models.

Transform a numerical variable into a variable with 3 classes

str(pirates)
## 'data.frame':    1000 obs. of  17 variables:
##  $ id             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sex            : chr  "male" "male" "male" "female" ...
##  $ age            : num  28 31 26 31 41 26 31 31 28 30 ...
##  $ height         : num  173 209 170 144 158 ...
##  $ weight         : num  70.5 105.6 77.1 58.5 58.4 ...
##  $ headband       : chr  "yes" "yes" "yes" "no" ...
##  $ college        : chr  "JSSFP" "JSSFP" "CCCC" "JSSFP" ...
##  $ tattoos        : num  9 9 10 2 9 7 9 5 12 12 ...
##  $ tchests        : num  0 11 10 0 6 19 1 13 37 69 ...
##  $ parrots        : num  0 0 1 2 4 0 7 7 2 4 ...
##  $ favorite.pirate: chr  "Jack Sparrow" "Jack Sparrow" "Jack Sparrow" "Jack Sparrow" ...
##  $ sword.type     : chr  "cutlass" "cutlass" "cutlass" "scimitar" ...
##  $ eyepatch       : Factor w/ 2 levels "fara_petec","cu_petec": 2 1 2 2 2 2 1 2 1 2 ...
##  $ sword.time     : num  0.58 1.11 1.44 36.11 0.11 ...
##  $ beard.length   : num  16 21 19 2 0 17 1 1 1 25 ...
##  $ fav.pixar      : chr  "Monsters, Inc." "WALL-E" "Inside Out" "Inside Out" ...
##  $ grogg          : num  11 9 7 9 14 7 9 12 16 9 ...
pirates$inaltime_cat<-cut(pirates$height, 3, c("Short", "Medium",  "Tall"))

Estimate the regression model between tchests, age and pirates variables

To see if the tchests is correlated with age we will do a scatter plot

plot(tchests ~ age,  pirates) 
abline(coef(lm(tchests ~ age, pirates)), col = 'blue', lwd = 3)

estimate the regression model

reg_m <- lm(tchests ~ age, pirates)

the statistics of the estimated model

summary(reg_m)
## 
## Call:
## lm(formula = tchests ~ age, data = pirates)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.253 -15.885  -6.750   8.225 122.142 
## 
## Coefficients:
##             Estimate Std. Error t value      Pr(>|t|)    
## (Intercept)   0.2083     3.6704   0.057         0.955    
## age           0.8217     0.1313   6.260 0.00000000057 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.01 on 998 degrees of freedom
## Multiple R-squared:  0.03778,    Adjusted R-squared:  0.03682 
## F-statistic: 39.19 on 1 and 998 DF,  p-value: 0.0000000005703
formula(reg_m)
## tchests ~ age

Estimated equation

tchests = 0.2083 + 0.8217 * age

Interpretation of the parameters

  1. b0 = 0.2083
    The average number of chests discovered by a pirate is 0.2083 when age is equal to zero.

  2. b1 = 0.8217
    When age increases by one year, the average number of discovered chests increases by 0.8217 chests.

Testing the regression model (Multiple Linear Regression Model – MLRM)

H0: β1 = β2 = 0
H1: At least one of the independent variable coefficients is different from zero

p-value = 0.0000000004082 < α = 0.05
⇒ We reject the null hypothesis (H0) at the 5% significance level.

Interpretation:
We can conclude that the regression model is statistically significant. Age has a statistically significant influence on the dependent variable (number of discovered chests).

Model evaluation

Coefficient of determination: R² = 0.03778
→ 3.778% of the variation in the dependent variable (number of discovered chests) is explained by the variation in the independent variable (pirate’s age).
The remaining variation is due to random factors not included in the model.

Adjusted R² = 0.03682
→ 3.682% of the variation in the number of discovered chests is explained by age, after adjusting for the number of predictors in the model.

Multiple Linear Regression Model

Research objective

We want to examine whether the number of treasure chests discovered by a pirate varies depending on: - the pirate’s age - the number of grogg mugs consumed

The goal is to determine whether age and grogg consumption significantly influence the number of discovered chests.


Estimated regression equation

\[ \widehat{tchests} = 5.4313 + 0.8328 \cdot age - 0.5449 \cdot grogg \]


Interpretation of the parameters

  1. Intercept (\(\beta_0 = 5.4313\))

\[ \beta_0 = 5.4313 \]

The average number of discovered chests is 5.4313 when both age and the number of grogg mugs are equal to zero.


  1. Age coefficient (\(\beta_1 = 0.8328\))

\[ \beta_1 = 0.8328 \]

When age increases by one year, the average number of discovered chests increases by 0.8328, holding grogg consumption constant.


  1. Grogg coefficient (\(\beta_2 = -0.5449\))

\[ \beta_2 = -0.5449 \]

When the number of grogg mugs increases by one unit, the average number of discovered chests decreases by 0.5449, holding age constant.


Testing the regression coefficients

For each parameter:

\[ H_0: \beta_i = 0 \]

\[ H_1: \beta_i \ne 0 \]


Statistical decisions

  • For \(\beta_0\):
    \[ p\text{-value} = 0.2136 > 0.05 \Rightarrow \text{fail to reject } H_0 \]
    The intercept is not statistically significant.

  • For \(\beta_1\) (age):
    \[ p\text{-value} = 3.21 \times 10^{-10} < 0.05 \Rightarrow \text{reject } H_0 \]
    Age has a statistically significant effect on the number of discovered chests.

  • For \(\beta_2\) (grogg):
    \[ p\text{-value} = 0.0279 < 0.05 \Rightarrow \text{reject } H_0 \]
    Grogg consumption has a statistically significant effect on the number of discovered chests.


Final interpretation

At the 5% significance level:

  • There is a statistically significant linear relationship between age and the number of discovered chests.
  • There is a statistically significant linear relationship between grogg consumption and the number of discovered chests.
  • The intercept is not statistically significant.

Confidence Intervals for the MLRM Parameters

Confidence interval estimation (95%)

Intercept (\(\beta_0\))

\[ CI_{95\%}(\beta_0) = [-2.6027,\; 46.0481] \]

With 95% confidence, we can state that the true value of \(\beta_0\) lies within this interval.
Since the confidence interval contains 0, the intercept is not statistically significant at the 5% significance level.


Age coefficient (\(\beta_1\))

\[ CI_{95\%}(\beta_1) = [0.4831,\; 1.0177] \]

Because the confidence interval does not contain 0, there is a statistically significant linear relationship between the number of discovered chests and the pirate’s age.


Grogg coefficient (\(\beta_2\))

\[ CI_{95\%}(\beta_2) = [-1.0385,\; -0.0664] \]

Since the confidence interval does not contain 0, there is a statistically significant linear relationship between the number of discovered chests and grogg consumption.


Overall Significance of the Regression Model

Hypotheses

\[ H_0: \beta_1 = \beta_2 = 0 \]

\[ H_1: \text{At least one independent variable coefficient is different from zero} \]

Given:

\[ p\text{-value} = 4.082 \times 10^{-10} < 0.05 \]

We reject \(H_0\) at the 5% significance level.

Conclusion:
The regression model is statistically significant.


Model Evaluation

Coefficient of Determination

\[ R^2 = 0.04244 \]

4.244% of the variation in the number of discovered chests is explained jointly by age and grogg consumption.
The remaining variation is due to other random factors not included in the model.


Adjusted Coefficient of Determination

\[ R^2_{adj} = 0.04052 \]

After adjusting for the number of predictors, 4.052% of the variation in the dependent variable is explained by the independent variables.


7. Association Between Favorite Pirate and Sex

We test whether favorite pirate preferences differ by sex.

Chi-Square Test of Independence

Hypotheses

\[ H_0: \text{Favorite pirate and sex are independent} \]

\[ H_1: \text{There is a significant association between favorite pirate and sex} \]

Given:

\[ p\text{-value} = 1.181 \times 10^{-76} < 0.05 \]

We reject \(H_0\) at the 5% significance level.

Conclusion:
There is a statistically significant association between favorite pirate preference and the pirate’s sex.