Stat

Import Dataset

library(dplyr)
library(tidyr)
library(tidyverse)
library(psych)
library(ggplot2)

df <- read.csv("Islander_data.csv")
head(df)

  first_name last_name age Happy_Sad_group Dosage Drug Mem_Score_Before
1    Bastian  Carrasco  25               H      1    A             63.5
2       Evan  Carrasco  52               S      1    A             41.6
3  Florencia  Carrasco  29               H      1    A             59.7
4      Holly  Carrasco  50               S      1    A             51.7
5     Justin  Carrasco  52               H      1    A             47.0
6       Liam  Carrasco  37               S      1    A             66.4
  Mem_Score_After Diff
1            61.2 -2.3
2            40.7 -0.9
3            55.1 -4.6
4            51.2 -0.5
5            47.1  0.1
6            58.1 -8.3

library(dplyr)
library(tidyr)
library(tidyverse)
df_long <- df %>% 
  pivot_longer(
    cols = starts_with('Mem_Score_'),
    names_to = 'time',
    values_to = 'Mem_Score'
  )


df_long$time <- str_replace(df_long$time, 'Mem_Score_', '')

df_long <- df_long %>% 
  mutate(
    time = fct_relevel(time, 'Before', 'After')
  )

head(df_long)

# A tibble: 6 × 9
  first_name last_name   age Happy_Sad_group Dosage Drug   Diff time   Mem_Score
  <chr>      <chr>     <int> <chr>            <int> <chr> <dbl> <fct>      <dbl>
1 Bastian    Carrasco     25 H                    1 A      -2.3 Before      63.5
2 Bastian    Carrasco     25 H                    1 A      -2.3 After       61.2
3 Evan       Carrasco     52 S                    1 A      -0.9 Before      41.6
4 Evan       Carrasco     52 S                    1 A      -0.9 After       40.7
5 Florencia  Carrasco     29 H                    1 A      -4.6 Before      59.7
6 Florencia  Carrasco     29 H                    1 A      -4.6 After       55.1

Descriptive

library(psych)
describe(df_long[, sapply(df_long, is.numeric)])

          vars   n  mean    sd median trimmed   mad   min max range skew
age          1 396 39.53 12.01  37.00   38.10 11.86  24.0  83  59.0 1.06
Dosage       2 396  1.99  0.82   2.00    1.99  1.48   1.0   3   2.0 0.02
Diff         3 396  2.95 10.74   1.70    2.12  6.89 -40.4  49  89.4 0.76
Mem_Score    4 396 59.44 17.03  55.95   58.10 15.64  27.1 120  92.9 0.74
          kurtosis   se
age           0.82 0.60
Dosage       -1.51 0.04
Diff          3.23 0.54
Mem_Score     0.18 0.86

#describe(df[, sapply(df, is.numeric)])

Descriptive statistics have been calculated for the numerical data, resulting in the following output Age data ranges between 24 and 83, with a mean of 39.53. The median value is 37, meaning that half of the ages are younger than this value. The skewness value is 1.06, meaning that the age distribution is skewed to the right. The kurtosis value is 0.82, which means that the distribution is close to a normal distribution but a bit spread out.

The difference data ranges between -40.4 and 49, with an average of 2.95. The median value is 1.70, meaning that half of the difference values are smaller than this value. The skewness is 0.76, meaning that the difference data is skewed to the right. The kurtosis is 3.23, meaning that the distribution is more pointed and there are more outliers.

The memory score data ranges from 27.1 to 120, with a mean of 59.44. The median value is 55.95, meaning that half of the memory scores are smaller than this value. The skewness is 0.74, meaning that the memory scores are skewed to the right. The kurtosis value is 0.18, meaning that the distribution is very close to a normal distribution.

In short, based on these data, we can say that the distributions of age, difference and memory scores are generally shifted to the right and the distribution of memory scores is closest to a normal distribution. There seem to be more outliers in the difference data.

library(ggplot2)

numeric_df <- df[, sapply(df, is.numeric)]

ggplot(numeric_df, aes(x = age)) +
  geom_histogram(aes(y = ..density..), binwidth = 5, fill = "lightblue", color = "black", alpha = 0.7) +
  geom_density(color = "blue", size = 1) +
  labs(title = "Age Distribution", x = "Age", y = "Density") +
  theme_minimal()

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

ggplot(numeric_df, aes(x = Diff)) +
  geom_histogram(aes(y = ..density..), binwidth = 5, fill = "lightcoral", color = "black", alpha = 0.7) +
  geom_density(color = "red", size = 1) +
  labs(title = "Difference Distribution", x = "Difference", y = "Density") +
  theme_minimal()

ggplot(numeric_df, aes(x = Mem_Score_Before)) +
  geom_histogram(aes(y = ..density..), binwidth = 5, fill = "lightpink", color = "black", alpha = 0.7) +
  geom_density(color = "pink", size = 1) +
  labs(title = "Memory Score Before Medication", x = "Memory Score Before", y = "Density") +
  theme_minimal()

ggplot(numeric_df, aes(x = Mem_Score_After)) +
  geom_histogram(aes(y = ..density..), binwidth = 5, fill = "lightyellow", color = "black", alpha = 0.7) +
  geom_density(color = "yellow", size = 1) +
  labs(title = "Memory Score After Medication", x = "Memory Score After", y = "Density") +
  theme_minimal()

Missing Data Control

sum(is.na(df))

[1] 0

No NA data was found in the data and there is no missing data in the data.

More on Z-scores

Let’s use the built-in “df_long” dataset in R to calculate Z-scores for one of its variables and answer some statistical questions. We’ll calculate Z-scores for the “mem_s” variable.

# Calculate the mean and standard deviation of Mem_Score
mean_mem_s <- mean(df_long$Mem_Score)
std_dev_mem_s <- sd(df_long$Mem_Score)

# Calculate Z-scores for each observation in Mem_Score
z_scores_mem_s <- (df_long$Mem_Score - mean_mem_s) / std_dev_mem_s

# Print the first few Z-scores
head(z_scores_mem_s)

[1]  0.23805562  0.10303192 -1.04760485 -1.10044021  0.01497298 -0.25507442

This code calculates Z-scores for the “Mem_Score” variable in the “df_long” dataset. Now, let’s answer some statistical questions:

1. What is the average (mean) Z-score for Mem_Score in the dataset?

mean_z_score <- mean(z_scores_mem_s)
mean_z_score

[1] -1.530762e-16

2. How many people in the dataset have above-average Mem_Score (Z-score > 0)?

above_average_mem_s <- sum(z_scores_mem_s > 0)
above_average_mem_s

[1] 167

3. How many people in the dataset have below-average Mem_Score (Z-score < 0)?

below_average_mem_s <- sum(z_scores_mem_s < 0)
below_average_mem_s

[1] 229

4. How many people in the dataset have very high Mem_Score (Z-score > 2)?

very_high_mem_s <- sum(z_scores_mem_s > 2)
very_high_mem_s

[1] 14

These examples demonstrate how you can calculate Z-scores for a variable in R, use them to answer statistical questions, and identify data points that fall within certain Z-score ranges.

Corr

cor(df_long$Mem_Score, df_long$age, method = "spearman")

[1] 0.02573972

cor(numeric_df)

                          age     Dosage Mem_Score_Before Mem_Score_After
age               1.000000000 0.03510690       0.06601027      0.05187934
Dosage            0.035106905 1.00000000       0.04414903      0.17121922
Mem_Score_Before  0.066010267 0.04414903       1.00000000      0.80752816
Mem_Score_After   0.051879342 0.17121922       0.80752816      1.00000000
Diff             -0.009293328 0.22397945      -0.10436567      0.50232974
                         Diff
age              -0.009293328
Dosage            0.223979452
Mem_Score_Before -0.104365667
Mem_Score_After   0.502329741
Diff              1.000000000

Difference in Mem Scores (Before and After)

ggplot(df_long, aes(time, Mem_Score, fill = time))+
  geom_boxplot()+
  labs(title = "Difference in Mem Scores (Before and After)")+
  facet_wrap(~ Drug)

Paired T Test

before <- df$Mem_Score_Before
after <- df$Mem_Score_After

diff <- after - before

before_l <- log(before)
after_l <- log(after)

shapiro_test_before <- shapiro.test(before_l)
shapiro_test_after <- shapiro.test(after_l)

print(shapiro_test_before)


    Shapiro-Wilk normality test

data:  before_l
W = 0.98901, p-value = 0.1321

print(shapiro_test_after)


    Shapiro-Wilk normality test

data:  after_l
W = 0.9908, p-value = 0.2394

According to the results of the Shapiro-Wilk test, the p-values of both data are greater than 0.05. This means that we can accept that the data are normally distributed. That is, the “before” and “after” data are normally distributed.

Null Hypothesis (H0): Medication use has no effect on memory scores.

Alternative Hypothesis (H1): Medication use has an effect on memory scores.

if (shapiro_test_before$p.value > 0.05 & shapiro_test_after$p.value > 0.05) {
  test_result <- t.test(before_l, after_l, paired = TRUE)
} else {
  test_result <- wilcox.test(before_l, after_l, paired = TRUE)
}

print(test_result)


    Paired t-test

data:  before_l and after_l
t = -3.4574, df = 197, p-value = 0.0006681
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -0.06792631 -0.01858276
sample estimates:
mean difference 
    -0.04325453

According to the paired t-test result, since the p-value is 0.0006681 (p < 0.05), we reject the null hypothesis. This indicates that medication use had a significant effect on memory scores. That is, medication use significantly affected memory scores.According to the test results, drug use decreases memory scores.

Anova

Step-by-Step ANOVA Analysis and Interpretation

Step 1: Formulate Hypotheses

Null Hypothesis (H0): There is no difference in the mean values of the after variable across the different Drug levels.
Alternative Hypothesis (H1): At least one group level has a different mean value of the after variable compared to the others.

Step 2: Choose a Suitable Test

Given that the data involves comparing the means of after values across more than two groups, and assuming the data meets necessary assumptions (normality, independence, and homogeneity of variances), ANOVA (Analysis of Variance) is appropriate. ANOVA allows comparing the means across three or more groups to determine if there is any statistically significant difference.

Step 3: Compute the Test Statistic

Load and check the dataset, and then compute the ANOVA:

anova_result <- aov(after_l ~ Drug, data = df)

summary(anova_result)

             Df Sum Sq Mean Sq F value   Pr(>F)    
Drug          2  1.143  0.5717   7.167 0.000992 ***
Residuals   195 15.555  0.0798                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 4: Obtain and Interpret the p-value

The output from the ANOVA (aov) function provides key statistics about the effects of Drug on after. Here’s how to interpret the critical elements:

F value: 7.167, a measure of the variance ratio between the mean squares of the treatment and the residuals. This high value suggests a strong model fit.
p-value: 0.000992, extremely small, providing compelling evidence to reject the null hypothesis.

Interpretation

Given the very low p-value, there is strong statistical evidence to reject the null hypothesis. This result suggests that different grup levels have significantly different effects on the after variable.

The low p-value indicates that the differences in the after values across the three Drug levels are statistically significant.
The mean after value is not the same for all Drug levels, with each group likely having a different impact on the outcomes.

Step 5: Make a Decision Based on the p-value and Alpha (α)

Alpha (α) Level: Typically set at 0.05.
Decision: Since the p-value (0.000992) is far less than 0.05, we reject the null hypothesis.

Conclusion

There is statistically significant evidence that different groups have different effects on the after variable.

Step 6: Report the Results

Summarize the Analysis: The ANOVA revealed significant differences in the after variable across the different Drug levels.
Detailed Findings: The results suggest that as the group changes, the mean after value significantly changes, indicating a group-wise difference in outcomes.
Post-hoc Analysis: Given the significant ANOVA result, perform a post-hoc analysis to identify which specific groups differ significantly. This can be done using Tukey’s Honestly Significant Difference (HSD) test.

tukey_result <- TukeyHSD(anova_result)
print(tukey_result)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = after_l ~ Drug, data = df)

$Drug
           diff        lwr         upr     p adj
S-A -0.14486239 -0.2605438 -0.02918094 0.0097155
T-A -0.17292006 -0.2890489 -0.05679119 0.0015695
T-S -0.02805767 -0.1446189  0.08850361 0.8370252

Post-hoc Analysis Interpretation

S-A Comparison: The mean difference between Drug S and Drug A is -0.1449. The 95% confidence interval for the difference is (-0.2605, -0.0292), and the p-value is 0.0097. Since the p-value is less than 0.05, this indicates a statistically significant difference between Drug S and Drug A. T-A Comparison: The mean difference between Drug T and Drug A is -0.1729. The 95% confidence interval for the difference is (-0.2890, -0.0568), and the p-value is 0.0016. This indicates a statistically significant difference between Drug T and Drug A. T-S Comparison: The mean difference between Drug T and Drug S is -0.0281. The 95% confidence interval for the difference is (-0.1446, 0.0885), and the p-value is 0.8370. This indicates no statistically significant difference between Drug T and Drug S.

Regression

Null Hypothesis (H0): Age, drug type and dosage do not have a significant effect on memory score.

Alternative Hypothesis (H1): Age, drug type and dosage have a significant effect on memory score.

model <- lm(Mem_Score ~ age+ Drug+ Dosage, data = df_long)
summary(model)


Call:
lm(formula = Mem_Score ~ age + Drug + Dosage, data = df_long)

Residuals:
   Min     1Q Median     3Q    Max 
-37.89 -12.25  -3.48  10.86  55.74 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 55.30869    3.74106  14.784  < 2e-16 ***
age          0.07720    0.07099   1.088  0.27748    
DrugS       -4.37501    2.06778  -2.116  0.03499 *  
DrugT       -6.12411    2.06801  -2.961  0.00325 ** 
Dosage       2.28822    1.03412   2.213  0.02749 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.79 on 391 degrees of freedom
Multiple R-squared:  0.03837,   Adjusted R-squared:  0.02853 
F-statistic:   3.9 on 4 and 391 DF,  p-value: 0.004054

Age had no significant effect on the memory score (p > 0.05). Drug types (DrugS and DrugT) have a significant effect on memory score. Memory score of Drug S is -4.37501 and the effect of Drug T is -6.12411. These negative effects indicate that both drugs score and these effects were statistically significant (p < 0.05 and p < 0.01). Dosage had a positive and significant effect on the memory score (p < 0.05). Increasing dosage increases the memory score by 2.28822 units.

ggplot(df_long, aes(x = age, y = Mem_Score)) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  labs(title = "Relationship between Age and Score Difference",
       x = "Age",
       y = "Score Difference")

`geom_smooth()` using formula = 'y ~ x'

As seen in the graph, the blue line slopes slightly upwards. This indicates that as age increases, medication use leads to a slight increase in the score difference. However, the fact that the slope is quite low and there is no clear trend at the points indicates that age does not have a significant effect on the score difference. That is, there is no significant change in the score difference as age increases. This indicates that age is not a significant variable on the effect of medication use.

ggplot(df_long, aes(x = Drug, y = Mem_Score)) +
  geom_boxplot() +
  labs(title = "Drug Type vs score difference",
       x = "Drug Type",
       y = "score difference")

Medication A: The score difference distribution is wider and the midpoint is higher. This indicates that people on Medication A generally have higher score differences. It also shows that the effect of Medicine A is more consistent because the length of the box is shorter than the other medicines.

Medicine S: The distribution of score differences is wider. This indicates that people taking Medicine S have more variability in score differences, meaning that the effect of the medicine may differ from person to person.

Medication T: The distribution of score differences is also wider, but the median is lower. This indicates that people on Medication T have generally lower score differences and that the effect of the medicine varies.

ggplot(df_long, aes(x =age , y = Mem_Score, color = as.factor(Dosage))) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Interaction between Age, Dosage and Score Difference",
       x = "Age",
       y = "Score Difference",
       color = "Dozaj")

`geom_smooth()` using formula = 'y ~ x'

The graph visualizes the interaction between age and dosage and its effect on the score difference. Different dosage levels (1, 2, 3) are represented by lines of different colors (red, green, blue).

Dosage 1 (Red): The red dots and line represent people at dosage level 1. This line has a horizontal slope, indicating that there is no significant change in the score difference as age increases.

Dosage 2 (Green): The green dots and line represent people at dosage 2. This line has a slight downward trend, indicating that the score difference decreases as age increases.

Dosage 3 (Blue): The blue dots and line represent people in dosage 3. This line has a slight upward trend, indicating that the score difference increases as age increases.