library(dplyr)
library(tidyr)
library(tidyverse)
library(psych)
library(ggplot2)Stat
Import Dataset
df <- read.csv("Islander_data.csv")
head(df) first_name last_name age Happy_Sad_group Dosage Drug Mem_Score_Before
1 Bastian Carrasco 25 H 1 A 63.5
2 Evan Carrasco 52 S 1 A 41.6
3 Florencia Carrasco 29 H 1 A 59.7
4 Holly Carrasco 50 S 1 A 51.7
5 Justin Carrasco 52 H 1 A 47.0
6 Liam Carrasco 37 S 1 A 66.4
Mem_Score_After Diff
1 61.2 -2.3
2 40.7 -0.9
3 55.1 -4.6
4 51.2 -0.5
5 47.1 0.1
6 58.1 -8.3
library(dplyr)
library(tidyr)
library(tidyverse)
df_long <- df %>%
pivot_longer(
cols = starts_with('Mem_Score_'),
names_to = 'time',
values_to = 'Mem_Score'
)
df_long$time <- str_replace(df_long$time, 'Mem_Score_', '')
df_long <- df_long %>%
mutate(
time = fct_relevel(time, 'Before', 'After')
)
head(df_long)# A tibble: 6 × 9
first_name last_name age Happy_Sad_group Dosage Drug Diff time Mem_Score
<chr> <chr> <int> <chr> <int> <chr> <dbl> <fct> <dbl>
1 Bastian Carrasco 25 H 1 A -2.3 Before 63.5
2 Bastian Carrasco 25 H 1 A -2.3 After 61.2
3 Evan Carrasco 52 S 1 A -0.9 Before 41.6
4 Evan Carrasco 52 S 1 A -0.9 After 40.7
5 Florencia Carrasco 29 H 1 A -4.6 Before 59.7
6 Florencia Carrasco 29 H 1 A -4.6 After 55.1
Descriptive
library(psych)
describe(df_long[, sapply(df_long, is.numeric)]) vars n mean sd median trimmed mad min max range skew
age 1 396 39.53 12.01 37.00 38.10 11.86 24.0 83 59.0 1.06
Dosage 2 396 1.99 0.82 2.00 1.99 1.48 1.0 3 2.0 0.02
Diff 3 396 2.95 10.74 1.70 2.12 6.89 -40.4 49 89.4 0.76
Mem_Score 4 396 59.44 17.03 55.95 58.10 15.64 27.1 120 92.9 0.74
kurtosis se
age 0.82 0.60
Dosage -1.51 0.04
Diff 3.23 0.54
Mem_Score 0.18 0.86
#describe(df[, sapply(df, is.numeric)])Descriptive statistics have been calculated for the numerical data, resulting in the following output Age data ranges between 24 and 83, with a mean of 39.53. The median value is 37, meaning that half of the ages are younger than this value. The skewness value is 1.06, meaning that the age distribution is skewed to the right. The kurtosis value is 0.82, which means that the distribution is close to a normal distribution but a bit spread out.
The difference data ranges between -40.4 and 49, with an average of 2.95. The median value is 1.70, meaning that half of the difference values are smaller than this value. The skewness is 0.76, meaning that the difference data is skewed to the right. The kurtosis is 3.23, meaning that the distribution is more pointed and there are more outliers.
The memory score data ranges from 27.1 to 120, with a mean of 59.44. The median value is 55.95, meaning that half of the memory scores are smaller than this value. The skewness is 0.74, meaning that the memory scores are skewed to the right. The kurtosis value is 0.18, meaning that the distribution is very close to a normal distribution.
In short, based on these data, we can say that the distributions of age, difference and memory scores are generally shifted to the right and the distribution of memory scores is closest to a normal distribution. There seem to be more outliers in the difference data.
library(ggplot2)
numeric_df <- df[, sapply(df, is.numeric)]
ggplot(numeric_df, aes(x = age)) +
geom_histogram(aes(y = ..density..), binwidth = 5, fill = "lightblue", color = "black", alpha = 0.7) +
geom_density(color = "blue", size = 1) +
labs(title = "Age Distribution", x = "Age", y = "Density") +
theme_minimal()Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
ggplot(numeric_df, aes(x = Diff)) +
geom_histogram(aes(y = ..density..), binwidth = 5, fill = "lightcoral", color = "black", alpha = 0.7) +
geom_density(color = "red", size = 1) +
labs(title = "Difference Distribution", x = "Difference", y = "Density") +
theme_minimal()ggplot(numeric_df, aes(x = Mem_Score_Before)) +
geom_histogram(aes(y = ..density..), binwidth = 5, fill = "lightpink", color = "black", alpha = 0.7) +
geom_density(color = "pink", size = 1) +
labs(title = "Memory Score Before Medication", x = "Memory Score Before", y = "Density") +
theme_minimal()ggplot(numeric_df, aes(x = Mem_Score_After)) +
geom_histogram(aes(y = ..density..), binwidth = 5, fill = "lightyellow", color = "black", alpha = 0.7) +
geom_density(color = "yellow", size = 1) +
labs(title = "Memory Score After Medication", x = "Memory Score After", y = "Density") +
theme_minimal()Missing Data Control
sum(is.na(df))[1] 0
No NA data was found in the data and there is no missing data in the data.
More on Z-scores
Let’s use the built-in “df_long” dataset in R to calculate Z-scores for one of its variables and answer some statistical questions. We’ll calculate Z-scores for the “mem_s” variable.
# Calculate the mean and standard deviation of Mem_Score
mean_mem_s <- mean(df_long$Mem_Score)
std_dev_mem_s <- sd(df_long$Mem_Score)
# Calculate Z-scores for each observation in Mem_Score
z_scores_mem_s <- (df_long$Mem_Score - mean_mem_s) / std_dev_mem_s
# Print the first few Z-scores
head(z_scores_mem_s)[1] 0.23805562 0.10303192 -1.04760485 -1.10044021 0.01497298 -0.25507442
This code calculates Z-scores for the “Mem_Score” variable in the “df_long” dataset. Now, let’s answer some statistical questions:
1. What is the average (mean) Z-score for Mem_Score in the dataset?
mean_z_score <- mean(z_scores_mem_s)
mean_z_score[1] -1.530762e-16
2. How many people in the dataset have above-average Mem_Score (Z-score > 0)?
above_average_mem_s <- sum(z_scores_mem_s > 0)
above_average_mem_s[1] 167
3. How many people in the dataset have below-average Mem_Score (Z-score < 0)?
below_average_mem_s <- sum(z_scores_mem_s < 0)
below_average_mem_s[1] 229
4. How many people in the dataset have very high Mem_Score (Z-score > 2)?
very_high_mem_s <- sum(z_scores_mem_s > 2)
very_high_mem_s[1] 14
These examples demonstrate how you can calculate Z-scores for a variable in R, use them to answer statistical questions, and identify data points that fall within certain Z-score ranges.
Corr
cor(df_long$Mem_Score, df_long$age, method = "spearman")[1] 0.02573972
cor(numeric_df) age Dosage Mem_Score_Before Mem_Score_After
age 1.000000000 0.03510690 0.06601027 0.05187934
Dosage 0.035106905 1.00000000 0.04414903 0.17121922
Mem_Score_Before 0.066010267 0.04414903 1.00000000 0.80752816
Mem_Score_After 0.051879342 0.17121922 0.80752816 1.00000000
Diff -0.009293328 0.22397945 -0.10436567 0.50232974
Diff
age -0.009293328
Dosage 0.223979452
Mem_Score_Before -0.104365667
Mem_Score_After 0.502329741
Diff 1.000000000
Difference in Mem Scores (Before and After)
ggplot(df_long, aes(time, Mem_Score, fill = time))+
geom_boxplot()+
labs(title = "Difference in Mem Scores (Before and After)")+
facet_wrap(~ Drug)Paired T Test
before <- df$Mem_Score_Before
after <- df$Mem_Score_After
diff <- after - before
before_l <- log(before)
after_l <- log(after)
shapiro_test_before <- shapiro.test(before_l)
shapiro_test_after <- shapiro.test(after_l)
print(shapiro_test_before)
Shapiro-Wilk normality test
data: before_l
W = 0.98901, p-value = 0.1321
print(shapiro_test_after)
Shapiro-Wilk normality test
data: after_l
W = 0.9908, p-value = 0.2394
According to the results of the Shapiro-Wilk test, the p-values of both data are greater than 0.05. This means that we can accept that the data are normally distributed. That is, the “before” and “after” data are normally distributed.
Null Hypothesis (H0): Medication use has no effect on memory scores.
Alternative Hypothesis (H1): Medication use has an effect on memory scores.
if (shapiro_test_before$p.value > 0.05 & shapiro_test_after$p.value > 0.05) {
test_result <- t.test(before_l, after_l, paired = TRUE)
} else {
test_result <- wilcox.test(before_l, after_l, paired = TRUE)
}
print(test_result)
Paired t-test
data: before_l and after_l
t = -3.4574, df = 197, p-value = 0.0006681
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-0.06792631 -0.01858276
sample estimates:
mean difference
-0.04325453
According to the paired t-test result, since the p-value is 0.0006681 (p < 0.05), we reject the null hypothesis. This indicates that medication use had a significant effect on memory scores. That is, medication use significantly affected memory scores.According to the test results, drug use decreases memory scores.
Anova
Step-by-Step ANOVA Analysis and Interpretation
Step 1: Formulate Hypotheses
- Null Hypothesis (H0): There is no difference in the mean values of the
aftervariable across the differentDruglevels. - Alternative Hypothesis (H1): At least one group level has a different mean value of the
aftervariable compared to the others.
Step 2: Choose a Suitable Test
Given that the data involves comparing the means of after values across more than two groups, and assuming the data meets necessary assumptions (normality, independence, and homogeneity of variances), ANOVA (Analysis of Variance) is appropriate. ANOVA allows comparing the means across three or more groups to determine if there is any statistically significant difference.
Step 3: Compute the Test Statistic
Load and check the dataset, and then compute the ANOVA:
anova_result <- aov(after_l ~ Drug, data = df)
summary(anova_result) Df Sum Sq Mean Sq F value Pr(>F)
Drug 2 1.143 0.5717 7.167 0.000992 ***
Residuals 195 15.555 0.0798
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Step 4: Obtain and Interpret the p-value
The output from the ANOVA (aov) function provides key statistics about the effects of Drug on after. Here’s how to interpret the critical elements:
- F value: 7.167, a measure of the variance ratio between the mean squares of the treatment and the residuals. This high value suggests a strong model fit.
- p-value: 0.000992, extremely small, providing compelling evidence to reject the null hypothesis.
Interpretation
Given the very low p-value, there is strong statistical evidence to reject the null hypothesis. This result suggests that different grup levels have significantly different effects on the after variable.
- The low p-value indicates that the differences in the
aftervalues across the threeDruglevels are statistically significant. - The mean
aftervalue is not the same for allDruglevels, with each group likely having a different impact on the outcomes.
Step 5: Make a Decision Based on the p-value and Alpha (α)
- Alpha (α) Level: Typically set at 0.05.
- Decision: Since the p-value (0.000992) is far less than 0.05, we reject the null hypothesis.
Conclusion
There is statistically significant evidence that different groups have different effects on the after variable.
Step 6: Report the Results
- Summarize the Analysis: The ANOVA revealed significant differences in the
aftervariable across the differentDruglevels. - Detailed Findings: The results suggest that as the group changes, the mean
aftervalue significantly changes, indicating a group-wise difference in outcomes. - Post-hoc Analysis: Given the significant ANOVA result, perform a post-hoc analysis to identify which specific groups differ significantly. This can be done using Tukey’s Honestly Significant Difference (HSD) test.
tukey_result <- TukeyHSD(anova_result)
print(tukey_result) Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = after_l ~ Drug, data = df)
$Drug
diff lwr upr p adj
S-A -0.14486239 -0.2605438 -0.02918094 0.0097155
T-A -0.17292006 -0.2890489 -0.05679119 0.0015695
T-S -0.02805767 -0.1446189 0.08850361 0.8370252
Post-hoc Analysis Interpretation
S-A Comparison: The mean difference between Drug S and Drug A is -0.1449. The 95% confidence interval for the difference is (-0.2605, -0.0292), and the p-value is 0.0097. Since the p-value is less than 0.05, this indicates a statistically significant difference between Drug S and Drug A. T-A Comparison: The mean difference between Drug T and Drug A is -0.1729. The 95% confidence interval for the difference is (-0.2890, -0.0568), and the p-value is 0.0016. This indicates a statistically significant difference between Drug T and Drug A. T-S Comparison: The mean difference between Drug T and Drug S is -0.0281. The 95% confidence interval for the difference is (-0.1446, 0.0885), and the p-value is 0.8370. This indicates no statistically significant difference between Drug T and Drug S.
Regression
Null Hypothesis (H0): Age, drug type and dosage do not have a significant effect on memory score.
Alternative Hypothesis (H1): Age, drug type and dosage have a significant effect on memory score.
model <- lm(Mem_Score ~ age+ Drug+ Dosage, data = df_long)
summary(model)
Call:
lm(formula = Mem_Score ~ age + Drug + Dosage, data = df_long)
Residuals:
Min 1Q Median 3Q Max
-37.89 -12.25 -3.48 10.86 55.74
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 55.30869 3.74106 14.784 < 2e-16 ***
age 0.07720 0.07099 1.088 0.27748
DrugS -4.37501 2.06778 -2.116 0.03499 *
DrugT -6.12411 2.06801 -2.961 0.00325 **
Dosage 2.28822 1.03412 2.213 0.02749 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.79 on 391 degrees of freedom
Multiple R-squared: 0.03837, Adjusted R-squared: 0.02853
F-statistic: 3.9 on 4 and 391 DF, p-value: 0.004054
Age had no significant effect on the memory score (p > 0.05). Drug types (DrugS and DrugT) have a significant effect on memory score. Memory score of Drug S is -4.37501 and the effect of Drug T is -6.12411. These negative effects indicate that both drugs score and these effects were statistically significant (p < 0.05 and p < 0.01). Dosage had a positive and significant effect on the memory score (p < 0.05). Increasing dosage increases the memory score by 2.28822 units.
ggplot(df_long, aes(x = age, y = Mem_Score)) +
geom_point() +
geom_smooth(method = "lm", col = "blue") +
labs(title = "Relationship between Age and Score Difference",
x = "Age",
y = "Score Difference")`geom_smooth()` using formula = 'y ~ x'
As seen in the graph, the blue line slopes slightly upwards. This indicates that as age increases, medication use leads to a slight increase in the score difference. However, the fact that the slope is quite low and there is no clear trend at the points indicates that age does not have a significant effect on the score difference. That is, there is no significant change in the score difference as age increases. This indicates that age is not a significant variable on the effect of medication use.
ggplot(df_long, aes(x = Drug, y = Mem_Score)) +
geom_boxplot() +
labs(title = "Drug Type vs score difference",
x = "Drug Type",
y = "score difference")Medication A: The score difference distribution is wider and the midpoint is higher. This indicates that people on Medication A generally have higher score differences. It also shows that the effect of Medicine A is more consistent because the length of the box is shorter than the other medicines.
Medicine S: The distribution of score differences is wider. This indicates that people taking Medicine S have more variability in score differences, meaning that the effect of the medicine may differ from person to person.
Medication T: The distribution of score differences is also wider, but the median is lower. This indicates that people on Medication T have generally lower score differences and that the effect of the medicine varies.
ggplot(df_long, aes(x =age , y = Mem_Score, color = as.factor(Dosage))) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Interaction between Age, Dosage and Score Difference",
x = "Age",
y = "Score Difference",
color = "Dozaj")`geom_smooth()` using formula = 'y ~ x'
The graph visualizes the interaction between age and dosage and its effect on the score difference. Different dosage levels (1, 2, 3) are represented by lines of different colors (red, green, blue).
Dosage 1 (Red): The red dots and line represent people at dosage level 1. This line has a horizontal slope, indicating that there is no significant change in the score difference as age increases.
Dosage 2 (Green): The green dots and line represent people at dosage 2. This line has a slight downward trend, indicating that the score difference decreases as age increases.
Dosage 3 (Blue): The blue dots and line represent people in dosage 3. This line has a slight upward trend, indicating that the score difference increases as age increases.