This vignette demonstrates how to use functions from the
ssstats
package to carry out common statistical tasks in R.
To follow along, load the required packages:
When working with continuous data, it is important to examine
measures of central tendency (mean, median, mode), distribution, and the
correlation between variables. We will use the penguins dataset from the
palmerpenguins
package.
## # A tibble: 3 × 3
## variable mean_sd median_iqr
## <chr> <chr> <chr>
## 1 bill_depth_mm 17.2 (2.0) 17.3 (3.1)
## 2 bill_length_mm 43.9 (5.5) 44.5 (9.3)
## 3 flipper_length_mm 200.9 (14.1) 197.0 (23.0)
## $`Normality Test`
## Variable Test Statistic p-value
## 1 bill_length_mm 0.97485 1e-05
## 2 bill_depth_mm 0.97258 0e+00
## 3 flipper_length_mm 0.95155 0e+00
## 4 body_mass_g 0.95921 0e+00
## 5 year 0.79228 0e+00
##
## $`Correlation Matrix`
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm 1 NA NA NA
## bill_depth_mm NA 1 NA NA
## flipper_length_mm NA NA 1 NA
## body_mass_g NA NA NA 1
## year NA NA NA NA
## year
## bill_length_mm NA
## bill_depth_mm NA
## flipper_length_mm NA
## body_mass_g NA
## year 1
##
## $`QQ Plots`
The correlation matrix includes NA
values due to missing
data. To avoid this, and to exclude numerically coded categorical
variables, clean the data before applying the
normality_correlation()
function:
## $`Normality Test`
## Variable Test Statistic p-value
## 1 bill_length_mm 0.97434 1e-05
## 2 bill_depth_mm 0.97329 1e-05
## 3 flipper_length_mm 0.95171 0e+00
## 4 body_mass_g 0.95801 0e+00
## 5 year 0.79395 0e+00
##
## $`Correlation Matrix`
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm 1.00000 -0.21386 0.66975 0.57648
## bill_depth_mm -0.21386 1.00000 -0.51732 -0.42928
## flipper_length_mm 0.66975 -0.51732 1.00000 0.84039
## body_mass_g 0.57648 -0.42928 0.84039 1.00000
## year 0.04290 -0.05706 0.15580 0.01845
## year
## bill_length_mm 0.04290
## bill_depth_mm -0.05706
## flipper_length_mm 0.15580
## body_mass_g 0.01845
## year 1.00000
##
## $`QQ Plots`
For categorical variables, use n_pct() to generate one-way or two-way frequency and percentage tables. For example, to examine the distribution of penguins by sex:
## sex n (pct)
## female 165 (49.5%)
## male 168 (50.5%)
You can also summarize by two variables. For instance, to see the distribution of species across years:
## # A tibble: 3 × 4
## year Adelie Chinstrap Gentoo
## <int> <chr> <chr> <chr>
## 1 2007 44 (30.1%) 26 (38.2%) 33 (27.7%)
## 2 2008 50 (34.2%) 18 (26.5%) 45 (37.8%)
## 3 2009 52 (35.6%) 24 (35.3%) 41 (34.5%)
By default, percentages are column-based. You can also specify
type = "row"
or type = "all"
to change the
denominator.
Depending on our research goals, we might be interested in comparing
an observed mean or proportion to a specified value to assess
statistical significance. The ssstats
package provides the
one_mean_*
and one_prop_*
functions to perform
hypothesis tests and compute confidence intervals for means and
proportions, respectively.
For demonstration purposes, let us test whether the mean bill length is greater than 40 mm (assuming 40 mm is a relevant benchmark for penguin bill length):
one_mean_HT(penguins_clean, continuous = bill_length_mm, mu = 40, alternative = "greater", alpha = 0.05)
## One-sample t-test for the population mean:
## Null: H0: μ = 40
## Alternative: H1: μ > 40
## Test statistic: t(332) = 13.323
## p-value: p < 0.001
## Conclusion: Reject the null hypothesis (p = < 0.001 < α = 0.05)
As shown in the output, we obtain a very small p-value and reject the null hypothesis that \(\mu = 40\) at the \(\alpha = 0.05\) level. Since this is a one-sided test (alternative = “greater”), we conclude that there is sufficient evidence to suggest the true population mean bill length exceeds 40 mm.
To compute the corresponding confidence interval and point estimate:
## # A tibble: 1 × 5
## mean sd conf_level ci_lower ci_upper
## <dbl> <dbl> <chr> <dbl> <dbl>
## 1 44.0 5.47 95% 43.4 44.6
We can conduct a similar analysis for proportions. For example, let us test the null hypothesis that 50% of the penguins in our research population are of the Adelie species:
one_prop_HT(data = penguins_clean, grouping = species, success = "Adelie", p = 0.5, alternative = "two.sided", alpha = 0.05)
## One-sample z-test for the population proportion:
## Null: H0: π = 0.5
## Alternative: H1: π ≠ 0.5
## Test statistic: z = 2.25
## p-value: p = 0.025
## Conclusion: Reject the null hypothesis (p = 0.0247 < α = 0.05)
To obtain the confidence interval for this proportion:
## # A tibble: 1 × 6
## proportion successes sample_size conf_level ci_lower ci_upper
## <dbl> <int> <int> <chr> <dbl> <dbl>
## 1 0.438 146 333 95% 0.386 0.492
Suppose we want to compare bill length between male and female penguins. First, check the assumption of within-group normality using QQ plots:
The QQ plots suggest deviations from normality, but with a large enough sample size, the Central Limit Theorem justifies using a t-test. Please note that this example is for illustration only. In practice, a non-parametric test may be more appropriate.
Now perform a two-sample t-test assuming equal variance:
independent_mean_HT(
penguins_clean,
continuous = "bill_length_mm",
grouping = "sex",
alternative = "two.sided",
alpha = 0.05,
variance = "equal"
)
## Two-sample t-test for two independent means and equal variance:
## Null: H₀: μ₁ − μ₂ = 0
## Alternative: H₁: μ₁ − μ₂ ≠ 0
## Test statistic: t(331) = -6.667
## p-value: p < 0.001
## Conclusion: Reject the null hypothesis (p = < 0.001 < α = 0.05)
The result suggests a statistically significant difference in bill length between sexes. To estimate the magnitude of this difference, we compute a 95% confidence interval:
independent_mean_CI(
penguins_clean,
grouping = "sex",
continuous = "bill_length_mm",
confidence = 0.95
)
## The point estimate for the difference in means is x̄₁ − x̄₂ = -3.7578
## The 95% confidence interval for μ₁ − μ₂ is (-4.8666, -2.649)
With 95% confidence, females have bills that are, on average, 2.65 to 4.87 mm shorter than males, which is a more informative conclusion than simply noting a difference!
In the previous example, we performed a two-sample t-test to see if there was a statistically significant different between male and female penguins with respect to bill length. However, we did not account for differences in bill length by species. We can resort to a regression model to determine if the same difference in bill length persists even after adjusting for species.
Now, we use the anova_check
function to assess the
assumptions of a linear model.
There is no clear evidence of either heteroscedasticity or non-normality of residuals; however, there are some points with large residuals (~10). Let us further examine these points.
In linear regression, outliers are points with large residuals, while influential points are those that significantly affect the model’s estimates. We examine residuals to assess model fit and Cook’s distance to identify observations that unduly influence the regression.
We begin with outlier detection. While definitions vary, we define
outliers as observations with standardized residuals greater than 2.5 or
less than -2.5. The function returns a ggplot
object,
allowing for further customization using +
. In this
example, point shapes are varied by penguin species to aid
interpretation.
## # A tibble: 2 × 2
## outlier count
## <chr> <int>
## 1 Not Suspected 328
## 2 Suspected 5
outlier_graph(df = penguins_clean, model = model,
x_var = bill_depth_mm, y_var = bill_length_mm) +
ggplot2::aes(shape = species) +
ggplot2::scale_shape_manual(values = c(16, 17, 15))
There is no indication of data entry errors (such as negative lengths) or other anomalies that would justify excluding these observations. Therefore, we retain them in the analytic sample and proceed to examine influential points. As with outliers, the definition of an influential observation varies. In this package, users can choose from three commonly used thresholds for identifying influential points based on Cook’s distance: (1) the rule, which flags values exceeding three times the mean Cook’s distance; (2) the convention, which highlights points with Cook’s distance greater than 0.5 or 1; and (3) the rule, which flags values above \(4/n\), where \(n\) is the sample size. The rule is used by default.
Observations 179, 283, and 296 appear to be influential. While alternative modeling approaches, such as robust regression, quantile regression, or generalized linear models with a gamma distribution, may be appropriate, we proceed under the assumption that retaining these points does not unduly compromise our analysis. We now turn to the regression coefficients to connect our findings with the earlier t-test results.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.977019 0.2307447 160.25076 2.727713e-314
## sexmale 3.693907 0.2548031 14.49710 3.597005e-37
## speciesChinstrap 10.009851 0.3412900 29.32946 8.148878e-94
## speciesGentoo 8.697533 0.2871096 30.29342 3.431234e-97
Even after adjusting for species, there is a significant difference in mean bill length between males and females. Specifically, holding species constant, males are expected to have bills that are 3.694 mm longer than females (95% CI: 3.193, 4.195). These results are consistent with the earlier t-test, though the estimated difference is slightly smaller.
The penguins
dataset does not contain variables suitable
for demonstrating dependent t-test procedures using
ssstats
. To that end, we will generate a synthetic dataset.
Let us simulate standardized pre- and post-test scores for 500
individuals. This type of data is commonly referred to as “wide” format,
where each row corresponds to an individual and each column to a
measurement occasion. The dependent t-test functions in
ssstats
require data in this format. If your data is
currently in “long” format (where each row is a single measurement),
consider reshaping it using tools from packages such as
tidyr
set.seed(123)
# Number of observations
n <- 100
# Mean and standard deviation for pre and post
mu_pre <- 0
mu_post <- 1
sd_pre <- 1
sd_post <- 1
# correlation between pre and post
rho <- 0.6
# Define variance-covariance matrix for bivariate normal
Sigma <- matrix(c(sd_pre^2, rho * sd_pre * sd_post,
rho * sd_pre * sd_post, sd_post^2),
nrow = 2)
bivn <- MASS::mvrnorm(n = n, mu = c(mu_pre, mu_post), Sigma = Sigma)
pre <- bivn[, 1]
post <- bivn[, 2]
id <- 1:n
data <- data.frame(id = id, pre = pre, post = post)
Let us begin by examining the distribution of the difference between the post- and pre-test scores. Specifically, we will inspect the QQ plot of the differences to assess the normality assumption required for the paired t-test. We will also compute the mean and median of the difference for a quick descriptive summary.
# Compare means and medians of pre and post scores
dependent_mean_median(data, col1 = pre, col2 = post, accuracy = 3)
## # A tibble: 3 × 3
## Variable `Mean (SD)` `Median (IQR)`
## <chr> <chr> <chr>
## 1 post – pre 0.904 (0.865) 0.798 (1.135)
## 2 pre 0.129 (0.943) 0.072 (1.095)
## 3 post 1.033 (0.905) 1.092 (1.161)
# Examine QQ plot of the difference (post - pre)
dependent_qq_plot(data, var1 = "pre", var2 = "post")
Assuming that all appropriate criteria for a t-test are met, we can proceed with the dependent t-test.
dependent_mean_HT(data = data, col1 = pre, col2 = post, alternative = "two.sided", alpha = 0.05, mu = 0)
## Paired t-test for the mean of differences:
## Null: H₀: μ_d = 0
## Alternative: H₁: μ_d ≠ 0
## Test statistic: t(99) = -10.45
## p-value: p < 0.001
## Conclusion: Reject the null hypothesis (p = < 0.001 < α = 0.05)
We complete this example by examining the 95% confidence interval for this difference.
## The point estimate for the mean difference is x̄ = -0.9038.
## The point estimate for the standard deviation of differences is s = 0.8649.
## The 95% confidence interval for the mean difference μ_d is (-1.0754, -0.7322).