classpackage

normality_and_correlation

In statistics, a routine task often carried out for continuous variables involves testing for normality and correlation, which can be performed using the normality_and_correlation function.

#> For demonstration sake, we will use Kendall rank correlation
output <- normality_correlation(data, "kendall")

The output variable is a list containing three items:

names(output)
#> [1] "Normality Test"     "Correlation Matrix" "QQ Plots"

Which can be subsetted for further analysis. For demonstration sake, let’s pull the correlation matrix.

correlation_matrix <- output[["Correlation Matrix"]]

correlation_matrix
#>                   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g     year
#> bill_length_mm           1.00000      -0.11824           0.48199     0.42776  0.03272
#> bill_depth_mm           -0.11824       1.00000          -0.27746    -0.19237 -0.04278
#> flipper_length_mm        0.48199      -0.27746           1.00000     0.66146  0.12141
#> body_mass_g              0.42776      -0.19237           0.66146     1.00000  0.01493
#> year                     0.03272      -0.04278           0.12141     0.01493  1.00000

QQ Plot Functions

QQ plots, or quantile-quantile plots, are graphical tools used to compare two probability distributions. Typically, they are used to compare the distribution of a continuous random variable with the normal distribution to assess how closely the variable of interest follows the normal distribution. This section will examine three functions: one_qq_plot, independent_qq_plot, and dependent_qq_plot.

one_qq_plot

The one_qq_plot function is used to construct a QQ plot for a single continuous random variable. This function is also used to construct the QQ plots that are part of the normality_and_correlation function.

Let’s create a QQ plot for the bill_length_mm variable from the penguins dataset.

one_qq_plot(data, "bill_length_mm")

independent_qq_plot

QQ plots are useful for evaluating normality before conducting t-tests, whether the tests are dependent or independent. When using independent QQ plots, the normality assumption should be checked for each group being compared. In this case, we will compare the bill_length_mm between the sexes of the penguins.

The variable argument accepts the column names that hold the values being compared between the two groups. On the other hand, the grouping_variable argument takes the name of the column that contains the values used to divide the measure variable into two distinct groups.

independent_qq_plot(data,
  variable = "bill_length_mm",
  grouping_variable = "sex"
)

If the dataset is in wide format, where each group’s measurements are stored in separate columns instead of using a variable to split the measure variable, you need to provide a character vector containing the names of those columns with the observations. In this scenario, the grouping_variable and variable arguments serve a more nominal purpose as they are only used to create titles for the QQ plots. For this purpose, let’s simulate an arbitrary wide dataset.

wide_dataset <- data.frame(
  group_1 = rnorm(500, 0, 1),
  group_2 = rnorm(500, 1, 1)
)

independent_qq_plot(wide_dataset,
  variable = "xyz",
  grouping_variable = "abc",
  c("group_1", "group_2")
)

dependent_qq_plot

In dependent t-tests, the normality assumption pertains to the differences between the paired observations. The order in which the subtraction is performed doesn’t affect the test itself, but it does alter the interpretation of the results. Consequently, when using the dependent_qq_plot function, we need to specify the order of subtraction to ensure accurate interpretation of the QQ plots.

Since the penguins dataset lacks suitable data for a dependent t-test, we will create two datasets: one in wide format and the other in long format through simulation. We’ll then compare how the arguments for the dependent_qq_plot function differ between these two formats.

wide_dataset <- data.frame(
  first_measurement = rnorm(500, 0, 1),
  second_measurement = rnorm(500, 1, 1)
)

long_dataset <- data.frame(
  measurements = c(
    rnorm(500, 0, 1),
    rnorm(500, 1, 1)
  ),
  which_measurement = rep(c("first", "second"), each = 500)
)

In the long dataset, we use the variable argument, which represents the column containing the measurements we want to test. Similar to the previous function, we also use the grouping_variable argument, which holds the column name used to split the measurements into groups. Additionally, we have the first_group argument, where we specify the level of the grouping_variable from which the other level will be subtracted. The second_group argument is used to specify the level associated with the observations that will be subtracted from the first group.

dependent_qq_plot(long_dataset,
  variable = "measurements",
  grouping_variable = "which_measurement",
  first_group = "first",
  second_group = "second"
)

In the context of the dependent_qq_plot function with wide datasets, there is a slight difference in the purpose of the arguments compared to the dependent_qq_plot function with long datasets, which is similar to what was observed in the independent_qq_plot function. Here, the grouping_variable and variable arguments become irrelevant and are used solely to modify the title of the resulting QQ plot. Instead, the first_group argument takes the name of the column that stores the first set of observations, while the second_group argument takes the name of the column that stores the second set of observations.

dependent_qq_plot(wide_dataset,
  variable = "xyz",
  grouping_variable = "abc",
  first_group = "first_measurement",
  second_group = "second_measurement"
)


data
#> # A tibble: 333 × 8
#>    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex     year
#>    <fct>   <fct>              <dbl>         <dbl>             <int>       <int> <fct>  <int>
#>  1 Adelie  Torgersen           39.1          18.7               181        3750 male    2007
#>  2 Adelie  Torgersen           39.5          17.4               186        3800 female  2007
#>  3 Adelie  Torgersen           40.3          18                 195        3250 female  2007
#>  4 Adelie  Torgersen           36.7          19.3               193        3450 female  2007
#>  5 Adelie  Torgersen           39.3          20.6               190        3650 male    2007
#>  6 Adelie  Torgersen           38.9          17.8               181        3625 female  2007
#>  7 Adelie  Torgersen           39.2          19.6               195        4675 male    2007
#>  8 Adelie  Torgersen           41.1          17.6               182        3200 female  2007
#>  9 Adelie  Torgersen           38.6          21.2               191        3800 male    2007
#> 10 Adelie  Torgersen           34.6          21.1               198        4400 male    2007
#> # ℹ 323 more rows

anova_check

Lastly, let us review the anova_check() function, which allows us to assess the assumptions of a linear model. Consider modeling bill_length_mm using flipper_length_mm.

model_lm <- 
  lm(bill_length_mm ~ flipper_length_mm, data)

model_glm <- 
  glm(bill_length_mm ~ flipper_length_mm, data = data, family = "gaussian")

model_anova <- 
  aov(bill_length_mm ~ flipper_length_mm, data)

R considers all of the previous models to be identical and suitable as an input for the anova_check() function.

#> anova_check(model_lm)
#> anova_check(model_glm)
anova_check(model_anova)

The function anova_check() generates two graphs: a residuals vs. fitted graph and a QQ plot of the residuals. The residuals vs. fitted graph helps evaluate the assumption of homoskedasticity, while the QQ plot assesses the normality of residuals.