In this engagement, we will practice how to perform the paired samples, independent samples t-test, ANOVA, and the Chi Square tests in R.
A loco medical anthropologist professor, Dr. Bunny, has been struggling with their own aura points. After watching several TikToks of people admitting how they embarrassingly lost 100 aura pts in the most insane way possible, Dr. Bunny knew that they needed to find a medicine that could instantly restore aura. They hypothesized that this new medicine will greater improve peoples’ aura.
Dr. Bunny invited 50 of their students to try out the new pill called
yuppie
. After some initial “my professor is giving me lsd
or something” ahh type reaction, they decided to “fck it” and give it a
go. Dr. Bunny measured their aura pts before the medicine and after. The
vector aura_pts
are the student’s mean aura before
consuming yuppie and aura_pts_yuppie
are their aura points
after consuming yuppie.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggpubr)
library(corrplot)
## corrplot 0.92 loaded
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
aura_pts = rnorm(50,0,10)
aura_pts_yuppie = aura_pts + rnorm(50, 3)
tibble(Mean_before = mean(aura_pts), sd_before = sd(aura_pts),
Mean_after = mean(aura_pts_yuppie), sd_after = sd(aura_pts_yuppie))
## # A tibble: 1 × 4
## Mean_before sd_before Mean_after sd_after
## <dbl> <dbl> <dbl> <dbl>
## 1 -3.40 8.48 -0.656 8.73
H0: There is no difference in the students’ aura points before and after taking yuppie.
H1: Students’ aura points greatly improved after taking yuppie.
# remember we are dealing with the right tail so we actually want a positive t-value. The critical value of a right tail should be 1 - alpha
qt(0.95, 49)
## [1] 1.676551
```{r} ggplot(data = data.frame(x = c(-4, 4)), aes(x)) + stat_function(fun = dnorm, n = 101, args = list(mean = 0, sd = 1))+ geom_area(stat = ‘function’, fun = dnorm, args = list(mean = 0, sd = 1), fill = “red”, alpha = 0.4, xlim = c(4, —))+ # modify the xlim argument HERE YES HERE scale_x_continuous(breaks = seq(-4,4,1))+ labs(x = ““, y =”proportion”)+ ggpubr::theme_pubclean()+ ggtitle(“Red = region for rejection of null”)
#### Q1.7. Calculate the test statistic
``` r
differences = aura_pts_yuppie - aura_pts
t_value = mean(differences) / (sd(differences) / sqrt(length(differences)))
t_value # ~ 110.01
## [1] 20.38598
```{r} ggplot(data = data.frame(x = c(-4, 4)), aes(x)) + stat_function(fun = dnorm, n = 101, args = list(mean = 0, sd = 1))+ geom_area(stat = ‘function’, fun = dnorm, args = list(mean = 0, sd = 1), fill = “red”, alpha = 0.4, xlim = c(4, —))+ # add the critical value here! geom_vline(xintercept = —) + # add the TEST value here! scale_x_continuous(breaks = seq(-4,4,1))+ labs(x = ““, y =”proportion”) + ggpubr::theme_pubclean()+ ggtitle(“Red = region for rejection of null”)
#### State your conclusions on the null hypothesis:
Well, Idk how it was possible, but yuppie seemed to work. What Dr. Bunny didn't know is they literally created the chemical compound for caffeine soo...that's what it was.
#### Scenario 2: Independent Samples t-test
The data stored in the `law_crime.csv` file contains information on violent crime rate (incidents per 100,000 people), murder rate, robery rate, incarceration rate, and some demographic information for each of the 50 US states. The observations were made in 1999 (quite a bit ago), and they also include the variable `law`, which is coded `yes` if the state had a shall-issue law in effect that year, and `no` if it did not. Shall-issue laws are laws that give residents of a state the right to carry guns - essentially, these force the state to give residents access to permits as long as the applicant meets the basic requirements set out by the law.
Some people argue that shall-issue laws affect crime because residents can stay strapped or get capped. We are interested in testing whether this is true, using violent crime rates given in `guns$violent`.
#### Read in the dataset as well as the necessary packages
``` r
library(tidyverse)
library(ggpubr)
library(ggthemes)
library(car)
library(corrplot)
guns <- read.csv("law_crime.csv")
There are two separate samples for this test: one from states without these specific gun laws and states with these specific gun laws.
\(H_0\): There is no difference in violent crime rates between states without shall-issue gun laws than states with these gun laws.
\(H_1\): The violent crime rates of states with shall-issue gun laws is significantly lower than states without these gun laws.
This is difficult to know just from the dataset alone. However, this dataset does not repeat states (the unit of observation), so we know that each observation is independent of each other.
Try replacing ---
with each group. remember we are
interested in the variables law
and violent
.
I will do the first one for you then you do the next one, okies?
qqPlot(x = guns$violent[guns$law == "yes"])
## [1] 4 20
qqPlot(x = guns$violent[guns$law == "no"])
## [1] 17 10
As you can see, the observations are within the error range of the
normal distribution. This tells us that both variables are normally
distributed. Let’s use another test just in case. I will do the first
one and you will do the next by replacing the ---
.
shapiro.test(guns$violent[guns$law == "no"])
##
## Shapiro-Wilk normality test
##
## data: guns$violent[guns$law == "no"]
## W = 0.92884, p-value = 0.1305
shapiro.test(guns$violent[guns$law == "yes"])
##
## Shapiro-Wilk normality test
##
## data: guns$violent[guns$law == "yes"]
## W = 0.96435, p-value = 0.4183
So we met the requirements.
leveneTest(guns$violent ~ guns$law)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.2908 0.5922
## 48
With a non-significant p-value at the 0.05 alpha, we can be assured that both groups have equal variance.
test = t.test(guns$violent[guns$law == "yes"],
guns$violent[guns$law == "no"],
alternative = "less")
test
##
## Welch Two Sample t-test
##
## data: guns$violent[guns$law == "yes"] and guns$violent[guns$law == "no"]
## t = -1.1308, df = 46.719, p-value = 0.132
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 31.33241
## sample estimates:
## mean of x mean of y
## 407.9207 472.6476
With a p-value of 0.868, we fail to reject the null hypothesis.
THEY ARE BACK! Yes, you already know which data to load in :D
library(palmerpenguins)
data(penguins)
Let’s say we are interested in understanding whether flipper length differs between penguin species. Since there are three penguin species, we cannot simply do a t-test so we need to perform an ANOVA!
First, we should visualize this somehow. Perhaps a boxplot. I know you know how to do this already so I will let you handle this.
library(ggplot2)
ggplot(penguins, aes(x = species, y = flipper_length_mm)) +
geom_boxplot() +
labs(title = "Flipper Length by Penguin Species")
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
So yeah there are some differences. But are they significant?
Let’s make our hypotheses
\(H_0\): There is no difference in the means of flipper lengths between penguin species.
\(H_1\): There is A difference in the means of flipper lengths between penguin species.
Okay, we need to be sure we are meeting the basic assumptions of the test
Since we dealt with these penguins for a long time, we know that each penguin is only recorded once and can only have 1 species name so we are goooood.
Remember that shapiro.wilks test is only good for up to 50 observations, then it tends to be a little bit biased (only a tiny bit). It would probably be okay for our purposes, but maybe in this case we should do qq (quantile) plots instead.
I also believe that you know how to do this too. Let’s see if you can make three qqplots for each penguin species.
library(car)
qqPlot(penguins$flipper_length_mm[penguins$species == "Adelie"])
## [1] 130 96
qqPlot(penguins$flipper_length_mm[penguins$species == "Chinstrap"])
## [1] 7 48
qqPlot(penguins$flipper_length_mm[penguins$species == "Gentoo"])
## [1] 99 64
Well let’s plug in that levene test and find out! How about I help
you here. Just replace my ---
part.
{r} leveneTest(flipper_length_mm ~ species, data = penguins)
If you look at the p-value you will notice that we are in the clear!
Let the ANOVA commence! Just don’t forget to replace my ---
with something that is appropriate.
anova = aov(flipper_length_mm ~ species, data = penguins)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## species 2 52473 26237 594.8 <2e-16 ***
## Residuals 339 14953 44
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2 observations deleted due to missingness
Since we met all of the requirements and can clearly see that 0.0000000000000002 is much less than 0.025, we should reject that null hypothesis.
But what if we wanted to know which groups are significant in this case? Let’s use the Tukey test!
TukeyHSD(anova)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = flipper_length_mm ~ species, data = penguins)
##
## $species
## diff lwr upr p adj
## Chinstrap-Adelie 5.869887 3.586583 8.153191 0
## Gentoo-Adelie 27.233349 25.334376 29.132323 0
## Gentoo-Chinstrap 21.363462 19.000841 23.726084 0
Tukey has spoken. All of them are!
Anthropologists interested in communication skills between couples in managing household chores have surveyed 1744 couples to learn about who manages certain tasks in the household. The couples were asked about a specific task, and they could answer that in their household, the task is typically done by one partner, typically done by the other, typically done jointly, or that it was done sometimes by one and sometimes by the other (alternating).
The data is available online – I’ve written the code here for you:
housetasks <- read.csv("housetasks.csv")
Its important to see what a frequency (contingency) table looks like. These are counts for each column and row, not a data table.
head(housetasks)
## Partner1 Alternating Partner2 Jointly
## 1 156 14 2 4
## 2 124 20 5 4
## 3 77 11 7 13
## 4 82 36 15 7
## 5 53 11 1 57
## 6 32 24 4 53
\(H_0\): The attribution of tasks does not differ by group.
\(H_1\): The attribution of tasks does differ by groups.
```{r}chi_test <- chisq.test(housetasks) chi_test — # print it out by replacing the — and writing chi_test
Now, let's take a look at some posthoc analysis to see which groups and tasks were relevant to the chi square test. You should calculate the contribution scores for each group by task. Do that by replacing the `---` with something appropriate.
```{r}contribut_chisq = 100 * chi_test$residuals ^ 2 / chi_test$statistic
corrplot(contribut_chisq, is.corr = FALSE)
{r}chi_test$expected[chi_test$expected < 6]
We reject the null hypothesis.
In the following section, you will be presented with three scenarios to test your amazing statistical skills that you have been practicing! Each of them will be focusing on a different type of statistical test we have discussed in this unit.
Have you ever wondered what happens to your bones as you habitually work out or perform the same daily tasks? Well, one prominent hypothesis is that your bone reacts to the stress you place on it and,overtime, will produce more bone to reinforce the regions you use more frequently. The actual process is a bit more nuanced and complex depending on the bone we are looking at, the this is the general idea.
Biological anthropologists were interested to know if living an active lifestyle increases your bone density. Therefore, they collected data on two groups of women: 35 who were not very active and 35 who were very active. They calculated the bone density for each person and now want to see if the group means are different.
They hypothesize that women who are more active will have more bone density, on average, than women who are more inactive.
Question1
What type of statistical test should these
anthropologists use to help answer their research question?
Question 2
Based on the information you were presented
with, please select on Brightspace which hypotheses make the most sense.
Here, fill in the hypotheses.
\(H_0\): There is no difference in bone density between active and inactive women.
\(H_1\): Active women have higher bone density than inactive women.
For this example, let’s choose a 0.05 alpha.
library(tidyverse)
library(ggthemes)
library(ggpubr)
library(BSDA)
## Loading required package: lattice
##
## Attaching package: 'BSDA'
## The following objects are masked from 'package:carData':
##
## Vocab, Wool
## The following object is masked from 'package:datasets':
##
## Orange
data(Bones)
# Make sure to run this next code. I didn't like the values so I changed them.
Bones = Bones %>% mutate(group = ifelse(group == "active", "Active", "Inactive"))
Question3
What type of plot will you use to visualize
the distribution of data by group. Select the closest option on
Brightspace.
Make your visualization in the code space below. Remember that the
facet_wrap
function can make two plots and split them up by
group.
ggplot(Bones, aes(x = group, y = density)) +
geom_boxplot(fill = "skyblue") +
labs(title = "Bone Density by Activity Level", x = "Group", y = "Bone Density") +
theme_minimal()
Question4
You should accompany this plot by summary
statistics. What summary statistics would be relevant based on the plot
you chose? Select the best statistic pairs on Brightspace.
In the code below, make a table using the summarise
function from the dplyr
package that outputs the relevant
summary statistics.
Bones %>%
group_by(group) %>%
summarise(
mean_density = mean(density),
sd_density = sd(density),
count = n()
)
## # A tibble: 2 × 4
## group mean_density sd_density count
## <chr> <dbl> <dbl> <int>
## 1 Active 212. 18.7 35
## 2 Inactive 208. 21.4 35
Question5
Based on this initial summary statistics and
data visualization, what would you hypothesize about bone density by
active and inactive women? Select the most appropriate option in
Brightspace.
Question6
Does this data, as it has been described to
you, meet this requirement? Input TRUE if your answer is yes, and input
FALSE if your answer is no on Brightspace.
In the following code block, conduct your very own normality test!
library(car)
qqPlot(Bones$density[Bones$group == "Active"])
## [1] 5 20
qqPlot(Bones$density[Bones$group == "Inactive"])
## [1] 21 6
Question7
Did your data meet the assumptions for
normality? Input TRUE if your answer is yes, and input FALSE if your
answer is no on Brightspace.
In the following code chunk, conduct your very own homogeneity of variances test!
leveneTest(density ~ group, data = Bones)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.1954 0.6599
## 68
Question8
Did your data meet the assumptions for equal
variances? Input TRUE if your answer is yes, and input FALSE if your
answer is no on Brightspace.
Question9
Based on all of this information, what test
will you conduct if you were to meet all of the assumptions?
In the following code chunk, conduct the amazing t.test
function. Don’t forget the argument var.equal =
and
indicate if that is true or false. Additionally, make sure to change the
argument alternative =
to one of these options
“lesser” - lower tail test
“greater” - upper tail test
“two-sided” - non-directional test
t.test(density ~ group, data = Bones, alternative = "greater", var.equal = TRUE )
##
## Two Sample t-test
##
## data: density by group
## t = 0.83732, df = 68, p-value = 0.2027
## alternative hypothesis: true difference in means between group Active and group Inactive is greater than 0
## 95 percent confidence interval:
## -3.99458 Inf
## sample estimates:
## mean in group Active mean in group Inactive
## 211.6286 207.6000
Question10
How would you conclude based on the results
of this statistical test. Select the best option on Brightspace.
Archaeologists could visually assess that beakers, bowls, and flasks were all different than each other (on average) at a Neolithic site in Denmark. However, they knew they needed to test this hypothesis first.
To assess this, they took several 2D coordinate points to draw the shape outline of these ceramic vessels.
To conduct their test, they took the center point of all 2D points called the geometric mean which is more of a mean measurement of all of the variables within a dataset by summing all of the values, taking the log of that, and then taking the exponent of that value.
Are the Neolithic Pottery Types the Same or Different?
\(H_0\): The geometric mean of the ceramic vessel types have the same mean.
\(H_1\): The geometric mean of the ceramic vessel types do not have the same mean.
library(archdata)
library(tidyverse)
library(ggpubr)
library(ggthemes)
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
data(TRBPottery)
geometric_mean <- function(x, zero_replace = 1e-10) {
x[x == 0] <- zero_replace
# Replace zero values with a small positive number
return(exp(mean(log(x))))
}
Pottery = data.frame(Type = TRBPottery$Form,
GM = apply(TRBPottery[,-1],1, geometric_mean))
Let’s Outline the Relationships Between Groups
Question11
Does this data, as it has been described to
you, meet this requirement? Input TRUE if your answer is yes, and input
FALSE if your answer is no on Brightspace.
In the following code block, conduct both a shapiro.wilks test and a qqPlot test. You will see why….
Question12
Did your data meet the assumptions for
normality? Input TRUE if your answer is yes, and input FALSE if your
answer is no on Brightspace.
In the following code chunk, conduct your very own homogeneity of variances test!
Pottery %>%
levene_test(GM ~ Type)
## # A tibble: 1 × 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 2 115 1.37 0.258
Question13
Did your data meet the assumptions for equal
variances? Input TRUE if your answer is yes, and input FALSE if your
answer is no on Brightspace.
Question14
Based on all of this information, what test
will you conduct? Select the best option on Brightspace.
Note: Despite one of the groups having a p-value of 0.04846, the qqPlot shows that it is due to 1 single individual. If this happened while doing a data analysis, you might select to remove the individual. However, another option is to do two tests and compare the results: a regular one and one that can handle non-parametric (not normally distributed) groups. If both tests show the same result, then you have good evidence to show that your conclusions were not affected by the violation of this assumption.
First, let’s conduct our typical anova test and view the results.
# Your first test
anova_result <- aov(GM ~ Type, data = Pottery)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## Type 2 24.4 12.18 1.235 0.295
## Residuals 115 1133.9 9.86
Second, let’s compare that to an ANOVA that can handle non-normal data: the Kruskal-Wallis test.
# your second test
kruskal.test(GM ~ Type, data = Pottery)
##
## Kruskal-Wallis rank sum test
##
## data: GM by Type
## Kruskal-Wallis chi-squared = 2.7143, df = 2, p-value = 0.2574
Question15
How would you describe the results of both of
these tests? Select the best option on Brightspace.
Primatologists Lim and colleagues (2021) published a large
study on primate ecology and diet and shared their datasets, one of
them being the primate_diet
dataset. One of them gives us
information on primate species such as apes (gorillas, us), monkeys
(macaques), and much more. Additionally, they provided data on what
types of plants they ate.
Primatologists want to focus in on one family, the ape family, and test whether or not different apes species have the same frequency of plant types that they consume (i.e., fruits, leaves, stems, nuts). They hypothesize that different ape species do have different frequencies of consuming various plant types.
\(H_0\): The frequencies of the parts of plants consumed is not different across ape species.
\(H_1\): The frequencies of the parts of plants consumed is not different across ape species.
To answer this question, we will use a subset of their dataset called
the primate_diet
dataset. Let’s load it in as well as the
packages we will use:
library(tidyverse)
library(ggthemes)
library(ggpubr)
library(corrplot)
primate_diet = read_csv("primate_diet.csv")
## Rows: 8981 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (19): SpecName, PrimateSpecies, Food, PlantFamily, PlantGenus, PlantSpec...
## dbl (4): RecordID, Study_ID, Percentage_Paper, Percentages_Recalculated
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let’s take a look at our dataset. I believe you already know how to examine a dataset :D
As you can see, there are a lot of columns that aren’t of much use to us for this analysis. Let’s trim this dataset. We need to know our criteria first:
PrimateSpecies
and plant type PartPlant.
There is a value called “Unknown”
in PartPlant. We really shouldn’t include something that is…unknown
because its……unknown.Filter the dataset so that only the above named ape species are
included in a new dataset called ape_diet
only includes the
relevant variables for this analysis.
```{r} ape_diet <- primate_diet %>% filter(
PrimateSpecies %in% c(“Pan_troglodytes”, “Gorilla_beringei”,
“Pongo_pygmaeus”), PartPlant != “Unknown” ) %>%
select(PrimateSpecies, PartPlant)
Did you know we could visualize the frequencies of categorical data with a Bar Plot? We only need to change the y-axis to represent frequency counts. Here is how we can do it:
{r}ggplot(ape_diet, aes(x = PrimateSpecies, fill = PartPlant)) + geom_bar(position = "dodge") + labs(x = "Ape Species", y = "Frequency") + ggpubr::theme_pubclean()
Question16
Given that the y-axis of this plot indicates
the frequency, what summary statistic does the tallest bar for each
species represent? Select the best answer in Brightspace.
Now, transform the trimmed ape species dataset into a
frequency table. Call it ape_freq
. Make this in
the following code chunk:
{r}apape_freq <- ape_diet %>% count(PrimateSpecies, PartPlant) %>% pivot_wider(names_from = PartPlant, values_from = n, values_fill = 0)
Question17
How many columns are in your frequency table?
Input your answer on Brightspace.
Question18
Based on the variable types in our hypothesis
test, what kind of test should we carry out? Select the best option on
Brightspace.
Conduct the necessary test in the following code block:
```{r}test <- chisq.test(table(ape_diet\(PrimateSpecies, ape_diet\)PartPlant))
Note: did you notice that when you ran the test that it told you that the approximation may be incorrect? Can you figure out why that might be?
I will give you the first assumption and inform you that it did meet this assumption of independence of observations.
Another major assumption is that each cell in the expected frequencies are greater than 5. Check this assumption in the following code chunk:
Question19
If there were any,
which assumptions were violated in this test? Select all that apply on
Brightspace.
Let’s examine which variables in PlantPart was most important in the
Chi Square statistic. Replace the ---
with the appropriate
code.
{r}contribution = 100 * test$residuals^2/test$statistic corrplot::corrplot(contribution, is.cor = FALSE)
Question20
Which variable value had the highest
contribution to the Chi Square Statistic? Select the most appropriate
option on Brightspace.
Extra Credit
Knit your Course Engagement and Submit it
to Brightspace! It will be saved as a .html
this time
because CE 2 had a weird issue with PDF.