This guide can help you decide which statistical procedure to use when your analysis involves one dependent variable and one independent variable, as long as you know which variable is which, and whether each variable is measured categorically or continuously. Once you have settled on a procedure, click the procedure to jump to some code that you can copy into R and run after modifying it to specify the names of the data file and variables involved. You’ll also find links to worked examples for each procedure.
To begin, use this flow chart to decide which bivariate statistical procedure to use. Click a procedure to jump to copyable R code for that procedure.
START HERE | ||||||||
---|---|---|---|---|---|---|---|---|
↓ | Categorical | → | Chi-Square test | |||||
DV level? | → | Categorical | → | IV level? | → | ↕ | ||
↓ | Continuous | → | Logistic regression | |||||
Continuous | ||||||||
↓ | ||||||||
IV level? | → | Continuous | → | Bivariate regression | ||||
↓ | ||||||||
Categorical | ||||||||
↓ | ||||||||
#IV categories? | → | 3 or more | → | One-way ANOVA | ||||
↓ | ||||||||
Only 2 | ||||||||
↓ | ||||||||
Independent, or paired? |
→ | Independent | → | Independent-samples t-test | ||||
↓ | ||||||||
Paired | → | Paired-samples t-test |
Use a chi-square test when your dependent and independent variables
are both categorical. #Edit
notes indicate which parts of
the code must be customized. See the Chi-Square test worked
example for help interpreting the script’s output.
#Installing required packages
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("gmodels")) install.packages("gmodels")
library(gmodels)
library(ggplot2)
#Read the data
mydata <- read.csv("YOURFILENAME.csv") #Edit YOURFILENAME.csv
#Specify the DV and IV
mydata$DV <- mydata$YOURDVNAME #Edit YOURDVNAME
mydata$IV <- mydata$YOURIVNAME #Edit YOURIVNAME
#Look at the DV and IV
ggplot(mydata, aes(x = DV, fill = IV)) +
geom_bar() +
scale_fill_manual(values = c("grey","dodgerblue"))
#Make the crosstab table
CrossTable(mydata$DV, mydata$IV,
prop.chisq = FALSE,
prop.t = FALSE,
prop.r = FALSE)
#Run the chi-squared test
options(scipen = 999)
chitestresults <- chisq.test(mydata$DV, mydata$IV)
chitestresults
Use logistic regression when the DV is dichotomous and coded as
either 1 or 0, and the IV is continuous. #Edit
notes
indicate which parts of the code must be customized. See the Logistic regression worked
example for help interpreting the script’s output.
#Installing required packages
if (!require("tidyverse")) install.packages("tidyverse")
library(ggplot2)
#Read the data
mydata <- read.csv("YOURFILENAME.csv") #Edit YOURFILENAME.csv
#Specify the DV and IV
mydata$DV <- mydata$YOURDVNAME #Edit YOURDVNAME
mydata$IV <- mydata$YOURIVNAME #Edit YOURIVNAME
#Look at the DV and IV
ggplot(mydata, aes(x = DV))+geom_bar(color="black",fill="dodgerblue")
ggplot(mydata, aes(x = IV))+geom_histogram(color="black",fill="dodgerblue")
#Run the logistic regression and view the summary results
options(scipen = 999)
log.ed <- glm(DV ~ IV, data=mydata, family="binomial")
summary(log.ed)
Use bivariate regression when both the DV and IV are continuous.
Bivariate regression works best when no major outliers are evident in
one variable, the other, or the two variables combined.
#Edit
notes indicate which parts of the code must be
customized. See the bivariate regression worked
example for help interpreting the script’s output.
#Start here
#Installing required packages
if (!require("dplyr")) install.packages("dplyr")
if (!require("tidyverse")) install.packages("tidyverse")
library(dplyr)
library(ggplot2)
##Read the data
mydata <- read.csv("YOURFILENAME.csv") #Edit YOURFILENAME.csv
#Specify the DV and IV
mydata$DV <- mydata$YOURDVNAME #Edit YOURDVNAME
mydata$IV <- mydata$YOURIVNAME #Edit YOURIVNAME
#Look at the DV and IV
ggplot(mydata, aes(x = DV))+geom_histogram(color="black",fill="dodgerblue")
ggplot(mydata, aes(x = IV))+geom_histogram(color="black",fill="dodgerblue")
# Creating and summarizing an initial regression model called myreg, and checking for bivariate outliers.
options(scipen = 999)
myreg <- lm(DV ~ IV,
data = mydata)
summary(myreg)
plot(mydata$IV,mydata$DV)
abline(lm(mydata$DV~mydata$IV))
If you see what looks like an outlier or two, run the code below,
look for hatvalue
figures that are notably high compared to
the others. The number to the left of each hatvalue
figure
is the row number in the mydata
data frame of the case that
produced the hatvalue
.
leverage <- as.data.frame(hatvalues(myreg))
view(leverage)
A case with a notably high “hatvalues” score, relative to the other
cases, may be an influential outlier and could be deleted, if you judge
doing so necessary. You can delete a row from the data frame with the
code below, where 2
is the row number of the case you want
to delete.
mydata <- mydata[-c(2), ]
If you need to delete two or more rows, you may delete all of them at once with the code below, where 2, 4, and 6 - and as many more digits as you need - are the row numbers of the cases you want to delete.
mydata <- mydata[-c(2, 4, 6), ]
With the suspected outliers deleted, you can re-run the analysis and make a final decision about whether to keep or delete the suspected outliers.
Use one-way analysis of variance (one-way ANOVA) when both the DV is
continuous, the IV is categorical, and the IV has three or more
categories. One-way ANOVA works best when no major outliers are evident
in the distributions of the groups being compared. #Edit
notes indicate which parts of the code must be customized. See the One-way ANOVA worked example
for help interpreting the script’s output.
#Installing required packages
if (!require("dplyr")) install.packages("dplyr")
if (!require("tidyverse")) install.packages("tidyverse")
library(dplyr)
library(ggplot2)
#Read the data
mydata <- read.csv("YOURFILENAME.csv") #Edit YOURFILENAME.csv
#Specify the DV and IV
mydata$DV <- mydata$YOURDVNAME
mydata$IV <- mydata$YOURIVNAME
#Graph the group distributions and averages
averages <- group_by(mydata, IV) %>%
summarise(mean = mean(DV, na.rm = TRUE))
ggplot(mydata, aes(x = DV))+
geom_histogram()+
facet_grid(IV ~ .)+
geom_histogram(color="black",fill="dodgerblue")+
geom_vline(data = averages, aes(xintercept=mean,))
#Calculate and show the group counts, means, standard
#deviations, minimums, and maximums
group_by(mydata, IV) %>%
summarise(
count = n(),
mean = mean(DV, na.rm = TRUE),
sd = sd(DV, na.rm = TRUE),
min = min(DV, na.rm = TRUE),
max = max(DV,na.rm = TRUE))
#ANOVA
options(scipen = 999)
oneway.test(mydata$DV ~ mydata$IV,
var.equal = FALSE)
# If the ANOVA detects significant difference, run this post-hoc procedure
# to learn which group pairs differed significantly
anova_1 <- aov(mydata$DV ~ mydata$IV)
TukeyHSD(anova_1)
#See annotated example output
#Using the TNBBAccess.csv dataset and:
#mydata$DV <- mydata$PctBB
# mydata$IV <- mydata$MedIncome_3
Use an independent-samples t-test when the DV is continuous, and the
IV is categorical and has two categories that represent independent,
mutually exclusive groups. See the Not sure …
section for explanations of what “independent” and “mutually exclusive”
mean. #Edit
notes indicate which parts of the code must be
customized. See the Independent-samples
t-test worked example for help interpreting the script’s output.
#Installing required packages
if (!require("dplyr")) install.packages("dplyr")
if (!require("tidyverse")) install.packages("tidyverse")
library(dplyr)
library(ggplot2)
#Read the data
mydata <- read.csv("YOURFILENAME.csv") #Edit YOURFILENAME.csv
#Specify the DV and IV
mydata$DV <- mydata$YOURDVNAME
mydata$IV <- mydata$YOURIVNAME
#Graph the group distributions and averages
averages <- group_by(mydata, IV) %>%
summarise(mean = mean(DV, na.rm = TRUE))
ggplot(mydata, aes(x = DV))+
geom_histogram()+
facet_grid(IV ~ .)+
geom_histogram(color="black",fill="dodgerblue")+
geom_vline(data = averages, aes(xintercept=mean,))
#Calculate and show the group counts, means, standard
#deviations, minimums, and maximums
group_by(mydata, IV) %>%
summarise(
count = n(),
mean = mean(DV, na.rm = TRUE),
sd = sd(DV, na.rm = TRUE),
min = min(DV, na.rm = TRUE),
max = max(DV,na.rm = TRUE))
# If you see evidence of non-normality in the distributions of one or both groups, use a Shapiro-Wilk test.
mydata %>%
group_by(IV) %>%
summarise(`W Statistic` = shapiro.test(DV)$statistic,
`p-value` = shapiro.test(DV)$p.value)
# A significant Shapiro-Wilk test for one or both groups means you should use a Wilcoxon signed rank test rather thana t-test.
options(scipen = 999)
wilcox.test(mydata$DV ~ mydata$IV)
# If both groups are normally distributed, though, you may run a t-test.
options(scipen = 999)
t.test(mydata$DV ~ mydata$IV,
var.equal = FALSE)
Use a paired-samples t-test when the DV is continuous, and the
categorical IV distinguishes between two separate, but connected, sets
of DV measures. See the Not sure … section for
explanations of what “separate, but connectd” means. #Edit
notes indicate which parts of the code must be customized. See the Paired-samples t-test
worked example for help interpreting the script’s output.
#Installing required packages
if (!require("dplyr")) install.packages("dplyr")
if (!require("tidyverse")) install.packages("tidyverse")
library(dplyr)
library(ggplot2)
#Read the data
mydata <- read.csv("YOURFILENAME.csv") #Edit YOURFILENAME.csv
#Specify the two variables involved
mydata$V1 <- mydata$YOURV1NAME
mydata$V2 <- mydata$YOURV2NAME
#Look at the distribution of the pair differences
mydata$PairDifferences <- mydata$V2 - mydata$V1
ggplot(mydata, aes(x = PairDifferences)) +
geom_histogram(color = "black", fill = "dodgerblue") +
geom_vline(aes(xintercept = mean(PairDifferences)))
#Get descriptive statistics for pair differences
mydata %>%
select(PairDifferences) %>%
summarise(
count = n(),
mean = mean(PairDifferences, na.rm = TRUE),
sd = sd(PairDifferences, na.rm = TRUE),
min = min(PairDifferences, na.rm = TRUE),
max = max(PairDifferences, na.rm = TRUE))
#If pair differences look non-normal, you can use a Shapiro-Wilk test to check
#whether their distribution differs significantly from normal. If the
#Shapiro-Wilk test p-value is less than 0.05, #use a Wilcoxon signed rank test
#instead of a paired-samples t-test.
#Shapiro-Wilk test (Checks whether pair differences distribution is non-normal)
options(scipen = 999)
shapiro.test(mydata$PairDifferences)
#Wilcoxon signed rank test (Use if pair differeces distribution is non-normal)
mydata %>%
select(V1, V2) %>%
summarise_all(list(Mean = mean,SD = sd))
wilcox.test(mydata$V1, mydata$V2, paired = TRUE)
#t-test (Use if pair differences distribution is normal)
mydata %>%
select(V1, V2) %>%
summarise_all(list(Mean = mean,SD = sd))
options(scipen = 999)
t.test(mydata$V2, mydata$V1,
paired = TRUE)
In the TNBBAccessData.csv
data file, the PctBB_2
variable indicates whether each
county in Tennessee has a “High” level of broadband internet access or a
“Low” level of broadband internet access. Meanwhile, the
MedIncome_2
variable indicates whether each county has a
“High” household income level or a “Low” household income level. See the
Example dataset codebook for
details about both variables. The code in this example uses a
Chi-Squared test to look for evidence of a relationship between the two
variables.
The script’s opening lines of code install (if not installed already)
and activate the tidyverse
and gmodels
packages, then read the data from the TNBBAccessData.csv
file. The code assumes the data file has been stored in the same
subdirectory as the script.
if (!require("tidyverse")) install.packages("tidyverse")
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.1.8
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
if (!require("gmodels")) install.packages("gmodels")
## Loading required package: gmodels
library(gmodels)
library(ggplot2)
#Read the data
mydata <- read.csv("TNBBAccessData.csv") #Edit YOURFILENAME.csv
The DV and IV variable names have been edited to specify
PctBB_2
as the dependent variable, and
MedIncome_2
as the independent variable. Again, see the Example dataset codebook for
details about both variables. Setting the analysis up in this way
assumes that broadband internet access depends upon household
income.
#Specify the DV and IV
mydata$DV <- mydata$PctBB_2 #Edit YOURDVNAME
mydata$IV <- mydata$MedIncome_2 #Edit YOURIVNAME
This code will produce a stacked column chart that gives you a look at whether, and how, the DV and IV might be related.
#Look at the DV and IV
ggplot(mydata, aes(x = IV, fill = DV)) +
geom_bar() +
scale_fill_manual(values = c("grey","dodgerblue"))
The chart shows that counties with low levels of broadband access appear much more common among low-income counties than among high-income counties. It also shows there are no outliers or oddly-coded cases that need to be examined.
It would be informative, though, to get the specific counts and percentages depicted in the chart. This code produces a table showing those numbers.
#Make the crosstab table
CrossTable(mydata$DV, mydata$IV,
prop.chisq = FALSE,
prop.t = FALSE,
prop.r = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 95
##
##
## | mydata$IV
## mydata$DV | 1 Low income | 2 High income | Row Total |
## -----------------|---------------|---------------|---------------|
## 1 Low BB access | 38 | 11 | 49 |
## | 0.792 | 0.234 | |
## -----------------|---------------|---------------|---------------|
## 2 High BB access | 10 | 36 | 46 |
## | 0.208 | 0.766 | |
## -----------------|---------------|---------------|---------------|
## Column Total | 48 | 47 | 95 |
## | 0.505 | 0.495 | |
## -----------------|---------------|---------------|---------------|
##
##
The table shows that 79 percent of the low-income counties (38 of the 48) are also in the low-broadband-access category, compared to only 23 percent (11) of the 47 counties in the high-income category are in the low-broadband-access category. Put another way, low-broadband-access counties are much more common among low-income counties than among high-income counties. This pattern doesn’t prove that low broadband access is a result of low income. But it does support the assertion that low broadband access is at least associated with low income.
But random chance could produce the same pattern, at least sometimes. A chi-squared test can help you decide whether to rule out random chance as an explanation for the pattern. Here’s the code:
#Run the chi-squared test
options(scipen = 999)
chitestresults <- chisq.test(mydata$DV, mydata$IV)
chitestresults
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata$DV and mydata$IV
## X-squared = 27.375, df = 1, p-value = 0.0000001676
… and the results indicate that the chi-quare value computed (27.375), with 1 degree of freedom (df) has a p-value of 0000001676. The p-value is not greater than 0.05. In fact, it’s quite a bit less. As a result, you can reasonably rule out randomness as an explanation for the pattern of association between income and broadband access. To summarize the analysis results more succinctly, and in a more scholarly format:
Prior research suggests that wealthier areas tend to have higher levels of broadband internet access. This analysis hypothesized that the percentage of households with high broadband access would be significantly greater among high-income counties than among low-income counties. It found that about 77 percent of the high-income counties examined had high broadband access, compared to only about 21 percent of low-income counties. A chi-squared test found the difference to be statistically significant, X2 (1, N = 95) = 27.38, p < .05. The results support the hypothesis.
This code tests the association between the TNBBAccessData.csv
data file’s PctBB_2_NUM
and MedIncome
variables. PctBB_2_NUM
distinguishes between Tennessee
counties with “Low” and “High” levels of broadband internet access by
assigning “Low” counties a 0
and “High” counties a
1
. Thus, PctBB_2_NUM
is a categorical
variable, even though its two categories are represented with numeric
digits. Meanwhile, MedIncome
is a continuous variable
containing each Tennessee county’s median household income. See the Example dataset codebook for
details about both variables.
The analysis assumes that whether a county has a “Low” or “High”
level of broadband internet access depends, at least partly, on the
county’s median household income. Thus, PctBB_2_NUM
is the
dependent variable, and MedIncome
is the independent
variable. Logistic regression is suitable for the analysis because the
dependent variable is categorical and has only two categories, one
represented by a 0
, and the other represented by a
1
.
The script’s opening lines of code install (if not installed already)
and activate the tidyverse
and ggplot
packages, then read the data from the TNBBAccessData.csv
file. The code assumes the data file has been stored in the same
subdirectory as the script.
#Installing required packages
if (!require("tidyverse")) install.packages("tidyverse")
library(ggplot2)
#Read the data
mydata <- read.csv("TNBBAccessData.csv") #Edit YOURFILENAME.csv
For the reasons described above, the DV is specified as
PctBB_2_NUM
, while the IV is specified as
MedIncome
, and each variable’s distribution is examined for
outliers, data entry errors, and other anomalies.
#Specify the DV and IV
mydata$DV <- mydata$PctBB_2_NUM
mydata$IV <- mydata$MedIncome
#Look at the DV and IV
ggplot(mydata, aes(x = DV))+geom_bar(color="black",fill="dodgerblue")
ggplot(mydata, aes(x = IV))+geom_histogram(color="black",fill="dodgerblue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
A clear outlier is evident in the MedIncome
variable’s
distribution. Examining the data reveals that the outlier is Williamson
County, which has an unusually high median household income for
Tennessee’s counties. But the outlier does not stem from a data entry
error; Williamson County’s median household income really is that high.
Furthermore, outliers aren’t much of a problem for logistic regression.
For these reasons, the Williamson County value was allowed to remain in
the dataset. If the outlier raised a concern about the validity of the
analysis’ conclusion, the outlier could be deleted so that the results
with and without the outlier could be compared.
This code runs the logistic regression and displays the results:
#Run the logistic regression and view the summary results
options(scipen = 999)
log.ed <- glm(DV ~ IV, data=mydata, family="binomial")
summary(log.ed)
##
## Call:
## glm(formula = DV ~ IV, family = "binomial", data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2138 -0.8051 -0.3234 0.8440 1.7984
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.58002397 2.07299037 -4.621 0.00000381 ***
## IV 0.00020691 0.00004551 4.547 0.00000545 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 131.603 on 94 degrees of freedom
## Residual deviance: 94.102 on 93 degrees of freedom
## AIC: 98.102
##
## Number of Fisher Scoring iterations: 5
Two figures in the output are particularly important: The
z value
, which you will need for the write-up, and the
probability value under Pr(>|z|)
, which must be 0.05 or
smaller to indicate statistical significance. Here, the rounded
z value
is 4.55, and the Pr(>|z|)
value,
0.00000545, is far less than 0.05, indicating statistical
significance.
The analysis could be summarized in this way:
Prior research suggests that wealthier areas tend to have higher levels of broadband internet access. This analysis hypothesized that county-level median household income would significantly predict whether a county had high or low broadband access. A logistic regression procedure found that higher median household income levels significantly predicted high broadband access (z = 4.55, p < .05). The results support the hypothesis.
This code tests the association between the TNBBAccessData.csv
data file’s PctBB
and MedIncome
variables.
PctBB
measures, for each county in Tennessee, the
percentage of households that have access to broadband internet. It is a
continuous variable. Meanwhile, MedIncome
, also a
continuous variable, contains each Tennessee county’s median household
income. See the Example dataset
codebook for details about both variables.
The analysis assumes that a county’s percentage of homes with
broadband internet access depends, at least partly, on the county’s
median household income. Thus, PctBB
is the dependent
variable, and MedIncome
is the independent variable.
Bivariate regression is suitable for the analysis because both the
dependent variables are continuous.
The script’s opening lines of code install (if not installed already)
and activate the tidyverse
and ggplot
packages, then read the data from the TNBBAccessData.csv
file. The code assumes the data file has been stored in the same
subdirectory as the script.
#Installing required packages
if (!require("dplyr")) install.packages("dplyr")
if (!require("tidyverse")) install.packages("tidyverse")
library(dplyr)
library(ggplot2)
mydata <- read.csv("TNBBAccessData.csv")
For the reasons described above, the DV is specified as
PctBB
, while the IV is specified as MedIncome
.
Next, each variable’s distribution is examined for outliers, data entry
errors, and other anomalies.
#Specify the DV and IV
mydata$DV <- mydata$PctBB #Edit YOURDVNAME
mydata$IV <- mydata$MedIncome #Edit YOURIVNAME
#Look at the DV and IV
ggplot(mydata, aes(x = DV))+geom_histogram(color="black",fill="dodgerblue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(mydata, aes(x = IV))+geom_histogram(color="black",fill="dodgerblue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distribution of the dependent variable, PctBB
, shows
what appear to be two outliers. One county has an unusually low
percentage of households with broadband access (somewhere around 50
percent), and another has an unusually high level of broadband access
(well above 90 percent). The distribution of the independent variable,
MedIncome
, also shows at least one, and maybe as many as
three, counties with unusually high median household incomes.
Because linear regression is being used, these outliers could lead to incorrect conclusions about the relationship between the dependent and independent variables. Creating the regression model and plotting its results will yield further insights.
Let’s begin with the regression results, which this code will produce and display:
# Creating and summarizing an initial regression model called myreg
options(scipen = 999)
myreg <- lm(DV ~ IV,
data = mydata)
summary(myreg)
##
## Call:
## lm(formula = DV ~ IV, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.5892 -3.1303 0.5912 3.0848 9.0294
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.85438413 2.11083922 23.14 <0.0000000000000002 ***
## IV 0.00049558 0.00004363 11.36 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.584 on 93 degrees of freedom
## Multiple R-squared: 0.5811, Adjusted R-squared: 0.5766
## F-statistic: 129 on 1 and 93 DF, p-value: < 0.00000000000000022
Looks like broadband access is significantly related to median
income. The p-value
for the F-statistic
is
<0.00000000000000022, with is far below the 0.05 threshold for
statistical significance. The model’s R-squared
value is
0.5811, which indicates that county median household income explains
about 58 percent of the variation in broadband access. That’s pretty
good. The table under Coefficients:
indicates that the
coefficient for the independent variable, median household income, is
0.00049558. It’s positive, which means that as median household income
increases, so does broadband internet access.
The same table provides the model’s Intercept
value,
48.85438413. Plugging it and the IV’s coefficient into the equation for
a line:
y = mx + b
… means that you can estimate any Tennessee county’s percentage of households with broadband (during this time period, at least) using this equation:
Pct. broadband = (0.00049558*median household income) + 48.85438413.
Put another way: At the time these day were collected, every 0.00049558 increase in median household income was associated with a 1-percentage-point increase in the percentage of households with broadband internet access.
But what about those outliers? Could this wonderfully concise regression model be spectacularly significant chiefly because of a few outliers? Let’s look at a plot of the relationship between household income and broadband access:
# Checking the initial regression model for outliers
plot(mydata$IV,mydata$DV)
abline(lm(mydata$DV~mydata$IV))
The plot shows a positive, reasonably linear relationship between the broadband access and household income measures. The outliers noted in the distributions of each variable are readily apparent, though. Both appear as cases far off the trend line, one at the low end, and the other at the high end. However, it seems unlikely that the two outliers account for all, or even most, of the linearity. Even the non-outlier cases show an upward relationship between household income and broadband access. Just in case, though, let’s check the leverages.
leverage <- as.data.frame(hatvalues(myreg))
view(leverage)
Case 94 (Williamson County) and case 95 (Wilson County) have
unusually large hatvalues
. Just to be safe, let’s delete
them, re-run the analysis, and compare the regression results with and
without these two outliers.
mydata <- mydata[-c(94,95), ]
options(scipen = 999)
myreg <- lm(DV ~ IV,
data = mydata)
summary(myreg)
##
## Call:
## lm(formula = DV ~ IV, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.1885 -2.3811 -0.0759 3.2838 8.5539
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.60572890 2.66814020 15.97 <0.0000000000000002 ***
## IV 0.00063393 0.00005699 11.12 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.334 on 91 degrees of freedom
## Multiple R-squared: 0.5762, Adjusted R-squared: 0.5716
## F-statistic: 123.7 on 1 and 91 DF, p-value: < 0.00000000000000022
plot(mydata$IV,mydata$DV)
abline(lm(mydata$DV~mydata$IV))
Comparing the output of this regression analysis to the output of the
earlier regression analysis (before the outliers were deleted) shows
little change. The model is still significantly positive. The
R-squared
figure, 0.5762, dropped a tad, but not much. All
in all, deleting the outliers didn’t make much difference.
In this particular case, I think the benefits of retaining the outliers outweigh the benefits of dropping them. The analysis could be summarized like this:
Prior research suggests that wealthier areas tend to have higher levels of broadband internet access. This analysis hypothesized that county-level median household income would correlate positively and significantly with county-level percentages of households with broadband access. Regression supported the hypothesis (F (1, 93) = 129, p < .05, R2 = .58). Two outliers were detected: Williamson County and Wilson County. Both had unusually high levels of both household income and broadband internet access. But repeating the analysis with both outliers omitted produced no substantial change in the results.
This code tests the association between the TNBBAccessData.csv
data file’s PctBB
and MedIncome_3
variables.
PctBB
measures, for each county in Tennessee, the
percentage of households that have access to broadband internet. It is a
continuous variable. Meanwhile, MedIncome_3
, a categorical
variable, categorizes each Tennessee county’s median household income as
“1 Low income,” “2 Moderate income,” or “3 High income.” See the Example dataset codebook for
details about both variables.
The analysis assumes that a county’s percentage of homes with
broadband internet access depends, at least partly, on the county’s
median household income. Thus, PctBB
is the dependent
variable, and MedIncome_3
is the independent variable.
One-way ANOVA is suitable for the analysis because the dependent
variable is continuous, and the independent variable is categorical and
has more than two categories. Technically, the “1,” “2,” and “3” digits
at the start of each category label are not needed. But including them
is a shortcut for telling R the order of the categories, from lowest to
highest. Without them, the categories would appear in alphabetical
order, and additional code would be needed to tell R that they are
ordinal-level variables and what order they should be put in.
The script’s opening lines of code install (if not installed already)
and activate the tidyverse
and ggplot
packages, then read the data from the TNBBAccessData.csv
file. The code assumes the data file has been stored in the same
subdirectory as the script.
#Installing required packages
if (!require("dplyr")) install.packages("dplyr")
if (!require("tidyverse")) install.packages("tidyverse")
library(dplyr)
library(ggplot2)
#Read the data
mydata <- read.csv("TNBBAccessData.csv") #Edit YOURFILENAME.csv
For the reasons noted above, these lines define PctBB
as
the dependent variable and MedIncome_3
as the independent
variable.
#Specify the DV and IV
mydata$DV <- mydata$PctBB
mydata$IV <- mydata$MedIncome_3
ANOVA compares averages, and an average can mislead you if the data
your averaged included outliers or had some weirdly non-normal
distribution. This code graphs the distribution of PctBB
within each of the three MedIncome_3
groups and adds
vertical lines showing each group’s average.
#Graph the group distributions and averages
averages <- group_by(mydata, IV) %>%
summarise(mean = mean(DV, na.rm = TRUE))
ggplot(mydata, aes(x = DV))+
geom_histogram()+
facet_grid(IV ~ .)+
geom_histogram(color="black",fill="dodgerblue")+
geom_vline(data = averages, aes(xintercept=mean,))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The graph suggests that “low income” counties have a lower average broadband access level than “moderate income” counties, and that “low” and “moderate” income counties both have lower average broadband access than “high income” counties.
Some possible outliers are evident, too. The “low income”
distribution shows one county in which broadband access looks a lot
lower than the other “low” counties. It is no doubt pulling the group
average down - maybe enough to be mostly or entirely responsible for the
difference between the “low” and “moderate” group averages. In fact,
only about seven of the “low” counties fall below the range of the
“moderate” group. Similarly, only about eight of the “high” counties
fall above the “moderate” group’s range. As noted in the Example dataset codebook, the
MedIncome_3
groups were formed by sorting the county
broadband access figures from smallest to greatest, then dividing the
counties into three roughly equal groups. It might be that this wasn’t
the most accurate way to group the counties by income, at least for the
purpose of looking at the connection between income and broadband
access.
Anyway, it’s difficult to tell from the graph what the precise average values are. The graph indicates, for example, that the lowest county in the “low” group has a broadband access figure of around 50 percent. But the graph doesn’t indicated exactly what the figure is. This code shows the exact value of each group distribution’s average, along with each distribution’s exact standard deviation, minimum value, and maximum value.
#Calculate and show the group counts, means, standard
#deviations, minimums, and maximums
group_by(mydata, IV) %>%
summarise(
count = n(),
mean = mean(DV, na.rm = TRUE),
sd = sd(DV, na.rm = TRUE))
## # A tibble: 3 × 4
## IV count mean sd
## <chr> <int> <dbl> <dbl>
## 1 1 Low income 31 66.2 5.23
## 2 2 Moderate income 31 72.4 4.11
## 3 3 High income 33 77.7 6.18
It looks like the average broadband access figures within the “low,”
“moderate” and “high” income groups are 66.2, 72.4, and 77.7,
respectively. Recall that MedIncome_3
happens to represent
percentages in this case. So, among the counties in the “low” group, the
average percentage of households with broadband access is 66.2 percent.
Among the “moderate” group’s counties, it’s 7.4 percent, and it’s 77.7
percent among the “high” group’s counties. Also, that low outlier in the
“low” group must be 51.2 percent, the “low” group’s minimum value. A
quick check of the data frame indicates that the 51.2 percent figure is
from Hancock County, in Row 34 of the data frame. Meanwhile, the high
outlier in the “high” group must be 93.4 percent - the figure belonging
to Williamson County, in Row 94 of the data frame.
Fundamentally, one-way ANOVA uses an F test to examine the differences among the group averages and indicate whether at least one group’s average differs significantly from at least one other group’s average. This code runs the ANOVA:
#ANOVA
options(scipen = 999)
oneway.test(mydata$DV ~ mydata$IV,
var.equal = FALSE)
##
## One-way analysis of means (not assuming equal variances)
##
## data: mydata$DV and mydata$IV
## F = 32.824, num df = 2.000, denom df = 60.157, p-value =
## 0.0000000002304
The resulting F statistic, 32.824, has an associated
p-value
of 0.0000000002304. That value is far less than
0.05, so the ANOVA results indicate that at least one group average
differs significantly from at least one other group average.
Had the F statistic been nonsignificant, the analysis could stop here and conclude that the data show no evidence of a relationship between income and broadband access.
But knowing that at least one group average differs significantly from at least one other group average produces a bit of a cliffhanger: Which averages differ significatly from each other? In the graph, the “moderate income” average looks a little closer to the “high income” average than the “low income” average. Are the “moderate” and “high” income averages close enough to be essentially the same? Are both the “moderate” and “high” income averages far enough from the “low income” average to be significantly greater? Or does each group average differs significantly from each other group average?
Tukey’s HSD (“honestly significant difference”) is a post hoc (“after this”) test designed to detect differences between group averages after an ANOVA has found that at least one such difference exists. This code will run the test and display the results:
anova_1 <- aov(mydata$DV ~ mydata$IV)
TukeyHSD(anova_1)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = mydata$DV ~ mydata$IV)
##
## $`mydata$IV`
## diff lwr upr p adj
## 2 Moderate income-1 Low income 6.222581 3.037414 9.407748 0.0000323
## 3 High income-1 Low income 11.502835 8.366299 14.639371 0.0000000
## 3 High income-2 Moderate income 5.280254 2.143718 8.416790 0.0003598
The table shows all possible pairs of group averages, the difference
between each pair (see the diff
column), and the p value
(see the p adj
column) for a significance test of the
difference between each pair. In this case, all three pairs differed
significantly.
Here’s one way to summarize the results of the analysis:
Prior research suggests that wealthier areas tend to have higher levels of broadband internet access. This analysis hypothesized that household broadband internet access would average significantly differently across at least two of the three income levels measured. The analysis found that broadband access averaged 66.2 among low-income counties, 72.4 among moderate-income counties, and 77.7 among high-income counties. A one-way analysis of variance test indicated at least one significant difference, F (2, 60.16) = 32.82, p < .05. A post-hoc procedure determined that all three means differed significantly from one another.
Note, too, that it might be wise to “stress test” the above conclusion by deleting the most prominent outliers and rerunning the analysis to see whether the outcome changes as a result. In this case, little changed. Running this code deleted the data for Hancock County (the low outlier in the “Low income” group) and for Williamson Conty (the high outlier in the “High income” group).
mydata <- mydata[-c(94, 34), ]
As expected, the group averages moved closer together as a result:
averages <- group_by(mydata, IV) %>%
summarise(mean = mean(DV, na.rm = TRUE))
ggplot(mydata, aes(x = DV))+
geom_histogram()+
facet_grid(IV ~ .)+
geom_histogram(color="black",fill="dodgerblue")+
geom_vline(data = averages, aes(xintercept=mean,))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
However, the ANOVA still produced a significant F:
#ANOVA
options(scipen = 999)
oneway.test(mydata$DV ~ mydata$IV,
var.equal = FALSE)
##
## One-way analysis of means (not assuming equal variances)
##
## data: mydata$DV and mydata$IV
## F = 34.149, num df = 2.000, denom df = 59.398, p-value =
## 0.0000000001343
… and the post hoc test still showed found significant differences between all group average pairs:
anova_1 <- aov(mydata$DV ~ mydata$IV)
TukeyHSD(anova_1)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = mydata$DV ~ mydata$IV)
##
## $`mydata$IV`
## diff lwr upr p adj
## 2 Moderate income-1 Low income 5.722473 2.800611 8.644335 0.0000313
## 3 High income-1 Low income 10.512292 7.612969 13.411614 0.0000000
## 3 High income-2 Moderate income 4.789819 1.914734 7.664904 0.0004208
This code tests the association between the TNBBAccessData.csv
data file’s PctBB
and MedIncome_2
variables.
PctBB
measures, for each county in Tennessee, the
percentage of households that have access to broadband internet. It is a
continuous variable. Meanwhile, MedIncome_2
, a categorical
variable, categorizes each Tennessee county’s median household income as
“1 Low income” or “2 High income.” See the Example dataset codebook for
details about both variables. Technically, the “1” and “2” digits at the
start of each category label are not needed. But including them is a
shortcut for telling R the order of the categories, from lowest to
highest. Without them, the categories would appear in alphabetical
order, and additional code would be needed to tell R that they are
ordinal-level variables and what order they should be put in.
The analysis assumes that a county’s percentage of homes with
broadband internet access depends, at least partly, on the county’s
median household income. Thus, PctBB
is the dependent
variable, and MedIncome_2
is the independent variable. An
independent-samples t-test is suitable for the analysis because the
dependent variable is continuous, and the independent variable is
categorical and has two categories that are independent and mutually
exclusive.
The script’s opening lines of code install (if not installed already)
and activate the tidyverse
and ggplot
packages, then read the data from the TNBBAccessData.csv
file. The code assumes the data file has been stored in the same
subdirectory as the script.
#Installing required packages
if (!require("dplyr")) install.packages("dplyr")
if (!require("tidyverse")) install.packages("tidyverse")
library(dplyr)
library(ggplot2)
#Read the data
mydata <- read.csv("TNBBAccessData.csv")
For the reasons noted above, these lines define PctBB
as
the dependent variable and MedIncome_2
as the independent
variable.
#Specify the DV and IV
mydata$DV <- mydata$PctBB
mydata$IV <- mydata$MedIncome_2
A t-test compares averages, and an average can mislead you if the
data you averaged included outliers or had some weird, non-normal
distribution. This code graphs the distribution of PctBB
within each of the MedIncome_2
groups and adds vertical
lines showing each group’s average.
#Graph the group distributions and averages
averages <- group_by(mydata, IV) %>%
summarise(mean = mean(DV, na.rm = TRUE))
ggplot(mydata, aes(x = DV))+
geom_histogram()+
facet_grid(IV ~ .)+
geom_histogram(color="black",fill="dodgerblue")+
geom_vline(data = averages, aes(xintercept=mean,))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
PctBB
’s distribution seems to have a low outlier within
the “1 Low income” group and a high outlier within the “2 High income”
group. Otherwise, though, both distributions look somewhat normal. It’s
hard to tell from the graphs precisely what the group averages and the
values of those outliers are. This code gives some exact numbers.
#Calculate and show the group counts, means, standard
#deviations, minimums, and maximums
group_by(mydata, IV) %>%
summarise(
count = n(),
mean = mean(DV, na.rm = TRUE),
sd = sd(DV, na.rm = TRUE),
min = min(DV, na.rm = TRUE),
max = max(DV, na.rm = TRUE))
## # A tibble: 2 × 6
## IV count mean sd min max
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 Low income 48 68.0 5.13 51.2 75.5
## 2 2 High income 47 76.5 6.11 65.7 93.4
So, PctBB
, the percent of households with broadband
access, averages 68.0 percent among the low-income counties and 76.5
percent among the high-income counties. And those outliers are, well,
pretty far out. The low-income outlier, 51.2 percent, is almost 17
percentage points below the average. That’s more than three times the
distribution’s standard deviation of 5.13 percentage points. Similarly,
the high-income outlier, 93.4 percent, is about 17 percentage points
above the distribution’s average - again, nearly three times the
distribution’s standard deviation of 6.11 percentage points. Recall that
one standard deviation is the typical difference between the
distribution’s scores and its average. So these outliers are about three
times more unusual than the typical score. A quick check of the data
frame shows that the low outlier belongs to Hancock County, which
occupies Row 34, and the high outlier belongs to Williamson County,
which occupies Row 94. It might be a good idea to run a Shapiro-Wilk
test on the normality of the group distributions. Here’s the code for
doing it:
mydata %>%
group_by(IV) %>%
summarise(`W Statistic` = shapiro.test(DV)$statistic,
`p-value` = shapiro.test(DV)$p.value)
## # A tibble: 2 × 3
## IV `W Statistic` `p-value`
## <chr> <dbl> <dbl>
## 1 1 Low income 0.931 0.00744
## 2 2 High income 0.983 0.731
The high-income group distribution appears normal; the W
statistic’s p-value
, 0.731, is above 0.05. But the the
low-income group distribution appears non-normal. Its W
statistic has a p-value
of 0.007, which is below 0.05.
Because both groups are not normally distributed, it would be wise to
use a Wilcoxon rank sum test instead. Here’s the code:
options(scipen = 999)
wilcox.test(mydata$DV ~ mydata$IV)
## Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
## compute exact p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: mydata$DV by mydata$IV
## W = 313, p-value = 0.000000001333
## alternative hypothesis: true location shift is not equal to 0
The test’s W
statistics, 313, has a p-value
of around 0.000000001333. That’s far less than 0.05, so the test
indicates that the groups differ significantly. You may have noticed the
warning noting that R could not compute an exact p-value for the test
because the data happen to include “ties” - that is, counties with
identical broadband access percentages. The Wilcoxon rank sum test works
by ranking the cases from highest to lowest, and things get fuzzy with
the test has to try to rank cases with identical scores. But the p-value
R came up with is nowhere near exceeding 0.05, so tie problem is
tolerable.
The results could be written up something like this:
Prior research suggests that wealthier areas tend to have higher levels of broadband internet access. This analysis hypothesized that average household broadband internet access would differ significantly among high-income counties compared to low-income counties. The analysis found that broadband access averaged 68.8 among “Low income” counties and 76.5 among “High income” counties. A Shapiro-Wilk test indicated the low-income group values were not normally distributed, rendering an independent-samples t-test unsuitable (W = 0.93, p < .05). However, a Wilcoxon rank sum test found the difference to be statistically significant (W = 313, p < .05). The results support the hypothesis.
Had the data been suitable for an independent-samples t-test, we could have run this code:
# Running a t-test
options(scipen = 999)
t.test(mydata$DV ~ mydata$IV,
var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: mydata$DV by mydata$IV
## t = -7.3327, df = 89.614, p-value = 0.00000000009635
## alternative hypothesis: true difference in means between group 1 Low income and group 2 High income is not equal to 0
## 95 percent confidence interval:
## -10.795676 -6.192711
## sample estimates:
## mean in group 1 Low income mean in group 2 High income
## 68.02708 76.52128
… which would have produced a t
value, -7.3327, with an
associated p-value
of 0.00000000009635. That’s far less
than 0.05, so the difference between the low- and high-income group
averages would be statistically significant. By the way, the
t
value is negative because the average of the first group
considered, the low-income group, is lower than the average of the
second group considered, the high-income group. If we had given R the
groups in the opposite order, the t
value would have been
positive.
The results of the t-test could be written up like this:
Prior research suggests that wealthier areas tend to have higher levels of broadband internet access. This analysis hypothesized that average household broadband internet access would differ significantly among high-income counties compared to low-income counties. The analysis found that broadband access averaged 68.8 among “Low income” counties and 76.5 among “High income” counties. An independent samples t-test found the difference to be statistically significant, t (89.61) = -7.33, p < .05. The results support the hypothesis.
This code uses a paired-samples t-test to evaluate whether each
Tennessee county’s percentage of households with broadband internet
access, as estimated in 2019 by PctBB
in the TNBBAccessData.csv
data file, increased significantly compared to each county’s estimated
percentage from two years earlier, as recorded in the TNBBAccessData.csv
data file’s PctBB2017
variable. A paired-samples t-test is
suitable for the analysis, because each PctBB
county
estimate from 2019 is paired with the same county’s
PctBB2017
estimate from 2017. In short, the county
observations are paired sequentially. See Not sure
… for more explanation. See the Example dataset codebook for
details about both variables.
The analysis assumes that each county’s percentage of homes with broadband internet access depends, at least partly, on the difference in time between 2017 and 2019. Thus, the percentage of homes with broadband is the continuious dependent variable, and the year the percentage was measured (either 2017 or 2019) is the categorical independent variable. These variables are implied, rather than represented directly, in the dataset, though, because the dataset has been arranged in the format that a paired-samples t-test requires.
The script’s opening lines of code install (if not installed already)
and activate the tidyverse
and ggplot
packages, then read the data from the TNBBAccessData.csv
file. The code assumes the data file has been stored in the same
subdirectory as the script.
#Installing required packages
if (!require("dplyr")) install.packages("dplyr")
if (!require("tidyverse")) install.packages("tidyverse")
library(dplyr)
library(ggplot2)
#Read the data
mydata <- read.csv("TNBBAccessData.csv") #Edit YOURFILENAME.csv
The 2017 broadband access levels in PctBB2017
are
designated as V1
for purposes of the analysis, because they
came first in the time sequence being examined. The broadband access
levels in PctBB
, which are from 2019, are designated as
V2
, because they came second in the time sequence.
#Specify the two variables involved
mydata$V1 <- mydata$PctBB2017
mydata$V2 <- mydata$PctBB
A paired-samples t-test assumes that differences between each pair of scores are approximately normally distributed, with no substantial outliers. A histogram of the pair differences can tell you whether these assumptions are true:
#Look at the distribution of the pair differences
mydata$PairDifferences <- mydata$V2 - mydata$V1
ggplot(mydata, aes(x = PairDifferences)) +
geom_histogram(color = "black", fill = "dodgerblue") +
geom_vline(aes(xintercept = mean(PairDifferences)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In this case, the histogram suggests that both assumptions may be false. The data show at least hints of the bell-shape characteristic of a normal distribution. But the right side of the distribution seems a lot lower than the left side does - suggesting that counties with a lot of difference are more common than counties with relatively less difference. Also, there’s a pretty clear outlier on the left - that is, a county with an unusually small change in broadband access between 2017 and 2019.
The histogram doesn’t show precise values for the average and the outlier, so some descriptive statistics may be enlightening:
#Get descriptive statistics for pair differences
mydata %>%
select(PairDifferences) %>%
summarise(
count = n(),
mean = mean(PairDifferences, na.rm = TRUE),
sd = sd(PairDifferences, na.rm = TRUE),
min = min(PairDifferences, na.rm = TRUE),
max = max(PairDifferences, na.rm = TRUE))
## count mean sd min max
## 1 95 6.827368 2.578571 0.8 12.2
So, the low outlier is a county - Weakly County, it turns out - in which broadband access grew by just eight-tenths of a percentage point, more than two standard deviations less than the 6.8 percent average. And the maximum - Lauderdale County’s 12.2 percent surge in broadband access - is more than two standard deviations above the average. It might, indeed, be wise to at least check whether the normality assumption has been met:
#Shapiro-Wilk test
options(scipen = 999)
shapiro.test(mydata$PairDifferences)
##
## Shapiro-Wilk normality test
##
## data: mydata$PairDifferences
## W = 0.97041, p-value = 0.02993
The p-value
for the Shapiro-Wilk test produced a
W
value of 0.97, which has a p-value
of about
0.03. That’s less than 0.05, so the distribution of the pair differences
is not normal, and a t-test isn’t suitable in this case. No problem,
though. The Wilcoxon signed rank test will do just fine:
#Wilcoxon signed rank test
mydata %>%
select(V1, V2) %>%
summarise_all(list(Mean = mean,SD = sd))
## V1_Mean V2_Mean V1_SD V2_SD
## 1 65.40211 72.22947 8.005702 7.045458
wilcox.test(mydata$V1, mydata$V2, paired = TRUE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: mydata$V1 and mydata$V2
## V = 0, p-value < 0.00000000000000022
## alternative hypothesis: true location shift is not equal to 0
The average of V1 (the 2017 percentages) is 65.4 with a standard
deviation of 8.0, while the average of V2 (the 2019 percentages) is
72.2, with a standard deviation of 7.0. The test produced a
V
value of 0 and a p-value of 0.00000000000000022. The
latter is much less than 0.05. It looks like the average difference
between the 2019 and 2017 scores, 6.8 percentage points (shown in the
summary statistics from the pair differences distribution, above), is
statistically significant. The results could be written up like
this:
This analysis hypothesized that county-level percentages of households with broadband access averaged significantly higher in 2019 than in 2017. The analysis found that access averaged about 65 percent in 2017 and 72 percent in 2019. The differences between the pairs of percentages averaged 6.8. A Shapiro-Wilk test indicated these differences were not normally distributed, rendering a paired-samples t-test unsuitable (W = 0.97, p < .05). However, a Wilcoxon signed rank test found the difference to be statistically signficant (V = 0, p < .05).
Had a t-test been suitable, however, we could have run this code:
#t-test (Use if pair differences distribution is normal)
mydata %>%
select(V1, V2) %>%
summarise_all(list(Mean = mean,SD = sd))
## V1_Mean V2_Mean V1_SD V2_SD
## 1 65.40211 72.22947 8.005702 7.045458
options(scipen = 999)
t.test(mydata$V2, mydata$V1,
paired = TRUE)
##
## Paired t-test
##
## data: mydata$V2 and mydata$V1
## t = 25.807, df = 94, p-value < 0.00000000000000022
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 6.302087 7.352650
## sample estimates:
## mean difference
## 6.827368
… the results of which would indicate that the t
value,
25.81, had a p-value
of 0.00000000000000022. That’s much
less than 0.05, so the test would have indicated a statistically
significant difference. The results could have been written up this
way:
This analysis hypothesized that county-level percentages of households with broadband access averaged significantly higher in 2019 than in 2017. The analysis found that access averaged about 65 percent in 2017 and 72 percent in 2019. The differences between the pairs of percentages averaged 6.8. A paired-samples t-test found the difference to be statistically significant (t(94) = 25.8, p < .05).
… whether you need an Independent-samples t-test or a Paired-samples t-test?
Similarities between the two tests can make telling them apart confusing. For example, both kinds of t-tests involve comparing the average of one set of numbers with the average of a second set of numbers. In both instances, the dependent variable is continuous, and the categorical independent variable tells you which set a given number belongs to: the first set, or the second set.
The difference, though, involves whether each number in the first set is somehow connected, or paired, with a number in the second set. When no such connection exists, you use an independent-samples t-test. When such a connection does exist, you use a paired-samples t-test.
Suppose you wanted to learn whether college freshmen and college seniors score significantly differently on a 10-question test of knowledge about politics. College supposedly broadens one’s understanding of the world, so there is reason to think seniors would outperform freshmen. But college’s demands also keep students busy. Too busy, perhaps, to keep up with politics, especially during senior-level courses. So, maybe freshmen would outscore seniors.
A five-student sample of your dataset might look like this:
ID | Class | Knowledge |
1 | Freshman | 4 |
2 | Senior | 5 |
3 | Senior | 7 |
4 | Freshman | 3 |
5 | Freshman | 4 |
Each row represents a person. “Knowledge” is a continuous variable indicating each person’s score on the 10-question political knowledge test. “Class” is a categorical variable indicating whether each person is a freshman or a senior - that is, belongs to the “freshman” set of scores or to the “senior” set of scores.
The key thing to see is that no one person is included in both sets of scores. Nobody in the study is both a freshman and a senior. Everybody in the study is in only one set or only the other set. In technical terms, the two sets of scores are “mutually exclusive,” meaning they have no members in common. Furthermore, at least as far as we can tell, the two sets are “independent.” That means nobody in the freshman set has any sort of relationship with someone in the senior set that might. For example, suppose one of the freshmen and one of the seniors happened to be siblings whose mother happened to be a U.S. senator. Both probably would have higher, and more similar, levels of political knowledge than some random freshman and senior with no such connection between them. Ignoring that possibility could lower the validity of your research’s conclusions.
As long as the two sets of scores are mutually exclusive and independent, as described above, an independent-samples t-test would be the right procedure to use.
There could be advantages, however, to intentionally studying political knowledge among pairs of siblings like the pair mentioned above.
All sorts of factors that have nothing to do with a college student’s class rank nonetheless could boost or diminish the college student’s level of political knowledge. Some students may come from families that talked about politics all the time. Others may come from families that rarely talked about politics. Some may come from wealthier families that could afford access to elite political information sources. Others may come from poorer families that could not.
But what if you could manage to measure political knowledge among three dozen or so sibling pairs in which one sibling was a freshman and the other was a senior? If you could do that, you would be able to investigate whether time spend in college seems to make a difference in political knowledge regardless of household circumstances. That’s because you would be comparing freshmen from politically engaged families with seniors from politically engaged families, freshmen from politically disengaged families with seniors from politically disengaged families, freshmen from richer families with seniors from richer families, freshmen from poorer families with seniors from poorer families, and so on, perhaps even for factors you hadn’t thought of or imagined.
You’d have to arrange your data a little differently than the data in Example 1 were arranged, though. A five-sibling-pair sample of your data might look like this:
Sibling_pair_ID | Knowledge_Freshman_sibling | Knowledge_Senior_sibling |
1 | 3 | 5 |
2 | 5 | 7 |
3 | 4 | 6 |
4 | 3 | 8 |
5 | 4 | 7 |
In this arrangement, each horizontal row is a sibling pair, not an
individual person, as was the case in Example 1. The
Sibling_pair_ID
variable shows a unique ID code for each
sibling pair. Within each row, the
Knowledge_freshman_sibling
variable shows the political
knowledge test score for the freshman member of each sibling pair. The
Knowledge_senior_sibling
variable shows the political
knowledge test score for the senior member of each sibling pair.
If you are going to analyze paired data, R will expect you to arrange the data in this way so that R can tell which cases make up each pair. Even though the dataset’s values have been arranged differently, though, the dependent variable is still “political knowledge,” measured continuously as a zero-to-10 score on a political knowledge test, and the independent variable is still “class rank,” measured categorically as either “freshman” or “senior.”
Alternatively, sets in a t-test data file can be linked sequentially. For example, suppose you investigated the relationship between political knowledge and college class rank by recruiting some college freshmen, measuring their political knowledge, then tracking them down four years later, when they are seniors, and measure their political knowledge a second time. A five-person sample from your data set might look like this:
Student | Knowledge_when_a_freshman | Knowledge_when_a_senior |
1 | 3 | 5 |
2 | 5 | 7 |
3 | 4 | 6 |
4 | 3 | 8 |
5 | 4 | 7 |
The data are arranged the same way they were in Example 2. The key differences are that:
Each horizontal row now holds data from one person, rather than from each of two siblings, and …
The last two vertical columns hold data about the person collected during two time periods. In this case, the two time periods are when the person was a freshman, and four years later, when the person was a senior.
As in Example 2, the dependent variable is still “political knowledge,” measured continuously as a zero-to-10 score on a political knowledge test, and the independent variable is still “class rank,” measured categorically as either “freshman” or “senior.” The difference is that the “freshman” and “senior” in each horizontal row are the same person. The “freshman” score was measured when the row’s person was a freshman, and the “senior” score was measured when the row’s person was a senior.
To recap: If you know you need to conduct a t-test, but you’re unsure whether to use and independent-samples or a paired-samples t-test, consider the two sets of scores the t-test will be comparing. If the sets are mutually exclusive and independent, use an independent-samples t-test. But if each score in the first set is somehow paired with a score in the second set - for example, through a circumstantial or sequential connection - then use a paired-samples t-test.
The dataset, TNBBAccessData.csv
,
contains data extracted from the 2019 five-year American
Community Survey for all counties in Tennessee.
GEOID: A unique code the Census Bureau assigns to each county in the United States. It’s here because I got these data from data.census.gov., the Census Bureau’s main data dissemination site.
NAME: The name of each county.
PctBB: The percentage of households with broadband internet access in 2019.
MedIncome: Median household income.
PctBB2017: The percentage of households with broadband internet access, according to an estimate released in 2017, two years before the release of the other variables in this dataset.
PctBB_2: PctBB, dichotomized at the median, so that values at or below the median are coded as “1 Low BB access,” and percentages above the median are coded as “2 High BB access.”
PctBB_2_NUM: Same as PctBB_2, but with a numeric value of 0 used to indicate “1 Low BB access,” and a numeric value of 1 used to indicate “2 High BB access.” Such numeric codes are required when a two-value categorical value is used as the dependent variable in a logistic regression procedure.
MedIncome_2: MedIncome, dichotomized at the median, so that values at or below the median are coded as “1 Low income,” and percentages above the median are coded as “2 High income.”
PctBB_3: PctBB, but recoded into three roughly equal-sized groups: “1. Low BB access,” indicating counties with the lowest percentages of broadband internet access; “2. Moderate BB access,” indicating counties with mid-level percentages of broadband internet access; and “3. High BB access,” indicating countries with high percentages of broadband internet access. The groups were formed by sorting the county broadband access figures from smallest to greatest, then dividing the counties into three roughly equal groups.
MedIncome_3: MedIncome, but recoded into three roughly equal-sized groups: “1. Low income,” indicating counties with the lowest median household income values; “2. Moderate income” indicating counties with mid-level median household income values; and “3. High income,” indicating countries with high median household income values.