This guide can help you decide which of six statistical procedures to use when your analysis involves one dependent variable and one independent variable, as long as you know which variable is which, and whether each variable is measured categorically or continuously. Once you have settled on a procedure, click the procedure to jump to some code that you can copy into R and run after modifying it to specify the names of the data file and variables involved. You’ll also find links to worked examples for each procedure using variables from an example dataset, and some code that will Install the R Markdown package on your computer.
To begin, use this flow chart to decide which bivariate statistical procedure to use. Click a procedure to jump to copyable R code for that procedure.
START HERE | ||||||||
---|---|---|---|---|---|---|---|---|
↓ | Categorical | → | Chi-Square test | |||||
DV level? | → | Categorical | → | IV level? | → | ↕ | ||
↓ | Continuous | → | Logistic regression | |||||
Continuous | ||||||||
↓ | ||||||||
IV level? | → | Continuous | → | Bivariate regression | ||||
↓ | ||||||||
Categorical | ||||||||
↓ | ||||||||
#IV categories? | → | 3 or more | → | One-way ANOVA | ||||
↓ | ||||||||
Only 2 | ||||||||
↓ | ||||||||
Independent, or paired? |
→ | Independent | → | Independent-samples t-test | ||||
↓ | ||||||||
Paired | → | Paired-samples t-test |
Use a chi-square test when your dependent and independent variables
are both categorical. #Edit
notes indicate which parts of
the code must be customized. See the Chi-Square test worked
example for help interpreting the script’s output.
# Installing required packages
if (!require("tidyverse"))
install.packages("tidyverse")
if (!require("gmodels"))
install.packages("gmodels")
library(gmodels)
library(ggplot2)
# Read the data
mydata <- read.csv("YOURFILENAME.csv") #Edit YOURFILENAME.csv
# Specify the DV and IV
mydata$DV <- mydata$YOURDVNAME #Edit YOURDVNAME
mydata$IV <- mydata$YOURIVNAME #Edit YOURIVNAME
# Look at the DV and IV
ggplot(mydata, aes(x = IV, fill = DV)) +
geom_bar(colour = "black") +
scale_fill_brewer(palette = "Paired")
# Make the crosstab table
CrossTable(
mydata$DV,
mydata$IV,
prop.chisq = FALSE,
prop.t = FALSE,
prop.r = FALSE
)
# Run the chi-squared test
options(scipen = 999)
chitestresults <- chisq.test(mydata$DV, mydata$IV)
chitestresults
Use logistic regression when the DV is dichotomous and coded as
either 1 or 0, and the IV is continuous. #Edit
notes
indicate which parts of the code must be customized. See the Logistic regression worked
example for help interpreting the script’s output.
# Installing required packages
if (!require("tidyverse"))
install.packages("tidyverse")
library(ggplot2)
# Read the data
mydata <- read.csv("YOURFILENAME.csv") #Edit YOURFILENAME.csv
# Specify the DV and IV
mydata$DV <- mydata$YOURDVNAME #Edit YOURDVNAME
mydata$IV <- mydata$YOURIVNAME #Edit YOURIVNAME
# Look at the DV and IV
ggplot(mydata, aes(x = DV)) + geom_bar(color = "black", fill = "#1f78b4")
ggplot(mydata, aes(x = IV)) + geom_histogram(color = "black", fill = "#1f78b4")
# Logistic regression plot
ggplot(mydata, aes(x = IV,
y = DV))+
geom_jitter(height = .03,
alpha = .5) +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = FALSE,
color = "#1f78b4")
# Run the logistic regression and view the summary results
options(scipen = 999)
log.ed <- glm(DV ~ IV, data = mydata, family = "binomial")
summary(log.ed)
p <- .50
Inflection_point <- (log(p/(1-p))-coef(log.ed)[1])/coef(log.ed)[2]
Inflection_point
Use bivariate regression when both the DV and IV are continuous.
Bivariate regression works best when no major outliers are evident in
one variable, the other, or the two variables combined.
#Edit
notes indicate which parts of the code must be
customized. See the bivariate regression worked
example for help interpreting the script’s output.
# Installing required packages
if (!require("dplyr"))
install.packages("dplyr")
if (!require("tidyverse"))
install.packages("tidyverse")
library(dplyr)
library(ggplot2)
# Read the data
mydata <- read.csv("YOURFILENAME.csv") #Edit YOURFILENAME.csv
# Specify the DV and IV
mydata$DV <- mydata$YOURDVNAME #Edit YOURDVNAME
mydata$IV <- mydata$YOURIVNAME #Edit YOURIVNAME
# Look at the DV and IV
ggplot(mydata, aes(x = DV)) + geom_histogram(color = "black", fill = "#1f78b4")
ggplot(mydata, aes(x = IV)) + geom_histogram(color = "black", fill = "#1f78b4")
# Creating and summarizing an initial regression model called myreg, and checking for bivariate outliers.
options(scipen = 999)
myreg <- lm(DV ~ IV,
data = mydata)
summary(myreg)
plot(mydata$IV, mydata$DV)
abline(lm(mydata$DV ~ mydata$IV))
If you see what looks like an outlier or two, run the code below,
look for hatvalue
figures that are notably high compared to
the others. The number to the left of each hatvalue
figure
is the row number in the mydata
data frame of the case that
produced the hatvalue
.
leverage <- as.data.frame(hatvalues(myreg))
view(leverage)
A case with a notably high “hatvalues” score, relative to the other
cases, may be an influential outlier and could be deleted, if you judge
doing so necessary. You can delete a row from the data frame with the
code below, where 2
is the row number of the case you want
to delete.
mydata <- mydata[-c(2), ]
If you need to delete two or more rows, you may delete all of them at once with the code below, where 2, 4, and 6 - and as many more digits as you need - are the row numbers of the cases you want to delete.
mydata <- mydata[-c(2, 4, 6), ]
With the suspected outliers deleted, you can re-run the analysis and make a final decision about whether to keep or delete the suspected outliers.
Use one-way analysis of variance (one-way ANOVA) when both the DV is
continuous, the IV is categorical, and the IV has three or more
categories. One-way ANOVA works best when no major outliers are evident
in the distributions of the groups being compared. #Edit
notes indicate which parts of the code must be customized. See the One-way ANOVA worked example
for help interpreting the script’s output.
# Installing required packages
if (!require("dplyr"))
install.packages("dplyr")
if (!require("tidyverse"))
install.packages("tidyverse")
library(dplyr)
library(ggplot2)
options(scipen = 999)
# Read the data
mydata <- read.csv("YOURFILENAME.csv") #Edit YOURFILENAME.csv
# Specify the DV and IV
mydata$DV <- mydata$YOURDVNAME
mydata$IV <- mydata$YOURIVNAME
# Graph the group distributions and averages
averages <- group_by(mydata, IV) %>%
summarise(mean = mean(DV, na.rm = TRUE))
ggplot(mydata, aes(x = DV)) +
geom_histogram() +
facet_grid(IV ~ .) +
geom_histogram(color = "black", fill = "#1f78b4") +
geom_vline(data = averages, aes(xintercept = mean, ))
# Calculate and show the group counts, means, standard
# deviations, minimums, and maximums
group_by(mydata, IV) %>%
summarise(
count = n(),
mean = mean(DV, na.rm = TRUE),
sd = sd(DV, na.rm = TRUE),
min = min(DV, na.rm = TRUE),
max = max(DV, na.rm = TRUE)
)
# If the group distributions look non-normal, consider checking for non-normality
# running a Shapiro-Wilk test for each group. Significant results indicate
# non-normality.
mydata %>%
group_by(IV) %>%
summarise(`W Statistic` = shapiro.test(DV)$statistic,
`p-value` = shapiro.test(DV)$p.value)
# If one or more of the group distributions are non-normal, consider using a
# Kruskal-Wallis test instead of an ANOVA and, if the result is significant, a
# Dunn's Test instead of a Tukey's HSD as a post-hoc test.
if (!require("FSA"))
install.packages("FSA")
library(FSA)
kruskal.test(DV ~ IV, data = mydata)
dunnTest(DV ~ IV, data = mydata)
# If all three group distributions are normal,
# though, you may use ANOVA. Also, ANOVA is
# powerful enough to work as long as the
# non-normality isn't severe.
options(scipen = 999)
oneway.test(mydata$DV ~ mydata$IV,
var.equal = FALSE)
# If the ANOVA detects significant difference, run
# this post-hoc procedure to learn which
# group pairs differed significantly.
anova_1 <- aov(mydata$DV ~ mydata$IV)
TukeyHSD(anova_1)
Use an independent-samples t-test when the DV is continuous, and the
IV is categorical and has two categories that represent independent,
mutually exclusive groups. See the Not sure …
section for explanations of what “independent” and “mutually exclusive”
mean. #Edit
notes indicate which parts of the code must be
customized. See the Independent-samples
t-test worked example for help interpreting the script’s output.
# Installing required packages
if (!require("dplyr"))
install.packages("dplyr")
if (!require("tidyverse"))
install.packages("tidyverse")
library(dplyr)
library(ggplot2)
options(scipen = 999)
# Read the data
mydata <- read.csv("YOURFILENAME.csv") #Edit YOURFILENAME.csv
# Specify the DV and IV
mydata$DV <- mydata$YOURDVNAME
mydata$IV <- mydata$YOURIVNAME
# Graph the group distributions and averages
averages <- group_by(mydata, IV) %>%
summarise(mean = mean(DV, na.rm = TRUE))
ggplot(mydata, aes(x = DV)) +
geom_histogram() +
facet_grid(IV ~ .) +
geom_histogram(color = "black", fill = "#1f78b4") +
geom_vline(data = averages, aes(xintercept = mean, ))
# Calculate and show the group counts, means, standard
# deviations, minimums, and maximums
group_by(mydata, IV) %>%
summarise(
count = n(),
mean = mean(DV, na.rm = TRUE),
sd = sd(DV, na.rm = TRUE),
min = min(DV, na.rm = TRUE),
max = max(DV, na.rm = TRUE)
)
# If you see evidence of non-normality in the distributions of one or both groups, use a Shapiro-Wilk test.
mydata %>%
group_by(IV) %>%
summarise(`W Statistic` = shapiro.test(DV)$statistic,
`p-value` = shapiro.test(DV)$p.value)
# A significant Shapiro-Wilk test for one or both groups means you should use a Wilcoxon signed rank test rather thana t-test.
options(scipen = 999)
wilcox.test(mydata$DV ~ mydata$IV)
# If both groups are normally distributed, though, you may run a t-test.
options(scipen = 999)
t.test(mydata$DV ~ mydata$IV,
var.equal = FALSE)
Use a paired-samples t-test when the DV is continuous, and the
categorical IV distinguishes between two separate, but connected, sets
of DV measures. See the Not sure … section for
explanations of what “separate, but connectd” means. #Edit
notes indicate which parts of the code must be customized. See the Paired-samples t-test
worked example for help interpreting the script’s output.
# Installing required packages
if (!require("dplyr"))
install.packages("dplyr")
if (!require("tidyverse"))
install.packages("tidyverse")
library(dplyr)
library(ggplot2)
options(scipen = 999)
# Read the data
mydata <- read.csv("YOURFILENAME.csv") #Edit YOURFILENAME.csv
# Specify the two variables involved
mydata$V1 <- mydata$YOURV1NAME
mydata$V2 <- mydata$YOURV2NAME
# Look at the distribution of the pair differences
mydata$PairDifferences <- mydata$V2 - mydata$V1
ggplot(mydata, aes(x = PairDifferences)) +
geom_histogram(color = "black", fill = "#1f78b4") +
geom_vline(aes(xintercept = mean(PairDifferences)))
# Get descriptive statistics for pair differences
mydata %>%
select(PairDifferences) %>%
summarise(
count = n(),
mean = mean(PairDifferences, na.rm = TRUE),
sd = sd(PairDifferences, na.rm = TRUE),
min = min(PairDifferences, na.rm = TRUE),
max = max(PairDifferences, na.rm = TRUE)
)
# If pair differences look non-normal, you can use a Shapiro-Wilk test to check
# whether their distribution differs significantly from normal. If the
# Shapiro-Wilk test p-value is less than 0.05, #use a Wilcoxon signed rank test
# instead of a paired-samples t-test.
# Shapiro-Wilk test
options(scipen = 999)
shapiro.test(mydata$PairDifferences)
# If the pair distribution is non-normal, consider # using a Wilcoxon signed rank test instead of a
# paired-samples t-test.
mydata %>%
select(V1, V2) %>%
summarise_all(list(Mean = mean, SD = sd))
wilcox.test(mydata$V1, mydata$V2, paired = TRUE)
# If the pair differences are normally distributed,
# though, you may use a paired-samples t-test.
mydata %>%
select(V1, V2) %>%
summarise_all(list(Mean = mean, SD = sd))
options(scipen = 999)
t.test(mydata$V2, mydata$V1,
paired = TRUE)
In the TNBBAccessData.csv
data file, the PctBB_2
variable indicates whether each
county in Tennessee has a “High” level of broadband internet access or a
“Low” level of broadband internet access. Meanwhile, the
MedIncome_2
variable indicates whether each county has a
“High” household income level or a “Low” household income level. See the
Example dataset codebook for
details about both variables. The code in this example uses a
Chi-Squared test to look for evidence of a relationship between the two
variables.
The script’s opening lines of code install (if not installed already)
and activate the tidyverse
and gmodels
packages, then read the data from the TNBBAccessData.csv
file. The code assumes the data file has been stored in the same
subdirectory as the script.
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("gmodels")) install.packages("gmodels")
library(gmodels)
library(ggplot2)
#Read the data
mydata <- read.csv("TNBBAccessData.csv") #Edit YOURFILENAME.csv
The DV and IV variable names have been edited to specify
PctBB_2
as the dependent variable, and
MedIncome_2
as the independent variable. Again, see the Example dataset codebook for
details about both variables. Setting the analysis up in this way
assumes that broadband internet access depends upon household
income.
#Specify the DV and IV
mydata$DV <- mydata$PctBB_2 #Edit YOURDVNAME
mydata$IV <- mydata$MedIncome_2 #Edit YOURIVNAME
This code will produce a stacked column chart that gives you a look at whether, and how, the DV and IV might be related.
#Look at the DV and IV
ggplot(mydata, aes(x = IV, fill = DV)) +
geom_bar(colour = "black") +
scale_fill_brewer(palette = "Paired")
The chart shows that counties with low levels of broadband access appear much more common among low-income counties than among high-income counties. It also shows there are no outliers or oddly-coded cases that need to be examined.
It would be informative, though, to get the specific counts and percentages depicted in the chart. This code produces a table showing those numbers.
#Make the crosstab table
CrossTable(mydata$DV, mydata$IV,
prop.chisq = FALSE,
prop.t = FALSE,
prop.r = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 95
##
##
## | mydata$IV
## mydata$DV | 1 Low income | 2 High income | Row Total |
## -----------------|---------------|---------------|---------------|
## 1 Low BB access | 38 | 11 | 49 |
## | 0.792 | 0.234 | |
## -----------------|---------------|---------------|---------------|
## 2 High BB access | 10 | 36 | 46 |
## | 0.208 | 0.766 | |
## -----------------|---------------|---------------|---------------|
## Column Total | 48 | 47 | 95 |
## | 0.505 | 0.495 | |
## -----------------|---------------|---------------|---------------|
##
##
The table shows that 79 percent of the low-income counties (38 of the 48) are also in the low-broadband-access category, compared to only 23 percent (11) of the 47 counties in the high-income category are in the low-broadband-access category. Put another way, low-broadband-access counties are much more common among low-income counties than among high-income counties. This pattern doesn’t prove that low broadband access is a result of low income. But it does support the assertion that low broadband access is at least associated with low income.
Correctly decoding a crosstabulation table can be difficult for beginners. Here’s a trick that might help: Compare percentages that are horizontally adjacent. That is, compare percentages that are in the same horizontal row. In the example table, you would have two options: the 0.792 and 0.234 pair, or below them, the 0.208 and 0.766 pair. The first pair tells you that:
“Seventy-nine percent of low-income counties have low levels of broadband access, while only 23 percent of high-income counties have low levels of broadband access.”
See how that phrasing makes the pattern clear? Low broadband access is much more common among low-income counties than among high-income counties. A summary based on the second pair of horizontally adjacent percentages is equally clear:
“Seventy-seven percent of high-income counties have high levels of broadband access, while only 21 percent of low-income counties have high levels of broadband access.”
Again, the phrasing makes the relationship easier to understand. High broadband access is much more common among high-income counties than among low-income counties.
In both approaches, it often helps to refer to the positively-related percentages first. For example, “low-income” and “low access,” rather than “low-income” and “high access.” It’s one less complication for your audience to keep track of.
But random chance could produce the same pattern shown in the table, at least sometimes. A chi-squared test can help you decide whether to rule out random chance as an explanation for the pattern. Here’s the code:
#Run the chi-squared test
options(scipen = 999)
chitestresults <- chisq.test(mydata$DV, mydata$IV)
chitestresults
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata$DV and mydata$IV
## X-squared = 27.375, df = 1, p-value = 0.0000001676
… and the results indicate that the chi-quare value computed (27.375), with 1 degree of freedom (df) has a p-value of 0000001676. The p-value is not greater than 0.05. In fact, it’s quite a bit less. As a result, you can reasonably rule out randomness as an explanation for the pattern of association between income and broadband access. To summarize the analysis results more succinctly, and in a more scholarly format:
Prior research suggests that wealthier areas tend to have higher levels of broadband internet access. This analysis hypothesized that the percentage of households with high broadband access would be significantly greater among high-income counties than among low-income counties. It found that about 77 percent of the high-income counties examined had high broadband access, compared to only about 21 percent of low-income counties. A chi-squared test found the difference to be statistically significant, X2 (1, N = 95) = 27.38, p < .05. The results support the hypothesis.
This code tests the association between the TNBBAccessData.csv
data file’s PctBB_2_NUM
and MedIncome
variables. PctBB_2_NUM
distinguishes between Tennessee
counties with “Low” and “High” levels of broadband internet access by
assigning “Low” counties a 0
and “High” counties a
1
. Thus, PctBB_2_NUM
is a categorical
variable, even though its two categories are represented with numeric
digits. Meanwhile, MedIncome
is a continuous variable
containing each Tennessee county’s median household income. See the Example dataset codebook for
details about both variables.
The analysis assumes that whether a county has a “Low” or “High”
level of broadband internet access depends, at least partly, on the
county’s median household income. Thus, PctBB_2_NUM
is the
dependent variable, and MedIncome
is the independent
variable. Logistic regression is suitable for the analysis because the
dependent variable is categorical and has only two categories, one
represented by a 0
, and the other represented by a
1
.
The script’s opening lines of code install (if not installed already)
and activate the tidyverse
and ggplot
packages, then read the data from the TNBBAccessData.csv
file. The code assumes the data file has been stored in the same
subdirectory as the script.
#Installing required packages
if (!require("tidyverse")) install.packages("tidyverse")
library(ggplot2)
#Read the data
mydata <- read.csv("TNBBAccessData.csv") #Edit YOURFILENAME.csv
For the reasons described above, the DV is specified as
PctBB_2_NUM
, while the IV is specified as
MedIncome
, and each variable’s distribution is examined for
outliers, data entry errors, and other anomalies.
#Specify the DV and IV
mydata$DV <- mydata$PctBB_2_NUM
mydata$IV <- mydata$MedIncome
#Look at the DV and IV
ggplot(mydata, aes(x = DV))+geom_bar(color="black",fill = "#1f78b4")
ggplot(mydata, aes(x = IV))+geom_histogram(color="black",fill = "#1f78b4")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
A clear outlier is evident in the MedIncome
variable’s
distribution. Examining the data reveals that the outlier is Williamson
County, which has an unusually high median household income for
Tennessee’s counties. But the outlier does not stem from a data entry
error; Williamson County’s median household income really is that high.
Furthermore, outliers aren’t much of a problem for logistic regression.
For these reasons, the Williamson County value was allowed to remain in
the dataset. If the outlier raised a concern about the validity of the
analysis’ conclusion, the outlier could be deleted so that the results
with and without the outlier could be compared.
This code shows how values for the IV, median household income, range in the “Low” and “High” categories of the DV, broadband internet access.
# Logistic regression plot
ggplot(mydata, aes(x = IV,
y = DV))+
geom_jitter(height = .03,
alpha = .5) +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = FALSE,
color = "#1f78b4")
## `geom_smooth()` using formula = 'y ~ x'
Among the counties that are “Low” on broadband internet access, median household income ranges from just over $30,000 a year to just under $60,000 a year, with most of the counties falling in the $40,000 range. Among the counties that are “High” on broadband internet access, by contrast, median household incomes start at just under $40,000 and range all the way up to well over $100,000. The $100k-plus county (Williamson County) is an obvious outlier, and not an unexpected one, given what we say on the univariate plot of median income. But even if Williamson County were excluded, it’s obvious that median income tends to be greater among the “High” broadband access counties.
By the way, all of the “Low” counties actually line up exactly on the
vertical axis’s “0” point, while all the “High” counties actually line
up exactly on the vertical axis’s “1” point. They don’t “bounce around”
0 and 1 the way the plot suggests they do. This is a “jitter” plot, so
named because it ads some random “jitter” to the locations of the points
on the resulting plot. Doing so allows you to better see points that
occupy the same real estate on the plot. If you change
geom_jitter(height = .03
to
geom_jitter(height = 0
, you’ll see the actual location of
each point on the vertical axis. You’ll also see that it’s tougher to
spot clusters of points with the same, or similar, values on the
horizontal axis.
But is this pattern stark enough to suggest nonrandomness - that is, suggest that some systematic relationship exists between median income and broadband access? That’s what logistic regression output can tell you. This code runs the logistic regression and displays the results:
#Run the logistic regression and view the summary results
options(scipen = 999)
log.ed <- glm(DV ~ IV, data=mydata, family="binomial")
summary(log.ed)
##
## Call:
## glm(formula = DV ~ IV, family = "binomial", data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2138 -0.8051 -0.3234 0.8440 1.7984
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.58002397 2.07299037 -4.621 0.00000381 ***
## IV 0.00020691 0.00004551 4.547 0.00000545 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 131.603 on 94 degrees of freedom
## Residual deviance: 94.102 on 93 degrees of freedom
## AIC: 98.102
##
## Number of Fisher Scoring iterations: 5
p <- .50
Inflection_point <- (log(p/(1-p))-coef(log.ed)[1])/coef(log.ed)[2]
Inflection_point
## (Intercept)
## 46300.82
Two figures in the output are particularly important: The
z value
, which you will need for the write-up, and the
probability value under Pr(>|z|)
, which must be 0.05 or
smaller to indicate statistical significance. Here, the rounded
z value
is 4.55, and the Pr(>|z|)
value,
0.00000545, is far less than 0.05, indicating statistical
significance.
The “Inflection point” ($46,300.82 in this example) is the point on the x axis at which the odds of being in the DV’s “1” group are 50/50. Cases with x axis values greater than this point have better than 50/50 odds of landing in the “1” group. Cases with x axis values less than this point have less than 50/50 odds of landing in the 1 group.
The analysis could be summarized in this way:
Prior research suggests that wealthier areas tend to have higher levels of broadband internet access. This analysis hypothesized that county-level median household income would significantly predict whether a county had high or low broadband access. A logistic regression procedure found that higher median household income levels significantly predicted high broadband access (z = 4.55, p < .05). The results support the hypothesis.
This code tests the association between the TNBBAccessData.csv
data file’s PctBB
and MedIncome
variables.
PctBB
measures, for each county in Tennessee, the
percentage of households that have access to broadband internet. It is a
continuous variable. Meanwhile, MedIncome
, also a
continuous variable, contains each Tennessee county’s median household
income. See the Example dataset
codebook for details about both variables.
The analysis assumes that a county’s percentage of homes with
broadband internet access depends, at least partly, on the county’s
median household income. Thus, PctBB
is the dependent
variable, and MedIncome
is the independent variable.
Bivariate regression is suitable for the analysis because both the
dependent variables are continuous.
The script’s opening lines of code install (if not installed already)
and activate the tidyverse
and ggplot
packages, then read the data from the TNBBAccessData.csv
file. The code assumes the data file has been stored in the same
subdirectory as the script.
#Installing required packages
if (!require("dplyr")) install.packages("dplyr")
if (!require("tidyverse")) install.packages("tidyverse")
library(dplyr)
library(ggplot2)
mydata <- read.csv("TNBBAccessData.csv")
For the reasons described above, the DV is specified as
PctBB
, while the IV is specified as MedIncome
.
Next, each variable’s distribution is examined for outliers, data entry
errors, and other anomalies.
#Specify the DV and IV
mydata$DV <- mydata$PctBB #Edit YOURDVNAME
mydata$IV <- mydata$MedIncome #Edit YOURIVNAME
#Look at the DV and IV
ggplot(mydata, aes(x = DV))+geom_histogram(color="black",fill = "#1f78b4")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(mydata, aes(x = IV))+geom_histogram(color="black",fill = "#1f78b4")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distribution of the dependent variable, PctBB
, shows
what appear to be two outliers. One county has an unusually low
percentage of households with broadband access (somewhere around 50
percent), and another has an unusually high level of broadband access
(well above 90 percent). The distribution of the independent variable,
MedIncome
, also shows at least one, and maybe as many as
three, counties with unusually high median household incomes.
Because linear regression is being used, these outliers could lead to incorrect conclusions about the relationship between the dependent and independent variables. Creating the regression model and plotting its results will yield further insights.
Let’s begin with the regression results, which this code will produce and display:
# Creating and summarizing an initial regression model called myreg
options(scipen = 999)
myreg <- lm(DV ~ IV,
data = mydata)
summary(myreg)
##
## Call:
## lm(formula = DV ~ IV, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.5892 -3.1303 0.5912 3.0848 9.0294
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.85438413 2.11083922 23.14 <0.0000000000000002 ***
## IV 0.00049558 0.00004363 11.36 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.584 on 93 degrees of freedom
## Multiple R-squared: 0.5811, Adjusted R-squared: 0.5766
## F-statistic: 129 on 1 and 93 DF, p-value: < 0.00000000000000022
Looks like broadband access is significantly related to median
income. The p-value
for the F-statistic
is
<0.00000000000000022, with is far below the 0.05 threshold for
statistical significance. The model’s R-squared
value is
0.5811, which indicates that county median household income explains
about 58 percent of the variation in broadband access. That’s pretty
good.
The table under Coefficients:
indicates that the
coefficient for the independent variable, median household income, is
0.00049558. It tells you a couple of things.
First, it’s positive, which means that as median household income increases, so does broadband internet access. If it were negative, it would indicate the opposite - that as median household income increases, broadband internet access decreases. Of course, you can also discern a positive or negative relationshp by looking at the scatterplot, as described above. Looking at the coefficient is simply an additional way to tell whether the relationship is positive or negative.
Second, the coefficient tells you how much of a change in the IV is associated with one unit of change in the DV. Put another way: At the time these day were collected, every 0.00049558 increase in median household income (the IV) was associated with a 1-percentage-point increase in the percentage of households with broadband internet access (the DV).
The same table provides the model’s Intercept
value,
48.85438413, which is what the regression model predicts the DV’s value
will be if this IV’s value is zero. The intercept’s value doesn’t always
make a lot of sense. Here, for example, there are some pretty poor
counties in Tennessee, but it’s hard to imagine a county in which the
median household income is zero, and it’s even harder to imagine that,
despite being so desperately poor, about 48.9 percent of the households
in such a county would have broadband internet access. The interceipt is
valuable, though, because it can help you figure out the trend line that
best represents the relationship between the DV and IV. If you poke
around in your hazy memories of middle school or high school math, you
might remember dealing with “x/y coordinates” and the “equation of a
line,” which looks like this:
y = mx + b
If you plug the regression output’s coefficient into the equation as “m,” the intercept value into the equation as “b,” and provide some “x” values (in this case, values of the IV), you’ll get a bunch of “y” values. Plot those values, and you’ll get the trend line shown in the scatterplot that the script will produce for you in the next step.
This trend line formula turns even more useful when you realize that you can estimate any Tennessee county’s percentage of households with broadband (during this time period, at least) by plugging in the county’s median household income, like this:
Pct. broadband = (0.00049558*median household income) + 48.85438413.
But what about those outliers? Could this wonderfully concise regression model be spectacularly significant chiefly because of a few outliers? Let’s look at a plot of the relationship between household income and broadband access:
# Checking the initial regression model for outliers
plot(mydata$IV,mydata$DV)
abline(lm(mydata$DV~mydata$IV))
The plot shows a positive, reasonably linear relationship between the broadband access and household income measures. The outliers noted in the distributions of each variable are readily apparent, though. Both appear as cases far off the trend line, one at the low end, and the other at the high end. However, it seems unlikely that the two outliers account for all, or even most, of the linearity. Even the non-outlier cases show an upward relationship between household income and broadband access. Just in case, though, let’s check the leverages.
leverage <- as.data.frame(hatvalues(myreg))
view(leverage)
Case 94 (Williamson County) and case 95 (Wilson County) have
unusually large hatvalues
. Just to be safe, let’s delete
them, re-run the analysis, and compare the regression results with and
without these two outliers.
mydata <- mydata[-c(94,95), ]
options(scipen = 999)
myreg <- lm(DV ~ IV,
data = mydata)
summary(myreg)
##
## Call:
## lm(formula = DV ~ IV, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.1885 -2.3811 -0.0759 3.2838 8.5539
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.60572890 2.66814020 15.97 <0.0000000000000002 ***
## IV 0.00063393 0.00005699 11.12 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.334 on 91 degrees of freedom
## Multiple R-squared: 0.5762, Adjusted R-squared: 0.5716
## F-statistic: 123.7 on 1 and 91 DF, p-value: < 0.00000000000000022
plot(mydata$IV,mydata$DV)
abline(lm(mydata$DV~mydata$IV))
Comparing the output of this regression analysis to the output of the
earlier regression analysis (before the outliers were deleted) shows
little change. The model is still significantly positive. The
R-squared
figure, 0.5762, dropped a tad, but not much. All
in all, deleting the outliers didn’t make much difference.
Incidentally, in regression, a positive relationship means that the dependent variable increases as the independent variable increases. In other words, the regression line slopes markedly upward from left to right. A negative relationship means the opposite. The dependent variable decreases as the independent variable increases, and the regression line slopes markedly downward from left to right. No relationship means that the dependent variable has no evident relationship, positive or negative with the independent variable. As the independent variable increases, the dependent variable sometimes increases, sometimes stays the same, and sometimes decreases, and never does so in a systematic way. The regression line, in such cases, is level, or at least nearly level, from left to right.
In this particular case, I think the benefits of retaining the outliers outweigh the benefits of dropping them. The analysis could be summarized like this:
Prior research suggests that wealthier areas tend to have higher levels of broadband internet access. This analysis hypothesized that county-level median household income would correlate positively and significantly with county-level percentages of households with broadband access. Regression supported the hypothesis (F (1, 93) = 129, p < .05, R2 = .58). Two outliers were detected: Williamson County and Wilson County. Both had unusually high levels of both household income and broadband internet access. But repeating the analysis with both outliers omitted produced no substantial change in the results.
This code tests the association between the TNBBAccessData.csv
data file’s PctBB
and MedIncome_3
variables.
PctBB
measures, for each county in Tennessee, the
percentage of households that have access to broadband internet. It is a
continuous variable. Meanwhile, MedIncome_3
, a categorical
variable, categorizes each Tennessee county’s median household income as
“1 Low income,” “2 Moderate income,” or “3 High income.” See the Example dataset codebook for
details about both variables.
The analysis assumes that a county’s percentage of homes with
broadband internet access depends, at least partly, on the county’s
median household income. Thus, PctBB
is the dependent
variable, and MedIncome_3
is the independent variable.
One-way ANOVA is suitable for the analysis because the dependent
variable is continuous, and the independent variable is categorical and
has more than two categories. Technically, the “1,” “2,” and “3” digits
at the start of each category label are not needed. But including them
is a shortcut for telling R the order of the categories, from lowest to
highest. Without them, the categories would appear in alphabetical
order, and additional code would be needed to tell R that they are
ordinal-level variables and what order they should be put in.
The script’s opening lines of code install (if not installed already)
and activate the tidyverse
and ggplot
packages, then read the data from the TNBBAccessData.csv
file. The code assumes the data file has been stored in the same
subdirectory as the script.
#Installing required packages
if (!require("dplyr")) install.packages("dplyr")
if (!require("tidyverse")) install.packages("tidyverse")
library(dplyr)
library(ggplot2)
options(scipen = 999)
#Read the data
mydata <- read.csv("TNBBAccessData.csv") #Edit YOURFILENAME.csv
For the reasons noted above, these lines define PctBB
as
the dependent variable and MedIncome_3
as the independent
variable.
#Specify the DV and IV
mydata$DV <- mydata$PctBB
mydata$IV <- mydata$MedIncome_3
ANOVA requires that the groups being compared have normal distributions. That’s because ANOVA works by comparing averages, and averages can’t do a good job of representing distributions that are non-normal - for example, distributions that have large outliers, or two or more peaks, or something like that.
This code graphs the distribution of PctBB
within each
of the three MedIncome_3
groups and adds vertical lines
showing each group’s average.
#Graph the group distributions and averages
averages <- group_by(mydata, IV) %>%
summarise(mean = mean(DV, na.rm = TRUE))
ggplot(mydata, aes(x = DV))+
geom_histogram()+
facet_grid(IV ~ .)+
geom_histogram(color="black",fill = "#1f78b4")+
geom_vline(data = averages, aes(xintercept=mean,))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The graph suggests that “low income” counties have a lower average broadband access level than “moderate income” counties, and that “low” and “moderate” income counties both have lower average broadband access than “high income” counties.
Would you call the group distributions normal, though? It’s kind of hard to tell. They’re all at least roughly symmetrical, meaning they are high in the center and taper off toward the left and right ends. But the “low-income” group looks like it might have two peaks, separated by a valley that drops, at one point, all the way to zero.
Anyway, it’s hard to tell from the graphs what the average values are. This code shows the exact value of each group distribution’s average, along with each distribution’s exact standard deviation, minimum value, and maximum value.
#Calculate and show the group counts, means, standard
#deviations, minimums, and maximums
group_by(mydata, IV) %>%
summarise(
count = n(),
mean = mean(DV, na.rm = TRUE),
sd = sd(DV, na.rm = TRUE))
## # A tibble: 3 × 4
## IV count mean sd
## <chr> <int> <dbl> <dbl>
## 1 1 Low income 31 66.2 5.23
## 2 2 Moderate income 31 72.4 4.11
## 3 3 High income 33 77.7 6.18
It looks like the average broadband access figures within the “low,”
“moderate” and “high” income groups are 66.2, 72.4, and 77.7,
respectively. Recall that MedIncome_3
happens to represent
percentages in this case. So, among the counties in the “low” group, the
average percentage of households with broadband access is 66.2 percent.
Among the “moderate” group’s counties, it’s 7.4 percent, and it’s 77.7
percent among the “high” group’s counties. Also, that low outlier in the
“low” group must be 51.2 percent, the “low” group’s minimum value. A
quick check of the data frame indicates that the 51.2 percent figure is
from Hancock County, in Row 34 of the data frame. Meanwhile, the high
outlier in the “high” group must be 93.4 percent - the figure belonging
to Williamson County, in Row 94 of the data frame.
Given these outliers and the group distributions that don’t look convincingly normal, let’s use a Shapiro-Wilk test to check each group for non-normality:
mydata %>%
group_by(IV) %>%
summarise(`W Statistic` = shapiro.test(DV)$statistic,
`p-value` = shapiro.test(DV)$p.value)
## # A tibble: 3 × 3
## IV `W Statistic` `p-value`
## <chr> <dbl> <dbl>
## 1 1 Low income 0.961 0.310
## 2 2 Moderate income 0.988 0.978
## 3 3 High income 0.985 0.916
Surprisingly, perhaps, all three groups have reasonably normal
distributions. You can tell because three have W
statistics
with p values that are above 0.05, meaning that their distributions
don’t differ significantly from a normal distribution. That means it’s
OK to go ahead with an ANOVA test, as planned.
Had one or more of the groups shown a non-normal distribution, however, you could have used a Kruskal-Wallis test, instead of ANOVA, and a Dunn’s Test instead of a Tukey’s HSD. Neither requires normal group distributions. Below is the code for both. The FSA package is required to run the Dunn’s Test, so the first code block installs and loads it.
if (!require("FSA")) install.packages("FSA")
library(FSA)
… and this code runs the two tests:
kruskal.test(DV ~ IV, data = mydata)
##
## Kruskal-Wallis rank sum test
##
## data: DV by IV
## Kruskal-Wallis chi-squared = 43.697, df = 2, p-value = 0.0000000003245
dunnTest(DV~IV,data=mydata)
## Warning: IV was coerced to a factor.
## Dunn (1964) Kruskal-Wallis multiple comparison
## p-values adjusted with the Holm method.
## Comparison Z P.unadj
## 1 1 Low income - 2 Moderate income -3.685733 0.00022804569358878
## 2 1 Low income - 3 High income -6.598951 0.00000000004140767
## 3 2 Moderate income - 3 High income -2.856072 0.00428917872629139
## P.adj
## 1 0.000456091387178
## 2 0.000000000124223
## 3 0.004289178726291
The Kruskal-Wallis test produced a chi-squared statistic
of about 41.2, with a p-value
of 0.000000001134. Because
the p-value
is below 0.05, the test indicates that at least
one of the groups is significantly different from at least one of the
others. The post-hoc Dunn’s Test produced Z
scores that
were below 0.05 for every one of the group comparisons, meaning that
each group differs significantly from each other group.
Had relying on the Kruskal-Wallis and Dunn’s Test results been necessary, they could have been summarized like this:
Prior research suggests that wealthier areas tend to have higher levels of broadband internet access. This analysis hypothesized that household broadband internet access would average significantly differently across at least two of the three income levels measured. A Shapiro-Wilk test detected significant nonnormality in the group distributions, so a Kruskal-Wallis test was used instead, with a Dunn’s Test as the post-hoc procedure. The Kruskal-Wallis test found that at least one of the groups differed significantly from at least one other group (H(2) = 41.2, p < 0.05). The Dunn’s test showed statistically significant (p < 0.05) Z values for all group comparisons.
But one-way ANOVA is, in fact, suitable for the data, so let’s use it instead. It relies on an F test to examine the differences among the group averages and indicate whether at least one group’s average differs significantly from at least one other group’s average. This code runs the ANOVA:
#ANOVA
options(scipen = 999)
oneway.test(mydata$DV ~ mydata$IV,
var.equal = FALSE)
##
## One-way analysis of means (not assuming equal variances)
##
## data: mydata$DV and mydata$IV
## F = 32.824, num df = 2.000, denom df = 60.157, p-value =
## 0.0000000002304
The resulting F statistic, 32.824, has an associated
p-value
of 0.0000000002304. That value is far less than
0.05, so the ANOVA results indicate that at least one group average
differs significantly from at least one other group average.
Had the F statistic been nonsignificant, the analysis could stop here and conclude that the data show no evidence of a relationship between income and broadband access.
But knowing that at least one group average differs significantly from at least one other group average produces a bit of a cliffhanger: Which averages differ significatly from each other? In the graph, the “moderate income” average looks a little closer to the “high income” average than the “low income” average. Are the “moderate” and “high” income averages close enough to be essentially the same? Are both the “moderate” and “high” income averages far enough from the “low income” average to be significantly greater? Or does each group average differs significantly from each other group average?
Tukey’s HSD (“honestly significant difference”) is a post hoc (“after this”) test designed to detect differences between group averages after an ANOVA has found that at least one such difference exists. This code will run the test and display the results:
anova_1 <- aov(mydata$DV ~ mydata$IV)
TukeyHSD(anova_1)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = mydata$DV ~ mydata$IV)
##
## $`mydata$IV`
## diff lwr upr p adj
## 2 Moderate income-1 Low income 6.222581 3.037414 9.407748 0.0000323
## 3 High income-1 Low income 11.502835 8.366299 14.639371 0.0000000
## 3 High income-2 Moderate income 5.280254 2.143718 8.416790 0.0003598
The table shows all possible pairs of group averages, the difference
between each pair (see the diff
column), and the p value
(see the p adj
column) for a significance test of the
difference between each pair. In this case, all three pairs differed
significantly.
Here’s one way to summarize the results of the analysis:
Prior research suggests that wealthier areas tend to have higher levels of broadband internet access. This analysis hypothesized that household broadband internet access would average significantly differently across at least two of the three income levels measured. The analysis found that broadband access averaged 66.2 among low-income counties, 72.4 among moderate-income counties, and 77.7 among high-income counties. A one-way analysis of variance test indicated at least one significant difference, F (2, 60.16) = 32.82, p < .05. A post-hoc procedure determined that all three means differed significantly from one another.
Note, too, that it might be wise to “stress test” the above conclusion by deleting the most prominent outliers and rerunning the analysis to see whether the outcome changes as a result. In this case, little changed. Running this code deleted the data for Hancock County (the low outlier in the “Low income” group) and for Williamson Conty (the high outlier in the “High income” group).
mydata <- mydata[-c(94, 34), ]
As expected, the group averages moved closer together as a result:
averages <- group_by(mydata, IV) %>%
summarise(mean = mean(DV, na.rm = TRUE))
ggplot(mydata, aes(x = DV))+
geom_histogram()+
facet_grid(IV ~ .)+
geom_histogram(color="black",fill = "#1f78b4")+
geom_vline(data = averages, aes(xintercept=mean,))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
However, the ANOVA still produced a significant F:
#ANOVA
options(scipen = 999)
oneway.test(mydata$DV ~ mydata$IV,
var.equal = FALSE)
##
## One-way analysis of means (not assuming equal variances)
##
## data: mydata$DV and mydata$IV
## F = 34.149, num df = 2.000, denom df = 59.398, p-value =
## 0.0000000001343
… and the post hoc test still showed found significant differences between all group average pairs:
anova_1 <- aov(mydata$DV ~ mydata$IV)
TukeyHSD(anova_1)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = mydata$DV ~ mydata$IV)
##
## $`mydata$IV`
## diff lwr upr p adj
## 2 Moderate income-1 Low income 5.722473 2.800611 8.644335 0.0000313
## 3 High income-1 Low income 10.512292 7.612969 13.411614 0.0000000
## 3 High income-2 Moderate income 4.789819 1.914734 7.664904 0.0004208
This code tests the association between the TNBBAccessData.csv
data file’s PctBB
and MedIncome_2
variables.
PctBB
measures, for each county in Tennessee, the
percentage of households that have access to broadband internet. It is a
continuous variable. Meanwhile, MedIncome_2
, a categorical
variable, categorizes each Tennessee county’s median household income as
“1 Low income” or “2 High income.” See the Example dataset codebook for
details about both variables. Technically, the “1” and “2” digits at the
start of each category label are not needed. But including them is a
shortcut for telling R the order of the categories, from lowest to
highest. Without them, the categories would appear in alphabetical
order, and additional code would be needed to tell R that they are
ordinal-level variables and what order they should be put in.
The analysis assumes that a county’s percentage of homes with
broadband internet access depends, at least partly, on the county’s
median household income. Thus, PctBB
is the dependent
variable, and MedIncome_2
is the independent variable. An
independent-samples t-test is suitable for the analysis because the
dependent variable is continuous, and the independent variable is
categorical and has two categories that are independent and mutually
exclusive.
The script’s opening lines of code install (if not installed already)
and activate the tidyverse
and ggplot
packages, then read the data from the TNBBAccessData.csv
file. The code assumes the data file has been stored in the same
subdirectory as the script.
#Installing required packages
if (!require("dplyr")) install.packages("dplyr")
if (!require("tidyverse")) install.packages("tidyverse")
library(dplyr)
library(ggplot2)
options(scipen = 999)
#Read the data
mydata <- read.csv("TNBBAccessData.csv")
For the reasons noted above, these lines define PctBB
as
the dependent variable and MedIncome_2
as the independent
variable.
#Specify the DV and IV
mydata$DV <- mydata$PctBB
mydata$IV <- mydata$MedIncome_2
A t-test compares averages, and an average can mislead you if the
data you averaged included outliers or had some weird, non-normal
distribution. This code graphs the distribution of PctBB
within each of the MedIncome_2
groups and adds vertical
lines showing each group’s average.
#Graph the group distributions and averages
averages <- group_by(mydata, IV) %>%
summarise(mean = mean(DV, na.rm = TRUE))
ggplot(mydata, aes(x = DV))+
geom_histogram()+
facet_grid(IV ~ .)+
geom_histogram(color="black",fill = "#1f78b4")+
geom_vline(data = averages, aes(xintercept=mean,))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
PctBB
’s distribution seems to have a low outlier within
the “1 Low income” group and a high outlier within the “2 High income”
group. Otherwise, though, both distributions look somewhat normal. It’s
hard to tell from the graphs precisely what the group averages and the
values of those outliers are. This code gives some exact numbers.
#Calculate and show the group counts, means, standard
#deviations, minimums, and maximums
group_by(mydata, IV) %>%
summarise(
count = n(),
mean = mean(DV, na.rm = TRUE),
sd = sd(DV, na.rm = TRUE),
min = min(DV, na.rm = TRUE),
max = max(DV, na.rm = TRUE))
## # A tibble: 2 × 6
## IV count mean sd min max
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 Low income 48 68.0 5.13 51.2 75.5
## 2 2 High income 47 76.5 6.11 65.7 93.4
So, PctBB
, the percent of households with broadband
access, averages 68.0 percent among the low-income counties and 76.5
percent among the high-income counties. And those outliers are, well,
pretty far out. The low-income outlier, 51.2 percent, is almost 17
percentage points below the average. That’s more than three times the
distribution’s standard deviation of 5.13 percentage points. Similarly,
the high-income outlier, 93.4 percent, is about 17 percentage points
above the distribution’s average - again, nearly three times the
distribution’s standard deviation of 6.11 percentage points. Recall that
one standard deviation is the typical difference between the
distribution’s scores and its average. So these outliers are about three
times more unusual than the typical score. A quick check of the data
frame shows that the low outlier belongs to Hancock County, which
occupies Row 34, and the high outlier belongs to Williamson County,
which occupies Row 94. It might be a good idea to run a Shapiro-Wilk
test on the normality of the group distributions. Here’s the code for
doing it:
mydata %>%
group_by(IV) %>%
summarise(`W Statistic` = shapiro.test(DV)$statistic,
`p-value` = shapiro.test(DV)$p.value)
## # A tibble: 2 × 3
## IV `W Statistic` `p-value`
## <chr> <dbl> <dbl>
## 1 1 Low income 0.931 0.00744
## 2 2 High income 0.983 0.731
The high-income group distribution appears normal; the W
statistic’s p-value
, 0.731, is above 0.05. But the the
low-income group distribution appears non-normal. Its W
statistic has a p-value
of 0.007, which is below 0.05.
Because both groups are not normally distributed, it would be wise to
use a Wilcoxon rank sum test instead. Here’s the code:
options(scipen = 999)
wilcox.test(mydata$DV ~ mydata$IV)
## Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
## compute exact p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: mydata$DV by mydata$IV
## W = 313, p-value = 0.000000001333
## alternative hypothesis: true location shift is not equal to 0
The test’s W
statistics, 313, has a p-value
of around 0.000000001333. That’s far less than 0.05, so the test
indicates that the groups differ significantly. You may have noticed the
warning noting that R could not compute an exact p-value for the test
because the data happen to include “ties” - that is, counties with
identical broadband access percentages. The Wilcoxon rank sum test works
by ranking the cases from highest to lowest, and things get fuzzy with
the test has to try to rank cases with identical scores. But the p-value
R came up with is nowhere near exceeding 0.05, so tie problem is
tolerable.
The results could be written up something like this:
Prior research suggests that wealthier areas tend to have higher levels of broadband internet access. This analysis hypothesized that average household broadband internet access would differ significantly among high-income counties compared to low-income counties. The analysis found that broadband access averaged 68.8 among “Low income” counties and 76.5 among “High income” counties. A Shapiro-Wilk test indicated the low-income group values were not normally distributed, rendering an independent-samples t-test unsuitable (W = 0.93, p < .05). However, a Wilcoxon rank sum test found the difference to be statistically significant (W = 313, p < .05). The results support the hypothesis.
Had the data been suitable for an independent-samples t-test, we could have run this code:
# Running a t-test
options(scipen = 999)
t.test(mydata$DV ~ mydata$IV,
var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: mydata$DV by mydata$IV
## t = -7.3327, df = 89.614, p-value = 0.00000000009635
## alternative hypothesis: true difference in means between group 1 Low income and group 2 High income is not equal to 0
## 95 percent confidence interval:
## -10.795676 -6.192711
## sample estimates:
## mean in group 1 Low income mean in group 2 High income
## 68.02708 76.52128
… which would have produced a t
value, -7.3327, with an
associated p-value
of 0.00000000009635. That’s far less
than 0.05, so the difference between the low- and high-income group
averages would be statistically significant. By the way, the
t
value is negative because the average of the first group
considered, the low-income group, is lower than the average of the
second group considered, the high-income group. If we had given R the
groups in the opposite order, the t
value would have been
positive.
The results of the t-test could be written up like this:
Prior research suggests that wealthier areas tend to have higher levels of broadband internet access. This analysis hypothesized that average household broadband internet access would differ significantly among high-income counties compared to low-income counties. The analysis found that broadband access averaged 68.8 among “Low income” counties and 76.5 among “High income” counties. An independent samples t-test found the difference to be statistically significant, t (89.61) = -7.33, p < .05. The results support the hypothesis.
This code uses a paired-samples t-test to evaluate whether each
Tennessee county’s percentage of households with broadband internet
access, as estimated in 2019 by PctBB
in the TNBBAccessData.csv
data file, increased significantly compared to each county’s estimated
percentage from two years earlier, as recorded in the TNBBAccessData.csv
data file’s PctBB2017
variable. A paired-samples t-test is
suitable for the analysis, because each PctBB
county
estimate from 2019 is paired with the same county’s
PctBB2017
estimate from 2017. In short, the county
observations are paired sequentially. See Not sure
… for more explanation. See the Example dataset codebook for
details about both variables.
The analysis assumes that each county’s percentage of homes with broadband internet access depends, at least partly, on the difference in time between 2017 and 2019. Thus, the percentage of homes with broadband is the continuious dependent variable, and the year the percentage was measured (either 2017 or 2019) is the categorical independent variable. These variables are implied, rather than represented directly, in the dataset, though, because the dataset has been arranged in the format that a paired-samples t-test requires.
The script’s opening lines of code install (if not installed already)
and activate the tidyverse
and ggplot
packages, then read the data from the TNBBAccessData.csv
file. The code assumes the data file has been stored in the same
subdirectory as the script.
#Installing required packages
if (!require("dplyr")) install.packages("dplyr")
if (!require("tidyverse")) install.packages("tidyverse")
library(dplyr)
library(ggplot2)
options(scipen = 999)
#Read the data
mydata <- read.csv("TNBBAccessData.csv") #Edit YOURFILENAME.csv
The 2017 broadband access levels in PctBB2017
are
designated as V1
for purposes of the analysis, because they
came first in the time sequence being examined. The broadband access
levels in PctBB
, which are from 2019, are designated as
V2
, because they came second in the time sequence.
#Specify the two variables involved
mydata$V1 <- mydata$PctBB2017
mydata$V2 <- mydata$PctBB
A paired-samples t-test assumes that differences between each pair of scores are approximately normally distributed, with no substantial outliers. A histogram of the pair differences can tell you whether these assumptions are true:
#Look at the distribution of the pair differences
mydata$PairDifferences <- mydata$V2 - mydata$V1
ggplot(mydata, aes(x = PairDifferences)) +
geom_histogram(color = "black", fill = "#1f78b4") +
geom_vline(aes(xintercept = mean(PairDifferences)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In this case, the histogram suggests that both assumptions may be false. The data show at least hints of the bell-shape characteristic of a normal distribution. But the right side of the distribution seems a lot lower than the left side does - suggesting that counties with a lot of difference are more common than counties with relatively less difference. Also, there’s a pretty clear outlier on the left - that is, a county with an unusually small change in broadband access between 2017 and 2019.
The histogram doesn’t show precise values for the average and the outlier, so some descriptive statistics may be enlightening:
#Get descriptive statistics for pair differences
mydata %>%
select(PairDifferences) %>%
summarise(
count = n(),
mean = mean(PairDifferences, na.rm = TRUE),
sd = sd(PairDifferences, na.rm = TRUE),
min = min(PairDifferences, na.rm = TRUE),
max = max(PairDifferences, na.rm = TRUE))
## count mean sd min max
## 1 95 6.827368 2.578571 0.8 12.2
So, the low outlier is a county - Weakly County, it turns out - in which broadband access grew by just eight-tenths of a percentage point, more than two standard deviations less than the 6.8 percent average. And the maximum - Lauderdale County’s 12.2 percent surge in broadband access - is more than two standard deviations above the average. It might, indeed, be wise to at least check whether the normality assumption has been met:
#Shapiro-Wilk test
options(scipen = 999)
shapiro.test(mydata$PairDifferences)
##
## Shapiro-Wilk normality test
##
## data: mydata$PairDifferences
## W = 0.97041, p-value = 0.02993
The p-value
for the Shapiro-Wilk test produced a
W
value of 0.97, which has a p-value
of about
0.03. That’s less than 0.05, so the distribution of the pair differences
is not normal, and a t-test isn’t suitable in this case. No problem,
though. The Wilcoxon signed rank test has some drawbacks. But one thing
it has going for it is that it doesn’t require normally distributed pair
differences. So, you can use a Wilcoxon signed rank test instead of a
paired-samples t-test when you data don’t meet the assumptions needed
for a paired-samples t-test. Here’s the code:
#Wilcoxon signed rank test
mydata %>%
select(V1, V2) %>%
summarise_all(list(Mean = mean,SD = sd))
## V1_Mean V2_Mean V1_SD V2_SD
## 1 65.40211 72.22947 8.005702 7.045458
wilcox.test(mydata$V1, mydata$V2, paired = TRUE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: mydata$V1 and mydata$V2
## V = 0, p-value < 0.00000000000000022
## alternative hypothesis: true location shift is not equal to 0
The average of V1 (the 2017 percentages) is 65.4 with a standard
deviation of 8.0, while the average of V2 (the 2019 percentages) is
72.2, with a standard deviation of 7.0. The test produced a
V
value of 0 and a p-value of 0.00000000000000022. The
latter is much less than 0.05. It looks like the average difference
between the 2019 and 2017 scores, 6.8 percentage points (shown in the
summary statistics from the pair differences distribution, above), is
statistically significant. The results could be written up like
this:
This analysis hypothesized that county-level percentages of households with broadband access averaged significantly higher in 2019 than in 2017. The analysis found that access averaged about 65 percent in 2017 and 72 percent in 2019. The differences between the pairs of percentages averaged 6.8. A Shapiro-Wilk test indicated these differences were not normally distributed, rendering a paired-samples t-test unsuitable (W = 0.97, p < .05). However, a Wilcoxon signed rank test found the difference to be statistically signficant (V = 0, p < .05).
Had a t-test been suitable, however, we could have run this code:
#t-test (Use if pair differences distribution is normal)
mydata %>%
select(V1, V2) %>%
summarise_all(list(Mean = mean,SD = sd))
## V1_Mean V2_Mean V1_SD V2_SD
## 1 65.40211 72.22947 8.005702 7.045458
options(scipen = 999)
t.test(mydata$V2, mydata$V1,
paired = TRUE)
##
## Paired t-test
##
## data: mydata$V2 and mydata$V1
## t = 25.807, df = 94, p-value < 0.00000000000000022
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 6.302087 7.352650
## sample estimates:
## mean difference
## 6.827368
… the results of which would indicate that the t
value,
25.81, had a p-value
of 0.00000000000000022. That’s much
less than 0.05, so the test would have indicated a statistically
significant difference. The results could have been written up this
way:
This analysis hypothesized that county-level percentages of households with broadband access averaged significantly higher in 2019 than in 2017. The analysis found that access averaged about 65 percent in 2017 and 72 percent in 2019. The differences between the pairs of percentages averaged 6.8. A paired-samples t-test found the difference to be statistically significant (t(94) = 25.8, p < .05).
… whether you need an Independent-samples t-test or a Paired-samples t-test?
Similarities between the two tests can make telling them apart confusing. For example, both kinds of t-tests involve comparing the average of one set of numbers with the average of a second set of numbers. In both instances, the dependent variable is continuous, and the categorical independent variable tells you which set a given number belongs to: the first set, or the second set.
The difference, though, involves whether each number in the first set is somehow connected, or paired, with a number in the second set. When no such connection exists, you use an independent-samples t-test. When such a connection does exist, you use a paired-samples t-test.
Suppose you wanted to learn whether college freshmen and college seniors score significantly differently on a 10-question test of knowledge about politics. College supposedly broadens one’s understanding of the world, so there is reason to think seniors would outperform freshmen. But college’s demands also keep students busy. Too busy, perhaps, to keep up with politics, especially during senior-level courses. So, maybe freshmen would outscore seniors.
A five-student sample of your dataset might look like this:
ID | Class | Knowledge |
1 | Freshman | 4 |
2 | Senior | 5 |
3 | Senior | 7 |
4 | Freshman | 3 |
5 | Freshman | 4 |
Each row represents a person. “Knowledge” is a continuous variable indicating each person’s score on the 10-question political knowledge test. “Class” is a categorical variable indicating whether each person is a freshman or a senior - that is, belongs to the “freshman” set of scores or to the “senior” set of scores.
The key thing to see is that no one person is included in both sets of scores. Nobody in the study is both a freshman and a senior. Everybody in the study is in only one set or only the other set. In technical terms, the two sets of scores are “mutually exclusive,” meaning they have no members in common. Furthermore, at least as far as we can tell, the two sets are “independent.” That means nobody in the freshman set has any sort of relationship with someone in the senior set that might. For example, suppose one of the freshmen and one of the seniors happened to be siblings whose mother happened to be a U.S. senator. Both probably would have higher, and more similar, levels of political knowledge than some random freshman and senior with no such connection between them. Ignoring that possibility could lower the validity of your research’s conclusions.
As long as the two sets of scores are mutually exclusive and independent, as described above, an independent-samples t-test would be the right procedure to use.
There could be advantages, however, to intentionally studying political knowledge among pairs of siblings like the pair mentioned above.
All sorts of factors that have nothing to do with a college student’s class rank nonetheless could boost or diminish the college student’s level of political knowledge. Some students may come from families that talked about politics all the time. Others may come from families that rarely talked about politics. Some may come from wealthier families that could afford access to elite political information sources. Others may come from poorer families that could not.
But what if you could manage to measure political knowledge among three dozen or so sibling pairs in which one sibling was a freshman and the other was a senior? If you could do that, you would be able to investigate whether time spend in college seems to make a difference in political knowledge regardless of household circumstances. That’s because you would be comparing freshmen from politically engaged families with seniors from politically engaged families, freshmen from politically disengaged families with seniors from politically disengaged families, freshmen from richer families with seniors from richer families, freshmen from poorer families with seniors from poorer families, and so on, perhaps even for factors you hadn’t thought of or imagined.
You’d have to arrange your data a little differently than the data in Example 1 were arranged, though. A five-sibling-pair sample of your data might look like this:
Sibling_pair_ID | Knowledge_Freshman_sibling | Knowledge_Senior_sibling |
1 | 3 | 5 |
2 | 5 | 7 |
3 | 4 | 6 |
4 | 3 | 8 |
5 | 4 | 7 |
In this arrangement, each horizontal row is a sibling pair, not an
individual person, as was the case in Example 1. The
Sibling_pair_ID
variable shows a unique ID code for each
sibling pair. Within each row, the
Knowledge_freshman_sibling
variable shows the political
knowledge test score for the freshman member of each sibling pair. The
Knowledge_senior_sibling
variable shows the political
knowledge test score for the senior member of each sibling pair.
If you are going to analyze paired data, R will expect you to arrange the data in this way so that R can tell which cases make up each pair. Even though the dataset’s values have been arranged differently, though, the dependent variable is still “political knowledge,” measured continuously as a zero-to-10 score on a political knowledge test, and the independent variable is still “class rank,” measured categorically as either “freshman” or “senior.”
Alternatively, sets in a t-test data file can be linked sequentially. For example, suppose you investigated the relationship between political knowledge and college class rank by recruiting some college freshmen, measuring their political knowledge, then tracking them down four years later, when they are seniors, and measure their political knowledge a second time. A five-person sample from your data set might look like this:
Student | Knowledge_when_a_freshman | Knowledge_when_a_senior |
1 | 3 | 5 |
2 | 5 | 7 |
3 | 4 | 6 |
4 | 3 | 8 |
5 | 4 | 7 |
The data are arranged the same way they were in Example 2. The key differences are that:
Each horizontal row now holds data from one person, rather than from each of two siblings, and …
The last two vertical columns hold data about the person collected during two time periods. In this case, the two time periods are when the person was a freshman, and four years later, when the person was a senior.
As in Example 2, the dependent variable is still “political knowledge,” measured continuously as a zero-to-10 score on a political knowledge test, and the independent variable is still “class rank,” measured categorically as either “freshman” or “senior.” The difference is that the “freshman” and “senior” in each horizontal row are the same person. The “freshman” score was measured when the row’s person was a freshman, and the “senior” score was measured when the row’s person was a senior.
To recap: If you know you need to conduct a t-test, but you’re unsure whether to use and independent-samples or a paired-samples t-test, consider the two sets of scores the t-test will be comparing. If the sets are mutually exclusive and independent, use an independent-samples t-test. But if each score in the first set is somehow paired with a score in the second set - for example, through a circumstantial or sequential connection - then use a paired-samples t-test.
The example dataset, TNBBAccessData.csv
,
contains data extracted from the 2019 five-year American Community
Survey for all counties in Tennessee. See below for the Example dataset codebook.
One way to get the example data onto your computer is to run this R code:
# Read the data from the web
TNBBAccessData <- read.csv("https://drkblake.com/wp-content/uploads/2023/08/TNBBAccessData.csv")
# Save the data on your computer
write.csv(TNBBAccessData, "TNBBAccessData.csv", row.names=FALSE)
# remove the data from the environment
rm (TNBBAccessData)
Most variables in the example dataset are county-level measures of either broadband internet access or household income. Some of the variables are expressed as continuous variables. Others are expressed as categorical variables, with two or three categories. Some represent different time periods. The differing measurement levels allow demonstration of different statistical techniques, depending on the measurement levels of the variables involved.
Definitions for specific variables are as follows:
GEOID: A unique code the Census Bureau assigns to each county in the United States. It’s here because I got these data from data.census.gov., the Census Bureau’s main data dissemination site.
NAME: The name of each county.
PctBB: The percentage of households with broadband internet access in 2019.
MedIncome: Median household income.
PctBB2017: The percentage of households with broadband internet access, according to an estimate released in 2017, two years before the release of the other variables in this dataset.
PctBB_2: PctBB, dichotomized at the median, so that values at or below the median are coded as “1 Low BB access,” and percentages above the median are coded as “2 High BB access.”
PctBB_2_NUM: Same as PctBB_2, but with a numeric value of 0 used to indicate “1 Low BB access,” and a numeric value of 1 used to indicate “2 High BB access.” Such numeric codes are required when a two-value categorical value is used as the dependent variable in a logistic regression procedure.
MedIncome_2: MedIncome, dichotomized at the median, so that values at or below the median are coded as “1 Low income,” and percentages above the median are coded as “2 High income.”
PctBB_3: PctBB, but recoded into three roughly equal-sized groups: “1. Low BB access,” indicating counties with the lowest percentages of broadband internet access; “2. Moderate BB access,” indicating counties with mid-level percentages of broadband internet access; and “3. High BB access,” indicating countries with high percentages of broadband internet access. The groups were formed by sorting the county broadband access figures from smallest to greatest, then dividing the counties into three roughly equal groups.
MedIncome_3: MedIncome, but recoded into three roughly equal-sized groups: “1. Low income,” indicating counties with the lowest median household income values; “2. Moderate income” indicating counties with mid-level median household income values; and “3. High income,” indicating countries with high median household income values.
The R Markdown package adds an authoring framework you can use to easily combine R code, R output, and text into a nicely formatted document that you can publish on the web or elsewhere. This code will install R Markdown for you:
if (!require("rmarkdown")) install.packages("rmarkdown")
After installing R Markdown, you can create a new R Markdown file by clicking File / New File / R Markdown …
You can work in “Source” or “Visual” mode in R Markdown. You’ll probably prefer working in the “Visual” mode.
It is best to write and test your R code in a regular R script window. Then, when you are sure the code will work as expected, copy and paste the code into a code block in R Markdown. Click the green “+C” button to add a code block, and paste your code into the block. Type your headings and explanatory material around your code blocks. When you are ready to see the result, click the blue “Knit” icon.