This guide can help you decide which of six statistical procedures to use when your analysis involves one dependent variable and one independent variable, as long as you know which variable is which, and whether each variable is measured categorically or continuously. Once you have settled on a procedure, click the procedure to jump to some code that you can copy into R and run after modifying it to specify the names of the data file and variables involved. You’ll also find links to worked examples for each procedure using variables from an example dataset, and some code that will Install the R Markdown package on your computer.

Decision guide

To begin, use this flow chart to decide which bivariate statistical procedure to use. Click a procedure to jump to copyable R code for that procedure.

START HERE
↓						Categorical	→	Chi-Square test
DV level?	→	Categorical	→	IV level?	→	↕
↓						Continuous	→	Logistic regression
Continuous
↓
IV level?	→	Continuous	→	Bivariate regression
↓
Categorical
↓
#IV categories?	→	3 or more	→	One-way ANOVA
↓
Only 2
↓
Independent, or paired? Not sure …	→	Independent	→	Independent-samples t-test
↓
Paired	→	Paired-samples t-test

Code templates

Chi-Square test

Use a chi-square test when your dependent and independent variables are both categorical. #Edit notes indicate which parts of the code must be customized. See the Chi-Square test worked example for help interpreting the script’s output.

# Installing required packages
if (!require("tidyverse"))
  install.packages("tidyverse")
if (!require("gmodels"))
  install.packages("gmodels")
library(gmodels)
library(ggplot2)

# Read the data
mydata <- read.csv("YOURFILENAME.csv") #Edit YOURFILENAME.csv

# Specify the DV and IV
mydata$DV <- mydata$YOURDVNAME #Edit YOURDVNAME
mydata$IV <- mydata$YOURIVNAME #Edit YOURIVNAME

# Look at the DV and IV
ggplot(mydata, aes(x = IV, fill = DV)) +
  geom_bar(colour = "black") +
  scale_fill_brewer(palette = "Paired")

# Make the crosstab table
CrossTable(
  mydata$DV,
  mydata$IV,
  prop.chisq = FALSE,
  prop.t = FALSE,
  prop.r = FALSE
)

# Run the chi-squared test
options(scipen = 999)
chitestresults <- chisq.test(mydata$DV, mydata$IV)
chitestresults

Logistic regression

Use logistic regression when the DV is dichotomous and coded as either 1 or 0, and the IV is continuous. #Edit notes indicate which parts of the code must be customized. See the Logistic regression worked example for help interpreting the script’s output.

# Installing required packages
if (!require("tidyverse"))
  install.packages("tidyverse")
library(ggplot2)

# Read the data
mydata <- read.csv("YOURFILENAME.csv") #Edit YOURFILENAME.csv

# Specify the DV and IV
mydata$DV <- mydata$YOURDVNAME #Edit YOURDVNAME
mydata$IV <- mydata$YOURIVNAME #Edit YOURIVNAME

# Look at the DV and IV
ggplot(mydata, aes(x = DV)) + geom_bar(color = "black", fill = "#1f78b4")
ggplot(mydata, aes(x = IV)) + geom_histogram(color = "black", fill = "#1f78b4")

# Logistic regression plot
ggplot(mydata, aes(x = IV,
                   y = DV))+
  geom_jitter(height = .03,
              alpha = .5) +
  geom_smooth(method = "glm",
              method.args = list(family = "binomial"),
              se = FALSE,
              color = "#1f78b4")

# Run the logistic regression and view the summary results
options(scipen = 999)
log.ed <- glm(DV ~ IV, data = mydata, family = "binomial")
summary(log.ed)
p <- .50
Inflection_point <- (log(p/(1-p))-coef(log.ed)[1])/coef(log.ed)[2]
Inflection_point

Bivariate regression

Use bivariate regression when both the DV and IV are continuous. Bivariate regression works best when no major outliers are evident in one variable, the other, or the two variables combined. #Edit notes indicate which parts of the code must be customized. See the bivariate regression worked example for help interpreting the script’s output.

# Installing required packages
if (!require("dplyr"))
  install.packages("dplyr")
if (!require("tidyverse"))
  install.packages("tidyverse")
library(dplyr)
library(ggplot2)

# Read the data
mydata <- read.csv("YOURFILENAME.csv") #Edit YOURFILENAME.csv

# Specify the DV and IV
mydata$DV <- mydata$YOURDVNAME #Edit YOURDVNAME
mydata$IV <- mydata$YOURIVNAME #Edit YOURIVNAME

# Look at the DV and IV
ggplot(mydata, aes(x = DV)) + geom_histogram(color = "black", fill = "#1f78b4")
ggplot(mydata, aes(x = IV)) + geom_histogram(color = "black", fill = "#1f78b4")

# Creating and summarizing an initial regression model called myreg, and checking for bivariate outliers.
options(scipen = 999)
myreg <- lm(DV ~ IV,
            data = mydata)
summary(myreg)
plot(mydata$IV, mydata$DV)
abline(lm(mydata$DV ~ mydata$IV))

If you see what looks like an outlier or two, run the code below, look for hatvalue figures that are notably high compared to the others. The number to the left of each hatvalue figure is the row number in the mydata data frame of the case that produced the hatvalue.

leverage <- as.data.frame(hatvalues(myreg))
view(leverage)

A case with a notably high “hatvalues” score, relative to the other cases, may be an influential outlier and could be deleted, if you judge doing so necessary. You can delete a row from the data frame with the code below, where 2 is the row number of the case you want to delete.

mydata <- mydata[-c(2), ]

If you need to delete two or more rows, you may delete all of them at once with the code below, where 2, 4, and 6 - and as many more digits as you need - are the row numbers of the cases you want to delete.

mydata <- mydata[-c(2, 4, 6), ]

With the suspected outliers deleted, you can re-run the analysis and make a final decision about whether to keep or delete the suspected outliers.

One-way ANOVA

Use one-way analysis of variance (one-way ANOVA) when both the DV is continuous, the IV is categorical, and the IV has three or more categories. One-way ANOVA works best when no major outliers are evident in the distributions of the groups being compared. #Edit notes indicate which parts of the code must be customized. See the One-way ANOVA worked example for help interpreting the script’s output.

# Installing required packages
if (!require("dplyr"))
  install.packages("dplyr")
if (!require("tidyverse"))
  install.packages("tidyverse")
library(dplyr)
library(ggplot2)
options(scipen = 999)

# Read the data
mydata <- read.csv("YOURFILENAME.csv") #Edit YOURFILENAME.csv

# Specify the DV and IV
mydata$DV <- mydata$YOURDVNAME
mydata$IV <- mydata$YOURIVNAME

# Graph the group distributions and averages
averages <- group_by(mydata, IV) %>%
  summarise(mean = mean(DV, na.rm = TRUE))
ggplot(mydata, aes(x = DV)) +
  geom_histogram() +
  facet_grid(IV ~ .) +
  geom_histogram(color = "black", fill = "#1f78b4") +
  geom_vline(data = averages, aes(xintercept = mean, ))

# Calculate and show the group counts, means, standard
# deviations, minimums, and maximums
group_by(mydata, IV) %>%
  summarise(
    count = n(),
    mean = mean(DV, na.rm = TRUE),
    sd = sd(DV, na.rm = TRUE),
    min = min(DV, na.rm = TRUE),
    max = max(DV, na.rm = TRUE)
  )

# If the group distributions look non-normal, consider checking for non-normality
# running a Shapiro-Wilk test for each group. Significant results indicate
# non-normality.

mydata %>%
  group_by(IV) %>%
  summarise(`W Statistic` = shapiro.test(DV)$statistic,
            `p-value` = shapiro.test(DV)$p.value)

# If one or more of the group distributions are non-normal, consider using a
# Kruskal-Wallis test instead of an ANOVA and, if the result is significant, a
# Dunn's Test instead of a Tukey's HSD as a post-hoc test.

if (!require("FSA"))
  install.packages("FSA")
library(FSA)
kruskal.test(DV ~ IV, data = mydata)
dunnTest(DV ~ IV, data = mydata)

# If all three group distributions are normal,
# though, you may use ANOVA. Also, ANOVA is
# powerful enough to work as long as the
# non-normality isn't severe.

options(scipen = 999)
oneway.test(mydata$DV ~ mydata$IV,
            var.equal = FALSE)

# If the ANOVA detects significant difference, run
# this post-hoc procedure to learn which
# group pairs differed significantly.

anova_1 <- aov(mydata$DV ~ mydata$IV)
TukeyHSD(anova_1)

Independent-samples t-test

Use an independent-samples t-test when the DV is continuous, and the IV is categorical and has two categories that represent independent, mutually exclusive groups. See the Not sure … section for explanations of what “independent” and “mutually exclusive” mean. #Edit notes indicate which parts of the code must be customized. See the Independent-samples t-test worked example for help interpreting the script’s output.

# Installing required packages
if (!require("dplyr"))
  install.packages("dplyr")
if (!require("tidyverse"))
  install.packages("tidyverse")
library(dplyr)
library(ggplot2)
options(scipen = 999)

# Read the data
mydata <- read.csv("YOURFILENAME.csv") #Edit YOURFILENAME.csv

# Specify the DV and IV
mydata$DV <- mydata$YOURDVNAME
mydata$IV <- mydata$YOURIVNAME

# Graph the group distributions and averages
averages <- group_by(mydata, IV) %>%
  summarise(mean = mean(DV, na.rm = TRUE))
ggplot(mydata, aes(x = DV)) +
  geom_histogram() +
  facet_grid(IV ~ .) +
  geom_histogram(color = "black", fill = "#1f78b4") +
  geom_vline(data = averages, aes(xintercept = mean, ))

# Calculate and show the group counts, means, standard
# deviations, minimums, and maximums
group_by(mydata, IV) %>%
  summarise(
    count = n(),
    mean = mean(DV, na.rm = TRUE),
    sd = sd(DV, na.rm = TRUE),
    min = min(DV, na.rm = TRUE),
    max = max(DV, na.rm = TRUE)
  )

# If you see evidence of non-normality in the distributions of one or both groups, use a Shapiro-Wilk test.
mydata %>%
  group_by(IV) %>%
  summarise(`W Statistic` = shapiro.test(DV)$statistic,
            `p-value` = shapiro.test(DV)$p.value)

# A significant Shapiro-Wilk test for one or both groups means you should use a Wilcoxon signed rank test rather thana t-test.

options(scipen = 999)
wilcox.test(mydata$DV ~ mydata$IV)

# If both groups are normally distributed, though, you may run a t-test.
options(scipen = 999)
t.test(mydata$DV ~ mydata$IV,
       var.equal = FALSE)

Paired-samples t-test

Use a paired-samples t-test when the DV is continuous, and the categorical IV distinguishes between two separate, but connected, sets of DV measures. See the Not sure … section for explanations of what “separate, but connectd” means. #Edit notes indicate which parts of the code must be customized. See the Paired-samples t-test worked example for help interpreting the script’s output.

# Installing required packages
if (!require("dplyr"))
  install.packages("dplyr")
if (!require("tidyverse"))
  install.packages("tidyverse")
library(dplyr)
library(ggplot2)
options(scipen = 999)

# Read the data
mydata <- read.csv("YOURFILENAME.csv") #Edit YOURFILENAME.csv

# Specify the two variables involved
mydata$V1 <- mydata$YOURV1NAME
mydata$V2 <- mydata$YOURV2NAME

# Look at the distribution of the pair differences
mydata$PairDifferences <- mydata$V2 - mydata$V1

ggplot(mydata, aes(x = PairDifferences)) +
  geom_histogram(color = "black", fill = "#1f78b4") +
  geom_vline(aes(xintercept = mean(PairDifferences)))

# Get descriptive statistics for pair differences
mydata %>%
  select(PairDifferences) %>%
  summarise(
    count = n(),
    mean = mean(PairDifferences, na.rm = TRUE),
    sd = sd(PairDifferences, na.rm = TRUE),
    min = min(PairDifferences, na.rm = TRUE),
    max = max(PairDifferences, na.rm = TRUE)
  )

# If pair differences look non-normal, you can use a Shapiro-Wilk test to check
# whether their distribution differs significantly from normal. If the
# Shapiro-Wilk test p-value is less than 0.05, #use a Wilcoxon signed rank test
# instead of a paired-samples t-test.

# Shapiro-Wilk test
options(scipen = 999)
shapiro.test(mydata$PairDifferences)

# If the pair distribution is non-normal, consider # using a Wilcoxon signed rank test instead of a
# paired-samples t-test.

mydata %>%
  select(V1, V2) %>%
  summarise_all(list(Mean = mean, SD = sd))
wilcox.test(mydata$V1, mydata$V2, paired = TRUE)

# If the pair differences are normally distributed,
# though, you may use a paired-samples t-test.

mydata %>%
  select(V1, V2) %>%
  summarise_all(list(Mean = mean, SD = sd))
options(scipen = 999)
t.test(mydata$V2, mydata$V1,
       paired = TRUE)

Worked examples

Chi-Square test worked example

In the TNBBAccessData.csv data file, the PctBB_2 variable indicates whether each county in Tennessee has a “High” level of broadband internet access or a “Low” level of broadband internet access. Meanwhile, the MedIncome_2 variable indicates whether each county has a “High” household income level or a “Low” household income level. See the Example dataset codebook for details about both variables. The code in this example uses a Chi-Squared test to look for evidence of a relationship between the two variables.

The script’s opening lines of code install (if not installed already) and activate the tidyverse and gmodels packages, then read the data from the TNBBAccessData.csv file. The code assumes the data file has been stored in the same subdirectory as the script.

if (!require("tidyverse")) install.packages("tidyverse")
if (!require("gmodels")) install.packages("gmodels")
library(gmodels)
library(ggplot2)

#Read the data
mydata <- read.csv("TNBBAccessData.csv") #Edit YOURFILENAME.csv

The DV and IV variable names have been edited to specify PctBB_2 as the dependent variable, and MedIncome_2 as the independent variable. Again, see the Example dataset codebook for details about both variables. Setting the analysis up in this way assumes that broadband internet access depends upon household income.

#Specify the DV and IV
mydata$DV <- mydata$PctBB_2 #Edit YOURDVNAME
mydata$IV <- mydata$MedIncome_2 #Edit YOURIVNAME

This code will produce a stacked column chart that gives you a look at whether, and how, the DV and IV might be related.

#Look at the DV and IV
ggplot(mydata, aes(x = IV, fill = DV)) +
  geom_bar(colour = "black") +
  scale_fill_brewer(palette = "Paired")

The chart shows that counties with low levels of broadband access appear much more common among low-income counties than among high-income counties. It also shows there are no outliers or oddly-coded cases that need to be examined.

It would be informative, though, to get the specific counts and percentages depicted in the chart. This code produces a table showing those numbers.

#Make the crosstab table
CrossTable(mydata$DV, mydata$IV,
           prop.chisq = FALSE,
           prop.t = FALSE,
           prop.r = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  95 
## 
##  
##                  | mydata$IV 
##        mydata$DV |  1 Low income | 2 High income |     Row Total | 
## -----------------|---------------|---------------|---------------|
##  1 Low BB access |            38 |            11 |            49 | 
##                  |         0.792 |         0.234 |               | 
## -----------------|---------------|---------------|---------------|
## 2 High BB access |            10 |            36 |            46 | 
##                  |         0.208 |         0.766 |               | 
## -----------------|---------------|---------------|---------------|
##     Column Total |            48 |            47 |            95 | 
##                  |         0.505 |         0.495 |               | 
## -----------------|---------------|---------------|---------------|
## 
##

The table shows that 79 percent of the low-income counties (38 of the 48) are also in the low-broadband-access category, compared to only 23 percent (11) of the 47 counties in the high-income category are in the low-broadband-access category. Put another way, low-broadband-access counties are much more common among low-income counties than among high-income counties. This pattern doesn’t prove that low broadband access is a result of low income. But it does support the assertion that low broadband access is at least associated with low income.

Correctly decoding a crosstabulation table can be difficult for beginners. Here’s a trick that might help: Compare percentages that are horizontally adjacent. That is, compare percentages that are in the same horizontal row. In the example table, you would have two options: the 0.792 and 0.234 pair, or below them, the 0.208 and 0.766 pair. The first pair tells you that:

“Seventy-nine percent of low-income counties have low levels of broadband access, while only 23 percent of high-income counties have low levels of broadband access.”

See how that phrasing makes the pattern clear? Low broadband access is much more common among low-income counties than among high-income counties. A summary based on the second pair of horizontally adjacent percentages is equally clear:

“Seventy-seven percent of high-income counties have high levels of broadband access, while only 21 percent of low-income counties have high levels of broadband access.”

Again, the phrasing makes the relationship easier to understand. High broadband access is much more common among high-income counties than among low-income counties.

In both approaches, it often helps to refer to the positively-related percentages first. For example, “low-income” and “low access,” rather than “low-income” and “high access.” It’s one less complication for your audience to keep track of.

But random chance could produce the same pattern shown in the table, at least sometimes. A chi-squared test can help you decide whether to rule out random chance as an explanation for the pattern. Here’s the code:

#Run the chi-squared test
options(scipen = 999)
chitestresults <- chisq.test(mydata$DV, mydata$IV)
chitestresults

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata$DV and mydata$IV
## X-squared = 27.375, df = 1, p-value = 0.0000001676

… and the results indicate that the chi-quare value computed (27.375), with 1 degree of freedom (df) has a p-value of 0000001676. The p-value is not greater than 0.05. In fact, it’s quite a bit less. As a result, you can reasonably rule out randomness as an explanation for the pattern of association between income and broadband access. To summarize the analysis results more succinctly, and in a more scholarly format:

Prior research suggests that wealthier areas tend to have higher levels of broadband internet access. This analysis hypothesized that the percentage of households with high broadband access would be significantly greater among high-income counties than among low-income counties. It found that about 77 percent of the high-income counties examined had high broadband access, compared to only about 21 percent of low-income counties. A chi-squared test found the difference to be statistically significant, X² (1, N = 95) = 27.38, p < .05. The results support the hypothesis.

Logistic regression worked example

This code tests the association between the TNBBAccessData.csv data file’s PctBB_2_NUM and MedIncome variables. PctBB_2_NUM distinguishes between Tennessee counties with “Low” and “High” levels of broadband internet access by assigning “Low” counties a 0 and “High” counties a 1. Thus, PctBB_2_NUM is a categorical variable, even though its two categories are represented with numeric digits. Meanwhile, MedIncome is a continuous variable containing each Tennessee county’s median household income. See the Example dataset codebook for details about both variables.

The analysis assumes that whether a county has a “Low” or “High” level of broadband internet access depends, at least partly, on the county’s median household income. Thus, PctBB_2_NUM is the dependent variable, and MedIncome is the independent variable. Logistic regression is suitable for the analysis because the dependent variable is categorical and has only two categories, one represented by a 0, and the other represented by a 1.

The script’s opening lines of code install (if not installed already) and activate the tidyverse and ggplot packages, then read the data from the TNBBAccessData.csv file. The code assumes the data file has been stored in the same subdirectory as the script.

#Installing required packages
if (!require("tidyverse")) install.packages("tidyverse")
library(ggplot2)

#Read the data
mydata <- read.csv("TNBBAccessData.csv") #Edit YOURFILENAME.csv

For the reasons described above, the DV is specified as PctBB_2_NUM, while the IV is specified as MedIncome, and each variable’s distribution is examined for outliers, data entry errors, and other anomalies.

#Specify the DV and IV
mydata$DV <- mydata$PctBB_2_NUM
mydata$IV <- mydata$MedIncome

#Look at the DV and IV
ggplot(mydata, aes(x = DV))+geom_bar(color="black",fill = "#1f78b4")

ggplot(mydata, aes(x = IV))+geom_histogram(color="black",fill = "#1f78b4")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

A clear outlier is evident in the MedIncome variable’s distribution. Examining the data reveals that the outlier is Williamson County, which has an unusually high median household income for Tennessee’s counties. But the outlier does not stem from a data entry error; Williamson County’s median household income really is that high. Furthermore, outliers aren’t much of a problem for logistic regression. For these reasons, the Williamson County value was allowed to remain in the dataset. If the outlier raised a concern about the validity of the analysis’ conclusion, the outlier could be deleted so that the results with and without the outlier could be compared.

This code shows how values for the IV, median household income, range in the “Low” and “High” categories of the DV, broadband internet access.

# Logistic regression plot
ggplot(mydata, aes(x = IV,
                   y = DV))+
  geom_jitter(height = .03,
              alpha = .5) +
  geom_smooth(method = "glm",
              method.args = list(family = "binomial"),
              se = FALSE,
              color = "#1f78b4")

## `geom_smooth()` using formula = 'y ~ x'

Among the counties that are “Low” on broadband internet access, median household income ranges from just over $30,000 a year to just under $60,000 a year, with most of the counties falling in the $40,000 range. Among the counties that are “High” on broadband internet access, by contrast, median household incomes start at just under $40,000 and range all the way up to well over $100,000. The $100k-plus county (Williamson County) is an obvious outlier, and not an unexpected one, given what we say on the univariate plot of median income. But even if Williamson County were excluded, it’s obvious that median income tends to be greater among the “High” broadband access counties.

By the way, all of the “Low” counties actually line up exactly on the vertical axis’s “0” point, while all the “High” counties actually line up exactly on the vertical axis’s “1” point. They don’t “bounce around” 0 and 1 the way the plot suggests they do. This is a “jitter” plot, so named because it ads some random “jitter” to the locations of the points on the resulting plot. Doing so allows you to better see points that occupy the same real estate on the plot. If you change geom_jitter(height = .03 to geom_jitter(height = 0, you’ll see the actual location of each point on the vertical axis. You’ll also see that it’s tougher to spot clusters of points with the same, or similar, values on the horizontal axis.

But is this pattern stark enough to suggest nonrandomness - that is, suggest that some systematic relationship exists between median income and broadband access? That’s what logistic regression output can tell you. This code runs the logistic regression and displays the results:

#Run the logistic regression and view the summary results
options(scipen = 999)
log.ed <- glm(DV ~ IV, data=mydata, family="binomial")
summary(log.ed)

## 
## Call:
## glm(formula = DV ~ IV, family = "binomial", data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2138  -0.8051  -0.3234   0.8440   1.7984  
## 
## Coefficients:
##                Estimate  Std. Error z value   Pr(>|z|)    
## (Intercept) -9.58002397  2.07299037  -4.621 0.00000381 ***
## IV           0.00020691  0.00004551   4.547 0.00000545 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 131.603  on 94  degrees of freedom
## Residual deviance:  94.102  on 93  degrees of freedom
## AIC: 98.102
## 
## Number of Fisher Scoring iterations: 5

p <- .50
Inflection_point <- (log(p/(1-p))-coef(log.ed)[1])/coef(log.ed)[2]
Inflection_point

## (Intercept) 
##    46300.82

Two figures in the output are particularly important: The z value, which you will need for the write-up, and the probability value under Pr(>|z|), which must be 0.05 or smaller to indicate statistical significance. Here, the rounded z value is 4.55, and the Pr(>|z|) value, 0.00000545, is far less than 0.05, indicating statistical significance.

The “Inflection point” ($46,300.82 in this example) is the point on the x axis at which the odds of being in the DV’s “1” group are 50/50. Cases with x axis values greater than this point have better than 50/50 odds of landing in the “1” group. Cases with x axis values less than this point have less than 50/50 odds of landing in the 1 group.

The analysis could be summarized in this way:

Prior research suggests that wealthier areas tend to have higher levels of broadband internet access. This analysis hypothesized that county-level median household income would significantly predict whether a county had high or low broadband access. A logistic regression procedure found that higher median household income levels significantly predicted high broadband access (z = 4.55, p < .05). The results support the hypothesis.

Bivariate regression worked example

This code tests the association between the TNBBAccessData.csv data file’s PctBB and MedIncome variables. PctBB measures, for each county in Tennessee, the percentage of households that have access to broadband internet. It is a continuous variable. Meanwhile, MedIncome, also a continuous variable, contains each Tennessee county’s median household income. See the Example dataset codebook for details about both variables.

#Installing required packages
if (!require("dplyr")) install.packages("dplyr")
if (!require("tidyverse")) install.packages("tidyverse")
library(dplyr)
library(ggplot2)

mydata <- read.csv("TNBBAccessData.csv")

For the reasons described above, the DV is specified as PctBB, while the IV is specified as MedIncome. Next, each variable’s distribution is examined for outliers, data entry errors, and other anomalies.

#Specify the DV and IV
mydata$DV <- mydata$PctBB #Edit YOURDVNAME
mydata$IV <- mydata$MedIncome #Edit YOURIVNAME

#Look at the DV and IV
ggplot(mydata, aes(x = DV))+geom_histogram(color="black",fill = "#1f78b4")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(mydata, aes(x = IV))+geom_histogram(color="black",fill = "#1f78b4")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distribution of the dependent variable, PctBB, shows what appear to be two outliers. One county has an unusually low percentage of households with broadband access (somewhere around 50 percent), and another has an unusually high level of broadband access (well above 90 percent). The distribution of the independent variable, MedIncome, also shows at least one, and maybe as many as three, counties with unusually high median household incomes.

Because linear regression is being used, these outliers could lead to incorrect conclusions about the relationship between the dependent and independent variables. Creating the regression model and plotting its results will yield further insights.

Let’s begin with the regression results, which this code will produce and display:

# Creating and summarizing an initial regression model called myreg
options(scipen = 999)
myreg <- lm(DV ~ IV,
            data = mydata)
summary(myreg)

## 
## Call:
## lm(formula = DV ~ IV, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.5892  -3.1303   0.5912   3.0848   9.0294 
## 
## Coefficients:
##                Estimate  Std. Error t value            Pr(>|t|)    
## (Intercept) 48.85438413  2.11083922   23.14 <0.0000000000000002 ***
## IV           0.00049558  0.00004363   11.36 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.584 on 93 degrees of freedom
## Multiple R-squared:  0.5811, Adjusted R-squared:  0.5766 
## F-statistic:   129 on 1 and 93 DF,  p-value: < 0.00000000000000022

Looks like broadband access is significantly related to median income. The p-value for the F-statistic is <0.00000000000000022, with is far below the 0.05 threshold for statistical significance. The model’s R-squared value is 0.5811, which indicates that county median household income explains about 58 percent of the variation in broadband access. That’s pretty good.

The table under Coefficients: indicates that the coefficient for the independent variable, median household income, is 0.00049558. It tells you a couple of things.

First, it’s positive, which means that as median household income increases, so does broadband internet access. If it were negative, it would indicate the opposite - that as median household income increases, broadband internet access decreases. Of course, you can also discern a positive or negative relationshp by looking at the scatterplot, as described above. Looking at the coefficient is simply an additional way to tell whether the relationship is positive or negative.

Second, the coefficient tells you how much of a change in the IV is associated with one unit of change in the DV. Put another way: At the time these day were collected, every 0.00049558 increase in median household income (the IV) was associated with a 1-percentage-point increase in the percentage of households with broadband internet access (the DV).

The same table provides the model’s Intercept value, 48.85438413, which is what the regression model predicts the DV’s value will be if this IV’s value is zero. The intercept’s value doesn’t always make a lot of sense. Here, for example, there are some pretty poor counties in Tennessee, but it’s hard to imagine a county in which the median household income is zero, and it’s even harder to imagine that, despite being so desperately poor, about 48.9 percent of the households in such a county would have broadband internet access. The interceipt is valuable, though, because it can help you figure out the trend line that best represents the relationship between the DV and IV. If you poke around in your hazy memories of middle school or high school math, you might remember dealing with “x/y coordinates” and the “equation of a line,” which looks like this:

y = mx + b

If you plug the regression output’s coefficient into the equation as “m,” the intercept value into the equation as “b,” and provide some “x” values (in this case, values of the IV), you’ll get a bunch of “y” values. Plot those values, and you’ll get the trend line shown in the scatterplot that the script will produce for you in the next step.

This trend line formula turns even more useful when you realize that you can estimate any Tennessee county’s percentage of households with broadband (during this time period, at least) by plugging in the county’s median household income, like this:

Pct. broadband = (0.00049558*median household income) + 48.85438413.

But what about those outliers? Could this wonderfully concise regression model be spectacularly significant chiefly because of a few outliers? Let’s look at a plot of the relationship between household income and broadband access:

# Checking the initial regression model for outliers
plot(mydata$IV,mydata$DV)
abline(lm(mydata$DV~mydata$IV))

The plot shows a positive, reasonably linear relationship between the broadband access and household income measures. The outliers noted in the distributions of each variable are readily apparent, though. Both appear as cases far off the trend line, one at the low end, and the other at the high end. However, it seems unlikely that the two outliers account for all, or even most, of the linearity. Even the non-outlier cases show an upward relationship between household income and broadband access. Just in case, though, let’s check the leverages.

leverage <- as.data.frame(hatvalues(myreg))
view(leverage)

Case 94 (Williamson County) and case 95 (Wilson County) have unusually large hatvalues. Just to be safe, let’s delete them, re-run the analysis, and compare the regression results with and without these two outliers.

mydata <- mydata[-c(94,95), ]
options(scipen = 999)
myreg <- lm(DV ~ IV,
            data = mydata)
summary(myreg)

## 
## Call:
## lm(formula = DV ~ IV, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.1885  -2.3811  -0.0759   3.2838   8.5539 
## 
## Coefficients:
##                Estimate  Std. Error t value            Pr(>|t|)    
## (Intercept) 42.60572890  2.66814020   15.97 <0.0000000000000002 ***
## IV           0.00063393  0.00005699   11.12 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.334 on 91 degrees of freedom
## Multiple R-squared:  0.5762, Adjusted R-squared:  0.5716 
## F-statistic: 123.7 on 1 and 91 DF,  p-value: < 0.00000000000000022

plot(mydata$IV,mydata$DV)
abline(lm(mydata$DV~mydata$IV))

Comparing the output of this regression analysis to the output of the earlier regression analysis (before the outliers were deleted) shows little change. The model is still significantly positive. The R-squared figure, 0.5762, dropped a tad, but not much. All in all, deleting the outliers didn’t make much difference.

Incidentally, in regression, a positive relationship means that the dependent variable increases as the independent variable increases. In other words, the regression line slopes markedly upward from left to right. A negative relationship means the opposite. The dependent variable decreases as the independent variable increases, and the regression line slopes markedly downward from left to right. No relationship means that the dependent variable has no evident relationship, positive or negative with the independent variable. As the independent variable increases, the dependent variable sometimes increases, sometimes stays the same, and sometimes decreases, and never does so in a systematic way. The regression line, in such cases, is level, or at least nearly level, from left to right.

In this particular case, I think the benefits of retaining the outliers outweigh the benefits of dropping them. The analysis could be summarized like this:

Prior research suggests that wealthier areas tend to have higher levels of broadband internet access. This analysis hypothesized that county-level median household income would correlate positively and significantly with county-level percentages of households with broadband access. Regression supported the hypothesis (F (1, 93) = 129, p < .05, R2 = .58). Two outliers were detected: Williamson County and Wilson County. Both had unusually high levels of both household income and broadband internet access. But repeating the analysis with both outliers omitted produced no substantial change in the results.

One-way ANOVA worked example

This code tests the association between the TNBBAccessData.csv data file’s PctBB and MedIncome_3 variables. PctBB measures, for each county in Tennessee, the percentage of households that have access to broadband internet. It is a continuous variable. Meanwhile, MedIncome_3, a categorical variable, categorizes each Tennessee county’s median household income as “1 Low income,” “2 Moderate income,” or “3 High income.” See the Example dataset codebook for details about both variables.

The analysis assumes that a county’s percentage of homes with broadband internet access depends, at least partly, on the county’s median household income. Thus, PctBB is the dependent variable, and MedIncome_3 is the independent variable. One-way ANOVA is suitable for the analysis because the dependent variable is continuous, and the independent variable is categorical and has more than two categories. Technically, the “1,” “2,” and “3” digits at the start of each category label are not needed. But including them is a shortcut for telling R the order of the categories, from lowest to highest. Without them, the categories would appear in alphabetical order, and additional code would be needed to tell R that they are ordinal-level variables and what order they should be put in.

#Installing required packages
if (!require("dplyr")) install.packages("dplyr")
if (!require("tidyverse")) install.packages("tidyverse")
library(dplyr)
library(ggplot2)
options(scipen = 999)

#Read the data
mydata <- read.csv("TNBBAccessData.csv") #Edit YOURFILENAME.csv

For the reasons noted above, these lines define PctBB as the dependent variable and MedIncome_3 as the independent variable.

#Specify the DV and IV
mydata$DV <- mydata$PctBB
mydata$IV <- mydata$MedIncome_3

ANOVA requires that the groups being compared have normal distributions. That’s because ANOVA works by comparing averages, and averages can’t do a good job of representing distributions that are non-normal - for example, distributions that have large outliers, or two or more peaks, or something like that.

This code graphs the distribution of PctBB within each of the three MedIncome_3 groups and adds vertical lines showing each group’s average.

#Graph the group distributions and averages
averages <- group_by(mydata, IV) %>%
  summarise(mean = mean(DV, na.rm = TRUE))
ggplot(mydata, aes(x = DV))+
  geom_histogram()+
  facet_grid(IV ~ .)+
  geom_histogram(color="black",fill = "#1f78b4")+
  geom_vline(data = averages, aes(xintercept=mean,))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The graph suggests that “low income” counties have a lower average broadband access level than “moderate income” counties, and that “low” and “moderate” income counties both have lower average broadband access than “high income” counties.

Would you call the group distributions normal, though? It’s kind of hard to tell. They’re all at least roughly symmetrical, meaning they are high in the center and taper off toward the left and right ends. But the “low-income” group looks like it might have two peaks, separated by a valley that drops, at one point, all the way to zero.

Anyway, it’s hard to tell from the graphs what the average values are. This code shows the exact value of each group distribution’s average, along with each distribution’s exact standard deviation, minimum value, and maximum value.

#Calculate and show the group counts, means, standard
#deviations, minimums, and maximums
group_by(mydata, IV) %>%
  summarise(
    count = n(),
    mean = mean(DV, na.rm = TRUE),
    sd = sd(DV, na.rm = TRUE))

## # A tibble: 3 × 4
##   IV                count  mean    sd
##   <chr>             <int> <dbl> <dbl>
## 1 1 Low income         31  66.2  5.23
## 2 2 Moderate income    31  72.4  4.11
## 3 3 High income        33  77.7  6.18

It looks like the average broadband access figures within the “low,” “moderate” and “high” income groups are 66.2, 72.4, and 77.7, respectively. Recall that MedIncome_3 happens to represent percentages in this case. So, among the counties in the “low” group, the average percentage of households with broadband access is 66.2 percent. Among the “moderate” group’s counties, it’s 7.4 percent, and it’s 77.7 percent among the “high” group’s counties. Also, that low outlier in the “low” group must be 51.2 percent, the “low” group’s minimum value. A quick check of the data frame indicates that the 51.2 percent figure is from Hancock County, in Row 34 of the data frame. Meanwhile, the high outlier in the “high” group must be 93.4 percent - the figure belonging to Williamson County, in Row 94 of the data frame.

Given these outliers and the group distributions that don’t look convincingly normal, let’s use a Shapiro-Wilk test to check each group for non-normality:

mydata %>%
  group_by(IV) %>%
  summarise(`W Statistic` = shapiro.test(DV)$statistic,
            `p-value` = shapiro.test(DV)$p.value)

## # A tibble: 3 × 3
##   IV                `W Statistic` `p-value`
##   <chr>                     <dbl>     <dbl>
## 1 1 Low income              0.961     0.310
## 2 2 Moderate income         0.988     0.978
## 3 3 High income             0.985     0.916

Surprisingly, perhaps, all three groups have reasonably normal distributions. You can tell because three have W statistics with p values that are above 0.05, meaning that their distributions don’t differ significantly from a normal distribution. That means it’s OK to go ahead with an ANOVA test, as planned.

Had one or more of the groups shown a non-normal distribution, however, you could have used a Kruskal-Wallis test, instead of ANOVA, and a Dunn’s Test instead of a Tukey’s HSD. Neither requires normal group distributions. Below is the code for both. The FSA package is required to run the Dunn’s Test, so the first code block installs and loads it.

if (!require("FSA")) install.packages("FSA")
library(FSA)

… and this code runs the two tests:

kruskal.test(DV ~ IV, data = mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  DV by IV
## Kruskal-Wallis chi-squared = 43.697, df = 2, p-value = 0.0000000003245

dunnTest(DV~IV,data=mydata)

## Warning: IV was coerced to a factor.

## Dunn (1964) Kruskal-Wallis multiple comparison

##   p-values adjusted with the Holm method.

##                          Comparison         Z             P.unadj
## 1  1 Low income - 2 Moderate income -3.685733 0.00022804569358878
## 2      1 Low income - 3 High income -6.598951 0.00000000004140767
## 3 2 Moderate income - 3 High income -2.856072 0.00428917872629139
##               P.adj
## 1 0.000456091387178
## 2 0.000000000124223
## 3 0.004289178726291

The Kruskal-Wallis test produced a chi-squared statistic of about 41.2, with a p-value of 0.000000001134. Because the p-value is below 0.05, the test indicates that at least one of the groups is significantly different from at least one of the others. The post-hoc Dunn’s Test produced Z scores that were below 0.05 for every one of the group comparisons, meaning that each group differs significantly from each other group.

Had relying on the Kruskal-Wallis and Dunn’s Test results been necessary, they could have been summarized like this:

Prior research suggests that wealthier areas tend to have higher levels of broadband internet access. This analysis hypothesized that household broadband internet access would average significantly differently across at least two of the three income levels measured. A Shapiro-Wilk test detected significant nonnormality in the group distributions, so a Kruskal-Wallis test was used instead, with a Dunn’s Test as the post-hoc procedure. The Kruskal-Wallis test found that at least one of the groups differed significantly from at least one other group (H(2) = 41.2, p < 0.05). The Dunn’s test showed statistically significant (p < 0.05) Z values for all group comparisons.

But one-way ANOVA is, in fact, suitable for the data, so let’s use it instead. It relies on an F test to examine the differences among the group averages and indicate whether at least one group’s average differs significantly from at least one other group’s average. This code runs the ANOVA:

#ANOVA
options(scipen = 999)
oneway.test(mydata$DV ~ mydata$IV, 
            var.equal = FALSE)

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  mydata$DV and mydata$IV
## F = 32.824, num df = 2.000, denom df = 60.157, p-value =
## 0.0000000002304

The resulting F statistic, 32.824, has an associated p-value of 0.0000000002304. That value is far less than 0.05, so the ANOVA results indicate that at least one group average differs significantly from at least one other group average.

Had the F statistic been nonsignificant, the analysis could stop here and conclude that the data show no evidence of a relationship between income and broadband access.

But knowing that at least one group average differs significantly from at least one other group average produces a bit of a cliffhanger: Which averages differ significatly from each other? In the graph, the “moderate income” average looks a little closer to the “high income” average than the “low income” average. Are the “moderate” and “high” income averages close enough to be essentially the same? Are both the “moderate” and “high” income averages far enough from the “low income” average to be significantly greater? Or does each group average differs significantly from each other group average?

Tukey’s HSD (“honestly significant difference”) is a post hoc (“after this”) test designed to detect differences between group averages after an ANOVA has found that at least one such difference exists. This code will run the test and display the results:

anova_1 <- aov(mydata$DV ~ mydata$IV)
TukeyHSD(anova_1)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = mydata$DV ~ mydata$IV)
## 
## $`mydata$IV`
##                                      diff      lwr       upr     p adj
## 2 Moderate income-1 Low income   6.222581 3.037414  9.407748 0.0000323
## 3 High income-1 Low income      11.502835 8.366299 14.639371 0.0000000
## 3 High income-2 Moderate income  5.280254 2.143718  8.416790 0.0003598

The table shows all possible pairs of group averages, the difference between each pair (see the diff column), and the p value (see the p adj column) for a significance test of the difference between each pair. In this case, all three pairs differed significantly.

Here’s one way to summarize the results of the analysis:

Prior research suggests that wealthier areas tend to have higher levels of broadband internet access. This analysis hypothesized that household broadband internet access would average significantly differently across at least two of the three income levels measured. The analysis found that broadband access averaged 66.2 among low-income counties, 72.4 among moderate-income counties, and 77.7 among high-income counties. A one-way analysis of variance test indicated at least one significant difference, F (2, 60.16) = 32.82, p < .05. A post-hoc procedure determined that all three means differed significantly from one another.

Note, too, that it might be wise to “stress test” the above conclusion by deleting the most prominent outliers and rerunning the analysis to see whether the outcome changes as a result. In this case, little changed. Running this code deleted the data for Hancock County (the low outlier in the “Low income” group) and for Williamson Conty (the high outlier in the “High income” group).

mydata <- mydata[-c(94, 34), ]

As expected, the group averages moved closer together as a result:

averages <- group_by(mydata, IV) %>%
  summarise(mean = mean(DV, na.rm = TRUE))
ggplot(mydata, aes(x = DV))+
  geom_histogram()+
  facet_grid(IV ~ .)+
  geom_histogram(color="black",fill = "#1f78b4")+
  geom_vline(data = averages, aes(xintercept=mean,))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

However, the ANOVA still produced a significant F:

#ANOVA
options(scipen = 999)
oneway.test(mydata$DV ~ mydata$IV, 
            var.equal = FALSE)

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  mydata$DV and mydata$IV
## F = 34.149, num df = 2.000, denom df = 59.398, p-value =
## 0.0000000001343

… and the post hoc test still showed found significant differences between all group average pairs:

anova_1 <- aov(mydata$DV ~ mydata$IV)
TukeyHSD(anova_1)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = mydata$DV ~ mydata$IV)
## 
## $`mydata$IV`
##                                      diff      lwr       upr     p adj
## 2 Moderate income-1 Low income   5.722473 2.800611  8.644335 0.0000313
## 3 High income-1 Low income      10.512292 7.612969 13.411614 0.0000000
## 3 High income-2 Moderate income  4.789819 1.914734  7.664904 0.0004208

Independent-samples t-test worked example

This code tests the association between the TNBBAccessData.csv data file’s PctBB and MedIncome_2 variables. PctBB measures, for each county in Tennessee, the percentage of households that have access to broadband internet. It is a continuous variable. Meanwhile, MedIncome_2, a categorical variable, categorizes each Tennessee county’s median household income as “1 Low income” or “2 High income.” See the Example dataset codebook for details about both variables. Technically, the “1” and “2” digits at the start of each category label are not needed. But including them is a shortcut for telling R the order of the categories, from lowest to highest. Without them, the categories would appear in alphabetical order, and additional code would be needed to tell R that they are ordinal-level variables and what order they should be put in.

#Installing required packages
if (!require("dplyr")) install.packages("dplyr")
if (!require("tidyverse")) install.packages("tidyverse")
library(dplyr)
library(ggplot2)
options(scipen = 999)

#Read the data
mydata <- read.csv("TNBBAccessData.csv")

For the reasons noted above, these lines define PctBB as the dependent variable and MedIncome_2 as the independent variable.

#Specify the DV and IV
mydata$DV <- mydata$PctBB
mydata$IV <- mydata$MedIncome_2

A t-test compares averages, and an average can mislead you if the data you averaged included outliers or had some weird, non-normal distribution. This code graphs the distribution of PctBB within each of the MedIncome_2 groups and adds vertical lines showing each group’s average.

#Graph the group distributions and averages
averages <- group_by(mydata, IV) %>%
  summarise(mean = mean(DV, na.rm = TRUE))
ggplot(mydata, aes(x = DV))+
  geom_histogram()+
  facet_grid(IV ~ .)+
  geom_histogram(color="black",fill = "#1f78b4")+
  geom_vline(data = averages, aes(xintercept=mean,))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

PctBB’s distribution seems to have a low outlier within the “1 Low income” group and a high outlier within the “2 High income” group. Otherwise, though, both distributions look somewhat normal. It’s hard to tell from the graphs precisely what the group averages and the values of those outliers are. This code gives some exact numbers.

#Calculate and show the group counts, means, standard
#deviations, minimums, and maximums
group_by(mydata, IV) %>%
  summarise(
    count = n(),
    mean = mean(DV, na.rm = TRUE),
    sd = sd(DV, na.rm = TRUE),
    min = min(DV, na.rm = TRUE),
    max = max(DV, na.rm = TRUE))

## # A tibble: 2 × 6
##   IV            count  mean    sd   min   max
##   <chr>         <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 Low income     48  68.0  5.13  51.2  75.5
## 2 2 High income    47  76.5  6.11  65.7  93.4

So, PctBB, the percent of households with broadband access, averages 68.0 percent among the low-income counties and 76.5 percent among the high-income counties. And those outliers are, well, pretty far out. The low-income outlier, 51.2 percent, is almost 17 percentage points below the average. That’s more than three times the distribution’s standard deviation of 5.13 percentage points. Similarly, the high-income outlier, 93.4 percent, is about 17 percentage points above the distribution’s average - again, nearly three times the distribution’s standard deviation of 6.11 percentage points. Recall that one standard deviation is the typical difference between the distribution’s scores and its average. So these outliers are about three times more unusual than the typical score. A quick check of the data frame shows that the low outlier belongs to Hancock County, which occupies Row 34, and the high outlier belongs to Williamson County, which occupies Row 94. It might be a good idea to run a Shapiro-Wilk test on the normality of the group distributions. Here’s the code for doing it:

mydata %>%
  group_by(IV) %>%
  summarise(`W Statistic` = shapiro.test(DV)$statistic,
            `p-value` = shapiro.test(DV)$p.value)

## # A tibble: 2 × 3
##   IV            `W Statistic` `p-value`
##   <chr>                 <dbl>     <dbl>
## 1 1 Low income          0.931   0.00744
## 2 2 High income         0.983   0.731

The high-income group distribution appears normal; the W statistic’s p-value , 0.731, is above 0.05. But the the low-income group distribution appears non-normal. Its W statistic has a p-value of 0.007, which is below 0.05. Because both groups are not normally distributed, it would be wise to use a Wilcoxon rank sum test instead. Here’s the code:

options(scipen = 999)
wilcox.test(mydata$DV ~ mydata$IV)

## Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
## compute exact p-value with ties

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  mydata$DV by mydata$IV
## W = 313, p-value = 0.000000001333
## alternative hypothesis: true location shift is not equal to 0

The test’s W statistics, 313, has a p-value of around 0.000000001333. That’s far less than 0.05, so the test indicates that the groups differ significantly. You may have noticed the warning noting that R could not compute an exact p-value for the test because the data happen to include “ties” - that is, counties with identical broadband access percentages. The Wilcoxon rank sum test works by ranking the cases from highest to lowest, and things get fuzzy with the test has to try to rank cases with identical scores. But the p-value R came up with is nowhere near exceeding 0.05, so tie problem is tolerable.

The results could be written up something like this:

Prior research suggests that wealthier areas tend to have higher levels of broadband internet access. This analysis hypothesized that average household broadband internet access would differ significantly among high-income counties compared to low-income counties. The analysis found that broadband access averaged 68.8 among “Low income” counties and 76.5 among “High income” counties. A Shapiro-Wilk test indicated the low-income group values were not normally distributed, rendering an independent-samples t-test unsuitable (W = 0.93, p < .05). However, a Wilcoxon rank sum test found the difference to be statistically significant (W = 313, p < .05). The results support the hypothesis.

Had the data been suitable for an independent-samples t-test, we could have run this code:

# Running a t-test 
options(scipen = 999)
t.test(mydata$DV ~ mydata$IV, 
       var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  mydata$DV by mydata$IV
## t = -7.3327, df = 89.614, p-value = 0.00000000009635
## alternative hypothesis: true difference in means between group 1 Low income and group 2 High income is not equal to 0
## 95 percent confidence interval:
##  -10.795676  -6.192711
## sample estimates:
##  mean in group 1 Low income mean in group 2 High income 
##                    68.02708                    76.52128

… which would have produced a t value, -7.3327, with an associated p-value of 0.00000000009635. That’s far less than 0.05, so the difference between the low- and high-income group averages would be statistically significant. By the way, the t value is negative because the average of the first group considered, the low-income group, is lower than the average of the second group considered, the high-income group. If we had given R the groups in the opposite order, the t value would have been positive.

The results of the t-test could be written up like this:

Prior research suggests that wealthier areas tend to have higher levels of broadband internet access. This analysis hypothesized that average household broadband internet access would differ significantly among high-income counties compared to low-income counties. The analysis found that broadband access averaged 68.8 among “Low income” counties and 76.5 among “High income” counties. An independent samples t-test found the difference to be statistically significant, t (89.61) = -7.33, p < .05. The results support the hypothesis.

Paired-samples t-test worked example

This code uses a paired-samples t-test to evaluate whether each Tennessee county’s percentage of households with broadband internet access, as estimated in 2019 by PctBB in the TNBBAccessData.csv data file, increased significantly compared to each county’s estimated percentage from two years earlier, as recorded in the TNBBAccessData.csv data file’s PctBB2017 variable. A paired-samples t-test is suitable for the analysis, because each PctBB county estimate from 2019 is paired with the same county’s PctBB2017 estimate from 2017. In short, the county observations are paired sequentially. See Not sure … for more explanation. See the Example dataset codebook for details about both variables.

The analysis assumes that each county’s percentage of homes with broadband internet access depends, at least partly, on the difference in time between 2017 and 2019. Thus, the percentage of homes with broadband is the continuious dependent variable, and the year the percentage was measured (either 2017 or 2019) is the categorical independent variable. These variables are implied, rather than represented directly, in the dataset, though, because the dataset has been arranged in the format that a paired-samples t-test requires.

#Installing required packages
if (!require("dplyr")) install.packages("dplyr")
if (!require("tidyverse")) install.packages("tidyverse")
library(dplyr)
library(ggplot2)
options(scipen = 999)

#Read the data
mydata <- read.csv("TNBBAccessData.csv") #Edit YOURFILENAME.csv

The 2017 broadband access levels in PctBB2017 are designated as V1 for purposes of the analysis, because they came first in the time sequence being examined. The broadband access levels in PctBB, which are from 2019, are designated as V2, because they came second in the time sequence.

#Specify the two variables involved
mydata$V1 <- mydata$PctBB2017
mydata$V2 <- mydata$PctBB

A paired-samples t-test assumes that differences between each pair of scores are approximately normally distributed, with no substantial outliers. A histogram of the pair differences can tell you whether these assumptions are true:

#Look at the distribution of the pair differences
mydata$PairDifferences <- mydata$V2 - mydata$V1

ggplot(mydata, aes(x = PairDifferences)) +
  geom_histogram(color = "black", fill = "#1f78b4") +
  geom_vline(aes(xintercept = mean(PairDifferences)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In this case, the histogram suggests that both assumptions may be false. The data show at least hints of the bell-shape characteristic of a normal distribution. But the right side of the distribution seems a lot lower than the left side does - suggesting that counties with a lot of difference are more common than counties with relatively less difference. Also, there’s a pretty clear outlier on the left - that is, a county with an unusually small change in broadband access between 2017 and 2019.

The histogram doesn’t show precise values for the average and the outlier, so some descriptive statistics may be enlightening:

#Get descriptive statistics for pair differences
mydata %>%
  select(PairDifferences) %>%
  summarise(
    count = n(),
    mean = mean(PairDifferences, na.rm = TRUE),
    sd = sd(PairDifferences, na.rm = TRUE),
    min = min(PairDifferences, na.rm = TRUE),
    max = max(PairDifferences, na.rm = TRUE))

##   count     mean       sd min  max
## 1    95 6.827368 2.578571 0.8 12.2

So, the low outlier is a county - Weakly County, it turns out - in which broadband access grew by just eight-tenths of a percentage point, more than two standard deviations less than the 6.8 percent average. And the maximum - Lauderdale County’s 12.2 percent surge in broadband access - is more than two standard deviations above the average. It might, indeed, be wise to at least check whether the normality assumption has been met:

#Shapiro-Wilk test
options(scipen = 999)
shapiro.test(mydata$PairDifferences)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$PairDifferences
## W = 0.97041, p-value = 0.02993

The p-value for the Shapiro-Wilk test produced a W value of 0.97, which has a p-value of about 0.03. That’s less than 0.05, so the distribution of the pair differences is not normal, and a t-test isn’t suitable in this case. No problem, though. The Wilcoxon signed rank test has some drawbacks. But one thing it has going for it is that it doesn’t require normally distributed pair differences. So, you can use a Wilcoxon signed rank test instead of a paired-samples t-test when you data don’t meet the assumptions needed for a paired-samples t-test. Here’s the code:

#Wilcoxon signed rank test
mydata %>% 
  select(V1, V2) %>% 
  summarise_all(list(Mean = mean,SD = sd))

##    V1_Mean  V2_Mean    V1_SD    V2_SD
## 1 65.40211 72.22947 8.005702 7.045458

wilcox.test(mydata$V1, mydata$V2, paired = TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  mydata$V1 and mydata$V2
## V = 0, p-value < 0.00000000000000022
## alternative hypothesis: true location shift is not equal to 0

The average of V1 (the 2017 percentages) is 65.4 with a standard deviation of 8.0, while the average of V2 (the 2019 percentages) is 72.2, with a standard deviation of 7.0. The test produced a V value of 0 and a p-value of 0.00000000000000022. The latter is much less than 0.05. It looks like the average difference between the 2019 and 2017 scores, 6.8 percentage points (shown in the summary statistics from the pair differences distribution, above), is statistically significant. The results could be written up like this:

This analysis hypothesized that county-level percentages of households with broadband access averaged significantly higher in 2019 than in 2017. The analysis found that access averaged about 65 percent in 2017 and 72 percent in 2019. The differences between the pairs of percentages averaged 6.8. A Shapiro-Wilk test indicated these differences were not normally distributed, rendering a paired-samples t-test unsuitable (W = 0.97, p < .05). However, a Wilcoxon signed rank test found the difference to be statistically signficant (V = 0, p < .05).

Had a t-test been suitable, however, we could have run this code:

#t-test (Use if pair differences distribution is normal)
mydata %>% 
  select(V1, V2) %>% 
  summarise_all(list(Mean = mean,SD = sd))

##    V1_Mean  V2_Mean    V1_SD    V2_SD
## 1 65.40211 72.22947 8.005702 7.045458

options(scipen = 999)
t.test(mydata$V2, mydata$V1,
       paired = TRUE)

## 
##  Paired t-test
## 
## data:  mydata$V2 and mydata$V1
## t = 25.807, df = 94, p-value < 0.00000000000000022
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  6.302087 7.352650
## sample estimates:
## mean difference 
##        6.827368

… the results of which would indicate that the t value, 25.81, had a p-value of 0.00000000000000022. That’s much less than 0.05, so the test would have indicated a statistically significant difference. The results could have been written up this way:

This analysis hypothesized that county-level percentages of households with broadband access averaged significantly higher in 2019 than in 2017. The analysis found that access averaged about 65 percent in 2017 and 72 percent in 2019. The differences between the pairs of percentages averaged 6.8. A paired-samples t-test found the difference to be statistically significant (t(94) = 25.8, p < .05).

Not sure …

… whether you need an Independent-samples t-test or a Paired-samples t-test?

Similarities between the two tests can make telling them apart confusing. For example, both kinds of t-tests involve comparing the average of one set of numbers with the average of a second set of numbers. In both instances, the dependent variable is continuous, and the categorical independent variable tells you which set a given number belongs to: the first set, or the second set.

The difference, though, involves whether each number in the first set is somehow connected, or paired, with a number in the second set. When no such connection exists, you use an independent-samples t-test. When such a connection does exist, you use a paired-samples t-test.

Example 1: Two sets with no connection

Suppose you wanted to learn whether college freshmen and college seniors score significantly differently on a 10-question test of knowledge about politics. College supposedly broadens one’s understanding of the world, so there is reason to think seniors would outperform freshmen. But college’s demands also keep students busy. Too busy, perhaps, to keep up with politics, especially during senior-level courses. So, maybe freshmen would outscore seniors.

A five-student sample of your dataset might look like this:

ID	Class	Knowledge
1	Freshman	4
2	Senior	5
3	Senior	7
4	Freshman	3
5	Freshman	4

Each row represents a person. “Knowledge” is a continuous variable indicating each person’s score on the 10-question political knowledge test. “Class” is a categorical variable indicating whether each person is a freshman or a senior - that is, belongs to the “freshman” set of scores or to the “senior” set of scores.

The key thing to see is that no one person is included in both sets of scores. Nobody in the study is both a freshman and a senior. Everybody in the study is in only one set or only the other set. In technical terms, the two sets of scores are “mutually exclusive,” meaning they have no members in common. Furthermore, at least as far as we can tell, the two sets are “independent.” That means nobody in the freshman set has any sort of relationship with someone in the senior set that might. For example, suppose one of the freshmen and one of the seniors happened to be siblings whose mother happened to be a U.S. senator. Both probably would have higher, and more similar, levels of political knowledge than some random freshman and senior with no such connection between them. Ignoring that possibility could lower the validity of your research’s conclusions.

As long as the two sets of scores are mutually exclusive and independent, as described above, an independent-samples t-test would be the right procedure to use.

Example 2: Two sets with a circumstantial connection

There could be advantages, however, to intentionally studying political knowledge among pairs of siblings like the pair mentioned above.

All sorts of factors that have nothing to do with a college student’s class rank nonetheless could boost or diminish the college student’s level of political knowledge. Some students may come from families that talked about politics all the time. Others may come from families that rarely talked about politics. Some may come from wealthier families that could afford access to elite political information sources. Others may come from poorer families that could not.

But what if you could manage to measure political knowledge among three dozen or so sibling pairs in which one sibling was a freshman and the other was a senior? If you could do that, you would be able to investigate whether time spend in college seems to make a difference in political knowledge regardless of household circumstances. That’s because you would be comparing freshmen from politically engaged families with seniors from politically engaged families, freshmen from politically disengaged families with seniors from politically disengaged families, freshmen from richer families with seniors from richer families, freshmen from poorer families with seniors from poorer families, and so on, perhaps even for factors you hadn’t thought of or imagined.

You’d have to arrange your data a little differently than the data in Example 1 were arranged, though. A five-sibling-pair sample of your data might look like this:

Sibling_pair_ID	Knowledge_Freshman_sibling	Knowledge_Senior_sibling
1	3	5
2	5	7
3	4	6
4	3	8
5	4	7

In this arrangement, each horizontal row is a sibling pair, not an individual person, as was the case in Example 1. The Sibling_pair_ID variable shows a unique ID code for each sibling pair. Within each row, the Knowledge_freshman_sibling variable shows the political knowledge test score for the freshman member of each sibling pair. The Knowledge_senior_sibling variable shows the political knowledge test score for the senior member of each sibling pair.

If you are going to analyze paired data, R will expect you to arrange the data in this way so that R can tell which cases make up each pair. Even though the dataset’s values have been arranged differently, though, the dependent variable is still “political knowledge,” measured continuously as a zero-to-10 score on a political knowledge test, and the independent variable is still “class rank,” measured categorically as either “freshman” or “senior.”

Example 3: Two sets with a sequential connection

Alternatively, sets in a t-test data file can be linked sequentially. For example, suppose you investigated the relationship between political knowledge and college class rank by recruiting some college freshmen, measuring their political knowledge, then tracking them down four years later, when they are seniors, and measure their political knowledge a second time. A five-person sample from your data set might look like this:

Student	Knowledge_when_a_freshman	Knowledge_when_a_senior
1	3	5
2	5	7
3	4	6
4	3	8
5	4	7

The data are arranged the same way they were in Example 2. The key differences are that:

Each horizontal row now holds data from one person, rather than from each of two siblings, and …
The last two vertical columns hold data about the person collected during two time periods. In this case, the two time periods are when the person was a freshman, and four years later, when the person was a senior.

As in Example 2, the dependent variable is still “political knowledge,” measured continuously as a zero-to-10 score on a political knowledge test, and the independent variable is still “class rank,” measured categorically as either “freshman” or “senior.” The difference is that the “freshman” and “senior” in each horizontal row are the same person. The “freshman” score was measured when the row’s person was a freshman, and the “senior” score was measured when the row’s person was a senior.

Summary

To recap: If you know you need to conduct a t-test, but you’re unsure whether to use and independent-samples or a paired-samples t-test, consider the two sets of scores the t-test will be comparing. If the sets are mutually exclusive and independent, use an independent-samples t-test. But if each score in the first set is somehow paired with a score in the second set - for example, through a circumstantial or sequential connection - then use a paired-samples t-test.

Example dataset

The example dataset, TNBBAccessData.csv, contains data extracted from the 2019 five-year American Community Survey for all counties in Tennessee. See below for the Example dataset codebook.

One way to get the example data onto your computer is to run this R code:

# Read the data from the web
TNBBAccessData <- read.csv("https://drkblake.com/wp-content/uploads/2023/08/TNBBAccessData.csv")
# Save the data on your computer
write.csv(TNBBAccessData, "TNBBAccessData.csv", row.names=FALSE)
# remove the data from the environment
rm (TNBBAccessData)

Example dataset codebook

Most variables in the example dataset are county-level measures of either broadband internet access or household income. Some of the variables are expressed as continuous variables. Others are expressed as categorical variables, with two or three categories. Some represent different time periods. The differing measurement levels allow demonstration of different statistical techniques, depending on the measurement levels of the variables involved.

Definitions for specific variables are as follows:

GEOID: A unique code the Census Bureau assigns to each county in the United States. It’s here because I got these data from data.census.gov., the Census Bureau’s main data dissemination site.

NAME: The name of each county.

PctBB: The percentage of households with broadband internet access in 2019.

MedIncome: Median household income.

PctBB2017: The percentage of households with broadband internet access, according to an estimate released in 2017, two years before the release of the other variables in this dataset.

PctBB_2: PctBB, dichotomized at the median, so that values at or below the median are coded as “1 Low BB access,” and percentages above the median are coded as “2 High BB access.”

PctBB_2_NUM: Same as PctBB_2, but with a numeric value of 0 used to indicate “1 Low BB access,” and a numeric value of 1 used to indicate “2 High BB access.” Such numeric codes are required when a two-value categorical value is used as the dependent variable in a logistic regression procedure.

MedIncome_2: MedIncome, dichotomized at the median, so that values at or below the median are coded as “1 Low income,” and percentages above the median are coded as “2 High income.”

PctBB_3: PctBB, but recoded into three roughly equal-sized groups: “1. Low BB access,” indicating counties with the lowest percentages of broadband internet access; “2. Moderate BB access,” indicating counties with mid-level percentages of broadband internet access; and “3. High BB access,” indicating countries with high percentages of broadband internet access. The groups were formed by sorting the county broadband access figures from smallest to greatest, then dividing the counties into three roughly equal groups.

MedIncome_3: MedIncome, but recoded into three roughly equal-sized groups: “1. Low income,” indicating counties with the lowest median household income values; “2. Moderate income” indicating counties with mid-level median household income values; and “3. High income,” indicating countries with high median household income values.

Install the R Markdown package

The R Markdown package adds an authoring framework you can use to easily combine R code, R output, and text into a nicely formatted document that you can publish on the web or elsewhere. This code will install R Markdown for you:

if (!require("rmarkdown")) install.packages("rmarkdown")

After installing R Markdown, you can create a new R Markdown file by clicking File / New File / R Markdown …

You can work in “Source” or “Visual” mode in R Markdown. You’ll probably prefer working in the “Visual” mode.

It is best to write and test your R code in a regular R script window. Then, when you are sure the code will work as expected, copy and paste the code into a code block in R Markdown. Click the green “+C” button to add a code block, and paste your code into the block. Type your headings and explanatory material around your code blocks. When you are ready to see the result, click the blue “Knit” icon.

R Data Analysis Guide

Dr. Ken Blake

2023-11-04