This code through explores how to perform descriptives on qualitative data using packages in R. Major packages of concern include: “dplyr”, gtsummary, ggplot, scales, pander, tidyr and KableExtra.
Specifically, it focuses on how to compute the frequency distribution, proportions - within-group percentages and how to explore relationships between variables using visualization.
Statistics and visualizations are useful for understanding relationships among or between variable. For example, while a correlation coefficient can tell you the possible associations (strength and direction) between variables, a regression coefficient can tell not only the association but also the direction of causation. These statistics can also be shown graphically. Examples include the use of bar charts, histogram, bar plots, pie charts etc.
Equipped with these knowledge in R, you will be able to explore and understand patterns of relationships and associations between and among variables in your dataset.
In the piece, you’ll learn how to do the following: * How to compute frequency distribution and simple percentages; How to use bar charts to visualize categorical variables using counts and in-group proportions; * How to have your table in a presentation-ready formats using the gtsummary package.
Here, we’ll show how to create presentation-ready tables, compute in-group proportions. First, we load the packages to be used and the sample dataset to be used.
# LOAD PACKAGES
library (dplyr)
library (gtsummary)
library (magrittr)
library (backports)
library( pander )
library( tidyr )
library( reshape2 )
library( scales )
library( ggplot2 )
library(readxl)
CodeData <- read_excel("C:/Users/seyin/OneDrive - Georgia State University/R Summer Class/Assignments/Code_through_Data.xlsx")
View(CodeData)
This is based on the discussions (see reference list) on creating presentation-ready summary statistics tables, visualization using ggplots and dplyr package for manipulating data.
A basic example shows how a frequency table showing percentages is computed and the result presented in a publishable format.
Using the base R - without the gtsummary package the table looks like this -
##
## Female Male
## 12 14
Using the gtsummary package, the table looks cleaner and near-publishable using the code below:
Characteristic | N = 261 |
---|---|
Gender | |
Female | 12 (46%) |
Male | 14 (54%) |
1
Statistics presented: n (%)
|
Adding more codes, we can add a column for the total sample size - N and also embolden the variables.
Characteristic | N | N = 261 |
---|---|---|
Gender | 26 | |
Female | 12 (46%) | |
Male | 14 (54%) | |
1
Statistics presented: n (%)
|
We can work with as many variables as possible. In this case, two variables are used - Gender and Food Choice.
Characteristic | N | N = 261 |
---|---|---|
Gender | 26 | |
Female | 12 (46%) | |
Male | 14 (54%) | |
Food choice | 26 | |
Amala | 3 (12%) | |
Beans | 8 (31%) | |
Eba | 9 (35%) | |
Rice | 6 (23%) | |
1
Statistics presented: n (%)
|
Can you see the difference between the two? With this, you are ready for presentation.
CodeData %>%
ggplot(mapping = aes(x = `Food choice`))+
geom_bar()+
labs(title= "Food preference by counts",
caption = "Source: Adedotun's Code through",
cex.labs = 0.5)
CodeData %>%
ggplot(mapping = aes(x = `Food choice`))+
geom_bar()+
labs(title= "Food preference by counts",
caption = "Source: Adedotun's Code through",
cex.labs = 0.5)+
coord_flip()
ggplot(CodeData, aes(x = `Food choice`)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
xlab("Food choice") +
scale_y_continuous(labels = scales::percent, name = "Proportion") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
What’s more, it can also be used for in-group proportions used for exploring a pattern by specific demographic features. For example, food choice by gender.
CD <- CodeData %>%
group_by(Gender,`Food choice`) %>%
summarise (n = n()) %>%
mutate(prop = n / sum(n))
ggplot(CD,aes(`Food choice`,prop,fill=Gender))+
geom_bar(stat="identity",position= 'dodge')+
labs(title= "Food preference by Gender (Percentages)",
caption = "Source: Adedotun's Code through",
cex.labs = 0.5)+
scale_y_continuous(labels = scales::percent, name = "Proportion")+
coord_flip()
Most notably, it’s valuable fo creating component bar plots as well.
CD <- CodeData %>%
group_by(Gender,`Food choice`) %>%
summarise (n = n()) %>%
mutate(prop = n / sum(n))
ggplot(CD,aes(`Food choice`,prop,fill=Gender))+
geom_bar(stat="identity",position= 'stack')+
labs(title= "Food preference by Gender (Percentages)",
caption = "Source: Adedotun's Code through",
cex.labs = 0.5)+
scale_y_continuous(labels = scales::percent, name = "Proportion")+
coord_flip()
Learn more about [package, technique, dataset] with the following:
Resource I ddsjoberg / gtsummary
Resource II Relative frequencies / proportions with dplyr
Resource III ggplot2 - Multi-group histogram with in-group proportions rather than frequency
This code through references and cites the following sources:
Bradley Boehmke (2018). Source II. Categorical Data Descriptive Statistics
Sjoberg, D., Hannum, M., Whiting, K. (2020). Presentation-Ready Summary Tables with gtsummary