####Table of Contents
####Deliverable Choose variables to analyze and submit at least one visualization and description on canvas. Use the step-by-step example on page 31 of your textbook as an example. You should include a table and either a bar plot or a pie chart with an accompanying description.
The following document will walk through an example analysis of the variables gender and years worked in education, but you are encouraged to explore different variables for your submission.
####Learning Objectives Here are some of the skills you should be familiar with in analyzing categorical data. This file will provide you with the technical resources to make these visualizations. While not a focus of this document, make sure you are able to interpret the meaning of your graphical displays as well. Use Chapter 3 as a resource.
Chapter 3 of your textbook
R markdown help
We will also be using the package ggplot2 to visualize our data. Here are resources for barplots ( Resource 1, Resource 2 ) and piecharts (Resource 1).
How is the data stored?
str(data_resilience)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 299 obs. of 34 variables:
## $ Timestamp : chr "8/24/19 9:56" "8/24/19 14:09" "8/29/19 13:57" "9/4/19 17:35" ...
## $ Please indicate your age range. : chr "45 - 49" "35 - 39" "50 - 54" "50 - 54" ...
## $ Please indicate your gender. : chr "Female" "Female" "Female" "Female" ...
## $ Please indicate your race. : chr "White" "White" "White" "White" ...
## $ How many years have you been working in education? : chr "25 - 29 years" "5 - 9 years" "25 - 29 years" "20 - 24 years" ...
## $ Which option best describes your role in your school?: chr "Administrator" "Administrator" "Administrator" "Administrator" ...
## $ In what division do you work? : chr "Upper School" "Upper School" "Upper School" "Upper School" ...
## $ In what state is your school located? : chr "DE" "AZ" "MS" "NY" ...
## $ How would you describe your school? : num 2 3 3 3 4 4 3 4 4 3 ...
## $ How much autonomy do you have in your job? : num 3 4 4 4 4 4 4 5 4 3 ...
## $ Know Yourself : num 5 4 4 4 5 5 4 4 5 5 ...
## $ Understand Emotions : num 5 4 4 4 5 4 4 3 5 4 ...
## $ Tell Empowering Stories : num 4 3 3 3 4 4 4 4 5 4 ...
## $ Build Community : num 4 3 5 4 3 5 3 4 5 3 ...
## $ Be Here Now : num 3 2 4 3 3 5 4 3 5 4 ...
## $ Take Care of Yourself : num 3 3 2 5 4 2 3 3 4 5 ...
## $ Focus on the Bright Spots : num 3 3 4 3 2 3 4 4 4 5 ...
## $ Cultivate Compassion : num 3 4 4 4 2 5 3 4 5 4 ...
## $ Be a Learner : num 3 4 4 3 3 5 4 4 4 4 ...
## $ Play and Create : num 3 3 5 3 3 5 4 4 5 5 ...
## $ Ride the Waves of Change : num 4 3 4 4 4 5 4 5 4 4 ...
## $ Celebrate and Appreciate : num 4 3 5 5 4 4 4 5 5 4 ...
## $ Purposefulness : num 5 4 4 4 4 4 4 4 5 4 ...
## $ Acceptance : num 4 3 3 4 5 3 4 5 5 4 ...
## $ Optimism : num 4 3 4 4 3 4 4 4 5 4 ...
## $ Empathy : num 3 5 4 4 5 5 4 4 5 5 ...
## $ Humor : num 4 5 4 2 4 5 4 5 4 5 ...
## $ Positive Self-Perception : num 2 3 4 4 5 3 3 4 4 3 ...
## $ Empowerment : num 3 4 5 4 5 5 4 4 4 4 ...
## $ Perspective : num 3 5 4 5 5 5 4 4 4 4 ...
## $ Curiosity : num 4 4 4 3 4 5 4 5 4 4 ...
## $ Courage : num 4 4 4 4 5 4 4 5 4 3 ...
## $ Perseverance : num 4 5 4 4 5 5 3 5 4 4 ...
## $ Trust : num 3 3 4 4 5 5 3 4 5 3 ...
## - attr(*, "spec")=
## .. cols(
## .. Timestamp = col_character(),
## .. `Please indicate your age range.` = col_character(),
## .. `Please indicate your gender.` = col_character(),
## .. `Please indicate your race.` = col_character(),
## .. `How many years have you been working in education?` = col_character(),
## .. `Which option best describes your role in your school?` = col_character(),
## .. `In what division do you work?` = col_character(),
## .. `In what state is your school located?` = col_character(),
## .. `How would you describe your school?` = col_double(),
## .. `How much autonomy do you have in your job?` = col_double(),
## .. `Know Yourself` = col_double(),
## .. `Understand Emotions` = col_double(),
## .. `Tell Empowering Stories` = col_double(),
## .. `Build Community` = col_double(),
## .. `Be Here Now` = col_double(),
## .. `Take Care of Yourself` = col_double(),
## .. `Focus on the Bright Spots` = col_double(),
## .. `Cultivate Compassion` = col_double(),
## .. `Be a Learner` = col_double(),
## .. `Play and Create` = col_double(),
## .. `Ride the Waves of Change` = col_double(),
## .. `Celebrate and Appreciate` = col_double(),
## .. Purposefulness = col_double(),
## .. Acceptance = col_double(),
## .. Optimism = col_double(),
## .. Empathy = col_double(),
## .. Humor = col_double(),
## .. `Positive Self-Perception` = col_double(),
## .. Empowerment = col_double(),
## .. Perspective = col_double(),
## .. Curiosity = col_double(),
## .. Courage = col_double(),
## .. Perseverance = col_double(),
## .. Trust = col_double()
## .. )
Because all of our data is categorial (even those variables that have numbers for each observation), we need to change all of the variables to “factors” in our dataset.
data_resilience[]<-lapply(data_resilience, factor)
####1. Tables
Let’s start by looking at how our variables are distributed across different categories. We can organize these counts into tables, which records the totals or percentages and the category names.
Here we will look at how the variable of gender is distributed. We will look at both a table of counts or frequencies, and a table of proportions, or relative frequencies.
#frequency table (counts)
table_gender<-table(data_resilience$`Please indicate your gender.`)
table_gender
##
## Female Male Non-binary
## 221 64 4
## Prefer not to answer
## 9
#relative frequencey table (propoertions)
table_gender_rel<-prop.table(table_gender)
table_gender_rel
##
## Female Male Non-binary
## 0.74161074 0.21476510 0.01342282
## Prefer not to answer
## 0.03020134
#marginal distribtuion
addmargins(table_gender)
##
## Female Male Non-binary
## 221 64 4
## Prefer not to answer Sum
## 9 298
What are the differences between these tables?
Next we will create a contigency table comparing gender and years in education.
#contingency table
table_gender_years<-table(data_resilience$`Please indicate your gender.`, data_resilience$`How many years have you been working in education?`)
addmargins(table_gender_years)
##
## 0 - 4 years 10 - 14 years 15 - 19 years
## Female 19 32 37
## Male 9 16 9
## Non-binary 0 2 0
## Prefer not to answer 3 2 0
## Sum 31 52 46
##
## 20 - 24 years 25 - 29 years 30 or more years
## Female 28 22 32
## Male 8 6 6
## Non-binary 0 1 0
## Prefer not to answer 1 2 1
## Sum 37 31 39
##
## 5 - 9 years Sum
## Female 49 219
## Male 9 63
## Non-binary 1 4
## Prefer not to answer 0 9
## Sum 59 295
What is the difference between the following two tables?
addmargins(prop.table(table_gender_years, margin=1))
##
## 0 - 4 years 10 - 14 years 15 - 19 years
## Female 0.08675799 0.14611872 0.16894977
## Male 0.14285714 0.25396825 0.14285714
## Non-binary 0.00000000 0.50000000 0.00000000
## Prefer not to answer 0.33333333 0.22222222 0.00000000
## Sum 0.56294847 1.12230920 0.31180691
##
## 20 - 24 years 25 - 29 years 30 or more years
## Female 0.12785388 0.10045662 0.14611872
## Male 0.12698413 0.09523810 0.09523810
## Non-binary 0.00000000 0.25000000 0.00000000
## Prefer not to answer 0.11111111 0.22222222 0.11111111
## Sum 0.36594912 0.66791694 0.35246793
##
## 5 - 9 years Sum
## Female 0.22374429 1.00000000
## Male 0.14285714 1.00000000
## Non-binary 0.25000000 1.00000000
## Prefer not to answer 0.00000000 1.00000000
## Sum 0.61660144 4.00000000
addmargins(prop.table(table_gender_years, margin=2))
##
## 0 - 4 years 10 - 14 years 15 - 19 years
## Female 0.61290323 0.61538462 0.80434783
## Male 0.29032258 0.30769231 0.19565217
## Non-binary 0.00000000 0.03846154 0.00000000
## Prefer not to answer 0.09677419 0.03846154 0.00000000
## Sum 1.00000000 1.00000000 1.00000000
##
## 20 - 24 years 25 - 29 years 30 or more years
## Female 0.75675676 0.70967742 0.82051282
## Male 0.21621622 0.19354839 0.15384615
## Non-binary 0.00000000 0.03225806 0.00000000
## Prefer not to answer 0.02702703 0.06451613 0.02564103
## Sum 1.00000000 1.00000000 1.00000000
##
## 5 - 9 years Sum
## Female 0.83050847 5.15009114
## Male 0.15254237 1.50982019
## Non-binary 0.01694915 0.08766876
## Prefer not to answer 0.00000000 0.25241991
## Sum 1.00000000 7.00000000
A bar chart displays the distribution of a categorical variable, showing the counts or proportions for each category next to each other for easy comparison.
Bar charts should have small spaces between the bars to indicate that these are freestanding bars that could be rearranged into any order. The bars should also be the same width, so their heights determine their areas, and the areas are proportional to the counts in each class. This convention will help you satisfy the “area principle”, which says that the area occupied by a part of the graph should correspond to the magnitude of the value it represents.
Don’t violate the area principle. This is probably the most common mistake in a graphical display.
# Basic barplot
g <- ggplot(data_resilience, aes(data_resilience$`How many years have you been working in education?`))+ geom_bar()
g
# Horizontal bar plot
g + coord_flip()
#stacked bar plot (notice the fill argument that was added)
gy<-ggplot(data_resilience, aes(data_resilience$`How many years have you been working in education?`))+ geom_bar(aes(fill = data_resilience$`Please indicate your gender.`))+theme(legend.position = "top")
gy
#Side-by-side bar chart (notice the position argument that was added)
gy_s<-ggplot(data_resilience, aes(data_resilience$`How many years have you been working in education?`))+ geom_bar(aes(fill = data_resilience$`Please indicate your gender.`), position=position_dodge())+theme(legend.position = "top")
gy_s
#relative frequency bar chart (notice the y= argument)
gy_r<-ggplot(data_resilience, aes(data_resilience$`How many years have you been working in education?`))+ geom_bar(aes(y = (..count..)/sum(..count..), fill = data_resilience$`Please indicate your gender.`))+theme(legend.position = "top") + ylab("Percent of Respondents")
gy_r
####3. Piecharts
Before you make a bar chart or a pie chart, always check the Categorical Data Condition: The data are counts or percentages of individuals in categories.
If you want to make a relative frequency bar chart or a pie chart, you’ll need to also make sure that the categories don’t overlap so that no individual is counted twice. If the categories do overlap, you can still make a bar chart, but the percentages won’t add up to 100%.
To make a pie chart, we will first store the frequency or contingency table as a dataframe, and make a pie chart based off of that table instead of the raw data itself.
#store table as dataframe
dftg<-data.frame(table_gender_rel)
#pie chart of gender
ggplot(dftg, aes(x="", y=Freq, fill=Var1)) +
geom_bar(stat="identity", width=1) +
coord_polar("y", start=0) +theme_void()
#store table as dataframe
dftgy<-data.frame(prop.table(table_gender_years, margin=1))
#pie chart of gender and years worked in education
ggplot(dftgy, aes(x="", y=Freq, fill=Var2)) +
geom_bar(stat="identity", width=1) +
coord_polar("y", start=0) + theme_void()+facet_grid(facets=. ~ dftgy$Var1) + theme_void()
Below is the contingency table for years in education v. self-reported “empathy” levels:
table(data_resilience$`How many years have you been working in education?`, data_resilience$`Positive Self-Perception`, useNA="no")
##
## 1 2 3 4 5
## 0 - 4 years 1 5 6 14 5
## 10 - 14 years 2 3 14 18 15
## 15 - 19 years 1 4 13 19 9
## 20 - 24 years 1 1 8 17 10
## 25 - 29 years 0 3 3 16 9
## 30 or more years 0 0 7 19 13
## 5 - 9 years 1 11 12 21 13
Note the segmented bar chart representation of the above table.
y_emp_table <- table(data_resilience$`How many years have you been working in education?`, data_resilience$`Positive Self-Perception`)
y_init<-ggplot(as.data.frame(y_emp_table)) +
geom_bar(aes(y = Freq, fill=Var2, x=Var1), stat="identity") +
scale_x_discrete(limits=c("0 - 4 years", "5 - 9 years", "10 - 14 years", "15 - 19 years", "20 - 24 years", "25 - 29 years", "30 or more years", NA)) +
theme(legend.position = "top")
y_emp_tablep <- as.data.frame(prop.table(y_emp_table, margin=1))
y_emp_tablep
## Var1 Var2 Freq
## 1 0 - 4 years 1 0.03225806
## 2 10 - 14 years 1 0.03846154
## 3 15 - 19 years 1 0.02173913
## 4 20 - 24 years 1 0.02702703
## 5 25 - 29 years 1 0.00000000
## 6 30 or more years 1 0.00000000
## 7 5 - 9 years 1 0.01724138
## 8 0 - 4 years 2 0.16129032
## 9 10 - 14 years 2 0.05769231
## 10 15 - 19 years 2 0.08695652
## 11 20 - 24 years 2 0.02702703
## 12 25 - 29 years 2 0.09677419
## 13 30 or more years 2 0.00000000
## 14 5 - 9 years 2 0.18965517
## 15 0 - 4 years 3 0.19354839
## 16 10 - 14 years 3 0.26923077
## 17 15 - 19 years 3 0.28260870
## 18 20 - 24 years 3 0.21621622
## 19 25 - 29 years 3 0.09677419
## 20 30 or more years 3 0.17948718
## 21 5 - 9 years 3 0.20689655
## 22 0 - 4 years 4 0.45161290
## 23 10 - 14 years 4 0.34615385
## 24 15 - 19 years 4 0.41304348
## 25 20 - 24 years 4 0.45945946
## 26 25 - 29 years 4 0.51612903
## 27 30 or more years 4 0.48717949
## 28 5 - 9 years 4 0.36206897
## 29 0 - 4 years 5 0.16129032
## 30 10 - 14 years 5 0.28846154
## 31 15 - 19 years 5 0.19565217
## 32 20 - 24 years 5 0.27027027
## 33 25 - 29 years 5 0.29032258
## 34 30 or more years 5 0.33333333
## 35 5 - 9 years 5 0.22413793
y_emp<-ggplot(y_emp_tablep) +
geom_bar(aes(y = Freq, fill=Var2, x=Var1), stat="identity") +
scale_x_discrete(limits=c("0 - 4 years", "5 - 9 years", "10 - 14 years", "15 - 19 years", "20 - 24 years", "25 - 29 years", "30 or more years", NA)) +
theme(legend.position = "top")
y_init
y_emp
In every subsection of the population, the number of poeple who rate themselves as “4” outnumber the number of people who rate themselves as any other category. Furthermore, in every subset more than 50% of the people in that subset rate themselves as four or higher, indicating possible recall bias, although for newer teachers there are a disproportionately large number of lower (1-2) values of self-perception. Visually, the frequency-based segmented bar chart seems to indicate that as teachers teach for more years, their “level of positive self-perception” increases. We aim to test this hypothesis in a more comprehensive way.
For this section, we plot the distribution of “positive self-perception” responses over each individual categorical output for the “years in education” question. We want to determine whether there is an association between empathy responses and the amount of time in education.
In order to do this, we can use a chi-squared test after grouping the responses between 1 and 3 together. We do this in order to ensure that there is a high enough expected count in each cell (n > 5). Thus, we develop the following table:
e_psp <- table(data_resilience$`How many years have you been working in education?`, data_resilience$`Positive Self-Perception`, useNA="no")
lw <- e_psp[,1]+e_psp[,2]+e_psp[,3]
e_psp2 <- cbind(e_psp,lw)
e_psp3 <- e_psp2[,4:6]
e_psp3
## 4 5 lw
## 0 - 4 years 14 5 12
## 10 - 14 years 18 15 19
## 15 - 19 years 19 9 18
## 20 - 24 years 17 10 10
## 25 - 29 years 16 9 6
## 30 or more years 19 13 7
## 5 - 9 years 21 13 24
We can find the expected value of the table above. We are testing for independence, so we can aggregate all columns and place expected values such that they are proportional to the number in each row.
chisq <- chisq.test(e_psp3)
chisq$expected
## 4 5 lw
## 0 - 4 years 13.07483 7.802721 10.12245
## 10 - 14 years 21.93197 13.088435 16.97959
## 15 - 19 years 19.40136 11.578231 15.02041
## 20 - 24 years 15.60544 9.312925 12.08163
## 25 - 29 years 13.07483 7.802721 10.12245
## 30 or more years 16.44898 9.816327 12.73469
## 5 - 9 years 24.46259 14.598639 18.93878
(chisq$observed - chisq$expected)**2 / chisq$expected
## 4 5 lw
## 0 - 4 years 0.065464687 1.00673155 0.3482554
## 10 - 14 years 0.704925643 0.27918381 0.2404091
## 15 - 19 years 0.008303041 0.57411848 0.5910603
## 20 - 24 years 0.124622648 0.05068996 0.3586597
## 25 - 29 years 0.654434511 0.18371499 1.6789006
## 30 or more years 0.395629716 1.03254275 2.5824503
## 5 - 9 years 0.490115624 0.17506070 1.3525686
chisq
##
## Pearson's Chi-squared test
##
## data: e_psp3
## X-squared = 12.898, df = 12, p-value = 0.3765
Note that the p-value as returned by the chi-squared test is 0.3765, so the result is not statistically significant. This means that we fail to reject the null hypothesis (that the distribution of responses for positive self-perception is independent of the number of years taught).