####Table of Contents

Delierable, Learning Objectives, Resources
The Data

####Deliverable Choose variables to analyze and submit at least one visualization and description on canvas. Use the step-by-step example on page 31 of your textbook as an example. You should include a table and either a bar plot or a pie chart with an accompanying description.

The following document will walk through an example analysis of the variables gender and years worked in education, but you are encouraged to explore different variables for your submission.

####Learning Objectives Here are some of the skills you should be familiar with in analyzing categorical data. This file will provide you with the technical resources to make these visualizations. While not a focus of this document, make sure you are able to interpret the meaning of your graphical displays as well. Use Chapter 3 as a resource.

Be able to recognize when a variable is categorical and choose an appropriate display for it.
Understand how to examine the association between categorical variables by comparing conditional and marginal percentages.
Be able to summarize the distribution of a categorical variable with a frequency table.
Be able to display the distribution of a categorical variable with a bar chart or pie chart.
Know how to make and examine a contingency table.
Know how to make and examine displays of the conditional distributions of one variable for two or more groups.
Be able to describe and discuss patterns found in a contingency table and associated displays of conditional distributions.

Resources

Chapter 3 of your textbook

R markdown help

We will also be using the package ggplot2 to visualize our data. Here are resources for barplots ( Resource 1, Resource 2 ) and piecharts (Resource 1).

The data

How is the data stored?

str(data_resilience)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 299 obs. of  34 variables:
##  $ Timestamp                                            : chr  "8/24/19 9:56" "8/24/19 14:09" "8/29/19 13:57" "9/4/19 17:35" ...
##  $ Please indicate your age range.                      : chr  "45 - 49" "35 - 39" "50 - 54" "50 - 54" ...
##  $ Please indicate your gender.                         : chr  "Female" "Female" "Female" "Female" ...
##  $ Please indicate your race.                           : chr  "White" "White" "White" "White" ...
##  $ How many years have you been working in education?   : chr  "25 - 29 years" "5 - 9 years" "25 - 29 years" "20 - 24 years" ...
##  $ Which option best describes your role in your school?: chr  "Administrator" "Administrator" "Administrator" "Administrator" ...
##  $ In what division do you work?                        : chr  "Upper School" "Upper School" "Upper School" "Upper School" ...
##  $ In what state is your school located?                : chr  "DE" "AZ" "MS" "NY" ...
##  $ How would you describe your school?                  : num  2 3 3 3 4 4 3 4 4 3 ...
##  $ How much autonomy do you have in your job?           : num  3 4 4 4 4 4 4 5 4 3 ...
##  $ Know Yourself                                        : num  5 4 4 4 5 5 4 4 5 5 ...
##  $ Understand Emotions                                  : num  5 4 4 4 5 4 4 3 5 4 ...
##  $ Tell Empowering Stories                              : num  4 3 3 3 4 4 4 4 5 4 ...
##  $ Build Community                                      : num  4 3 5 4 3 5 3 4 5 3 ...
##  $ Be Here Now                                          : num  3 2 4 3 3 5 4 3 5 4 ...
##  $ Take Care of Yourself                                : num  3 3 2 5 4 2 3 3 4 5 ...
##  $ Focus on the Bright Spots                            : num  3 3 4 3 2 3 4 4 4 5 ...
##  $ Cultivate Compassion                                 : num  3 4 4 4 2 5 3 4 5 4 ...
##  $ Be a Learner                                         : num  3 4 4 3 3 5 4 4 4 4 ...
##  $ Play and Create                                      : num  3 3 5 3 3 5 4 4 5 5 ...
##  $ Ride the Waves of Change                             : num  4 3 4 4 4 5 4 5 4 4 ...
##  $ Celebrate and Appreciate                             : num  4 3 5 5 4 4 4 5 5 4 ...
##  $ Purposefulness                                       : num  5 4 4 4 4 4 4 4 5 4 ...
##  $ Acceptance                                           : num  4 3 3 4 5 3 4 5 5 4 ...
##  $ Optimism                                             : num  4 3 4 4 3 4 4 4 5 4 ...
##  $ Empathy                                              : num  3 5 4 4 5 5 4 4 5 5 ...
##  $ Humor                                                : num  4 5 4 2 4 5 4 5 4 5 ...
##  $ Positive Self-Perception                             : num  2 3 4 4 5 3 3 4 4 3 ...
##  $ Empowerment                                          : num  3 4 5 4 5 5 4 4 4 4 ...
##  $ Perspective                                          : num  3 5 4 5 5 5 4 4 4 4 ...
##  $ Curiosity                                            : num  4 4 4 3 4 5 4 5 4 4 ...
##  $ Courage                                              : num  4 4 4 4 5 4 4 5 4 3 ...
##  $ Perseverance                                         : num  4 5 4 4 5 5 3 5 4 4 ...
##  $ Trust                                                : num  3 3 4 4 5 5 3 4 5 3 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Timestamp = col_character(),
##   ..   `Please indicate your age range.` = col_character(),
##   ..   `Please indicate your gender.` = col_character(),
##   ..   `Please indicate your race.` = col_character(),
##   ..   `How many years have you been working in education?` = col_character(),
##   ..   `Which option best describes your role in your school?` = col_character(),
##   ..   `In what division do you work?` = col_character(),
##   ..   `In what state is your school located?` = col_character(),
##   ..   `How would you describe your school?` = col_double(),
##   ..   `How much autonomy do you have in your job?` = col_double(),
##   ..   `Know Yourself` = col_double(),
##   ..   `Understand Emotions` = col_double(),
##   ..   `Tell Empowering Stories` = col_double(),
##   ..   `Build Community` = col_double(),
##   ..   `Be Here Now` = col_double(),
##   ..   `Take Care of Yourself` = col_double(),
##   ..   `Focus on the Bright Spots` = col_double(),
##   ..   `Cultivate Compassion` = col_double(),
##   ..   `Be a Learner` = col_double(),
##   ..   `Play and Create` = col_double(),
##   ..   `Ride the Waves of Change` = col_double(),
##   ..   `Celebrate and Appreciate` = col_double(),
##   ..   Purposefulness = col_double(),
##   ..   Acceptance = col_double(),
##   ..   Optimism = col_double(),
##   ..   Empathy = col_double(),
##   ..   Humor = col_double(),
##   ..   `Positive Self-Perception` = col_double(),
##   ..   Empowerment = col_double(),
##   ..   Perspective = col_double(),
##   ..   Curiosity = col_double(),
##   ..   Courage = col_double(),
##   ..   Perseverance = col_double(),
##   ..   Trust = col_double()
##   .. )

Because all of our data is categorial (even those variables that have numbers for each observation), we need to change all of the variables to “factors” in our dataset.

data_resilience[]<-lapply(data_resilience, factor)

####1. Tables

Let’s start by looking at how our variables are distributed across different categories. We can organize these counts into tables, which records the totals or percentages and the category names.

Here we will look at how the variable of gender is distributed. We will look at both a table of counts or frequencies, and a table of proportions, or relative frequencies.

#frequency table (counts)
table_gender<-table(data_resilience$`Please indicate your gender.`)
table_gender

## 
##               Female                 Male           Non-binary 
##                  221                   64                    4 
## Prefer not to answer 
##                    9

#relative frequencey table (propoertions) 
table_gender_rel<-prop.table(table_gender)
table_gender_rel

## 
##               Female                 Male           Non-binary 
##           0.74161074           0.21476510           0.01342282 
## Prefer not to answer 
##           0.03020134

#marginal distribtuion
addmargins(table_gender)

## 
##               Female                 Male           Non-binary 
##                  221                   64                    4 
## Prefer not to answer                  Sum 
##                    9                  298

What are the differences between these tables?

Next we will create a contigency table comparing gender and years in education.

#contingency table
table_gender_years<-table(data_resilience$`Please indicate your gender.`, data_resilience$`How many years have you been working in education?`)
addmargins(table_gender_years)

##                       
##                        0 - 4 years 10 - 14 years 15 - 19 years
##   Female                        19            32            37
##   Male                           9            16             9
##   Non-binary                     0             2             0
##   Prefer not to answer           3             2             0
##   Sum                           31            52            46
##                       
##                        20 - 24 years 25 - 29 years 30 or more years
##   Female                          28            22               32
##   Male                             8             6                6
##   Non-binary                       0             1                0
##   Prefer not to answer             1             2                1
##   Sum                             37            31               39
##                       
##                        5 - 9 years Sum
##   Female                        49 219
##   Male                           9  63
##   Non-binary                     1   4
##   Prefer not to answer           0   9
##   Sum                           59 295

What is the difference between the following two tables?

addmargins(prop.table(table_gender_years, margin=1))

##                       
##                        0 - 4 years 10 - 14 years 15 - 19 years
##   Female                0.08675799    0.14611872    0.16894977
##   Male                  0.14285714    0.25396825    0.14285714
##   Non-binary            0.00000000    0.50000000    0.00000000
##   Prefer not to answer  0.33333333    0.22222222    0.00000000
##   Sum                   0.56294847    1.12230920    0.31180691
##                       
##                        20 - 24 years 25 - 29 years 30 or more years
##   Female                  0.12785388    0.10045662       0.14611872
##   Male                    0.12698413    0.09523810       0.09523810
##   Non-binary              0.00000000    0.25000000       0.00000000
##   Prefer not to answer    0.11111111    0.22222222       0.11111111
##   Sum                     0.36594912    0.66791694       0.35246793
##                       
##                        5 - 9 years        Sum
##   Female                0.22374429 1.00000000
##   Male                  0.14285714 1.00000000
##   Non-binary            0.25000000 1.00000000
##   Prefer not to answer  0.00000000 1.00000000
##   Sum                   0.61660144 4.00000000

addmargins(prop.table(table_gender_years, margin=2))

##                       
##                        0 - 4 years 10 - 14 years 15 - 19 years
##   Female                0.61290323    0.61538462    0.80434783
##   Male                  0.29032258    0.30769231    0.19565217
##   Non-binary            0.00000000    0.03846154    0.00000000
##   Prefer not to answer  0.09677419    0.03846154    0.00000000
##   Sum                   1.00000000    1.00000000    1.00000000
##                       
##                        20 - 24 years 25 - 29 years 30 or more years
##   Female                  0.75675676    0.70967742       0.82051282
##   Male                    0.21621622    0.19354839       0.15384615
##   Non-binary              0.00000000    0.03225806       0.00000000
##   Prefer not to answer    0.02702703    0.06451613       0.02564103
##   Sum                     1.00000000    1.00000000       1.00000000
##                       
##                        5 - 9 years        Sum
##   Female                0.83050847 5.15009114
##   Male                  0.15254237 1.50982019
##   Non-binary            0.01694915 0.08766876
##   Prefer not to answer  0.00000000 0.25241991
##   Sum                   1.00000000 7.00000000

####2. Barplots

A bar chart displays the distribution of a categorical variable, showing the counts or proportions for each category next to each other for easy comparison.

Bar charts should have small spaces between the bars to indicate that these are freestanding bars that could be rearranged into any order. The bars should also be the same width, so their heights determine their areas, and the areas are proportional to the counts in each class. This convention will help you satisfy the “area principle”, which says that the area occupied by a part of the graph should correspond to the magnitude of the value it represents.

Don’t violate the area principle. This is probably the most common mistake in a graphical display.

# Basic barplot
g <- ggplot(data_resilience, aes(data_resilience$`How many years have you been working in education?`))+ geom_bar()
g

# Horizontal bar plot
g + coord_flip()

#stacked bar plot (notice the fill argument that was added)
gy<-ggplot(data_resilience, aes(data_resilience$`How many years have you been working in education?`))+ geom_bar(aes(fill = data_resilience$`Please indicate your gender.`))+theme(legend.position = "top")
gy

#Side-by-side bar chart (notice the position argument that was added)
gy_s<-ggplot(data_resilience, aes(data_resilience$`How many years have you been working in education?`))+ geom_bar(aes(fill = data_resilience$`Please indicate your gender.`), position=position_dodge())+theme(legend.position = "top")
gy_s

#relative frequency bar chart (notice the y= argument)
gy_r<-ggplot(data_resilience, aes(data_resilience$`How many years have you been working in education?`))+ geom_bar(aes(y = (..count..)/sum(..count..), fill = data_resilience$`Please indicate your gender.`))+theme(legend.position = "top") + ylab("Percent of Respondents")
gy_r

####3. Piecharts

Before you make a bar chart or a pie chart, always check the Categorical Data Condition: The data are counts or percentages of individuals in categories.

If you want to make a relative frequency bar chart or a pie chart, you’ll need to also make sure that the categories don’t overlap so that no individual is counted twice. If the categories do overlap, you can still make a bar chart, but the percentages won’t add up to 100%.

To make a pie chart, we will first store the frequency or contingency table as a dataframe, and make a pie chart based off of that table instead of the raw data itself.

#store table as dataframe
dftg<-data.frame(table_gender_rel)

#pie chart of gender
ggplot(dftg, aes(x="", y=Freq, fill=Var1)) +
  geom_bar(stat="identity", width=1) +
  coord_polar("y", start=0) +theme_void()

#store table as dataframe
dftgy<-data.frame(prop.table(table_gender_years, margin=1))

#pie chart of gender and years worked in education
ggplot(dftgy, aes(x="", y=Freq, fill=Var2)) +
  geom_bar(stat="identity", width=1) +
  coord_polar("y", start=0) + theme_void()+facet_grid(facets=. ~ dftgy$Var1) + theme_void()

Extra analysis for in-class assignment

Below is the contingency table for years in education v. self-reported “empathy” levels:

table(data_resilience$`How many years have you been working in education?`, data_resilience$`Positive Self-Perception`, useNA="no")

##                   
##                     1  2  3  4  5
##   0 - 4 years       1  5  6 14  5
##   10 - 14 years     2  3 14 18 15
##   15 - 19 years     1  4 13 19  9
##   20 - 24 years     1  1  8 17 10
##   25 - 29 years     0  3  3 16  9
##   30 or more years  0  0  7 19 13
##   5 - 9 years       1 11 12 21 13

Note the segmented bar chart representation of the above table.

y_emp_table <- table(data_resilience$`How many years have you been working in education?`, data_resilience$`Positive Self-Perception`)

y_init<-ggplot(as.data.frame(y_emp_table)) + 
  geom_bar(aes(y = Freq, fill=Var2, x=Var1), stat="identity") + 
  scale_x_discrete(limits=c("0 - 4 years", "5 - 9 years", "10 - 14 years", "15 - 19 years", "20 - 24 years", "25 - 29 years", "30 or more years", NA)) +
  theme(legend.position = "top")

y_emp_tablep <- as.data.frame(prop.table(y_emp_table, margin=1))
y_emp_tablep

##                Var1 Var2       Freq
## 1       0 - 4 years    1 0.03225806
## 2     10 - 14 years    1 0.03846154
## 3     15 - 19 years    1 0.02173913
## 4     20 - 24 years    1 0.02702703
## 5     25 - 29 years    1 0.00000000
## 6  30 or more years    1 0.00000000
## 7       5 - 9 years    1 0.01724138
## 8       0 - 4 years    2 0.16129032
## 9     10 - 14 years    2 0.05769231
## 10    15 - 19 years    2 0.08695652
## 11    20 - 24 years    2 0.02702703
## 12    25 - 29 years    2 0.09677419
## 13 30 or more years    2 0.00000000
## 14      5 - 9 years    2 0.18965517
## 15      0 - 4 years    3 0.19354839
## 16    10 - 14 years    3 0.26923077
## 17    15 - 19 years    3 0.28260870
## 18    20 - 24 years    3 0.21621622
## 19    25 - 29 years    3 0.09677419
## 20 30 or more years    3 0.17948718
## 21      5 - 9 years    3 0.20689655
## 22      0 - 4 years    4 0.45161290
## 23    10 - 14 years    4 0.34615385
## 24    15 - 19 years    4 0.41304348
## 25    20 - 24 years    4 0.45945946
## 26    25 - 29 years    4 0.51612903
## 27 30 or more years    4 0.48717949
## 28      5 - 9 years    4 0.36206897
## 29      0 - 4 years    5 0.16129032
## 30    10 - 14 years    5 0.28846154
## 31    15 - 19 years    5 0.19565217
## 32    20 - 24 years    5 0.27027027
## 33    25 - 29 years    5 0.29032258
## 34 30 or more years    5 0.33333333
## 35      5 - 9 years    5 0.22413793

y_emp<-ggplot(y_emp_tablep) + 
  geom_bar(aes(y = Freq, fill=Var2, x=Var1), stat="identity") + 
  scale_x_discrete(limits=c("0 - 4 years", "5 - 9 years", "10 - 14 years", "15 - 19 years", "20 - 24 years", "25 - 29 years", "30 or more years", NA)) +
  theme(legend.position = "top")
y_init

y_emp

In every subsection of the population, the number of poeple who rate themselves as “4” outnumber the number of people who rate themselves as any other category. Furthermore, in every subset more than 50% of the people in that subset rate themselves as four or higher, indicating possible recall bias, although for newer teachers there are a disproportionately large number of lower (1-2) values of self-perception. Visually, the frequency-based segmented bar chart seems to indicate that as teachers teach for more years, their “level of positive self-perception” increases. We aim to test this hypothesis in a more comprehensive way.

For this section, we plot the distribution of “positive self-perception” responses over each individual categorical output for the “years in education” question. We want to determine whether there is an association between empathy responses and the amount of time in education.

In order to do this, we can use a chi-squared test after grouping the responses between 1 and 3 together. We do this in order to ensure that there is a high enough expected count in each cell (n > 5). Thus, we develop the following table:

e_psp <- table(data_resilience$`How many years have you been working in education?`, data_resilience$`Positive Self-Perception`, useNA="no")
lw <- e_psp[,1]+e_psp[,2]+e_psp[,3]
e_psp2 <- cbind(e_psp,lw)
e_psp3 <- e_psp2[,4:6]
e_psp3

##                   4  5 lw
## 0 - 4 years      14  5 12
## 10 - 14 years    18 15 19
## 15 - 19 years    19  9 18
## 20 - 24 years    17 10 10
## 25 - 29 years    16  9  6
## 30 or more years 19 13  7
## 5 - 9 years      21 13 24

We can find the expected value of the table above. We are testing for independence, so we can aggregate all columns and place expected values such that they are proportional to the number in each row.

chisq <- chisq.test(e_psp3)
chisq$expected

##                         4         5       lw
## 0 - 4 years      13.07483  7.802721 10.12245
## 10 - 14 years    21.93197 13.088435 16.97959
## 15 - 19 years    19.40136 11.578231 15.02041
## 20 - 24 years    15.60544  9.312925 12.08163
## 25 - 29 years    13.07483  7.802721 10.12245
## 30 or more years 16.44898  9.816327 12.73469
## 5 - 9 years      24.46259 14.598639 18.93878

(chisq$observed - chisq$expected)**2 / chisq$expected

##                            4          5        lw
## 0 - 4 years      0.065464687 1.00673155 0.3482554
## 10 - 14 years    0.704925643 0.27918381 0.2404091
## 15 - 19 years    0.008303041 0.57411848 0.5910603
## 20 - 24 years    0.124622648 0.05068996 0.3586597
## 25 - 29 years    0.654434511 0.18371499 1.6789006
## 30 or more years 0.395629716 1.03254275 2.5824503
## 5 - 9 years      0.490115624 0.17506070 1.3525686

chisq

## 
##  Pearson's Chi-squared test
## 
## data:  e_psp3
## X-squared = 12.898, df = 12, p-value = 0.3765

Note that the p-value as returned by the chi-squared test is 0.3765, so the result is not statistically significant. This means that we fail to reject the null hypothesis (that the distribution of responses for positive self-perception is independent of the number of years taught).

Categorical Data Visualization - Math 540

Resources

The data

Extra analysis for in-class assignment