2.1 Categorical Variables Activity

In this activity, you’ll be investigating the data from the StudentSurvey data set provided by your book. We will be using this data set to learn how to summarize and visualize categorical variables. The R guides on Moodle under Section 2.1 will be VERY helpful.

Getting Started

There are two lines of code below. The first line of code will load all the data sets in from the book. The second line of code loads the package that we will be using to make our graphs. Usually you will need to do this, but since we are just starting to use RStudio I have provided the code for you. Look at the code below so you know how to do it on your own in the next activity.

library(Lock5Data) # Loads data sets from ht book
library(ggplot2) # Loads the graphing package

The console operates separately from your document so you will need to load the data sets and graphing package into your Console. Remember that the console is located at the bottom of RStudio and is where you test your code. In the console, type library(Lock5Data). and press enter. The cursor should advance to the next line with NO output. The StudentSurvey data set should now be loaded in both your document and into your console. Similarly, in the console, type library(ggplot2) and press enter. The cursor should advance to the next line with NO output. The graphing package should now also be loaded into your console.

View the Data

Once the data is loaded into the Console you may view it in a spreadsheet by typing View(StudentSurvey) in the Console and press enter. If you did this correctly a tab containing all the cases and variables should open automatically. DO NOT use View() in your document as your document will not render correctly.

You can learn more about this dataset by typing?StudentSurvey into the Console. This will bring up a help page in the lower right window of RStudio. This particular help page lists all the variables, variable types, and gives a brief explanation of the variables and the dataset.

(5 pts.) In general, the variables in this data set have descriptive names. Use the information provided to you from the View(StudentSurvey) output and the ?StudentSurvey help page to list the variables in this dataset. Complete the statements below. I have done the first one for you.

ANSWER:
Gender is a \(\underline{categorical}\) variable.
Height is a \(\underline{quantitative}\) variable.
VerbalSAT is a \(\underline{quantitative}\) variable.
GPA is a \(\underline{quantitative}\) variable.
Exercise is a \(\underline{quantitative}\) variable.
Award is a \(\underline{categorical}\) variable.

One Categorical Variable:

(3 pts.) Notice in the spreadsheet there is a column titled Gender. Gender is a categorical variable. Suppose we wanted to know how many cases are in each category? One way to display the responses of a single categorical variable is to use a frequency table. The code below constructs and displays a frequency table for Gender. The first line of code below constructs a table labeled “Gender Table” using the Gender variable from the StudentSurvey data set. The table is stored in a variable called table_1 The second line of code displays the table.

table_1 <- table(StudentSurvey$Gender, dnn="Gender Table") # Creates a table
table_1 # Displays the table

## Gender Table
##   F   M 
## 169 193

Now its your turn. The Award variable contains responses to the question “Which award would you prefer to win? Academy, Nobel, or Olympic” Create a frequency table to display the frequency of Award. You should save your table as table_2 and give it a label of “Award Table”. Construct and display your table in the code chunk below.

table_2 <- table(StudentSurvey$Award, dnn="Award Preference")
table_2

## Award Preference
## Academy   Nobel Olympic 
##      31     149     182

(1 pt.) Add the total to table_2 The R Studio help page for frequency tables has an example on how to do this. Remember that you have already created a table called table_2 so you do not need to create a new table.

my.table <- table(StudentSurvey$Award, dnn="Award Preference")
addmargins(my.table);

## Award Preference
## Academy   Nobel Olympic     Sum 
##      31     149     182     362

(1 pt.) Suppose you wanted a relative frequency table instead of a frequency table. Again, use the R Studio help page for frequency tables and create a relative frequency table using table_2.

my.table <- table(StudentSurvey$Award, dnn="Award Preference")
prop.table(my.table)

## Award Preference
##    Academy      Nobel    Olympic 
## 0.08563536 0.41160221 0.50276243

(2 pts.) Why might you choose to use a relative frequency table over a frequency table?

ANSWER: You might choose to use a relative frequency table in order to more clearly see how frequently each answer was given in a survey.

Below I have provided code to create and display a bar chart. This particular bar chart displays the distribution of Gender. Read through the comments and code to get an idea of how plots are constructed. You can find additional examples on the R Studio help page for bar charts

ggplot(StudentSurvey, aes(x = Gender)) + # Create a plot using the Gender variable from the StudentSurvey dataset
  geom_bar(fill = "midnightblue") + # Create a bar chart with blue bars
  ggtitle("Distribution of Gender") + # Create a title
  labs(caption = "Data from the StudentSurvey dataset") + # adds a caption to the plot
  ylab("") # remove y axis label

(4 pts.) In the chunk below, make a bar chart to display the distribution of Award. Make sure your bar chart has a useful title and labels.

ggplot(StudentSurvey, aes(x = Award)) +
  geom_bar(fill = "midnight blue") +
  ggtitle("Distribution of Award Preference") +
  labs(caption = "Data from the StudentSurvey dataset") +
  ylab("")

7. (4 pts.) In the chunk below, make a pie chart to display the distribution of Award. Use the R Studio help page for pie charts as a guide. Make sure your pie chart has a useful title and labels.

pie(table(StudentSurvey$Award), col=c("seagreen", "seagreen1", "seagreen3"), main="Distribution of Award Preference", labels=c("","","") )  

legend("right",legend=levels(as.factor(StudentSurvey$Award)), fill=c("seagreen", "seagreen1", "seagreen2"), title="Cylinders", box.lty=0)

Two Categorical Variables:

Two-way tables, or contingency tables as they are sometimes called, show the relationship between two categorical variables. The categories for one variable are listed down the side (rows) and the categories for the second variable are listed across the top (columns). Each cell of the table contains the count of the number of cases that are in both the row and column categories.

(5 pts.) In the chunk below, make and display a two-way table of Gender and Award. Use the R Studio help page for Contingency Tables as a guide. Your table should have the following characteristics:

your table should be saved in a variable called twoway_tab.
The Gender variable should be displayed across the top of your table, horizontally, and the Award variable should be displayed along the side of your table, vertically.
Your table should be labeled.

twoway_tab <- ct <- table(StudentSurvey$Award, StudentSurvey$Gender, dnn=c("Award Preference", "Gender"));
addmargins(ct)

##                 Gender
## Award Preference   F   M Sum
##          Academy  20  11  31
##          Nobel    76  73 149
##          Olympic  73 109 182
##          Sum     169 193 362

(2 pts.) Use your contingency table and the prop.table function to display a two-way table of Gender and Award where the values in the table are the proportions of Award within each level of Gender. If you do this correctly each column in the table should sum to 1.

ct <- table(StudentSurvey$Award, StudentSurvey$Gender, dnn=c("Award Preference", "Gender"));
prop.table(ct, margin=2)

##                 Gender
## Award Preference          F          M
##          Academy 0.11834320 0.05699482
##          Nobel   0.44970414 0.37823834
##          Olympic 0.43195266 0.56476684

(2 pts.) Use your contingency table and the prop.table function to display a two-way table of Gender and Award where the values in the table are the proportions of Gender within each level of Award. If you do this correctly each row in the table should sum to 1.

ct <- table(StudentSurvey$Award, StudentSurvey$Gender, dnn=c("Award Preference", "Gender"));
prop.table(ct, margin=1)

##                 Gender
## Award Preference         F         M
##          Academy 0.6451613 0.3548387
##          Nobel   0.5100671 0.4899329
##          Olympic 0.4010989 0.5989011

The tables you created in questions 9 and 10 are conditional tables. These tables assume that you know the specific value of one of the variables. For instance, the table you created in question 10 assumes you know an individuals higher Award preference and allows you to see if there are differences within Award preferencescores based on GEnder The tables you created in questions 8, 9 and 10 are very similar, but used to answer very different questions.

Use the tables from questions 8, 9, and 10 to answer the following questions:

(1 pt.) Are there more males or females in the survey?

ANSWER: Males

(1 pt.) How many individuals prefer to win the Nobel Prize?

ANSWER: 149

(1 pt.) What proportion of females prefer the Nobel Prize?

ANSWER: 0.5100671

(1 pt.) What proportion of students that prefer the Nobel Prize are females?

ANSWER: .44970414

(1 pt.) Is an Olympic Metal preference more common among males or females?

ANSWER: Males

(4 pts.) Just as tables can be manipulated to address different questions so can graphs. In the chunk below, make a stacked bar chart to visualize the relationship between Gender and Award. You can find examples of bar charts on the RStudio Help Page for bar charts Your graph should have the following characteristics.

The Gender variable should be the explanatory variable and should be displayed along the horizontal axis.
The Award variable should be treated as the response Variable and should be displayed on the vertical axis of your graph.
The graph should display frequencies.
Your graph should be labeled.

library(ggplot2)
StudentSurvey$Award <- factor(StudentSurvey$Award)  # Converts the gear variable into a factor
StudentSurvey$Gender <- factor(StudentSurvey$Gender)  # Converts the cyl variable into a factor

stacked.bar2 <- ggplot(data = StudentSurvey, aes( x = Gender, fill = Award)) +
  geom_bar(position = "fill") +
  xlab("Gender") + 
  ggtitle("Award Preference by Gender") + 
  scale_fill_discrete(name="Awards")
stacked.bar2

(4 pts.) Using the same criteria from question 16, create a side-by-side bar char to display the relationship between Gender and Award.

library(ggplot2) #load ggplot2 library
StudentSurvey$Gender <- factor(StudentSurvey$Gender) # Create a categorical variable
StudentSurvey$Award <- factor(StudentSurvey$Award) # Create categorical variable

sbs <- ggplot(data = StudentSurvey, aes(x = Gender, fill = Award) ) + 
  geom_bar( position="dodge") + 
  xlab("Gender") + 
  ggtitle("Award Preference by Gender")
sbs

Extra:

Try making the tables in questions 2, 4, and 8 using count() and group_by() in the dplyr package and pivot_wider() in the tidyr package. Look at the dplyr sections on the R Studio help page for frequency tables and R Studio help page for Contingency Tables.

Below is the code used to answer question 2 using dplyr.

library(dplyr)
library(tidyr)

# Question 2
StudentSurvey %>% 
  count(Award)

##     Award   n
## 1 Academy  31
## 2   Nobel 149
## 3 Olympic 182

library(dplyr) # load the dplyr package to use %>%
library(tidyr) # loads the tidyr package to use pivot_
mtcars %>% 
  group_by(cyl, gear) %>% 
  count(cyl, gear) %>%
  pivot_wider( values_from = n, names_from =  cyl)

## # A tibble: 3 x 4
## # Groups:   gear [3]
##    gear   `4`   `6`   `8`
##   <dbl> <int> <int> <int>
## 1     3     1     2    12
## 2     4     8     4    NA
## 3     5     2     1     2

2.1 Categorical Variables Activity

Faith Johnson

Fall 2020

Getting Started

View the Data

One Categorical Variable:

Two Categorical Variables: