In this activity, you’ll be investigating the data from the StudentSurvey data set provided by your book. We will be using this data set to learn how to summarize and visualize categorical variables. The R guides on Moodle under Section 2.1 will be VERY helpful.
There are two lines of code below. The first line of code will load all the data sets in from the book. The second line of code loads the package that we will be using to make our graphs. Usually you will need to do this, but since we are just starting to use RStudio I have provided the code for you. Look at the code below so you know how to do it on your own in the next activity.
library(Lock5Data) # Loads data sets from ht book
library(ggplot2) # Loads the graphing package
The console operates separately from your document so you will need to load the data sets and graphing package into your Console. Remember that the console is located at the bottom of RStudio and is where you test your code. In the console, type library(Lock5Data)
. and press enter. The cursor should advance to the next line with NO output. The StudentSurvey
data set should now be loaded in both your document and into your console. Similarly, in the console, type library(ggplot2)
and press enter. The cursor should advance to the next line with NO output. The graphing package should now also be loaded into your console.
Once the data is loaded into the Console you may view it in a spreadsheet by typing View(StudentSurvey)
in the Console and press enter. If you did this correctly a tab containing all the cases and variables should open automatically. DO NOT use View()
in your document as your document will not render correctly.
You can learn more about this dataset by typing?StudentSurvey
into the Console. This will bring up a help page in the lower right window of RStudio. This particular help page lists all the variables, variable types, and gives a brief explanation of the variables and the dataset.
View(StudentSurvey)
output and the ?StudentSurvey
help page to list the variables in this dataset. Complete the statements below. I have done the first one for you.ANSWER:
Gender is a \(\underline{categorical}\) variable.
Height is a \(\underline{quantitative}\) variable.
VerbalSAT is a \(\underline{quantitative}\) variable.
GPA is a \(\underline{quantitative}\) variable.
Exercise is a \(\underline{quantitative}\) variable.
Award is a \(\underline{categorical}\) variable.
table_1
The second line of code displays the table.table_1 <- table(StudentSurvey$Gender, dnn="Gender Table") # Creates a table
table_1 # Displays the table
## Gender Table
## F M
## 169 193
Now its your turn. The Award variable contains responses to the question “Which award would you prefer to win? Academy, Nobel, or Olympic” Create a frequency table to display the frequency of Award. You should save your table as table_2
and give it a label of “Award Table”. Construct and display your table in the code chunk below.
table_2 <- table(StudentSurvey$Award, dnn="Award Preference")
table_2
## Award Preference
## Academy Nobel Olympic
## 31 149 182
table_2
The R Studio help page for frequency tables has an example on how to do this. Remember that you have already created a table called table_2
so you do not need to create a new table.my.table <- table(StudentSurvey$Award, dnn="Award Preference")
addmargins(my.table);
## Award Preference
## Academy Nobel Olympic Sum
## 31 149 182 362
table_2
.my.table <- table(StudentSurvey$Award, dnn="Award Preference")
prop.table(my.table)
## Award Preference
## Academy Nobel Olympic
## 0.08563536 0.41160221 0.50276243
ANSWER: You might choose to use a relative frequency table in order to more clearly see how frequently each answer was given in a survey.
Below I have provided code to create and display a bar chart. This particular bar chart displays the distribution of Gender. Read through the comments and code to get an idea of how plots are constructed. You can find additional examples on the R Studio help page for bar charts
ggplot(StudentSurvey, aes(x = Gender)) + # Create a plot using the Gender variable from the StudentSurvey dataset
geom_bar(fill = "midnightblue") + # Create a bar chart with blue bars
ggtitle("Distribution of Gender") + # Create a title
labs(caption = "Data from the StudentSurvey dataset") + # adds a caption to the plot
ylab("") # remove y axis label
ggplot(StudentSurvey, aes(x = Award)) +
geom_bar(fill = "midnight blue") +
ggtitle("Distribution of Award Preference") +
labs(caption = "Data from the StudentSurvey dataset") +
ylab("")
7. (4 pts.) In the chunk below, make a pie chart to display the distribution of Award. Use the R Studio help page for pie charts as a guide. Make sure your pie chart has a useful title and labels.
pie(table(StudentSurvey$Award), col=c("seagreen", "seagreen1", "seagreen3"), main="Distribution of Award Preference", labels=c("","","") )
legend("right",legend=levels(as.factor(StudentSurvey$Award)), fill=c("seagreen", "seagreen1", "seagreen2"), title="Cylinders", box.lty=0)
Two-way tables, or contingency tables as they are sometimes called, show the relationship between two categorical variables. The categories for one variable are listed down the side (rows) and the categories for the second variable are listed across the top (columns). Each cell of the table contains the count of the number of cases that are in both the row and column categories.
your table should be saved in a variable called twoway_tab
.
The Gender variable should be displayed across the top of your table, horizontally, and the Award variable should be displayed along the side of your table, vertically.
Your table should be labeled.
twoway_tab <- ct <- table(StudentSurvey$Award, StudentSurvey$Gender, dnn=c("Award Preference", "Gender"));
addmargins(ct)
## Gender
## Award Preference F M Sum
## Academy 20 11 31
## Nobel 76 73 149
## Olympic 73 109 182
## Sum 169 193 362
prop.table
function to display a two-way table of Gender and Award where the values in the table are the proportions of Award within each level of Gender. If you do this correctly each column in the table should sum to 1.ct <- table(StudentSurvey$Award, StudentSurvey$Gender, dnn=c("Award Preference", "Gender"));
prop.table(ct, margin=2)
## Gender
## Award Preference F M
## Academy 0.11834320 0.05699482
## Nobel 0.44970414 0.37823834
## Olympic 0.43195266 0.56476684
prop.table
function to display a two-way table of Gender and Award where the values in the table are the proportions of Gender within each level of Award. If you do this correctly each row in the table should sum to 1.ct <- table(StudentSurvey$Award, StudentSurvey$Gender, dnn=c("Award Preference", "Gender"));
prop.table(ct, margin=1)
## Gender
## Award Preference F M
## Academy 0.6451613 0.3548387
## Nobel 0.5100671 0.4899329
## Olympic 0.4010989 0.5989011
The tables you created in questions 9 and 10 are conditional tables. These tables assume that you know the specific value of one of the variables. For instance, the table you created in question 10 assumes you know an individuals higher Award preference and allows you to see if there are differences within Award preferencescores based on GEnder The tables you created in questions 8, 9 and 10 are very similar, but used to answer very different questions.
Use the tables from questions 8, 9, and 10 to answer the following questions:
ANSWER: Males
ANSWER: 149
ANSWER: 0.5100671
ANSWER: .44970414
ANSWER: Males
The Gender variable should be the explanatory variable and should be displayed along the horizontal axis.
The Award variable should be treated as the response Variable and should be displayed on the vertical axis of your graph.
The graph should display frequencies.
Your graph should be labeled.
library(ggplot2)
StudentSurvey$Award <- factor(StudentSurvey$Award) # Converts the gear variable into a factor
StudentSurvey$Gender <- factor(StudentSurvey$Gender) # Converts the cyl variable into a factor
stacked.bar2 <- ggplot(data = StudentSurvey, aes( x = Gender, fill = Award)) +
geom_bar(position = "fill") +
xlab("Gender") +
ggtitle("Award Preference by Gender") +
scale_fill_discrete(name="Awards")
stacked.bar2
library(ggplot2) #load ggplot2 library
StudentSurvey$Gender <- factor(StudentSurvey$Gender) # Create a categorical variable
StudentSurvey$Award <- factor(StudentSurvey$Award) # Create categorical variable
sbs <- ggplot(data = StudentSurvey, aes(x = Gender, fill = Award) ) +
geom_bar( position="dodge") +
xlab("Gender") +
ggtitle("Award Preference by Gender")
sbs
Extra:
Try making the tables in questions 2, 4, and 8 using count()
and group_by()
in the dplyr
package and pivot_wider()
in the tidyr
package. Look at the dplyr
sections on the R Studio help page for frequency tables and R Studio help page for Contingency Tables.
Below is the code used to answer question 2 using dplyr
.
library(dplyr)
library(tidyr)
# Question 2
StudentSurvey %>%
count(Award)
## Award n
## 1 Academy 31
## 2 Nobel 149
## 3 Olympic 182
library(dplyr) # load the dplyr package to use %>%
library(tidyr) # loads the tidyr package to use pivot_
mtcars %>%
group_by(cyl, gear) %>%
count(cyl, gear) %>%
pivot_wider( values_from = n, names_from = cyl)
## # A tibble: 3 x 4
## # Groups: gear [3]
## gear `4` `6` `8`
## <dbl> <int> <int> <int>
## 1 3 1 2 12
## 2 4 8 4 NA
## 3 5 2 1 2