In this computer lab, we will work through some exercises relating to the content in Topic 1, and then apply our knowledge to new data.
After working through the questions in this computer lab, you will be ready to complete Quiz 2. If you have time during todayβs lab, you may like to work on the quiz.
Throughout the computer lab question sheets, you will see emojis and/or collapsible sections like this one. Each emoji has a particular meaning and will sometimes be associated with additional instructions:
Prompts for you
π¬ Write your answer in the chat.
Modes at different times during the lab
π‘ Main room. All together in the main room β your computer lab demonstrator will be presenting information or facilitating class discussion
π‘ Breakout rooms. Person with birthday closest to (your computer lab demonstrator will pick a random date) shares their screen or whiteboard. Here you will discuss a question together and bring your groupβs answer back to the main room.
π» Focus mode. You will still be in the main room, but working independently. All students will be sharing screen during this time so that your computer lab demonstrator (but not other students) can see your screen.
Throughout the computer lab question sheets, you will see emojis and/or collapsible sections like this one. You can ignore the emojis and collapsible sections, as they contain information relevant to students who are studying online.
π‘ Open up RStudio and create a new script file.
As you work through the questions, you can copy and paste the code provided below into your script file.
π» Recall the survey
data set introduced in the Topic 1 material.
This data set is contained within the R package MASS
, and consists of the responses of 237 Statistics students to a set of questions (Venables 1999).
Follow the instructions below to familiarise yourself with the survey
data set.
π» Run the following code to (1) load the MASS
package into the current R environment, and (2) view the list of variables in the survey
data set.
library(MASS)
names(survey)
π» To view the full survey
data set in a separate tab, run the command
View(survey)
This should immediately open a tab titled βsurveyβ. Take some time to look over the different types of data recorded in this data set.
π» Navigate back to your R script file, and run the following code to view the R Documentation on the survey
data set. This documentation should include a short explanation of each variable.
help(survey)
π» Using the help file for the survey
data set where needed, write down the name of each variable, and what kind of variable it is. For example:
Sex
: Categorical, NominalWr.Hnd
: Numerical, ContinuousClap
and Smoke
in the chat.
π» Frequency tables allow us to display how many people (or units) fall into each possible category of a variable. They can be created in R with the table
function.
Over the next few questions (up to 4.3), we will conduct various calculations relating to frequency tables. We suggest you try these questions yourself first, but for a video walkthrough of these steps, you can refer to this video:
π» Consider the variable W.Hnd
, which tells us the writing hand of each student (Left or Right). Run the following code to create a simple frequency table:
freq.w.hnd <- table(survey$W.Hnd) # Store frequency table as freq.w.hnd
freq.w.hnd # Display the frequency table in the console
π» We can convert a frequency table into a relative frequency table by using the prop.table
function as follows:
rel.freq.w.hnd <- prop.table(freq.w.hnd)
rel.freq.w.hnd
Do the proportions of left-handed and right-handed students align with your expectations?
π» In 2.2, you will have noticed that the percentages displayed are long strings of numbers.
If we would like to display our results rounded to a specific number of decimal places in R, we can use the round
function. Run the code below to round your results for 2.2 to two decimal places of accuracy.
rel.freq.w.hnd <- round(rel.freq.w.hnd, 2)
If you now assess rel.freq.w.hnd
, what do you observe?
π» Use R to convert your rel.freq.w.hnd
table values to percentages, using the code below:
rel.freq.w.hnd <- rel.freq.w.hnd * 100
π» Now that you are familiar with frequency tables in R, create a frequency table of the variable Sex
, using the table
function. How many males and females are there in the class?
π» Finally, create a relative frequency table to determine the relative frequency of males and females in the class. Display your output in percentages rounded to two decimal places.
What percentage of the class are female and what percentage are male?
π» In 1.4, we determined that the variable survey$Smoke
is a categorical, ordinal variable. This is a useful fact to know when displaying frequency tables in R.
When working with categorical ordinal variables in R, we should ensure that:
π» First of all, we will make sure that R knows to treat survey$Smoke
as a βfactor variableβ. Doing so is good practice because it makes categorical variables much easier to work with in R. We can use the as.factor
function for this purpose.
Then, to check if the variable survey$Smoke
is recorded as an ordinal variable, we use the R function is.ordered
:
survey$Smoke <- as.factor(survey$Smoke) # set Smoke to be a factor
is.ordered(survey$Smoke) # check if survey$Smoke is ordinal
## [1] FALSE
The output FALSE
tells us that the categories of survey$Smoke
have not been ordered. To confirm this, use the function levels
to see the current order of the categories.
Hint: Check the code below if you are not sure how to proceed.
levels(survey$Smoke)
π» As we can see, the order of the Smoke
categories appears quite arbitrary. Run the code below to order the categories in survey$Smoke
from lowest to highest using the ordered
function:
survey$Smoke <- ordered(survey$Smoke, levels = c("Never", "Occas", "Regul", "Heavy"))
π» We can then run the following two lines of code to see what difference this has made:
is.ordered(survey$Smoke)
levels(survey$Smoke)
Note that in contrast to 3.1, our data is now ordered, as denoted by the output TRUE
you should obtain. We can also see that the levels, or categories, are now ordered from lowest to highest.
π» Having set up the variable survey$Smoke
as an ordinal variable, we can now create a frequency table and a relative frequency table as follows:
π» Adapt the code from Question 2 (specifically 2.1 to 2.4) to create a frequency table called freq.smoke
, and a relative frequency table called rel.freq.smoke
.
π» Since the variable survey$Smoke
is ordinal, it can be useful to convert our frequency and relative frequency tables to show cumulative results. Such tables are called cumulative frequency tables and cumulative relative frequency tables respectively, and can be created using the function cumsum
as follows:
cum.freq.smoke <- cumsum(freq.smoke)
cum.freq.smoke
cum.rel.freq.smoke <- round(cumsum(prop.table(freq.smoke)) * 100, 2)
cum.rel.freq.smoke
π»Suppose we want to create a table that summarises the variable survey$Smoke
by displaying the levels of the variable as rows, and the frequencies, relative frequencies and their cumulative counterparts as columns. To create such a table, we can use the cbind
function, which combines vectors and displays them as columns side by side.
Using the code below as a guide, create a table that summarises the variable survey$Smoke
.
Note: In the code below, the parts in red are the names of the columns - these can be whatever you like.
cbind("Freq" = freq.smoke,
"Cum Freq" = cum.freq.smoke,
"Rel Freq" = rel.freq.smoke,
"Cum Rel Freq" = cum.rel.freq.smoke)
π» Now that you have familiarised yourself with the survey
data set, we will consider different options for graphically presenting this data.
Recall that when producing graphics in RStudio, it can be handy to display them in a separate graphics device window. To do so, run the command windows()
(for Mac OS users, run the command quartz()
) before producing your plot. Keep this in mind for this question and the subsequent questions.
For the remaining lab questions, we will focus on producing various data visualisations. We suggest you try these questions yourself first, but for a video walkthrough of these steps, you can refer to this video:
π» Bar charts are a useful way to graphically present categorical data. We can use the barplot
function to create frequency and relative frequency bar charts in R.
A table detailing some of the most commonly used arguments (options) for the barplot
function
can be found here.
Using these options, we can create a frequency chart of the smoking levels of the students in the survey as follows:
smoke.names <- c("Never", "Occasional", "Regular", "Heavy")
barplot(height = freq.smoke,
ylim = c(0, 200),
col = c("chartreuse4", "yellow", "orange", "red"),
names = smoke.names,
main = "Frequency Distribution Chart of Smoking Levels",
axis.lty = 1,
xlab = "Smoking levels",
ylab = "Frequency",
legend.text = smoke.names)
π» Inspect the frequency chart this code produces, and note the following:
We have used the freq.smoke
object for our height
argument, since it was already created.
The highest frequency value was 189 (the number of students who never smoke), so we have set our y range accordingly, to ensure all frequencies are displayed. Try modifying the ylim
values, and observe the changes to the bar chart.
Before producing the chart, we have stored a variable called smoke.names
, which contains the names of the categories as we wish to use them in the chart. We have then used smoke.names
twice in the barplot
function. Defining and using smoke.names
in this manner is convenient, because it means that if we decide to change the names later, we only need to do it once.
π» We can create a relative frequency distribution chart of the smoking levels of students in a similar fashion, using the following code:
Note: Some sections, denoted by ...
, have been removed and you will have to fill these sections in.
barplot(height = rel.freq.smoke,
ylim = c(0,100),
col = ...,
names = smoke.names,
main = ...,
axis.lty = 1, xlab = ..., ylab = ...,
legend.text = smoke.names)
R has many colour palettes available to choose from if you would like to create a personalised plot. There is a wealth of information on the web on R colours. For a simple reference guide, take a look at this pdf.
π» Inspect the chart this code produces, and note the following:
We have used rel.freq.smoke
instead of freq.smoke
as our height
object here.
The y axis now represents percentages rather than frequencies. As such, we have set the range of the y axis to go from 0 to 100.
π» Pie charts are an alternative way to visually present information and can be created in R using the pie
function. To create a pie chart of the frequency of smoking levels of students, run the following code:
pie(x = freq.smoke,
labels = smoke.names,
main = "Smoking Levels of Students")
π» Inspect the pie chart this code produces, and note the following:
We have specified the frequencies using the argument x
instead of height
, which is used with the barplot
function.
If not specified, R automatically assigns a colour palette to the pie chart.
If we wish to specify our own colours, we can use the col
argument in the same way as we used it in the barplot
function. If you are not sure what colours to use, remember you can refer
to this pdf.
π» So far, we have been considering categorical data. In this section, we will be looking at frequency tables and histograms to make sense of numerical data. While the variables we will look at in this section will mostly be continuous, the material presented can also be applied to discrete variables.
π» In order to create frequency tables for numerical data, there is one preliminary step we need to take that was not necessary for categorical data. Consider the variable Height
, which contains the height of each student in centimeters. Using the same method we have learned for categorical data, we could create a frequency table as follows:
freq.height <- table(survey$Height)
freq.height
However, as we can see from the output, this frequency table is not very informative.
π» We will get a more informative result if we break up the range of heights into equal intervals. To do this, we first need to see the range of heights in the survey$Height
variable by using the range
function:
range(survey$Height, na.rm = TRUE)
# Note we use the na.rm = TRUE argument to ignore missing values
π» As we can see, the heights range from 150cm to 200cm. Knowing this, we can now break this range into equal, non-overlapping intervals. Since the range is 150 to 200 inclusive, letβs use intervals of 5, starting from 150 and going up to 205. To do this, we will use the seq
function to create an object that we will call intervals
that contains a sequence of numbers from 150 to 205 with breaks of 5:
intervals <- seq(from = 150, to = 205, by = 5)
intervals
π» Having done this, we can now create a modified version of the survey$Height
variable, which we will call height.intervals
, that assigns each student to one of the intervals based on their height. We can do this using the cut
function as follows:
height.intervals <- cut(x = survey$Height, breaks = intervals, right = FALSE)
height.intervals
Note that:
The final argument in the above command, right = FALSE
, tells R that each interval should include the lower point and go up to but not include the upper point.
We can see that the first studentβs height is between 170cm and 175cm, the second studentβs height is between 175cm and 180cm, the third studentβs height is missing, the fourth studentβs height is between 160cm and 165cm, and so on.
π» We are now ready to create a frequency table of the heights and display it vertically using cbind
as follows:
freq.height <- table(height.intervals)
cbind(freq = freq.height)
π» Using what we have learnt earlier, we can extend the above table to also include relative and cumulative frequencies. Use the code below as a base, and fill in the missing sections (denoted by ...
):
# Relative Frequency Table
rel.freq.height <- round(prop.table(freq.height) * 100, 2)
# Cumulative frequency
cum.freq.height <- cumsum(...)
# Cumulative relative frequency
cum.rel.freq.height <- round(...(...(...)) * 100, 2)
# Use the cbind function to display all frequencies vertically
cbind("Freq" = freq.height, "..." = cum.freq.height,
"Rel Freq" = ..., "Cum Rel Freq" = ...)
π» To answer the following questions, use the code provided in Question 6 as a guide.
For each of the following parts of this question, consider the variable Age
.
Create a variable called age.intervals
which assigns each student to an appropriate interval within the range of ages. The functions range
, seq
and cut
may be of assistance when answering this question.
π» Create a frequency table for Age
.
π» Create a relative frequency table for Age
.
π» Create a cumulative frequency table for Age
.
π» Create a cumulative relative frequency table for Age
.
π» Using cbind
, display a table in the console which displays the frequency, relative frequency, cumulative frequency and cumulative relative frequency of Age
.
π» Recall that a histogram is a chart that depicts the frequency of a numerical variable in non-overlapping intervals,called βbinsβ, that span the entire range of the data. We can think of a histogram as a pictorial representation of a frequency table. While we have used bar charts for categorical variables, a histogram would be the equivalent kind of chart for numerical data.
When creating histograms in R, the bins are chosen automatically for us. However as we will see, this can easily be overridden. We use the function hist
to create a histogram in R. A table detailing some of the most commonly used arguments for the hist
function
can be found here.
Suppose we wanted to create a histogram that matched the intervals for survey$Height
we defined earlier and stored as intervals
. We could do this using the code below:
hist(x = survey$Height,
breaks = intervals,
right = FALSE,
main = "Height of Students",
xlab = "Height (cm)")
The following exercises will provide us with an opportunity to practise using more of the hist
function arguments to see how they work. For each of the following parts of this question, consider the variable survey$Age
.
π» Create a histogram of the ages of students using the command hist(survey$Age)
. Does the data appear symmetrical or skewed? Can you see any outliers?
π» Adding both the breaks
and right
arguments to the code you used in the previous question, create another histogram that has bins matching the intervals you used to create your frequency table for survey$Age
in 7.2.
π» Now add the labels
argument to your code to display the frequencies above each bin. Do they match the frequencies in the frequency table you previously created for survey$Age
?
π» Instead of passing in a list of intervals to the breaks
argument, we can also specify a single number to specify the number of breaks.
Have a go at passing in some small and large numbers and see how it affects the look of your histogram. Which do you prefer?
π» Using the arguments xlab
, main
and col
, modify your code again to now choose an appropriate x axis label, title for your histogram and colour of your choice.
π» Run the following code to display the breaks, counts, density and midpoints of each bin in the console:
hist(survey$Age, plot = FALSE)
These notes have been prepared by Rupert Kuveke and Amanda Shaker. They are adapted from notes originally written by Amanda Shaker as a supplement to a workshop hosted by the Statistics Consultancy Platform, entitled Basic Statistics with R first held at La Trobe University in February 2018. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.