Topic 1: Introduction to Statistics and presenting data


In this computer lab, we will work through some exercises relating to the content in Topic 1, and then apply our knowledge to new data.


After working through the questions in this computer lab, you will be ready to complete Quiz 2. If you have time during today’s lab, you may like to work on the quiz.

🎧 Online students

Throughout the computer lab question sheets, you will see emojis and/or collapsible sections like this one. Each emoji has a particular meaning and will sometimes be associated with additional instructions:

Prompts for you

πŸ’¬ Write your answer in the chat.

Modes at different times during the lab

🏑 Main room. All together in the main room – your computer lab demonstrator will be presenting information or facilitating class discussion

πŸ’‘ Breakout rooms. Person with birthday closest to (your computer lab demonstrator will pick a random date) shares their screen or whiteboard. Here you will discuss a question together and bring your group’s answer back to the main room.

πŸ’» Focus mode. You will still be in the main room, but working independently. All students will be sharing screen during this time so that your computer lab demonstrator (but not other students) can see your screen.


🏫 Face-to-face (blended) students

Throughout the computer lab question sheets, you will see emojis and/or collapsible sections like this one. You can ignore the emojis and collapsible sections, as they contain information relevant to students who are studying online.


Preparation

🏑 Open up RStudio and create a new script file.

As you work through the questions, you can copy and paste the code provided below into your script file.

1 Inspecting our Data

πŸ’» Recall the survey data set introduced in the Topic 1 material. This data set is contained within the R package MASS, and consists of the responses of 237 Statistics students to a set of questions (Venables 1999).

Follow the instructions below to familiarise yourself with the survey data set.

1.1

πŸ’» Run the following code to (1) load the MASS package into the current R environment, and (2) view the list of variables in the survey data set.

library(MASS)
names(survey)

1.2

πŸ’» To view the full survey data set in a separate tab, run the command

View(survey)

This should immediately open a tab titled β€œsurvey”. Take some time to look over the different types of data recorded in this data set.

1.3

πŸ’» Navigate back to your R script file, and run the following code to view the R Documentation on the survey data set. This documentation should include a short explanation of each variable.

help(survey)

1.4

πŸ’» Using the help file for the survey data set where needed, write down the name of each variable, and what kind of variable it is. For example:

  • Sex: Categorical, Nominal
  • Wr.Hnd: Numerical, Continuous


🎧 Online students πŸ’¬ Enter the variable types for Clap and Smoke in the chat.


2 Frequency Tables

πŸ’» Frequency tables allow us to display how many people (or units) fall into each possible category of a variable. They can be created in R with the table function.

Over the next few questions (up to 4.3), we will conduct various calculations relating to frequency tables. We suggest you try these questions yourself first, but for a video walkthrough of these steps, you can refer to this video:

2.1

πŸ’» Consider the variable W.Hnd, which tells us the writing hand of each student (Left or Right). Run the following code to create a simple frequency table:

freq.w.hnd <- table(survey$W.Hnd)  # Store frequency table as freq.w.hnd
freq.w.hnd  # Display the frequency table in the console

2.2

πŸ’» We can convert a frequency table into a relative frequency table by using the prop.table function as follows:

rel.freq.w.hnd <- prop.table(freq.w.hnd)
rel.freq.w.hnd

Do the proportions of left-handed and right-handed students align with your expectations?

2.3

πŸ’» In 2.2, you will have noticed that the percentages displayed are long strings of numbers.

If we would like to display our results rounded to a specific number of decimal places in R, we can use the round function. Run the code below to round your results for 2.2 to two decimal places of accuracy.

rel.freq.w.hnd <- round(rel.freq.w.hnd, 2)

If you now assess rel.freq.w.hnd, what do you observe?


🎧 Online students πŸ’¬ Enter the rounded results in the chat.


2.4

πŸ’» Use R to convert your rel.freq.w.hnd table values to percentages, using the code below:

rel.freq.w.hnd <- rel.freq.w.hnd * 100

2.5

πŸ’» Now that you are familiar with frequency tables in R, create a frequency table of the variable Sex, using the table function. How many males and females are there in the class?

2.6

πŸ’» Finally, create a relative frequency table to determine the relative frequency of males and females in the class. Display your output in percentages rounded to two decimal places.

What percentage of the class are female and what percentage are male?


🎧 Online students πŸ’¬ Enter the results in the chat.


3 Types of Variables in R

πŸ’» In 1.4, we determined that the variable survey$Smoke is a categorical, ordinal variable. This is a useful fact to know when displaying frequency tables in R.

When working with categorical ordinal variables in R, we should ensure that:

  • R knows they are β€˜ordered’
  • R has the correct order of the categories of the variable, and
  • R automatically displays any output accordingly
    • E.g. small < medium < large is ordered correctly, while small < large < medium is not ordered

3.1

πŸ’» First of all, we will make sure that R knows to treat survey$Smoke as a β€œfactor variable”. Doing so is good practice because it makes categorical variables much easier to work with in R. We can use the as.factor function for this purpose. Then, to check if the variable survey$Smoke is recorded as an ordinal variable, we use the R function is.ordered:

survey$Smoke <- as.factor(survey$Smoke) # set Smoke to be a factor
is.ordered(survey$Smoke) # check if survey$Smoke is ordinal
## [1] FALSE

The output FALSE tells us that the categories of survey$Smoke have not been ordered. To confirm this, use the function levels to see the current order of the categories.


❓Hint

Hint: Check the code below if you are not sure how to proceed.

levels(survey$Smoke)


3.2

πŸ’» As we can see, the order of the Smoke categories appears quite arbitrary. Run the code below to order the categories in survey$Smoke from lowest to highest using the ordered function:

survey$Smoke <- ordered(survey$Smoke, levels = c("Never", "Occas", "Regul", "Heavy"))

3.3

πŸ’» We can then run the following two lines of code to see what difference this has made:

is.ordered(survey$Smoke)
levels(survey$Smoke)

Note that in contrast to 3.1, our data is now ordered, as denoted by the output TRUE you should obtain. We can also see that the levels, or categories, are now ordered from lowest to highest.


🏑 Reconvene in main room to discuss results


4 Relative and Cumulative Relative Frequency Tables

πŸ’» Having set up the variable survey$Smoke as an ordinal variable, we can now create a frequency table and a relative frequency table as follows:

4.1

πŸ’» Adapt the code from Question 2 (specifically 2.1 to 2.4) to create a frequency table called freq.smoke, and a relative frequency table called rel.freq.smoke.


🎧 Online students πŸ’¬ Enter the results in the chat.


4.2

πŸ’» Since the variable survey$Smoke is ordinal, it can be useful to convert our frequency and relative frequency tables to show cumulative results. Such tables are called cumulative frequency tables and cumulative relative frequency tables respectively, and can be created using the function cumsum as follows:

cum.freq.smoke <- cumsum(freq.smoke)
cum.freq.smoke
cum.rel.freq.smoke <- round(cumsum(prop.table(freq.smoke)) * 100, 2)
cum.rel.freq.smoke


🎧 Online students πŸ’¬ Enter the results in the chat.

4.3 Tabulating Results

πŸ’»Suppose we want to create a table that summarises the variable survey$Smoke by displaying the levels of the variable as rows, and the frequencies, relative frequencies and their cumulative counterparts as columns. To create such a table, we can use the cbind function, which combines vectors and displays them as columns side by side.

Using the code below as a guide, create a table that summarises the variable survey$Smoke.

Note: In the code below, the parts in red are the names of the columns - these can be whatever you like.

cbind("Freq" = freq.smoke, 
      "Cum Freq" = cum.freq.smoke, 
      "Rel Freq" = rel.freq.smoke, 
      "Cum Rel Freq" = cum.rel.freq.smoke)

5 Visualizing our Data

πŸ’» Now that you have familiarised yourself with the survey data set, we will consider different options for graphically presenting this data.

Recall that when producing graphics in RStudio, it can be handy to display them in a separate graphics device window. To do so, run the command windows() (for Mac OS users, run the command quartz()) before producing your plot. Keep this in mind for this question and the subsequent questions.

For the remaining lab questions, we will focus on producing various data visualisations. We suggest you try these questions yourself first, but for a video walkthrough of these steps, you can refer to this video:

5.1 Bar Charts

πŸ’» Bar charts are a useful way to graphically present categorical data. We can use the barplot function to create frequency and relative frequency bar charts in R.

A table detailing some of the most commonly used arguments (options) for the barplot function can be found here.

Using these options, we can create a frequency chart of the smoking levels of the students in the survey as follows:

smoke.names <- c("Never", "Occasional", "Regular", "Heavy")
barplot(height = freq.smoke, 
        ylim = c(0, 200), 
        col = c("chartreuse4", "yellow", "orange", "red"),
        names = smoke.names,
        main = "Frequency Distribution Chart of Smoking Levels",
        axis.lty = 1, 
        xlab = "Smoking levels",
        ylab = "Frequency",
        legend.text = smoke.names)

5.1.1

πŸ’» Inspect the frequency chart this code produces, and note the following:

  • We have used the freq.smoke object for our height argument, since it was already created.

  • The highest frequency value was 189 (the number of students who never smoke), so we have set our y range accordingly, to ensure all frequencies are displayed. Try modifying the ylim values, and observe the changes to the bar chart.

  • Before producing the chart, we have stored a variable called smoke.names, which contains the names of the categories as we wish to use them in the chart. We have then used smoke.names twice in the barplot function. Defining and using smoke.names in this manner is convenient, because it means that if we decide to change the names later, we only need to do it once.

5.1.2

πŸ’» We can create a relative frequency distribution chart of the smoking levels of students in a similar fashion, using the following code:

Note: Some sections, denoted by ..., have been removed and you will have to fill these sections in.

barplot(height = rel.freq.smoke, 
        ylim = c(0,100), 
        col = ...,
        names = smoke.names,
        main = ...,
        axis.lty = 1, xlab = ..., ylab = ...,
        legend.text = smoke.names)

R has many colour palettes available to choose from if you would like to create a personalised plot. There is a wealth of information on the web on R colours. For a simple reference guide, take a look at this pdf.

5.1.3

πŸ’» Inspect the chart this code produces, and note the following:

  • We have used rel.freq.smoke instead of freq.smoke as our height object here.

  • The y axis now represents percentages rather than frequencies. As such, we have set the range of the y axis to go from 0 to 100.

5.2 Pie Charts

πŸ’» Pie charts are an alternative way to visually present information and can be created in R using the pie function. To create a pie chart of the frequency of smoking levels of students, run the following code:

pie(x = freq.smoke, 
    labels = smoke.names,
    main = "Smoking Levels of Students")

5.2.1

πŸ’» Inspect the pie chart this code produces, and note the following:

  • We have specified the frequencies using the argument x instead of height, which is used with the barplot function.

  • If not specified, R automatically assigns a colour palette to the pie chart. If we wish to specify our own colours, we can use the col argument in the same way as we used it in the barplot function. If you are not sure what colours to use, remember you can refer to this pdf.


🏑 Reconvene in main room to discuss results


6 Assessing Numerical Data

πŸ’» So far, we have been considering categorical data. In this section, we will be looking at frequency tables and histograms to make sense of numerical data. While the variables we will look at in this section will mostly be continuous, the material presented can also be applied to discrete variables.

6.1

πŸ’» In order to create frequency tables for numerical data, there is one preliminary step we need to take that was not necessary for categorical data. Consider the variable Height, which contains the height of each student in centimeters. Using the same method we have learned for categorical data, we could create a frequency table as follows:

freq.height <- table(survey$Height)
freq.height

However, as we can see from the output, this frequency table is not very informative.

6.2

πŸ’» We will get a more informative result if we break up the range of heights into equal intervals. To do this, we first need to see the range of heights in the survey$Height variable by using the range function:

range(survey$Height, na.rm = TRUE) 
# Note we use the na.rm = TRUE argument to ignore missing values


🎧 Online students πŸ’¬ Enter the results in the chat.


6.3

πŸ’» As we can see, the heights range from 150cm to 200cm. Knowing this, we can now break this range into equal, non-overlapping intervals. Since the range is 150 to 200 inclusive, let’s use intervals of 5, starting from 150 and going up to 205. To do this, we will use the seq function to create an object that we will call intervals that contains a sequence of numbers from 150 to 205 with breaks of 5:

intervals <- seq(from = 150, to = 205, by = 5) 
intervals

6.4

πŸ’» Having done this, we can now create a modified version of the survey$Height variable, which we will call height.intervals, that assigns each student to one of the intervals based on their height. We can do this using the cut function as follows:

height.intervals <- cut(x = survey$Height, breaks = intervals, right = FALSE)
height.intervals

Note that:

  • The final argument in the above command, right = FALSE, tells R that each interval should include the lower point and go up to but not include the upper point.

  • We can see that the first student’s height is between 170cm and 175cm, the second student’s height is between 175cm and 180cm, the third student’s height is missing, the fourth student’s height is between 160cm and 165cm, and so on.

6.5

πŸ’» We are now ready to create a frequency table of the heights and display it vertically using cbind as follows:

freq.height <- table(height.intervals)
cbind(freq = freq.height)

6.6

πŸ’» Using what we have learnt earlier, we can extend the above table to also include relative and cumulative frequencies. Use the code below as a base, and fill in the missing sections (denoted by ...):

# Relative Frequency Table
rel.freq.height <- round(prop.table(freq.height) * 100, 2)

# Cumulative frequency
cum.freq.height <- cumsum(...)

# Cumulative relative frequency
cum.rel.freq.height <- round(...(...(...)) * 100, 2)

# Use the cbind function to display all frequencies vertically
cbind("Freq" = freq.height, "..." = cum.freq.height, 
      "Rel Freq" = ..., "Cum Rel Freq" = ...)

7 Analysing a Variable

πŸ’» To answer the following questions, use the code provided in Question 6 as a guide. For each of the following parts of this question, consider the variable Age.

7.1

Create a variable called age.intervals which assigns each student to an appropriate interval within the range of ages. The functions range, seq and cut may be of assistance when answering this question.

7.2

πŸ’» Create a frequency table for Age.


🎧 Online students πŸ’¬ Enter the results in the chat.


7.3

πŸ’» Create a relative frequency table for Age.

7.4

πŸ’» Create a cumulative frequency table for Age.

7.5

πŸ’» Create a cumulative relative frequency table for Age.

7.6

πŸ’» Using cbind, display a table in the console which displays the frequency, relative frequency, cumulative frequency and cumulative relative frequency of Age.

8 Creating Histograms

πŸ’» Recall that a histogram is a chart that depicts the frequency of a numerical variable in non-overlapping intervals,called β€˜bins’, that span the entire range of the data. We can think of a histogram as a pictorial representation of a frequency table. While we have used bar charts for categorical variables, a histogram would be the equivalent kind of chart for numerical data.

When creating histograms in R, the bins are chosen automatically for us. However as we will see, this can easily be overridden. We use the function hist to create a histogram in R. A table detailing some of the most commonly used arguments for the hist function can be found here.

Suppose we wanted to create a histogram that matched the intervals for survey$Height we defined earlier and stored as intervals. We could do this using the code below:

hist(x = survey$Height, 
     breaks = intervals, 
     right = FALSE, 
     main = "Height of Students", 
     xlab = "Height (cm)")

The following exercises will provide us with an opportunity to practise using more of the hist function arguments to see how they work. For each of the following parts of this question, consider the variable survey$Age.

8.1

πŸ’» Create a histogram of the ages of students using the command hist(survey$Age). Does the data appear symmetrical or skewed? Can you see any outliers?

8.2

πŸ’» Adding both the breaks and right arguments to the code you used in the previous question, create another histogram that has bins matching the intervals you used to create your frequency table for survey$Age in 7.2.

8.3

πŸ’» Now add the labels argument to your code to display the frequencies above each bin. Do they match the frequencies in the frequency table you previously created for survey$Age?

8.4

πŸ’» Instead of passing in a list of intervals to the breaks argument, we can also specify a single number to specify the number of breaks.

Have a go at passing in some small and large numbers and see how it affects the look of your histogram. Which do you prefer?

8.5

πŸ’» Using the arguments xlab, main and col, modify your code again to now choose an appropriate x axis label, title for your histogram and colour of your choice.

8.6

πŸ’» Run the following code to display the breaks, counts, density and midpoints of each bin in the console:

hist(survey$Age, plot = FALSE)


🏑 Reconvene in main room to discuss results


Great work! That was a long lab, but you should now feel much more comfortable assessing and summarising simple statistics and variables in R.


References

Venables, & Ripley, W. N. 1999. Modern Applied Statistics with s-PLUS. 3rd ed. New York: Springer.


These notes have been prepared by Rupert Kuveke and Amanda Shaker. They are adapted from notes originally written by Amanda Shaker as a supplement to a workshop hosted by the Statistics Consultancy Platform, entitled Basic Statistics with R first held at La Trobe University in February 2018. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.