Topic 2: Descriptive Statistics

In Topic 2, we focused on descriptive statistics. In this computer lab, we will practise describing data using numerical and graphical measures.

After working through the questions in this computer lab, you will be ready to complete Quiz 3. If you have time during today’s lab, you may like to work on the quiz.

🎧 Online students

Throughout the computer lab question sheets, you will see emojis and/or collapsible sections like this one. Each emoji has a particular meaning and will sometimes be associated with additional instructions:

Prompts for you

💬 Write your answer in the chat.

Modes at different times during the lab

🏡 Main room. All together in the main room – your computer lab demonstrator will be presenting information or facilitating class discussion

💡 Breakout rooms. Person with birthday closest to (your computer lab demonstrator will pick a random date) shares their screen or whiteboard. Here you will discuss a question together and bring your group’s answer back to the main room.

💻 Focus mode. You will still be in the main room, but working independently. All students will be sharing screen during this time so that your computer lab demonstrator (but not other students) can see your screen.

🏫 Face-to-face (blended) students

Throughout the computer lab question sheets, you will see emojis and/or collapsible sections like this one. You can ignore the emojis and collapsible sections, as they contain information relevant to students who are studying online.

1 Descriptive statistics and plots for a single variable

🏡 For this question, we will be assessing happiness data published in the World Happiness Report (WHR) of the Sustainable Development Solutions Network (Gapminder 2021a).

This data has been collected between the years 2005 and 2019 for 163 countries. Each year, a happiness score from 0 to 10 is assigned to each country, based on the national average response to the Cantril life ladder question, summarised here:

Imagine a ladder with rungs from 0, being the worst possible life for you, up to 10, representing the best possible life for you.
On which rung do you feel you stand at this point in time?

For convenience, this score has been converted from 0 to 100 in our data set. Let’s take a look at the results for Australia over the years:

##     country happiness_2005 happiness_2006 happiness_2007 happiness_2008
## 6 Australia           73.4             NA           72.9           72.5
##   happiness_2009 happiness_2010 happiness_2011 happiness_2012 happiness_2013
## 6             NA           74.5           74.1             72           73.6
##   happiness_2014 happiness_2015 happiness_2016 happiness_2017 happiness_2018
## 6           72.9           73.1           72.5           72.6           71.8
##   happiness_2019
## 6           72.2

We can see that in 2019, Australia’s happiness score was 72.2. You’ll notice that the happiness score seems fairly consistent across the years, although there are some years with missing data, denoted by NA.

Suppose that we want to assess the 2019 happiness scores for countries around the world, and compare them to Australia’s score.

1.1

🏡 Open up RStudio and create a new script file. As you work through the questions, you can copy and paste the code provided below into your script file.

1.2

🏡 The happiness_income_2019.csv file in the Week 3 tile in LMS contains data on the 163 surveyed countries for 2019. Download this file now, and save it in a relevant location on your PC.

Set your R working directory to the folder in which you have saved the happiness_income_2019.csv file.

Hint: If you need a refresher on how to set working directories and import data into R, Watch the video “How do I import data into R?” which is available on LMS, just below the happiness_income_2015.csv file.

1.3

🏡 Once you have set your R working directory, load in your data to RStudio using the code below.

world.data <- read.csv("happiness_income_2019.csv")

1.4

🏡 If we take a quick look at the data

head(world.data)

##   X     country income_2019 happiness_2019
## 1 1 Afghanistan        1760           25.7
## 2 2     Albania       12700           48.8
## 3 3     Algeria       14000           50.1
## 4 4      Angola        5540             NA
## 5 5   Argentina       17500           59.7
## 6 6     Armenia        9730           46.8

we notice that Angola has a data entry of NA. As you may recall, this stands for “Not Available”, and denotes a missing value. 12 countries in our data set have 2019 happiness scores of NA. Don’t worry too much about this just yet though - there are several ways to get around this in R, and we will cover one such way shortly.

1.5

🏡 Recall from Computer Lab 2, that we can use the $ symbol to select a specific column of data. Use the R mean function to compute the average happiness score for all 151 countries in 2019, as shown in the code below.

mean(world.data$happiness_2019)

1.6

🏡 Wait…why are we getting a result of NA? Well, since the data we are assessing contains some NA entries, R is not sure how to use these when conducting a calculation. To avoid this problem, we can use the argument na.rm = TRUE within whichever R function we are using, to tell R to ignore any values of NA.

Let’s retry 1.5, using the modified code below:

mean(world.data$happiness_2019, na.rm = TRUE)

Great! We now have a numeric result. How does this mean score compare to Australia’s happiness score?

🎧 Online students

💬 Enter your interpretation of what the comparison between Australia’s happiness score and the mean score means, in the chat.

1.7

💡 In addition to the mean, we can consider other measures of location, such as the median. In R, we can compute the median of a data set using the median function.

Use the median function to compute the median happiness score for countries in 2019. Remember to include the argument na.rm = TRUE.

What do you notice? Can you explain what your result means?

1.8

💡 Based on our calculations so far, it would appear that Australia has a higher happiness score than many countries. To obtain a better understanding of where Australia falls in terms of happiness ranking, use the R quantile function to find the minimum and maximum happiness scores in 2019, along with the $25\%$, $50\%$ and $75\%$ quantiles.

What do these values tell you about the spread of the data? Which quartile does Australia lie in?

1.9

💡 We can also compute the minimum and maximum happiness scores in 2019 using the functions min and max respectively.

Use these functions now, and confirm that your results match those obtained in 1.8.

Hint: Remember, since there are some observations of NA in our data, we will need to use the na.rm = TRUE argument when performing these calculations.

1.10

💡 Using either the IQR function, or your results from 1.8 above, compute the IQR for the 2019 happiness scores. What can you infer from this value?

1.11

💡 The quantile function can also be used to find the value corresponding to a specific percentile, as described in the code below.

# This finds the happiness score corresponding to the 20th percentile
quantile(world.data$happiness_2019, prob = 0.2, na.rm = TRUE)

Using this code as a guide, substitute in some values to try and find a percentile which corresponds to Australia’s happiness score in 2019. Are you surprised by the result?

Note: Don’t spend too long on this - a rough percentile match is sufficient.

🏡 Reconvene in main room to discuss results

1.12

💻 Compute the variance and standard deviation of the 2019 happiness scores by using the var and sd functions respectively.

1.13

💻 Now that we have considered the quartiles, IQR and variance of our data, we have a clearer understanding of the spread of the data around the mean and median. To visualise this spread, create the following plots using the happiness_2019 data:

box plot
histogram

Make sure to label your plots clearly.

If you are not sure how to begin, use the code in the code chunk below as a guide.

boxplot(world.data$..., 
        col = ...,
        main = ...,
        xlab =..., ylab =...,
        horizontal = TRUE)

hist(...)

1.14

💻 Given the appearance of your plots from 1.13, would you describe the distribution of data for the happiness_2019 variable as symmetric or skewed?

🎧 Online students

💬 Comment in the chat whether you believe the distribution is symmetric or skewed.

1.15

💻 To obtain further evidence regarding the symmetry (or asymmetry) of our data, we can compute the skewness of the data. To calculate skewness, we will need to use a new package, called psych.

Run the R code below to install and load this package now:

install.packages("psych")
library(psych)

With the psych package loaded, we can use the describe function to compute the skewness value for the average happiness scores for countries in 2019. (Note - the describe function will proved a number of summary statistics, including skew, which is the skewness measure we are after here.)

A skewness value close to 0 suggests that the data is symmetric, while a skewness value far from 0 suggests that the data is skewed.

The describe function provides three slightly different ways for the skewness to be calculated. These can be chosen using the type argument. In this subject, we will be using type = 2. Also note that for the describe function, we do not need to include the na.rm = TRUE argument.

Use the R code below as a starting point, and replace the ...s with the appropriate code:

describe(..., type = 2)

After running the code, you may have noticed that the output was only provided to two decimal places. To increase the number of decimal places, use the as.data.frame function as follows:

as.data.frame(describe(..., type = 2))

1.16

💻 One aspect of the data which is hard to grasp using solely a box plot or histogram is the composition, or density, of the data.

Fortunately, a violin plot can provide us with this information, all in the one graph. Run the code below to create a violin plot for the average happiness scores in 2019.

install.packages("vioplot") # install the package
library(vioplot) # load the package
vioplot(world.data$happiness_2019, horizontal = T, col = "cyan")

1.17

💻 Based on your results in 1.8, 1.12, 1.13, 1.15 and 1.16, what measure of location and what measure of spread do you think would be best to use, when discussing the 2019 happiness scores data? Make sure to provide a justification for your choices.

🎧 Online students

💬 Post your chosen measure of location and measure of spread in the chat.

🏡 Reconvene in main room to discuss results

1.18

💻 In addition to the happiness scores, the happiness_income_2019.csv data set also contains the average income per person for each country - i.e. the GDP (gross domestic product) per person, adjusted for purchasing power differences (Gapminder 2021b).

Create the following plots for the income_2019 variable, to practice your new skills:

box plot
histogram
violin plot

Comment on your results.

Hint: You can use your work from 1.13 and 1.16 as a guide.

🎧 Online students

💬 Leave a comment about your results in the chat.

2 Descriptive statistics and plots to assess the relationship between two numeric variables

💻 In this question, we will broaden our assessment of the data set introduced in question 1.

So far, we have found that happiness scores and average income scores differ quite markedly between countries, with Australia enjoying both a higher average level of happiness and average income than most other countries in 2019. Suppose that we are now interested in determining whether or not there is a relationship between a country’s GDP per person and its citizens’ average happiness.

2.1

💻 The R cov function can be used to find the covariance between two variables. Run the code below to calculate the covariance between the GDP per capita and the happiness score for countries in 2019.

Note that in order to ignore the NA values, here we use the argument use = "complete.obs". Make sure to include the quotation marks.

cov(world.data$income_2019, world.data$happiness_2019, use = "complete.obs")

What does this result tell us about the relationship between these two variables?

2.2

💻 While the covariance value is helpful, it is hard to interpret. Typically, it is more beneficial to calculate the correlation coefficient for two variables. Use the R cor function, and the code above in 2.1 as a guide, to calculate the correlation coefficient for the GDP per capita and the happiness score for countries in 2019.

Does the result seem reasonable? How would you describe the correlation in terms of strength?

Note: There is actually more than one way to calculate the correlation, depending on what type of data we are assessing. We will discuss this later on in the semester, but for now it is sufficient to use the default method in the cor function.

🎧 Online students

💬 Post your interpretation of the correlation value in the chat.

2.3

💻 To help visualise the data, create a scatter plot of the 2019 variables using the R plot function. Make sure to consider which axis each variable should be plotted on, and label your axes and plot clearly. You can use the code below to get started.

plot(x = ..., y = ...)

Are you surprised by anything shown in the graph?

Hints: If you are not sure which variable should be used for each axis, remember that the variable listed on the y-axis is (generally speaking) reliant to some extent upon the variable listed on the x-axis.

Since we are assessing income and happiness, do you think it would be more reasonable to say income is reliant on happiness, or that happiness is reliant on income? (Of course it is more nuanced than this, but for the purposes of this question, these are the only two variables under consideration).

If you are still not quite sure, or would like to check if you are on the right track, you can also refer back to the Topic 2 section on scatter plots .

3 Saving Plots in RStudio

💻 Once you have produced an image in R, such as your scatter plot from 2.3, you might like to save it.

There are several ways to save images in R. We will take a look at three different approaches.

3.1 Saving images in the RStudio `Plots` Panel

💻 If your image is within the Plots Panel of the RStudio interface, then simply navigate to this panel, click the Export button, and then select Save as Image... to save your image as a .png or .jpg. See the screenshot below for an example:

Figure 3.1: Saving an image in the RStudio Plots panel.

Once you have selected this, there will be more options from which to choose, such as the dimensions and the save location. Once you have decided on these options, simply click Save to save your image.

Note that there is also the option to save your image as a pdf, but for the purposes of this subject, we recommend you avoid using this option.

3.2 Saving images in a separate Graphics device

💻 If you have plotted your image in a separate Graphics device (for example you may have used either the windows() or quartz() command directly prior to producing the image), then the way in which you save your image is slightly different to the method presented in 3.1.

When saving images in a separate Graphics device, navigate to the top panel of the Graphics device, and select File -> Save as.... You will then be presented with different file types. Select the file type of your choice. A window will then appear allowing you to specify the save location of your image.

See the screenshot below for an example:

Figure 3.2: Saving an image from a Graphics device.

Note that saving images in this way offers a wider variety of file types compared to the method outlined in 3.1.

3.3 Saving images using code

💻 Finally, we can save images using R code! As long as the image has been displayed within a separate Graphics device, we can save it using the savePlot function. This can be a more practical option when many plots are being produced and need to be saved.

The savePlot function can be used with the arguments filename (to name the file), and type (to specify the file type).

Take a look at the R code chunk below for an example:

savePlot(filename = "Scatter Plot", type = "png")
# This will save the image currently shown in the graphics device

Note that we have specified the file type png here, but there are many other file type options available.

💻 Try saving the scatter plot you produced in 2.3 using the three methods discussed above.

🎧 Online students

💬 Once you have saved your image, put it into the chat.

🏡 Reconvene in main room to discuss results

4 Extension: Gapminder 2015 Data

💻 If you have completed all the questions above, and would like some extra practice, try the following questions:

4.1

💻 Download the happiness_income_2015.csv file in the Week 3 Module in LMS, save it in a relevant location on your PC, and load it into R using the code below:

world.data.2015 <- read.csv("happiness_income_2015.csv", header = TRUE)
world.data.2015 <- na.omit(world.data.2015)

4.2

💻 Repeat steps 1.5 to 1.18 in question 1, using the 2015 data. Are your results similar to those for 2019?

4.3

💻 Repeat question 2 using the 2015 data. Are your findings similar or very different to those for 2019?

References

Gapminder. 2021a. “Happiness Score (WHR) [.csv File].” 2021. http://gapm.io/dhapiscore\_whr.

———. 2021b. “Income Per Person [.csv File].” 2021. http://gapm.io/dgdppc.

These notes have been prepared by Rupert Kuveke and Amanda Shaker. The notes on saving plots in R are adapted from notes originally written by Amanda Shaker as a supplement to a workshop hosted by the Statistics Consultancy Platform, entitled Basic Statistics with R first held at La Trobe University in February 2018. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.

STM1001: Computer Lab 3

Topic 2: Descriptive Statistics

1 Descriptive statistics and plots for a single variable

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

1.10

1.11

🏡 Reconvene in main room to discuss results

1.12

1.13

1.14

1.15

1.16

1.17

🏡 Reconvene in main room to discuss results

1.18

2 Descriptive statistics and plots to assess the relationship between two numeric variables

2.1

2.2

2.3

3 Saving Plots in RStudio

3.1 Saving images in the RStudio Plots Panel

3.2 Saving images in a separate Graphics device

3.3 Saving images using code

🏡 Reconvene in main room to discuss results

4 Extension: Gapminder 2015 Data

4.1

4.2

4.3

References

3.1 Saving images in the RStudio `Plots` Panel