In Topic 2, we focused on descriptive statistics. In this computer lab, we will practise describing data using numerical and graphical measures.
After working through the questions in this computer lab, you will be ready to complete Quiz 3. If you have time during today’s lab, you may like to work on the quiz.
Throughout the computer lab question sheets, you will see emojis and/or collapsible sections like this one. Each emoji has a particular meaning and will sometimes be associated with additional instructions:
Prompts for you
💬 Write your answer in the chat.
Modes at different times during the lab
🏡 Main room. All together in the main room – your computer lab demonstrator will be presenting information or facilitating class discussion
💡 Breakout rooms. Person with birthday closest to (your computer lab demonstrator will pick a random date) shares their screen or whiteboard. Here you will discuss a question together and bring your group’s answer back to the main room.
💻 Focus mode. You will still be in the main room, but working independently. All students will be sharing screen during this time so that your computer lab demonstrator (but not other students) can see your screen.
Throughout the computer lab question sheets, you will see emojis and/or collapsible sections like this one. You can ignore the emojis and collapsible sections, as they contain information relevant to students who are studying online.
🏡 For this question, we will be assessing happiness data published in the World Happiness Report (WHR) of the Sustainable Development Solutions Network (Gapminder 2021a).
This data has been collected between the years 2005 and 2019 for 163 countries. Each year, a happiness score from 0 to 10 is assigned to each country, based on the national average response to the Cantril life ladder question, summarised here:
For convenience, this score has been converted from 0 to 100 in our data set. Let’s take a look at the results for Australia over the years:
## country happiness_2005 happiness_2006 happiness_2007 happiness_2008
## 6 Australia 73.4 NA 72.9 72.5
## happiness_2009 happiness_2010 happiness_2011 happiness_2012 happiness_2013
## 6 NA 74.5 74.1 72 73.6
## happiness_2014 happiness_2015 happiness_2016 happiness_2017 happiness_2018
## 6 72.9 73.1 72.5 72.6 71.8
## happiness_2019
## 6 72.2
We can see that in 2019, Australia’s happiness score was 72.2. You’ll notice that the happiness score seems fairly consistent across the years, although there are some years with missing data, denoted by NA
.
Suppose that we want to assess the 2019 happiness scores for countries around the world, and compare them to Australia’s score.
🏡 Open up RStudio and create a new script file. As you work through the questions, you can copy and paste the code provided below into your script file.
🏡 The happiness_income_2019.csv
file in the Week 3 tile in LMS contains data on the 163 surveyed countries
for 2019. Download this file now, and save it in a relevant location on your PC.
Set your R working directory to the folder in which you have saved the happiness_income_2019.csv
file.
Hint: If you need a refresher on how to set working directories and import data into R, Watch the video “How do I import data into R?” which is available on LMS, just below the happiness_income_2015.csv
file.
🏡 Once you have set your R working directory, load in your data to RStudio using the code below.
world.data <- read.csv("happiness_income_2019.csv")
🏡 If we take a quick look at the data
head(world.data)
## X country income_2019 happiness_2019
## 1 1 Afghanistan 1760 25.7
## 2 2 Albania 12700 48.8
## 3 3 Algeria 14000 50.1
## 4 4 Angola 5540 NA
## 5 5 Argentina 17500 59.7
## 6 6 Armenia 9730 46.8
we notice that Angola has a data entry of NA
. As you may recall, this stands for “Not Available”, and denotes a missing value. 12 countries in our data set have 2019 happiness scores of NA
.
Don’t worry too much about this just yet though - there are several ways to get around this in R, and we will cover one such way shortly.
🏡 Recall from Computer Lab 2, that we can use the $ symbol to select a specific column of data. Use the R mean
function to compute the average happiness score for all 151 countries in 2019, as shown in the code below.
mean(world.data$happiness_2019)
🏡 Wait…why are we getting a result of NA
?
Well, since the data we are assessing contains some NA
entries, R is not sure how to use these when conducting a calculation. To avoid this problem, we can use the argument na.rm = TRUE
within whichever R function we are using, to tell R to ignore any values of NA
.
Let’s retry 1.5, using the modified code below:
mean(world.data$happiness_2019, na.rm = TRUE)
Great! We now have a numeric result. How does this mean score compare to Australia’s happiness score?
💡 In addition to the mean, we can consider other measures of location, such as the median. In R, we can compute the median of a data set using the median
function.
Use the median
function to compute the median happiness score for countries in 2019.
Remember to include the argument na.rm = TRUE
.
What do you notice? Can you explain what your result means?
💡 Based on our calculations so far, it would appear that Australia has a higher happiness score than many countries. To obtain a better understanding of where Australia falls in terms of happiness ranking, use the R quantile
function to find the minimum and maximum happiness scores in 2019, along with the \(25\%\), \(50\%\) and \(75\%\) quantiles.
What do these values tell you about the spread of the data? Which quartile does Australia lie in?
💡 We can also compute the minimum and maximum happiness scores in 2019 using the functions min
and max
respectively.
Use these functions now, and confirm that your results match those obtained in 1.8.
Hint: Remember, since there are some observations of NA
in our data, we will need to use the na.rm = TRUE
argument when performing these calculations.
💡 Using either the IQR
function, or your results from 1.8 above, compute the IQR for the 2019 happiness scores. What can you infer from this value?
💡 The quantile
function can also be used to find the value corresponding to a specific percentile, as described in the code below.
# This finds the happiness score corresponding to the 20th percentile
quantile(world.data$happiness_2019, prob = 0.2, na.rm = TRUE)
Using this code as a guide, substitute in some values to try and find a percentile which corresponds to Australia’s happiness score in 2019. Are you surprised by the result?
Note: Don’t spend too long on this - a rough percentile match is sufficient.
💻 Compute the variance and standard deviation of the 2019 happiness scores by using the var
and sd
functions respectively.
💻 Now that we have considered the quartiles, IQR and variance of our data, we have a clearer understanding of the spread of the data around the mean and median. To visualise this spread, create the following plots using the happiness_2019
data:
Make sure to label your plots clearly.
If you are not sure how to begin, use the code in the code chunk below as a guide.
boxplot(world.data$...,
col = ...,
main = ...,
xlab =..., ylab =...,
horizontal = TRUE)
hist(...)
💻 Given the appearance of your plots from 1.13, would you describe the distribution of data for the happiness_2019
variable as symmetric or skewed?
💻 To obtain further evidence regarding the symmetry (or asymmetry) of our data, we can compute the skewness of the data.
To calculate skewness, we will need to use a new package, called psych
.
Run the R code below to install and load this package now:
install.packages("psych")
library(psych)
With the psych
package loaded, we can use the describe
function to compute the skewness value for the average happiness scores for countries in 2019. (Note - the describe
function will proved a number of summary statistics, including skew
, which is the skewness measure we are after here.)
A skewness value close to 0 suggests that the data is symmetric, while a skewness value far from 0 suggests that the data is skewed.
The describe
function provides three slightly different ways for the skewness to be calculated. These can be chosen using the type
argument. In this subject, we will be using type = 2
. Also note that for the describe
function, we do not need to include the na.rm = TRUE
argument.
Use the R code below as a starting point, and replace the ...
s with the appropriate code:
describe(..., type = 2)
After running the code, you may have noticed that the output was only provided to two decimal places. To increase the number of decimal places, use the as.data.frame
function as follows:
as.data.frame(describe(..., type = 2))
💻 One aspect of the data which is hard to grasp using solely a box plot or histogram is the composition, or density, of the data.
Fortunately, a violin plot can provide us with this information, all in the one graph. Run the code below to create a violin plot for the average happiness scores in 2019.
install.packages("vioplot") # install the package
library(vioplot) # load the package
vioplot(world.data$happiness_2019, horizontal = T, col = "cyan")
💻 Based on your results in 1.8, 1.12, 1.13, 1.15 and 1.16, what measure of location and what measure of spread do you think would be best to use, when discussing the 2019 happiness scores data? Make sure to provide a justification for your choices.
💻 In addition to the happiness scores, the happiness_income_2019.csv
data set also contains the
average income per person for each country - i.e. the GDP (gross domestic product) per person, adjusted for purchasing power differences (Gapminder 2021b).
Create the following plots for the income_2019
variable, to practice your new skills:
Comment on your results.
Hint: You can use your work from 1.13 and 1.16 as a guide.
💻 In this question, we will broaden our assessment of the data set introduced in question 1.
So far, we have found that happiness scores and average income scores differ quite markedly between countries, with Australia enjoying both a higher average level of happiness and average income than most other countries in 2019. Suppose that we are now interested in determining whether or not there is a relationship between a country’s GDP per person and its citizens’ average happiness.
💻 The R cov
function can be used to find the covariance between two variables. Run the code below to calculate the covariance between the GDP per capita and the happiness score for countries in 2019.
Note that in order to ignore the NA
values, here we use the argument use = "complete.obs"
. Make sure to include the quotation marks.
cov(world.data$income_2019, world.data$happiness_2019, use = "complete.obs")
What does this result tell us about the relationship between these two variables?
💻 While the covariance value is helpful, it is hard to interpret. Typically, it is more beneficial to calculate the correlation coefficient for two variables. Use the R cor
function, and the code above in 2.1 as a guide, to calculate the correlation coefficient for the GDP per capita and the happiness score for countries in 2019.
Does the result seem reasonable? How would you describe the correlation in terms of strength?
Note: There is actually more than one way to calculate the correlation, depending on what type of data we are assessing. We will discuss this later on in the semester, but for now it is sufficient to use the default method in the cor
function.
💻 To help visualise the data, create a scatter plot of the 2019 variables using the R plot
function. Make sure to consider which axis each variable should be plotted on, and label your axes and plot clearly. You can use the code below to get started.
plot(x = ..., y = ...)
Are you surprised by anything shown in the graph?
Hints: If you are not sure which variable should be used for each axis, remember that the variable listed on the y-axis is (generally speaking) reliant to some extent upon the variable listed on the x-axis.
Since we are assessing income and happiness, do you think it would be more reasonable to say income is reliant on happiness, or that happiness is reliant on income? (Of course it is more nuanced than this, but for the purposes of this question, these are the only two variables under consideration).
If you are still not quite sure, or would like to check if you are on the right track, you can also refer back to the Topic 2 section on scatter plots .
💻 Once you have produced an image in R, such as your scatter plot from 2.3, you might like to save it.
There are several ways to save images in R. We will take a look at three different approaches.
Plots
Panel💻 If your image is within the Plots
Panel of the RStudio interface, then simply navigate to this panel, click the Export
button, and then select Save as Image...
to save your image as a .png
or .jpg
. See the screenshot below for an example:
Figure 3.1: Saving an image in the RStudio Plots panel.
Once you have selected this, there will be more options from which to choose, such as the dimensions and the save location. Once you have decided on these options, simply click Save
to save your image.
Note that there is also the option to save your image as a pdf, but for the purposes of this subject, we recommend you avoid using this option.
💻 If you have plotted your image in a separate Graphics device (for example you may have used either the windows()
or quartz()
command directly prior to producing the image), then the way in which you save your image is slightly different to the method presented in 3.1.
When saving images in a separate Graphics device, navigate to the top panel of the Graphics device, and select File -> Save as...
. You will then be presented with different file types. Select the file type of your choice. A window will then appear allowing you to specify the save location of your image.
See the screenshot below for an example:
Figure 3.2: Saving an image from a Graphics device.
Note that saving images in this way offers a wider variety of file types compared to the method outlined in 3.1.
💻 Finally, we can save images using R code! As long as the image has been displayed within a separate Graphics device, we can save it using the savePlot
function.
This can be a more practical option when many plots are being produced and need to be saved.
The savePlot
function can be used with the arguments filename
(to name the file), and type
(to specify the file type).
Take a look at the R code chunk below for an example:
savePlot(filename = "Scatter Plot", type = "png")
# This will save the image currently shown in the graphics device
Note that we have specified the file type png
here, but there are many other file type options available.
##
💻 Try saving the scatter plot you produced in 2.3 using the three methods discussed above.
💻 If you have completed all the questions above, and would like some extra practice, try the following questions:
💻 Download the happiness_income_2015.csv
file in the Week 3 Module in LMS, save it in a relevant location on your PC, and load it into R using the code below:
world.data.2015 <- read.csv("happiness_income_2015.csv", header = TRUE)
world.data.2015 <- na.omit(world.data.2015)
💻 Repeat steps 1.5 to 1.18 in question 1, using the 2015 data. Are your results similar to those for 2019?
💻 Repeat question 2 using the 2015 data. Are your findings similar or very different to those for 2019?
These notes have been prepared by Rupert Kuveke and Amanda Shaker. The notes on saving plots in R are adapted from notes originally written by Amanda Shaker as a supplement to a workshop hosted by the Statistics Consultancy Platform, entitled Basic Statistics with R first held at La Trobe University in February 2018. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.