Topic 2: Descriptive Statistics

These are the solutions for Computer Lab 3, in which we assessed happiness and income data from Gapminder (see (Gapminder 2021a) and (Gapminder 2021b)).

1 Descriptive statistics and plots for a single variable

1.1

No solution needed.

1.2

No solution needed.

1.3

world.data <- read.csv("happiness_income_2019.csv", header = TRUE)

1.4

head(world.data)

##   X     country income_2019 happiness_2019
## 1 1 Afghanistan        1760           25.7
## 2 2     Albania       12700           48.8
## 3 3     Algeria       14000           50.1
## 4 4      Angola        5540             NA
## 5 5   Argentina       17500           59.7
## 6 6     Armenia        9730           46.8

1.5

mean(world.data$happiness_2019)

## [1] NA

1.6

mean(world.data$happiness_2019, na.rm = TRUE)

## [1] 54.67086

The average happiness score for countries in 2019 was approximately 54.671. This is much lower than Australia’s average happiness score of 72.2.

1.7

median(world.data$happiness_2019, na.rm = TRUE)

## [1] 55.1

The median happiness score in 2019 was 55.1. This means that 50% of countries had a happiness score below 55.1 in 2019, while 50% had a happiness score above 55.1.

1.8

quantile(world.data$happiness_2019, na.rm = TRUE)

##   0%  25%  50%  75% 100% 
## 25.7 47.0 55.1 62.3 78.1

The minimum and maximum average happiness scores in 2019 were 25.7 and 78.1, respectively.

The \(25\%\), \(50\%\) and \(75\%\) quantiles were 47.0, 55.1, and 62.3 respectively. Note that the \(50\%\) quantile is equivalent to the median value.

Based on these results, we can see that Australia falls in the top quartile, i.e. Australia’s average happiness score in 2019 is greater than that of (at least) \(75\%\) of the countries surveyed. We can also see that the scores are bunched up around the median value.

1.9

We can compute the minimum and maximum average happiness scores in 2019 using the following code.

min(world.data$happiness_2019, na.rm = TRUE)

## [1] 25.7

max(world.data$happiness_2019, na.rm = TRUE)

## [1] 78.1

So the minimum average happiness score in 2019 was 25.7, while the maximum average happiness score in 2019 was 78.1. Note that these values match those obtained in 1.8.

1.10

We can calculate the IQR by hand, noting that \[\begin{align*} IQR &= Q_3 - Q_1 \\ &= 62.3-47 = 15.3. \end{align*}\] Alternatively, in R, we can run the following code:

IQR(world.data$happiness_2019, na.rm = TRUE)

## [1] 15.3

Just as we noticed in 1.8 above, the IQR is rather narrow, with \(50\%\) of the values bunched up around the median value.

1.11

For this question, it is unlikely you will find the exact percentile for Australia (let us know if you think you have!). Example R code for different guesses is displayed below:

quantile(world.data$happiness_2019, prob = 0.775, na.rm = TRUE)

##  77.5% 
## 63.125

quantile(world.data$happiness_2019, prob = 0.8, na.rm = TRUE)

##  80% 
## 63.8

quantile(world.data$happiness_2019, prob = 0.85, na.rm = TRUE)

##   85% 
## 65.65

quantile(world.data$happiness_2019, prob = 0.9, na.rm = TRUE)

##  90% 
## 70.9

quantile(world.data$happiness_2019, prob = 0.95, na.rm = TRUE)

##   95% 
## 72.95

Remembering that Australia’s 2019 score was 72.2, we can see that this falls somewhere between the \(90^{th}\) and \(95^{th}\) percentile. This may be a somewhat surprising result, to find that the average happiness level in Australia in 2019 was higher than in over \(90\%\) of the world’s countries.

1.12

Example R code is provided below:

var.happy <- var(world.data$happiness_2019, na.rm = TRUE)
var.happy

## [1] 124.8317

sd.happy <- sqrt(var.happy)
sd.happy

## [1] 11.17281

The variance of the 2019 happiness scores is 124.8317, and thus the standard deviation is 11.17281.

1.13

boxplot(world.data$happiness_2019, horizontal = T, col = "lightblue",
xlab = "Happiness Score", main = "Happiness Score Box plot for 2019")

hist(world.data$happiness_2019, main = "Average Happiness Score 2019", 
xlab = "Happiness Score", col = "lightblue")

1.14

Based on the box plot and histogram, the data appears to be roughly symmetric.

1.15

install.packages("psych")
library(psych)

With the psych package loaded, we can compute the skewness value for the average happiness scores for countries in 2019 as follows:

as.data.frame(describe(world.data$happiness_2019, type = 2))

##    vars   n     mean       sd median  trimmed      mad  min  max range
## X1    1 151 54.67086 11.17281   55.1 54.80248 11.41602 25.7 78.1  52.4
##          skew   kurtosis        se
## X1 -0.0913619 -0.4633164 0.9092304

As the skewness is -0.091, which is close to 0, we can conclude that the data is symmetric.

1.16

install.packages("vioplot") # install the package
library(vioplot) # load the package
vioplot(world.data$happiness_2019, horizontal = T, 
main = "Violin Plot for 2019 Happiness Score", 
xlab = "Happiness Score", col = "lightblue")

1.17

The violin plot of the happiness scores suggests that the data may in fact be roughly symmetrical, despite the mean and median not being equal.

Since the distribution of the data appears to be (roughly) symmetric, it makes more sense to use the mean and standard deviation as appropriate measures of location and spread respectively, rather than the median and IQR.

1.18

Example results and code for the income_2019 data are shown below.

boxplot(world.data$income_2019, horizontal = T, col = "orange",
xlab = "Average Income", main = "GDP per Capita Box plot for 2019")

The GDP per capita data is clearly positively skewed, with some large positive outliers. In comparison, the happiness score data looks almost symmetrical.

hist(world.data$income_2019, main = "GDP per Capita  2019", 
xlab = "Average Income", col = "orange", breaks = 20)

This histogram clearly supports our previous conclusion that the GDP per capita data is positively skewed.

vioplot(world.data$income_2019, horizontal = T, 
main = "Violin Plot for 2019 GDP per capita", 
xlab = "Average income per person", col = "orange")

The violin plot of the GDP per capita supports our previous conclusions.

2 Descriptive statistics and plots to assess the relationship between two numeric variables

2.1

cov(world.data$income_2019, world.data$happiness_2019, use = "complete.obs")

## [1] 157188

The covariance value between average income and average happiness for 2019 is 157188. While this number is not too informative, it does tell us that the relationship between income and happiness score is positive (which is not too surprising!).

2.2

cor(world.data$income_2019, world.data$happiness_2019, use = "complete.obs")

## [1] 0.7389331

The correlation coefficient between average income and average happiness for 2019 is roughly 0.7389. We could describe this correlation as being a moderate to strong positive correlation, which intuitively makes sense - as people’s income increases, their average happiness level should increase.

2.3

plot(world.data$income_2019, world.data$happiness_2019, pch = 21, bg = "chartreuse3", 
     cex = 1.2, main = "Happiness Score versus GDP per capita",
     xlab = "Average income per person", ylab = "Average happiness score")

One thing that may be surprising to note, is that at quite low values for average income per person, the average happiness scores vary greatly, with some values even being above the mean and median happiness scores.

It is also worth noting that an increase in income appears to offer diminishing returns with respect to happiness once a certain income level is reached - see e.g. how some of the points in the top right of the graph are below those to the left.

3 Saving Plots in RStudio

3.1 Saving Images in the RStudio `Plots` Panel

No answer required.

3.2 Saving Images in a separate Graphics device

No answer required.

3.3 Saving images using code

No answer required.

3.4

Example R code for this question could be

savePlot(filename = "Scatter Plot", type = "jpg")

4 Extension: Gapminder 2015 Data

Check with your lab demonstrator if you would like to discuss your results for this question.

If there were any parts you were unsure about, take a look back over the relevant sections of the Topic 2 material.

References

Gapminder. 2021a. “Happiness Score (WHR) [.csv File].” 2021. http://gapm.io/dhapiscore\_whr.

———. 2021b. “Income Per Person [.csv File].” 2021. http://gapm.io/dgdppc.

These notes have been prepared by Rupert Kuveke and Amanda Shaker. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.

STM1001: Computer Lab 3 Solutions

Topic 2: Descriptive Statistics

1 Descriptive statistics and plots for a single variable

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

1.10

1.11

1.12

1.13

1.14

1.15

1.16

1.17

1.18

2 Descriptive statistics and plots to assess the relationship between two numeric variables

2.1

2.2

2.3

3 Saving Plots in RStudio

3.1 Saving Images in the RStudio Plots Panel

3.2 Saving Images in a separate Graphics device

3.3 Saving images using code

3.4

4 Extension: Gapminder 2015 Data

If there were any parts you were unsure about, take a look back over the relevant sections of the Topic 2 material.

References

3.1 Saving Images in the RStudio `Plots` Panel