Exercise 3 presentation

2024-01-23

Exercise 3.1.

Create a new program to study this data. Download the dataset and save it with a short meaningful name.
Are there any issues with the data which we need to take into account?
plot the data in a similar style to Figure 3.1 but don’t worry about the flags or the statistics in the upper left hand corner.
What sort of relationship does there appear to be between the two variables?
Calculate the Pearson correlation coefficient and the Spearman rank correlation coefficient between the two variables.
What conclusion can we draw from those coefficients?

Download the dataset

cnd <- read.csv ("Chocolate_Nobel_data.csv")

plot it

Calculate the correlation coefficients

## The Pearson correlation coefficient is: 0.7575217

## The Spearman correlation coefficient is: 0.8403233

conclusion:

There is a strong positive monotonic relationship between Noble Laureates per 10 million population and chocolate consumption.

Exercise 3.2.

Create a plot containing four subplots (use par (mfrow = …)):

the plot from Exercise 3.1 plus versions where log() transformations have been taken of the two variables,

individually and
jointly.
Which plot appears most linear? Calculate the two correlation coefficients from Exercise 3.1 for each of the four plots and insert them into the bodies of the plots

Plot them!

Conclusion drawn:

There is a powerful relationship between chocolate consumption and the number of Nobel laureates in various countries.

Does chocolate consumption improve intellectual output?

A correlation between X and Y does not prove causation but indicates that:

either X influences Y,
Y influences X,
or X and Y are influenced by a common underlying mechanism.

Exercise 3.3

Apply your favorite transformation from Exercise 3.2 to the data,

firstly excluding Brazil and China, and
secondly excluding all dates before 2013.
Produce a pair of plots similar to those in Exercise 3.2.
What conclusions can we draw from our investigations?
Does chocolate consumption improve intellectual output (for which Noble prizes might be considered a proxy)?
If so, why? If not, what can we conclude?

exclude Brazil and China

class(cnd)

## [1] "data.frame"

cnd_ex <- cnd[!(cnd$country %in% c("Brazil", "China")), ]

exclude all dates before 2013

cnd2013 <- cnd_ex[cnd_ex$year >= 2013, ]

exclude all dates before 2013

##        country choc_kg year Nobel_no Nobel_10m
## 1      Germany    12.2 2013      105    13.013
## 2  Switzerland    11.7 2014       25    30.125
## 3       Norway     9.6 2013       13    24.947
## 4           UK     8.9 2013      125    19.315
## 5      Austria     8.8 2013       21    24.577
## 6      Denmark     7.6 2013       14    24.695
## 7      Finland     7.2 2013        4     7.268
## 8      Belgium     6.9 2013       10     8.850
## 9       France     6.7 2013       61     9.473
## 10      Sweden     6.2 2013       30    30.677
## 11   Lithuania     5.8 2013        1     3.474
## 12       Italy     3.9 2013       20     3.345
## 13       Czech     3.6 2013        5     4.742
## 14       Spain     3.4 2013        8     1.735
## 15    Portugal     2.9 2013        2     1.932
## 16     Hungary     2.7 2013        9     9.132

plot without any log

log on choc_kg

log on nobel_10m

log on both

Conclusion drawn:

There is a powerful relationship between chocolate consumption and the number of Nobel laureates in various countries.

Does chocolate consumption improve intellectual output?

A correlation between X and Y does not prove causation but indicates that:

either X influences Y,
Y influences X,
or X and Y are influenced by a common underlying mechanism

Exercise 3.4

Expand the program that you created in Exercise 3.1 to display the two correlation coefficients to the screen.
Send these outputs to a file which has an appropriate heading together with your name and the date and time at the bottom.

Exercise 3.4

sink("output exercise 3.1 .txt")

# pearson correlation coefficient
pearson <- cor(cnd$choc_kg, cnd$Nobel_10m,
    method = "pearson")
cat ("The pearson correlation coefficient is:", pearson, "\n")

## The pearson correlation coefficient is: 0.7575217

# Spearman rank correlation coefficient
spearman <- cor(cnd$choc_kg, cnd$Nobel_10m,
    method = "spearman")
cat ("The Spearman correlation coefficient is:", spearman,"\n")

## The Spearman correlation coefficient is: 0.8403233

sink()

Exercise 3.5.

For the country data:

produce a matrix of scatter plots comparing the three numerical columns of data.

You could either use:

the pairs() function
or explore the car package for a more sophisticated representation.
Are any of the relationships linear? Look for best linear relationships using log() transformations and correlations.

round data first to get a general overview in the matrix of plots

cd_pop <- round (cd$population, digits = - 5)
cd_gdp <- round (cd$gdp_head, -2)
cd_age <- round (cd$age_median / 5) * 5
cd_cal <- round (cd$kcals_day, -2)

produce the data frame for using pairs() function:

df_cd <- data.frame (
  population = cd_pop,
  gdp = cd_gdp,
  age = cd_age,
  kcal = cd_cal
)

The plot with original data

Now I take log on them:

df_cd_log <- data.frame (
  population = log(cd_pop),
  gdp = log(cd_gdp),
  age = log(cd_age),
  kcal = log(cd_cal)
)

The plot with log data

Closer look: age and gdp data :

I deal with non-rounded data, take them on a log-scale

Closer look: age and kcal data :

Closer look: gdp and kcal data :

Combine them

Play with the car package

step 0: preparatory step

library(car)

## Loading required package: carData

plot the median age over gdp:

Note:

How can I find the largest residual point?

first, I use max function to find the value of the largest residual;
then, I use the residual table (model_age_gdp$residual) to find which point it is (19)
This is because the residual table have the same order as the original data.frame (df_cd_log)
I can then say: plot me the 19th row’s (x-value is log(gdp_head), y-value is log(age_median))
I use the original data here, instead of the rounded data in the data frame I created!

using car, we can also plot the residual plot

Interpretation of the residual plot:

Solid Line:

A solid horizontal line at y = 0 is often used as a reference line to indicate the expected value of residuals when the model is accurate.

In an ideal situation, residuals should be randomly distributed around this line with no discernible pattern.

Dashed Lines:

Dashed lines are often used to indicate patterns in the residuals.

Interpretation of the residual plot:

Shaded area

In a residual plot produced by the scatterplot() function from the car package, the area typically represents a confidence band or tolerance interval for the residuals. This interval provides a visual indication of the expected range for residuals under the assumption that the model is correctly specified.