Exercise 3.1.

Download the dataset

cnd <- read.csv ("Chocolate_Nobel_data.csv")

plot it

plot (cnd$choc_kg, cnd$Nobel_10m, 
      yaxt = "n", ylim = c (-5, 35),
      xlim = c (0, 15),
      xlab = "Chocolate Consumption (kg/yr/capita)",
      ylab = "Nobel Laureates per 10 Million Population")

axis (2, 
      at = c (-5, 0, 5, 10, 15, 20, 25, 30, 35), 
      labels = c ("", "0", "5", "10", "15", "20", "25", "30", "35")
      )

abline (h = 0, col = "red", lty = 2)

Calculate the correlation coefficients

## The Pearson correlation coefficient is: 0.7575217
## The Spearman correlation coefficient is: 0.8403233

There is a strong positive monotonic relationship between Noble Laureates per 10 million population and chocolate consumption.

Exercise 3.2.

the plot from Exercise 3.1 plus versions where log() transformations have been taken of the two variables,

Conclusion drawn:

There is a powerful relationship between chocolate consumption and the number of Nobel laureates in various countries.

Does chocolate consumption improve intellectual output?

A correlation between X and Y does not prove causation but indicates that:

  • either X influences Y,

  • Y influences X,

  • or X and Y are influenced by a common underlying mechanism.

Exercise 3.3

Apply your favorite transformation from Exercise 3.2 to the data,

exclude Brazil and China

class(cnd)
## [1] "data.frame"
cnd_ex <- cnd[!(cnd$country %in% c("Brazil", "China")), ]

exclude all dates before 2013

cnd2013 <- cnd_ex[cnd_ex$year >= 2013, ]

cnd2013
##        country choc_kg year Nobel_no Nobel_10m
## 1      Germany    12.2 2013      105    13.013
## 2  Switzerland    11.7 2014       25    30.125
## 3       Norway     9.6 2013       13    24.947
## 4           UK     8.9 2013      125    19.315
## 5      Austria     8.8 2013       21    24.577
## 6      Denmark     7.6 2013       14    24.695
## 7      Finland     7.2 2013        4     7.268
## 8      Belgium     6.9 2013       10     8.850
## 9       France     6.7 2013       61     9.473
## 10      Sweden     6.2 2013       30    30.677
## 11   Lithuania     5.8 2013        1     3.474
## 12       Italy     3.9 2013       20     3.345
## 13       Czech     3.6 2013        5     4.742
## 14       Spain     3.4 2013        8     1.735
## 15    Portugal     2.9 2013        2     1.932
## 16     Hungary     2.7 2013        9     9.132

plot without any log

log on choc_kg

log on nobel_10m

log on both

Conclusion drawn:

There is a powerful relationship between chocolate consumption and the number of Nobel laureates in various countries.

Does chocolate consumption improve intellectual output?

A correlation between X and Y does not prove causation but indicates that:

  • either X influences Y,

  • Y influences X,

  • or X and Y are influenced by a common underlying mechanism

Exercise 3.4

sink("output exercise 3.1 .txt")

cat ("title: Exercise 3.1\n")
## title: Exercise 3.1
cat ("author: zzzzzzzzzzzz\n\n")
## author: zzzzzzzzzzzz
# pearson correlation coefficient

pearson <- cor(cnd$choc_kg, cnd$Nobel_10m,
    method = "pearson")

cat ("The pearson correlation coefficient is:", pearson, "\n")
## The pearson correlation coefficient is: 0.7575217
# Spearman rank correlation coefficient

spearman <- cor(cnd$choc_kg, cnd$Nobel_10m,
    method = "spearman")

cat ("The Spearman correlation coefficient is:", spearman,"\n")
## The Spearman correlation coefficient is: 0.8403233
sink()

Exercise 3.5.

For the country data:

You could either use:

round data first to get a general overview in the matrix of plots

cd_pop <- round (cd$population, digits = - 5)
cd_gdp <- round (cd$gdp_head, -2)
cd_age <- round (cd$age_median / 5) * 5
cd_cal <- round (cd$kcals_day, -2)

produce the data frame for using pairs() function:

df_cd <- data.frame (
  population = cd_pop,
  gdp = cd_gdp,
  age = cd_age,
  kcal = cd_cal
)

The plot with original data

Now I take log on them:

df_cd_log <- data.frame (
  population = log(cd_pop),
  gdp = log(cd_gdp),
  age = log(cd_age),
  kcal = log(cd_cal)
)

The plot with log data

Closer look: age and gdp data :

I deal with non-rounded data, take them on a log-scale

Closer look: age and kcal data :

Closer look: gdp and kcal data :

Combine them

Play with the car package

  • step 0: preparatory step
library(car)
## Loading required package: carData

plot the median age over gdp:

Note:

How can I find the largest residual point?

  • first, I use max function to find the value of the largest residual;

  • then, I use the residual table (model_age_gdp$residual) to find which point it is (19)

  • This is because the residual table have the same order as the original data.frame (df_cd_log)

  • I can then say: plot me the 19th row’s (x-value is log(gdp_head), y-value is log(age_median))

  • I use the original data here, instead of the rounded data in the data frame I created!

using car, we can also plot the residual plot

Interpretation of the residual plot:

  • Solid Line:

A solid horizontal line at y = 0 is often used as a reference line to indicate the expected value of residuals when the model is accurate.

In an ideal situation, residuals should be randomly distributed around this line with no discernible pattern.

  • Dashed Lines:

Dashed lines are often used to indicate patterns in the residuals.

Interpretation of the residual plot:

  • Shaded area

In a residual plot produced by the scatterplot() function from the car package, the area typically represents a confidence band or tolerance interval for the residuals. This interval provides a visual indication of the expected range for residuals under the assumption that the model is correctly specified.