Create a new program to study this data. Download the dataset and save it with a short meaningful name.
Are there any issues with the data which we need to take into account?
plot the data in a similar style to Figure 3.1 but don’t worry about the flags or the statistics in the upper left hand corner.
What sort of relationship does there appear to be between the two variables?
Calculate the Pearson correlation coefficient and the Spearman rank correlation coefficient between the two variables.
What conclusion can we draw from those coefficients?
Download the dataset
cnd <- read.csv ("Chocolate_Nobel_data.csv")
plot it
plot (cnd$choc_kg, cnd$Nobel_10m,
yaxt = "n", ylim = c (-5, 35),
xlim = c (0, 15),
xlab = "Chocolate Consumption (kg/yr/capita)",
ylab = "Nobel Laureates per 10 Million Population")
axis (2,
at = c (-5, 0, 5, 10, 15, 20, 25, 30, 35),
labels = c ("", "0", "5", "10", "15", "20", "25", "30", "35")
)
abline (h = 0, col = "red", lty = 2)
Calculate the correlation coefficients
## The Pearson correlation coefficient is: 0.7575217
## The Spearman correlation coefficient is: 0.8403233
There is a strong positive monotonic relationship between Noble Laureates per 10 million population and chocolate consumption.
the plot from Exercise 3.1 plus versions where log() transformations have been taken of the two variables,
individually and
jointly.
Which plot appears most linear? Calculate the two correlation coefficients from Exercise 3.1 for each of the four plots and insert them into the bodies of the plots
There is a powerful relationship between chocolate consumption and the number of Nobel laureates in various countries.
A correlation between X and Y does not prove causation but indicates that:
either X influences Y,
Y influences X,
or X and Y are influenced by a common underlying mechanism.
Apply your favorite transformation from Exercise 3.2 to the data,
firstly excluding Brazil and China, and
secondly excluding all dates before 2013.
Produce a pair of plots similar to those in Exercise 3.2.
What conclusions can we draw from our investigations?
Does chocolate consumption improve intellectual output (for which Noble prizes might be considered a proxy)?
If so, why? If not, what can we conclude?
class(cnd)
## [1] "data.frame"
cnd_ex <- cnd[!(cnd$country %in% c("Brazil", "China")), ]
cnd2013 <- cnd_ex[cnd_ex$year >= 2013, ]
cnd2013
## country choc_kg year Nobel_no Nobel_10m
## 1 Germany 12.2 2013 105 13.013
## 2 Switzerland 11.7 2014 25 30.125
## 3 Norway 9.6 2013 13 24.947
## 4 UK 8.9 2013 125 19.315
## 5 Austria 8.8 2013 21 24.577
## 6 Denmark 7.6 2013 14 24.695
## 7 Finland 7.2 2013 4 7.268
## 8 Belgium 6.9 2013 10 8.850
## 9 France 6.7 2013 61 9.473
## 10 Sweden 6.2 2013 30 30.677
## 11 Lithuania 5.8 2013 1 3.474
## 12 Italy 3.9 2013 20 3.345
## 13 Czech 3.6 2013 5 4.742
## 14 Spain 3.4 2013 8 1.735
## 15 Portugal 2.9 2013 2 1.932
## 16 Hungary 2.7 2013 9 9.132
There is a powerful relationship between chocolate consumption and the number of Nobel laureates in various countries.
A correlation between X and Y does not prove causation but indicates that:
either X influences Y,
Y influences X,
or X and Y are influenced by a common underlying mechanism
Expand the program that you created in Exercise 3.1 to display the two correlation coefficients to the screen.
Send these outputs to a file which has an appropriate heading together with your name and the date and time at the bottom.
sink("output exercise 3.1 .txt")
cat ("title: Exercise 3.1\n")
## title: Exercise 3.1
cat ("author: zzzzzzzzzzzz\n\n")
## author: zzzzzzzzzzzz
# pearson correlation coefficient
pearson <- cor(cnd$choc_kg, cnd$Nobel_10m,
method = "pearson")
cat ("The pearson correlation coefficient is:", pearson, "\n")
## The pearson correlation coefficient is: 0.7575217
# Spearman rank correlation coefficient
spearman <- cor(cnd$choc_kg, cnd$Nobel_10m,
method = "spearman")
cat ("The Spearman correlation coefficient is:", spearman,"\n")
## The Spearman correlation coefficient is: 0.8403233
sink()
For the country data:
You could either use:
the pairs() function
or explore the car package for a more sophisticated representation.
Are any of the relationships linear? Look for best linear relationships using log() transformations and correlations.
cd_pop <- round (cd$population, digits = - 5)
cd_gdp <- round (cd$gdp_head, -2)
cd_age <- round (cd$age_median / 5) * 5
cd_cal <- round (cd$kcals_day, -2)
df_cd <- data.frame (
population = cd_pop,
gdp = cd_gdp,
age = cd_age,
kcal = cd_cal
)
df_cd_log <- data.frame (
population = log(cd_pop),
gdp = log(cd_gdp),
age = log(cd_age),
kcal = log(cd_cal)
)
I deal with non-rounded data, take them on a log-scale
library(car)
## Loading required package: carData
How can I find the largest residual point?
first, I use max function to find the value of the largest residual;
then, I use the residual table (model_age_gdp$residual) to find which point it is (19)
This is because the residual table have the same order as the original data.frame (df_cd_log)
I can then say: plot me the 19th row’s (x-value is log(gdp_head), y-value is log(age_median))
I use the original data here, instead of the rounded data in the data frame I created!
A solid horizontal line at y = 0 is often used as a reference line to indicate the expected value of residuals when the model is accurate.
In an ideal situation, residuals should be randomly distributed around this line with no discernible pattern.
Dashed lines are often used to indicate patterns in the residuals.
In a residual plot produced by the scatterplot() function from the car package, the area typically represents a confidence band or tolerance interval for the residuals. This interval provides a visual indication of the expected range for residuals under the assumption that the model is correctly specified.