General information

Maximum number of points:   8 points 
Turnaround:                 1 week
Submission Deadlines:       Group: WED 16-18 digital on OLAT: TUE, 31.03.26 23:59
                            -----------------------------------------------------
                            Group: THU 12-14 digital on OLAT: WED, 01.04.26 23:59
                            ----------------------------------------------------- 
                            Group: THU 14-16 digital on OLAT: WED, 01.04.26 23:59

General: Plots must be labelled and statistical tests always need a H0 hypothesis!

Introduction

Correlation analysis is used to investigate the direction (positive or negative) and the degree (very weak to very strong) of the relationship between two variables. As mentioned in the Lecture Notes, correlation analysis does not provide any indication about the causality between the variables. Hence, a statistically established correlation is not yet proof of an actual connection between variables. Always keep this in mind when performing correlation analyses.

Dataset

In this week’s exercise, we will work with student enrollment data at Swiss Institutes of Higher Education, during the period 1990-2024\(^1\). For each of these Academic Years (starting with the fall semester of the Academic Year 1990-1991), the number of students enrolled was monitored for 42 different education fields:

1) Education science 2) Teacher training without subject specialization 3) Teacher training with subject specialization 4) Fine arts 5) Music and performing arts 6) Religion and theology 7) History and archaeology 8) Philosophy and ethics 9) Language acquisition 10) Literature and linguistics 11) Economics 12) Political sciences and civics 13) Psychology 14) Sociology and cultural studies 15) Journalism and reporting 16) Management and administration 17) Law 18) Biology 19) Environmental sciences 20) Chemistry 21) Earth sciences 22) Physics 23) Mathematics 24) Information and Communication Technologies (ICTs) 25) Chemical engineering and processes 26) Electricity and energy 27) Electronics and automation 28) Mechanics and metal trades 29) Food processing 30) Materials (glass, paper, plastic and wood) 31) Architecture and town planning 32) Building and civil engineering 33) Crop and livestock production 34) Forestry 35) Veterinary studies 36) Dental studies 37) Medicine 38) Nursing and midwifery 39) Pharmacy 40) Social work and counselling 41) Military and defence 42) Education field unknown

Note that the first column in the dataset indicates the Academic Year.

Q1

Sadly for us, the education field ‘Geography’ is not separately listed in the dataset. However, two related fields, Environmental Science and Earth Science are listed. Before we start the quantitative analysis, answer these two questions first:

  1. Given the similarity of these two education fields, provide a possible reason why the correlation in student enrollment between these two fields might be negative?

The existance of both fields splits up the group of people interested in the greater field. So the addition of one field (B) to the other (A) means students who otherwise would have chosen field A now may choose field B as it is more tailored to their interests.

  1. Given the similarity of these two education fields, provide a possible reason why the correlation in student enrollment between these two fields might be positive?

Given that the two fields are geography and environmental science have been getting a lot of attention in this generation, a positive enrollment could stem from heightened awareness for these fields. So overall both fields could gain more enrollments simultaneously.

We will now proceed with the quantitative analyses.

  1. Create a line plot, in which you show the student enrollment numbers of the Earth Science and Environmental Science programs over the Academic Years. Then, make a scatter plot showing the student enrollment numbers of Earth Science against those of Environmental Science. Add a linear trend line to this scatter plot. Describe the two plots shortly and use the scatter plot to comment on the direction (positive or negative) and strength (very weak, weak, moderate, strong or very strong) of the correlation.

## `geom_smooth()` using formula = 'y ~ x'

The line plot shows that both fields grow similarly with the Earth Science program having more or less constantly more student applications in our observed timeframe. The scatterplot shows a strong positive correlation between the two study fields, meaning the more earth science student enrollments, the more environmental science student enrollments and vice versa.

  1. Calculate the appropriate correlation coefficient (i.e. either the parametric Pearson coefficient, or the non-parametric Spearman coefficient) between Environmental Science student numbers and Earth Science student numbers, for the whole time period from 1990 until 2024. Does the calculated correlation coefficient reflect the direction (positive or negative) and strength (very weak to very strong) you visually assessed for Q1c)? Also, explain what the observed correlation implies.
## 
##  Pearson's product-moment correlation
## 
## data:  StudentEnrollmentCH$EarthSci and StudentEnrollmentCH$EnvironmentalSci
## t = 32.296, df = 33, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9693355 0.9922417
## sample estimates:
##       cor 
## 0.9845463

The pearson coefficient is at 0.98 which indicates a very strong positive correlation. We got a p-value of <0.05 which indicates that this coefficient is stat. significant and a linear relationship exists between the two variables. This reflects the visual interpretation of the scatterplot in 1d).


Q2

In this Exercise we will look at the correlation between student enrollment into two other programs, Fine Arts and History.

  1. Create a line plot, in which you show the student enrollment numbers of the Fine Arts and History programs over the Academic Years. Then, make a scatter plot showing the student enrollment numbers of Fine Arts against those of History. Add a linear trend line to this scatter plot. Describe both shortly and use the scatter plot to comment on the direction (positive or negative) and strength (very weak, weak, moderate, strong or very strong) of the correlation. Also, explain what the observed correlation implies.

## `geom_smooth()` using formula = 'y ~ x'

The Line plot shows a much higher enrollment into history than fine arts. History experienced a boom until ca. 2004 and has been slowly declining ever since. Fine arts is overall very low with a small declining trend. The scatter plot shows a wide spread distribution of data points with no apparent trend visible, the trend line suggests a weak positive correlation. This implies that both fields have been declining for a while which can make for a correlation.

  1. Calculate the appropriate correlation coefficient (i.e. either the parametric Pearson coefficient, or the non-parametric Spearman coefficient) between Fine Arts student numbers and History student numbers, for the whole time period from 1990 until 2024. Does the calculated correlation coefficient reflect the direction direction (positive or negative) and strength (very weak to very strong) you visually assessed for Q2a)?
## [1] 0.08123236

The coefficient is at 0.08 which implies a very weak correlation and reflects the visual reflection from the line and scatter plot.


Q3

In this last Exercise we will look at the correlation in student enrollment numbers for two other programs, Economics and Social Work.

  1. Create a line plot, in which you show the student enrollment numbers of the Economics and Social Work programs over the Academic Years. Then, make a scatter plot showing the student enrollment numbers of Economics against those of Social Work. Add a linear trend line to this scatter plot. Describe all three plots shortly and use the scatter plot to comment on the direction (positive or negative) and strength (very weak, weak, moderate, strong or very strong) of the correlation.

## `geom_smooth()` using formula = 'y ~ x'

The line plot shows a drastic growth in economics enrollments from 1956 going forward, while Social Works has been slowly declining after a small boom in 1956. The scatter plot shows a cluster of points that would suggest a weak negative correlation but the linear trend says otherwise (strong positive corr.) because of an outlier at 0,0.

  1. The trendline shown in Question 3a) might be driven by an unusual phenomenon in the dataset. Inspect the data of both variables, and describe this phenomenon. Describe how this phenomenon may drive the observed trendline; what approach could you take to circumvent this effect (only a description in words needed here, you will execute the approach in Question 3d).

The unusual phenomenon here is probably an outlier at 0,0. Looking at the entries for the two observed variables, we detect “0” values fpr 7 rows. Those correspond to the years 1991-1996, maybe these fields didn’t exist yet back then. For this specific task, i would eliminate these first 7 rows as no enrollments most likely means no availability to enroll which is why they are not relevant for our analysis.

  1. Calculate the appropriate correlation coefficient (i.e. either the parametric Pearson coefficient, or the non-parametric Spearman coefficient) between Economics student numbers and Social Work student numbers, for the whole time period from 1990 until 2024. Does the calculated correlation coefficient reflect the direction direction (positive or negative) and strength (very weak to very strong) you visually assessed for Q3a)?
## 
##  Pearson's product-moment correlation
## 
## data:  StudentEnrollmentCH$Economics and StudentEnrollmentCH$SocialWork
## t = 4.1073, df = 33, p-value = 0.0002479
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3080830 0.7663313
## sample estimates:
##       cor 
## 0.5816164

H₀: There is no linear correlation between Economics and Social Work \((\rho = 0)\).
H₁: There is a linear correlation between Economics and Social Work \((\rho \ne 0)\).

The Pearson correlation coefficient is r = 0.58, which indicates a moderate positive correlation. This roughly reflects the visual impression from Q3a, although the result may be influenced by unusual values in the dataset.


  1. Following the approach you suggested in Question 3b), perform an alternative calculation of the appropriate correlation coefficient (i.e. either the parametric Pearson coefficient, or the non-parametric Spearman coefficient) between Economics student numbers and Social Work student numbers. How do the results of the correlation analysis change when you use this alternative method? Explain why we see this change in outcome between the analysis performed in question 3c) and this question.
## 
##  Pearson's product-moment correlation
## 
## data:  StudentEnrollmentCH_fixed$Economics and StudentEnrollmentCH_fixed$SocialWork
## t = -7.8513, df = 26, p-value = 2.51e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9229552 -0.6775820
## sample estimates:
##        cor 
## -0.8386551

I removed the years where both variables are 0 and recalculated the correlation.

H₀: There is no correlation between Economics and Social Work \((\rho = 0)\).
H₁: There is a correlation between Economics and Social Work \((\rho \ne 0)\).

The new correlation is r = -0.84, which is a strong negative correlation and statistically significant \((p < 0.05)\).

Compared to Q3c \((r = 0.58)\), the result changes from a positive to a negative correlation.

This happens because the \((0,0)\) values made the correlation look more positive than it actually is. After removing them, the real relationship becomes visible.


  1. Provide a visualization of the data analyzed in question 3d) that included the trend in the data as obtained through the correlation analysis performed in that question. How do the visualizations made in this question and question 3a) confirm the outcomes of the correlation analyses performed in questions 3c) and 3d)?
## `geom_smooth()` using formula = 'y ~ x'

The new scatter plot shows a clear negative trend.

In Q3a, the trendline looked positive because of the \((0,0)\) values. After removing them, the data points form a downward pattern and the trendline shows a strong negative relationship.

So the visualizations confirm the results: - Q3c → positive correlation (misleading due to 0-values)
- Q3d → strong negative correlation (more accurate)

  1. In our correlation analysis, we can either: 1) Start with calculating the strength of the association between the variables (i.e. calculate the Spearman or the Pearson correlation), and subsequently visualize this correlation with a scatter plot that includes a trendline; 2) Show the association between variables in a scatter plot that includes a trendline, and subsequently calculate the strength of this association between the variables (i.e. calculate the Spearman or the Pearson correlation). Which of these two approaches do you think is preferable? Explain your answer.

It is better to first look at the scatter plot and then calculate the correlation.

The plot helps to detect patterns or problems in the data, like the 0-values in Q3. If you calculate the correlation first, the result can be misleading.

So plotting first makes the analysis more reliable.

Footnotes

\(^1\) Data is available from the Swiss Federal Statistical Office: https://www.pxweb.bfs.admin.ch/pxweb/en/px-x-1502040100_131/px-x-1502040100_131/px-x-1502040100_131.px