Maximum number of points: 8 points
Turnaround: 1 week
Submission Deadlines: Group: WED 16-18 digital on OLAT: TUE, 31.03.26 23:59
-----------------------------------------------------
Group: THU 12-14 digital on OLAT: WED, 01.04.26 23:59
-----------------------------------------------------
Group: THU 14-16 digital on OLAT: WED, 01.04.26 23:59
Correlation analysis is used to investigate the direction (positive or negative) and the degree (very weak to very strong) of the relationship between two variables. As mentioned in the Lecture Notes, correlation analysis does not provide any indication about the causality between the variables. Hence, a statistically established correlation is not yet proof of an actual connection between variables. Always keep this in mind when performing correlation analyses.
In this week’s exercise, we will work with student enrollment data at Swiss Institutes of Higher Education, during the period 1990-2024\(^1\). For each of these Academic Years (starting with the fall semester of the Academic Year 1990-1991), the number of students enrolled was monitored for 42 different education fields:
1) Education science 2) Teacher training without subject specialization 3) Teacher training with subject specialization 4) Fine arts 5) Music and performing arts 6) Religion and theology 7) History and archaeology 8) Philosophy and ethics 9) Language acquisition 10) Literature and linguistics 11) Economics 12) Political sciences and civics 13) Psychology 14) Sociology and cultural studies 15) Journalism and reporting 16) Management and administration 17) Law 18) Biology 19) Environmental sciences 20) Chemistry 21) Earth sciences 22) Physics 23) Mathematics 24) Information and Communication Technologies (ICTs) 25) Chemical engineering and processes 26) Electricity and energy 27) Electronics and automation 28) Mechanics and metal trades 29) Food processing 30) Materials (glass, paper, plastic and wood) 31) Architecture and town planning 32) Building and civil engineering 33) Crop and livestock production 34) Forestry 35) Veterinary studies 36) Dental studies 37) Medicine 38) Nursing and midwifery 39) Pharmacy 40) Social work and counselling 41) Military and defence 42) Education field unknown
Note that the first column in the dataset indicates the Academic Year.
Sadly for us, the education field ‘Geography’ is not separately listed in the dataset. However, two related fields, Environmental Science and Earth Science are listed. Before we start the quantitative analysis, answer these two questions first:
The existance of both fields splits up the group of people interested in the greater field. So the addition of one field (B) to the other (A) means students who otherwise would have chosen field A now may choose field B as it is more tailored to their interests.
Given that the two fields are geography and environmental science have been getting a lot of attention in this generation, a positive enrollment could stem from heightened awareness for these fields. So overall both fields could gain more enrollments simultaneously.
We will now proceed with the quantitative analyses.
## `geom_smooth()` using formula = 'y ~ x'
The line plot shows that both fields grow similarly with the Earth
Science program having more or less constantly more student applications
in our observed timeframe. The scatterplot shows a strong positive
correlation between the two study fields, meaning the more earth science
student enrollments, the more environmental science student enrollments
and vice versa.
##
## Pearson's product-moment correlation
##
## data: StudentEnrollmentCH$EarthSci and StudentEnrollmentCH$EnvironmentalSci
## t = 32.296, df = 33, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9693355 0.9922417
## sample estimates:
## cor
## 0.9845463
The pearson coefficient is at 0.98 which indicates a very strong positive correlation. We got a p-value of <0.05 which indicates that this coefficient is stat. significant and a linear relationship exists between the two variables. This reflects the visual interpretation of the scatterplot in 1d).
In this Exercise we will look at the correlation between student enrollment into two other programs, Fine Arts and History.
## `geom_smooth()` using formula = 'y ~ x'
The Line plot shows a much higher enrollment into history than fine
arts. History experienced a boom until ca. 2004 and has been slowly
declining ever since. Fine arts is overall very low with a small
declining trend. The scatter plot shows a wide spread distribution of
data points with no apparent trend visible, the trend line suggests a
weak positive correlation. This implies that both fields have been
declining for a while which can make for a correlation.
## [1] 0.08123236
The coefficient is at 0.08 which implies a very weak correlation and reflects the visual reflection from the line and scatter plot.
In this last Exercise we will look at the correlation in student enrollment numbers for two other programs, Economics and Social Work.
## `geom_smooth()` using formula = 'y ~ x'
The line plot shows a drastic growth in economics enrollments from 1956
going forward, while Social Works has been slowly declining after a
small boom in 1956. The scatter plot shows a cluster of points that
would suggest a weak negative correlation but the linear trend says
otherwise (strong positive corr.) because of an outlier at 0,0.
The unusual phenomenon here is probably an outlier at 0,0. Looking at the entries for the two observed variables, we detect “0” values fpr 7 rows. Those correspond to the years 1991-1996, maybe these fields didn’t exist yet back then. For this specific task, i would eliminate these first 7 rows as no enrollments most likely means no availability to enroll which is why they are not relevant for our analysis.
##
## Pearson's product-moment correlation
##
## data: StudentEnrollmentCH$Economics and StudentEnrollmentCH$SocialWork
## t = 4.1073, df = 33, p-value = 0.0002479
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3080830 0.7663313
## sample estimates:
## cor
## 0.5816164
H₀: There is no linear correlation between Economics
and Social Work \((\rho = 0)\).
H₁: There is a linear correlation between Economics and
Social Work \((\rho \ne 0)\).
The Pearson correlation coefficient is r = 0.58, which indicates a moderate positive correlation. This roughly reflects the visual impression from Q3a, although the result may be influenced by unusual values in the dataset.
##
## Pearson's product-moment correlation
##
## data: StudentEnrollmentCH_fixed$Economics and StudentEnrollmentCH_fixed$SocialWork
## t = -7.8513, df = 26, p-value = 2.51e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9229552 -0.6775820
## sample estimates:
## cor
## -0.8386551
I removed the years where both variables are 0 and recalculated the correlation.
H₀: There is no correlation between Economics and
Social Work \((\rho = 0)\).
H₁: There is a correlation between Economics and Social
Work \((\rho \ne 0)\).
The new correlation is r = -0.84, which is a strong negative correlation and statistically significant \((p < 0.05)\).
Compared to Q3c \((r = 0.58)\), the result changes from a positive to a negative correlation.
This happens because the \((0,0)\) values made the correlation look more positive than it actually is. After removing them, the real relationship becomes visible.
## `geom_smooth()` using formula = 'y ~ x'
The new scatter plot shows a clear negative trend.
In Q3a, the trendline looked positive because of the \((0,0)\) values. After removing them, the data points form a downward pattern and the trendline shows a strong negative relationship.
So the visualizations confirm the results: - Q3c → positive
correlation (misleading due to 0-values)
- Q3d → strong negative correlation (more accurate)
It is better to first look at the scatter plot and then calculate the correlation.
The plot helps to detect patterns or problems in the data, like the 0-values in Q3. If you calculate the correlation first, the result can be misleading.
So plotting first makes the analysis more reliable.
\(^1\) Data is available from the Swiss Federal Statistical Office: https://www.pxweb.bfs.admin.ch/pxweb/en/px-x-1502040100_131/px-x-1502040100_131/px-x-1502040100_131.px