MATH3030: Coursework Spring 2025

Author

Joseph Beddoe

Question 1

Exploratory Analysis

The following table outlines the number of countries within each continent.

Continent	Number of Countries
Africa	52
Americas	25
Asia	32
Europe	30
Oceania	2

Africa accounts for the largest number of countries (52), while the Americas, Asia, and Europe have a comparable number of countries (25, 32, and 30 respectively). Notably, Oceania stands out as an outlier with only 2 countries represented in the dataset. This significant difference in the number of countries per continent should be taken into consideration during subsequent analysis to avoid potential biases or misinterpretations.

GDP per capita has increased across all continents, though the rate and scale of growth vary. Oceania consistently has the highest average GDP per capita, though this is based on only two countries. Europe shows a similar trajectory with sustained growth, a slight dip around 1992, and a rapid increase afterward, reaching over $25,000 by 2007.

Asia and the Americas had more modest growth initially, but Asia experienced accelerated gains post-1992, eventually overtaking the Americas by 2007. Africa had the slowest and most limited growth, with average GDP per capita remaining under $5,000 in 2007.

Life expectancy increased across all continents during the same period. Europe and Oceania maintained high levels throughout, rising from around 64 years in 1952 to above 75 and 80 years respectively in 2007. Asia and the Americas started lower but made substantial gains. By 2007, both had converged toward the higher levels seen in Europe and Oceania, despite GDP per capita remaining considerably lower.

Africa followed a more complex path. Life expectancy improved from a low base until the late 1980s, stagnated or declined during the 1990s, and began to rise again after 2002. Despite recent improvements, Africa remained well behind other continents by 2007.

Together, the plots indicate a correlation between GDP per capita and life expectancy. Both metrics increase across the same time period and continents with higher average GDP per capitas have higher life expectancies.

Principal Component Analysis

GDP per Capita

The first technique used to explore the dataset is Principal Component Analysis, which identifies linear combinations of variables that explain the greatest variance. The correlation matrix (scale invariant) was used due to large differences in scale across GDP values, ensuring the analysis captures relative patterns rather than absolute magnitudes.

Principal Components for GDP per Capita
	PC1	PC2
1952	0.2788176	0.3524255
1957	0.2838975	0.3445207
1962	0.2910906	0.2972751
1967	0.2908360	0.2687599
1972	0.2932835	0.1687753
1977	0.2897775	0.1285789
1982	0.2947620	0.0171118
1987	0.2963891	-0.1364208
1992	0.2921932	-0.2733420
1997	0.2872139	-0.3614618
2002	0.2843569	-0.3906319
2007	0.2808892	-0.4178666

The first principal component (PC1) captures the overall level of economic development. It explains a large proportion of the variance (91.85%). Countries with large positive PC1 scores consistently had higher GDP per capita across the entire period (1952–2007).

The second principal component (PC2) reflects trends in GDP per capita over time and explains a much smaller proportion of variance (5.29%). A large positive PC2 score suggests that a country hasn’t experienced economic growth since the 1950s and 1960s or has grown more slowly relative to others. In contrast, countries with large negative PC2 scores experienced faster growth, increasing their GDP per capita relative to the global average.

The scree plot compares the proportion of variance that each component accounts for. The first two principal component scores account for 97.14% of the variance, indicating that retaining just these is sufficient. The first component likely explains such a high proportion of the variance because of the high absolute values of GDP per capita of some highly developed countries.

In this analysis, a selection of countries was chosen to reflect a broad geographic, economic and cultural spectrum. The goal was to include countries from each continent and with varying GDP per capita and Life Expectancy trends between 1952 and 2007.

Switzerland has a high PC1 and positive PC2 score, indicating consistently high GDP and slower relative growth. Singapore, by contrast, combines a high PC1 with a strongly negative PC2 score, reflecting rapid economic growth from a lower starting point.

Countries like Rwanda, India, and Guatemala cluster on the left side of the plot, with low PC1 scores indicating lower GDP per capita. Their PC2 scores are closer to zero, suggesting growth broadly in line with the global average. Wealthier nations—such as the US, UK, and New Zealand—appear to the right with high PC1 scores and modest PC2 values, reflecting steady growth.

Life Expectancy

Similar to the previous analysis, PCA was applied to the life expectancy data using a correlation matrix for the same reason as before.

Principal Components for Life Expectancy
	PC1	PC2
1952	0.2838333	0.3432747
1957	0.2881767	0.3182007
1962	0.2898994	0.2956790
1967	0.2937496	0.2410211
1972	0.2956403	0.1794977
1977	0.2949837	0.1140951
1982	0.2969461	0.0261266
1987	0.2954553	-0.0739168
1992	0.2877486	-0.2423253
1997	0.2848872	-0.3709769
2002	0.2773908	-0.4346325
2007	0.2743489	-0.4458244

PC1 captures the overall level of life expectancy across the time period and explains a large proportion of variance (92.3%). Countries with high PC1 scores tend to have consistently high life expectancy from 1952 to 2007.

PC2 represents the rate of change in life expectancy. A large positive PC2 score indicates that a country’s life expectancy either declined or improved more slowly than average since 1952. Conversely, a strongly negative PC2 score points to substantial improvement - countries that started with lower life expectancy in 1952 and made significant progress by 2007.

Similarly to for GDP per capita, the first two principal components account for a considerable proportion of the variance (97.94%) so retaining just them is sufficient.

Rwanda is an outlier with low PC1 and high PC2, indicating persistently low life expectancy and a period of decline or stagnation. Oman, near the bottom of the plot, has an average life expectancy close to the global mean but an extremely negative PC2 score, suggesting one of the sharpest improvements in life expectancy across the time span.

Most African countries lie to the left, but are dispersed along PC2, highlighting varied experiences - some showing strong gains, others stagnation. Countries with consistently high life expectancy (e.g. European countries, New Zealand, US) appear on the right with moderate PC2 values, indicating steady but less dramatic improvement.

Comparing GDP per Capita PCA with Life Expectancy PCA

The next plot compares the GDP per capita and life expectancy PC1 scores for different countries.

Overall, there is a clear positive association: countries with higher GDP per capita PC1 scores tend to have higher life expectancy PC1 scores. However, this relationship appears to be nonlinear, potentially logarithmic, with rapid gains in life expectancy at lower GDP levels that gradually taper off as income increases. Further analysis would be necessary to confirm this.

A few notable outliers are present. Saudi Arabia, for example, has a relatively high GDP per capita PC1 score but a much lower life expectancy PC1 score than other countries with similar economic performance, such as the United Kingdom.

As expected, many African countries cluster in the bottom-left quadrant. In contrast, European nations dominate the upper-middle and top-right regions of the plot.

Canonical Correlation Analysis

Canonical Correlation Analysis (CCA) is a statistical technique used to examine the relationships between two sets of variables. It finds linear combinations, called canonical variables, from each set that are maximally correlated with each other. In this analysis, the x-variables relate to the logarithm of GDP per capita, while the y-variables consist of life expectancy indicators.

The logarithmic transformation is applied to GDP per capita to reduce skewness and stabilize variance, improving the suitability of the variables for correlation-based methods.

The canonical variables (η₁ for GDP per capita and ψ₁ for life expectancy) are computed for each country by projecting their centered values onto the respective canonical vectors. This gives us a pair of values per country that reflect the dominant relationship captured by CCA.

The CCA scatter plot confirms a strong positive correlation between η₁ and ψ₁, as expected. Countries with higher life expectancy and GDP per capita, such as those in Europe, Oceania, and the United States, tend to cluster in the bottom-left of the plot. In contrast, countries with lower development metrics, including many from Africa and parts of Asia, are found in the top-right.

This pattern suggests that the first pair of canonical variables captures a development gradient: countries with better socio-economic and health indicators score low on both η₁ and ψ₁, while those with poorer conditions score high.

Using log(GDP) rather than raw GDP, as in the PCA, changes the scale of economic variation, placing greater emphasis on relative differences among lower and middle income countries. While PCA on GDP was dominated by high income nations due to large absolute values, the log transformation in CCA allows for a more balanced correlation structure, making patterns across all income levels more interpretable. This likely contributes to the strong and consistent relationship observed between η₁ and ψ₁.

Multi-Dimensional Scaling

Multi-Dimensional Scaling (MDS) is a dimension reduction technique that aims to represent high-dimensional data in a lower-dimensional while preserving the dissimilarities between data points as closely as possible. In this context, it allows us to visually explore similarities and differences between countries across various socio-economic indicators, based on a calculated distance matrix from the transformed dataset.

The plot illustrated above is very similar to PCA Life Expectancy plot but it has been flipped along the x-axis. This indicates that the 2-dimensional representation of the data is very closely related to life expectancy. It implies x is a measure of overall life expectancy much like the PC1 score was and y is a rate of change in life expectancy but given the plot is flipped a large positive y indicates a country has increased it’s life expectancy significantly. Oman was and is similarly the notable example of this. This also implies that given GDP per capita is not so obviously represented by the plot that the information about GDP per capita is captured within the life expectancy data.

The MDS plot above shows a clear similarity to the PCA life expectancy plot, although it appears flipped along the x-axis. This visual similarity suggests that the first MDS dimension (x-axis) captures variation closely related to overall life expectancy, much like the first principal component in the PCA.

The second dimension (y-axis) seems to capture something similar to the rate of change or divergence in life expectancy among countries. Countries like Oman and Saudi Arabia, which score highly on the y-axis, appear to have achieved particularly rapid improvements in life expectancy relative to their peers.

Interestingly, GDP per capita is not directly separable in this MDS representation, implying that much of the information contained in GDP per capita is indirectly reflected through life expectancy. This reinforces the strong correlation between GDP per capita and life expectancy observed in earlier analyses.

In summary, MDS supports the patterns observed in PCA and CCA, providing an alternative yet consistent view of global development variation across countries.

Question 2

A number of classification techniques are explored in this second question. Both supervised and unsupervised learning techniques are utilized.

Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) is a supervised learning technique used for classification tasks. In this analysis, LDA is used to build a classifier that predicts a country’s continent based on its GDP, life expectancy, and population from 1952 to 2007.

To improve the reliability of the model’s predictions, 10-fold cross-validation was applied. This method ensures that each country is used in the test set exactly once, providing a reliable evaluation of model performance.

The accuracy of the LDA model is 0.66.

LDA Confusion Matrix
	Africa	Americas	Asia	Europe	Oceania
Africa	43	3	6	0	0
Americas	1	15	5	2	2
Asia	8	7	14	3	0
Europe	0	2	3	21	4
Oceania	0	1	0	1	0

Africa and Europe are classified with relatively high accuracy, suggesting these continents have more distinct combinations of GDP, life expectancy, and population. Meanwhile the classifier struggles to distinguish between Asia and the Americas, suggesting overlapping patterns in the input variables.

Clustering

Clustering is a form of unsupervised learning, where the data is grouped based on it’s inherent structure without using predefined labels - in this case, continent information is not used in training. Several clustering techniques are explored below. Note that all the data was scaled before these techniques were applied.

K-Means

K-means partitions the data into k groups by minimizing within-cluster variance. While five continents are present in the data, this does not necessarily imply that k = 5 is optimal, particularly given the size and feature overlap of certain continents such as Oceania.

Using the elbow method, the optimal number of clusters appears to be k = 3, as the within-cluster sum of squares drops sharply from k = 2 to k = 3, with diminishing returns thereafter.

To visualize the clustering in two dimensions, PCA was applied to project the data onto the first two principal components. The resulting plot shows three distinct clusters.

K-Means Confusion Matrix (K=3)
	1	2	3
Africa	42	10	0
Americas	2	21	2
Asia	10	16	6
Europe	0	12	18
Oceania	0	0	2

Africa: A large majority (42 of 52) were assigned to Cluster 1, indicating strong within-region similarity in feature space.

Americas: Most (21 of 25) were grouped into Cluster 2, suggesting clear separation from other regions.

Asia: Countries are spread across all clusters, reflecting high variability.

Europe: Mostly split between Clusters 2 and 3, indicating internal diversity.

Oceania: Both countries are placed in Cluster 3, implying similarity with countries in that group.

K-means clustering produced relatively well-separated groups with clear associations to certain regions (e.g., Africa and the Americas), but struggled with more heterogeneous regions like Asia and Europe.

Multivariate Gaussian Clustering

Multivariate Gaussian clustering is a type of model-based clustering in which each sub-population is assumed to be multivariate Gaussian. Three clusters are assumed as before.

PCA was again used for visualization. Compared to k-means, the clusters appear less well-separated.

Multivariate Gaussian Clustering Confusion Matrix
	1	2	3
Africa	6	44	2
Americas	14	2	9
Asia	12	8	12
Europe	5	0	25
Oceania	0	0	2

Africa: Cluster 2 captures most countries (44 of 52), indicating high internal consistency.

Americas: More spread than under k-means, with 14 in Cluster 1 and 9 in Cluster 3.

Asia: Again distributed across all clusters, reinforcing the region’s heterogeneity.

Europe: More tightly grouped (25 of 30) in Cluster 3.

Oceania: Both countries remain in Cluster 3, consistent with the previous method.

Multivariate Gaussian clustering showed a stronger grouping for Europe and Africa but resulted in greater dispersion for the Americas.

Group Average Clustering

Group average clustering is an agglomerative hierarchical method that begins with each data point as its own cluster and successively merges the closest clusters. It uses the average distance between all points in two clusters to decide which ones to merge. This approach strikes a balance between single linkage (minimum distance) and complete linkage (maximum distance). While both of those methods were tested, group average produced more effective results.

Due to the large number of data points, the resulting dendrogram is difficult to interpret. However, it shows that at three clusters, Saudi Arabia separates out on its own, a pattern already highlighted in the PCA as an outlier. As such, using two clusters provides a more meaningful grouping in this context.

Group Average Confusion Matrix
	1	2
Africa	52	0
Americas	23	2
Asia	26	6
Europe	12	18
Oceania	0	2

Cluster 1 comprises all 52 African countries, along with the majority of countries from the Americas (23 of 25) and Asia (26 of 32). It also includes a subset of European countries (12 of 30). Based on prior analyses, this cluster represents countries with comparatively lower life expectancies and GDP per capita.

Cluster 2 contains the remaining countries and is primarily composed of European nations, as well as both Oceanian countries. Inspection of the dendrogram confirms that this cluster includes high-income countries such as the United States, Switzerland, New Zealand, and Singapore. These countries are characterized by higher GDP per capita and longer life expectancies.

Group average clustering successfully identified two broad clusters based on socio-economic disparities. Nevertheless, applying this method to derive a larger number of clusters led to highly fragmented solutions, often isolating individual countries into their own clusters. As a result, group average linkage was less successful in providing a meaningful multi-cluster structure compared to the alternative techniques investigated.

Linear Modelling

This section investigates whether a country’s life expectancy in 2007 can be predicted from its GDP per capita over the previous 55 years, using linear modelling techniques.

Principal Component Regression

PCR is a dimensionality reduction technique where the response variable is regressed on the principal components of the predictors, rather than the original variables. This is particularly useful when the predictors are highly correlated, as is the case with GDP across time.

The first model is trained using the raw GDP per capita values. To evaluate the model’s performance, cross-validation is used. The figure below shows the RMSEP (Root Mean Squared Error of Prediction) across different numbers of components.

From the plot, retaining two components minimises prediction error before it begins to increase again. The corresponding adjusted RMSEP is 8.9436. This indicates that the first two principal components capture the majority of the predictive information available in the raw GDP data.

To account for the non-linear relationship between income and life expectancy, the second model uses the logarithm of GDP per capita as predictors.

The RMSEP plot for this model suggests that three components yield the lowest prediction error. The adjusted RMSEP in this case is 7.33, which is notably lower than the previous model. This suggests that log-transforming GDP per capita improves the model’s predictive accuracy, likely because it better reflects the nonlinear nature of the GDP-life expectancy relationship.

Ridge Regression

Ridge regression is a shrinkage method used to address multicollinearity and overfitting in linear models. It does this by introducing a penalty term to the least squares loss function, which shrinks the regression coefficients towards zero. Specifically, ridge regression minimizes the residual sum of squares plus the squared Euclidean norm of the coefficients.

This regularization helps reduce variance in the predictions, especially when predictor variables are highly correlated - as is the case with historical GDP per capita.

To select an appropriate value of the tuning parameter λ, which controls the strength of the penalty, cross-validation is applied.

The cross-validation plot shows that values of log(λ) between approximately -0.5 and yield the lowest prediction errors. The value of λ that minimizes the error is λ = 0.63, while the largest value within one standard deviation of the minimum is λ = 19.95. This range provides flexibility in balancing bias and variance depending on a future goal.