Homework 4

Sara Bračun Duhovnik

About the data:

I decided to analyse data on socio-economic and health factors that determine the overall development of the countries. I selected the dataset as a source for my analysis on Kaggle.com (https://www.kaggle.com/datasets/vipulgohel/clustering-pca-assignment). I cleaned the data and made a factor variable.

data(package = .packages(all.available = TRUE))
library(readxl)
mydata <- read_excel("~/Desktop/country_data.xlsx")
colnames(mydata) <- c("ID", "Country", "ChildMort", "Exports", "Imports", "Income", "Inflation", "GDP")
head(mydata, 15)
## # A tibble: 15 × 8
##       ID Country      ChildMort Exports Imports Income Inflation   GDP
##    <dbl> <chr>            <dbl>   <dbl>   <dbl>  <dbl>     <dbl> <dbl>
##  1     1 Afghanistan       90.2    10      44.9   1610     9.44    553
##  2     2 Albania           16.6    28      48.6   9930     4.49   4090
##  3     3 Algeria           27.3    38.4    31.4  12900    16.1    4460
##  4     4 Angola           119      62.3    42.9   5900    22.4    3530
##  5     5 Antigua and…      10.3    45.5    58.9  19100     1.44  12200
##  6     6 Argentina         14.5    18.9    16    18700    20.9   10300
##  7     7 Armenia           18.1    20.8    45.3   6700     7.77   3220
##  8     8 Australia          4.8    19.8    20.9  41400     1.16  51900
##  9     9 Austria            4.3    51.3    47.8  43200     0.873 46900
## 10    10 Azerbaijan        39.2    54.3    20.7  16000    13.8    5840
## 11    11 Bahamas           13.8    35      43.7  22900    -0.393 28000
## 12    12 Bahrain            8.6    69.5    50.9  41100     7.44  20700
## 13    13 Bangladesh        49.4    16      21.8   2440     7.14    758
## 14    14 Barbados          14.2    39.5    48.7  15300     0.321 16000
## 15    15 Belarus            5.5    51.4    64.5  16200    15.1    6030

Description:

  • Unit of observation: one country
  • Sample size: 167 observations
  • Number of variables: 6

Definitions of all variables:

  • ChildMort: death of children under five years of age (per 1000 live births)
  • Exports: exports of goods and services (% of the total GDP)
  • Imports: imports of goods and services (% of the total GDP)
  • Income: net income per person (in $)
  • Inflation: inflation rates (in %)
  • GDP: the measurement of the annual growth rate of the total GDP (in %)
library(psych)
describe(mydata[ , c(-1, -2)])
##           vars   n     mean       sd  median  trimmed      mad    min
## ChildMort    1 167    38.27    40.33   19.30    31.59    21.94   2.60
## Exports      2 167    41.76    27.72   35.40    38.22    20.90   2.20
## Imports      3 167    46.89    24.21   43.30    44.34    21.05   0.07
## Income       4 167 17144.69 19278.07 9960.00 13807.63 11638.41 609.00
## Inflation    5 167     7.16     7.48    5.14     6.15     5.66  -4.21
## GDP          6 167 12964.16 18328.70 4660.00  9146.43  5814.76 231.00
##                max     range skew kurtosis      se
## ChildMort    208.0    205.40 1.42     1.62    3.12
## Exports      200.0    197.80 2.36     9.05    2.15
## Imports      174.0    173.93 1.87     6.41    1.87
## Income    125000.0 124391.00 2.19     6.67 1491.78
## Inflation     45.9     50.11 1.78     4.91    0.58
## GDP       105000.0 104769.00 2.18     5.23 1418.32
Interpretation:

Exports have the largest positive skew therefore it is skewed to right. The average annual income in observed countries is 17.144,69 $. The lowest number of imports in the sample data was 0.07% of total GDP.

Research question 1: How the annual GDP growth rate, children mortality, exports, imports, income, inflation and life expectancy interract and influence overall economic development?

To make the PCA, I have to make some checks:

  • there has to be a sufficiently large correlation between the variables - the higher the correlation, the more successful the grouping (should be at least 0.3)
  • data has to be suitable: KMO or MSA statistic should be at least 0.5 - higher it is, more information you will retain when joining variables in a PCA (if lower than 0.5, data can be still used in PCA, because it is justa guideline for PCA)
  • Bartlett’s test of sphericity
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
fit <- lm (GDP ~ ChildMort + Exports + Imports + Income + Inflation,
          data = mydata)
vif(fit)
## ChildMort   Exports   Imports    Income Inflation 
##  1.418941  2.889246  2.340056  1.903555  1.159029
mean(vif(fit))
## [1] 1.942166

With VIF statistics I checked for multicoliniarity. Even if it is below 5 for all, the mean is quite high and shows high correlation between variables. I decided to join them with PCA.

mydata_PCA <- mydata[ , c("ChildMort", "Exports", "Imports", "Income")]
R <- cor(mydata_PCA)
round(R, 2)
##           ChildMort Exports Imports Income
## ChildMort      1.00   -0.30   -0.13  -0.52
## Exports       -0.30    1.00    0.68   0.49
## Imports       -0.13    0.68    1.00   0.12
## Income        -0.52    0.49    0.12   1.00
library(psych)
corPlot(R)

The table and the plot both show correlation between variables. From them I can see that only correlations between Imports and child mortality and income are below 0.3 (the higher the better). But because it has a high correlation with Exports, I will leave it in the dataset for further analysis.

mydata_PCA <- mydata[ , c("ChildMort", "Exports", "Imports", "Income")]
cortest.bartlett(R, n = nrow(mydata_PCA))
## $chisq
## [1] 222.6528
## 
## $p.value
## [1] 2.828364e-45
## 
## $df
## [1] 6
  • H0: P = I
  • H1: P ≠ I
  • p-value < 0.001

Based on the sample data, I can reject H0 at p-value < 0.001, which means that population correlation matrix and identity matrix are not the same. In other words, at least one of the correlation coefficients differs from 0. This means that we can do PCA.

det(R)
## [1] 0.2569125

This is checked just for FA, not for PCA.

R <- cor(mydata_PCA)
KMO(R)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = R)
## Overall MSA =  0.51
## MSA for each item = 
## ChildMort   Exports   Imports    Income 
##      0.65      0.51      0.44      0.49

MSA is above 0.5 so the requirement is met. Imports have the least common information with others and cause a lower MSA, because it is the least correlated with other variables. In other words, if we join all five variables in one or two PCs, the most information will be lost for Imports. For Imports and Income it is not above 0.5 which means that if I join all five in one or two PCs, the most information will be lost for Imports and Income. Based on this, data is not the most appropriate for PCA but I will still continue with the analysis.

#install.packages("FactoMineR")
library(FactoMineR)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
components <- PCA(mydata_PCA,
                  scale.unit = TRUE,
                  graph = FALSE)

library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
get_eigenvalue(components)
##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1  2.1528976        53.822440                    53.82244
## Dim.2  1.1291555        28.228887                    82.05133
## Dim.3  0.5112175        12.780437                    94.83176
## Dim.4  0.2067294         5.168236                   100.00000

This is a PCA on standardized variables. Eigenvalues show the amount of variability explained by each PC and indicate the proportion of the total variance in the dataset that is determined by that PC. If we sum all eigenvalues we get 4 (same as the number of variables), because they are standardised and their variance is 1, therefore all together represent total information.

Explanations:

  • variance of the second PC is 1.13
  • 53.82% of information of measured variables is transfered to PC1
fviz_eig(components,
         choice = "eigenvalue",
         main = "Scree plot",
         ylab = "Eigenvalue",
         xlab = "Principal component",
         addlabels = TRUE)

library(psych)
fa.parallel(mydata_PCA,
            sim = FALSE,
            fa = "pc")
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs =
## np.obs, : The estimated weights for the factor scores are probably
## incorrect.  Try a different factor score estimation method.
## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate =
## rotate, : An ultra-Heywood case was detected.  Examine the results
## carefully

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  2

Based on Parallel analysis and Scree plot above I can see that I should include 2 PCs. From Scree plot, I should look at where is the most evident break and then subtract 1 (most evident break is at third component). In Parallel analysis you should take as many components as long as the empirical eigenvalues are higher than theoretical ones. From the eigenvalues above I can also see that I should include 2 PCs, because based on Kaiser’s rule, you should make as many PCs as there is eigenvalues above 1 (Dim1 and Dim2).

Now I will perform a PCA again and include first two PCs.

components <- PCA(mydata_PCA,
                  scale.unit = TRUE,
                  graph = FALSE,
                  ncp = 2)
components
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 167 individuals, described by 4 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"
print(components$var$cor)
##                Dim.1      Dim.2
## ChildMort -0.6347827  0.5808986
## Exports    0.8750028  0.3223674
## Imports    0.6666630  0.6694831
## Income     0.7347646 -0.4894731

This is a matrix of rescaled coefficients.

Explanations:

  • 74.6% of information was transfered from Child mortality to PC1 and PC2.
  • 25.4% of information from Child mortality was lost
  • Variance of PC2 can be calculated also here by summing the squares of rescaled coefficients for Dim2 (1.13)
print(components$var$contrib)
##              Dim.1     Dim.2
## ChildMort 18.71659 29.884567
## Exports   35.56277  9.203402
## Imports   20.64378 39.694055
## Income    25.07686 21.217975

This matrix shows how different variables contributed to PCs.

Explanations:

  • Income contributed 25.08% when forming PC1
  • Exports contributed 9.2% when forming PC2
library(factoextra)
fviz_pca_var(components,
             repel = TRUE)

fviz_pca_biplot(components)

Above we can see that rescaled coefficient for Child mortality and PC1 is negative, which indicates that a lower Child mortality has a positive effect on PC1 and makes it higher. On the other hand, if Income, Exports and Imports get higher, the PC1 also gets higher. Based on that I would say that PC1 measures general development of a country. PC2 on the other hand measures contrast between trade (Import and Exports) and social welfare (Income and Child mortality).

For example, country ID 132 (Seychelles) has above average in general development and has more developed trade. Country ID 67 (Haiti) is below average in development and has also more developed trade. Country ID 78 (Japan) is below average in development and has more developed social welfare.