Homework 4

Sara Bračun Duhovnik

About the data:

I decided to analyse data on socio-economic and health factors that determine the overall development of the countries. I selected the dataset as a source for my analysis on Kaggle.com (https://www.kaggle.com/datasets/vipulgohel/clustering-pca-assignment). I cleaned the data and made a factor variable.

data(package = .packages(all.available = TRUE))
library(readxl)
mydata <- read_excel("~/Desktop/country_data.xlsx")
colnames(mydata) <- c("ID", "Country", "ChildMort", "Exports", "Imports", "Income", "Inflation", "GDP")
head(mydata, 15)
## # A tibble: 15 × 8
##       ID Country      ChildMort Exports Imports Income Inflation   GDP
##    <dbl> <chr>            <dbl>   <dbl>   <dbl>  <dbl>     <dbl> <dbl>
##  1     1 Afghanistan       90.2    10      44.9   1610     9.44    553
##  2     2 Albania           16.6    28      48.6   9930     4.49   4090
##  3     3 Algeria           27.3    38.4    31.4  12900    16.1    4460
##  4     4 Angola           119      62.3    42.9   5900    22.4    3530
##  5     5 Antigua and…      10.3    45.5    58.9  19100     1.44  12200
##  6     6 Argentina         14.5    18.9    16    18700    20.9   10300
##  7     7 Armenia           18.1    20.8    45.3   6700     7.77   3220
##  8     8 Australia          4.8    19.8    20.9  41400     1.16  51900
##  9     9 Austria            4.3    51.3    47.8  43200     0.873 46900
## 10    10 Azerbaijan        39.2    54.3    20.7  16000    13.8    5840
## 11    11 Bahamas           13.8    35      43.7  22900    -0.393 28000
## 12    12 Bahrain            8.6    69.5    50.9  41100     7.44  20700
## 13    13 Bangladesh        49.4    16      21.8   2440     7.14    758
## 14    14 Barbados          14.2    39.5    48.7  15300     0.321 16000
## 15    15 Belarus            5.5    51.4    64.5  16200    15.1    6030

Describtion:

  • Unit of observation: one country
  • Sample size: 167 observations
  • Number of variables: 6

Definitions of all variables:

  • ChildMort: death of children under five years of age (per 1000 live births)
  • Exports: exports of goods and services (% of the total GDP)
  • Imports: imports of goods and services (% of the total GDP)
  • Income: net income per person (in $)
  • Inflation: inflation rates (in %)
  • GDP: the measurement of the annual growth rate of the total GDP (in %)
library(psych)
describe(mydata[ , c(-1, -2)])
##           vars   n     mean       sd  median  trimmed      mad    min
## ChildMort    1 167    38.27    40.33   19.30    31.59    21.94   2.60
## Exports      2 167    41.76    27.72   35.40    38.22    20.90   2.20
## Imports      3 167    46.89    24.21   43.30    44.34    21.05   0.07
## Income       4 167 17144.69 19278.07 9960.00 13807.63 11638.41 609.00
## Inflation    5 167     7.16     7.48    5.14     6.15     5.66  -4.21
## GDP          6 167 12964.16 18328.70 4660.00  9146.43  5814.76 231.00
##                max     range skew kurtosis      se
## ChildMort    208.0    205.40 1.42     1.62    3.12
## Exports      200.0    197.80 2.36     9.05    2.15
## Imports      174.0    173.93 1.87     6.41    1.87
## Income    125000.0 124391.00 2.19     6.67 1491.78
## Inflation     45.9     50.11 1.78     4.91    0.58
## GDP       105000.0 104769.00 2.18     5.23 1418.32
Interpretation:

Children mortality has the largest positive skew therefore it is skewed to right, while life expectancy has the largest negative skew and is skewed to left. The average annual income in observed countries is 17.144,69 $. In the observed sample data, 50% of people live by the age of 73.1 years, while 50% live longer.

Research question 1: How the annual GDP growth rate can be explained by children mortality, exports, imports, income, inflation and life expectancy?

To make the PCA, I have to make some checks:

  • there has to be a sufficiently large correlation between the variables - the higher the correlation, the more successful the grouping (should be at least 0.3)
  • data has to be suitable: KMO or MSA statistic should be at least 0.5 - higher it is, more information you will retain when joining variables in a PCA
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
fit <- lm (GDP ~ ChildMort + Exports + Imports + Income + Inflation,
          data = mydata)
vif(fit)
## ChildMort   Exports   Imports    Income Inflation 
##  1.418941  2.889246  2.340056  1.903555  1.159029
mean(vif(fit))
## [1] 1.942166

With VIF statistics I checked for multicoliniarity. Even if it is below 5 for all, the mean is very high and shows higher correlation between variables. I decided to join them with PCA. I joined the ones that show too high multicoliniarity.

mydata_PCA <- mydata[ , c("ChildMort", "Exports", "Imports", "Income")]
R <- cor(mydata_PCA)
round(R, 2)
##           ChildMort Exports Imports Income
## ChildMort      1.00   -0.30   -0.13  -0.52
## Exports       -0.30    1.00    0.68   0.49
## Imports       -0.13    0.68    1.00   0.12
## Income        -0.52    0.49    0.12   1.00
library(psych)
corPlot(R)

The table and the plot both show correlation between variables. From them I can see that only correlations between Imports and child mortality and income are below 0.3 (the higher the better). But because it has a high correlation with Exports, I will leave it in the dataset for further analysis.

mydata_PCA <- mydata[ , c("ChildMort", "Exports", "Imports", "Income")]
cortest.bartlett(R, n = nrow(mydata_PCA))
## $chisq
## [1] 222.6528
## 
## $p.value
## [1] 2.828364e-45
## 
## $df
## [1] 6
  • H0: P = I
  • H1: P ≠ I
  • p-value < 0.001

Based on the sample data, I can reject H0 at p-value < 0.001, which means that population correlation matrix and identity matrix are not the same. In other words, at least one of the correlation coefficients differs from 0. This means that data is suitable and we can do PCA.

R <- cor(mydata_PCA)
KMO(R)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = R)
## Overall MSA =  0.51
## MSA for each item = 
## ChildMort   Exports   Imports    Income 
##      0.65      0.51      0.44      0.49

Imports have the least common information with others. In other words, if we join all five variables in one PC, the most information will be lost for Imports. For Imports and Income it is not above 0.5 which means that if I join all five in one PC, the most information will be lost for Imports and Income. Total MSA is also higher than 0.5.

#install.packages("FactoMineR")
library(FactoMineR)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
components <- PCA(mydata_PCA,
                  scale.unit = TRUE,
                  graph = FALSE)

library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
get_eigenvalue(components)
##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1  2.1528976        53.822440                    53.82244
## Dim.2  1.1291555        28.228887                    82.05133
## Dim.3  0.5112175        12.780437                    94.83176
## Dim.4  0.2067294         5.168236                   100.00000

This is a PCA on standardized variables. Eigenvalues show the amount of variability explained by each PC and indicate the proportion of the total variance in the dataset that is determined by that PC.

Explanations:

  • variance of the second PC is 1.13
  • 53.82% of information of measured variables is transfered to PC1
fviz_eig(components,
         choice = "eigenvalue",
         main = "Scree plot",
         ylab = "Eigenvalue",
         xlab = "Principal component",
         addlabels = TRUE)

library(psych)
fa.parallel(mydata_PCA,
            sim = FALSE,
            fa = "pc")
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs =
## np.obs, : The estimated weights for the factor scores are probably
## incorrect.  Try a different factor score estimation method.
## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate =
## rotate, : An ultra-Heywood case was detected.  Examine the results
## carefully

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  2

Based on Parallel analysis and Scree plot above I can see that I should include 2 PCs. From the eigenvalues above I can also see that I should include 2 PCs, because based on Kaiser’s rule, you should make as many PCs as there is eigenvalues above 1 (Dim1 and Dim2).

Now I will perform a PCA again and include first two PCs.

components <- PCA(mydata_PCA,
                  scale.unit = TRUE,
                  graph = FALSE,
                  ncp = 2)
components
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 167 individuals, described by 4 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"
print(components$var$cor)
##                Dim.1      Dim.2
## ChildMort -0.6347827  0.5808986
## Exports    0.8750028  0.3223674
## Imports    0.6666630  0.6694831
## Income     0.7347646 -0.4894731

This is a matrix of rescaled coefficients.

Explanations:

  • 74.6% of information was transfered from Child mortality to PC1 and PC2.
  • 25.4% of information from Child mortality was lost
  • Variance of PC2 can be calculated also here by summing the squares of rescaled coefficients for Dim2 (1.13)
print(components$var$contrib)
##              Dim.1     Dim.2
## ChildMort 18.71659 29.884567
## Exports   35.56277  9.203402
## Imports   20.64378 39.694055
## Income    25.07686 21.217975

This matrix shows how different variables contributed to PCs.

Explanations:

  • Income contributed 25.08% when forming PC1
  • Exports contributed 9.2% when forming PC2
library(factoextra)
fviz_pca_var(components,
             repel = TRUE)

fviz_pca_biplot(components)

Above we can see how annual GDP growth can be explained by child mortality, imports, exports and income. Income, exports and imports have a positive effect on GDP growth, while child mortality has a negative effect. On the other hand, the higher the income, the lower the child mortality.

For example, country ID 132 has below average child mortality rates and above average income, imports and exports quantities. It has performs better on trade data than on income.