Introduction:

The data set I am using is happiness 2019 wich shows the happiness score for different countries. the dataset also includes a large number of variables such as GDP which could contribute to the happiness of a country. THe reason to use PCA is because it allows me to use many varibles and ultimatly show just the important trends in the data.

Data prep

library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.4.2
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
setwd("~/Data 110/Datasets")
happy <- read_csv("happiness2019(1).csv")
## Rows: 156 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country or region
## dbl (8): Overall rank, Score, GDP per capita, Social support, Healthy life e...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sum(is.na(happy))
## [1] 0
head(happy)
## # A tibble: 6 × 9
##   `Overall rank` `Country or region` Score `GDP per capita` `Social support`
##            <dbl> <chr>               <dbl>            <dbl>            <dbl>
## 1              1 Finland              7.77             1.34             1.59
## 2              2 Denmark              7.6              1.38             1.57
## 3              3 Norway               7.55             1.49             1.58
## 4              4 Iceland              7.49             1.38             1.62
## 5              5 Netherlands          7.49             1.40             1.52
## 6              6 Switzerland          7.48             1.45             1.53
## # ℹ 4 more variables: `Healthy life expectancy` <dbl>,
## #   `Freedom to make life choices` <dbl>, Generosity <dbl>,
## #   `Perceptions of corruption` <dbl>

PCA:

PCA <- happy |>
  select(where(is.numeric)) |>
  scale() |>
  prcomp() 
PCA
## Standard deviations (1, .., p=8):
## [1] 2.1652862 1.1963505 0.7853395 0.7458613 0.5281342 0.5114892 0.3963805
## [8] 0.0977207
## 
## Rotation (n x k) = (8 x 8):
##                                      PC1         PC2         PC3          PC4
## Overall rank                  0.43835312  0.04819581  0.08093045 -0.001803683
## Score                        -0.44080367 -0.01615719 -0.07274773  0.007992977
## GDP per capita               -0.40793270 -0.19514753  0.07763114 -0.243462032
## Social support               -0.39151201 -0.18985948 -0.22758821  0.058092253
## Healthy life expectancy      -0.40303361 -0.15977162  0.02820855 -0.278434425
## Freedom to make life choices -0.29257895  0.37414889 -0.09030278  0.807294464
## Generosity                   -0.03767076  0.69293710 -0.57626348 -0.423325884
## Perceptions of corruption    -0.21069008  0.52688779  0.76766212 -0.169792530
##                                       PC5           PC6         PC7
## Overall rank                 -0.552021508 -0.0884729120  0.01650277
## Score                         0.535814601  0.0244819984 -0.03613972
## GDP per capita               -0.306498966  0.1540425044  0.78242466
## Social support               -0.284077249 -0.8042927073 -0.16286260
## Healthy life expectancy      -0.406109725  0.4683510116 -0.59100403
## Freedom to make life choices -0.253828884  0.2158453649  0.05911195
## Generosity                   -0.064222526 -0.0004731803  0.05382862
## Perceptions of corruption    -0.007397664 -0.2345532091 -0.06349404
##                                       PC8
## Overall rank                 -0.697239684
## Score                        -0.714893152
## GDP per capita               -0.018679955
## Social support                0.037279359
## Healthy life expectancy      -0.012326465
## Freedom to make life choices  0.004321629
## Generosity                    0.010602166
## Perceptions of corruption     0.027620442

Visualize PCA

PCA |>
  augment(happy) |>
  ggplot(aes(.fittedPC1, .fittedPC2)) +
  geom_point() +
  coord_fixed() +
  geom_segment(aes(x = -4.9, y = 0, xend = 5, yend = 0),arrow = arrow(type = "closed", length =          unit(0.1,"inches")), color = "black") +
  geom_segment(aes(x = 0, y = -3, xend = 0, yend = 4),arrow = arrow(type = "closed", length = unit(0.1, "inches")),color = "black") +
  geom_text(aes(x = 5, y = 0, label = "PC1"), vjust = -0.5, color = "black") +  
  geom_text(aes(x = 0, y = 4, label = "PC2"), hjust = -0.5, color = "black")
## Warning in geom_segment(aes(x = -4.9, y = 0, xend = 5, yend = 0), arrow = arrow(type = "closed", : All aesthetics have length 1, but the data has 156 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.
## Warning in geom_segment(aes(x = 0, y = -3, xend = 0, yend = 4), arrow = arrow(type = "closed", : All aesthetics have length 1, but the data has 156 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.
## Warning in geom_text(aes(x = 5, y = 0, label = "PC1"), vjust = -0.5, color = "black"): All aesthetics have length 1, but the data has 156 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.
## Warning in geom_text(aes(x = 0, y = 4, label = "PC2"), hjust = -0.5, color = "black"): All aesthetics have length 1, but the data has 156 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.

arrow_style <- arrow(
  angle = 20, length = grid::unit(8, "pt"),
  ends = "first", type = "closed"
)

PCA |>
  tidy(matrix = "rotation") |> 
  pivot_wider(
    names_from = "PC", values_from = "value",
    names_prefix = "PC"
  ) |>
  ggplot(aes(PC1, PC2)) +
  geom_segment(
    xend = 0, yend = 0,
    arrow = arrow_style
  , aes(color = column)) +
  geom_text(aes(label = column))

PCA Analysis:

PC1 is how happy a country is countries with higher ranks are less happy which is why the rank arrow points left and countries with higher scores are happier which is why the arrow points straight right.

PC2 is the diffrence between GDP per capita and Generosity the lowest and highest stretching arrows respectivly (on the PC2 scale).

Variance plot:

PCA |>
  tidy(matrix = "eigenvalues") |>
  ggplot(aes(PC, percent)) +
  geom_col() +
  scale_x_continuous(breaks = 1:8) +
  scale_y_continuous(labels = scales::label_percent())

To summarize the above plot about 60% of the variance in the dataset and PC2 explains another 19% for a total of 79% of the variance being explained with PC1 and PC2