library(tidyverse)

Introduction

In this assignment (and others to follow) we will become more familiar with RStudio and R. More importantly, through this series of assignments, we will address the “how much” and “why” of happiness, around the world. Really.

Let’s begin by us answering (silently) the following question. Considering our situation over the last month:

Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?

Seriously, we should give ourselves time to answer this question, known as the Cantril Life Ladder question. Think about your answer, and convert your answer to a percent. There is only one person who needs to know your answer.

How happy were the people all over the world in 2019?

So, this question is asked yearly to a random sample of people in many countries, including our country of origin. The results appear in the file hapiscore*.csv, which comes from <“https://www.gapminder.org/data/”>. If this file is in the same folder as this *.Rmd, the following chunk reads in (most) results of that survey:

happy_entities <- read_csv("/Users/swapnanilbala/Downloads/hapiscore_whrGapMinder.csv")
## Rows: 164 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): country
## dbl (19): 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(happy_entities,5)
## # A tibble: 5 × 20
##   country  `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012` `2013` `2014`
##   <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Afghani…     NA   NA     NA     37.2   44     47.6   38.3   37.8   35.7   31.3
## 2 Angola       NA   NA     NA     NA     NA     NA     55.9   43.6   39.4   38  
## 3 Albania      NA   NA     46.3   NA     54.9   52.7   58.7   55.1   45.5   48.1
## 4 UAE          NA   67.3   NA     NA     68.7   71     71.2   72.2   66.2   65.4
## 5 Argenti…     NA   63.1   60.7   59.6   64.2   64.4   67.8   64.7   65.8   66.7
## # ℹ 9 more variables: `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, `2018` <dbl>,
## #   `2019` <dbl>, `2020` <dbl>, `2021` <dbl>, `2022` <dbl>, `2023` <dbl>

(We can turn off the messages by the message=FALSE statement, as in the libraries chunk.) This file contains, for a country, a happiness percentage for various years. Let’s focus only on the years 2019 through 2023. The following chunk does this:

affected_5_years <-  select(happy_entities,
                     country="country",
                     Year_2019="2019",
                     Year_2020="2020",
                     Year_2021="2021",
                     Year_2022="2022",
                     Year_2023="2023")
                     
head(affected_5_years,5)
## # A tibble: 5 × 6
##   country     Year_2019 Year_2020 Year_2021 Year_2022 Year_2023
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 Afghanistan      23.8      NA        24.4      12.8      14.5
## 2 Angola           NA        NA        NA        NA        NA  
## 3 Albania          50        53.6      52.5      52.1      54.5
## 4 UAE              67.1      64.6      67.3      67.4      67.3
## 5 Argentina        60.9      59        59.1      62.6      63.9

Let’s arrange the table according to the most happy countries first, in 2019. The following chunk does this

hapest_2019 <- arrange(affected_5_years,desc(Year_2019))
head(hapest_2019,5)
## # A tibble: 5 × 6
##   country     Year_2019 Year_2020 Year_2021 Year_2022 Year_2023
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 Finland          77.8      78.9      77.9      77.3      77  
## 2 Switzerland      76.9      75.1      73.3      68.8      69.7
## 3 Denmark          76.9      75.2      77        75.5      75  
## 4 Iceland          75.3      75.8      75.7      74.5      75.6
## 5 Norway           74.4      72.9      73.6      73        72.5
tail(hapest_2019,5)
## # A tibble: 5 × 6
##   country             Year_2019 Year_2020 Year_2021 Year_2022 Year_2023
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 Somalia                    NA        NA        NA        NA        NA
## 2 South Sudan                NA        NA        NA        NA        NA
## 3 Suriname                   NA        NA        NA        NA        NA
## 4 Syria                      NA        NA        NA        NA        NA
## 5 Trinidad and Tobago        NA        NA        NA        NA        NA

Note we can peak at the top of the data frame using head() and the bottom using tail(). Oh no, NAs go last! Let’s remove them from the output, by amending the above chunk

hapest_2019 <- arrange(affected_5_years,desc(Year_2019))
tail(drop_na(hapest_2019,Year_2019),5)
## # A tibble: 5 × 6
##   country     Year_2019 Year_2020 Year_2021 Year_2022 Year_2023
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 Zambia           33.1      48.4      30.8      37.3      36.9
## 2 Rwanda           32.7      NA        NA        NA        NA  
## 3 India            32.5      42.2      35.6      39.3      46.8
## 4 Zimbabwe         26.9      31.6      31.6      33        35.7
## 5 Afghanistan      23.8      NA        24.4      12.8      14.5

Let’s investigate the happiness score for year 2019, in more detail. Let’s consider a boxplot, which summarizes a continuous measurement with 5 numbers. The following chunk produces the boxplot

ggplot(hapest_2019) +
  geom_boxplot(aes(x=Year_2019)) +
  xlab("happiness, 2019")

Notice that I turned off warnings with warning=FALSE. See the course notes for interpreting Boxplots. In the plot, we can see the box, which has a horizontal span [49,63] for half the (non NA) countries, and at least 2 “left” Outliers. We see the median “about” centered in the box, and the left whisker “about” the same length as the right whisker.

The following chunk provides a more numerical form of a Boxplot

summary(hapest_2019$Year_2019)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   23.80   49.20   56.30   55.71   62.90   77.80      21

So, these results suggest a symmetric distribution, is it? Let’s find out with a histogram, produced by the following chunk

ggplot(hapest_2019) +
  geom_histogram(aes(x=Year_2019),bins=10) +
  xlab("happiness, 2019")

This histogram indicates a slight left skew (the left whisker was a little longer than the right, plus those left Outliers). So, there seems to be a bell-shaped curve in the histogram. Is the distribution close to a Gaussian? Let’s find out with a quantile-quantile plot, produced by the following chunk

p <- ggplot(hapest_2019,aes(sample=Year_2019))
            p + stat_qq() + stat_qq_line() + ylab("happiness_in_2019")

Let’s arrange the table according to the most happy countries first, in 2020. The following chunk does this

hapest_2020 <- arrange(affected_5_years,desc(Year_2020))
head(hapest_2020,10)
## # A tibble: 10 × 6
##    country     Year_2019 Year_2020 Year_2021 Year_2022 Year_2023
##    <chr>           <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
##  1 Finland          77.8      78.9      77.9      77.3      77  
##  2 Iceland          75.3      75.8      75.7      74.5      75.6
##  3 Denmark          76.9      75.2      77        75.5      75  
##  4 Switzerland      76.9      75.1      73.3      68.8      69.7
##  5 Netherlands      74.3      75        73.1      73.9      72.5
##  6 Germany          70.3      73.1      67.5      66.1      67.9
##  7 Sweden           74        73.1      74.4      74.3      71.6
##  8 Norway           74.4      72.9      73.6      73        72.5
##  9 New Zealand      72        72.6      71.4      69.8      69.8
## 10 Austria          72        72.1      70.8      70        66.4
tail(hapest_2020,10)
## # A tibble: 10 × 6
##    country             Year_2019 Year_2020 Year_2021 Year_2022 Year_2023
##    <chr>                   <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
##  1 Somalia                  NA          NA      NA        NA        NA  
##  2 South Sudan              NA          NA      NA        NA        NA  
##  3 Suriname                 NA          NA      NA        NA        NA  
##  4 Eswatini                 44          NA      NA        35        NA  
##  5 Syria                    NA          NA      NA        NA        NA  
##  6 Chad                     42.5        NA      NA        44        45.4
##  7 Togo                     41.8        NA      40.4      42.4      43.6
##  8 Turkmenistan             54.7        NA      NA        NA        NA  
##  9 Trinidad and Tobago      NA          NA      NA        NA        NA  
## 10 Yemen                    42          NA      NA        35.9      35.3

We have kept the data set in the heapst2020 variable, keeping the result sorted in descending manner for the Year_2020 column. Now, the head function returns the first 10 results, similarly the tail returns the last 10 results (results means the whole sorted Dataset)

Now we are removing the NA values for the year 2020 and showing a more arranged and Concised version of the table.

hapest_2020 <- arrange(affected_5_years,desc(Year_2020))
tail(drop_na(hapest_2020,Year_2020),5)
## # A tibble: 5 × 6
##   country  Year_2019 Year_2020 Year_2021 Year_2022 Year_2023
##   <chr>        <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 India         32.5      42.2      35.6      39.3      46.8
## 2 Jordan        44.5      40.9      39.1      43.6      42.9
## 3 Tanzania      36.4      37.9      36.8      36.2      40.4
## 4 Zimbabwe      26.9      31.6      31.6      33        35.7
## 5 Lebanon       40.2      26.3      21.8      23.5      35.9

Let’s investigate the happiness score for year 2020, in more detail. Let’s consider a Boxplot, which summarizes a continuous measurement with 5 numbers. The following chunk produces the Boxplot for it

ggplot(hapest_2020) +
  geom_boxplot(aes(x=Year_2020)) +
  xlab("happiness, 2020")

From the Boxplot its transparent that most values lie within 50 - 65. However, there appears to be an Outlier which has a value of around 15for the happiness score. This could mean people in most countries were relatively happier however, for a few countries the people suffered.

arrange_unhappy_2020 <- arrange(hapest_2020, desc(2020))
unhappy_2020 <- filter(arrange_unhappy_2020, Year_2020 < 49.00)
tail(drop_na(unhappy_2020,Year_2020),5)
## # A tibble: 5 × 6
##   country  Year_2019 Year_2020 Year_2021 Year_2022 Year_2023
##   <chr>        <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 India         32.5      42.2      35.6      39.3      46.8
## 2 Jordan        44.5      40.9      39.1      43.6      42.9
## 3 Tanzania      36.4      37.9      36.8      36.2      40.4
## 4 Zimbabwe      26.9      31.6      31.6      33        35.7
## 5 Lebanon       40.2      26.3      21.8      23.5      35.9

We can see that between the year 2019 and 2020, the happiness score of Lebanon took a massive hit, It went from 40 to 26, meaning that as we mentioned earlier some countries were adversely affected, whereas in counties such as India the score increased.

The following chunk provides a more numerical form of a Box-plot.

summary(hapest_2020$Year_2020)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   26.30   49.85   57.70   57.27   64.05   78.90      48

So, from the summary table we can say that necessary stats(for the happiness score) are as follows -

1.  Minimum - 26.30

2.  Median - 57.70

3.  Maximum - 78.90

So, these results suggest a some what symmetric distribution because the diff is like 57.70 - 26.30 = 31.4 and 78.90 - 57.70 = 21.2 which are close in values, but is that the case? Let us find out with a histogram, produced by the following chunk

ggplot(hapest_2020) +
  geom_histogram(aes(x=Year_2020),bins=10) +
  xlab("happiness, 2020")

This histogram indicates a slight right skew (the right whisker is longer than the left, plus there appears to be some left Outliers).

The Right side is more dense and has a more bell shaped like decline, however on the left side the shape suddenly takes a steep fall after 46(happiness score).

The histogram seems to have a somewhat bell curve. Is the distribution close to a Gaussian? Let’s find out with a quantile-quantile plot, produced by the following chunk

q <- ggplot(hapest_2020,aes(sample=Year_2020))
            q + stat_qq() + stat_qq_line() + ylab("happiness_in_2020")

So, these results suggest a symmetric distribution, is it? Let’s find out with a histogram, produced by the following chunk*

ggplot(hapest_2020) +
  geom_histogram(aes(x=Year_2020),bins=10) +
  xlab("happiness, 2020")

Note:

1. About 68% of the data falls within 1 standard deviation of the mean which implies the data that’s within the range from -1 to 1.

2. About 95% of the data falls within 2 standard deviations of the mean, implying the data ranging between -2 to 2.

3. About 99.7% of the data falls within 3 standard deviation of the mean.

summary(hapest_2020$Year_2020)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   26.30   49.85   57.70   57.27   64.05   78.90      48
summary(hapest_2019$Year_2019)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   23.80   49.20   56.30   55.71   62.90   77.80      21

Effect of COVID-19 for years 2019 and 2020

  • It’s conspicuous from the histogram that for the year 2019 the happiness score was more balance on both sides(density wise for left and right), but in case of 2020, the happiness score were higher for most countries as the histogram suggested.

  • In the first year (2019) the happiness score was significantly less for most countries however, after one year of lock down the overall happiness score seemed to have increased drastically, as we could understand from the histogram.

  • In 2019, the minimum happiness score was around 24, and in 2020 it increased by almost 3, however the median remained pretty much the same, in 2020 also the maximum happiness score increased by 1, which may not be significant, however we could tell from the *histogram that after one year of pandemic, to some extent the overall happiness has increased around the whole world.

  • From, the differences we could come to a conclusion that even though the pandemic’s initial effects were thought be quite detrimental, it resulted in higher happiness score in most countries, by bringing families together and keeping them locked tight to their house made them happier.