library(tidyverse)
In this assignment (and others to follow) we will become more familiar with RStudio and R. More importantly, through this series of assignments, we will address the “how much” and “why” of happiness, around the world. Really.
Let’s begin by us answering (silently) the following question. Considering our situation over the last month:
Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?
Seriously, we should give ourselves time to answer this question, known as the Cantril Life Ladder question. Think about your answer, and convert your answer to a percent. There is only one person who needs to know your answer.
So, this question is asked yearly to a random sample of people in
many countries, including our country of origin. The results appear in
the file hapiscore*.csv, which comes from <“https://www.gapminder.org/data/”>. If this file is in
the same folder as this *.Rmd, the following chunk reads in
(most) results of that survey:
happy_entities <- read_csv("/Users/swapnanilbala/Downloads/hapiscore_whrGapMinder.csv")
## Rows: 164 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): country
## dbl (19): 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(happy_entities,5)
## # A tibble: 5 × 20
## country `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012` `2013` `2014`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghani… NA NA NA 37.2 44 47.6 38.3 37.8 35.7 31.3
## 2 Angola NA NA NA NA NA NA 55.9 43.6 39.4 38
## 3 Albania NA NA 46.3 NA 54.9 52.7 58.7 55.1 45.5 48.1
## 4 UAE NA 67.3 NA NA 68.7 71 71.2 72.2 66.2 65.4
## 5 Argenti… NA 63.1 60.7 59.6 64.2 64.4 67.8 64.7 65.8 66.7
## # ℹ 9 more variables: `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, `2018` <dbl>,
## # `2019` <dbl>, `2020` <dbl>, `2021` <dbl>, `2022` <dbl>, `2023` <dbl>
(We can turn off the messages by the message=FALSE
statement, as in the libraries chunk.) This file contains,
for a country, a happiness percentage for various years. Let’s focus
only on the years 2019 through 2023. The following chunk does this:
affected_5_years <- select(happy_entities,
country="country",
Year_2019="2019",
Year_2020="2020",
Year_2021="2021",
Year_2022="2022",
Year_2023="2023")
head(affected_5_years,5)
## # A tibble: 5 × 6
## country Year_2019 Year_2020 Year_2021 Year_2022 Year_2023
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 23.8 NA 24.4 12.8 14.5
## 2 Angola NA NA NA NA NA
## 3 Albania 50 53.6 52.5 52.1 54.5
## 4 UAE 67.1 64.6 67.3 67.4 67.3
## 5 Argentina 60.9 59 59.1 62.6 63.9
Let’s arrange the table according to the most happy countries first, in 2019. The following chunk does this
hapest_2019 <- arrange(affected_5_years,desc(Year_2019))
head(hapest_2019,5)
## # A tibble: 5 × 6
## country Year_2019 Year_2020 Year_2021 Year_2022 Year_2023
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Finland 77.8 78.9 77.9 77.3 77
## 2 Switzerland 76.9 75.1 73.3 68.8 69.7
## 3 Denmark 76.9 75.2 77 75.5 75
## 4 Iceland 75.3 75.8 75.7 74.5 75.6
## 5 Norway 74.4 72.9 73.6 73 72.5
tail(hapest_2019,5)
## # A tibble: 5 × 6
## country Year_2019 Year_2020 Year_2021 Year_2022 Year_2023
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Somalia NA NA NA NA NA
## 2 South Sudan NA NA NA NA NA
## 3 Suriname NA NA NA NA NA
## 4 Syria NA NA NA NA NA
## 5 Trinidad and Tobago NA NA NA NA NA
Note we can peak at the top of the data frame using
head() and the bottom using tail(). Oh no,
NAs go last! Let’s remove them from the output, by amending
the above chunk
hapest_2019 <- arrange(affected_5_years,desc(Year_2019))
tail(drop_na(hapest_2019,Year_2019),5)
## # A tibble: 5 × 6
## country Year_2019 Year_2020 Year_2021 Year_2022 Year_2023
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Zambia 33.1 48.4 30.8 37.3 36.9
## 2 Rwanda 32.7 NA NA NA NA
## 3 India 32.5 42.2 35.6 39.3 46.8
## 4 Zimbabwe 26.9 31.6 31.6 33 35.7
## 5 Afghanistan 23.8 NA 24.4 12.8 14.5
Let’s investigate the happiness score for year 2019, in more detail. Let’s consider a boxplot, which summarizes a continuous measurement with 5 numbers. The following chunk produces the boxplot
ggplot(hapest_2019) +
geom_boxplot(aes(x=Year_2019)) +
xlab("happiness, 2019")
Notice that I turned off warnings with warning=FALSE.
See the course notes for interpreting Boxplots. In the plot, we can see
the box, which has a horizontal span [49,63] for half the (non NA)
countries, and at least 2 “left” Outliers. We see the median “about”
centered in the box, and the left whisker “about” the same length as the
right whisker.
The following chunk provides a more numerical form of a Boxplot
summary(hapest_2019$Year_2019)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 23.80 49.20 56.30 55.71 62.90 77.80 21
So, these results suggest a symmetric distribution, is it? Let’s find out with a histogram, produced by the following chunk
ggplot(hapest_2019) +
geom_histogram(aes(x=Year_2019),bins=10) +
xlab("happiness, 2019")
This histogram indicates a slight left skew (the left whisker was a little longer than the right, plus those left Outliers). So, there seems to be a bell-shaped curve in the histogram. Is the distribution close to a Gaussian? Let’s find out with a quantile-quantile plot, produced by the following chunk
p <- ggplot(hapest_2019,aes(sample=Year_2019))
p + stat_qq() + stat_qq_line() + ylab("happiness_in_2019")
hapest_2020 <- arrange(affected_5_years,desc(Year_2020))
head(hapest_2020,10)
## # A tibble: 10 × 6
## country Year_2019 Year_2020 Year_2021 Year_2022 Year_2023
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Finland 77.8 78.9 77.9 77.3 77
## 2 Iceland 75.3 75.8 75.7 74.5 75.6
## 3 Denmark 76.9 75.2 77 75.5 75
## 4 Switzerland 76.9 75.1 73.3 68.8 69.7
## 5 Netherlands 74.3 75 73.1 73.9 72.5
## 6 Germany 70.3 73.1 67.5 66.1 67.9
## 7 Sweden 74 73.1 74.4 74.3 71.6
## 8 Norway 74.4 72.9 73.6 73 72.5
## 9 New Zealand 72 72.6 71.4 69.8 69.8
## 10 Austria 72 72.1 70.8 70 66.4
tail(hapest_2020,10)
## # A tibble: 10 × 6
## country Year_2019 Year_2020 Year_2021 Year_2022 Year_2023
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Somalia NA NA NA NA NA
## 2 South Sudan NA NA NA NA NA
## 3 Suriname NA NA NA NA NA
## 4 Eswatini 44 NA NA 35 NA
## 5 Syria NA NA NA NA NA
## 6 Chad 42.5 NA NA 44 45.4
## 7 Togo 41.8 NA 40.4 42.4 43.6
## 8 Turkmenistan 54.7 NA NA NA NA
## 9 Trinidad and Tobago NA NA NA NA NA
## 10 Yemen 42 NA NA 35.9 35.3
We have kept the data set in the heapst2020 variable,
keeping the result sorted in descending manner for the Year_2020 column.
Now, the head function returns the first 10 results, similarly the tail
returns the last 10 results (results means the whole sorted Dataset)
Now we are removing the NA values for the year 2020 and
showing a more arranged and Concised version of the table.
hapest_2020 <- arrange(affected_5_years,desc(Year_2020))
tail(drop_na(hapest_2020,Year_2020),5)
## # A tibble: 5 × 6
## country Year_2019 Year_2020 Year_2021 Year_2022 Year_2023
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 India 32.5 42.2 35.6 39.3 46.8
## 2 Jordan 44.5 40.9 39.1 43.6 42.9
## 3 Tanzania 36.4 37.9 36.8 36.2 40.4
## 4 Zimbabwe 26.9 31.6 31.6 33 35.7
## 5 Lebanon 40.2 26.3 21.8 23.5 35.9
Let’s investigate the happiness score for year 2020, in more detail.
Let’s consider a Boxplot, which summarizes a continuous
measurement with 5 numbers. The following chunk produces the
Boxplot for it
ggplot(hapest_2020) +
geom_boxplot(aes(x=Year_2020)) +
xlab("happiness, 2020")
From the Boxplot its transparent that most values lie within
50 - 65. However, there appears to be an Outlier which has
a value of around 15for the happiness score. This could
mean people in most countries were relatively happier however, for a few
countries the people suffered.
arrange_unhappy_2020 <- arrange(hapest_2020, desc(2020))
unhappy_2020 <- filter(arrange_unhappy_2020, Year_2020 < 49.00)
tail(drop_na(unhappy_2020,Year_2020),5)
## # A tibble: 5 × 6
## country Year_2019 Year_2020 Year_2021 Year_2022 Year_2023
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 India 32.5 42.2 35.6 39.3 46.8
## 2 Jordan 44.5 40.9 39.1 43.6 42.9
## 3 Tanzania 36.4 37.9 36.8 36.2 40.4
## 4 Zimbabwe 26.9 31.6 31.6 33 35.7
## 5 Lebanon 40.2 26.3 21.8 23.5 35.9
We can see that between the year 2019 and 2020, the happiness score of Lebanon took a massive hit, It went from 40 to 26, meaning that as we mentioned earlier some countries were adversely affected, whereas in counties such as India the score increased.
The following chunk provides a more numerical form of a Box-plot.
summary(hapest_2020$Year_2020)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 26.30 49.85 57.70 57.27 64.05 78.90 48
So, from the summary table we can say that necessary stats(for the happiness score) are as follows -
1. Minimum - 26.30
2. Median - 57.70
3. Maximum - 78.90
So, these results suggest a some what symmetric distribution because
the diff is like 57.70 - 26.30 = 31.4 and
78.90 - 57.70 = 21.2 which are close in values, but is that
the case? Let us find out with a histogram, produced by the following
chunk
ggplot(hapest_2020) +
geom_histogram(aes(x=Year_2020),bins=10) +
xlab("happiness, 2020")
This histogram indicates a slight right skew (the right whisker is longer than the left, plus there appears to be some left Outliers).
The Right side is more dense and has a more bell shaped like decline, however on the left side the shape suddenly takes a steep fall after 46(happiness score).
The histogram seems to have a somewhat bell curve. Is the distribution close to a Gaussian? Let’s find out with a quantile-quantile plot, produced by the following chunk
q <- ggplot(hapest_2020,aes(sample=Year_2020))
q + stat_qq() + stat_qq_line() + ylab("happiness_in_2020")
So, these results suggest a symmetric distribution, is it? Let’s find out with a histogram, produced by the following chunk*
ggplot(hapest_2020) +
geom_histogram(aes(x=Year_2020),bins=10) +
xlab("happiness, 2020")
Note:
1. About 68% of the data falls within 1 standard deviation of the mean which implies the data that’s within the range from -1 to 1.
2. About 95% of the data falls within 2 standard deviations of the mean, implying the data ranging between -2 to 2.
3. About 99.7% of the data falls within 3 standard deviation of the mean.
summary(hapest_2020$Year_2020)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 26.30 49.85 57.70 57.27 64.05 78.90 48
summary(hapest_2019$Year_2019)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 23.80 49.20 56.30 55.71 62.90 77.80 21
It’s conspicuous from the histogram that for the year 2019 the happiness score was more balance on both sides(density wise for left and right), but in case of 2020, the happiness score were higher for most countries as the histogram suggested.
In the first year (2019) the happiness score was significantly less for most countries however, after one year of lock down the overall happiness score seemed to have increased drastically, as we could understand from the histogram.
In 2019, the minimum happiness score was around 24, and in 2020 it increased by almost 3, however the median remained pretty much the same, in 2020 also the maximum happiness score increased by 1, which may not be significant, however we could tell from the *histogram that after one year of pandemic, to some extent the overall happiness has increased around the whole world.
From, the differences we could come to a conclusion that even though the pandemic’s initial effects were thought be quite detrimental, it resulted in higher happiness score in most countries, by bringing families together and keeping them locked tight to their house made them happier.