Objectives:

For this lab you should…

Part 1: Getting ready to use data

The following packages will need to be loaded into R before you begin the lab. To run a chunk of R code, click the green arrow symbol in the upper right of the gray chunk. If you would like to run line by line, hit Ctrl-Enter with your cursor next to the desired line.

The Happiness Data

The happiness data set is from the 2021 World Happiness Report https://Worldhappiness.report/ed/2021. Each observation in this dataset represents a country in the world. There are 14 variables observed about citizens opinions on their countries, characteristics of their countries, and Covid-19 related summaries. A multitude of interesting questions can be posed (and answered!) using these data.

TASK 1.0 Download the dataset WorldHappiness.csv from the Lab 1.1 assignment page. Once you’ve downloaded it, using the following code to read it into R:

TASK 1.1 To ensure that you have successfully loaded the dataset, you can check using the function head(DataSetName), which displays the first six rows of the data, and glimpse(DataSetName), which provides a summary of the data frame.

Rows: 148
Columns: 15
$ Name                    <chr> "United States", "Egypt", "Morocco", "Lebanon", "Saudi Arabia", "…
$ Region                  <chr> "North America and ANZ", "Middle East and North Africa", "Middle …
$ HappinessScore          <dbl> 6.951, 4.283, 4.918, 4.584, 6.494, 4.395, 4.948, 4.934, 5.345, 5.…
$ Population2019          <dbl> 328.239523, 100.388073, 36.471769, 6.855713, 34.268528, 10.101694…
$ CovidDeaths2020         <dbl> 104.451, 7.457, 20.016, 21.508, 17.875, 37.577, 24.758, 4.607, 8.…
$ MedianAge               <dbl> 38.3, 25.3, 29.6, 31.1, 31.9, 23.2, 31.6, 23.5, 29.3, 27.5, 40.8,…
$ FemaleGovBoss           <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "Yes", "No"…
$ InstitutionalTrust      <dbl> 0.250, 0.446, 0.397, 0.107, 0.651, 0.465, 0.295, 0.277, 0.561, 0.…
$ ExcessDeaths            <dbl> 179.220, NA, NA, NA, NA, NA, NA, NA, NA, NA, 133.313, 106.597, 73…
$ LogGDP                  <dbl> 11.023, 9.367, 8.903, 9.626, 10.743, 9.182, 10.240, 8.458, 9.365,…
$ SocialSupport           <dbl> 0.920, 0.750, 0.560, 0.848, 0.891, 0.767, 0.822, 0.651, 0.811, 0.…
$ HealthyLifeExpectancy   <dbl> 68.200, 61.998, 66.208, 67.355, 66.603, 67.000, 67.199, 58.709, 6…
$ Freedom                 <dbl> 0.837, 0.749, 0.774, 0.525, 0.877, 0.755, 0.576, 0.726, 0.873, 0.…
$ PerceptionsOfCorruption <dbl> 0.698, 0.795, 0.801, 0.898, 0.684, 0.705, 0.776, 0.787, 0.867, 0.…
$ Island                  <chr> "No", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "Yes…

TASK 1.2 Use your results from Task 1.1 to determine which columns are categorical data, and which are numeric:

**Response**

Categorical: Name, Region, FemaleGovBoss, Island

Numeric: HappinessScore, Population2019, CovidDeaths2020, MedianAge, InstitutionalTrust, ExcessDeaths, LogGDP, SocialSupport, HealthyLifeExpectancy, Freedom, PerceptionsOfCorruption

Model-centric approach to R

For the purposes of our class, it’s useful to learn a model-centric approach to R. The psuedo-code below is going to be our foundation for the rest of the class:

function( Y ~ X, data = DataSetName )

Here’s a short description of each part in the pseudo-code above:

  • function is an R function that dictates something you want to do with your data e.g. 
    • for example, mean calculates the mean
  • Y is the outcome of interest (response variable)
  • X is some explanatory variable or you can use “1” as a placeholder if there is no explanatory variable
  • DataSetName is the name of a data set loaded into the R environment

A summary of the most commonly used functions for this class are provided at the bottom of this document. In the work below you will be introduced to the functions and their uses one at a time in the context of answering research questions.

Research Question A: Are different regions of the world happier?

To use the data to answer the research question, we could find data summaries such as average happiness, for each of the regions. In this situation, HappinessScore is our response variable and Region is our explanatory variable.

TASK 2.1 Modify and run the code below to calculate the mean HappinessScore for each region.

        Central and Eastern Europe Commonwealth of Independent States 
                          5.984765                           5.467000 
                         East Asia        Latin America and Caribbean 
                          5.810333                           5.908050 
      Middle East and North Africa              North America and ANZ 
                          5.219765                           7.128500 
                        South Asia                     Southeast Asia 
                          4.441857                           5.407556 
                Sub-Saharan Africa                     Western Europe 
                          4.499800                           6.914905 

TASK 2.2 Use the function ‘favstats()’, and model-centric syntax, to determine which region has the country with the highest happiness score.

**Response: WesternEurope**

TASK 2.3 Which summary statistics are calculated by the favstats() function?

**Response: min, Q1, median, Q3, max, mean, n, missing**

Now it’s time to visualize the distribution of Happiness Score for the different reasons.

TASK 2.4 Modify the code below to create a histogram of Happiness Score, without accounting for other variables. What is the shape of the distribution: left-skewed, right-skewed, or roughly symmetric?

**Response: Roughly Symmetric**

TASK 2.5 Generate side-by-side boxplots of the HappinessScore, separated by Region. Write a sentence about what you observe in the figure.

**Response: First World countries report significantly higher Happiness Scores than countries in the second and third world, on average.**

TASK 2.6 Based on everything you’ve learned, do you think that happiness scores vary by region?

**Response: Based on the box plot above, happiness scores appear to vary by region because many of the boxes do not overlap or have little overlap.**

Research Question B: Are countries run by female leaders happier?

TASK 3.1 Write and run code to determine whether the mean Happiness Score is higher or lower for countries with female leaders. Hint: the variable ‘FemaleGovBoss’ indicates whether a country is run by a female leader.

**Response:There appears to be association between having a female leader and reporting a higher happiness score, as the mean happiness score for countries with female leaders is higher than countries without them. However, the Q1 for both cases is very similar, and the Q3 values are not far apart either, which leads me to think there may not be direct causation between the two variables.**

TASK 3.2 How many countries are governed by female leaders? You can tabulate observations of a categorical variable using the ‘tally’ command. Modify the code below to answer the question.

**Response:22**
FemaleGovBoss
 No Yes 
126  22 

TASK 3.3 Create an appropriate visual summary to investigate research question B.

TASK 3.4 Based on what you learned in tasks 3.1 and 3.3, what do you think the answer to research question B would be?

**Response: I think the answer to the question would be that having a higher happiness score is associated with and may even be directly caused by having a female leader. The whisker on the boxplot above for countries without a female leader extends significantly lower than that of countries with female leaders, and the median happiness score for countries with female leaders is significantly higher than that of countries without them, which point towards having female leaders being a positive factor of happiness.**

Research Question C: Are Island nations more likely to have female leaders?

TASK 4.1 Use R tally up how many countries are islands with female leaders, islands are male leaders, not islands with female leaders, and not islands with male leaders. Modify the code below to extract the numbers.

      FemaleGovBoss
Island  No Yes
   No  109  19
   Yes  17   3

For a research question like this one that involves two binary variables, it is useful to calculate and compare two proportions:

TASK 4.2 What proportion of Island Nations have female leaders? You should calculate (# island nations with female leaders)/(# island nations)

**Response: 15%**

TASK 4.3 What proportion of non-Island Nations have female leaders? You should calculate (# non-island nations with female leaders)/(# non-island nations)

**Response: 17.4%**

To visualize the data as it relates to this question, you should make side-by-side or stacked barcharts.

TASK 4.4 Modify the code below to create a side by side barchart. Notice the syntax is a little different; you should try different options for X and Y until you have Islands on the bottom and different colors for FemaleGovBoss.

TASK 4.5 Based on your answers to Tasks 4.1–4.4, what do you think the answer to research question C would be?

**Response: Based on the bar graph above, whether or not a country is an island does not appear to affect the likelihood of it having a female leader, as the proportion of male to female leaders for both island and non-island countries appears very similar.**

Epilogue

Below are some summaries you can reference at any time.

Numerical summaries

The primary numerical summaries used in this course include the mean, the five-number summary, and the standard deviation. The functions below calculate these summaries for a single variable, or for that variable based on the value of a second categorical variable.

  • mean(~ Y, data = DataSetName) # mean of Y without considering other variables
  • mean(Y ~ X, data = DataSetName) # mean of Y for different values of X
  • sd(~ Y, data = DataSetName) # standard deviation of Y
  • sd(Y ~ X, data = DataSetName) # standard deviation of Y for different values of X
  • favstats(~ Y, data = DataSetName) # summary statistics of Y
  • favstats(Y ~ X, data = DataSetName) # summary statistics of Y for different values of X

Visual summaries

The primary graphical representations used in this course will be a barcharts, boxplots, histograms and scatterplots. Each plot helps to visualize different aspects of a dataset.

The functions that will be used to generate these plots are:

  • gf_boxplot(Y ~ X, data = DataSetName) # boxplot of Y for each value of X, if specified
  • gf_histogram(~Y, data = DataSetName) # histogram of Y
  • gf_point(Y ~ X, data = DataSetName) # scatterplot of Y versus X
  • gf_bar( ~Y, data = DataSetName) # Bar chart for a single variable
  • gf_bar( ~ X, fill = ~ Y, data = DataSetName, position = position_dodge()) # side-by-side bar chart
  • gf_bar( ~ X, fill = ~ Y, data = DataSetName) # stacked bar chart
