Objectives:

For this lab you should…

Part 1: Getting ready to use data

The following packages will need to be loaded into R before you begin the lab. To run a chunk of R code, click the green arrow symbol in the upper right of the gray chunk. If you would like to run line by line, hit Ctrl-Enter with your cursor next to the desired line.

The Happiness Data

The happiness data set is from the 2021 World Happiness Report https://Worldhappiness.report/ed/2021. Each observation in this dataset represents a country in the world. There are 14 variables observed about citizens opinions on their countries, characteristics of their countries, and Covid-19 related summaries. A multitude of interesting questions can be posed (and answered!) using these data.

TASK 1.0 Download the dataset WorldHappiness.csv from the Lab 1.1 assignment page. Once you’ve downloaded it, using the following code to read it into R:

TASK 1.1 To ensure that you have successfully loaded the dataset, you can check using the function head(DataSetName), which displays the first six rows of the data, and glimpse(DataSetName), which provides a summary of the data frame.

Rows: 148
Columns: 15
$ Name                    <chr> "United States", "Egypt", "Morocco", "Lebanon", "Saudi Arabia", "Jordan", "Turkey", "Pakistan", "Indonesia", "Bangladesh", "Uni…
$ Region                  <chr> "North America and ANZ", "Middle East and North Africa", "Middle East and North Africa", "Middle East and North Africa", "Middl…
$ HappinessScore          <dbl> 6.951, 4.283, 4.918, 4.584, 6.494, 4.395, 4.948, 4.934, 5.345, 5.025, 7.064, 6.690, 7.155, 7.464, 6.834, 6.491, 6.483, 6.166, 5…
$ Population2019          <dbl> 328.239523, 100.388073, 36.471769, 6.855713, 34.268528, 10.101694, 83.429615, 216.565318, 270.625568, 163.046161, 66.834405, 67…
$ CovidDeaths2020         <dbl> 104.451, 7.457, 20.016, 21.508, 17.875, 37.577, 24.758, 4.607, 8.094, 4.590, 108.450, 99.212, 40.331, 67.260, 168.496, 108.731,…
$ MedianAge               <dbl> 38.3, 25.3, 29.6, 31.1, 31.9, 23.2, 31.6, 23.5, 29.3, 27.5, 40.8, 42.0, 46.6, 43.2, 41.8, 45.5, 47.9, 41.8, 43.4, 43.3, 43.0, 4…
$ FemaleGovBoss           <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "No", "No", "No", "No", "No", "No", "No",…
$ InstitutionalTrust      <dbl> 0.250, 0.446, 0.397, 0.107, 0.651, 0.465, 0.295, 0.277, 0.561, 0.577, 0.268, 0.234, 0.435, 0.522, 0.303, 0.143, 0.076, 0.304, 0…
$ ExcessDeaths            <dbl> 179.220, NA, NA, NA, NA, NA, NA, NA, NA, NA, 133.313, 106.597, 73.548, 114.468, 169.894, 173.233, 182.298, 206.508, 128.939, 18…
$ LogGDP                  <dbl> 11.023, 9.367, 8.903, 9.626, 10.743, 9.182, 10.240, 8.458, 9.365, 8.454, 10.707, 10.704, 10.873, 10.932, 10.823, 10.571, 10.623…
$ SocialSupport           <dbl> 0.920, 0.750, 0.560, 0.848, 0.891, 0.767, 0.822, 0.651, 0.811, 0.693, 0.934, 0.942, 0.903, 0.942, 0.906, 0.932, 0.880, 0.898, 0…
$ HealthyLifeExpectancy   <dbl> 68.200, 61.998, 66.208, 67.355, 66.603, 67.000, 67.199, 58.709, 62.236, 64.800, 72.500, 74.000, 72.500, 72.400, 72.199, 74.700,…
$ Freedom                 <dbl> 0.837, 0.749, 0.774, 0.525, 0.877, 0.755, 0.576, 0.726, 0.873, 0.877, 0.859, 0.822, 0.875, 0.913, 0.783, 0.761, 0.693, 0.841, 0…
$ PerceptionsOfCorruption <dbl> 0.698, 0.795, 0.801, 0.898, 0.684, 0.705, 0.776, 0.787, 0.867, 0.682, 0.459, 0.571, 0.460, 0.338, 0.646, 0.745, 0.866, 0.735, 0…
$ Island                  <chr> "No", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "Yes", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No",…

TASK 1.2 Use your results from Task 1.1 to determine which columns are categorical data, and which are numeric:

**Response**

Categorical: 1, 2, 7, 15

Numeric: 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14

Model-centric approach to R

For the purposes of our class, it’s useful to learn a model-centric approach to R. The psuedo-code below is going to be our foundation for the rest of the class:

function( Y ~ X, data = DataSetName )

Here’s a short description of each part in the pseudo-code above:

  • function is an R function that dictates something you want to do with your data e.g. 
    • for example, mean calculates the mean
  • Y is the outcome of interest (response variable)
  • X is some explanatory variable or you can use “1” as a placeholder if there is no explanatory variable
  • DataSetName is the name of a data set loaded into the R environment

A summary of the most commonly used functions for this class are provided at the bottom of this document. In the work below you will be introduced to the functions and their uses one at a time in the context of answering research questions.

Research Question A: Are different regions of the world happier?

To use the data to answer the research question, we could find data summaries such as average happiness, for each of the regions. In this situation, HappinessScore is our response variable and Region is our explanatory variable.

TASK 2.1 Modify and run the code below to calculate the mean HappinessScore for each region.

        Central and Eastern Europe Commonwealth of Independent States                          East Asia        Latin America and Caribbean 
                          5.984765                           5.467000                           5.810333                           5.908050 
      Middle East and North Africa              North America and ANZ                         South Asia                     Southeast Asia 
                          5.219765                           7.128500                           4.441857                           5.407556 
                Sub-Saharan Africa                     Western Europe 
                          4.499800                           6.914905 

TASK 2.2 Use the function ‘favstats()’, and model-centric syntax, to determine which region has the country with the highest happiness score.

**Response:**

North America and ANZ has the highest mean and median happiness score

TASK 2.3 Which summary statistics are calculated by the favstats() function?

**Response:**

The minimum value, first quartile, second quartile, third quartile, maximum value, mean, standard deviation, number of data points, and number of blank data points in a data set.

Now it’s time to visualize the distribution of Happiness Score for the different reasons.

TASK 2.4 Modify the code below to create a histogram of Happiness Score, without accounting for other variables. What is the shape of the distribution: left-skewed, right-skewed, or roughly symmetric?

**Response:**

Roughly symmetric

TASK 2.5 Generate side-by-side boxplots of the HappinessScore, separated by Region. Write a sentence about what you observe in the figure.

**Response:**

A total of 6 outliers can be observed in the side-by-side boxplots.

TASK 2.6 Based on everything you’ve learned, do you think that happiness scores vary by region?

**Response:**

The data suggests that there is a relationship between happiness scores and region.

Research Question B: Are countries run by female leaders happier?

TASK 3.1 Write and run code to determine whether the mean Happiness Score is higher or lower for countries with female leaders. Hint: the variable ‘FemaleGovBoss’ indicates whether a country is run by a female leader.

**Response:**

TASK 3.2 How many countries are governed by female leaders? You can tabulate observations of a categorical variable using the ‘tally’ command. Modify the code below to answer the question.

**Response:**

22 countries are governed by female leaders.Categorical variables can be tabulated using "tally."
FemaleGovBoss
 No Yes 
126  22 

TASK 3.3 Create an appropriate visual summary to investigate research question B.

TASK 3.4 Based on what you learned in tasks 3.1 and 3.3, what do you think the answer to research question B would be?

**Response:**

Countries run by female leaders are happier.

Research Question C: Are Island nations more likely to have female leaders?

TASK 4.1 Use R tally up how many countries are islands with female leaders, islands are male leaders, not islands with female leaders, and not islands with male leaders. Modify the code below to extract the numbers.

             Island
FemaleGovBoss  No Yes
          No  109  17
          Yes  19   3

For a research question like this one that involves two binary variables, it is useful to calculate and compare two proportions:

TASK 4.2 What proportion of Island Nations have female leaders? You should calculate (# island nations with female leaders)/(# island nations)

**Response:**

3/17 ~= 17.6%
[1] 0.1764706

TASK 4.3 What proportion of non-Island Nations have female leaders? You should calculate (# non-island nations with female leaders)/(# non-island nations)

**Response:**

19/109 ~= 17.4%
[1] 0.1743119

To visualize the data as it relates to this question, you should make side-by-side or stacked barcharts.

TASK 4.4 Modify the code below to create a side by side barchart. Notice the syntax is a little different; you should try different options for X and Y until you have Islands on the bottom and different colors for FemaleGovBoss.

TASK 4.5 Based on your answers to Tasks 4.1–4.4, what do you think the answer to research question C would be?

**Response:**

Islands are not significantly more likely to have female leaders.

Epilogue

Below are some summaries you can reference at any time.

Numerical summaries

The primary numerical summaries used in this course include the mean, the five-number summary, and the standard deviation. The functions below calculate these summaries for a single variable, or for that variable based on the value of a second categorical variable.

  • mean(~ Y, data = DataSetName) # mean of Y without considering other variables
  • mean(Y ~ X, data = DataSetName) # mean of Y for different values of X
  • sd(~ Y, data = DataSetName) # standard deviation of Y
  • sd(Y ~ X, data = DataSetName) # standard deviation of Y for different values of X
  • favstats(~ Y, data = DataSetName) # summary statistics of Y
  • favstats(Y ~ X, data = DataSetName) # summary statistics of Y for different values of X

Visual summaries

The primary graphical representations used in this course will be a barcharts, boxplots, histograms and scatterplots. Each plot helps to visualize different aspects of a dataset.

The functions that will be used to generate these plots are:

  • gf_boxplot(Y ~ X, data = DataSetName) # boxplot of Y for each value of X, if specified
  • gf_histogram(~Y, data = DataSetName) # histogram of Y
  • gf_point(Y ~ X, data = DataSetName) # scatterplot of Y versus X
  • gf_bar( ~Y, data = DataSetName) # Bar chart for a single variable
  • gf_bar( ~ X, fill = ~ Y, data = DataSetName, position = position_dodge()) # side-by-side bar chart
  • gf_bar( ~ X, fill = ~ Y, data = DataSetName) # stacked bar chart
