For this lab you should…
The following packages will need to be loaded into R before you begin the lab. To run a chunk of R code, click the green arrow symbol in the upper right of the gray chunk. If you would like to run line by line, hit Ctrl-Enter with your cursor next to the desired line.
The happiness data set is from the 2021 World Happiness Report https://Worldhappiness.report/ed/2021. Each observation in this dataset represents a country in the world. There are 14 variables observed about citizens opinions on their countries, characteristics of their countries, and Covid-19 related summaries. A multitude of interesting questions can be posed (and answered!) using these data.
TASK 1.0 Download the dataset WorldHappiness.csv from the Lab 1.1 assignment page. Once you’ve downloaded it, using the following code to read it into R:
TASK 1.1 To ensure that you have successfully loaded the dataset, you can check using the function head(DataSetName), which displays the first six rows of the data, and glimpse(DataSetName), which provides a summary of the data frame.
Rows: 148
Columns: 15
$ Name <chr> "United States", "Egypt", "Morocco", "Lebanon", "Saudi Arabia", "Jordan", "Turkey", "Pakistan", "Indonesia", "Bangladesh", "Uni…
$ Region <chr> "North America and ANZ", "Middle East and North Africa", "Middle East and North Africa", "Middle East and North Africa", "Middl…
$ HappinessScore <dbl> 6.951, 4.283, 4.918, 4.584, 6.494, 4.395, 4.948, 4.934, 5.345, 5.025, 7.064, 6.690, 7.155, 7.464, 6.834, 6.491, 6.483, 6.166, 5…
$ Population2019 <dbl> 328.239523, 100.388073, 36.471769, 6.855713, 34.268528, 10.101694, 83.429615, 216.565318, 270.625568, 163.046161, 66.834405, 67…
$ CovidDeaths2020 <dbl> 104.451, 7.457, 20.016, 21.508, 17.875, 37.577, 24.758, 4.607, 8.094, 4.590, 108.450, 99.212, 40.331, 67.260, 168.496, 108.731,…
$ MedianAge <dbl> 38.3, 25.3, 29.6, 31.1, 31.9, 23.2, 31.6, 23.5, 29.3, 27.5, 40.8, 42.0, 46.6, 43.2, 41.8, 45.5, 47.9, 41.8, 43.4, 43.3, 43.0, 4…
$ FemaleGovBoss <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "No", "No", "No", "No", "No", "No", "No",…
$ InstitutionalTrust <dbl> 0.250, 0.446, 0.397, 0.107, 0.651, 0.465, 0.295, 0.277, 0.561, 0.577, 0.268, 0.234, 0.435, 0.522, 0.303, 0.143, 0.076, 0.304, 0…
$ ExcessDeaths <dbl> 179.220, NA, NA, NA, NA, NA, NA, NA, NA, NA, 133.313, 106.597, 73.548, 114.468, 169.894, 173.233, 182.298, 206.508, 128.939, 18…
$ LogGDP <dbl> 11.023, 9.367, 8.903, 9.626, 10.743, 9.182, 10.240, 8.458, 9.365, 8.454, 10.707, 10.704, 10.873, 10.932, 10.823, 10.571, 10.623…
$ SocialSupport <dbl> 0.920, 0.750, 0.560, 0.848, 0.891, 0.767, 0.822, 0.651, 0.811, 0.693, 0.934, 0.942, 0.903, 0.942, 0.906, 0.932, 0.880, 0.898, 0…
$ HealthyLifeExpectancy <dbl> 68.200, 61.998, 66.208, 67.355, 66.603, 67.000, 67.199, 58.709, 62.236, 64.800, 72.500, 74.000, 72.500, 72.400, 72.199, 74.700,…
$ Freedom <dbl> 0.837, 0.749, 0.774, 0.525, 0.877, 0.755, 0.576, 0.726, 0.873, 0.877, 0.859, 0.822, 0.875, 0.913, 0.783, 0.761, 0.693, 0.841, 0…
$ PerceptionsOfCorruption <dbl> 0.698, 0.795, 0.801, 0.898, 0.684, 0.705, 0.776, 0.787, 0.867, 0.682, 0.459, 0.571, 0.460, 0.338, 0.646, 0.745, 0.866, 0.735, 0…
$ Island <chr> "No", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "Yes", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No",…
TASK 1.2 Use your results from Task 1.1 to determine which columns are categorical data, and which are numeric:
**Response**
Categorical: 1, 2, 7, 15
Numeric: 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14
For the purposes of our class, it’s useful to learn a model-centric approach to R. The psuedo-code below is going to be our foundation for the rest of the class:
function( Y ~ X, data = DataSetName )
Here’s a short description of each part in the pseudo-code above:
function is an R function that dictates something you
want to do with your data e.g.
mean calculates the meanY is the outcome of interest (response variable)X is some explanatory variable or you can use “1” as a
placeholder if there is no explanatory variableDataSetName is the name of a data set loaded into the R
environmentA summary of the most commonly used functions for this class are provided at the bottom of this document. In the work below you will be introduced to the functions and their uses one at a time in the context of answering research questions.
To use the data to answer the research question, we could find data summaries such as average happiness, for each of the regions. In this situation, HappinessScore is our response variable and Region is our explanatory variable.
TASK 2.1 Modify and run the code below to calculate the mean HappinessScore for each region.
Central and Eastern Europe Commonwealth of Independent States East Asia Latin America and Caribbean
5.984765 5.467000 5.810333 5.908050
Middle East and North Africa North America and ANZ South Asia Southeast Asia
5.219765 7.128500 4.441857 5.407556
Sub-Saharan Africa Western Europe
4.499800 6.914905
TASK 2.2 Use the function ‘favstats()’, and model-centric syntax, to determine which region has the country with the highest happiness score.
**Response:**
North America and ANZ has the highest mean and median happiness score
TASK 2.3 Which summary statistics are calculated by the favstats() function?
**Response:**
The minimum value, first quartile, second quartile, third quartile, maximum value, mean, standard deviation, number of data points, and number of blank data points in a data set.
Now it’s time to visualize the distribution of Happiness Score for the different reasons.
TASK 2.4 Modify the code below to create a histogram of Happiness Score, without accounting for other variables. What is the shape of the distribution: left-skewed, right-skewed, or roughly symmetric?
**Response:**
Roughly symmetric
TASK 2.5 Generate side-by-side boxplots of the HappinessScore, separated by Region. Write a sentence about what you observe in the figure.
**Response:**
A total of 6 outliers can be observed in the side-by-side boxplots.
TASK 2.6 Based on everything you’ve learned, do you think that happiness scores vary by region?
**Response:**
The data suggests that there is a relationship between happiness scores and region.
TASK 3.1 Write and run code to determine whether the mean Happiness Score is higher or lower for countries with female leaders. Hint: the variable ‘FemaleGovBoss’ indicates whether a country is run by a female leader.
**Response:**
TASK 3.2 How many countries are governed by female leaders? You can tabulate observations of a categorical variable using the ‘tally’ command. Modify the code below to answer the question.
**Response:**
22 countries are governed by female leaders.Categorical variables can be tabulated using "tally."
FemaleGovBoss
No Yes
126 22
TASK 3.3 Create an appropriate visual summary to investigate research question B.
TASK 3.4 Based on what you learned in tasks 3.1 and 3.3, what do you think the answer to research question B would be?
**Response:**
Countries run by female leaders are happier.
TASK 4.1 Use R tally up how many countries are islands with female leaders, islands are male leaders, not islands with female leaders, and not islands with male leaders. Modify the code below to extract the numbers.
Island
FemaleGovBoss No Yes
No 109 17
Yes 19 3
For a research question like this one that involves two binary variables, it is useful to calculate and compare two proportions:
TASK 4.2 What proportion of Island Nations have female leaders? You should calculate (# island nations with female leaders)/(# island nations)
**Response:**
3/17 ~= 17.6%
[1] 0.1764706
TASK 4.3 What proportion of non-Island Nations have female leaders? You should calculate (# non-island nations with female leaders)/(# non-island nations)
**Response:**
19/109 ~= 17.4%
[1] 0.1743119
To visualize the data as it relates to this question, you should make side-by-side or stacked barcharts.
TASK 4.4 Modify the code below to create a side by side barchart. Notice the syntax is a little different; you should try different options for X and Y until you have Islands on the bottom and different colors for FemaleGovBoss.
TASK 4.5 Based on your answers to Tasks 4.1–4.4, what do you think the answer to research question C would be?
**Response:**
Islands are not significantly more likely to have female leaders.
Below are some summaries you can reference at any time.
The primary numerical summaries used in this course include the mean, the five-number summary, and the standard deviation. The functions below calculate these summaries for a single variable, or for that variable based on the value of a second categorical variable.
The primary graphical representations used in this course will be a barcharts, boxplots, histograms and scatterplots. Each plot helps to visualize different aspects of a dataset.
The functions that will be used to generate these plots are: