Objectives:

For this lab you should…

Part 1: Getting ready to use data

The following packages will need to be loaded into R before you begin the lab. To run a chunk of R code, click the green arrow symbol in the upper right of the gray chunk. If you would like to run line by line, hit Ctrl-Enter with your cursor next to the desired line.

The Happiness Data

The happiness data set is from the 2021 World Happiness Report https://Worldhappiness.report/ed/2021. Each observation in this dataset represents a country in the world. There are 14 variables observed about citizens opinions on their countries, characteristics of their countries, and Covid-19 related summaries. A multitude of interesting questions can be posed (and answered!) using these data.

TASK 1.0 Download the dataset WorldHappiness.csv from the Lab 1.1 assignment page. Once you’ve downloaded it, using the following code to read it into R:

TASK 1.1 To ensure that you have successfully loaded the dataset, you can check using the function head(DataSetName), which displays the first six rows of the data, and glimpse(DataSetName), which provides a summary of the data frame.

Rows: 148
Columns: 15
$ Name                    <chr> "United States", "Egypt", "Morocco", "Lebanon", "Saudi Arabia", "Jordan", "Turkey", "Pakistan", "Indonesia", "Bangladesh", "Uni…
$ Region                  <chr> "North America and ANZ", "Middle East and North Africa", "Middle East and North Africa", "Middle East and North Africa", "Middl…
$ HappinessScore          <dbl> 6.951, 4.283, 4.918, 4.584, 6.494, 4.395, 4.948, 4.934, 5.345, 5.025, 7.064, 6.690, 7.155, 7.464, 6.834, 6.491, 6.483, 6.166, 5…
$ Population2019          <dbl> 328.239523, 100.388073, 36.471769, 6.855713, 34.268528, 10.101694, 83.429615, 216.565318, 270.625568, 163.046161, 66.834405, 67…
$ CovidDeaths2020         <dbl> 104.451, 7.457, 20.016, 21.508, 17.875, 37.577, 24.758, 4.607, 8.094, 4.590, 108.450, 99.212, 40.331, 67.260, 168.496, 108.731,…
$ MedianAge               <dbl> 38.3, 25.3, 29.6, 31.1, 31.9, 23.2, 31.6, 23.5, 29.3, 27.5, 40.8, 42.0, 46.6, 43.2, 41.8, 45.5, 47.9, 41.8, 43.4, 43.3, 43.0, 4…
$ FemaleGovBoss           <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "No", "No", "No", "No", "No", "No", "No",…
$ InstitutionalTrust      <dbl> 0.250, 0.446, 0.397, 0.107, 0.651, 0.465, 0.295, 0.277, 0.561, 0.577, 0.268, 0.234, 0.435, 0.522, 0.303, 0.143, 0.076, 0.304, 0…
$ ExcessDeaths            <dbl> 179.220, NA, NA, NA, NA, NA, NA, NA, NA, NA, 133.313, 106.597, 73.548, 114.468, 169.894, 173.233, 182.298, 206.508, 128.939, 18…
$ LogGDP                  <dbl> 11.023, 9.367, 8.903, 9.626, 10.743, 9.182, 10.240, 8.458, 9.365, 8.454, 10.707, 10.704, 10.873, 10.932, 10.823, 10.571, 10.623…
$ SocialSupport           <dbl> 0.920, 0.750, 0.560, 0.848, 0.891, 0.767, 0.822, 0.651, 0.811, 0.693, 0.934, 0.942, 0.903, 0.942, 0.906, 0.932, 0.880, 0.898, 0…
$ HealthyLifeExpectancy   <dbl> 68.200, 61.998, 66.208, 67.355, 66.603, 67.000, 67.199, 58.709, 62.236, 64.800, 72.500, 74.000, 72.500, 72.400, 72.199, 74.700,…
$ Freedom                 <dbl> 0.837, 0.749, 0.774, 0.525, 0.877, 0.755, 0.576, 0.726, 0.873, 0.877, 0.859, 0.822, 0.875, 0.913, 0.783, 0.761, 0.693, 0.841, 0…
$ PerceptionsOfCorruption <dbl> 0.698, 0.795, 0.801, 0.898, 0.684, 0.705, 0.776, 0.787, 0.867, 0.682, 0.459, 0.571, 0.460, 0.338, 0.646, 0.745, 0.866, 0.735, 0…
$ Island                  <chr> "No", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "Yes", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No",…

TASK 1.2 Use your results from Task 1.1 to determine which columns are categorical data, and which are numeric:

**Response**

Categorical: 1, 2, 7, 15

Numeric: 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14

Model-centric approach to R

For the purposes of our class, it’s useful to learn a model-centric approach to R. The psuedo-code below is going to be our foundation for the rest of the class:

function( Y ~ X, data = DataSetName )

Here’s a short description of each part in the pseudo-code above:

  • function is an R function that dictates something you want to do with your data e.g. 
    • for example, mean calculates the mean
  • Y is the outcome of interest (response variable)
  • X is some explanatory variable or you can use “1” as a placeholder if there is no explanatory variable
  • DataSetName is the name of a data set loaded into the R environment

A summary of the most commonly used functions for this class are provided at the bottom of this document. In the work below you will be introduced to the functions and their uses one at a time in the context of answering research questions.

Research Question A: Are different regions of the world happier?

To use the data to answer the research question, we could find data summaries such as average happiness, for each of the regions. In this situation, HappinessScore is our response variable and Region is our explanatory variable.

TASK 2.1 Modify and run the code below to calculate the mean HappinessScore for each region.

        Central and Eastern Europe Commonwealth of Independent States                          East Asia        Latin America and Caribbean 
                          5.984765                           5.467000                           5.810333                           5.908050 
      Middle East and North Africa              North America and ANZ                         South Asia                     Southeast Asia 
                          5.219765                           7.128500                           4.441857                           5.407556 
                Sub-Saharan Africa                     Western Europe 
                          4.499800                           6.914905 

TASK 2.2 Use the function ‘favstats()’, and model-centric syntax, to determine which region has the country with the highest happiness score.

**Response:**

North America and ANZ has the highest mean and median happiness score

TASK 2.3 Which summary statistics are calculated by the favstats() function?

**Response:**

The minimum value, first quartile, second quartile, third quartile, maximum value, mean, standard deviation, number of data points, and number of blank data points in a data set.

Now it’s time to visualize the distribution of Happiness Score for the different reasons.

TASK 2.4 Modify the code below to create a histogram of Happiness Score, without accounting for other variables. What is the shape of the distribution: left-skewed, right-skewed, or roughly symmetric?

**Response:**

Roughly symmetric

TASK 2.5 Generate side-by-side boxplots of the HappinessScore, separated by Region. Write a sentence about what you observe in the figure.

**Response:**

A total of 6 outliers can be observed in the side-by-side boxplots.

TASK 2.6 Based on everything you’ve learned, do you think that happiness scores vary by region?

**Response:**

The data suggests that there is a relationship between happiness scores and region.

Research Question B: Are countries run by female leaders happier?

TASK 3.1 Write and run code to determine whether the mean Happiness Score is higher or lower for countries with female leaders. Hint: the variable ‘FemaleGovBoss’ indicates whether a country is run by a female leader.

**Response:**

TASK 3.2 How many countries are governed by female leaders? You can tabulate observations of a categorical variable using the ‘tally’ command. Modify the code below to answer the question.

**Response:**

22 countries are governed by female leaders.Categorical variables can be tabulated using "tally."
FemaleGovBoss
 No Yes 
126  22 

TASK 3.3 Create an appropriate visual summary to investigate research question B.

TASK 3.4 Based on what you learned in tasks 3.1 and 3.3, what do you think the answer to research question B would be?

**Response:**

Countries run by female leaders are happier.

Research Question C: Are Island nations more likely to have female leaders?

TASK 4.1 Use R tally up how many countries are islands with female leaders, islands are male leaders, not islands with female leaders, and not islands with male leaders. Modify the code below to extract the numbers.

             Island
FemaleGovBoss  No Yes
          No  109  17
          Yes  19   3

For a research question like this one that involves two binary variables, it is useful to calculate and compare two proportions:

TASK 4.2 What proportion of Island Nations have female leaders? You should calculate (# island nations with female leaders)/(# island nations)

**Response:**

3/17 ~= 17.6%
[1] 0.1764706

TASK 4.3 What proportion of non-Island Nations have female leaders? You should calculate (# non-island nations with female leaders)/(# non-island nations)

**Response:**

19/109 ~= 17.4%
[1] 0.1743119

To visualize the data as it relates to this question, you should make side-by-side or stacked barcharts.

TASK 4.4 Modify the code below to create a side by side barchart. Notice the syntax is a little different; you should try different options for X and Y until you have Islands on the bottom and different colors for FemaleGovBoss.

TASK 4.5 Based on your answers to Tasks 4.1–4.4, what do you think the answer to research question C would be?

**Response:**

Islands are not significantly more likely to have female leaders.

Epilogue

Below are some summaries you can reference at any time.

Numerical summaries

The primary numerical summaries used in this course include the mean, the five-number summary, and the standard deviation. The functions below calculate these summaries for a single variable, or for that variable based on the value of a second categorical variable.

  • mean(~ Y, data = DataSetName) # mean of Y without considering other variables
  • mean(Y ~ X, data = DataSetName) # mean of Y for different values of X
  • sd(~ Y, data = DataSetName) # standard deviation of Y
  • sd(Y ~ X, data = DataSetName) # standard deviation of Y for different values of X
  • favstats(~ Y, data = DataSetName) # summary statistics of Y
  • favstats(Y ~ X, data = DataSetName) # summary statistics of Y for different values of X

Visual summaries

The primary graphical representations used in this course will be a barcharts, boxplots, histograms and scatterplots. Each plot helps to visualize different aspects of a dataset.

The functions that will be used to generate these plots are:

  • gf_boxplot(Y ~ X, data = DataSetName) # boxplot of Y for each value of X, if specified
  • gf_histogram(~Y, data = DataSetName) # histogram of Y
  • gf_point(Y ~ X, data = DataSetName) # scatterplot of Y versus X
  • gf_bar( ~Y, data = DataSetName) # Bar chart for a single variable
  • gf_bar( ~ X, fill = ~ Y, data = DataSetName, position = position_dodge()) # side-by-side bar chart
  • gf_bar( ~ X, fill = ~ Y, data = DataSetName) # stacked bar chart
---
title: 'Lab 1-1: Using R Markdown to explore data'
output: html_notebook
---


### Objectives:

For this lab you should...

  - Read datasets into R
  - Familiarize yourself with commonly used functions and model-centric syntax in R
  - Provide summary statistics of a dataset
  - visualize the dataset using histograms and boxplots 
 

# Part 1: Getting ready to use data

The following packages will need to be loaded into R before you begin the lab. To run a chunk of R code, click the green arrow symbol in the upper right of the gray chunk. If you would like to run line by line, hit Ctrl-Enter with your cursor next to the desired line.

```{r, echo = F, message = F}
##  NOTE - in all future documents/assignments this code will be included for you and you are expected to run it without prompting.


# Clear Workspace
rm(list = ls()) 
knitr::opts_chunk$set(echo=FALSE)

# load packages we typically use for this class.
library(mosaic, warn.conflicts = FALSE) 
library(ggformula, warn.conflicts = FALSE)


```

### The Happiness Data

The happiness data set is from the 2021 World Happiness Report <https://Worldhappiness.report/ed/2021>.  Each observation in this dataset represents a country in the world. There are 14 variables observed about citizens opinions on their countries, characteristics of their countries, and Covid-19 related summaries.  A multitude of interesting questions can be posed (and answered!) using these data.

**TASK 1.0** Download the dataset WorldHappiness.csv from the Lab 1.1 assignment page. Once you've downloaded it, using the following code to read it into R:

```{r,echo = F}

Hap <- read.csv(file.choose()) #select your file in its saved location when the prompt opens

```

**TASK 1.1** To ensure that you have successfully loaded the dataset, you can check using the function head(DataSetName), which displays the first six rows of the data, and glimpse(DataSetName), which provides a summary of the data frame.

```{r,echo = F}
# modify the code below

head(Hap) # displays the first 6 rows of the dataset, including headers 
glimpse(Hap) # provides a summary of the data frame

```


**TASK 1.2** Use your results from Task 1.1 to determine which columns are categorical data, and which are numeric:

    **Response**

    Categorical: 1, 2, 7, 15
    
    Numeric: 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14
 

### Model-centric approach to R

For the purposes of our class, it's useful to learn a model-centric approach to R.  The psuedo-code below is going to be our foundation for the rest of the class: 

`function( Y ~ X, data = DataSetName )`

Here's a short description of each part in the pseudo-code above:

- `function` is an R function that dictates something you want to do with your data e.g. 
    - for example, `mean` calculates the mean
- `Y` is the outcome of interest (response variable)  
- `X` is some explanatory variable or you can use "1" as a placeholder if there is no explanatory variable  
- `DataSetName` is the name of a data set loaded into the R environment

A summary of the most commonly used functions for this class are provided at the bottom of this document.  In the work below you will be introduced to the functions and their uses one at a time in the context of answering research questions.

## Research Question A: Are different regions of the world happier?

To use the data to answer the research question, we could find data summaries such as average happiness, for each of the regions.  In this situation, HappinessScore is our response variable and Region is our explanatory variable.

**TASK 2.1** Modify and run the code below to calculate the mean HappinessScore for each region.

```{r,echo = F}

mean(HappinessScore ~ Region, data = Hap)

```

**TASK 2.2**  Use the function 'favstats()', and model-centric syntax, to determine which region has the country with the highest happiness score.

    **Response:**
    
    North America and ANZ has the highest mean and median happiness score
  
```{r, echo = F}

favstats(HappinessScore ~ Region, data = Hap)

```

**TASK 2.3** Which summary statistics are calculated by the favstats() function?

    **Response:**
    
    The minimum value, first quartile, second quartile, third quartile, maximum value, mean, standard deviation, number of data points, and number of blank data points in a data set.

Now it's time to visualize the distribution of Happiness Score for the different reasons.  

**TASK 2.4** Modify the code below to create a histogram of Happiness Score, without accounting for other variables.  What is the shape of the distribution: left-skewed, right-skewed, or roughly symmetric?

    **Response:**
    
    Roughly symmetric
    
```{r,echo = F}

gf_histogram(~ HappinessScore, data = Hap)

```

**TASK 2.5** Generate side-by-side boxplots of the HappinessScore, separated by Region. Write a sentence about what you observe in the figure.

    **Response:**
    
    A total of 6 outliers can be observed in the side-by-side boxplots.

```{r,echo = F}
#gf_boxplot(HappinessScore ~ Region, data = Hap)

# if the axes labels aren't readable, try modifying and running the code below!
gf_boxplot(HappinessScore ~ Region, data = Hap) + coord_flip()

```

**TASK 2.6** Based on everything you've learned, do you think that happiness scores vary by region?

    **Response:**
    
    The data suggests that there is a relationship between happiness scores and region.
    
    
#### Research Question B: Are countries run by female leaders happier?

**TASK 3.1** Write and run code to determine whether the mean Happiness Score is higher or lower for countries with female leaders.  Hint:  the variable 'FemaleGovBoss' indicates whether a country is run by a female leader.

    **Response:**

```{r,echo = F}

favstats(HappinessScore ~ FemaleGovBoss, data = Hap)

```
**TASK 3.2** How many countries are governed by female leaders? You can tabulate observations of a categorical variable using the 'tally' command.  Modify the code below to answer the question.

    **Response:**
    
    22 countries are governed by female leaders.Categorical variables can be tabulated using "tally."
    
```{r,echo = F}
tally( ~ FemaleGovBoss, data = Hap)
```


**TASK 3.3** Create an appropriate visual summary to investigate research question B.

```{r,echo = F}

gf_boxplot(HappinessScore ~ FemaleGovBoss, data = Hap) + coord_flip()

```

**TASK 3.4** Based on what you learned in tasks 3.1 and 3.3, what do you think the answer to research question B would be?

    **Response:**
    
    Countries run by female leaders are happier.
    

## Research Question C: Are Island nations more likely to have female leaders?


**TASK 4.1** Use R tally up how many countries are islands with female leaders, islands are male leaders, not islands with female leaders, and not islands with male leaders.  Modify the code below to extract the numbers.

```{r, echo = F}
tally(FemaleGovBoss ~ Island, data = Hap)
```
For a research question like this one that involves two binary variables, it is useful to calculate and compare two proportions:

**TASK 4.2**  What proportion of Island Nations have female leaders?  You should calculate (# island nations with female leaders)/(# island nations)

    **Response:**
    
    3/17 ~= 17.6%
    
```{r}
# you can use this R chunk as a calculator if needed

3/17

```

  
**TASK 4.3**  What proportion of non-Island Nations have female leaders? You should calculate (# non-island nations with female leaders)/(# non-island nations)

    **Response:**
    
    19/109 ~= 17.4%
    
```{r}
# you can use this R chunk as a calculator if needed

19/109


```

To visualize the data as it relates to this question, you should make side-by-side or stacked barcharts.
  
**TASK 4.4**  Modify the code below to create a side by side barchart.  Notice the syntax is a little different; you should try different options for X and Y until you have Islands on the bottom and different colors for FemaleGovBoss.

```{r, echo = F}
gf_bar(~Island, fill =  ~ FemaleGovBoss, data = Hap, position = position_dodge())
```

**TASK 4.5** Based on your answers to Tasks 4.1--4.4, what do you think the answer to research question C would be?

    **Response:**
    
    Islands are not significantly more likely to have female leaders.


## Research Question D: Which of the quantitative variables are related to Happiness Score?

To answer this question, we consider plots that visualize the relationship between two quantitative variables. A scatterplot is simply an (x,y) plot in the Cartesian plane. Using our Y ~ X format, the Y variable will be along the Y axis. 

**TASK 5.1** Modify the code below to generate a scatterplot of the HappinessScore versus the Institutional Trust.


```{r,echo = F}
gf_point(HappinessScore ~ InstitutionalTrust, data = Hap)

```
**TASK 5.2** Write code and produce scatterplots of the HappinessScore versus: SocialSupport, HealthyLifeExpectancy, and Freedom. 

```{r,echo = F}

gf_point(HappinessScore ~ SocialSupport, data = Hap) 

gf_point(HappinessScore ~ HealthyLifeExpectancy, data = Hap)

gf_point(HappinessScore ~ Freedom, data = Hap)

``` 

**TASK 5.3**  Comment on which of the four variables that have been plotted suggest a strong linear relationship with the Happiness Score. 

    **Response** 
    
    Social support, life expectancy, and freedom have clear, linear relationships with happiness score.

**Response** 


## Research Question E: Which variables are related to Covid Deaths 2020??

A strength of R is its ability to quickly and easily produce sophisticated data visualizations.  You will use  curiosity and tools in R to create visualizations that explore this research question.

Use the code below to remind yourself which variables are available in this dataset.  Choose a quantitative variable that you think will be related to CovidDeaths2020.

**TASK 6.1** Make a scatterplot to examine the relationship between Covid Deaths in 2020 and the variable you chose. 
```{r,echo = F}


gf_point(CovidDeaths2020 ~ MedianAge, data = Hap)


``` 

**TASK 6.2** Now, color the points in your plot based on the values of a third variable such as FemaleGovBoss, Island, or Region. You could also choose a quantitative variable, if you desired.

```{r,echo = F}

gf_point(CovidDeaths2020 ~ MedianAge, col = ~ LogGDP, data = Hap)


``` 
**TASK 6.3** Now we can add one more layer, based on a categorical variable.  Choose one of the binary variables FemaleGovBoss or Island that you haven't already included and modify the code below:

```{r, echo = F}
gf_point(CovidDeaths2020 ~ MedianAge | Island, col = ~LogGDP, data = Hap)
```


**TASK 6.4**  Make two interesting comments about what you learned about the variables you chose and covid deaths in 2020.

    **Response:**
    
    Populations with larger median ages did show evidence of higher numbers of covid deaths, and populations with lower GDPs showed evidence of fewer covid deaths, however the latter is not a proportional representation of covid deaths.
    

# Epilogue

Below are some summaries you can reference at any time.

#### Numerical summaries

The primary numerical summaries used in this course include the mean, the five-number summary, and the standard deviation.  The functions below calculate these summaries for a single variable, or for that variable based on the value of a second categorical variable.

  - mean(~ Y, data = DataSetName)  # mean of Y without considering other variables
  - mean(Y ~ X, data = DataSetName)  # mean of Y for different values of X
  - sd(~ Y, data = DataSetName)  # standard deviation of Y 
  - sd(Y ~ X, data = DataSetName)  # standard deviation of Y for different values of X
  - favstats(~ Y, data = DataSetName)  # summary statistics of Y 
  - favstats(Y ~ X, data = DataSetName)  # summary statistics of Y for different values of X

#### Visual summaries

The primary graphical representations used in this course will be a barcharts, boxplots, histograms and scatterplots. Each plot helps to visualize different aspects of a dataset. 

The functions that will be used to generate these plots are:

  - gf_boxplot(Y ~ X, data = DataSetName) # boxplot of Y for each value of X, if specified
  - gf_histogram(~Y, data = DataSetName) # histogram of Y
  - gf_point(Y ~ X, data = DataSetName) # scatterplot of Y versus X
  - gf_bar( ~Y, data = DataSetName) # Bar chart for a single variable
  - gf_bar( ~ X, fill = ~ Y, data = DataSetName, position = position_dodge()) # side-by-side bar chart
  - gf_bar( ~ X, fill = ~ Y, data = DataSetName) # stacked bar chart

