Save this .Rmd file with the title
Lastname_Firstname_Tidyverse-Lab.Rmd into your RStudio
folder.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.4 âś” readr 2.1.5
## âś” forcats 1.0.0 âś” stringr 1.5.1
## âś” ggplot2 3.5.2 âś” tibble 3.3.0
## âś” lubridate 1.9.4 âś” tidyr 1.3.1
## âś” purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
What economic, social, or political factor has the greatest impact on a country’s Human Development Index score?
The Human Development Index Factors Data Set includes four variables used to determine relationship to a country’s Human Development Index score. These variables include dominant religion, whether the country is a democracy, percentage of GDP by exports, and percent of land used for agriculture.
Name and load your data set below.
hdi_data = read_csv("~/AY26-1 MA206 Statistics/RStudio/hdi_data.csv")
## Rows: 158 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Country, Dominant Religion, Democracy / Not a Democracy
## dbl (3): Human Development Index Score: 2017, Exports % of GDP as of 2017, A...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(hdi_data)
## # A tibble: 6 Ă— 6
## Country Human Development In…¹ Exports % of GDP as …² Agricultural Land Pe…³
## <chr> <dbl> <dbl> <dbl>
## 1 Angola 0.597 29 36.7
## 2 Albania 0.802 31.6 42.9
## 3 UAE 0.897 97.9 5.4
## 4 Argentina 0.851 11.3 43
## 5 Armenia 0.768 38.2 58.9
## 6 Australia 0.937 21.3 48.3
## # ℹ abbreviated names: ¹​`Human Development Index Score: 2017`,
## # ²​`Exports % of GDP as of 2017`, ³​`Agricultural Land Percent of Area 2017`
## # ℹ 2 more variables: `Dominant Religion` <chr>,
## # `Democracy / Not a Democracy` <chr>
The output of the above code, in conjunction with any provided data dictionary, should enable you to complete the table below. Remove the information from the wage data set and use your own.
| Variable | Column Name | Units | Variable Type |
|---|---|---|---|
| Country | Country | N/A | Categorical |
| HDI | HDI | Index | Quantitative |
| GDP % Exports | GDP % Exports | % | Quantitative |
| Ag. Land % | Ag. Land % | % | Quantitative |
| Dom. Religion | Dom. Religion | N/A | Categorical |
| Democracy | Democracy | N/A | Categorical |
Use the below space to practice calling, selecting, filtering, summarizing, grouping by, and mutating variables.
hdi_data$Country
## [1] "Angola" "Albania"
## [3] "UAE" "Argentina"
## [5] "Armenia" "Australia"
## [7] "Austria" "Azerbaijan"
## [9] "Burundi" "Belgium"
## [11] "Benin" "Burkina Faso"
## [13] "Bangladesh" "Bulgaria"
## [15] "Bahrain" "Bahamas"
## [17] "Bosnia and Herzegovina" "Belarus"
## [19] "Belize" "Bolivia"
## [21] "Brazil" "Brunei"
## [23] "Bhutan" "Botswana"
## [25] "Central African Republic" "Canada"
## [27] "Switzerland" "Chile"
## [29] "China" "Cameroon"
## [31] "Colombia" "Comoros"
## [33] "Cape Verde" "Costa Rica"
## [35] "Cuba" "Cyprus"
## [37] "Czech Republic" "Germany"
## [39] "Djibouti" "Denmark"
## [41] "Dominican Republic" "Algeria"
## [43] "Ecuador" "Egypt"
## [45] "Spain" "Estonia"
## [47] "Ethiopia" "Finland"
## [49] "Fiji" "France"
## [51] "Gabon" "UK"
## [53] "Georgia" "Ghana"
## [55] "Guinea" "Gambia"
## [57] "Guinea-Bissau" "Equatorial Guinea"
## [59] "Greece" "Guatemala"
## [61] "Hong Kong, China" "Honduras"
## [63] "Croatia" "Haiti"
## [65] "Hungary" "Indonesia"
## [67] "India" "Ireland"
## [69] "Iran" "Iraq"
## [71] "Iceland" "Israel"
## [73] "Italy" "Jamaica"
## [75] "Jordan" "Japan"
## [77] "Kazakhstan" "Kenya"
## [79] "Kyrgyz Republic" "Cambodia"
## [81] "Kiribati" "South Korea"
## [83] "Kuwait" "Lebanon"
## [85] "Libya" "Sri Lanka"
## [87] "Lesotho" "Lithuania"
## [89] "Luxembourg" "Latvia"
## [91] "Morocco" "Madagascar"
## [93] "Maldives" "Mexico"
## [95] "Mali" "Malta"
## [97] "Montenegro" "Mongolia"
## [99] "Mozambique" "Mauritania"
## [101] "Mauritius" "Malaysia"
## [103] "Namibia" "Niger"
## [105] "Nigeria" "Nicaragua"
## [107] "Netherlands" "Norway"
## [109] "Nepal" "New Zealand"
## [111] "Oman" "Pakistan"
## [113] "Panama" "Peru"
## [115] "Philippines" "Papua New Guinea"
## [117] "Poland" "Portugal"
## [119] "Paraguay" "Palestine"
## [121] "Qatar" "Romania"
## [123] "Russia" "Rwanda"
## [125] "Saudi Arabia" "Sudan"
## [127] "Senegal" "Singapore"
## [129] "Solomon Islands" "Sierra Leone"
## [131] "El Salvador" "Serbia"
## [133] "Sweden" "Seychelles"
## [135] "Syria" "Chad"
## [137] "Togo" "Thailand"
## [139] "Tajikistan" "Turkmenistan"
## [141] "Timor-Leste" "Tonga"
## [143] "Tunisia" "Turkey"
## [145] "Tanzania" "Uganda"
## [147] "Ukraine" "Uruguay"
## [149] "USA" "Uzbekistan"
## [151] "Vietnam" "Vanuatu"
## [153] "Samoa" "Yemen"
## [155] "South Africa" "Zambia"
## [157] "Zimbabwe" NA
hdi_data %>%
select("Exports % of GDP as of 2017")
## # A tibble: 158 Ă— 1
## `Exports % of GDP as of 2017`
## <dbl>
## 1 29
## 2 31.6
## 3 97.9
## 4 11.3
## 5 38.2
## 6 21.3
## 7 54.3
## 8 48.5
## 9 6.03
## 10 83.4
## # ℹ 148 more rows
hdi_data %>%
filter(`Exports % of GDP as of 2017` > 30.0)
## # A tibble: 89 Ă— 6
## Country Human Development In…¹ Exports % of GDP as …² Agricultural Land Pe…³
## <chr> <dbl> <dbl> <dbl>
## 1 Albania 0.802 31.6 42.9
## 2 UAE 0.897 97.9 5.4
## 3 Armenia 0.768 38.2 58.9
## 4 Austria 0.916 54.3 32.2
## 5 Azerbai… 0.753 48.5 57.8
## 6 Belgium 0.931 83.4 43.7
## 7 Bulgaria 0.808 67 46.3
## 8 Bahrain 0.869 71.9 10.4
## 9 Bahamas 0.825 32.3 1.3
## 10 Bosnia … 0.772 40.3 43.5
## # ℹ 79 more rows
## # ℹ abbreviated names: ¹​`Human Development Index Score: 2017`,
## # ²​`Exports % of GDP as of 2017`, ³​`Agricultural Land Percent of Area 2017`
## # ℹ 2 more variables: `Dominant Religion` <chr>,
## # `Democracy / Not a Democracy` <chr>
#(ChatGPT, 2025)
hdi_data %>%
summarize(Mean_Exports = mean(`Exports % of GDP as of 2017`, na.rm= TRUE))
## # A tibble: 1 Ă— 1
## Mean_Exports
## <dbl>
## 1 40.4
#(ChatGPT, 2025)
hdi_data %>%
group_by(`Dominant Religion`,`Democracy / Not a Democracy`) %>%
arrange(`Dominant Religion`)
## # A tibble: 158 Ă— 6
## # Groups: Dominant Religion, Democracy / Not a Democracy [11]
## Country Human Development In…¹ Exports % of GDP as …² Agricultural Land Pe…³
## <chr> <dbl> <dbl> <dbl>
## 1 Bhutan 0.647 28.4 13.4
## 2 Singapo… 0.935 172 0.927
## 3 Thailand 0.79 66.7 70.2
## 4 Angola 0.597 29 36.7
## 5 Argenti… 0.851 11.3 43
## 6 Armenia 0.768 38.2 58.9
## 7 Austral… 0.937 21.3 48.3
## 8 Austria 0.916 54.3 32.2
## 9 Burundi 0.428 6.03 77.7
## 10 Belgium 0.931 83.4 43.7
## # ℹ 148 more rows
## # ℹ abbreviated names: ¹​`Human Development Index Score: 2017`,
## # ²​`Exports % of GDP as of 2017`, ³​`Agricultural Land Percent of Area 2017`
## # ℹ 2 more variables: `Dominant Religion` <chr>,
## # `Democracy / Not a Democracy` <chr>
# mutate a variable, using existing variables to create and name a new one
hdi_data %>%
mutate(Exports_to_Ag_Land_Ratio = `Exports % of GDP as of 2017` / `Agricultural Land Percent of Area 2017`) %>%
select(`Exports % of GDP as of 2017`, `Agricultural Land Percent of Area 2017`, `Exports_to_Ag_Land_Ratio`)
## # A tibble: 158 Ă— 3
## `Exports % of GDP as of 2017` Agricultural Land Perc…¹ Exports_to_Ag_Land_R…²
## <dbl> <dbl> <dbl>
## 1 29 36.7 0.790
## 2 31.6 42.9 0.737
## 3 97.9 5.4 18.1
## 4 11.3 43 0.263
## 5 38.2 58.9 0.649
## 6 21.3 48.3 0.441
## 7 54.3 32.2 1.69
## 8 48.5 57.8 0.839
## 9 6.03 77.7 0.0776
## 10 83.4 43.7 1.91
## # ℹ 148 more rows
## # ℹ abbreviated names: ¹​`Agricultural Land Percent of Area 2017`,
## # ²​Exports_to_Ag_Land_Ratio
Human Development Index. This index determines the level of development of different nations based upon their life expectancy at birth, education, and Standard of Living. It is measured using an index value, ranging from 0 to 1.
# using quantitative variable techniques, create a summary table here
hdi_data %>%
filter(!is.na(`Human Development Index Score: 2017`)) %>%
#(ChatGPT)
summarize(mean = mean(`Human Development Index Score: 2017`),
s = sd(`Human Development Index Score: 2017`),
n = n())
## # A tibble: 1 Ă— 3
## mean s n
## <dbl> <dbl> <int>
## 1 0.727 0.153 157
# using ggplot, create a visualization that shows the behavior of this variable here
hdi_data %>%
ggplot(aes(x = `Human Development Index Score: 2017`))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).
# using ggplot, create a second visualization that shows a different aspect of this variable
hdi_data %>%
filter(!is.na(`Human Development Index Score: 2017`),
!is.na(`Democracy / Not a Democracy`)) %>%
#(ChatGPT, 2025)
ggplot(aes(x = `Human Development Index Score: 2017`, fill = `Democracy / Not a Democracy`)) +
geom_histogram() +
facet_grid(`Democracy / Not a Democracy` ~ .) +
labs(x = "Human Development Index Score: 2017", y = "Count", title = "Histogram of Human Development Index Score: 2017 by Regime Type")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In 1-2 sentences, describe this variable’s data. Which visualization is better, and why? Are there any questions that you have after exploring? Add code chunks below if you’d like to do some more exploration.
The Human Development Index data is higher on average for countries that have a democratic regime type compared to those that don’t. I am curious to see which of the quantitative explanatory variables this trend might align with as well.
Type the name, description, and units of your quantitative variable here.
This variable is Exports of GDP as of 2017, measures the percent of a country’s GDP is exports.
# using quantitative variable techniques, create a summary table here. Replace
# the 'dataframe' and 'variable' with your dataframe name and variable name
hdi_data %>%
filter(!is.na(`Exports % of GDP as of 2017`)) %>%
#(ChatGPT, 2025)
summarize(mean = mean(`Exports % of GDP as of 2017`),
s = sd(`Exports % of GDP as of 2017`),
n = n())
## # A tibble: 1 Ă— 3
## mean s n
## <dbl> <dbl> <int>
## 1 40.4 31.2 155
# using ggplot, create a visualization that shows the behavior of this variable here
hdi_data %>%
ggplot(aes(x = `Exports % of GDP as of 2017`))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).
# using ggplot, create a second visualization that shows a different aspect of this variable
hdi_data %>%
filter(!is.na(`Exports % of GDP as of 2017`),
!is.na(`Democracy / Not a Democracy`)) %>%
#(ChatGPT, 2025)
ggplot(aes(x = `Exports % of GDP as of 2017`, fill = `Democracy / Not a Democracy`)) +
geom_histogram() +
facet_grid(`Democracy / Not a Democracy` ~ .) +
labs(x = "Exports % of GDP as of 2017", y = "Count", title = "Histogram of Exports % of GDP as of 2017 by Regime Type")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In 1-2 sentences, describe this variable’s data. Which visualization is better, and why? Are there any questions that you have after exploring? Add code chunks below if you’d like to do some more exploration.
Exports in democratic countries tend to represtent a greater portion of their overall GDP compared to Non-Democratic countries. The second vizualization of better because it allows us to compare Democratic vs Non-Democratic values, which allows us to draw significant conclusions. I am curious whether this trend is caused by less freedom of trade in Non-Democratic countries.
Type the name, description, and units of your quantitative variable here.
# using quantitative variable techniques, create a summary table here. Replace
# the 'dataframe' and 'variable' with your dataframe name and variable name
hdi_data %>%
summarize(mean = mean(`Agricultural Land Percent of Area 2017`),
s = sd(`Agricultural Land Percent of Area 2017`),
n = n())
## # A tibble: 1 Ă— 3
## mean s n
## <dbl> <dbl> <int>
## 1 39.7 21.9 158
# using ggplot, create a visualization that shows the behavior of this variable here
hdi_data %>%
ggplot(aes(x = `Agricultural Land Percent of Area 2017`))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# using ggplot, create a second visualization that shows a different aspect of this variable
hdi_data %>%
filter(!is.na(`Agricultural Land Percent of Area 2017`),
!is.na(`Democracy / Not a Democracy`)) %>%
#(ChatGPT, 2025)
ggplot(aes(x = `Agricultural Land Percent of Area 2017`, fill = `Democracy / Not a Democracy`)) +
geom_histogram() +
facet_grid(`Democracy / Not a Democracy` ~ .) +
labs(x = "Agricultural Land Percent of Area 2017", y = "Count", title = "Agricultural Land Percent of Area 2017 by Regime Type")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In 1-2 sentences, describe this variable’s data. Which visualization is better, and why? Are there any questions that you have after exploring? Add code chunks below if you’d like to do some more exploration.
The variable’s data is better represented in the second visualization, as it separated the country data by Democratic and Non-Democratic regime type. It appears that Agricultural Land Percent Area is evenly distributed across Non-Democratic countries, and focued around 50 percent for Democratic countries.
Type the name, description, and units of your categorical variable here. Remember that this will require different code than your quantitative variables.
Dominant Religion measures which religion has the most members for each nation. Because it is categorical, it is measured by distinguishing amongst category types.
# using categorical variable techniques, create a summary table here
hdi_data %>%
group_by(`Dominant Religion`) %>%
summarize(
Count = n(),
Proportion = n()/ nrow(hdi_data))
## # A tibble: 7 Ă— 3
## `Dominant Religion` Count Proportion
## <chr> <int> <dbl>
## 1 Buddhist 3 0.0190
## 2 Christian 92 0.582
## 3 Hindu 2 0.0127
## 4 Jewish 1 0.00633
## 5 Muslim 49 0.310
## 6 Unaffiliated 10 0.0633
## 7 <NA> 1 0.00633
#(ChatGPT, 2025)
# using ggplot, create a visualization that shows the behavior of this variable here
hdi_data %>%
ggplot(aes(x = `Dominant Religion`, y = `Human Development Index Score: 2017`)) + geom_boxplot() +
labs(title = "Comparative Boxplot of Dominant Religion and Human Development Index Score", x = "Dominant Religion")
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).
# using ggplot, create a second visualization that shows a different aspect of this variable
hdi_data %>%
ggplot(aes(x = `Dominant Religion`, y = `Human Development Index Score: 2017`, color = `Democracy / Not a Democracy`)) + geom_boxplot() +
labs(title = "Comparative Boxplot of Dominant Religion and Human Development Index Score", x = "Dominant Religion")
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).
In 1-2 sentences, describe this variable’s data. Which visualization is better, and why? Are there any questions that you have after exploring? Add code chunks below if you’d like to do some more exploration.
This variable shows the distribution of HDI values acrosss countries with different dominant religions, which I also compared across regime type. The second visualization is better because it gives insignht to what variables might have more of an effect on the HDI.
Type the name, description, and units of your categorical variable here. Remember that this will require different code than your quantitative variables.
Regime Type, labeled as “Democracy / Not a Democracy”, shows which countries have a democratic government versus an authoritative regime. It is a dummy variable, thus only having two possible outcomes.
# using categorical variable techniques, create a summary table here
hdi_data %>%
group_by(`Democracy / Not a Democracy`) %>%
summarize(
Count = n(),
Proportion = n()/ nrow(hdi_data))
## # A tibble: 3 Ă— 3
## `Democracy / Not a Democracy` Count Proportion
## <chr> <int> <dbl>
## 1 Democracy 74 0.468
## 2 Not a Democracy 83 0.525
## 3 <NA> 1 0.00633
#(ChatGPT, 2025)
# using ggplot, create a visualization that shows the behavior of this variable here
hdi_data %>%
group_by(`Exports % of GDP as of 2017`, `Democracy / Not a Democracy`) %>%
summarize(Median = median(`Human Development Index Score: 2017`)) %>%
ggplot(aes(x = `Exports % of GDP as of 2017`, y = Median, color = `Democracy / Not a Democracy`)) + geom_line() +
labs(x = "Exports % of GDP as of 2017", y = "Human Development Index Score: 2017", color = "Democracy / Not a Democracy", title = "Exports % of GDP as of 2017 vs. Human Development Index Score: 2017 by Regime Type") +
scale_x_continuous(limits=c(0,100), breaks=c(seq(from = 0, to = 100, by = 10))) +
scale_y_continuous(limits=c(0,1), breaks = c(2.5, 5.0, 7.5))
## `summarise()` has grouped output by 'Exports % of GDP as of 2017'. You can
## override using the `.groups` argument.
## Warning: Removed 9 rows containing missing values or values outside the scale range
## (`geom_line()`).
# using ggplot, create a second visualization that shows a different aspect of this variable
hdi_data %>%
group_by(`Exports % of GDP as of 2017`, `Democracy / Not a Democracy`) %>%
summarize(Median = median(`Human Development Index Score: 2017`)) %>%
ggplot(aes(x = `Exports % of GDP as of 2017`, y = Median, color = `Democracy / Not a Democracy`)) + geom_point() +
labs(x = "Exports % of GDP as of 2017", y = "Human Development Index Score: 2017", color = "Democracy / Not a Democracy", title = "Exports % of GDP as of 2017 vs. Human Development Index Score: 2017 by Regime Type")
## `summarise()` has grouped output by 'Exports % of GDP as of 2017'. You can
## override using the `.groups` argument.
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).
In 1-2 sentences, describe this variable’s data. Which visualization is better, and why? Are there any questions that you have after exploring? Add code chunks below if you’d like to do some more exploration.
Because this was a dummy variable, I included other variable data in order to gain a better visualization of how Regime Type data was distributed. The second visualization is better because it looks cleaner and shows the Regime Type of different countries across the different HDI values.
Using ggplot, create visualizations that show relationships between your variables below. Since you have five variables, you will need at minimum four plots so that each variable is visualized at least once. It is possible to display relationships between 3+ variables in one plot; at least one of your plots should demonstrate mastery of this skill. Create more code chunks as needed.
hdi_data %>%
group_by(`Exports % of GDP as of 2017`, `Democracy / Not a Democracy`) %>%
summarize(Median = median(`Human Development Index Score: 2017`)) %>%
ggplot(aes(x = `Exports % of GDP as of 2017`, y = Median, color = `Democracy / Not a Democracy`)) + geom_point() +
labs(x = "Exports % of GDP as of 2017", y = "Human Development Index Score: 2017", color = "Democracy / Not a Democracy", title = "Exports % of GDP as of 2017 vs. Human Development Index Score: 2017 by Regime Type")
## `summarise()` has grouped output by 'Exports % of GDP as of 2017'. You can
## override using the `.groups` argument.
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).
hdi_data %>%
filter(!is.na(`Agricultural Land Percent of Area 2017`),
!is.na(`Democracy / Not a Democracy`)) %>%
#(ChatGPT, 2025)
ggplot(aes(x = `Agricultural Land Percent of Area 2017`, fill = `Democracy / Not a Democracy`)) +
geom_histogram() +
facet_grid(`Democracy / Not a Democracy` ~ .) +
labs(x = "Agricultural Land Percent of Area 2017", y = "Count", title = "Agricultural Land Percent of Area 2017 by Regime Type")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
hdi_data %>%
ggplot(aes(x = `Dominant Religion`, y = `Human Development Index Score: 2017`, color = `Democracy / Not a Democracy`)) + geom_boxplot() +
labs(title = "Comparative Boxplot of Dominant Religion and Human Development Index Score", x = "Dominant Religion")
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).
hdi_data %>%
filter(!is.na(`Human Development Index Score: 2017`),
!is.na(`Democracy / Not a Democracy`)) %>%
#(ChatGPT, 2025)
ggplot(aes(x = `Human Development Index Score: 2017`, fill = `Democracy / Not a Democracy`)) +
geom_histogram() +
facet_grid(`Democracy / Not a Democracy` ~ .) +
labs(x = "Human Development Index Score: 2017", y = "Count", title = "Histogram of Human Development Index Score: 2017 by Regime Type")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Test your skills by working through the code after the
ggplot section of the Tutorial. These examples will help
you gain a basic understanding of what is happening with specific
commands or data structures within R, which will be useful to you over
the course of the semester. Create more code chunks as needed.
Now that you’re done, you need to save this file (if the title is
red, it has unsaved changes). RStudio does NOT autosave while you work,
so CTRL+S early and often. Next, press the Knit button up
top with the yarn icon. This will create an HTML file, because that was
specified in the header. Save your HTML file with the name
Lastname_Firstname_Tidyverse-Lab.html. Then, open the HTML
file and print, using the
`Microsoft Print to PDF" option to save asLastname_Firstname_Tidyverse-Lab.pdf`.
This PDF file is what you will submit on Canvas for Milestone 2.