The Economic Question

How have GDP per capita and life expectancy evolved across different continents since 1952? Which continents have seen the fastest growth, and which countries are outliers?

Part 1: Setup and Data Loading (5 points)

# Load the tidyverse package
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.2
## Warning: package 'tibble' was built under R version 4.4.2
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'stringr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Import the wide Gapminder dataset
gapminder_wide <- read_csv("gapminder_wide(1).csv") 
## Rows: 142 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): country, continent
## dbl (24): gdpPercap_1952, gdpPercap_1957, gdpPercap_1962, gdpPercap_1967, gd...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Task 1.1: Use glimpse() to examine the structure of gapminder_wide. In your own words, describe what you see. How many rows and columns are there? What do the column names tell you about the data format?

glimpse(gapminder_wide)
## Rows: 142
## Columns: 26
## $ country        <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argenti…
## $ continent      <chr> "Asia", "Europe", "Africa", "Africa", "Americas", "Ocea…
## $ gdpPercap_1952 <dbl> 779.4453, 1601.0561, 2449.0082, 3520.6103, 5911.3151, 1…
## $ gdpPercap_1957 <dbl> 820.8530, 1942.2842, 3013.9760, 3827.9405, 6856.8562, 1…
## $ gdpPercap_1962 <dbl> 853.1007, 2312.8890, 2550.8169, 4269.2767, 7133.1660, 1…
## $ gdpPercap_1967 <dbl> 836.1971, 2760.1969, 3246.9918, 5522.7764, 8052.9530, 1…
## $ gdpPercap_1972 <dbl> 739.9811, 3313.4222, 4182.6638, 5473.2880, 9443.0385, 1…
## $ gdpPercap_1977 <dbl> 786.1134, 3533.0039, 4910.4168, 3008.6474, 10079.0267, …
## $ gdpPercap_1982 <dbl> 978.0114, 3630.8807, 5745.1602, 2756.9537, 8997.8974, 1…
## $ gdpPercap_1987 <dbl> 852.3959, 3738.9327, 5681.3585, 2430.2083, 9139.6714, 2…
## $ gdpPercap_1992 <dbl> 649.3414, 2497.4379, 5023.2166, 2627.8457, 9308.4187, 2…
## $ gdpPercap_1997 <dbl> 635.3414, 3193.0546, 4797.2951, 2277.1409, 10967.2820, …
## $ gdpPercap_2002 <dbl> 726.7341, 4604.2117, 5288.0404, 2773.2873, 8797.6407, 3…
## $ gdpPercap_2007 <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3796, …
## $ lifeExp_1952   <dbl> 28.801, 55.230, 43.077, 30.015, 62.485, 69.120, 66.800,…
## $ lifeExp_1957   <dbl> 30.33200, 59.28000, 45.68500, 31.99900, 64.39900, 70.33…
## $ lifeExp_1962   <dbl> 31.99700, 64.82000, 48.30300, 34.00000, 65.14200, 70.93…
## $ lifeExp_1967   <dbl> 34.02000, 66.22000, 51.40700, 35.98500, 65.63400, 71.10…
## $ lifeExp_1972   <dbl> 36.08800, 67.69000, 54.51800, 37.92800, 67.06500, 71.93…
## $ lifeExp_1977   <dbl> 38.43800, 68.93000, 58.01400, 39.48300, 68.48100, 73.49…
## $ lifeExp_1982   <dbl> 39.854, 70.420, 61.368, 39.942, 69.942, 74.740, 73.180,…
## $ lifeExp_1987   <dbl> 40.822, 72.000, 65.799, 39.906, 70.774, 76.320, 74.940,…
## $ lifeExp_1992   <dbl> 41.674, 71.581, 67.744, 40.647, 71.868, 77.560, 76.040,…
## $ lifeExp_1997   <dbl> 41.763, 72.950, 69.152, 40.963, 73.275, 78.830, 77.510,…
## $ lifeExp_2002   <dbl> 42.129, 75.651, 70.994, 41.003, 74.340, 80.370, 78.980,…
## $ lifeExp_2007   <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829,…

Your answer:

I can see that the dataset contains 142 rows and 26 columns. When I run this function, a table appears showing the columns and their values, but the structure seems somewhat messy. The first columns are country and continent, followed by variables such as gdpPercap_1952 through gdpPercap_2007, and lifeExp_1952 through lifeExp_2007. This makes the dataset appear very wide because the information is spread across many years in separate columns. The country and continent columns are labeled as <chr>, meaning they contain character data, while the remaining columns are labeled as <dbl>, indicating that they store numeric values.


Part 2: Data Tidying with .value (20 points)

In the lab, you learned how to use pivot_longer() with the .value sentinel to reshape wide data into tidy format.

Task 2.1: Write code to transform gapminder_wide into a tidy dataset with columns: country, continent, year, gdpPercap, and lifeExp. Show the first 10 rows of your tidy dataset.

gap_tidy <- gapminder_wide %>% 
  pivot_longer(
    cols = -c(country, continent),
    names_to = c(".value", "year"),
    names_sep = "_",
    values_drop_na = FALSE
  ) %>% 
  mutate(year = as.numeric(year))

glimpse(gap_tidy)
## Rows: 1,704
## Columns: 5
## $ country   <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asi…
## $ year      <dbl> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
head(gap_tidy, 10)
## # A tibble: 10 × 5
##    country     continent  year gdpPercap lifeExp
##    <chr>       <chr>     <dbl>     <dbl>   <dbl>
##  1 Afghanistan Asia       1952      779.    28.8
##  2 Afghanistan Asia       1957      821.    30.3
##  3 Afghanistan Asia       1962      853.    32.0
##  4 Afghanistan Asia       1967      836.    34.0
##  5 Afghanistan Asia       1972      740.    36.1
##  6 Afghanistan Asia       1977      786.    38.4
##  7 Afghanistan Asia       1982      978.    39.9
##  8 Afghanistan Asia       1987      852.    40.8
##  9 Afghanistan Asia       1992      649.    41.7
## 10 Afghanistan Asia       1997      635.    41.8
# The pipe operator (%>% or |>) takes the dataset (gapminder_wide) 
# and passes it to the function on the right.

# The pivot_longer() function helps reorganize the dataset because 
# the original data contains many columns such as gdpPercap_1952, 
# gdpPercap_1957, gdpPercap_1962 and lifeExp_1987, lifeExp_1992, etc.

# This function converts the dataset from wide format to long format.
# As a result, the number of columns decreases while the number of rows increases.

# cols = -c(country, continent) means these two columns will not be included 
# in the pivoting process. The minus (-) sign excludes them because they 
# do not contain information about years.

# names_to splits the column names into two parts.
# For example: lifeExp_1987 becomes lifeExp and 1987.

# Normally these parts would go into new columns, but using ".value"
# makes the first part become the new column name.
# Therefore, lifeExp and gdpPercap will become column names.

# names_to = c(".value", "year") means the first part becomes the variable name
# and the second part is stored in a new column called year.

# names_sep = "_" indicates that the column names are separated 
# at the underscore (_) character.

# values_drop_na = FALSE means missing values will be kept 
# and not removed from the dataset.

# mutate(year = as.numeric(year)) converts the year column 
# into numeric format so the years can be treated as numbers.

Task 2.2: Explain in 2-3 sentences what the .value sentinel does in your code. Why is it the right tool for this dataset?

Your answer:

Using .value allows the first part of the original column name to become the new column name. In other words, it keeps the original variable names as column names. For example, lifeExp and gdpPercap become the names of columns in the new dataset.

When we use names_to = c(".value", "year"), the first part of the column name becomes the variable name, while the second part is stored in a new column called year. As a result, the dataset will contain three main columns: gdpPercap, lifeExp, and year.

The .value argument is used together with the pivot_longer() function. It helps make the dataset more organized, easier to read, and simpler to analyze.

Task 2.3: From your tidy dataset, filter to keep only observations from 1970 onwards for the following countries: "Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China". Save this filtered dataset as gap_filtered.

gap_filtered <- gap_tidy %>% 
  filter(country %in%  c("Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China"),
         year >= 1970)

gap_filtered
## # A tibble: 48 × 5
##    country continent  year gdpPercap lifeExp
##    <chr>   <chr>     <dbl>     <dbl>   <dbl>
##  1 Brazil  Americas   1972     4986.    59.5
##  2 Brazil  Americas   1977     6660.    61.5
##  3 Brazil  Americas   1982     7031.    63.3
##  4 Brazil  Americas   1987     7807.    65.2
##  5 Brazil  Americas   1992     6950.    67.1
##  6 Brazil  Americas   1997     7958.    69.4
##  7 Brazil  Americas   2002     8131.    71.0
##  8 Brazil  Americas   2007     9066.    72.4
##  9 China   Asia       1972      677.    63.1
## 10 China   Asia       1977      741.    64.0
## # ℹ 38 more rows
# filter() keeps only the rows that meet specific conditions.

# country %in% c(...) is used to filter countries.
# c() creates a vector of values, and %in% checks whether
# the country column contains any of those values.

# This means that only the selected 6 countries
# will remain in the dataset.

Part 3: Grouped Summaries (25 points)

Now you will use group_by() and summarize() to answer questions about continents and countries.

Task 3.1: Calculate the average GDP per capita and average life expectancy for each continent across all years (use the full tidy dataset, not the filtered one).

average_gdp_and_lifeExp <- gap_tidy %>% 
  group_by(continent) %>% 
  summarize(
    average_gdp = mean(gdpPercap, na.rm = TRUE),
    average_lifeExp = mean(lifeExp, na.rm = TRUE),
    .groups = "drop"
  )

average_gdp_and_lifeExp
## # A tibble: 5 × 3
##   continent average_gdp average_lifeExp
##   <chr>           <dbl>           <dbl>
## 1 Africa          2194.            48.9
## 2 Americas        7136.            64.7
## 3 Asia            7902.            60.1
## 4 Europe         14469.            71.9
## 5 Oceania        18622.            74.3
# group_by(continent) groups the dataset by the continent variable,
# meaning the data will be processed separately for each continent.

# summarize() reduces the data into a smaller table by calculating
# summary statistics. Each continent will have its own row in the result.

# The mean() function calculates the average value, and
# na.rm = TRUE removes missing values before calculating the mean.

# .groups = "drop" removes the grouping after the summarization
# so the result is returned as a regular table.

Questions to answer: - Which continent has the highest average GDP per capita? - Which continent has the highest average life expectancy? - Are these the same continent? Why might that be?

Your answer:

Ocenia has the highest gdp per capita and average life expectancy. Actually it suprised me. This will be because of all of the countries here being an island country. They could be doing so much sea trade. Avustralia and New Zeland are the most developed countries in this area. When i researched, i saw that Australia is the worlds biggest ore and energy exporter. With its low population this will increase gdp per capita. Also Australia has strict migration rules, they only accept people that will benefit the country. a country with higher gdp will have a higher life expectancy - this will make sense if the government is spending for the public.- These countries have developped and free healhcare systems. Also another fact that I learned was These countries have the highest budget for early diagnosis and healty life campaigns. Governments spending for the publics health will increase average life expectancy.

Task 3.2: Find the 5 countries with the highest average GDP per capita across all years. Show the country name and its average GDP per capita.

highest_avg_gdp_top_5 <- gap_tidy %>% 
  group_by(country) %>% 
  summarise(
    avg_gdp = mean(gdpPercap, na.rm = TRUE),
    .groups = "drop"
  ) %>% 
  slice_max(avg_gdp, n=5)

highest_avg_gdp_top_5
## # A tibble: 5 × 2
##   country       avg_gdp
##   <chr>           <dbl>
## 1 Kuwait         65333.
## 2 Switzerland    27074.
## 3 Norway         26747.
## 4 United States  26261.
## 5 Canada         22411.
# slice_max() keeps rows with the highest value

Look at your result: Do any of these countries surprise you? Why might small, wealthy countries appear at the top?

Your answer:

When I look at the table, I notice that Kuwait, Norway, and Switzerland have higher GDP per capita than the United States and Canada. Although the United States and Canada have very large economies, GDP per capita is calculated by dividing total GDP by the population. Since Kuwait, Norway, and Switzerland have smaller populations than the United States and Canada, their GDP per capita can appear higher. Kuwait being at the top did not surprise me because it is well known for its large oil reserves and high living standards. The country is also known for benefits such as low or no personal income tax and various government payments to citizens. When I looked into Norway, I found that a large share of its exports comes from natural gas and crude oil. These natural resources generate significant national income, which increases its GDP per capita. Switzerland, on the other hand, does not have many natural resources. However, it has a very strong economy based on high value-added services such as banking, finance, and other advanced industries. In contrast, the United States and Canada still have very high total GDP, but their larger populations reduce their GDP per capita compared with countries that have smaller populations and strong income sources.

Task 3.3: Calculate the correlation between GDP per capita and life expectancy for each continent. Use the full tidy dataset.

continent_info <- gap_tidy %>% 
  select(country, continent) %>% 
  distinct()

corelation_by_continent <- gap_tidy |>
  group_by(continent) |>
  summarize(
    correlation = cor(gdpPercap, lifeExp, use = "complete.obs"),
    n_obs = n(),
    .groups = "drop"
  )


corelation_by_continent
## # A tibble: 5 × 3
##   continent correlation n_obs
##   <chr>           <dbl> <int>
## 1 Africa          0.426   624
## 2 Americas        0.558   300
## 3 Asia            0.382   396
## 4 Europe          0.781   360
## 5 Oceania         0.956    24
# distinct() removes duplicate rows from the dataset.

# cor() calculates the correlation between two variables.
# The correlation value ranges from -1 to 1:
#   - Close to 1  → strong positive correlation
#   - Close to 0  → little or no correlation
#   - Close to -1 → strong negative correlation

# use = "complete.obs" tells R to only use rows where both
# variables have non-missing values.

# n_obs = n() counts the number of rows for each continent.

Questions to answer: - In which continent is the relationship strongest (highest correlation)? - In which continent is it weakest? - What might explain the differences between continents?

Your answer: Oceania shows the highest correlation between GDP per capita and life expectancy, while Asia has the lowest. Generally, as GDP per capita increases, living standards improve, which tends to increase life expectancy. Money itself does not buy more years of life, but it provides conditions that support longer life, such as better healthcare, sanitation, and infrastructure. A strong correlation indicates that economic growth is being used effectively for public benefits, like clean water, safe food, and hospitals. On the other hand, a weak correlation suggests that wealth may not be distributed to improve public welfare and may reflect higher income inequality. In Asia, we can see that some countries, like Japan and South Korea, have a strong correlation between GDP per capita and life expectancy. However, in countries like India, increases in GDP per capita do not necessarily lead to higher life expectancy. This pattern suggests that income inequality may weaken the correlation.

Population size is another factor. Asia has the largest population in the world, which can affect averages and correlations. Additionally, Asia ranks fourth among continents in terms of income inequality. For Oceania, the number of observations is only 24, which likely means that Australia and New Zealand dominate the data for that continent.


Part 4: Data Integration (20 points)

Now you will practice joining two separate datasets: one containing only life expectancy, and one containing only GDP per capita.

Task 4.1: Import gap_life.csv and gap_gdp.csv. Use glimpse() to examine each one.

gap_life <- read.csv("gap_life.csv")
glimpse(gap_life)
## Rows: 1,618
## Columns: 3
## $ country <chr> "Mali", "Malaysia", "Zambia", "Greece", "Swaziland", "Iran", "…
## $ year    <int> 1992, 1967, 1987, 2002, 1967, 1997, 2007, 2007, 1957, 2002, 19…
## $ lifeExp <dbl> 48.388, 59.371, 50.821, 78.256, 46.633, 68.042, 73.747, 78.098…
getwd()
## [1] "C:/Users/KLAB/Downloads"
gap_gdp <- read.csv("gap_gdp.csv")
glimpse(gap_gdp)
## Rows: 1,618
## Columns: 3
## $ country   <chr> "Bangladesh", "Mongolia", "Taiwan", "Burkina Faso", "Angola"…
## $ year      <int> 1987, 1997, 2002, 1962, 1962, 1977, 2007, 1962, 1992, 1972, …
## $ gdpPercap <dbl> 751.9794, 1902.2521, 23235.4233, 722.5120, 4269.2767, 2785.4…

Task 4.2: Use inner_join() to combine them into a dataset called gap_joined. Join by the columns they have in common.

gap_joined <- inner_join(gap_life, gap_gdp, by = c("country", "year"))

# Combines two tables, keeping matching rows

Task 4.3: Answer the following: - How many rows are in gap_joined? - How many unique countries are in gap_joined? - Compare this to the original number of rows in gap_life.csv and gap_gdp.csv. Why might the joined dataset have fewer rows?

nrow(gap_joined) #how many rows - 1535
## [1] 1535
nrow(gap_life) # - 1618 - original dataset
## [1] 1618
nrow(gap_gdp) # - 1618 - original dataset
## [1] 1618
unique(gap_joined$country) #unique countries
##   [1] "Mali"                     "Malaysia"                
##   [3] "Zambia"                   "Greece"                  
##   [5] "Swaziland"                "Iran"                    
##   [7] "Venezuela"                "Portugal"                
##   [9] "Sweden"                   "Brazil"                  
##  [11] "Pakistan"                 "Algeria"                 
##  [13] "Equatorial Guinea"        "Botswana"                
##  [15] "Haiti"                    "Saudi Arabia"            
##  [17] "Korea, Dem. Rep."         "Niger"                   
##  [19] "Congo, Dem. Rep."         "United States"           
##  [21] "Eritrea"                  "Trinidad and Tobago"     
##  [23] "Colombia"                 "Panama"                  
##  [25] "Comoros"                  "Italy"                   
##  [27] "Nicaragua"                "Gambia"                  
##  [29] "Iceland"                  "Bosnia and Herzegovina"  
##  [31] "Hong Kong, China"         "El Salvador"             
##  [33] "Myanmar"                  "Croatia"                 
##  [35] "Finland"                  "South Africa"            
##  [37] "Ireland"                  "United Kingdom"          
##  [39] "Liberia"                  "Libya"                   
##  [41] "Malawi"                   "Norway"                  
##  [43] "India"                    "Guatemala"               
##  [45] "Netherlands"              "Japan"                   
##  [47] "Mauritania"               "Ghana"                   
##  [49] "Taiwan"                   "Paraguay"                
##  [51] "Morocco"                  "Cuba"                    
##  [53] "Guinea"                   "Denmark"                 
##  [55] "Chad"                     "Zimbabwe"                
##  [57] "Yemen, Rep."              "Austria"                 
##  [59] "Bahrain"                  "Egypt"                   
##  [61] "Angola"                   "Reunion"                 
##  [63] "Senegal"                  "Gabon"                   
##  [65] "Albania"                  "Serbia"                  
##  [67] "Lebanon"                  "Germany"                 
##  [69] "Jamaica"                  "Canada"                  
##  [71] "Montenegro"               "Rwanda"                  
##  [73] "New Zealand"              "Syria"                   
##  [75] "Spain"                    "Slovak Republic"         
##  [77] "Kenya"                    "Guinea-Bissau"           
##  [79] "Cote d'Ivoire"            "Sri Lanka"               
##  [81] "Switzerland"              "Afghanistan"             
##  [83] "Mozambique"               "Togo"                    
##  [85] "Namibia"                  "Tunisia"                 
##  [87] "Uganda"                   "Mongolia"                
##  [89] "Bulgaria"                 "Sao Tome and Principe"   
##  [91] "Uruguay"                  "Nepal"                   
##  [93] "West Bank and Gaza"       "Iraq"                    
##  [95] "Oman"                     "Burkina Faso"            
##  [97] "Cameroon"                 "Philippines"             
##  [99] "Kuwait"                   "Vietnam"                 
## [101] "Benin"                    "Dominican Republic"      
## [103] "Turkey"                   "Somalia"                 
## [105] "Tanzania"                 "Puerto Rico"             
## [107] "Jordan"                   "Peru"                    
## [109] "Cambodia"                 "Chile"                   
## [111] "Burundi"                  "China"                   
## [113] "Israel"                   "Australia"               
## [115] "Mexico"                   "Lesotho"                 
## [117] "Madagascar"               "Sierra Leone"            
## [119] "Korea, Rep."              "Ecuador"                 
## [121] "Slovenia"                 "Honduras"                
## [123] "France"                   "Belgium"                 
## [125] "Indonesia"                "Romania"                 
## [127] "Hungary"                  "Thailand"                
## [129] "Central African Republic" "Argentina"               
## [131] "Congo, Rep."              "Poland"                  
## [133] "Singapore"                "Bangladesh"              
## [135] "Bolivia"                  "Sudan"                   
## [137] "Mauritius"                "Nigeria"                 
## [139] "Djibouti"                 "Costa Rica"              
## [141] "Ethiopia"                 "Czech Republic"

Your answer: The gap_joined dataset contains 1,535 rows and 142 unique countries. In the original datasets, gap_life and gap_gdp each had 1,618 rows. The difference occurs because the join function only keeps observations that exist in both datasets. If a row is missing in either dataset, it is excluded. To ensure a row is included in the joined dataset, both the year and country must match between the two datasets. .

Task 4.4: Check for missing values in gap_joined. Are there any rows where lifeExp or gdpPercap is NA? If so, list them.

gap_joined %>% 
  filter(is.na(lifeExp) | is.na(gdpPercap))  # there is no NA data
## [1] country   year      lifeExp   gdpPercap
## <0 rows> (or 0-length row.names)

Task 4.5: Propose one way an economist could handle these missing values. What are the trade-offs of your proposed method?

Your answer:

I researched this issue and found that there are several ways to handle missing data. Each method has its advantages, but they also come with potential drawbacks. The first step is to ask: why is the data missing? If the missing data is random or coincidental, it might be acceptable to remove those rows. However, this can have consequences—for example, the missing year might coincide with a major event, like a pandemic, which could affect the analysis. If the missing data is not random, a better approach is to use a proxy method, where another dataset that is highly correlated with the original one is used to estimate the missing values. This approach helps preserve important patterns in the data while filling the gaps.


Part 5: Economic Interpretation (15 points)

Write a short paragraph (5‑8 sentences) addressing the following questions. Use evidence from your analysis in Parts 3 and 4 to support your claims.

Your paragraph:

growth_rate <- gap_tidy %>% 
  filter(year%in% c(1952, 2007)) %>%
  group_by(continent, year) %>% 
  summarize(avg_gdp = mean(gdpPercap, na.rm = TRUE),
            .groups = "drop") %>% 
  pivot_wider(names_from = year, values_from = avg_gdp, names_prefix = "gdp_") %>% 
  mutate(growth= gdp_2007 - gdp_1952 / gdp_1952)
  
growth_rate
## # A tibble: 5 × 4
##   continent gdp_1952 gdp_2007 growth
##   <chr>        <dbl>    <dbl>  <dbl>
## 1 Africa       1253.    3089.  3088.
## 2 Americas     4079.   11003. 11002.
## 3 Asia         5195.   12473. 12472.
## 4 Europe       5661.   25054. 25053.
## 5 Oceania     10298.   29810. 29809.
# filter(year%in% c(1952, 2007)) only gets the years 1952 and 2007
# group_by(continent, year) grouping by continent and year
# summarize(avg_gdp = mean(gdpPercap, na.rm = TRUE) calculates average gdp per capita for each continent and year. 
# na.rm = TRUE removes missing variables
# creates a new column called avg_gdp
# .groups = "drop" ends grouping
# pivot_wider(names_from = year, values_from = avg_gdp, names_prefix = "gdp_") makes the table wide
# names_from = year makes new column name year
# values_from = avg_gdp assingns the values to that new column
# names_prefix = "gdp_" adds gdp_ in the columns name (left handed side)
# mutate(growth= gdp_2007 - gdp_1952) creates a column named growth

I calculated the growth rate with the help of ChatGPT. I wrote some of the code, and AI added the pipe operator and redesigned the pivot_wider() part. According to the results, Oceania has the highest growth rate. Between 1950 and 2007, many Oceania countries gained independence, experienced high migration, exported natural resources, and developed their service sectors. Australia, in particular, boosted its population after World War II through its “populate or perish” policy. During this period, Asia focused on industrialization, often trading for natural resources from Australia and New Zealand, while also developing agriculture, health technologies, and tourism in the Pacific Islands.

This belongs the correlation i did previously,

continent_info <- gap_tidy %>% 
  select(country, continent) %>% 
  distinct()

corelation_by_continent <- gap_tidy |>
  group_by(continent) |>
  summarize(
    correlation = cor(gdpPercap, lifeExp, use = "complete.obs"),
    n_obs = n(),
    .groups = "drop"
  )


corelation_by_continent
## # A tibble: 5 × 3
##   continent correlation n_obs
##   <chr>           <dbl> <int>
## 1 Africa          0.426   624
## 2 Americas        0.558   300
## 3 Asia            0.382   396
## 4 Europe          0.781   360
## 5 Oceania         0.956    24

Oceania has the highest correlation between GDP per capita and life expectancy, largely due to Australia and New Zealand. High GDP combined with government spending on public health, hygiene, and infrastructure helps explain this strong relationship.

The main limitations are that the data only includes GDP per capita and life expectancy, ignoring factors like war, migration, health issues, and inequalities. It covers a limited period (1952–2007), and using inner_join() dropped some observations. Also, since the data is recorded every five years, short-term trends cannot be observed, and correlation shows association but not causation.


Part 6: Reproducibility (5 points)

Before submitting, check that your document meets these requirements:


Academic Integrity Reminder

You are encouraged to discuss concepts with classmates, but your submitted work must be your own. If you use AI assistants (ChatGPT, Copilot, etc.), you must include an AI Use Log at the end of your document documenting:

Tool Used ————————— Prompt Given ———————————- How You Verified or Modified the Output
Tool Used:Gemini
Prompt Given: what happened in oceania between 1952 and 2007

Tool Used: ChatGpt

Prompt Given: Which continent has seen the most dramatic economic growth since 1952? (Look at the numbers – don’t just guess.), how can I calculate the growth rate

How You Verified or Modified the Output: growth_rate <- gap_tidy %>% filter(year %in% c(1952, 2007)) %>% group_by(continent, year) %>% summarize(avg_gdp = mean(gdpPercap, na.rm = TRUE), .groups = “drop”) %>% pivot_wider(names_from = year, values_from = avg_gdp, names_prefix = “gdp_”) %>% mutate( absolute_growth = gdp_2007 - gdp_1952, growth_rate = ((gdp_2007 - gdp_1952)) ) ————————————————————————————————————————————-

Submission Checklist


Glossary of Functions Used

Function What it does
select() Keeps only specified columns
filter() Keeps rows that meet conditions
mutate() Adds or modifies columns
pivot_longer() Reshapes wide to long
group_by() Groups data for subsequent operations
summarize() Reduces grouped data to summary stats
inner_join() Combines two tables, keeping matching rows
distinct() Keeps unique rows
slice_max() Keeps rows with highest values
arrange() Sorts rows
contains() Helper for selecting columns with a pattern