# Load the tidyverse package
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Import the wide Gapminder dataset
gapminder_wide <- read_csv("data/gapminder_wide.csv") #With the assingment symbol i can assing the data to gapminder_wide. Also i changed the file name from gapminder_wide(1).csv to gapminder_wide.csv
## Rows: 142 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country, continent
## dbl (24): gdpPercap_1952, gdpPercap_1957, gdpPercap_1962, gdpPercap_1967, gd...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Task 1.1: Use glimpse() to examine the
structure of gapminder_wide. In your own words, describe
what you see. How many rows and columns are there? What do the column
names tell you about the data format?
glimpse(gapminder_wide)
## Rows: 142
## Columns: 26
## $ country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argenti…
## $ continent <chr> "Asia", "Europe", "Africa", "Africa", "Americas", "Ocea…
## $ gdpPercap_1952 <dbl> 779.4453, 1601.0561, 2449.0082, 3520.6103, 5911.3151, 1…
## $ gdpPercap_1957 <dbl> 820.8530, 1942.2842, 3013.9760, 3827.9405, 6856.8562, 1…
## $ gdpPercap_1962 <dbl> 853.1007, 2312.8890, 2550.8169, 4269.2767, 7133.1660, 1…
## $ gdpPercap_1967 <dbl> 836.1971, 2760.1969, 3246.9918, 5522.7764, 8052.9530, 1…
## $ gdpPercap_1972 <dbl> 739.9811, 3313.4222, 4182.6638, 5473.2880, 9443.0385, 1…
## $ gdpPercap_1977 <dbl> 786.1134, 3533.0039, 4910.4168, 3008.6474, 10079.0267, …
## $ gdpPercap_1982 <dbl> 978.0114, 3630.8807, 5745.1602, 2756.9537, 8997.8974, 1…
## $ gdpPercap_1987 <dbl> 852.3959, 3738.9327, 5681.3585, 2430.2083, 9139.6714, 2…
## $ gdpPercap_1992 <dbl> 649.3414, 2497.4379, 5023.2166, 2627.8457, 9308.4187, 2…
## $ gdpPercap_1997 <dbl> 635.3414, 3193.0546, 4797.2951, 2277.1409, 10967.2820, …
## $ gdpPercap_2002 <dbl> 726.7341, 4604.2117, 5288.0404, 2773.2873, 8797.6407, 3…
## $ gdpPercap_2007 <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3796, …
## $ lifeExp_1952 <dbl> 28.801, 55.230, 43.077, 30.015, 62.485, 69.120, 66.800,…
## $ lifeExp_1957 <dbl> 30.33200, 59.28000, 45.68500, 31.99900, 64.39900, 70.33…
## $ lifeExp_1962 <dbl> 31.99700, 64.82000, 48.30300, 34.00000, 65.14200, 70.93…
## $ lifeExp_1967 <dbl> 34.02000, 66.22000, 51.40700, 35.98500, 65.63400, 71.10…
## $ lifeExp_1972 <dbl> 36.08800, 67.69000, 54.51800, 37.92800, 67.06500, 71.93…
## $ lifeExp_1977 <dbl> 38.43800, 68.93000, 58.01400, 39.48300, 68.48100, 73.49…
## $ lifeExp_1982 <dbl> 39.854, 70.420, 61.368, 39.942, 69.942, 74.740, 73.180,…
## $ lifeExp_1987 <dbl> 40.822, 72.000, 65.799, 39.906, 70.774, 76.320, 74.940,…
## $ lifeExp_1992 <dbl> 41.674, 71.581, 67.744, 40.647, 71.868, 77.560, 76.040,…
## $ lifeExp_1997 <dbl> 41.763, 72.950, 69.152, 40.963, 73.275, 78.830, 77.510,…
## $ lifeExp_2002 <dbl> 42.129, 75.651, 70.994, 41.003, 74.340, 80.370, 78.980,…
## $ lifeExp_2007 <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829,…
Your answer:
I can see that there are 142 rows and 26 columns. When i run this
function i see there is a table which shows columns and their elements.
But it seems unorganized. The first column goes by country, continent,
gdpPercap_1952 to gdpPercap_2007 and lifeExp_1952 to lifeExp_2007. This
looks very wide because the data is stored in many years. And while the
rows which start with country and continent are
.value (20 points)In the lab, you learned how to use pivot_longer() with
the .value sentinel to reshape wide data into tidy
format.
Task 2.1: Write code to transform
gapminder_wide into a tidy dataset with columns:
country, continent, year,
gdpPercap, and lifeExp. Show the first 10 rows
of your tidy dataset.
gap_tidy <- gapminder_wide %>%
pivot_longer(
cols = -c(country, continent),
names_to = c(".value", "year"),
names_sep = "_",
values_drop_na = FALSE
) %>%
mutate(year = as.numeric(year))
glimpse(gap_tidy)
## Rows: 1,704
## Columns: 5
## $ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asi…
## $ year <dbl> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
head(gap_tidy, 10)
## # A tibble: 10 × 5
## country continent year gdpPercap lifeExp
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 779. 28.8
## 2 Afghanistan Asia 1957 821. 30.3
## 3 Afghanistan Asia 1962 853. 32.0
## 4 Afghanistan Asia 1967 836. 34.0
## 5 Afghanistan Asia 1972 740. 36.1
## 6 Afghanistan Asia 1977 786. 38.4
## 7 Afghanistan Asia 1982 978. 39.9
## 8 Afghanistan Asia 1987 852. 40.8
## 9 Afghanistan Asia 1992 649. 41.7
## 10 Afghanistan Asia 1997 635. 41.8
# pipe operator ( %>% or |>) tooks data (gapminder_wide) and takes the data to the function on the right.
# with pivot_longer() function i can organize the data. the data was containing very much columns.
# Like gdpPercap_1952, gdpPercap_1957, gdpPercap_1962... lifeExp_1987, lifeExp_1992...
# We can arrange the dataset into less rows and more columns with that function. It makes the data longer
# cols = -c(country, continent) tells us do not touch to country and continent columns. Because they do not include any information about year. That - symbol excludes these columns. So these cols won't be pivotted.
# names_to = divides cols name into 2 different cols. Like lifeExp_1987 -> lifeExp and 1987
# normally when we divide cols, they will go to new cols but with .value() we can make the first parts name the cols name. So lifeExp and gdpPercap will be the cols name.
# names_to = c(".value", "year") so the first cal will be the name and the second col will be named year.
# so there will be 3 different cols named gdpPercap, lifeExp and year
# names_sep = "_" tells us where to divide the name of the col. we divide them from the _ sign.
# values_drop_na = FALSE we are keeping the empty/missing values and are not deleting them.
# mutate(year = as.numeric(year)) we need years as numbers so we turn them into numbers.
Task 2.2: Explain in 2-3 sentences what the
.value sentinel does in your code. Why is it the right tool
for this dataset?
Your answer:
With .value we can make the first parts name the columns name. It tells us the keep the original column name for name of the column. So lifeExp and gdpPercap will be the cols name. names_to = c(“.value”, “year”) so the first column will be the name and the second column will be named year. So there will be 3 different cols named gdpPercap, lifeExp and year. We can use .value with pivot_longer() function. It makes the data more readable and make it easier to analyse.
Task 2.3: From your tidy dataset, filter to keep
only observations from 1970 onwards for the following
countries: "Turkey", "Brazil",
"Korea, Rep.", "Germany",
"United States", "China". Save this filtered
dataset as gap_filtered.
gap_filtered <- gap_tidy %>%
filter(country %in% c("Turkey", "Brazil", "Korea, Rep.", "Germany", "United States", "China"),
year >= 1970)
gap_filtered
## # A tibble: 48 × 5
## country continent year gdpPercap lifeExp
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Brazil Americas 1972 4986. 59.5
## 2 Brazil Americas 1977 6660. 61.5
## 3 Brazil Americas 1982 7031. 63.3
## 4 Brazil Americas 1987 7807. 65.2
## 5 Brazil Americas 1992 6950. 67.1
## 6 Brazil Americas 1997 7958. 69.4
## 7 Brazil Americas 2002 8131. 71.0
## 8 Brazil Americas 2007 9066. 72.4
## 9 China Asia 1972 677. 63.1
## 10 China Asia 1977 741. 64.0
## # ℹ 38 more rows
# filter() keeps rows that meet conditions
# country %in% c(...) is a country filter. c() creates a vector and %in% checks whether the country is included or not. So this tells us
# only these 6 countries should stay.
Now you will use group_by() and summarize()
to answer questions about continents and countries.
Task 3.1: Calculate the average GDP per capita and average life expectancy for each continent across all years (use the full tidy dataset, not the filtered one).
average_gdp_and_lifeExp <- gap_tidy %>%
group_by(continent) %>%
summarize(
average_gdp = mean(gdpPercap, na.rm = TRUE),
average_lifeExp = mean(lifeExp, na.rm = TRUE),
.groups = "drop"
)
average_gdp_and_lifeExp
## # A tibble: 5 × 3
## continent average_gdp average_lifeExp
## <chr> <dbl> <dbl>
## 1 Africa 2194. 48.9
## 2 Americas 7136. 64.7
## 3 Asia 7902. 60.1
## 4 Europe 14469. 71.9
## 5 Oceania 18622. 74.3
# group_by(continent) tells us to divide the data to groups for continent row
# summarize() summarizes, makes a much column data to a shorter table. Every continent will have its own row.
# with mean() function we can get the average and na.rm = TRUE removes missing variables.
# .groups drops the groups and makes it seem as a table
Questions to answer: - Which continent has the highest average GDP per capita? - Which continent has the highest average life expectancy? - Are these the same continent? Why might that be?
Your answer:
Ocenia has the highest gdp per capita and average life expectancy. Actually it suprised me. This will be because of all of the countries here being an island country. They could be doing so much sea trade. Avustralia and New Zeland are the most developed countries in this area. When i researched, i saw that Australia is the worlds biggest ore and energy exporter. With its low population this will increase gdp per capita. Also Australia has strict migration rules, they only accept people that will benefit the country. a country with higher gdp will have a higher life expectancy - this will make sense if the government is spending for the public.- These countries have developped and free healhcare systems. Also another fact that I learned was These countries have the highest budget for early diagnosis and healty life campaigns. Governments spending for the publics health will increase average life expectancy.
Task 3.2: Find the 5 countries with the highest average GDP per capita across all years. Show the country name and its average GDP per capita.
highest_avg_gdp_top_5 <- gap_tidy %>%
group_by(country) %>%
summarise(
avg_gdp = mean(gdpPercap, na.rm = TRUE),
.groups = "drop"
) %>%
slice_max(avg_gdp, n=5)
highest_avg_gdp_top_5
## # A tibble: 5 × 2
## country avg_gdp
## <chr> <dbl>
## 1 Kuwait 65333.
## 2 Switzerland 27074.
## 3 Norway 26747.
## 4 United States 26261.
## 5 Canada 22411.
# slice_max() keeps rows with the highest value
Look at your result: Do any of these countries surprise you? Why might small, wealthy countries appear at the top?
Your answer:
When I look at the table i see that Kuwait, Norway and Switzerland have a higher gdp than United States and Canada. Canada and The US have huge economies but when we are calculating gdp per capita we are dividing gdp to population. Kuwait, Norway and Switzerland have lower population than Canada and The US. Kuwait is at the top but it did not suprise me because i have known that it has huge oil reserves and it is famous for its living standarts like no taxing and governments payments for their public. When I researched i find out that Norways most exported goods are natural gas and crude oil. This will bring them tons of money. When we come to Switzerland, it does not have natural resources but I found out that they are doing high value-added services like banking, finance etc. Actually The US and Canada have high gdp but their population brings gdp per capita down.
Task 3.3: Calculate the correlation between GDP per capita and life expectancy for each continent. Use the full tidy dataset.
continent_info <- gap_tidy %>%
select(country, continent) %>%
distinct()
corelation_by_continent <- gap_tidy |>
group_by(continent) |>
summarize(
correlation = cor(gdpPercap, lifeExp, use = "complete.obs"),
n_obs = n(),
.groups = "drop"
)
corelation_by_continent
## # A tibble: 5 × 3
## continent correlation n_obs
## <chr> <dbl> <int>
## 1 Africa 0.426 624
## 2 Americas 0.558 300
## 3 Asia 0.382 396
## 4 Europe 0.781 360
## 5 Oceania 0.956 24
# distinct() deletes repeated rows
# cor() calculates the correlation between 2 variables. Correlation is between -1 and 1. If it is close 1 it has high positive
# correlation. If it is close to 0 it has no correlation and if it is close to -1 it has high negative correlation.
# use = "complete.obs" is used for missing values. It means only use observations where boht variables are filled in.
# n_obs = n() counts how many rows for each continent
Questions to answer: - In which continent is the relationship strongest (highest correlation)? - In which continent is it weakest? - What might explain the differences between continents?
Your answer: Oceania has the highest correlation and Asia has the lowest. The general rule says as the gdp per capita increases life standarts increase so we can aslo see an increase in the expected life time. Money does not buy lifetime but it buys conditions that increase lifetime like health, hygiene and infrastructure. Strong correlation will tell us that economic growth will directly used for benefit of the public like clean water, clean food, hospitals. Weak correlation will show us the money will not be used for the benefit of the public. It also shows that there will be income inequality. In Asia, we can see strong correlation between gdp per capita and expected life countries like Japan and South Korea but in countries like Indıa, expected life will not be increase as the gdp increases. From this pattern we can see that income inequality may be weaken this correlation. Another factor can be population. Asia is the continent with the highest population in the world. Also Asia is in the 4th continent for the income inequality. Also observations of Oceania is 24. I can inder that Australia and New Zeland dominated that continent.
Now you will practice joining two separate datasets: one containing only life expectancy, and one containing only GDP per capita.
Task 4.1: Import gap_life.csv and
gap_gdp.csv. Use glimpse() to examine each
one.
gap_life <- read.csv("data/gap_life.csv")
glimpse(gap_life)
## Rows: 1,618
## Columns: 3
## $ country <chr> "Mali", "Malaysia", "Zambia", "Greece", "Swaziland", "Iran", "…
## $ year <int> 1992, 1967, 1987, 2002, 1967, 1997, 2007, 2007, 1957, 2002, 19…
## $ lifeExp <dbl> 48.388, 59.371, 50.821, 78.256, 46.633, 68.042, 73.747, 78.098…
gap_gdp <- read.csv("data/gap_gdp.csv")
glimpse(gap_gdp)
## Rows: 1,618
## Columns: 3
## $ country <chr> "Bangladesh", "Mongolia", "Taiwan", "Burkina Faso", "Angola"…
## $ year <int> 1987, 1997, 2002, 1962, 1962, 1977, 2007, 1962, 1992, 1972, …
## $ gdpPercap <dbl> 751.9794, 1902.2521, 23235.4233, 722.5120, 4269.2767, 2785.4…
Task 4.2: Use inner_join() to combine
them into a dataset called gap_joined. Join by the columns
they have in common.
gap_joined <- inner_join(gap_life, gap_gdp, by = c("country", "year"))
# Combines two tables, keeping matching rows
Task 4.3: Answer the following: - How many rows are
in gap_joined? - How many unique countries are in
gap_joined? - Compare this to the original number of rows
in gap_life.csv and gap_gdp.csv. Why might the
joined dataset have fewer rows?
nrow(gap_joined) #how many rows - 1535
## [1] 1535
nrow(gap_life) # - 1618 - original dataset
## [1] 1618
nrow(gap_gdp) # - 1618 - original dataset
## [1] 1618
unique(gap_joined$country) #unique countries
## [1] "Mali" "Malaysia"
## [3] "Zambia" "Greece"
## [5] "Swaziland" "Iran"
## [7] "Venezuela" "Portugal"
## [9] "Sweden" "Brazil"
## [11] "Pakistan" "Algeria"
## [13] "Equatorial Guinea" "Botswana"
## [15] "Haiti" "Saudi Arabia"
## [17] "Korea, Dem. Rep." "Niger"
## [19] "Congo, Dem. Rep." "United States"
## [21] "Eritrea" "Trinidad and Tobago"
## [23] "Colombia" "Panama"
## [25] "Comoros" "Italy"
## [27] "Nicaragua" "Gambia"
## [29] "Iceland" "Bosnia and Herzegovina"
## [31] "Hong Kong, China" "El Salvador"
## [33] "Myanmar" "Croatia"
## [35] "Finland" "South Africa"
## [37] "Ireland" "United Kingdom"
## [39] "Liberia" "Libya"
## [41] "Malawi" "Norway"
## [43] "India" "Guatemala"
## [45] "Netherlands" "Japan"
## [47] "Mauritania" "Ghana"
## [49] "Taiwan" "Paraguay"
## [51] "Morocco" "Cuba"
## [53] "Guinea" "Denmark"
## [55] "Chad" "Zimbabwe"
## [57] "Yemen, Rep." "Austria"
## [59] "Bahrain" "Egypt"
## [61] "Angola" "Reunion"
## [63] "Senegal" "Gabon"
## [65] "Albania" "Serbia"
## [67] "Lebanon" "Germany"
## [69] "Jamaica" "Canada"
## [71] "Montenegro" "Rwanda"
## [73] "New Zealand" "Syria"
## [75] "Spain" "Slovak Republic"
## [77] "Kenya" "Guinea-Bissau"
## [79] "Cote d'Ivoire" "Sri Lanka"
## [81] "Switzerland" "Afghanistan"
## [83] "Mozambique" "Togo"
## [85] "Namibia" "Tunisia"
## [87] "Uganda" "Mongolia"
## [89] "Bulgaria" "Sao Tome and Principe"
## [91] "Uruguay" "Nepal"
## [93] "West Bank and Gaza" "Iraq"
## [95] "Oman" "Burkina Faso"
## [97] "Cameroon" "Philippines"
## [99] "Kuwait" "Vietnam"
## [101] "Benin" "Dominican Republic"
## [103] "Turkey" "Somalia"
## [105] "Tanzania" "Puerto Rico"
## [107] "Jordan" "Peru"
## [109] "Cambodia" "Chile"
## [111] "Burundi" "China"
## [113] "Israel" "Australia"
## [115] "Mexico" "Lesotho"
## [117] "Madagascar" "Sierra Leone"
## [119] "Korea, Rep." "Ecuador"
## [121] "Slovenia" "Honduras"
## [123] "France" "Belgium"
## [125] "Indonesia" "Romania"
## [127] "Hungary" "Thailand"
## [129] "Central African Republic" "Argentina"
## [131] "Congo, Rep." "Poland"
## [133] "Singapore" "Bangladesh"
## [135] "Bolivia" "Sudan"
## [137] "Mauritius" "Nigeria"
## [139] "Djibouti" "Costa Rica"
## [141] "Ethiopia" "Czech Republic"
Your answer: There are 1535 rows in gap_joined and 142 unique countries. In the original dataset there are 1618 rows boht in gap_life and gap_gdp. There is a difference because join function will only get observations that are boht in these variables. If something is missing it will not be included. To include it we boht need year and country combination.
Task 4.4: Check for missing values in
gap_joined. Are there any rows where lifeExp
or gdpPercap is NA? If so, list them.
gap_joined %>%
filter(is.na(lifeExp) | is.na(gdpPercap)) # there is no NA data
## [1] country year lifeExp gdpPercap
## <0 rows> (or 0-length row.names)
Task 4.5: Propose one way an economist could handle these missing values. What are the trade-offs of your proposed method?
Your answer:
I researched about it and saw many methods. Every method has a benefit but aslo has a side that creates an issue. The first answer i will be search for is for the question: why the data is missing? If it is by coincidence i will delete the missing data but it will also have a consequence. Maybe the year was missing there was a huge case going on - like pandemic. If it is not by coincidence i will choose proxy method which you use another dataset that has a high correlation with the original dataset.
Write a short paragraph (5‑8 sentences) addressing the following questions. Use evidence from your analysis in Parts 3 and 4 to support your claims.
Your paragraph:
growth_rate <- gap_tidy %>%
filter(year%in% c(1952, 2007)) %>%
group_by(continent, year) %>%
summarize(avg_gdp = mean(gdpPercap, na.rm = TRUE),
.groups = "drop") %>%
pivot_wider(names_from = year, values_from = avg_gdp, names_prefix = "gdp_") %>%
mutate(growth= gdp_2007 - gdp_1952 / gdp_1952)
growth_rate
## # A tibble: 5 × 4
## continent gdp_1952 gdp_2007 growth
## <chr> <dbl> <dbl> <dbl>
## 1 Africa 1253. 3089. 3088.
## 2 Americas 4079. 11003. 11002.
## 3 Asia 5195. 12473. 12472.
## 4 Europe 5661. 25054. 25053.
## 5 Oceania 10298. 29810. 29809.
# filter(year%in% c(1952, 2007)) only gets the years 1952 and 2007
# group_by(continent, year) grouping by continent and year
# summarize(avg_gdp = mean(gdpPercap, na.rm = TRUE) calculates average gdp per capita for each continent and year.
# na.rm = TRUE removes missing variables
# creates a new column called avg_gdp
# .groups = "drop" ends grouping
# pivot_wider(names_from = year, values_from = avg_gdp, names_prefix = "gdp_") makes the table wide
# names_from = year makes new column name year
# values_from = avg_gdp assingns the values to that new column
# names_prefix = "gdp_" adds gdp_ in the columns name (left handed side)
# mutate(growth= gdp_2007 - gdp_1952) creates a column named growth
I have calculated the growth rate with the help of ChatGPT. I did some part of the codes and AI added pipe operator and redesigned pivot_wider() code. Oceania has the highest growth rate. Between these years most Oceania countries started to gain independence in 1950s. Also between these years Oceania got high migration, exported natural resources and service sector developed. Australia has a huge impact on gaining population after the second world war especially with their “populate or perish” policy. In that time Asia wanted to industrialize so they traded their natural resources(especially Austrlia and NZ) to Asia. Also they focused on agriculture. They developed health technologies. Tourism in Pasific Islands gained attention.
From the correlation i did earlier,
continent_info <- gap_tidy %>%
select(country, continent) %>%
distinct()
corelation_by_continent <- gap_tidy |>
group_by(continent) |>
summarize(
correlation = cor(gdpPercap, lifeExp, use = "complete.obs"),
n_obs = n(),
.groups = "drop"
)
corelation_by_continent
## # A tibble: 5 × 3
## continent correlation n_obs
## <chr> <dbl> <int>
## 1 Africa 0.426 624
## 2 Americas 0.558 300
## 3 Asia 0.382 396
## 4 Europe 0.781 360
## 5 Oceania 0.956 24
I can see that oceania has the highest correlation between gdp per capita and expected life. High gdp in Oceania is expected because of australia and New Zeland. If the government is spending for its publics health, hygene and other things like that we can expect a high correlation between them
The limitation is, data is not showing all the factors had been going through. It only shows gdp per capita and expected life time. It doesn not show factors like war, migration, healht issues, inequalities etc. And it has a limited time between 1952 and 2007. World war II just ended 7 before. inner_join() function dropped some variables. We can see correlation but we cannot see the causes behind it. Data quality is not that bad but its repeating in 5 years row is a limitation. Because of this we cannot see short term trends.
Before submitting, check that your document meets these requirements:
You are encouraged to discuss concepts with classmates, but your submitted work must be your own. If you use AI assistants (ChatGPT, Copilot, etc.), you must include an AI Use Log at the end of your document documenting:
| Tool Used ————————— Prompt Given ———————————- How You Verified or Modified the Output |
|---|
| Tool Used:Gemini |
| Prompt Given: what happened in oceania between 1952 and 2007 |
Tool Used: ChatGpt
How You Verified or Modified the Output: growth_rate <- gap_tidy %>% filter(year %in% c(1952, 2007)) %>% group_by(continent, year) %>% summarize(avg_gdp = mean(gdpPercap, na.rm = TRUE), .groups = “drop”) %>% pivot_wider(names_from = year, values_from = avg_gdp, names_prefix = “gdp_”) %>% mutate( absolute_growth = gdp_2007 - gdp_1952, growth_rate = ((gdp_2007 - gdp_1952)) ) ————————————————————————————————————————————-
| Function | What it does |
|---|---|
select() |
Keeps only specified columns |
filter() |
Keeps rows that meet conditions |
mutate() |
Adds or modifies columns |
pivot_longer() |
Reshapes wide to long |
group_by() |
Groups data for subsequent operations |
summarize() |
Reduces grouped data to summary stats |
inner_join() |
Combines two tables, keeping matching rows |
distinct() |
Keeps unique rows |
slice_max() |
Keeps rows with highest values |
arrange() |
Sorts rows |
contains() |
Helper for selecting columns with a pattern |