#This is a code in R for importing, cleaning, and transforming data on Gini indices per country. The data is read from a CSV file downloaded from a Kaggle dataset. The code is commented, which helps to understand each step. Here is a brief summary of each step:
# My objective was to analyze the global Gini index and observe how it has changed from 2000 to 2020. I wanted to identify which countries are more unequal and which ones are more egalitarian. One of my main goals was to create a map to visualize the geographical distribution of countries and to identify patterns and outliers. I also wanted to make it possible to filter the data using a button by year, to easily observe changes over time. Let's get started!
#the purpose of thus project is for entertainment and only to satisfy the viewers curiosity about the world and equality.
# tghis is the kaggle link: https://www.kaggle.com/datasets/ulrikthygepedersen/gini-index-per-country
# First, we load the tidyverse package, which contains a suite of packages for data manipulation and visualization.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.0
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
#I defined the file path to the input CSV file, which is stored in the filepath variable.
#Then, I used the read_csv() function from the readr package to read the CSV file into a data frame called gini_import.
filepath <-"~/Downloads/GinIndex/gini_by_country.csv"
gini_import <- read_csv(filepath)
## Rows: 1896 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country_code, country_name
## dbl (2): year, value
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# I then counted the number of missing values in the data frame using the is.na() function to identify missing values and the sum() function to count them.
sum(is.na(gini_import))
## [1] 0
#After checking for missing values (there were none :)), I viewed a summary of the data frame using the summary() function.
summary(gini_import)
## country_code country_name year value
## Length:1896 Length:1896 Min. :1967 Min. :20.20
## Class :character Class :character 1st Qu.:2001 1st Qu.:31.20
## Mode :character Mode :character Median :2008 Median :36.10
## Mean :2006 Mean :38.18
## 3rd Qu.:2014 3rd Qu.:44.10
## Max. :2021 Max. :65.80
# country_code country_name year value
# Length:1896 Length:1896 Min. :1967 Min. :20.20
# Class :character Class :character 1st Qu.:2001 1st Qu.:31.20
# Mode :character Mode :character Median :2008 Median :36.10
# Mean :2006 Mean :38.18
# 3rd Qu.:2014 3rd Qu.:44.10
# Max. :2021 Max. :65.80
# I cleaned data that was not necesarry
gini_import <- select(gini_import, -country_code)
#I didnt liked the way the data was structured, because there could be some N/A values hidden so then I reshaped the data from a long format to a wide format using the pivot_wider() function from the tidyr package. This function takes the existing columns in the data frame and spreads them out into multiple columns based on a specified key variable. In this case, I was using country_name as the key variable and value as the value variable.
gini_import <- pivot_wider(gini_import, names_from = country_name, values_from = value)
gini_import
## # A tibble: 50 × 169
## year Angola Albania United…¹ Argen…² Armenia Austr…³ Austria Azerb…⁴ Burundi
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2000 52 NA NA 51.1 NA NA 28.8 NA NA
## 2 2008 42.7 30 NA 44.9 29.2 35.4 30.4 NA NA
## 3 2018 51.3 30.1 26 41.3 34.4 34.3 30.8 NA NA
## 4 1996 NA 27 NA 49.5 NA NA NA NA NA
## 5 2002 NA 31.7 NA 53.8 34.8 NA NA 25.3 NA
## 6 2005 NA 30.6 NA 47.7 36 NA 28.7 26.6 NA
## 7 2012 NA 29 NA 41.3 29.6 NA 30.5 NA NA
## 8 2014 NA 34.6 32.5 41.6 31.5 34.4 30.5 NA NA
## 9 2015 NA 32.8 NA NA 32.4 NA 30.5 NA NA
## 10 2016 NA 33.7 NA 42 32.5 33.7 30.8 NA NA
## # … with 40 more rows, 159 more variables: Belgium <dbl>, Benin <dbl>,
## # `Burkina Faso` <dbl>, Bangladesh <dbl>, Bulgaria <dbl>,
## # `Bosnia and Herzegovina` <dbl>, Belarus <dbl>, Belize <dbl>, Bolivia <dbl>,
## # Brazil <dbl>, Bhutan <dbl>, Botswana <dbl>,
## # `Central African Republic` <dbl>, Canada <dbl>, Switzerland <dbl>,
## # Chile <dbl>, China <dbl>, `Cote d'Ivoire` <dbl>, Cameroon <dbl>,
## # `Congo, Dem. Rep.` <dbl>, `Congo, Rep.` <dbl>, Colombia <dbl>, …
#problem lots of N/A, countries with no info in some years , so I am going to put an average of their own country to fill in the N/A
# I sorted the data frame by year in descending order, remove the first row, subset the data to include only years >= 2000, and replace the missing values with the mean of each variable using the arrange(), [ ], and apply() functions.
gini_import$year
## [1] 2000 2008 2018 1996 2002 2005 2012 2014 2015 2016 2017 2019 2013 1980 1986
## [16] 1987 1991 1992 1993 1994 1995 1997 1998 1999 2001 2003 2004 2006 2007 2009
## [31] 2010 2011 2020 1981 1985 1989 1988 1983 1982 1984 1990 1971 1975 1978 1969
## [46] 1974 1979 2021 1977 1967
gini_import <- arrange(gini_import, desc(year))
gini_import
## # A tibble: 50 × 169
## year Angola Albania United…¹ Argen…² Armenia Austr…³ Austria Azerb…⁴ Burundi
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2021 NA NA NA NA NA NA NA NA NA
## 2 2020 NA NA NA 42.3 25.2 NA NA NA NA
## 3 2019 NA 30.8 NA 42.9 29.9 NA 30.2 NA NA
## 4 2018 51.3 30.1 26 41.3 34.4 34.3 30.8 NA NA
## 5 2017 NA 33.1 NA 41.1 33.6 NA 29.7 NA NA
## 6 2016 NA 33.7 NA 42 32.5 33.7 30.8 NA NA
## 7 2015 NA 32.8 NA NA 32.4 NA 30.5 NA NA
## 8 2014 NA 34.6 32.5 41.6 31.5 34.4 30.5 NA NA
## 9 2013 NA NA 32.5 40.9 30.6 NA 30.8 NA 38.6
## 10 2012 NA 29 NA 41.3 29.6 NA 30.5 NA NA
## # … with 40 more rows, 159 more variables: Belgium <dbl>, Benin <dbl>,
## # `Burkina Faso` <dbl>, Bangladesh <dbl>, Bulgaria <dbl>,
## # `Bosnia and Herzegovina` <dbl>, Belarus <dbl>, Belize <dbl>, Bolivia <dbl>,
## # Brazil <dbl>, Bhutan <dbl>, Botswana <dbl>,
## # `Central African Republic` <dbl>, Canada <dbl>, Switzerland <dbl>,
## # Chile <dbl>, China <dbl>, `Cote d'Ivoire` <dbl>, Cameroon <dbl>,
## # `Congo, Dem. Rep.` <dbl>, `Congo, Rep.` <dbl>, Colombia <dbl>, …
gini_import <- gini_import[-1, ]
gini_import <- gini_import [gini_import$year >= 2000, ]
gini_import
## # A tibble: 21 × 169
## year Angola Albania United…¹ Argen…² Armenia Austr…³ Austria Azerb…⁴ Burundi
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2020 NA NA NA 42.3 25.2 NA NA NA NA
## 2 2019 NA 30.8 NA 42.9 29.9 NA 30.2 NA NA
## 3 2018 51.3 30.1 26 41.3 34.4 34.3 30.8 NA NA
## 4 2017 NA 33.1 NA 41.1 33.6 NA 29.7 NA NA
## 5 2016 NA 33.7 NA 42 32.5 33.7 30.8 NA NA
## 6 2015 NA 32.8 NA NA 32.4 NA 30.5 NA NA
## 7 2014 NA 34.6 32.5 41.6 31.5 34.4 30.5 NA NA
## 8 2013 NA NA 32.5 40.9 30.6 NA 30.8 NA 38.6
## 9 2012 NA 29 NA 41.3 29.6 NA 30.5 NA NA
## 10 2011 NA NA NA 42.6 29.4 NA 30.8 NA NA
## # … with 11 more rows, 159 more variables: Belgium <dbl>, Benin <dbl>,
## # `Burkina Faso` <dbl>, Bangladesh <dbl>, Bulgaria <dbl>,
## # `Bosnia and Herzegovina` <dbl>, Belarus <dbl>, Belize <dbl>, Bolivia <dbl>,
## # Brazil <dbl>, Bhutan <dbl>, Botswana <dbl>,
## # `Central African Republic` <dbl>, Canada <dbl>, Switzerland <dbl>,
## # Chile <dbl>, China <dbl>, `Cote d'Ivoire` <dbl>, Cameroon <dbl>,
## # `Congo, Dem. Rep.` <dbl>, `Congo, Rep.` <dbl>, Colombia <dbl>, …
gini_import[] <- apply(gini_import, 2, function(x) ifelse(is.na(x), mean (x, na.rm = TRUE), x))
gini_import
## # A tibble: 21 × 169
## year Angola Albania United…¹ Argen…² Armenia Austr…³ Austria Azerb…⁴ Burundi
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2020 48.7 31.6 30.3 42.3 25.2 34.1 30.2 28.4 36
## 2 2019 48.7 30.8 30.3 42.9 29.9 34.1 30.2 28.4 36
## 3 2018 51.3 30.1 26 41.3 34.4 34.3 30.8 28.4 36
## 4 2017 48.7 33.1 30.3 41.1 33.6 34.1 29.7 28.4 36
## 5 2016 48.7 33.7 30.3 42 32.5 33.7 30.8 28.4 36
## 6 2015 48.7 32.8 30.3 45.3 32.4 34.1 30.5 28.4 36
## 7 2014 48.7 34.6 32.5 41.6 31.5 34.4 30.5 28.4 36
## 8 2013 48.7 31.6 32.5 40.9 30.6 34.1 30.8 28.4 38.6
## 9 2012 48.7 29 30.3 41.3 29.6 34.1 30.5 28.4 36
## 10 2011 48.7 31.6 30.3 42.6 29.4 34.1 30.8 28.4 36
## # … with 11 more rows, 159 more variables: Belgium <dbl>, Benin <dbl>,
## # `Burkina Faso` <dbl>, Bangladesh <dbl>, Bulgaria <dbl>,
## # `Bosnia and Herzegovina` <dbl>, Belarus <dbl>, Belize <dbl>, Bolivia <dbl>,
## # Brazil <dbl>, Bhutan <dbl>, Botswana <dbl>,
## # `Central African Republic` <dbl>, Canada <dbl>, Switzerland <dbl>,
## # Chile <dbl>, China <dbl>, `Cote d'Ivoire` <dbl>, Cameroon <dbl>,
## # `Congo, Dem. Rep.` <dbl>, `Congo, Rep.` <dbl>, Colombia <dbl>, …
#Finally, I reshaped the data back to long format using the pivot_longer() function, and export the final data to a new CSV file called gini_code_final.csv.
gini_vi <- pivot_longer(gini_import, cols = -year, names_to = "country", values_to = "gini_index")
gini_vi
## # A tibble: 3,528 × 3
## year country gini_index
## <dbl> <chr> <dbl>
## 1 2020 Angola 48.7
## 2 2020 Albania 31.6
## 3 2020 United Arab Emirates 30.3
## 4 2020 Argentina 42.3
## 5 2020 Armenia 25.2
## 6 2020 Australia 34.1
## 7 2020 Austria 30.2
## 8 2020 Azerbaijan 28.4
## 9 2020 Burundi 36
## 10 2020 Belgium 28.5
## # … with 3,518 more rows
#I converted gini_vi to a .csv file to work on the data visualization on tableau
library(tibble)
filepath <- "~/Downloads/GinIndex/gini_code_final.csv"
write_csv(gini_vi, filepath)
# the dashboard that goes along this chart is in this Tableau link: