#This is a code in R for importing, cleaning, and transforming data on Gini indices per country. The data is read from a CSV file downloaded from a Kaggle dataset. The code is commented, which helps to understand each step. Here is a brief summary of each step:
# My objective was to analyze the global Gini index and observe how it has changed from 2000 to 2020. I wanted to identify which countries are more unequal and which ones are more egalitarian. One of my main goals was to create a map to visualize the geographical distribution of countries and to identify patterns and outliers. I also wanted to make it possible to filter the data using a button by year, to easily observe changes over time. Let's get started!
#the purpose of thus project is for entertainment and only to satisfy the viewers curiosity about the world and equality.
# tghis is the kaggle link: https://www.kaggle.com/datasets/ulrikthygepedersen/gini-index-per-country
# First, we load the tidyverse package, which contains a suite of packages for data manipulation and visualization.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.0
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
#I defined the file path to the input CSV file, which is stored in the filepath variable.
#Then, I used the read_csv() function from the readr package to read the CSV file into a data frame called gini_import.

filepath <-"~/Downloads/GinIndex/gini_by_country.csv"
gini_import <- read_csv(filepath)
## Rows: 1896 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country_code, country_name
## dbl (2): year, value
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# I then counted the number of missing values in the data frame using the is.na() function to identify missing values and the sum() function to count them.
sum(is.na(gini_import))
## [1] 0
#After checking for missing values (there were none :)), I viewed a summary of the data frame using the summary() function.


summary(gini_import)
##  country_code       country_name            year          value      
##  Length:1896        Length:1896        Min.   :1967   Min.   :20.20  
##  Class :character   Class :character   1st Qu.:2001   1st Qu.:31.20  
##  Mode  :character   Mode  :character   Median :2008   Median :36.10  
##                                        Mean   :2006   Mean   :38.18  
##                                        3rd Qu.:2014   3rd Qu.:44.10  
##                                        Max.   :2021   Max.   :65.80
# country_code       country_name            year          value      
# Length:1896        Length:1896        Min.   :1967   Min.   :20.20  
# Class :character   Class :character   1st Qu.:2001   1st Qu.:31.20  
# Mode  :character   Mode  :character   Median :2008   Median :36.10  
#                                      Mean   :2006   Mean   :38.18  
#                                     3rd Qu.:2014   3rd Qu.:44.10  
#                                    Max.   :2021   Max.   :65.80  
# I cleaned data that was not necesarry
gini_import <- select(gini_import, -country_code)
#I didnt liked the way the data was structured, because there could be some N/A values hidden so then I reshaped the data from a long format to a wide format using the pivot_wider() function from the tidyr package. This function takes the existing columns in the data frame and spreads them out into multiple columns based on a specified key variable. In this case, I was using country_name as the key variable and value as the value variable.
gini_import <- pivot_wider(gini_import, names_from = country_name, values_from = value) 
gini_import
## # A tibble: 50 × 169
##     year Angola Albania United…¹ Argen…² Armenia Austr…³ Austria Azerb…⁴ Burundi
##    <dbl>  <dbl>   <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1  2000   52      NA       NA      51.1    NA      NA      28.8    NA        NA
##  2  2008   42.7    30       NA      44.9    29.2    35.4    30.4    NA        NA
##  3  2018   51.3    30.1     26      41.3    34.4    34.3    30.8    NA        NA
##  4  1996   NA      27       NA      49.5    NA      NA      NA      NA        NA
##  5  2002   NA      31.7     NA      53.8    34.8    NA      NA      25.3      NA
##  6  2005   NA      30.6     NA      47.7    36      NA      28.7    26.6      NA
##  7  2012   NA      29       NA      41.3    29.6    NA      30.5    NA        NA
##  8  2014   NA      34.6     32.5    41.6    31.5    34.4    30.5    NA        NA
##  9  2015   NA      32.8     NA      NA      32.4    NA      30.5    NA        NA
## 10  2016   NA      33.7     NA      42      32.5    33.7    30.8    NA        NA
## # … with 40 more rows, 159 more variables: Belgium <dbl>, Benin <dbl>,
## #   `Burkina Faso` <dbl>, Bangladesh <dbl>, Bulgaria <dbl>,
## #   `Bosnia and Herzegovina` <dbl>, Belarus <dbl>, Belize <dbl>, Bolivia <dbl>,
## #   Brazil <dbl>, Bhutan <dbl>, Botswana <dbl>,
## #   `Central African Republic` <dbl>, Canada <dbl>, Switzerland <dbl>,
## #   Chile <dbl>, China <dbl>, `Cote d'Ivoire` <dbl>, Cameroon <dbl>,
## #   `Congo, Dem. Rep.` <dbl>, `Congo, Rep.` <dbl>, Colombia <dbl>, …
#problem lots of N/A, countries with no info in some years , so I am going to put an average of their own country to fill in the N/A
# I sorted the data frame by year in descending order, remove the first row, subset the data to include only years >= 2000, and replace the missing values with the mean of each variable using the arrange(), [ ], and apply() functions.
gini_import$year
##  [1] 2000 2008 2018 1996 2002 2005 2012 2014 2015 2016 2017 2019 2013 1980 1986
## [16] 1987 1991 1992 1993 1994 1995 1997 1998 1999 2001 2003 2004 2006 2007 2009
## [31] 2010 2011 2020 1981 1985 1989 1988 1983 1982 1984 1990 1971 1975 1978 1969
## [46] 1974 1979 2021 1977 1967
gini_import <- arrange(gini_import, desc(year))
gini_import
## # A tibble: 50 × 169
##     year Angola Albania United…¹ Argen…² Armenia Austr…³ Austria Azerb…⁴ Burundi
##    <dbl>  <dbl>   <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1  2021   NA      NA       NA      NA      NA      NA      NA        NA    NA  
##  2  2020   NA      NA       NA      42.3    25.2    NA      NA        NA    NA  
##  3  2019   NA      30.8     NA      42.9    29.9    NA      30.2      NA    NA  
##  4  2018   51.3    30.1     26      41.3    34.4    34.3    30.8      NA    NA  
##  5  2017   NA      33.1     NA      41.1    33.6    NA      29.7      NA    NA  
##  6  2016   NA      33.7     NA      42      32.5    33.7    30.8      NA    NA  
##  7  2015   NA      32.8     NA      NA      32.4    NA      30.5      NA    NA  
##  8  2014   NA      34.6     32.5    41.6    31.5    34.4    30.5      NA    NA  
##  9  2013   NA      NA       32.5    40.9    30.6    NA      30.8      NA    38.6
## 10  2012   NA      29       NA      41.3    29.6    NA      30.5      NA    NA  
## # … with 40 more rows, 159 more variables: Belgium <dbl>, Benin <dbl>,
## #   `Burkina Faso` <dbl>, Bangladesh <dbl>, Bulgaria <dbl>,
## #   `Bosnia and Herzegovina` <dbl>, Belarus <dbl>, Belize <dbl>, Bolivia <dbl>,
## #   Brazil <dbl>, Bhutan <dbl>, Botswana <dbl>,
## #   `Central African Republic` <dbl>, Canada <dbl>, Switzerland <dbl>,
## #   Chile <dbl>, China <dbl>, `Cote d'Ivoire` <dbl>, Cameroon <dbl>,
## #   `Congo, Dem. Rep.` <dbl>, `Congo, Rep.` <dbl>, Colombia <dbl>, …
gini_import <- gini_import[-1, ]
gini_import <- gini_import [gini_import$year >= 2000, ]
gini_import
## # A tibble: 21 × 169
##     year Angola Albania United…¹ Argen…² Armenia Austr…³ Austria Azerb…⁴ Burundi
##    <dbl>  <dbl>   <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1  2020   NA      NA       NA      42.3    25.2    NA      NA        NA    NA  
##  2  2019   NA      30.8     NA      42.9    29.9    NA      30.2      NA    NA  
##  3  2018   51.3    30.1     26      41.3    34.4    34.3    30.8      NA    NA  
##  4  2017   NA      33.1     NA      41.1    33.6    NA      29.7      NA    NA  
##  5  2016   NA      33.7     NA      42      32.5    33.7    30.8      NA    NA  
##  6  2015   NA      32.8     NA      NA      32.4    NA      30.5      NA    NA  
##  7  2014   NA      34.6     32.5    41.6    31.5    34.4    30.5      NA    NA  
##  8  2013   NA      NA       32.5    40.9    30.6    NA      30.8      NA    38.6
##  9  2012   NA      29       NA      41.3    29.6    NA      30.5      NA    NA  
## 10  2011   NA      NA       NA      42.6    29.4    NA      30.8      NA    NA  
## # … with 11 more rows, 159 more variables: Belgium <dbl>, Benin <dbl>,
## #   `Burkina Faso` <dbl>, Bangladesh <dbl>, Bulgaria <dbl>,
## #   `Bosnia and Herzegovina` <dbl>, Belarus <dbl>, Belize <dbl>, Bolivia <dbl>,
## #   Brazil <dbl>, Bhutan <dbl>, Botswana <dbl>,
## #   `Central African Republic` <dbl>, Canada <dbl>, Switzerland <dbl>,
## #   Chile <dbl>, China <dbl>, `Cote d'Ivoire` <dbl>, Cameroon <dbl>,
## #   `Congo, Dem. Rep.` <dbl>, `Congo, Rep.` <dbl>, Colombia <dbl>, …
gini_import[] <- apply(gini_import, 2, function(x) ifelse(is.na(x), mean (x, na.rm = TRUE), x))
gini_import
## # A tibble: 21 × 169
##     year Angola Albania United…¹ Argen…² Armenia Austr…³ Austria Azerb…⁴ Burundi
##    <dbl>  <dbl>   <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1  2020   48.7    31.6     30.3    42.3    25.2    34.1    30.2    28.4    36  
##  2  2019   48.7    30.8     30.3    42.9    29.9    34.1    30.2    28.4    36  
##  3  2018   51.3    30.1     26      41.3    34.4    34.3    30.8    28.4    36  
##  4  2017   48.7    33.1     30.3    41.1    33.6    34.1    29.7    28.4    36  
##  5  2016   48.7    33.7     30.3    42      32.5    33.7    30.8    28.4    36  
##  6  2015   48.7    32.8     30.3    45.3    32.4    34.1    30.5    28.4    36  
##  7  2014   48.7    34.6     32.5    41.6    31.5    34.4    30.5    28.4    36  
##  8  2013   48.7    31.6     32.5    40.9    30.6    34.1    30.8    28.4    38.6
##  9  2012   48.7    29       30.3    41.3    29.6    34.1    30.5    28.4    36  
## 10  2011   48.7    31.6     30.3    42.6    29.4    34.1    30.8    28.4    36  
## # … with 11 more rows, 159 more variables: Belgium <dbl>, Benin <dbl>,
## #   `Burkina Faso` <dbl>, Bangladesh <dbl>, Bulgaria <dbl>,
## #   `Bosnia and Herzegovina` <dbl>, Belarus <dbl>, Belize <dbl>, Bolivia <dbl>,
## #   Brazil <dbl>, Bhutan <dbl>, Botswana <dbl>,
## #   `Central African Republic` <dbl>, Canada <dbl>, Switzerland <dbl>,
## #   Chile <dbl>, China <dbl>, `Cote d'Ivoire` <dbl>, Cameroon <dbl>,
## #   `Congo, Dem. Rep.` <dbl>, `Congo, Rep.` <dbl>, Colombia <dbl>, …
#Finally, I reshaped the data back to long format using the pivot_longer() function, and export the final data to a new CSV file called gini_code_final.csv.
gini_vi <- pivot_longer(gini_import, cols = -year, names_to = "country", values_to = "gini_index")
gini_vi
## # A tibble: 3,528 × 3
##     year country              gini_index
##    <dbl> <chr>                     <dbl>
##  1  2020 Angola                     48.7
##  2  2020 Albania                    31.6
##  3  2020 United Arab Emirates       30.3
##  4  2020 Argentina                  42.3
##  5  2020 Armenia                    25.2
##  6  2020 Australia                  34.1
##  7  2020 Austria                    30.2
##  8  2020 Azerbaijan                 28.4
##  9  2020 Burundi                    36  
## 10  2020 Belgium                    28.5
## # … with 3,518 more rows
#I converted gini_vi to a .csv file to work on the data visualization on tableau
library(tibble)
filepath <- "~/Downloads/GinIndex/gini_code_final.csv"
write_csv(gini_vi, filepath)
# the dashboard that goes along this chart is in this Tableau link: