Creating a GitHub- Gist Acceptable Document

Introduction:

This is a simple template to create a reproducible, csv document that can be uploaded to GitHub Gist without using GistR and create.gist. I was inspired to wite this because I was taking a d3.js class from FreeCodeCamp. In the Data Visualization course the data is uploaded to Gist. While I understand why Carran did not use any specific language to augment the data. I wanted to be able to manipulate the data from the language I knew best, so that it was a reproducible method. I did find that there were some tricks to getting this to upload correctly to Gist. This code can be easily modified to suit any R user’s needs. The audience for this document is meant to those with only elementary knowledge of R needing this upload for d3.js. As I am sure there is probably a more intuitive way.

Selecting data

(The assumption right now is that the working directory is already set to the desired location and that the file does not already exist.) We are interested in the Countries and their populations from 1950-2020; convert the population to billions, and sort by most populous. First, we will begin by downloading data from the United Nations Population Prospectives website.

require(tidyverse)

## Loading required package: tidyverse

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

require(readxl)

## Loading required package: readxl

UNUrl <-"https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)/EXCEL_FILES/1_Population/WPP2019_POP_F01_1_TOTAL_POPULATION_BOTH_SEXES.xlsx"
temp <-tempfile()
download.file(UNUrl, destfile = temp, mode="wb")
World_population <- read_excel(temp)

## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...

At this point I usually inspect the data to ensure that it uploaded in a workable form.

View(World_population)

I see that the first seven rows are basically useless to my needs. So, I will use tidyverse’s call slice to remove these.

World_population_2 <-World_population%>%
  slice(-c(1:11))
head(World_population_2)

## # A tibble: 6 x 78
##   `United Nations` ...2   ...3   ...4  ...5  ...6  ...7  ...8  ...9  ...10 ...11
##   <chr>            <chr>  <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Index            Varia~ Regio~ Notes Coun~ Type  Pare~ 1950  1951  1952  1953 
## 2 1                Estim~ WORLD  <NA>  900   World 0     2536~ 2584~ 2630~ 2677~
## 3 2                Estim~ UN de~ a     1803  Labe~ 900   ...   ...   ...   ...  
## 4 3                Estim~ More ~ b     901   Deve~ 1803  8148~ 8240~ 8337~ 8437~
## 5 4                Estim~ Less ~ c     902   Deve~ 1803  1721~ 1760~ 1797~ 1833~
## 6 5                Estim~ Least~ d     941   Deve~ 902   1954~ 1991~ 2030~ 2069~
## # ... with 67 more variables: ...12 <chr>, ...13 <chr>, ...14 <chr>,
## #   ...15 <chr>, ...16 <chr>, ...17 <chr>, ...18 <chr>, ...19 <chr>,
## #   ...20 <chr>, ...21 <chr>, ...22 <chr>, ...23 <chr>, ...24 <chr>,
## #   ...25 <chr>, ...26 <chr>, ...27 <chr>, ...28 <chr>, ...29 <chr>,
## #   ...30 <chr>, ...31 <chr>, ...32 <chr>, ...33 <chr>, ...34 <chr>,
## #   ...35 <chr>, ...36 <chr>, ...37 <chr>, ...38 <chr>, ...39 <chr>,
## #   ...40 <chr>, ...41 <chr>, ...42 <chr>, ...43 <chr>, ...44 <chr>,
## #   ...45 <chr>, ...46 <chr>, ...47 <chr>, ...48 <chr>, ...49 <chr>,
## #   ...50 <chr>, ...51 <chr>, ...52 <chr>, ...53 <chr>, ...54 <chr>,
## #   ...55 <chr>, ...56 <chr>, ...57 <chr>, ...58 <chr>, ...59 <chr>,
## #   ...60 <chr>, ...61 <chr>, ...62 <chr>, ...63 <chr>, ...64 <chr>,
## #   ...65 <chr>, ...66 <chr>, ...67 <chr>, ...68 <chr>, ...69 <chr>,
## #   ...70 <chr>, ...71 <chr>, ...72 <chr>, ...73 <chr>, ...74 <chr>,
## #   ...75 <chr>, ...76 <chr>, ...77 <chr>, ...78 <chr>

As we can see row 1 contains all the column names so lets make that happen; then, we will drop that row.

names(World_population_2) <- World_population_2 %>% slice(1) %>% unlist()
World_population_3 <- World_population_2 %>% slice(-1)
head(World_population_2)

## # A tibble: 6 x 78
##   Index Variant `Region, subreg~ Notes `Country code` Type  `Parent code` `1950`
##   <chr> <chr>   <chr>            <chr> <chr>          <chr> <chr>         <chr> 
## 1 Index Variant Region, subregi~ Notes Country code   Type  Parent code   1950  
## 2 1     Estima~ WORLD            <NA>  900            World 0             25364~
## 3 2     Estima~ UN development ~ a     1803           Labe~ 900           ...   
## 4 3     Estima~ More developed ~ b     901            Deve~ 1803          81481~
## 5 4     Estima~ Less developed ~ c     902            Deve~ 1803          17216~
## 6 5     Estima~ Least developed~ d     941            Deve~ 902           19542~
## # ... with 70 more variables: 1951 <chr>, 1952 <chr>, 1953 <chr>, 1954 <chr>,
## #   1955 <chr>, 1956 <chr>, 1957 <chr>, 1958 <chr>, 1959 <chr>, 1960 <chr>,
## #   1961 <chr>, 1962 <chr>, 1963 <chr>, 1964 <chr>, 1965 <chr>, 1966 <chr>,
## #   1967 <chr>, 1968 <chr>, 1969 <chr>, 1970 <chr>, 1971 <chr>, 1972 <chr>,
## #   1973 <chr>, 1974 <chr>, 1975 <chr>, 1976 <chr>, 1977 <chr>, 1978 <chr>,
## #   1979 <chr>, 1980 <chr>, 1981 <chr>, 1982 <chr>, 1983 <chr>, 1984 <chr>,
## #   1985 <chr>, 1986 <chr>, 1987 <chr>, 1988 <chr>, 1989 <chr>, 1990 <chr>,
## #   1991 <chr>, 1992 <chr>, 1993 <chr>, 1994 <chr>, 1995 <chr>, 1996 <chr>,
## #   1997 <chr>, 1998 <chr>, 1999 <chr>, 2000 <chr>, 2001 <chr>, 2002 <chr>,
## #   2003 <chr>, 2004 <chr>, 2005 <chr>, 2006 <chr>, 2007 <chr>, 2008 <chr>,
## #   2009 <chr>, 2010 <chr>, 2011 <chr>, 2012 <chr>, 2013 <chr>, 2014 <chr>,
## #   2015 <chr>, 2016 <chr>, 2017 <chr>, 2018 <chr>, 2019 <chr>, 2020 <chr>

Well, this is looking better, but there is alot of superfulous information and renaming of columns that need to be done.

Selecting only the information we desire:

 From these lines we will deselect any information we don't want and filter by just Country using `filter`.

Country_population <- World_population_3%>%
  filter (Type=="Country/Area")%>%
  select (-Index, -Variant, -Notes, -`Country code`, -`Parent code`)

and we will change the name of the ’ Region, subregion, country or Area *’, to a name that is more meaningful:

colnames(Country_population)[1] <-"Country"

Convert to Billions

During the import R assign as.character to the populations; we will need to change all 76 of them to as.numeric. Rather than calling all of those variables individually we will use mutate_at. This is another function in tidyverse.

Country_population_2<-Country_population %>%
  mutate_at(vars(-"Country", -"Type"), as.numeric)

We will write our own function to multiple all the populations by 1000.

convert_to_billions <-function(x){
  y=x*1000
  
  return(y)
}

Country_population_3 <-Country_population_2%>%
  mutate_at(vars(-"Country", -"Type"), convert_to_billions)

Now we will sort from greatest to least from the year 2020:

Country_population_3<-Country_population_3 %>% 
  #group_by(Country)%>%
  arrange(desc(Country_population_3$'2020'))

Getting it into an uploadable form

Most people would at this point would think that “Oh, now we just need to put it into a CSV and viola copy paste to Gist.” But isn’t QUITE that simple because Gist does not like quotes. Or atleast I have found in my case, it does not like quotes.

So without further delay, the money shots.

Country_population_3<-Country_population_3%>%
  mutate_all( as.character)

You will notice the addition of quote=FALSE in the write file. This is very important and cannot be left out when make a Gist-friendly file in R.

write.csv(Country_population_3,file= "World_Population_3.csv", row.names=FALSE, quote=FALSE)

Now, all that is left is uploading it your Gist account, which I will leave to the user to do.