HW 7

Sterling Hall HCS867

For all questions, include the R code that you used to find your answer (show R chunk and outputs). Answers without supporting code will not receive credit (unless no code was required). Outputs without comments will not receive credit either: Write full sentences to describe your findings.

Question 1: (0.5 pt)

The dataset `world_bank_pop` is a built-in dataset in `tidyverse`. It contains information about total population, population growth, and urban population, for countries around the world.

Save the dataset in your environment as `myworld`. Take a look at it with `head()`. Is the data tidy? Why or why not?

# your code goes below this line
library(tidyr)
data("world_bank_pop")
myworld <- world_bank_pop
head(myworld)

## # A tibble: 6 x 20
##   country indicator `2000` `2001` `2002` `2003`  `2004`  `2005`   `2006`
##   <chr>   <chr>      <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
## 1 ABW     SP.URB.T… 4.24e4 4.30e4 4.37e4 4.42e4 4.47e+4 4.49e+4  4.49e+4
## 2 ABW     SP.URB.G… 1.18e0 1.41e0 1.43e0 1.31e0 9.51e-1 4.91e-1 -1.78e-2
## 3 ABW     SP.POP.T… 9.09e4 9.29e4 9.50e4 9.70e4 9.87e+4 1.00e+5  1.01e+5
## 4 ABW     SP.POP.G… 2.06e0 2.23e0 2.23e0 2.11e0 1.76e+0 1.30e+0  7.98e-1
## 5 AFG     SP.URB.T… 4.44e6 4.65e6 4.89e6 5.16e6 5.43e+6 5.69e+6  5.93e+6
## 6 AFG     SP.URB.G… 3.91e0 4.66e0 5.13e0 5.23e0 5.12e+0 4.77e+0  4.12e+0
## # … with 11 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>,
## #   `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>,
## #   `2015` <dbl>, `2016` <dbl>, `2017` <dbl>

The data is tidy since, each variable has its own column, each obseraction has it’s own row and each value has its own cell.

Question 2: (0.5 pt)

Using `dplyr` functions, count how many distinct countries there are in the dataset.

library(dplyr)

myworld %>% distinct(country)

## # A tibble: 264 x 1
##    country
##    <chr>  
##  1 ABW    
##  2 AFG    
##  3 AGO    
##  4 ALB    
##  5 AND    
##  6 ARB    
##  7 ARE    
##  8 ARG    
##  9 ARM    
## 10 ASM    
## # … with 254 more rows

There are 264 unique countries within the dataset.

Question 3: (1 pts)

Use one of the `pivot` functions to create a new dataset, `myworld2`, with the years 2000 to 2017 appearing as a numeric variable `year`, and with the different values for the indicator variable displayed in a variable called `value`. In this new dataset, how many lines are there per country? Why does it make sense?

myworld2 <- myworld %>% pivot_longer(!country & !indicator, names_to = "year", values_to = "value") %>% 
  mutate(year = as.numeric(year))
myworld2

## # A tibble: 19,008 x 4
##    country indicator    year value
##    <chr>   <chr>       <dbl> <dbl>
##  1 ABW     SP.URB.TOTL  2000 42444
##  2 ABW     SP.URB.TOTL  2001 43048
##  3 ABW     SP.URB.TOTL  2002 43670
##  4 ABW     SP.URB.TOTL  2003 44246
##  5 ABW     SP.URB.TOTL  2004 44669
##  6 ABW     SP.URB.TOTL  2005 44889
##  7 ABW     SP.URB.TOTL  2006 44881
##  8 ABW     SP.URB.TOTL  2007 44686
##  9 ABW     SP.URB.TOTL  2008 44375
## 10 ABW     SP.URB.TOTL  2009 44052
## # … with 18,998 more rows

There are 4 unique indicators per country and from 2000:2017 there are 18 different years, therefore there are 72 lines per country. Therefore, it makes sense since the math checks out (18 x 4 = 72 & 264 x 72 = 19008)

Question 4: (1 pts)

Use another `pivot` function on `myworld2` to create a new dataset, `myworld3`, with the different categories for the indicator variable appearing as their own variables. Use `dplyr` functions to rename `SP.POP.GROW` and `SP.URB.GROW`, as `pop_growth` and `pop_urb_growth` respectively (for example, you might use `rename`). On this new dataset, use `dplyr` functions to find which country code had the highest population growth in 2017.

myworld3 <- myworld2 %>%
  pivot_wider(names_from = 'indicator', values_from = 'value') %>%
  rename(pop_growth = SP.POP.GROW, pop_urb_growth = SP.URB.GROW)



myworld3 %>%  filter(year == 2017) %>% slice(which.max(pop_growth))

## # A tibble: 1 x 6
##   country  year SP.URB.TOTL pop_urb_growth SP.POP.TOTL pop_growth
##   <chr>   <dbl>       <dbl>          <dbl>       <dbl>      <dbl>
## 1 OMN      2017     3874061           5.95     4636262       4.67

The highest population growth in 2017 was 4.669195%.

Question 5: (1 pts)

Using `dplyr` functions, find the ratio of urban growth compared to the population growth in the world for each year (Hint: the country code `WLD` represents the entire world). Using a visualization, describe how the percentage of urban population growth has changed over the years. Why does your graph not contradict the fact that the urban population worldwide is increasing over the years?

rat <-
  myworld3 %>% filter(country == 'WLD') %>% mutate(ratio = (pop_urb_growth/pop_growth))

ggplot(data = rat) +
  geom_smooth(mapping = aes(x = year, y = ratio))

The rate at which people are moving to urban environmetns may fulcutate, but the total number of people in urban areas contiunes to increase year by year.

Question 6: (1 pts)

When we answered question 4, we have no idea what actual country is represented by the three-letter code. We will now use a package that has information about the coding system used by the World bank.

In the package `countrycode`, we will use a built-in dataset called `codelist`. Call the library and save this dataset as `mycodes`.

# Paste and run the following into your console (NOT HERE): install.packages("countrycode")

# Call the countrycode package
library(countrycode)
# your code goes below this line
mycodes <- countrycode::codelist

Using `dplyr` functions, modify `mycodes` to: 1. select only the variables continent, `wb` (World Bank code), and `country.name.en` (country name in English); 2. filter to keep countries in Europe only; 3. remove countries with missing `wb` code. On this new dataset, use `dplyr` functions to count how many countries there are in Europe with a World Bank code.

mycodes %>% select(continent, wb, country.name.en) %>% filter(continent == 'Europe') %>% na.omit(wb) %>% count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1    46

There are 46 different countries in Europe with a wb (World Bank Code).

Question 7: (1 pt)

Use a `left_join()` function to create a new dataset, `myeurope`, to add data to the countries in `mycodes` dataset from `myworld3` dataset. Match the two datasets based on the World Bank code. Using `dplyr` functions, change the name of the variable containing the World Bank code to `country`. How many rows are there in this new dataset? Why does it make sense?

mycodes <- mycodes %>% select(continent, wb, country.name.en) %>% filter(continent == 'Europe') %>% na.omit(wb) %>% rename(country = wb)


myeurope = mycodes %>% left_join(myworld3, by='country')
myeurope

## # A tibble: 828 x 8
##    continent country country.name.en  year SP.URB.TOTL pop_urb_growth
##    <chr>     <chr>   <chr>           <dbl>       <dbl>          <dbl>
##  1 Europe    ALB     Albania          2000     1289391          0.742
##  2 Europe    ALB     Albania          2001     1298584          0.710
##  3 Europe    ALB     Albania          2002     1327220          2.18 
##  4 Europe    ALB     Albania          2003     1354848          2.06 
##  5 Europe    ALB     Albania          2004     1381828          1.97 
##  6 Europe    ALB     Albania          2005     1407298          1.83 
##  7 Europe    ALB     Albania          2006     1430886          1.66 
##  8 Europe    ALB     Albania          2007     1452398          1.49 
##  9 Europe    ALB     Albania          2008     1473392          1.44 
## 10 Europe    ALB     Albania          2009     1495260          1.47 
## # … with 818 more rows, and 2 more variables: SP.POP.TOTL <dbl>,
## #   pop_growth <dbl>

#mycodes <- countrycode::codelist

There are a total of 828 rows, after left joining the table by country. The overlap created between df1 and df2 plus that of df1 is 828 rows.

Question 8: (1 pt)

Using `dplyr` functions, only keep information for the population growth in 2017 then compare the population growth per country with `ggplot` using `geom_bar()`. Make sure to order countries in order of population growth. Which country in Europe had the highest population growth in 2017?

PG <- myeurope %>% filter(year == 2017) %>% arrange(pop_growth)

ggplot(data = PG, aes(x = reorder(country, pop_growth), y = pop_growth)) +
geom_bar(stat="identity")+
theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust=1))+
xlab("Country")+
ylab("Population Growth")

Luxanburg had the highest population growth.

Question 9: (0.5 pt)

When dealing with location data, we can actually visualize information on a map if we have geographic information such as latitude and longitude.

Let’s use a built-in function called `map_data()` to get geographic coordinates about countries in the world (see below). Take a look at the dataset `mapWorld` with `glimpse()`. What variable could we use to join this dataset with `myeurope` dataset?

# Geographic coordinates about countries in the world
mapWorld <- map_data("world")

# your code goes below this line
glimpse(mapWorld)

## Rows: 99,338
## Columns: 6
## $ long      <dbl> -69.89912, -69.89571, -69.94219, -70.00415, -70.06612, -70.…
## $ lat       <dbl> 12.45200, 12.42300, 12.43853, 12.50049, 12.54697, 12.59707,…
## $ group     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ order     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18, …
## $ region    <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Arub…
## $ subregion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

We can join the datasets using region as there is an overlap between them. Since, region has the names of all countries it therefore has all of the countries in the ‘myeurope’ dataset.

Question 10: 1 pts)

Only keep the year 2017 in the dataset `myeurope`. Then use a `left_join()` to add data to the countries in `myeurope` dataset from `mapWorld` dataset, matching the two datasets based on the country name. If we then use `dplyr` functions, we can identify some missing values for `lat` and `long` in the new dataset. Indeed, some countries such as United Kingdom did not have a match. Why do you think this happened?

MapWorldRenamed <- mapWorld %>% rename(country = region)
myeurope2017 <- myeurope %>% filter(year == 2017) %>% rename(wb = country, country = country.name.en)



myeuro <- myeurope2017 %>% left_join(MapWorldRenamed, by='country')

There is a difference in the variable names used for the country itself. In the MapWorld dataset it is named: ‘UK’, while myeurope has it as: “United Kingdom”.

Question 11: (0.5 pt)

To identify all countries in 2017 that did not have an exact match, do an `anti_join()` instead of `left_join()` in the previous question. How many countries did not have an exact match? Note: using `anti_join()` is a very useful function to identify differences between datasets.

myeurope <- myeurope2017 %>% anti_join(MapWorldRenamed, by='country')
myeurope

## # A tibble: 5 x 8
##   continent wb    country  year SP.URB.TOTL pop_urb_growth SP.POP.TOTL
##   <chr>     <chr> <chr>   <dbl>       <dbl>          <dbl>       <dbl>
## 1 Europe    BIH   Bosnia…  2017     1679019          0.472     3507017
## 2 Europe    CZE   Czechia  2017     7803157          0.379    10591323
## 3 Europe    GIB   Gibral…  2017       34571          0.473       34571
## 4 Europe    MKD   North …  2017     1202983          0.415     2083160
## 5 Europe    GBR   United…  2017    54892898          0.958    66022273
## # … with 1 more variable: pop_growth <dbl>

There are 4 countries total that did not have an exact match but there there is 1 country that did not have a match at all (Gibraltar)

Question 12: (0.5 pt)

Joining datasets by variables containing names often leads to a mismatch because spelling can vary from one dataset to another. Sometimes we need to manually fix spelling in order to be able to match values. Consider the code given below. Replace the name of United Kingdom so that its name in `myeurope` dataset corresponds to the name given in `mapWorld` dataset. Following this code, add a pipe and use a `left_join()` function to create the new dataset, `mymap`, adding data to the countries in `myeurope` dataset from `mapWorld` dataset.

# Remove `eval = FALSE` to run the code
#mymap <- myeurope %>%
#filter(year == 2017) %>%
#mutate(country_clean = recode(country.name.en,
#'United Kingdom' = 'UK',
#'Bosnia & Herzegovina' = 'Bosnia and Herzegovina',
#'Czechia' = 'Czech Republic',
#'North Macedonia' = 'Macedonia')) %>% left_join(MapWorldRenamed, myeurope, by = 'region')

Question 13: (0.5 pt)

Let’s visualize how population growth varies across European countries in 2017 with a map. With the package `ggmap`, use the R code provided below. Add a comment after each # to explain what each component of this code does. Note: it would be a good idea to run the code piece by piece to see what each layer adds to the plot. See if you can spot Luxembourg!

#Paste and run the following into your console (NOT HERE): install.packages("ggmap")
#When you are ready to run the code, remove `eval = FALSE` in the markdown
#Call the ggmap package
library(ggmap)
##mymap %>%
  

  
  
#ggplot(aes(x = long, y = lat, group = group, fill = pop_growth)) +
  

  
#geom_polygon(colour = "black") +


#scale_fill_gradient(low = "white", high = "blue",
#guide = "colorbar") +
#labs(fill = "Growth" ,
#title = "Population Growth in 2000",
#x = "Longitude", y = "Latitude") +
#xlim(-25,50) + ylim(35,70)

Your answer goes here. 1-2 sentences per line.

##                                       sysname 
##                                       "Linux" 
##                                       release 
##                          "4.15.0-171-generic" 
##                                       version 
## "#180-Ubuntu SMP Wed Mar 2 17:25:05 UTC 2022" 
##                                      nodename 
##                  "educcomp02.ccbb.utexas.edu" 
##                                       machine 
##                                      "x86_64" 
##                                         login 
##                                     "unknown" 
##                                          user 
##                                      "hcs867" 
##                                effective_user 
##                                      "hcs867"