For all questions, include the R code that you used to find your answer (show R chunk and outputs). Answers without supporting code will not receive credit (unless no code was required). Outputs without comments will not receive credit either: Write full sentences to describe your findings.
world_bank_pop is a built-in dataset in tidyverse. It contains information about total population, population growth, and urban population, for countries around the world.myworld. Take a look at it with head(). Is the data tidy? Why or why not?# your code goes below this line
library(tidyr)
data("world_bank_pop")
myworld <- world_bank_pop
head(myworld)
## # A tibble: 6 x 20
## country indicator `2000` `2001` `2002` `2003` `2004` `2005` `2006`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ABW SP.URB.T… 4.24e4 4.30e4 4.37e4 4.42e4 4.47e+4 4.49e+4 4.49e+4
## 2 ABW SP.URB.G… 1.18e0 1.41e0 1.43e0 1.31e0 9.51e-1 4.91e-1 -1.78e-2
## 3 ABW SP.POP.T… 9.09e4 9.29e4 9.50e4 9.70e4 9.87e+4 1.00e+5 1.01e+5
## 4 ABW SP.POP.G… 2.06e0 2.23e0 2.23e0 2.11e0 1.76e+0 1.30e+0 7.98e-1
## 5 AFG SP.URB.T… 4.44e6 4.65e6 4.89e6 5.16e6 5.43e+6 5.69e+6 5.93e+6
## 6 AFG SP.URB.G… 3.91e0 4.66e0 5.13e0 5.23e0 5.12e+0 4.77e+0 4.12e+0
## # … with 11 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>,
## # `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>,
## # `2015` <dbl>, `2016` <dbl>, `2017` <dbl>
The data is tidy since, each variable has its own column, each obseraction has it’s own row and each value has its own cell.
dplyr functions, count how many distinct countries there are in the dataset.library(dplyr)
myworld %>% distinct(country)
## # A tibble: 264 x 1
## country
## <chr>
## 1 ABW
## 2 AFG
## 3 AGO
## 4 ALB
## 5 AND
## 6 ARB
## 7 ARE
## 8 ARG
## 9 ARM
## 10 ASM
## # … with 254 more rows
There are 264 unique countries within the dataset.
pivot functions to create a new dataset, myworld2, with the years 2000 to 2017 appearing as a numeric variable year, and with the different values for the indicator variable displayed in a variable called value. In this new dataset, how many lines are there per country? Why does it make sense?myworld2 <- myworld %>% pivot_longer(!country & !indicator, names_to = "year", values_to = "value") %>%
mutate(year = as.numeric(year))
myworld2
## # A tibble: 19,008 x 4
## country indicator year value
## <chr> <chr> <dbl> <dbl>
## 1 ABW SP.URB.TOTL 2000 42444
## 2 ABW SP.URB.TOTL 2001 43048
## 3 ABW SP.URB.TOTL 2002 43670
## 4 ABW SP.URB.TOTL 2003 44246
## 5 ABW SP.URB.TOTL 2004 44669
## 6 ABW SP.URB.TOTL 2005 44889
## 7 ABW SP.URB.TOTL 2006 44881
## 8 ABW SP.URB.TOTL 2007 44686
## 9 ABW SP.URB.TOTL 2008 44375
## 10 ABW SP.URB.TOTL 2009 44052
## # … with 18,998 more rows
There are 4 unique indicators per country and from 2000:2017 there are 18 different years, therefore there are 72 lines per country. Therefore, it makes sense since the math checks out (18 x 4 = 72 & 264 x 72 = 19008)
pivot function on myworld2 to create a new dataset, myworld3, with the different categories for the indicator variable appearing as their own variables. Use dplyr functions to rename SP.POP.GROW and SP.URB.GROW, as pop_growth and pop_urb_growth respectively (for example, you might use rename). On this new dataset, use dplyr functions to find which country code had the highest population growth in 2017.myworld3 <- myworld2 %>%
pivot_wider(names_from = 'indicator', values_from = 'value') %>%
rename(pop_growth = SP.POP.GROW, pop_urb_growth = SP.URB.GROW)
myworld3 %>% filter(year == 2017) %>% slice(which.max(pop_growth))
## # A tibble: 1 x 6
## country year SP.URB.TOTL pop_urb_growth SP.POP.TOTL pop_growth
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 OMN 2017 3874061 5.95 4636262 4.67
The highest population growth in 2017 was 4.669195%.
dplyr functions, find the ratio of urban growth compared to the population growth in the world for each year (Hint: the country code WLD represents the entire world). Using a visualization, describe how the percentage of urban population growth has changed over the years. Why does your graph not contradict the fact that the urban population worldwide is increasing over the years?rat <-
myworld3 %>% filter(country == 'WLD') %>% mutate(ratio = (pop_urb_growth/pop_growth))
ggplot(data = rat) +
geom_smooth(mapping = aes(x = year, y = ratio))
The rate at which people are moving to urban environmetns may fulcutate, but the total number of people in urban areas contiunes to increase year by year.
countrycode, we will use a built-in dataset called codelist. Call the library and save this dataset as mycodes.# Paste and run the following into your console (NOT HERE): install.packages("countrycode")
# Call the countrycode package
library(countrycode)
# your code goes below this line
mycodes <- countrycode::codelist
dplyr functions, modify mycodes to: 1. select only the variables continent, wb (World Bank code), and country.name.en (country name in English); 2. filter to keep countries in Europe only; 3. remove countries with missing wb code. On this new dataset, use dplyr functions to count how many countries there are in Europe with a World Bank code.mycodes %>% select(continent, wb, country.name.en) %>% filter(continent == 'Europe') %>% na.omit(wb) %>% count()
## # A tibble: 1 x 1
## n
## <int>
## 1 46
There are 46 different countries in Europe with a wb (World Bank Code).
left_join() function to create a new dataset, myeurope, to add data to the countries in mycodes dataset from myworld3 dataset. Match the two datasets based on the World Bank code. Using dplyr functions, change the name of the variable containing the World Bank code to country. How many rows are there in this new dataset? Why does it make sense?mycodes <- mycodes %>% select(continent, wb, country.name.en) %>% filter(continent == 'Europe') %>% na.omit(wb) %>% rename(country = wb)
myeurope = mycodes %>% left_join(myworld3, by='country')
myeurope
## # A tibble: 828 x 8
## continent country country.name.en year SP.URB.TOTL pop_urb_growth
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Europe ALB Albania 2000 1289391 0.742
## 2 Europe ALB Albania 2001 1298584 0.710
## 3 Europe ALB Albania 2002 1327220 2.18
## 4 Europe ALB Albania 2003 1354848 2.06
## 5 Europe ALB Albania 2004 1381828 1.97
## 6 Europe ALB Albania 2005 1407298 1.83
## 7 Europe ALB Albania 2006 1430886 1.66
## 8 Europe ALB Albania 2007 1452398 1.49
## 9 Europe ALB Albania 2008 1473392 1.44
## 10 Europe ALB Albania 2009 1495260 1.47
## # … with 818 more rows, and 2 more variables: SP.POP.TOTL <dbl>,
## # pop_growth <dbl>
#mycodes <- countrycode::codelist
There are a total of 828 rows, after left joining the table by country. The overlap created between df1 and df2 plus that of df1 is 828 rows.
dplyr functions, only keep information for the population growth in 2017 then compare the population growth per country with ggplot using geom_bar(). Make sure to order countries in order of population growth. Which country in Europe had the highest population growth in 2017?PG <- myeurope %>% filter(year == 2017) %>% arrange(pop_growth)
ggplot(data = PG, aes(x = reorder(country, pop_growth), y = pop_growth)) +
geom_bar(stat="identity")+
theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust=1))+
xlab("Country")+
ylab("Population Growth")
Luxanburg had the highest population growth.
map_data() to get geographic coordinates about countries in the world (see below). Take a look at the dataset mapWorld with glimpse(). What variable could we use to join this dataset with myeurope dataset?# Geographic coordinates about countries in the world
mapWorld <- map_data("world")
# your code goes below this line
glimpse(mapWorld)
## Rows: 99,338
## Columns: 6
## $ long <dbl> -69.89912, -69.89571, -69.94219, -70.00415, -70.06612, -70.…
## $ lat <dbl> 12.45200, 12.42300, 12.43853, 12.50049, 12.54697, 12.59707,…
## $ group <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ order <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18, …
## $ region <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Arub…
## $ subregion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
We can join the datasets using region as there is an overlap between them. Since, region has the names of all countries it therefore has all of the countries in the ‘myeurope’ dataset.
myeurope. Then use a left_join() to add data to the countries in myeurope dataset from mapWorld dataset, matching the two datasets based on the country name. If we then use dplyr functions, we can identify some missing values for lat and long in the new dataset. Indeed, some countries such as United Kingdom did not have a match. Why do you think this happened?MapWorldRenamed <- mapWorld %>% rename(country = region)
myeurope2017 <- myeurope %>% filter(year == 2017) %>% rename(wb = country, country = country.name.en)
myeuro <- myeurope2017 %>% left_join(MapWorldRenamed, by='country')
There is a difference in the variable names used for the country itself. In the MapWorld dataset it is named: ‘UK’, while myeurope has it as: “United Kingdom”.
anti_join() instead of left_join() in the previous question. How many countries did not have an exact match? Note: using anti_join() is a very useful function to identify differences between datasets.myeurope <- myeurope2017 %>% anti_join(MapWorldRenamed, by='country')
myeurope
## # A tibble: 5 x 8
## continent wb country year SP.URB.TOTL pop_urb_growth SP.POP.TOTL
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Europe BIH Bosnia… 2017 1679019 0.472 3507017
## 2 Europe CZE Czechia 2017 7803157 0.379 10591323
## 3 Europe GIB Gibral… 2017 34571 0.473 34571
## 4 Europe MKD North … 2017 1202983 0.415 2083160
## 5 Europe GBR United… 2017 54892898 0.958 66022273
## # … with 1 more variable: pop_growth <dbl>
There are 4 countries total that did not have an exact match but there there is 1 country that did not have a match at all (Gibraltar)
myeurope dataset corresponds to the name given in mapWorld dataset. Following this code, add a pipe and use a left_join() function to create the new dataset, mymap, adding data to the countries in myeurope dataset from mapWorld dataset.# Remove `eval = FALSE` to run the code
#mymap <- myeurope %>%
#filter(year == 2017) %>%
#mutate(country_clean = recode(country.name.en,
#'United Kingdom' = 'UK',
#'Bosnia & Herzegovina' = 'Bosnia and Herzegovina',
#'Czechia' = 'Czech Republic',
#'North Macedonia' = 'Macedonia')) %>% left_join(MapWorldRenamed, myeurope, by = 'region')
ggmap, use the R code provided below. Add a comment after each # to explain what each component of this code does. Note: it would be a good idea to run the code piece by piece to see what each layer adds to the plot. See if you can spot Luxembourg!#Paste and run the following into your console (NOT HERE): install.packages("ggmap")
#When you are ready to run the code, remove `eval = FALSE` in the markdown
#Call the ggmap package
library(ggmap)
##mymap %>%
#ggplot(aes(x = long, y = lat, group = group, fill = pop_growth)) +
#geom_polygon(colour = "black") +
#scale_fill_gradient(low = "white", high = "blue",
#guide = "colorbar") +
#labs(fill = "Growth" ,
#title = "Population Growth in 2000",
#x = "Longitude", y = "Latitude") +
#xlim(-25,50) + ylim(35,70)
Your answer goes here. 1-2 sentences per line.
## sysname
## "Linux"
## release
## "4.15.0-171-generic"
## version
## "#180-Ubuntu SMP Wed Mar 2 17:25:05 UTC 2022"
## nodename
## "educcomp02.ccbb.utexas.edu"
## machine
## "x86_64"
## login
## "unknown"
## user
## "hcs867"
## effective_user
## "hcs867"