In this project, I showcase preparing datasets for downstream analysis work; this specifically deals with wide datasets and getting them ready by making them follow the rules of a tidy dataset. For this project, I chose my own discussion item, world population data and projects.
Get table from MySQL server and get head of new DataFrame.
azuredb = dbConnect(MySQL(), user=params$dbuser, password=params$dbpass, dbname=params$dbname, host=params$dbhost)
pop <- dbGetQuery(azuredb,
"SELECT
*
FROM
world_population_data;")
head(pop)## MyUnknownColumn Rank country country_code 1980 2000
## 1 0 1 China CHN 982372466 1264099069
## 2 1 2 India IND 696828385 1059633675
## 3 2 3 United States USA 223140018 282398554
## 4 3 4 Indonesia IDN 148177096 214072421
## 5 4 5 Pakistan PAK 80624057 154369924
## 6 5 6 Nigeria NGA 72951439 122851984
## 2010 2021 2022 2030 2050 area landAreaKm
## 1 1348191368 1425893465 1425887337 1415605906 1312636325 9706961 9424703
## 2 1240613620 1407563842 1417173173 1514994080 1670490596 3287590 2973190
## 3 311182845 336997624 338289857 352162301 375391963 9372610 9147420
## 4 244016173 273753191 275501339 292150100 317225213 1904569 1877519
## 5 194454498 231402117 235824862 274029836 367808468 881912 770880
## 6 160952853 213401323 218541212 262580425 377459883 923768 910770
## growthRate worldPercentage density
## 1 0.0000 0.1788 151.2926
## 2 0.0068 0.1777 476.6507
## 3 0.0038 0.0424 36.9820
## 4 0.0064 0.0345 146.7369
## 5 0.0191 0.0296 305.9164
## 6 0.0241 0.0274 239.9521
As we can see, this is a wide df showing information for a single country over multiple years with a few unneeded columns. We’d like each row to be a single observation of population for a country. We should gather up these years that are columns, as they are values for ‘year’, not variable names. Then we can just select the variables we need for analysis.
pop <- pop %>%
pivot_longer(cols = `1980`:`2050`, names_to = "year", values_to = "population")
pop <- pop[,c("country","year","population","landAreaKm","growthRate")]
head(pop)## # A tibble: 6 × 5
## country year population landAreaKm growthRate
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 China 1980 982372466 9424703. 0
## 2 China 2000 1264099069 9424703. 0
## 3 China 2010 1348191368 9424703. 0
## 4 China 2021 1425893465 9424703. 0
## 5 China 2022 1425887337 9424703. 0
## 6 China 2030 1415605906 9424703. 0
pop_top <- head(pop,30)
ggplot(data = pop_top, mapping = aes(x = landAreaKm, y = population, color = country)) +
geom_point()
It looks like our big 5 nations are actually slowing down over time
regarding their pop growth.
Now, I’m curious if density has any correlation with growth…
pop_density <- mutate(pop, density = population/landAreaKm)
cor(pop_density$density, pop_density$growthRate)## [1] -0.0761573
ggplot(data = pop_density, mapping = aes(x = density, y = growthRate, color = country)) +
geom_point(show.legend = FALSE)It does appear that there is a decent enough negative correlation between population density and growth rate.