In this Data Visualization assignment we make use of the famous Gapminder dataset, which highlights critical factors such as ‘Total Population’, ‘Life Expectancy’ & ‘GDP/Person’ over the years for different countries. Through interactive visualizations, we would want to unravel interesting insights and also observe trends and changes over the years with regards to life expectancy, population & GDP.
The core purpose of the visualizations is to generate interesting insights & spot trends over the years for parameters such as - Population, GDP, Life Expectancy in different countries & regions.
We will leverage interactivity so as to make the visualizations intuitive & thereby allow users the flexibility to deep-dive into different aspects of the visualization.
Title & Caption: The final visualization has a title to define the story being depicted through the visualization.
Legend & Markers: The legend & markers help give definition to different colors, patterns, etc used in the visualization.
Tooltips: In the interactive visualization, tooltips help with highlighting the features at a given point when navigated through it.
Trends: The trends in the data can be derived through facetting the original visualization and using interactivity and/or animation to highlight specific trends in the data.
Layered grammar of graphics: This includes various components embedded into tmaps through different layers for - data, aesthetics in form of a map layout, geometries as shape and symbol, facets to divide the visualization into various components, statistics to show the symbols scaled through the variable of focus and theme to make the overall visualization impactful and self-explanatory.
Data Processing for generating the Visualization: In the raw format, the data isn’t in a structure for creating the visualization, and hence we need to process and convert it into a format that can be used for creating the visualization.
Interactivity: Incorporating interactivity in the static visualizations is a challenge. Unlike Tableau, where filters, tooltips & legends can be added easily to the visualization, it isn’t the same case with ggplot. We will have to explore other alternatives or build on top of it.
Customization of the view of charts: The default view of the visualization created with ggplot is not appealing and hence not very professional. The grey default background & the basic themes are not appealing and need to be customized.
Limited re-usability: Seperate dataframes need to be created for different visualizations as the logic for creating the visualization varies from one to the other.
For data preparation challenges, dplyr from tidyverse has been used to perform filtering and aggregation of the fields. The process of data preparation is performed separately for each of the visualizations.
Even though adding interactivity in R is not as intuitive as it is in Tableau, but there is a lot of scope to add interactivity through ggplot and plotly packages.
Basic formatting with regards to background color, themes, hiding of axis (if required), etc can be included and standardized for all the visualizations without the need to repeat the process for each visualization.
This section describes the step by step process for creating the visualizations. It is divided into 3 sections - Intallation of R packages, Data Preparation followed by Creating Visualizations.
This code chunk installs the tidyverse, plotly, sf, maps, gganimate,ggrepel, viridis & pryr packages on the user machine without having to load one by one. These packages are installed and loaded into Rstudio environment because they are needed to be loaded for the visualizations.
packages = c('tictoc', 'pryr', 'sf','maps','gganimate','magick','tidyverse', 'plotly','kableExtra','viridis','ggrepel','gapminder')
for (p in packages){
if (!require(p,character.only=T)){
install.packages(p)
}
library(p, character.only=T)
}
The data has been acquired from the famous source - ‘GapMinder’. GapMinder is the dataset which has gdpPerCapita (Gross Domestic Product per Capita) across the countries in the globe collected over years dating 1950 to 2007
Following are the variables in the dataset: 1. Country : Names of the countries 2. Continent : Name of the continent the country belongs to 3. Year : year for which this observation is collected 4. LifeExp : Life Expectency for people in that country 5. Pop : Population for that country in that year 6. gdpPerCap : It is the gdp Per capita ( gross domestic product divided by the population)
library(gapminder)
#Complete Dataset for All countries & Years
data <- gapminder
#Filtering out Data only for the latest year of 2007
data_2007 <- gapminder %>% filter(year=="2007") %>% select(-year)
The data looks like -
data_2007 %>% head(6) %>%
mutate(gdpPercap=round(gdpPercap,0)) %>%
mutate(pop=round(pop/1000000,2)) %>%
mutate(lifeExp=round(lifeExp,1)) %>%
kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| country | continent | lifeExp | pop | gdpPercap |
|---|---|---|---|---|
| Afghanistan | Asia | 43.8 | 31.89 | 975 |
| Albania | Europe | 76.4 | 3.60 | 5937 |
| Algeria | Africa | 72.3 | 33.33 | 6223 |
| Angola | Africa | 42.7 | 12.42 | 4797 |
| Argentina | Americas | 75.3 | 40.30 | 12779 |
| Australia | Oceania | 81.2 | 20.43 | 34435 |
The following map is an intuitive way to see “life expectancy” across different countries of the world. The shading on the map helps easily identify regions where people have a low life expectancy or the other way round.
gapminder_lifeExp_07 <- gapminder_data %>%
filter(year == 2007) %>%
select(mapname, code, lifeExp)
lifeExp_07<- plot_geo(gapminder_lifeExp_07)
lifeExp_07<- lifeExp_07 %>% add_trace(
z = ~lifeExp, color = ~lifeExp, colors = 'Oranges',
text = ~mapname, locations = ~code)
lifeExp_07 <- lifeExp_07 %>% colorbar(title = 'Years')
lifeExp_07 <- lifeExp_07 %>% layout(
title = 'Life Expectancy in 2007<br>Source:<a href="https://www.gapminder.org"> gapminder.org</a>',geo = lifeExp_07)
lifeExp_07
This visualization helps explore the relationship between Life Expectancy & Gross Domestic Product per capita. It is usually believed, that countries with high GDP per capita income, have high life of expectancy for it’s people because of better resource availability in terms of healthcare, etc. Using this visualization we can have a look at different countries, by simply hovering over the bubble plot.
# Interactive version
p <- data_2007 %>%
mutate(gdpPercap=round(gdpPercap,0)) %>%
mutate(pop=round(pop/1000000,2)) %>%
mutate(lifeExp=round(lifeExp,1)) %>%
arrange(desc(pop)) %>%
mutate(country = factor(country)) %>%
mutate(text = paste("Country: ", country, "\nPopulation (M): ", pop, "\nLife Expectancy: ", lifeExp, "\nGdp per capita: ", gdpPercap, sep="")) %>%
ggplot( aes(x=gdpPercap, y=lifeExp, size = pop, color = continent, text=text)) +
geom_point(alpha=0.7) +
scale_size(range = c(1.4, 19), name="Population (M)") +
scale_color_viridis(discrete=TRUE, guide=FALSE) +
theme(legend.position="none")
ggplotly(p, tooltip="text")
This visualization makes it easier for us to compare changes over the years - 2007 vs 1962. We can try and observe if there have been any drastic shifts or the things have remained pretty much same. In this particular visualization we also add a color element for the continents so that we can notice similarity if any.
# facet by year only
p2<- filter(gapminder, year %in% c(1962, 2007)) %>%
mutate(text = paste("Country: ", country, "\nPopulation (M): ", pop/1000000, "\nLife Expectancy: ", lifeExp, "\nGdp per capita: ", gdpPercap, sep="")) %>%
ggplot(aes(lifeExp, gdpPercap, col = continent, text=text)) +
geom_point() +
facet_grid(. ~ year) + ggtitle("Comparison between 1962 & 2007\n")
ggplotly(p2,tooltip="text")
From the visualization below, it is amply clear that there exists a wide difference in life expectancies across different parts of the world. North America & Australia-New Zealand regions seem to have the highest life expectancy values where as some parts of African region have life expectancy less than 50 years.
(By hovering on the map you can see the life expectancy for each country)
Going by the visualization below, there seems to be a relation between GDP per capita & life expectancy. As the GDP per capita values increase, the life expectancy values also go high. This could be attributed to high resource availability in richer areas. Most countries lying in the top right region are the ones that are developed - US, Singapore & Norway.
(By hovering on the bubble chart you can view country specific information)
There seems to be a clear drift that is observable from the plot below. Most countries seem to have a better life expectancy as all of them have shifted towards the right side (higher lifeExp). In terms of GDP per capita also, countries are doing better as a lot of them have higher values in 2007 as compared to 1962.
African continent seem to be the one that is still lagging behind the rest of the world. Asia, Europe & Americas seem have really improved with higher GDP per capita & Life Expectancy values.
Flexible data analysis through view manipulation enables users to explore large datasets and have the flexiblity to manipulate the views of interest by filtering/subsetting, through user controls. This also allows users to switch from high-level analysis to isolated and focused analysis by looking at data on a higher granularity.
Depiction of Patterns make it easier for users to analyze the data and identify trends. It’s unlikely that users would be able to recognize patterns when presented with all the data points at the same time; hence visualizations make it easier to identify patterns at a glance.
Interactivity with hover text and tooltips provide detail-on-demand without having too many information cluttered on one screen. It is a more powerful visualisation tool as users are able to explore data they are interested in, instead of having to plough through a bunch of static charts to find information that they require.