In this R markdown file, I will explore the gapminder dataset and build meaningful visualizations to better understand the data.

Loading tidyverse and gapminder packages.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.1     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(gapminder)

Using dim() to find the dimensions of gapminder data set.

dim(gapminder)
## [1] 1704    6

The data has 1704 rows and 6 columns.

We will use the glimpse() function to look at different variables in our data set and their types.

glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", ~
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, ~
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, ~
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8~
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12~
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, ~

Now we will take a look at the statistical summary of data using summary() function.

summary(gapminder)
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 


Q1.Do all variables have the right data type? A.Yes

Creating a new data frame country_data with Pakistan as the country.

country_data<-gapminder %>% filter(country=="Pakistan")

Creating a scatterplot of year and lifeExp for Pakistan.

ggplot2::ggplot( country_data, aes( x = year, y = lifeExp))+
  geom_point()+
  labs(title="Pakistan",x="Year",y="Life Expectacy in Years")


We can see that the life Expectancy in Pakistan increased with the years.

Repeating the same by filtering the gapminder data set for Zambia.

zambia_data<-gapminder %>% filter(country == "Zambia")

ggplot2::ggplot(zambia_data,aes(x = year,y = lifeExp))+
  geom_point()+
  labs(title="Zambia",x="Year",y="Life Expectacy in Years")

No, Zambia’s life expectancy is not better than Pakistan. It starts at about 42 years and maxes out at 52 years in 1982 and takes a deep plunge after that. While we see a steady rise in life expectancy in Pakistan from 42 years to 65 years.

Creating a new dataframe data_2007 by subsetting gapminder data for year 2007 and making a scatter plot with x axis as GPD per Capital and y axis as Life expectancy.

data_2007<-gapminder %>% filter(year == '2007')

ggplot(data_2007,aes(x = gdpPercap,y = lifeExp,color=continent,size=pop))+
  geom_point()+
  labs(title="Year 2007",x="GDP per capita (US$)",y="Life Expectacy in Years")


From this plot, we can tell that in 2007, there is a general trend of increase in life expectancy as well as GDP, while there are some outliers. THere is a sharp rise of Life expectancy among African countries while GDP among European countries increased the most. Asian and American countries make up most of the population in 2007.

Loading plotly package.

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Remaking the 2007 plot using plotly.

plot_2007<-ggplot(data_2007,aes(x = gdpPercap,y = lifeExp,color=continent,size=pop,label=country))+
  geom_point()+
  labs(title="Year 2007",x="GDP per capita (US$)",y="Life Expectacy in Years")  

ggplotly(plot_2007)

By hovering the tooltip on the biggest points we can see that the countries with the largest population in 2007 are China and India.


The End