Synopsis

This R markdown file is the fourth assignment in the course Data Wrangling with R.In this markdown several different packages are installed. Different data visualizations are created to explain the questions asked on the gapminder data. Data transformation is done using the dplyr package in the R. The dplyr package is a core package in the tidyverse meta-package. The various functionalities offered by dplyr is experimented in this assignment.

Packages Required

Tidyverse-Tidyverse is a set of packages that works in harmony because they share common data representations and API design. It makes loading and installing other core packages easier from the tidyverse in a single command.

ggplot2- the ggplot packages is mainly used for data visualization. It is a plotting system for R.

gapminder- The gapminder package loads the dataset. This data set contains gdpPer capita of several continents and countries.The gapminder_unfiltered data set is used in this assignment.

Source Code

The following are the variables that are present in the gapminder_unfiltered data set.

library(gapminder)
library(dplyr)
library(ggplot2)
library(knitr)
head(gapminder_unfiltered)
## # A tibble: 6 × 6
##       country continent  year lifeExp      pop gdpPercap
##        <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan      Asia  1952  28.801  8425333  779.4453
## 2 Afghanistan      Asia  1957  30.332  9240934  820.8530
## 3 Afghanistan      Asia  1962  31.997 10267083  853.1007
## 4 Afghanistan      Asia  1967  34.020 11537966  836.1971
## 5 Afghanistan      Asia  1972  36.088 13079460  739.9811
## 6 Afghanistan      Asia  1977  38.438 14880372  786.1134
names(gapminder_unfiltered)
## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"

From the command ?gapminder we obtain information about the variables in the gapminder_unfiltered data set.

Gapminder data-Excerpt of the Gapminder data on life expectancy, GDP per capita, and population by country(as quoted by R documentation) The gapminder data has 6 variables.

The following are the variables-

Country- factor with 142 levels
Continent- factor with 5 levels
Year- Ranges from 1952 to 2007 (increments of 5 years)
lifeExp- life expectance at birth,in years
pop- population per country
gdpPercap-GDP per capita

Data Description

The total number of rows in the data set is given by nrow. The is.na command tests for null values in the data set.The summary statistics are given by the summary command.The names of various continents and countries, quantiles, mean and medians of the year, life expectancy,population and gdpPer capita is obtained.

NROW(gapminder_unfiltered)
## [1] 3313
sum(is.na(gapminder_unfiltered))
## [1] 0
summary(gapminder_unfiltered)
##            country        continent         year         lifeExp     
##  Czech Republic:  58   Africa  : 637   Min.   :1950   Min.   :23.60  
##  Denmark       :  58   Americas: 470   1st Qu.:1967   1st Qu.:58.33  
##  Finland       :  58   Asia    : 578   Median :1982   Median :69.61  
##  Iceland       :  58   Europe  :1302   Mean   :1980   Mean   :65.24  
##  Japan         :  58   FSU     : 139   3rd Qu.:1996   3rd Qu.:73.66  
##  Netherlands   :  58   Oceania : 187   Max.   :2007   Max.   :82.67  
##  (Other)       :2965                                                 
##       pop              gdpPercap       
##  Min.   :5.941e+04   Min.   :   241.2  
##  1st Qu.:2.680e+06   1st Qu.:  2505.3  
##  Median :7.560e+06   Median :  7825.8  
##  Mean   :3.177e+07   Mean   : 11313.8  
##  3rd Qu.:1.961e+07   3rd Qu.: 17355.8  
##  Max.   :1.319e+09   Max.   :113523.1  
## 

Exploratory Data Analysis

Problem 1 -

hist(gapminder_unfiltered$gdpPercap,breaks=100,main="Distribution of GDP by countries", border="darkgreen")

Problem 2-

forcontinent<-filter(gapminder_unfiltered,year==2007)
ggplot(data = forcontinent, mapping = aes(x = continent, y = gdpPercap),color="darkgreen") +
  geom_boxplot()

Problem 3-

forcontinent<-filter(gapminder_unfiltered,year==2007)
top10<- arrange(forcontinent, desc(gdpPercap))
top10<-top10%>% head(10)
kable(top10)
country continent year lifeExp pop gdpPercap
Qatar Asia 2007 75.588 907229 82010.98
Macao, China Asia 2007 80.718 456989 54589.82
Norway Europe 2007 80.196 4627926 49357.19
Brunei Asia 2007 77.118 386511 48014.59
Kuwait Asia 2007 77.588 2505559 47306.99
Singapore Asia 2007 79.972 4553009 47143.18
United States Americas 2007 78.242 301139947 42951.65
Ireland Europe 2007 78.885 4109086 40676.00
Hong Kong, China Asia 2007 82.208 6980412 39724.98
Switzerland Europe 2007 81.701 7554661 37506.42

Problem 4-

indianGdp<-filter(gapminder_unfiltered,country=="India")
  kable(indianGdp)
country continent year lifeExp pop gdpPercap
India Asia 1952 37.373 372000000 546.5657
India Asia 1957 40.249 409000000 590.0620
India Asia 1962 43.605 454000000 658.3472
India Asia 1967 47.193 506000000 700.7706
India Asia 1972 50.651 567000000 724.0325
India Asia 1977 54.208 634000000 813.3373
India Asia 1982 56.596 708000000 855.7235
India Asia 1987 58.553 788000000 976.5127
India Asia 1992 60.223 872000000 1164.4068
India Asia 1997 61.765 959000000 1458.8174
India Asia 2002 62.879 1034172547 1746.7695
India Asia 2007 64.698 1110396331 2452.2104
ggplot(data = indianGdp, mapping = aes(x = year, y = gdpPercap)) + 
  geom_point(color="Blue")+
  geom_line(color="darkgreen")

Problem 5-

additional<- mutate(indianGdp,lag=lag(gdpPercap),percentDiff=((gdpPercap-lag)/lag)*100)
percentGrowth<-filter(additional,year==2007)
kable(percentGrowth)
country continent year lifeExp pop gdpPercap lag percentDiff
India Asia 2007 64.698 1110396331 2452.21 1746.769 40.38546

Problem 6-

additional<- mutate(indianGdp,lag=lag(gdpPercap),difference=((gdpPercap-lag)/lag)*100)
kable(additional)
country continent year lifeExp pop gdpPercap lag difference
India Asia 1952 37.373 372000000 546.5657 NA NA
India Asia 1957 40.249 409000000 590.0620 546.5657 7.958100
India Asia 1962 43.605 454000000 658.3472 590.0620 11.572539
India Asia 1967 47.193 506000000 700.7706 658.3472 6.443935
India Asia 1972 50.651 567000000 724.0325 700.7706 3.319477
India Asia 1977 54.208 634000000 813.3373 724.0325 12.334362
India Asia 1982 56.596 708000000 855.7235 813.3373 5.211394
India Asia 1987 58.553 788000000 976.5127 855.7235 14.115440
India Asia 1992 60.223 872000000 1164.4068 976.5127 19.241341
India Asia 1997 61.765 959000000 1458.8174 1164.4068 25.284173
India Asia 2002 62.879 1034172547 1746.7695 1458.8174 19.738728
India Asia 2007 64.698 1110396331 2452.2104 1746.7695 40.385464
ggplot(data = additional, mapping = aes(x = year, y = difference, color="darkGreen")) + geom_point(color="blue")+
  geom_line(color="darkgreen")