Synopsis
This R markdown file is the fourth assignment in the course Data Wrangling with R.In this markdown several different packages are installed. Different data visualizations are created to explain the questions asked on the gapminder data. Data transformation is done using the dplyr package in the R. The dplyr package is a core package in the tidyverse meta-package. The various functionalities offered by dplyr is experimented in this assignment.
Packages Required
Tidyverse-Tidyverse is a set of packages that works in harmony because they share common data representations and API design. It makes loading and installing other core packages easier from the tidyverse in a single command.
ggplot2- the ggplot packages is mainly used for data visualization. It is a plotting system for R.
gapminder- The gapminder package loads the dataset. This data set contains gdpPer capita of several continents and countries.The gapminder_unfiltered data set is used in this assignment.
Source Code
The following are the variables that are present in the gapminder_unfiltered data set.
library(gapminder)
library(dplyr)
library(ggplot2)
library(knitr)
head(gapminder_unfiltered)## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
names(gapminder_unfiltered)## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
From the command ?gapminder we obtain information about the variables in the gapminder_unfiltered data set.
Gapminder data-Excerpt of the Gapminder data on life expectancy, GDP per capita, and population by country(as quoted by R documentation) The gapminder data has 6 variables.
The following are the variables-
Country- factor with 142 levels
Continent- factor with 5 levels
Year- Ranges from 1952 to 2007 (increments of 5 years)
lifeExp- life expectance at birth,in years
pop- population per country
gdpPercap-GDP per capita
Data Description
The total number of rows in the data set is given by nrow. The is.na command tests for null values in the data set.The summary statistics are given by the summary command.The names of various continents and countries, quantiles, mean and medians of the year, life expectancy,population and gdpPer capita is obtained.
NROW(gapminder_unfiltered)## [1] 3313
sum(is.na(gapminder_unfiltered))## [1] 0
summary(gapminder_unfiltered)## country continent year lifeExp
## Czech Republic: 58 Africa : 637 Min. :1950 Min. :23.60
## Denmark : 58 Americas: 470 1st Qu.:1967 1st Qu.:58.33
## Finland : 58 Asia : 578 Median :1982 Median :69.61
## Iceland : 58 Europe :1302 Mean :1980 Mean :65.24
## Japan : 58 FSU : 139 3rd Qu.:1996 3rd Qu.:73.66
## Netherlands : 58 Oceania : 187 Max. :2007 Max. :82.67
## (Other) :2965
## pop gdpPercap
## Min. :5.941e+04 Min. : 241.2
## 1st Qu.:2.680e+06 1st Qu.: 2505.3
## Median :7.560e+06 Median : 7825.8
## Mean :3.177e+07 Mean : 11313.8
## 3rd Qu.:1.961e+07 3rd Qu.: 17355.8
## Max. :1.319e+09 Max. :113523.1
##
Exploratory Data Analysis
Problem 1 -
hist(gapminder_unfiltered$gdpPercap,breaks=100,main="Distribution of GDP by countries", border="darkgreen")Problem 2-
forcontinent<-filter(gapminder_unfiltered,year==2007)
ggplot(data = forcontinent, mapping = aes(x = continent, y = gdpPercap),color="darkgreen") +
geom_boxplot()Problem 3-
forcontinent<-filter(gapminder_unfiltered,year==2007)
top10<- arrange(forcontinent, desc(gdpPercap))
top10<-top10%>% head(10)
kable(top10)| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Qatar | Asia | 2007 | 75.588 | 907229 | 82010.98 |
| Macao, China | Asia | 2007 | 80.718 | 456989 | 54589.82 |
| Norway | Europe | 2007 | 80.196 | 4627926 | 49357.19 |
| Brunei | Asia | 2007 | 77.118 | 386511 | 48014.59 |
| Kuwait | Asia | 2007 | 77.588 | 2505559 | 47306.99 |
| Singapore | Asia | 2007 | 79.972 | 4553009 | 47143.18 |
| United States | Americas | 2007 | 78.242 | 301139947 | 42951.65 |
| Ireland | Europe | 2007 | 78.885 | 4109086 | 40676.00 |
| Hong Kong, China | Asia | 2007 | 82.208 | 6980412 | 39724.98 |
| Switzerland | Europe | 2007 | 81.701 | 7554661 | 37506.42 |
Problem 4-
indianGdp<-filter(gapminder_unfiltered,country=="India")
kable(indianGdp)| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| India | Asia | 1952 | 37.373 | 372000000 | 546.5657 |
| India | Asia | 1957 | 40.249 | 409000000 | 590.0620 |
| India | Asia | 1962 | 43.605 | 454000000 | 658.3472 |
| India | Asia | 1967 | 47.193 | 506000000 | 700.7706 |
| India | Asia | 1972 | 50.651 | 567000000 | 724.0325 |
| India | Asia | 1977 | 54.208 | 634000000 | 813.3373 |
| India | Asia | 1982 | 56.596 | 708000000 | 855.7235 |
| India | Asia | 1987 | 58.553 | 788000000 | 976.5127 |
| India | Asia | 1992 | 60.223 | 872000000 | 1164.4068 |
| India | Asia | 1997 | 61.765 | 959000000 | 1458.8174 |
| India | Asia | 2002 | 62.879 | 1034172547 | 1746.7695 |
| India | Asia | 2007 | 64.698 | 1110396331 | 2452.2104 |
ggplot(data = indianGdp, mapping = aes(x = year, y = gdpPercap)) +
geom_point(color="Blue")+
geom_line(color="darkgreen")Problem 5-
additional<- mutate(indianGdp,lag=lag(gdpPercap),percentDiff=((gdpPercap-lag)/lag)*100)
percentGrowth<-filter(additional,year==2007)
kable(percentGrowth)| country | continent | year | lifeExp | pop | gdpPercap | lag | percentDiff |
|---|---|---|---|---|---|---|---|
| India | Asia | 2007 | 64.698 | 1110396331 | 2452.21 | 1746.769 | 40.38546 |
Problem 6-
additional<- mutate(indianGdp,lag=lag(gdpPercap),difference=((gdpPercap-lag)/lag)*100)
kable(additional)| country | continent | year | lifeExp | pop | gdpPercap | lag | difference |
|---|---|---|---|---|---|---|---|
| India | Asia | 1952 | 37.373 | 372000000 | 546.5657 | NA | NA |
| India | Asia | 1957 | 40.249 | 409000000 | 590.0620 | 546.5657 | 7.958100 |
| India | Asia | 1962 | 43.605 | 454000000 | 658.3472 | 590.0620 | 11.572539 |
| India | Asia | 1967 | 47.193 | 506000000 | 700.7706 | 658.3472 | 6.443935 |
| India | Asia | 1972 | 50.651 | 567000000 | 724.0325 | 700.7706 | 3.319477 |
| India | Asia | 1977 | 54.208 | 634000000 | 813.3373 | 724.0325 | 12.334362 |
| India | Asia | 1982 | 56.596 | 708000000 | 855.7235 | 813.3373 | 5.211394 |
| India | Asia | 1987 | 58.553 | 788000000 | 976.5127 | 855.7235 | 14.115440 |
| India | Asia | 1992 | 60.223 | 872000000 | 1164.4068 | 976.5127 | 19.241341 |
| India | Asia | 1997 | 61.765 | 959000000 | 1458.8174 | 1164.4068 | 25.284173 |
| India | Asia | 2002 | 62.879 | 1034172547 | 1746.7695 | 1458.8174 | 19.738728 |
| India | Asia | 2007 | 64.698 | 1110396331 | 2452.2104 | 1746.7695 | 40.385464 |
ggplot(data = additional, mapping = aes(x = year, y = difference, color="darkGreen")) + geom_point(color="blue")+
geom_line(color="darkgreen")