This report explores transforming the data and combining it with the visualization for explorartory data analysis
ggplot2 for the visualizzations
tidyverse for using some dplyr functions
gapminder for using the gapminder data
The gapminder data contains 6 variables. Country and continent are factors. Life expectency is at birth. The years range from 1950 to 2007 with increments of five years.
str(gapminder_unfiltered)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3313 obs. of 6 variables:
## $ country : Factor w/ 187 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 6 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
sapply(gapminder_unfiltered,function(x) (sum(is.na(x))))
## country continent year lifeExp pop gdpPercap
## 0 0 0 0 0 0
sapply(gapminder_unfiltered,summary)
## $country
## Czech Republic Denmark Finland
## 58 58 58
## Iceland Japan Netherlands
## 58 58 58
## Norway Portugal Slovak Republic
## 58 58 58
## Spain Sweden Switzerland
## 58 58 58
## Taiwan Austria Belgium
## 58 57 57
## Bulgaria Canada France
## 57 57 57
## Hungary United States Australia
## 57 57 56
## Italy New Zealand Poland
## 56 55 52
## Luxembourg Latvia China
## 49 42 36
## Slovenia Germany Russia
## 32 26 20
## Ukraine Belarus Estonia
## 20 18 18
## Lithuania Costa Rica Cuba
## 18 13 13
## Greece Ireland Libya
## 13 13 13
## Mexico Puerto Rico Sri Lanka
## 13 13 13
## Thailand Uganda United Kingdom
## 13 13 13
## Afghanistan Albania Algeria
## 12 12 12
## Angola Argentina Bahrain
## 12 12 12
## Bangladesh Benin Bolivia
## 12 12 12
## Bosnia and Herzegovina Botswana Brazil
## 12 12 12
## Burkina Faso Burundi Cambodia
## 12 12 12
## Cameroon Central African Republic Chad
## 12 12 12
## Chile Colombia Comoros
## 12 12 12
## Congo, Dem. Rep. Congo, Rep. Cote d'Ivoire
## 12 12 12
## Croatia Djibouti Dominican Republic
## 12 12 12
## Ecuador Egypt El Salvador
## 12 12 12
## Equatorial Guinea Eritrea Ethiopia
## 12 12 12
## Gabon Gambia Ghana
## 12 12 12
## Guatemala Guinea Guinea-Bissau
## 12 12 12
## Haiti Honduras Hong Kong, China
## 12 12 12
## India Indonesia Iran
## 12 12 12
## Iraq Israel Jamaica
## 12 12 12
## Jordan Kenya Korea, Dem. Rep.
## 12 12 12
## Korea, Rep. Kuwait Lebanon
## 12 12 12
## (Other)
## 871
##
## $continent
## Africa Americas Asia Europe FSU Oceania
## 637 470 578 1302 139 187
##
## $year
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1950 1967 1982 1980 1996 2007
##
## $lifeExp
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23.60 58.33 69.61 65.24 73.66 82.67
##
## $pop
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.941e+04 2.680e+06 7.560e+06 3.177e+07 1.961e+07 1.319e+09
##
## $gdpPercap
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 241.2 2505.0 7826.0 11310.0 17360.0 113500.0
Level of data is at country and year level
nrow(unique(gapminder_unfiltered[,c("country","year")]))
## [1] 3313
The unique number of rows at country and year level is same as total nuber of rows
gapminder_unfiltered %>%
group_by(country) %>%
summarize(years_for_country=n_distinct(year),num=n()) %>%
arrange(desc(years_for_country))
## # A tibble: 187 × 3
## country years_for_country num
## <fctr> <int> <int>
## 1 Czech Republic 58 58
## 2 Denmark 58 58
## 3 Finland 58 58
## 4 Iceland 58 58
## 5 Japan 58 58
## 6 Netherlands 58 58
## 7 Norway 58 58
## 8 Portugal 58 58
## 9 Slovak Republic 58 58
## 10 Spain 58 58
## # ... with 177 more rows
Number of years captured for each country. Maximum years for some countries is 58 i.e. all years from 1950 to 2007 captured
gapminder_unfiltered %>%
group_by(year) %>%
summarize(country_count_for_years= n_distinct(country)) %>%
arrange(desc(country_count_for_years))
## # A tibble: 58 × 2
## year country_count_for_years
## <int> <int>
## 1 2002 187
## 2 1997 184
## 3 1992 183
## 4 2007 183
## 5 1977 171
## 6 1982 171
## 7 1987 171
## 8 1972 168
## 9 1967 156
## 10 1962 151
## # ... with 48 more rows
The year 2002 had the maximum coverage with 187 countries captured
gapminder_unfiltered %>%
filter(year==2007) %>%
ggplot()+
geom_histogram(mapping=aes(x=gdpPercap),col="red")
gapminder_unfiltered %>%
filter(year==2007) %>%
ggplot()+
geom_histogram(mapping=aes(x=gdpPercap),col="red") +
facet_wrap(~continent , nrow=3)
gapminder_unfiltered %>%
filter(year==2007) %>%
mutate(rank=dense_rank(desc(gdpPercap))) %>%
filter(rank<=10) %>%
select(country,gdpPercap) %>%
ggplot()+
geom_bar(mapping=aes(y=gdpPercap, x=reorder(country,-gdpPercap)),stat="identity",col="green")
gapminder_unfiltered %>%
filter(country=="India") %>%
ggplot(mapping=aes(x=year,y=gdpPercap))+
geom_point(col="red",size=3)+
geom_line()
gapminder_unfiltered %>%
filter(country=="India") %>%
arrange(year) %>%
mutate(percent_growth=(gdpPercap-(lag(gdpPercap)))/(lag(gdpPercap))) %>%
ggplot()+
geom_point(mapping=aes(x=year,y=gdpPercap,size=percent_growth),col="blue")