In this Week 4 Homework, we perform exploratory data analysis on the ‘gapminder_unfiltered’ dataset by loading the package ‘gapminder’.As usual we first explore the data, prepare the data for analysis and finally using various functions of dplyr(tidyverse package) and ggplot2 package obtain the stats and visualize the data respectively to derive some insights into the data.
The following packages are required:
library (gapminder) #To load the data we are going to work on
library (tidyverse) #To use dplyr for various functions through our analysis and ggplot for the visualization
There are 6 variables in the dataset:Country, continent,year,lifeExp,pop,gdpPercap.Country, continent and year refer to the respective countries, continents of the countries and the year of the observation.LifeExp is the life expectanctancy at birth (in years), pop is the population of the country.GdpPercap is the per capita GDP (Gross domestic product) given in units of international dollars, “a hypothetical unit of currency that has the same purchasing power parity that the U.S. dollar had in the United States at a given point in time” – 2005, in this case.
Link to the data codebook can be obtained using: ?gapminder_unfiltered
Once we load the data we take a look at it:
# TO observe the dimensions of the data ( no of obs and no of variables)
dim (gapminder_unfiltered)
## [1] 3313 6
#let us look at the variables present
names (gapminder_unfiltered)
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
#looking at the structure of the data
str (gapminder_unfiltered)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3313 obs. of 6 variables:
## $ country : Factor w/ 187 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 6 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
#looking at the first and last few rows of dataset
gapminder_unfiltered %>% head ()
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
#Let us check and count missing values
gapminder_unfiltered[!complete.cases(gapminder_unfiltered),]
## # A tibble: 0 × 6
## # ... with 6 variables: country <fctr>, continent <fctr>, year <int>,
## # lifeExp <dbl>, pop <int>, gdpPercap <dbl>
#We obatain an empty tibble.Thus no missing values are found in the dataset.
#We can view a summary of the continuous variables(lifeexp, pop, gdpPercap) using the function summary.Some of the non continous variable like Country and continent are already in the factor form and can be viewed as part of the summary table but the variable Year doesnt makes sense to be viewed without being converted to a factor.So instead we view it seperately using count(along with country and continent).
summary (gapminder_unfiltered)
## country continent year lifeExp
## Czech Republic: 58 Africa : 637 Min. :1950 Min. :23.60
## Denmark : 58 Americas: 470 1st Qu.:1967 1st Qu.:58.33
## Finland : 58 Asia : 578 Median :1982 Median :69.61
## Iceland : 58 Europe :1302 Mean :1980 Mean :65.24
## Japan : 58 FSU : 139 3rd Qu.:1996 3rd Qu.:73.66
## Netherlands : 58 Oceania : 187 Max. :2007 Max. :82.67
## (Other) :2965
## pop gdpPercap
## Min. :5.941e+04 Min. : 241.2
## 1st Qu.:2.680e+06 1st Qu.: 2505.3
## Median :7.560e+06 Median : 7825.8
## Mean :3.177e+07 Mean : 11313.8
## 3rd Qu.:1.961e+07 3rd Qu.: 17355.8
## Max. :1.319e+09 Max. :113523.1
##
gapminder_unfiltered %>%
count (continent)
## # A tibble: 6 × 2
## continent n
## <fctr> <int>
## 1 Africa 637
## 2 Americas 470
## 3 Asia 578
## 4 Europe 1302
## 5 FSU 139
## 6 Oceania 187
gapminder_unfiltered %>%
count (country)
## # A tibble: 187 × 2
## country n
## <fctr> <int>
## 1 Afghanistan 12
## 2 Albania 12
## 3 Algeria 12
## 4 Angola 12
## 5 Argentina 12
## 6 Armenia 4
## 7 Aruba 8
## 8 Australia 56
## 9 Austria 57
## 10 Azerbaijan 4
## # ... with 177 more rows
gapminder_unfiltered %>%
count (year)
## # A tibble: 58 × 2
## year n
## <int> <int>
## 1 1950 39
## 2 1951 24
## 3 1952 144
## 4 1953 24
## 5 1954 24
## 6 1955 24
## 7 1956 24
## 8 1957 144
## 9 1958 25
## 10 1959 25
## # ... with 48 more rows
Using a combination of data transformation and visualization techniques we try to answer the following quesitons:
#Q1.For the year 2007, what is the distribution of GDP per capita across all countries?
#The GDP per capita across the various countries is as shown in the table. The plot shows a general distribution of GDP across the countries.(The individual country names were removed to prevet clutter of the axis.)
gdpctry<-gapminder_unfiltered %>%
filter(year == 2007) %>% group_by(country) %>%
select(continent,country,gdpPercap)
gdpctry
## Source: local data frame [183 x 3]
## Groups: country [183]
##
## continent country gdpPercap
## <fctr> <fctr> <dbl>
## 1 Asia Afghanistan 974.5803
## 2 Europe Albania 5937.0295
## 3 Africa Algeria 6223.3675
## 4 Africa Angola 4797.2313
## 5 Americas Argentina 12779.3796
## 6 FSU Armenia 4942.5439
## 7 Americas Aruba 27230.6752
## 8 Oceania Australia 34435.3674
## 9 Europe Austria 36126.4927
## 10 Asia Azerbaijan 7708.6112
## # ... with 173 more rows
ggplot(gdpctry, mapping = aes(x = country, y=gdpPercap, size=gdpPercap)) + geom_point(color='green')+
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())
#Q2.For the year 2007, how do the distributions differ across the different continents?
#I have filtered out a table showing the median GDP per capita per continent and plotted the box plots for the data displaying the various stats like min,IQR,median,mean etc.
gdpcon<-gapminder_unfiltered %>%
filter (year == 2007) %>%
group_by (continent) %>%
summarize (con_gdp = median(gdpPercap))
gdpcon
## # A tibble: 6 × 2
## continent con_gdp
## <fctr> <dbl>
## 1 Africa 1463.249
## 2 Americas 9065.801
## 3 Asia 4889.250
## 4 Europe 25885.565
## 5 FSU 10273.774
## 6 Oceania 5143.615
ggplot(gapminder_unfiltered, aes(x = continent, y = gdpPercap)) +
geom_boxplot(outlier.colour = "hotpink") +
geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 1/8) + coord_cartesian(ylim = c(0, 45000))
#In order to better show the distributions across the continents I have zoomed in using coord_cartesian(). In order to view the same data with outliers we can just remove the command and run the rest of the Query.
#Q3.For the year 2007, what are the top 10 countries with the largest GDP per capita?
#Thetop 10 countries are as follows:
top10<-gapminder_unfiltered %>%
filter (year == 2007) %>%
select(country,gdpPercap) %>%
top_n(10, wt = gdpPercap) %>%
arrange(desc(gdpPercap))
top10
## # A tibble: 10 × 2
## country gdpPercap
## <fctr> <dbl>
## 1 Qatar 82010.98
## 2 Macao, China 54589.82
## 3 Norway 49357.19
## 4 Brunei 48014.59
## 5 Kuwait 47306.99
## 6 Singapore 47143.18
## 7 United States 42951.65
## 8 Ireland 40676.00
## 9 Hong Kong, China 39724.98
## 10 Switzerland 37506.42
ggplot(data=top10) + geom_point(mapping = aes(x = country, y = gdpPercap, size=gdpPercap),color='blue')+coord_flip()
#Q4.Plot the GDP per capita for your country of origin for all years available.
#The plot of GDP per capita of India for all years available:
gapminder_unfiltered %>%
filter(country=='India') %>%
ggplot( mapping = aes(x = year,y=gdpPercap)) +geom_smooth()
#Q5.What was the percent growth (or decline) in GDP per capita in 2007?
#The percent of growth in 2007 as seen from the data:
gapminder_unfiltered %>%
filter(country=='India') %>%
group_by(country) %>%
mutate( growth = gdpPercap - lag(gdpPercap)) %>%
select(country,year,gdpPercap, growth) %>%
arrange(year) %>%
top_n(1,wt=year)
## Source: local data frame [1 x 4]
## Groups: country [1]
##
## country year gdpPercap growth
## <fctr> <int> <dbl> <dbl>
## 1 India 2007 2452.21 705.441
#Q6.What has been the historical growth (or decline) in GDP per capita for your country?
#We can see the curve depicting the growth of the per capita of India along with the table:
histgrowth<-gapminder_unfiltered %>%
filter(country=='India') %>%
group_by(country) %>%
mutate( growth = gdpPercap - lag(gdpPercap)) %>%
mutate(growth=ifelse (is.na(growth),0,growth)) %>%
select(country,year,gdpPercap, growth)
histgrowth
## Source: local data frame [12 x 4]
## Groups: country [1]
##
## country year gdpPercap growth
## <fctr> <int> <dbl> <dbl>
## 1 India 1952 546.5657 0.00000
## 2 India 1957 590.0620 43.49625
## 3 India 1962 658.3472 68.28515
## 4 India 1967 700.7706 42.42346
## 5 India 1972 724.0325 23.26192
## 6 India 1977 813.3373 89.30480
## 7 India 1982 855.7235 42.38621
## 8 India 1987 976.5127 120.78914
## 9 India 1992 1164.4068 187.89413
## 10 India 1997 1458.8174 294.41063
## 11 India 2002 1746.7695 287.95201
## 12 India 2007 2452.2104 705.44095
ggplot(data=histgrowth) +
geom_smooth(mapping = aes(x =year,y=growth),color='hotpink')