Synopsis

In this Week 4 Homework, we perform exploratory data analysis on the ‘gapminder_unfiltered’ dataset by loading the package ‘gapminder’.As usual we first explore the data, prepare the data for analysis and finally using various functions of dplyr(tidyverse package) and ggplot2 package obtain the stats and visualize the data respectively to derive some insights into the data.

Packages Required

The following packages are required:

library (gapminder) #To load the data we are going to work on
library (tidyverse) #To use dplyr for various functions through our analysis and ggplot for the                       visualization

Source Code

There are 6 variables in the dataset:Country, continent,year,lifeExp,pop,gdpPercap.Country, continent and year refer to the respective countries, continents of the countries and the year of the observation.LifeExp is the life expectanctancy at birth (in years), pop is the population of the country.GdpPercap is the per capita GDP (Gross domestic product) given in units of international dollars, “a hypothetical unit of currency that has the same purchasing power parity that the U.S. dollar had in the United States at a given point in time” – 2005, in this case.

Link to the data codebook can be obtained using: ?gapminder_unfiltered

Data Description

Once we load the data we take a look at it:

# TO observe the dimensions of the data ( no of obs and no of variables)

dim (gapminder_unfiltered) 
## [1] 3313    6
#let us look at the variables present

names (gapminder_unfiltered)
## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"
#looking at the structure of the data

str (gapminder_unfiltered)
## Classes 'tbl_df', 'tbl' and 'data.frame':    3313 obs. of  6 variables:
##  $ country  : Factor w/ 187 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 6 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
#looking at the first and last few rows of dataset

gapminder_unfiltered %>% head ()
## # A tibble: 6 × 6
##       country continent  year lifeExp      pop gdpPercap
##        <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan      Asia  1952  28.801  8425333  779.4453
## 2 Afghanistan      Asia  1957  30.332  9240934  820.8530
## 3 Afghanistan      Asia  1962  31.997 10267083  853.1007
## 4 Afghanistan      Asia  1967  34.020 11537966  836.1971
## 5 Afghanistan      Asia  1972  36.088 13079460  739.9811
## 6 Afghanistan      Asia  1977  38.438 14880372  786.1134
#Let us check and count missing values

gapminder_unfiltered[!complete.cases(gapminder_unfiltered),]
## # A tibble: 0 × 6
## # ... with 6 variables: country <fctr>, continent <fctr>, year <int>,
## #   lifeExp <dbl>, pop <int>, gdpPercap <dbl>
#We obatain an empty tibble.Thus no missing values are found in the dataset.
#We can view a summary of the  continuous variables(lifeexp, pop, gdpPercap) using the function summary.Some of the non continous variable like Country and continent are already in the factor form and can be viewed as part of the summary table but the variable Year doesnt makes sense to be viewed without being converted to a factor.So instead we view it seperately using count(along with country and continent).

summary (gapminder_unfiltered)
##            country        continent         year         lifeExp     
##  Czech Republic:  58   Africa  : 637   Min.   :1950   Min.   :23.60  
##  Denmark       :  58   Americas: 470   1st Qu.:1967   1st Qu.:58.33  
##  Finland       :  58   Asia    : 578   Median :1982   Median :69.61  
##  Iceland       :  58   Europe  :1302   Mean   :1980   Mean   :65.24  
##  Japan         :  58   FSU     : 139   3rd Qu.:1996   3rd Qu.:73.66  
##  Netherlands   :  58   Oceania : 187   Max.   :2007   Max.   :82.67  
##  (Other)       :2965                                                 
##       pop              gdpPercap       
##  Min.   :5.941e+04   Min.   :   241.2  
##  1st Qu.:2.680e+06   1st Qu.:  2505.3  
##  Median :7.560e+06   Median :  7825.8  
##  Mean   :3.177e+07   Mean   : 11313.8  
##  3rd Qu.:1.961e+07   3rd Qu.: 17355.8  
##  Max.   :1.319e+09   Max.   :113523.1  
## 
gapminder_unfiltered %>% 
  count (continent)
## # A tibble: 6 × 2
##   continent     n
##      <fctr> <int>
## 1    Africa   637
## 2  Americas   470
## 3      Asia   578
## 4    Europe  1302
## 5       FSU   139
## 6   Oceania   187
gapminder_unfiltered %>% 
  count (country)
## # A tibble: 187 × 2
##        country     n
##         <fctr> <int>
## 1  Afghanistan    12
## 2      Albania    12
## 3      Algeria    12
## 4       Angola    12
## 5    Argentina    12
## 6      Armenia     4
## 7        Aruba     8
## 8    Australia    56
## 9      Austria    57
## 10  Azerbaijan     4
## # ... with 177 more rows
gapminder_unfiltered %>% 
  count (year)
## # A tibble: 58 × 2
##     year     n
##    <int> <int>
## 1   1950    39
## 2   1951    24
## 3   1952   144
## 4   1953    24
## 5   1954    24
## 6   1955    24
## 7   1956    24
## 8   1957   144
## 9   1958    25
## 10  1959    25
## # ... with 48 more rows

Exploratory Data Analysis

Using a combination of data transformation and visualization techniques we try to answer the following quesitons:

#Q1.For the year 2007, what is the distribution of GDP per capita across all countries?

#The GDP per capita across the various countries is as shown in the table. The plot shows  a general distribution of GDP across the countries.(The individual country names were removed to prevet clutter of the axis.)

gdpctry<-gapminder_unfiltered %>%
filter(year == 2007) %>% group_by(country) %>% 
  select(continent,country,gdpPercap)
gdpctry
## Source: local data frame [183 x 3]
## Groups: country [183]
## 
##    continent     country  gdpPercap
##       <fctr>      <fctr>      <dbl>
## 1       Asia Afghanistan   974.5803
## 2     Europe     Albania  5937.0295
## 3     Africa     Algeria  6223.3675
## 4     Africa      Angola  4797.2313
## 5   Americas   Argentina 12779.3796
## 6        FSU     Armenia  4942.5439
## 7   Americas       Aruba 27230.6752
## 8    Oceania   Australia 34435.3674
## 9     Europe     Austria 36126.4927
## 10      Asia  Azerbaijan  7708.6112
## # ... with 173 more rows
ggplot(gdpctry, mapping = aes(x = country, y=gdpPercap, size=gdpPercap)) + geom_point(color='green')+
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

#Q2.For the year 2007, how do the distributions differ across the different continents?

#I have filtered out a table showing the median GDP per capita per continent and plotted the box plots for the data displaying the various stats like min,IQR,median,mean etc.

gdpcon<-gapminder_unfiltered %>% 
  filter (year == 2007) %>%
   group_by (continent) %>% 
  summarize (con_gdp = median(gdpPercap))
gdpcon
## # A tibble: 6 × 2
##   continent   con_gdp
##      <fctr>     <dbl>
## 1    Africa  1463.249
## 2  Americas  9065.801
## 3      Asia  4889.250
## 4    Europe 25885.565
## 5       FSU 10273.774
## 6   Oceania  5143.615
ggplot(gapminder_unfiltered, aes(x = continent, y = gdpPercap)) +
  geom_boxplot(outlier.colour = "hotpink") +
  geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 1/8) +  coord_cartesian(ylim = c(0, 45000))

#In order to better show the distributions across the continents I have zoomed in using coord_cartesian(). In order to view the same data with outliers we can just remove the command and run the rest of the Query. 
#Q3.For the year 2007, what are the top 10 countries with the largest GDP per capita?

#Thetop 10 countries are as follows:

top10<-gapminder_unfiltered %>% 
  filter (year == 2007) %>%
    select(country,gdpPercap) %>% 
     top_n(10, wt = gdpPercap) %>% 
      arrange(desc(gdpPercap))
top10
## # A tibble: 10 × 2
##             country gdpPercap
##              <fctr>     <dbl>
## 1             Qatar  82010.98
## 2      Macao, China  54589.82
## 3            Norway  49357.19
## 4            Brunei  48014.59
## 5            Kuwait  47306.99
## 6         Singapore  47143.18
## 7     United States  42951.65
## 8           Ireland  40676.00
## 9  Hong Kong, China  39724.98
## 10      Switzerland  37506.42
ggplot(data=top10) +  geom_point(mapping = aes(x = country, y = gdpPercap, size=gdpPercap),color='blue')+coord_flip()

#Q4.Plot the GDP per capita for your country of origin for all years available.

#The plot of GDP per capita of India for all years available:

gapminder_unfiltered %>% 
  filter(country=='India') %>% 
    ggplot( mapping = aes(x = year,y=gdpPercap)) +geom_smooth()

 #Q5.What was the percent growth (or decline) in GDP per capita in 2007?

#The percent of growth in 2007 as seen from the data:

gapminder_unfiltered %>% 
  filter(country=='India') %>% 
   group_by(country) %>% 
    mutate( growth = gdpPercap - lag(gdpPercap)) %>% 
      select(country,year,gdpPercap, growth) %>% 
       arrange(year) %>%
        top_n(1,wt=year) 
## Source: local data frame [1 x 4]
## Groups: country [1]
## 
##   country  year gdpPercap  growth
##    <fctr> <int>     <dbl>   <dbl>
## 1   India  2007   2452.21 705.441
#Q6.What has been the historical growth (or decline) in GDP per capita for your country?

#We can see the curve depicting the growth of the per capita of India along with the table:

histgrowth<-gapminder_unfiltered %>% 
  filter(country=='India') %>% 
   group_by(country) %>% 
    mutate( growth = gdpPercap - lag(gdpPercap)) %>% 
   mutate(growth=ifelse (is.na(growth),0,growth)) %>% 
      select(country,year,gdpPercap, growth)
histgrowth
## Source: local data frame [12 x 4]
## Groups: country [1]
## 
##    country  year gdpPercap    growth
##     <fctr> <int>     <dbl>     <dbl>
## 1    India  1952  546.5657   0.00000
## 2    India  1957  590.0620  43.49625
## 3    India  1962  658.3472  68.28515
## 4    India  1967  700.7706  42.42346
## 5    India  1972  724.0325  23.26192
## 6    India  1977  813.3373  89.30480
## 7    India  1982  855.7235  42.38621
## 8    India  1987  976.5127 120.78914
## 9    India  1992 1164.4068 187.89413
## 10   India  1997 1458.8174 294.41063
## 11   India  2002 1746.7695 287.95201
## 12   India  2007 2452.2104 705.44095
ggplot(data=histgrowth) + 
  geom_smooth(mapping = aes(x =year,y=growth),color='hotpink')