Synopsis
This is my homework report for week 4. In this week, our main focus is to investigate the dataset Gapminder and interact with it by performing early exploratory data analysis to answer questions about the data via transforming and visualizing it.
We will do the follwoing to explore our data -
- Use the dplyr package to perform many common data transformation and manipulation tasks.
- Use the ggplot package to visually analyze our data and answer specific questions.
Source Code
Details about the Data: The Gapminder package gives us data on life expectancy, GDP per capita, and population by country.The main data frame gapminder has 1704 rows and 6 variables.
The variables are explained as follows:
- Country - factor with 142 levels
- Continent - Factor with 5 levels
- Year - ranges from 1952 to 2007 in increments of 5 years
- lifeExp - life expectancy at birth, in years
- pop - population
- dgoPercap - GDP per capita
Understanding the Data :
?gapminder
kable(head(gapminder))
Afghanistan |
Asia |
1952 |
28.801 |
8425333 |
779.4453 |
Afghanistan |
Asia |
1957 |
30.332 |
9240934 |
820.8530 |
Afghanistan |
Asia |
1962 |
31.997 |
10267083 |
853.1007 |
Afghanistan |
Asia |
1967 |
34.020 |
11537966 |
836.1971 |
Afghanistan |
Asia |
1972 |
36.088 |
13079460 |
739.9811 |
Afghanistan |
Asia |
1977 |
38.438 |
14880372 |
786.1134 |
Data Description
Displaying the top and bottom 5 rows each from the data-set.
kable(head(gapminder_unfiltered,5))
Afghanistan |
Asia |
1952 |
28.801 |
8425333 |
779.4453 |
Afghanistan |
Asia |
1957 |
30.332 |
9240934 |
820.8530 |
Afghanistan |
Asia |
1962 |
31.997 |
10267083 |
853.1007 |
Afghanistan |
Asia |
1967 |
34.020 |
11537966 |
836.1971 |
Afghanistan |
Asia |
1972 |
36.088 |
13079460 |
739.9811 |
kable(tail(gapminder_unfiltered,5))
Zimbabwe |
Africa |
1987 |
62.351 |
9216418 |
706.1573 |
Zimbabwe |
Africa |
1992 |
60.377 |
10704340 |
693.4208 |
Zimbabwe |
Africa |
1997 |
46.809 |
11404948 |
792.4500 |
Zimbabwe |
Africa |
2002 |
39.989 |
11926563 |
672.0386 |
Zimbabwe |
Africa |
2007 |
43.487 |
12311143 |
469.7093 |
Displaying the Names of the Variables
names(gapminder_unfiltered)
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
Displaying the dimensions(number of rows and columns) of the dataset.
dim(gapminder_unfiltered)
## [1] 3313 6
Finding out the missing values in the data. As we can see, the data has no missing values.
sum(is.na(gapminder_unfiltered))
## [1] 0
which(is.na(gapminder_unfiltered), arr.ind = TRUE)
## row col
Displaying the number of and the values of unique countries, continents and years.
length(unique(gapminder_unfiltered$country))
## [1] 187
unique(gapminder_unfiltered$country)
## [1] Afghanistan Albania
## [3] Algeria Angola
## [5] Argentina Armenia
## [7] Aruba Australia
## [9] Austria Azerbaijan
## [11] Bahamas Bahrain
## [13] Bangladesh Barbados
## [15] Belarus Belgium
## [17] Belize Benin
## [19] Bhutan Bolivia
## [21] Bosnia and Herzegovina Botswana
## [23] Brazil Brunei
## [25] Bulgaria Burkina Faso
## [27] Burundi Cambodia
## [29] Cameroon Canada
## [31] Cape Verde Central African Republic
## [33] Chad Chile
## [35] China Colombia
## [37] Comoros Congo, Dem. Rep.
## [39] Congo, Rep. Costa Rica
## [41] Cote d'Ivoire Croatia
## [43] Cuba Cyprus
## [45] Czech Republic Denmark
## [47] Djibouti Dominican Republic
## [49] Ecuador Egypt
## [51] El Salvador Equatorial Guinea
## [53] Eritrea Estonia
## [55] Ethiopia Fiji
## [57] Finland France
## [59] French Guiana French Polynesia
## [61] Gabon Gambia
## [63] Georgia Germany
## [65] Ghana Greece
## [67] Grenada Guadeloupe
## [69] Guatemala Guinea
## [71] Guinea-Bissau Guyana
## [73] Haiti Honduras
## [75] Hong Kong, China Hungary
## [77] Iceland India
## [79] Indonesia Iran
## [81] Iraq Ireland
## [83] Israel Italy
## [85] Jamaica Japan
## [87] Jordan Kazakhstan
## [89] Kenya Korea, Dem. Rep.
## [91] Korea, Rep. Kuwait
## [93] Latvia Lebanon
## [95] Lesotho Liberia
## [97] Libya Lithuania
## [99] Luxembourg Macao, China
## [101] Madagascar Malawi
## [103] Malaysia Maldives
## [105] Mali Malta
## [107] Martinique Mauritania
## [109] Mauritius Mexico
## [111] Micronesia, Fed. Sts. Moldova
## [113] Mongolia Montenegro
## [115] Morocco Mozambique
## [117] Myanmar Namibia
## [119] Nepal Netherlands
## [121] Netherlands Antilles New Caledonia
## [123] New Zealand Nicaragua
## [125] Niger Nigeria
## [127] Norway Oman
## [129] Pakistan Panama
## [131] Papua New Guinea Paraguay
## [133] Peru Philippines
## [135] Poland Portugal
## [137] Puerto Rico Qatar
## [139] Reunion Romania
## [141] Russia Rwanda
## [143] Samoa Sao Tome and Principe
## [145] Saudi Arabia Senegal
## [147] Serbia Sierra Leone
## [149] Singapore Slovak Republic
## [151] Slovenia Solomon Islands
## [153] Somalia South Africa
## [155] Spain Sri Lanka
## [157] Sudan Suriname
## [159] Swaziland Sweden
## [161] Switzerland Syria
## [163] Taiwan Tajikistan
## [165] Tanzania Thailand
## [167] Timor-Leste Togo
## [169] Tonga Trinidad and Tobago
## [171] Tunisia Turkey
## [173] Turkmenistan Uganda
## [175] Ukraine United Arab Emirates
## [177] United Kingdom United States
## [179] Uruguay Uzbekistan
## [181] Vanuatu Venezuela
## [183] Vietnam West Bank and Gaza
## [185] Yemen, Rep. Zambia
## [187] Zimbabwe
## 187 Levels: Afghanistan Albania Algeria Angola Argentina Armenia ... Zimbabwe
length(unique(gapminder_unfiltered$continent))
## [1] 6
unique(gapminder_unfiltered$continent)
## [1] Asia Europe Africa Americas FSU Oceania
## Levels: Africa Americas Asia Europe FSU Oceania
length(unique(gapminder_unfiltered$year))
## [1] 58
unique(gapminder_unfiltered$year)
## [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 1950 1951
## [15] 1953 1954 1955 1956 1958 1959 1960 1961 1963 1964 1965 1966 1968 1969
## [29] 1970 1971 1973 1974 1975 1976 1978 1979 1980 1981 1983 1984 1985 1986
## [43] 1988 1989 1990 1991 1993 1994 1995 1996 1998 1999 2000 2001 2003 2004
## [57] 2005 2006
Structure of the Data-Set
str(gapminder_unfiltered)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3313 obs. of 6 variables:
## $ country : Factor w/ 187 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 6 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
Basic Summary statistics of the Data Set.
describe(gapminder_unfiltered)
## gapminder_unfiltered
##
## 6 Variables 3313 Observations
## ---------------------------------------------------------------------------
## country
## n missing distinct
## 3313 0 187
##
## lowest : Afghanistan Albania Algeria Angola Argentina
## highest: Vietnam West Bank and Gaza Yemen, Rep. Zambia Zimbabwe
## ---------------------------------------------------------------------------
## continent
## n missing distinct
## 3313 0 6
##
## lowest : Africa Americas Asia Europe FSU
## highest: Americas Asia Europe FSU Oceania
##
## Africa (637, 0.192), Americas (470, 0.142), Asia (578, 0.174), Europe
## (1302, 0.393), FSU (139, 0.042), Oceania (187, 0.056)
## ---------------------------------------------------------------------------
## year
## n missing distinct Info Mean Gmd .05 .10
## 3313 0 58 0.998 1980 19.52 1952 1957
## .25 .50 .75 .90 .95
## 1967 1982 1996 2002 2007
##
## lowest : 1950 1951 1952 1953 1954, highest: 2003 2004 2005 2006 2007
## ---------------------------------------------------------------------------
## lifeExp
## n missing distinct Info Mean Gmd .05 .10
## 3313 0 2571 1 65.24 12.73 41.22 45.37
## .25 .50 .75 .90 .95
## 58.33 69.61 73.66 77.12 78.68
##
## lowest : 23.599 28.801 30.000 30.015 30.331
## highest: 82.208 82.270 82.360 82.603 82.670
## ---------------------------------------------------------------------------
## pop
## n missing distinct Info Mean Gmd .05
## 3313 0 3312 1 31773251 50168977 235605
## .10 .25 .50 .75 .90 .95
## 436150 2680018 7559776 19610538 56737055 121365965
##
## lowest : 59412 59461 60011 60427 61325
## highest: 1110396331 1164970000 1230075000 1280400000 1318683096
## ---------------------------------------------------------------------------
## gdpPercap
## n missing distinct Info Mean Gmd .05 .10
## 3313 0 3313 1 11314 11542 665.7 887.9
## .25 .50 .75 .90 .95
## 2505.3 7825.8 17355.7 26592.7 31534.9
##
## lowest : 241.1659 277.5519 298.8462 299.8503 312.1884
## highest: 82010.9780 95458.1118 108382.3529 109347.8670 113523.1329
## ---------------------------------------------------------------------------
Exploratory Data Analysis
For the year 2007, what is the distribution of GDP per capita across all countries?
GDP_2007 <- gapminder_unfiltered %>% filter(year==2007) %>% select(continent, country, gdpPercap)
ggplot(GDP_2007, aes(x=gdpPercap)) + geom_histogram(fill="cyan", bins=40) +
ylab("GDP per Capita") +
ggtitle("Distribution of GDP Per Capita for 2007 for all Countries") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())

ggplot(GDP_2007, aes(x=country, y=gdpPercap)) + geom_point(aes(color = continent)) +
ylab("GDP per Capita") +
ggtitle("GDP Per Capita for Contries grouped by Continents for 2007") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())

For the year 2007, how do the distributions differ across the different continents?
ggplot(GDP_2007, aes(x=continent, y=gdpPercap)) + geom_bar(fill="firebrick", stat = "identity") +
xlab("Continents") +
ylab("GDP per Capita") +
ggtitle("GDP Per Capita vs Continents for 2007")

ggplot(GDP_2007, aes(x=continent, y=gdpPercap)) + geom_jitter(aes(color="firebrick")) +
xlab("Continents") +
ylab("GDP per Capita") +
ggtitle("GDP Per Capita vs Continents for 2007")

For the year 2007, what are the top 10 countries with the largest GDP per capita?
top_10_gdps <- GDP_2007[order(GDP_2007$gdpPercap, decreasing = TRUE),2:3][1:10,]
kable(top_10_gdps)
Qatar |
82010.98 |
Macao, China |
54589.82 |
Norway |
49357.19 |
Brunei |
48014.59 |
Kuwait |
47306.99 |
Singapore |
47143.18 |
United States |
42951.65 |
Ireland |
40676.00 |
Hong Kong, China |
39724.98 |
Switzerland |
37506.42 |
ggplot(top_10_gdps, aes(x=country, y=gdpPercap)) + geom_bar(fill="palegreen2", stat = "identity") +
xlab("Top 10 Countries") +
ylab("GDP per Capita") +
ggtitle("Top 10 GDP Per Capita vs Countries")

Plot the GDP per capita for your country of origin for all years available.
gdp_India <- gapminder_unfiltered %>% filter(country=="India") %>% select(year,gdpPercap)
ggplot(gdp_India, aes(x=year, y=gdpPercap)) + geom_point() + geom_smooth() +
xlab("Year") +
ylab("GDP per Capita") +
ggtitle("GDP Per Capita vs Year for INDIA")

What was the percent growth (or decline) in GDP per capita in 2007?
Percent.growth.India.2007 <- gapminder_unfiltered %>% arrange(year) %>% filter(country=="India")%>%
mutate(GDP_Growth = (gdpPercap - lag(gdpPercap))/lag(gdpPercap)*100) %>%
filter(year==2007)
Percent.growth.India.2007
## # A tibble: 1 x 7
## country continent year lifeExp pop gdpPercap GDP_Growth
## <fctr> <fctr> <int> <dbl> <int> <dbl> <dbl>
## 1 India Asia 2007 64.698 1110396331 2452.21 40.38546
The percent growth in GDP per Capita for India from 2002 to 2007 is almost 40%
6. What has been the historical growth (or decline) in GDP per capita for your country?
Percent.growth.India.Historical <- gapminder_unfiltered %>% arrange(year) %>% filter(country=="India")%>%
mutate(GDP_Growth = (gdpPercap - lag(gdpPercap))/lag(gdpPercap)*100)
kable(Percent.growth.India.Historical)
India |
Asia |
1952 |
37.373 |
372000000 |
546.5657 |
NA |
India |
Asia |
1957 |
40.249 |
409000000 |
590.0620 |
7.958100 |
India |
Asia |
1962 |
43.605 |
454000000 |
658.3472 |
11.572539 |
India |
Asia |
1967 |
47.193 |
506000000 |
700.7706 |
6.443935 |
India |
Asia |
1972 |
50.651 |
567000000 |
724.0325 |
3.319477 |
India |
Asia |
1977 |
54.208 |
634000000 |
813.3373 |
12.334362 |
India |
Asia |
1982 |
56.596 |
708000000 |
855.7235 |
5.211394 |
India |
Asia |
1987 |
58.553 |
788000000 |
976.5127 |
14.115440 |
India |
Asia |
1992 |
60.223 |
872000000 |
1164.4068 |
19.241341 |
India |
Asia |
1997 |
61.765 |
959000000 |
1458.8174 |
25.284173 |
India |
Asia |
2002 |
62.879 |
1034172547 |
1746.7695 |
19.738728 |
India |
Asia |
2007 |
64.698 |
1110396331 |
2452.2104 |
40.385464 |
ggplot(Percent.growth.India.Historical, aes(x=year, y=GDP_Growth)) + geom_line() +
xlab("Year") +
ylab("GDP Growth") +
ggtitle("GDP Growth vs Year for INDIA")
## Warning: Removed 1 rows containing missing values (geom_path).

Total.Percent.Growth <- gapminder_unfiltered %>% arrange(year) %>% filter(country=="India")%>%
mutate(GDP_Growth = (gdpPercap - first(gdpPercap))/first(gdpPercap)*100)%>%
filter(year==2007)
Total.Percent.Growth
## # A tibble: 1 x 7
## country continent year lifeExp pop gdpPercap GDP_Growth
## <fctr> <fctr> <int> <dbl> <int> <dbl> <dbl>
## 1 India Asia 2007 64.698 1110396331 2452.21 348.6579
The historical percent growth in GDP per Capita for India from 1952 to 2007 has been close to 350%