When you need to explore “big data”, it is extremely important to:
create graphical visualisations;
generate basic summary statistics.
I will use R to accomplish both of these tasks using the gapminder data (available here).
TO BEGIN:
gDat <- read.delim("gapminderDataFiveYear.txt")
head(gDat, n = 3) #first three rows of gDat
## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.80 779.4
## 2 Afghanistan 1957 9240934 Asia 30.33 820.9
## 3 Afghanistan 1962 10267083 Asia 32.00 853.1
tail(gDat, n = 3) #last three rows of gDat
## country year pop continent lifeExp gdpPercap
## 1702 Zimbabwe 1997 11404948 Africa 46.81 792.4
## 1703 Zimbabwe 2002 11926563 Africa 39.99 672.0
## 1704 Zimbabwe 2007 12311143 Africa 43.49 469.7
str(gDat) #data structure, one-line summary for each component within gDat
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
summary(gDat)
## country year pop continent
## Afghanistan: 12 Min. :1952 Min. :6.00e+04 Africa :624
## Albania : 12 1st Qu.:1966 1st Qu.:2.79e+06 Americas:300
## Algeria : 12 Median :1980 Median :7.02e+06 Asia :396
## Angola : 12 Mean :1980 Mean :2.96e+07 Europe :360
## Argentina : 12 3rd Qu.:1993 3rd Qu.:1.96e+07 Oceania : 24
## Australia : 12 Max. :2007 Max. :1.32e+09
## (Other) :1632
## lifeExp gdpPercap
## Min. :23.6 Min. : 241
## 1st Qu.:48.2 1st Qu.: 1202
## Median :60.7 Median : 3532
## Mean :59.5 Mean : 7215
## 3rd Qu.:70.8 3rd Qu.: 9325
## Max. :82.6 Max. :113523
##
library(lattice)
xyplot(lifeExp ~ year | continent, gDat)
xyplot(lifeExp ~ gdpPercap, gDat, subset = year == 2002, group = continent,
auto.key = TRUE)
This is only a taste of the possible graphical visualisations you can perform, but already we can observe a large spread in the data, outliers, and trends that can be investigated further!