This notebook summarizes datacamp’s ‘Introduction to Tidyverse’ course in R.
Loading necessary libraries.
library(gapminder)
library(dplyr)
library(ggplot2)
options(scipen=999,digits=3)Data Wrangling
Tidyverse is a powerful set of packages which can help to manipulate, visualize and model data. The whole notebook focuses on the gapminder dataset which tracks socio-economic, life expectacy etc. across countries over time. The primary key is country and year. The filter verb is used when you only want to look at a suset of the observations based on a particular condition. To use multiple verbs, pipes( %>%) are used to separate them. It basically puts the output from left of pipe, to the first argument on the right.
gapminder## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779
## 2 Afghanistan Asia 1957 30.3 9240934 821
## 3 Afghanistan Asia 1962 32.0 10267083 853
## 4 Afghanistan Asia 1967 34.0 11537966 836
## 5 Afghanistan Asia 1972 36.1 13079460 740
## 6 Afghanistan Asia 1977 38.4 14880372 786
## 7 Afghanistan Asia 1982 39.9 12881816 978
## 8 Afghanistan Asia 1987 40.8 13867957 852
## 9 Afghanistan Asia 1992 41.7 16317921 649
## 10 Afghanistan Asia 1997 41.8 22227415 635
## # ... with 1,694 more rows
# Filter with one condition
gapminder %>% filter(year==2007)## # A tibble: 142 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975
## 2 Albania Europe 2007 76.4 3600523 5937
## 3 Algeria Africa 2007 72.3 33333216 6223
## 4 Angola Africa 2007 42.7 12420476 4797
## 5 Argentina Americas 2007 75.3 40301927 12779
## 6 Australia Oceania 2007 81.2 20434176 34435
## 7 Austria Europe 2007 79.8 8199783 36126
## 8 Bahrain Asia 2007 75.6 708573 29796
## 9 Bangladesh Asia 2007 64.1 150448339 1391
## 10 Belgium Europe 2007 79.4 10392226 33693
## # ... with 132 more rows
# Filter the gapminder dataset for the year 1957
gapminder %>% filter(year==1957)## # A tibble: 142 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1957 30.3 9240934 821
## 2 Albania Europe 1957 59.3 1476505 1942
## 3 Algeria Africa 1957 45.7 10270856 3014
## 4 Angola Africa 1957 32.0 4561361 3828
## 5 Argentina Americas 1957 64.4 19610538 6857
## 6 Australia Oceania 1957 70.3 9712569 10950
## 7 Austria Europe 1957 67.5 6965860 8843
## 8 Bahrain Asia 1957 53.8 138655 11636
## 9 Bangladesh Asia 1957 39.3 51365468 662
## 10 Belgium Europe 1957 69.2 8989111 9715
## # ... with 132 more rows
gapminder %>% filter(country =="United States")## # A tibble: 12 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 United States Americas 1952 68.4 157553000 13990
## 2 United States Americas 1957 69.5 171984000 14847
## 3 United States Americas 1962 70.2 186538000 16173
## 4 United States Americas 1967 70.8 198712000 19530
## 5 United States Americas 1972 71.3 209896000 21806
## 6 United States Americas 1977 73.4 220239000 24073
## 7 United States Americas 1982 74.6 232187835 25010
## 8 United States Americas 1987 75.0 242803533 29884
## 9 United States Americas 1992 76.1 256894189 32004
## 10 United States Americas 1997 76.8 272911760 35767
## 11 United States Americas 2002 77.3 287675526 39097
## 12 United States Americas 2007 78.2 301139947 42952
# Filter with 2 conditions
gapminder %>% filter(year==2007,country=="United States")## # A tibble: 1 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 United States Americas 2007 78.2 301139947 42952
# Filter for China in 2002
gapminder %>%
filter(year==2002,country=="China")## # A tibble: 1 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 China Asia 2002 72.0 1280400000 3119
arrange verb sorts the dataset based on one or more variables in ascending or descending order. It is useful when you are interested in extreme values.
gapminder %>% arrange(gdpPercap)## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Congo, Dem. Rep. Africa 2002 45.0 55379852 241
## 2 Congo, Dem. Rep. Africa 2007 46.5 64606759 278
## 3 Lesotho Africa 1952 42.1 748747 299
## 4 Guinea-Bissau Africa 1952 32.5 580653 300
## 5 Congo, Dem. Rep. Africa 1997 42.6 47798986 312
## 6 Eritrea Africa 1952 35.9 1438760 329
## 7 Myanmar Asia 1952 36.3 20092996 331
## 8 Lesotho Africa 1957 45.0 813338 336
## 9 Burundi Africa 1952 39.0 2445618 339
## 10 Eritrea Africa 1957 38.0 1542611 344
## # ... with 1,694 more rows
gapminder %>% arrange(desc(gdpPercap))## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Kuwait Asia 1957 58.0 212846 113523
## 2 Kuwait Asia 1972 67.7 841934 109348
## 3 Kuwait Asia 1952 55.6 160000 108382
## 4 Kuwait Asia 1962 60.5 358266 95458
## 5 Kuwait Asia 1967 64.6 575003 80895
## 6 Kuwait Asia 1977 69.3 1140357 59265
## 7 Norway Europe 2007 80.2 4627926 49357
## 8 Kuwait Asia 2007 77.6 2505559 47307
## 9 Singapore Asia 2007 80.0 4553009 47143
## 10 Norway Europe 2002 79.0 4535591 44684
## # ... with 1,694 more rows
# Sort in ascending order of lifeExp
gapminder %>% arrange(lifeExp)## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Rwanda Africa 1992 23.6 7290203 737
## 2 Afghanistan Asia 1952 28.8 8425333 779
## 3 Gambia Africa 1952 30.0 284320 485
## 4 Angola Africa 1952 30.0 4232095 3521
## 5 Sierra Leone Africa 1952 30.3 2143249 880
## 6 Afghanistan Asia 1957 30.3 9240934 821
## 7 Cambodia Asia 1977 31.2 6978607 525
## 8 Mozambique Africa 1952 31.3 6446316 469
## 9 Sierra Leone Africa 1957 31.6 2295678 1004
## 10 Burkina Faso Africa 1952 32.0 4469979 543
## # ... with 1,694 more rows
# Sort in descending order of lifeExp
gapminder %>% arrange(desc(lifeExp))## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Japan Asia 2007 82.6 127467972 31656
## 2 Hong Kong, China Asia 2007 82.2 6980412 39725
## 3 Japan Asia 2002 82.0 127065841 28605
## 4 Iceland Europe 2007 81.8 301931 36181
## 5 Switzerland Europe 2007 81.7 7554661 37506
## 6 Hong Kong, China Asia 2002 81.5 6762476 30209
## 7 Australia Oceania 2007 81.2 20434176 34435
## 8 Spain Europe 2007 80.9 40448191 28821
## 9 Sweden Europe 2007 80.9 9031088 33860
## 10 Israel Asia 2007 80.7 6426679 25523
## # ... with 1,694 more rows
# Combining two verbs
gapminder %>% filter(year==2007) %>% arrange(desc(gdpPercap))## # A tibble: 142 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Norway Europe 2007 80.2 4627926 49357
## 2 Kuwait Asia 2007 77.6 2505559 47307
## 3 Singapore Asia 2007 80.0 4553009 47143
## 4 United States Americas 2007 78.2 301139947 42952
## 5 Ireland Europe 2007 78.9 4109086 40676
## 6 Hong Kong, China Asia 2007 82.2 6980412 39725
## 7 Switzerland Europe 2007 81.7 7554661 37506
## 8 Netherlands Europe 2007 79.8 16570613 36798
## 9 Canada Americas 2007 80.7 33390141 36319
## 10 Iceland Europe 2007 81.8 301931 36181
## # ... with 132 more rows
# Filter for the year 1957, then arrange in descending order of population
gapminder %>% filter(year==1957) %>% arrange(desc(pop))## # A tibble: 142 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 China Asia 1957 50.5 637408000 576
## 2 India Asia 1957 40.2 409000000 590
## 3 United States Americas 1957 69.5 171984000 14847
## 4 Japan Asia 1957 65.5 91563009 4318
## 5 Indonesia Asia 1957 39.9 90124000 859
## 6 Germany Europe 1957 69.1 71019069 10188
## 7 Brazil Americas 1957 53.3 65551171 2487
## 8 United Kingdom Europe 1957 70.4 51430000 11283
## 9 Bangladesh Asia 1957 39.3 51365468 662
## 10 Italy Europe 1957 67.8 49182000 6249
## # ... with 132 more rows
Mutates modifies or creates new variables.
# Changing existing variables
gapminder %>% mutate(pop=pop/1000000)## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8.43 779
## 2 Afghanistan Asia 1957 30.3 9.24 821
## 3 Afghanistan Asia 1962 32.0 10.3 853
## 4 Afghanistan Asia 1967 34.0 11.5 836
## 5 Afghanistan Asia 1972 36.1 13.1 740
## 6 Afghanistan Asia 1977 38.4 14.9 786
## 7 Afghanistan Asia 1982 39.9 12.9 978
## 8 Afghanistan Asia 1987 40.8 13.9 852
## 9 Afghanistan Asia 1992 41.7 16.3 649
## 10 Afghanistan Asia 1997 41.8 22.2 635
## # ... with 1,694 more rows
# Use mutate to change lifeExp to be in months
gapminder %>% mutate(lifeExp = lifeExp * 12)## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 346 8425333 779
## 2 Afghanistan Asia 1957 364 9240934 821
## 3 Afghanistan Asia 1962 384 10267083 853
## 4 Afghanistan Asia 1967 408 11537966 836
## 5 Afghanistan Asia 1972 433 13079460 740
## 6 Afghanistan Asia 1977 461 14880372 786
## 7 Afghanistan Asia 1982 478 12881816 978
## 8 Afghanistan Asia 1987 490 13867957 852
## 9 Afghanistan Asia 1992 500 16317921 649
## 10 Afghanistan Asia 1997 501 22227415 635
## # ... with 1,694 more rows
# Adding new variables
gapminder %>% mutate(grossgdp = pop * gdpPercap)## # A tibble: 1,704 x 7
## country continent year lifeExp pop gdpPercap grossgdp
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779 6567086330
## 2 Afghanistan Asia 1957 30.3 9240934 821 7585448670
## 3 Afghanistan Asia 1962 32.0 10267083 853 8758855797
## 4 Afghanistan Asia 1967 34.0 11537966 836 9648014150
## 5 Afghanistan Asia 1972 36.1 13079460 740 9678553274
## 6 Afghanistan Asia 1977 38.4 14880372 786 11697659231
## 7 Afghanistan Asia 1982 39.9 12881816 978 12598563401
## 8 Afghanistan Asia 1987 40.8 13867957 852 11820990309
## 9 Afghanistan Asia 1992 41.7 16317921 649 10595901589
## 10 Afghanistan Asia 1997 41.8 22227415 635 14121995875
## # ... with 1,694 more rows
# Use mutate to create a new column called lifeExpMonths
gapminder %>% mutate(lifeExpMonths = lifeExp * 12 )## # A tibble: 1,704 x 7
## country continent year lifeExp pop gdpPercap lifeExpMonths
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779 346
## 2 Afghanistan Asia 1957 30.3 9240934 821 364
## 3 Afghanistan Asia 1962 32.0 10267083 853 384
## 4 Afghanistan Asia 1967 34.0 11537966 836 408
## 5 Afghanistan Asia 1972 36.1 13079460 740 433
## 6 Afghanistan Asia 1977 38.4 14880372 786 461
## 7 Afghanistan Asia 1982 39.9 12881816 978 478
## 8 Afghanistan Asia 1987 40.8 13867957 852 490
## 9 Afghanistan Asia 1992 41.7 16317921 649 500
## 10 Afghanistan Asia 1997 41.8 22227415 635 501
## # ... with 1,694 more rows
# Combing 3 verbs
gapminder %>% mutate(grossgdp = pop * gdpPercap) %>%
filter(year==2007) %>%
arrange(desc(grossgdp))## # A tibble: 142 x 7
## country continent year lifeExp pop gdpPercap grossgdp
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 United States Americas 2007 78.2 301139947 42952 1.29e13
## 2 China Asia 2007 73.0 1318683096 4959 6.54e12
## 3 Japan Asia 2007 82.6 127467972 31656 4.04e12
## 4 India Asia 2007 64.7 1110396331 2452 2.72e12
## 5 Germany Europe 2007 79.4 82400996 32170 2.65e12
## 6 United Kingdom Europe 2007 79.4 60776238 33203 2.02e12
## 7 France Europe 2007 80.7 61083916 30470 1.86e12
## 8 Brazil Americas 2007 72.4 190010647 9066 1.72e12
## 9 Italy Europe 2007 80.5 58147733 28570 1.66e12
## 10 Mexico Americas 2007 76.2 108700891 11978 1.30e12
## # ... with 132 more rows
# Find the countries with the highest life expectancy, in months, in the year 2007
# Filter, mutate, and arrange the gapminder dataset
gapminder %>% filter(year==2007) %>%
mutate(lifeExpMonths = lifeExp * 12) %>%
arrange(desc(lifeExpMonths))## # A tibble: 142 x 7
## country continent year lifeExp pop gdpPercap lifeExpMonths
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Japan Asia 2007 82.6 1.27e8 31656 991
## 2 Hong Kong, China Asia 2007 82.2 6.98e6 39725 986
## 3 Iceland Europe 2007 81.8 3.02e5 36181 981
## 4 Switzerland Europe 2007 81.7 7.55e6 37506 980
## 5 Australia Oceania 2007 81.2 2.04e7 34435 975
## 6 Spain Europe 2007 80.9 4.04e7 28821 971
## 7 Sweden Europe 2007 80.9 9.03e6 33860 971
## 8 Israel Asia 2007 80.7 6.43e6 25523 969
## 9 France Europe 2007 80.7 6.11e7 30470 968
## 10 Canada Americas 2007 80.7 3.34e7 36319 968
## # ... with 132 more rows
Visualizing with ggplot2
When working with more than 2 variables, always use color to control factor variables, and size to control numerical
# Does higher GDP lead to higher lifeExp (because of other factors like better healthcare etc)
ggplot(data=gapminder,aes(x=gdpPercap,y=lifeExp)) + geom_point()# It is better to use a log scale, when one variable is very densely distributed across just few values
ggplot(data=gapminder,aes(x=log(gdpPercap),y=lifeExp)) + geom_point()# Another way --probably cleaner is (it doesnt mess your column labels)
ggplot(data=gapminder,aes(x=log(gdpPercap),y=lifeExp)) + geom_point() +
scale_x_log10()# Create gapminder_1952
gapminder_1952 <- filter(.data=gapminder,year==1952)
# Create gapminder_2007
gapminder_2007 <- filter(.data=gapminder,year==2007)
# Population vs GDP in 1952
ggplot(gapminder_1952, aes(x = pop, y = gdpPercap)) +
geom_point()# Cleaning it up
ggplot(gapminder_1952, aes(x = pop, y = gdpPercap)) +
geom_point() +
scale_x_log10() +
scale_y_log10()# Create a scatter plot with pop on the x-axis and lifeExp on the y-axis
ggplot(data=gapminder_1952,aes(x=pop,y=lifeExp)) + geom_point()# To clean it up
ggplot(gapminder_1952, aes(x = pop, y = lifeExp)) +
geom_point() +
scale_x_log10()# Working with >2 variables
ggplot(data=gapminder_2007,aes(x=gdpPercap,y=lifeExp,color=continent)) + geom_point() +
scale_x_log10() ggplot(data=gapminder_2007,aes(x=gdpPercap,y=lifeExp,color=continent,size=pop)) + geom_point() +
scale_x_log10() # Scatter plot comparing pop and lifeExp, with color representing continent
ggplot(data=gapminder_1952,aes(x=pop,y=lifeExp,color=continent)) +
geom_point() +
scale_x_log10()# Add the size aesthetic to represent a country's gdpPercap
ggplot(gapminder_1952, aes(x = pop, y = lifeExp, color = continent,
size=gdpPercap)) +
geom_point() +
scale_x_log10()# Instead of showing all categorical variables in one plot , we can have 5 different plots in one plot using faceting
ggplot(data=gapminder_2007,aes(x=gdpPercap,y=lifeExp)) + geom_point() +
scale_x_log10() +
facet_wrap(~continent)# Scatter plot comparing pop and lifeExp, faceted by continent
ggplot(data=gapminder_1952,aes(x=pop,y=lifeExp)) + geom_point() +
scale_x_log10() +
facet_wrap(~continent)# Scatter plot comparing gdpPercap and lifeExp, with color representing continent
# and size representing population, faceted by year
ggplot(data=gapminder,aes(x=gdpPercap,y=lifeExp,color=continent,
size = pop)) + geom_point() +
scale_x_log10() +
facet_wrap(~year)Grouping and summarizing
summarize turns many rows rows into one by aggregating at one level. To group your data, you can use group_by. It is often that group_by and summarize are used with each other
# Finding mean life exp across all years all continents
gapminder %>% summarize(meanLifeExp = mean(lifeExp))## # A tibble: 1 x 1
## meanLifeExp
## <dbl>
## 1 59.5
# Summarize to find the median life expectancy
gapminder %>% summarize(medianLifeExp = median(lifeExp))## # A tibble: 1 x 1
## medianLifeExp
## <dbl>
## 1 60.7
# Avg Life Exp in 2007
gapminder %>% filter(year==2007) %>% summarize(meanLifeExp = mean(lifeExp))## # A tibble: 1 x 1
## meanLifeExp
## <dbl>
## 1 67.0
# Avg life Exp and total pop in 2007
gapminder %>% filter(year==2007) %>% summarize(meanLifeExp = mean(lifeExp),totalPop = sum(as.numeric(pop)))## # A tibble: 1 x 2
## meanLifeExp totalPop
## <dbl> <dbl>
## 1 67.0 6251013179
# Filter for 1957 then summarize the median life expectancy
gapminder %>% filter(year==1957) %>% summarize(medianLifeExp = median(lifeExp))## # A tibble: 1 x 1
## medianLifeExp
## <dbl>
## 1 48.4
# Filter for 1957 then summarize the median life expectancy and the maximum GDP per capita
gapminder %>% filter(year==1957) %>%
summarize(medianLifeExp = median(lifeExp),
maxGdpPercap = max(gdpPercap))## # A tibble: 1 x 2
## medianLifeExp maxGdpPercap
## <dbl> <dbl>
## 1 48.4 113523
# Avg life Exp and total pop in each year
gapminder %>% group_by(year) %>% summarize(meanLifeExp = mean(lifeExp),totalPop = sum(as.numeric(pop)))## # A tibble: 12 x 3
## year meanLifeExp totalPop
## <int> <dbl> <dbl>
## 1 1952 49.1 2406957150
## 2 1957 51.5 2664404580
## 3 1962 53.6 2899782974
## 4 1967 55.7 3217478384
## 5 1972 57.6 3576977158
## 6 1977 59.6 3930045807
## 7 1982 61.5 4289436840
## 8 1987 63.2 4691477418
## 9 1992 64.2 5110710260
## 10 1997 65.0 5515204472
## 11 2002 65.7 5886977579
## 12 2007 67.0 6251013179
# Avg life Exp and total pop in each continent in 2007
gapminder %>% filter(year==2007) %>%
group_by(continent) %>%
summarize(meanLifeExp = mean(lifeExp),totalPop = sum(as.numeric(pop)))## # A tibble: 5 x 3
## continent meanLifeExp totalPop
## <fct> <dbl> <dbl>
## 1 Africa 54.8 929539692
## 2 Americas 73.6 898871184
## 3 Asia 70.7 3811953827
## 4 Europe 77.6 586098529
## 5 Oceania 80.7 24549947
# Avg life Exp and total pop in each year and contient
gapminder %>% group_by(year,continent) %>%
summarize(meanLifeExp = mean(lifeExp),totalPop = sum(as.numeric(pop)))## # A tibble: 60 x 4
## # Groups: year [?]
## year continent meanLifeExp totalPop
## <int> <fct> <dbl> <dbl>
## 1 1952 Africa 39.1 237640501
## 2 1952 Americas 53.3 345152446
## 3 1952 Asia 46.3 1395357351
## 4 1952 Europe 64.4 418120846
## 5 1952 Oceania 69.3 10686006
## 6 1957 Africa 41.3 264837738
## 7 1957 Americas 56.0 386953916
## 8 1957 Asia 49.3 1562780599
## 9 1957 Europe 66.7 437890351
## 10 1957 Oceania 70.3 11941976
## # ... with 50 more rows
# Find median life expectancy and maximum GDP per capita in each year
gapminder %>% group_by(year) %>%
summarize(medianLifeExp = median(lifeExp),
maxGdpPercap = max(gdpPercap))## # A tibble: 12 x 3
## year medianLifeExp maxGdpPercap
## <int> <dbl> <dbl>
## 1 1952 45.1 108382
## 2 1957 48.4 113523
## 3 1962 50.9 95458
## 4 1967 53.8 80895
## 5 1972 56.5 109348
## 6 1977 59.7 59265
## 7 1982 62.4 33693
## 8 1987 65.8 31541
## 9 1992 67.7 34933
## 10 1997 69.4 41283
## 11 2002 70.8 44684
## 12 2007 71.9 49357
# Find median life expectancy and maximum GDP per capita in each continent in 1957
gapminder %>% filter(year==1957) %>% group_by(continent) %>%
summarize(medianLifeExp = median(lifeExp),
maxGdpPercap = max(gdpPercap))## # A tibble: 5 x 3
## continent medianLifeExp maxGdpPercap
## <fct> <dbl> <dbl>
## 1 Africa 40.6 5487
## 2 Americas 56.1 14847
## 3 Asia 48.3 113523
## 4 Europe 67.6 17909
## 5 Oceania 70.3 12247
# Find median life expectancy and maximum GDP per capita in each year/continent combination
gapminder %>% group_by(year,continent) %>%
summarize(medianLifeExp = median(lifeExp),
maxGdpPercap = max(gdpPercap))## # A tibble: 60 x 4
## # Groups: year [?]
## year continent medianLifeExp maxGdpPercap
## <int> <fct> <dbl> <dbl>
## 1 1952 Africa 38.8 4725
## 2 1952 Americas 54.7 13990
## 3 1952 Asia 44.9 108382
## 4 1952 Europe 65.9 14734
## 5 1952 Oceania 69.3 10557
## 6 1957 Africa 40.6 5487
## 7 1957 Americas 56.1 14847
## 8 1957 Asia 48.3 113523
## 9 1957 Europe 67.6 17909
## 10 1957 Oceania 70.3 12247
## # ... with 50 more rows
Visualizing summarized data. Sometimes it is misleading when y axis doesnt start at zero. You can then use expand_limits(y=0)
by_year <- gapminder %>% group_by(year) %>%
summarize(meanLifeExp = mean(lifeExp),totalPop = sum(as.numeric(pop)))
by_year_continent <- gapminder %>% group_by(year,continent) %>%
summarize(meanLifeExp = mean(lifeExp),totalPop =sum(as.numeric(pop)))
# Visualizing population over time
ggplot(data=by_year, aes(x=year,y=totalPop)) + geom_point()# Visualizing population over time,starting at zero
ggplot(data=by_year, aes(x=year,y=totalPop)) + geom_point() + expand_limits(y=0)# Create a scatter plot showing the change in meanLifeExp over time
ggplot(data=by_year,aes(x=year,y=meanLifeExp)) + geom_point() +
expand_limits(y=0)# Visualizing population over time,starting at zero, for each continent
ggplot(data=by_year_continent, aes(x=year,y=totalPop,color=continent)) + geom_point() + expand_limits(y=0)# Summarize medianGdpPercap within each continent within each year: by_year_continent
by_year_continent2 <- gapminder %>% group_by(continent,year) %>%
summarize(medianGdpPercap=median(gdpPercap))
# Plot the change in medianGdpPercap in each continent over time
ggplot(data=by_year_continent2,aes(x=year,y=medianGdpPercap,color=continent)) + geom_point() + expand_limits(y=0)# Summarize the median GDP and median life expectancy per continent in 2007
by_continent_2007 <- gapminder %>% filter(year==2007) %>%
group_by(continent) %>%
summarize(medianLifeExp = median(lifeExp),
medianGdpPercap = median(gdpPercap))
# Use a scatter plot to compare the median GDP and median life expectancy
ggplot(data=by_continent_2007,aes(x=medianGdpPercap,y=medianLifeExp,color=continent)) + geom_point()Types of visualization
Line plots are useful for showing a trend over time. Bar charts are good at showing statistics for different categories. Histogram describe the distribution of a 1D numeric variable. Boxplots compare the distribution of a numeric variable across different categories.
# Summarize the median gdpPercap by year, then save it as by_year
by_year <- gapminder %>% group_by(year) %>%
summarize(medianGdpPercap = median(gdpPercap))
# Create a line plot showing the change in medianGdpPercap over time
ggplot(data=by_year,aes(x=year,y=medianGdpPercap)) + geom_line() +
expand_limits(y=0)# Summarize the median gdpPercap by year & continent, save as by_year_continent
by_year_continent <- gapminder %>% group_by(year,continent) %>%
summarize(medianGdpPercap=median(gdpPercap))
# Create a line plot showing the change in medianGdpPercap by continent over time
ggplot(data=by_year_continent,aes(x=year,y=medianGdpPercap,color=continent)) + geom_line() + expand_limits(y=0)Bar plots are useful for exploring data across discrete categories.
by_continent <- gapminder %>% filter(year==2007) %>% group_by(continent) %>% summarize(meanLifeExp = mean(lifeExp))
by_continent## # A tibble: 5 x 2
## continent meanLifeExp
## <fct> <dbl>
## 1 Africa 54.8
## 2 Americas 73.6
## 3 Asia 70.7
## 4 Europe 77.6
## 5 Oceania 80.7
ggplot(data=by_continent,aes(x=continent,y=meanLifeExp)) + geom_col()# Summarize the median gdpPercap by year and continent in 1952
by_continent <- gapminder %>% filter(year==1952) %>% group_by(continent) %>% summarize(medianGdpPercap=median(gdpPercap))
# Create a bar plot showing medianGdp by continent
ggplot(data=by_continent,aes(x=continent,y=medianGdpPercap)) + geom_col()# Filter for observations in the Oceania continent in 1952
oceania_1952 <- gapminder %>% filter(year==1952,continent=="Oceania") %>% group_by(country) %>% summarize(medianGdpPercap=median(gdpPercap))
# Create a bar plot of gdpPercap by country
ggplot(data=oceania_1952,aes(x=country,y=medianGdpPercap)) + geom_col()Histogram shows distribution. Ever bar represents a bin, and height of bar represents frequency. The bin size is automatic and largely affects the way the distribution appears to look. binwidth= option inside geom_histogram() controls it.
ggplot(data=gapminder_2007,aes(x=lifeExp)) + geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# With binwidth = 5
ggplot(data=gapminder_2007,aes(x=lifeExp)) + geom_histogram(binwidth=5)gapminder_1952 <- gapminder %>%
filter(year == 1952)
# Create a histogram of population (pop)
ggplot(data=gapminder_1952,aes(x=pop)) + geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# This histogram s not helpful to read. Because of certain outliers, or heavy population countries, all the others are clumped in one bin. We can show it on a log scale
# Create a histogram of population (pop), with x on a log scale
ggplot(data=gapminder_1952,aes(x=pop)) + geom_histogram() +
scale_x_log10()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
If we wanted to observe the same above distributions but across continents, we can use boxplots. The line in the middle of each box is the median of that distribution. The top and bottom show 75th and 25th percentile. The whiskers represent other datapoints, and the points outside the whiskers represent outliers.
ggplot(data=gapminder_2007,aes(x=continent,y=lifeExp)) + geom_boxplot()# Create a boxplot comparing gdpPercap among continents
ggplot(data=gapminder_1952,aes(x=continent,y=gdpPercap)) + geom_boxplot() + scale_y_log10()# Add a title to this graph: "Comparing GDP per capita across continents"
ggplot(gapminder_1952, aes(x = continent, y = gdpPercap)) +
geom_boxplot() +
scale_y_log10() +
ggtitle("Comparing GDP per capita across continents")```