iso: An identification variable used to distinguish between the 181 countries in the filtered dataset.
fert_rate: A numerical variable representing the country’s fertility rate in 2022 corresponding to the expected number of children born per woman in child-bearing years. This is the outcome variable y of interest.
life_exp: A numerical variable representing the country’s average life expectancy in 2022 in years. This is the primary explanatory variable x of interest.
obes_rate: A numerical variable representing the country’s obesity rate in 2016.
ggplot(UN_data_ch5, aes(x = life_exp, y = fert_rate)) +geom_point(alpha =0.5) +labs(x ="Life Expectancy", y ="Fertility Rate")
Scatter plot style
R for data science 2e, section 9, can see different point shape.
Scatter plot style
ggplot(UN_data_ch5, aes(x = life_exp, y = fert_rate)) +geom_point(color="lawngreen", fill="snow", shape=23, size=2) +labs(x ="Life Expectancy", y ="Fertility Rate")
Best fitting line
geom_smooth(method = “lm”, se = FALSE)
ggplot(UN_data_ch5, aes(x = life_exp, y = fert_rate)) +geom_point(alpha =0.1) +labs(x ="Life Expectancy", y ="Fertility Rate",title ="Relationship of life expectancy and fertility rate") +geom_smooth(method ="lm", se =TRUE)
# the band represent confidence interval (default is 95% in ggplot2)
Best fitting line
lm(y~x, dataframe) get the regression line
coef() get the regression model coeffiencts
# Fit regression model:demographics_model <-lm(fert_rate ~ life_exp, data = UN_data_ch5)# Get regression coefficients:coef(demographics_model)
Follow gapminder 2022 you just created at exercise 1
select life_exp and continent
use tidy_summary () for summary table
# A tibble: 7 × 11
column n group type min Q1 mean median Q3 max sd
<chr> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 life_exp 188 <NA> nume… 53.6 69.4 73.8 75.2 78.4 89.6 6.93
2 continent 52 Africa fact… NA NA NA NA NA NA NA
3 continent 44 Asia fact… NA NA NA NA NA NA NA
4 continent 43 Europe fact… NA NA NA NA NA NA NA
5 continent 23 North America fact… NA NA NA NA NA NA NA
6 continent 14 Oceania fact… NA NA NA NA NA NA NA
7 continent 12 South America fact… NA NA NA NA NA NA NA
# A tibble: 7 × 11
column n group type min Q1 mean median Q3 max sd
<chr> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 life_exp 188 <NA> nume… 53.6 69.4 73.8 75.2 78.4 89.6 6.93
2 continent 52 Africa fact… NA NA NA NA NA NA NA
3 continent 44 Asia fact… NA NA NA NA NA NA NA
4 continent 43 Europe fact… NA NA NA NA NA NA NA
5 continent 23 North America fact… NA NA NA NA NA NA NA
6 continent 14 Oceania fact… NA NA NA NA NA NA NA
7 continent 12 South America fact… NA NA NA NA NA NA NA
Gapminder Table
country: An identification variable of type character/text used to distinguish the 142 countries in the dataset.
life_exp: A numerical variable of that country’s life expectancy at birth. This is the outcome variable y of interest.
continent: A categorical variable with five levels. Here “levels” correspond to the possible categories: Africa, Asia, Americas, Europe, and Oceania. This is the explanatory variable x of interest.
gdp_per_capita: A numerical variable of that country’s GDP per capita in US inflation-adjusted dollars
Histogram catch up
Left-skewed distribution?
ggplot(gapminder2022, aes(x = life_exp)) +geom_histogram(binwidth =5, color ="white") +labs(x ="Life expectancy", y ="Number of countries",title ="Histogram of distribution of worldwide life expectancies")
How about separated by continental?
recap with facet_wrap()
ggplot(gapminder2022, aes(x = life_exp)) +geom_histogram(binwidth =5, color ="white") +labs(x ="Life expectancy", y ="Number of countries",title ="Histogram of distribution of worldwide life expectancies") +facet_wrap(~ continent, nrow =2)
Is there any outliers?
Recall boxplot
ggplot(gapminder2022, aes(x = continent, y = life_exp)) +geom_boxplot() +labs(x ="Continent", y ="Life expectancy",title ="Life expectancy by continent")
Catgory data with scatter plot
ggplot(gapminder2022, aes(y = life_exp, x = continent)) +geom_point(alpha =0.1) +labs(x ="continent", y ="Life expectancy",title ="Life expectancy estimation by continent")
Is there any outliers?
Recall boxplot and mark outlier
Mean and median
life_exp_by_continent <- gapminder2022 |>group_by(continent) |>summarize(median =median(life_exp), mean =mean(life_exp))life_exp_by_continent
# A tibble: 6 × 3
continent median mean
<fct> <dbl> <dbl>
1 Africa 66.1 66.3
2 Asia 75.4 74.9
3 Europe 81.5 79.9
4 North America 76.1 76.3
5 Oceania 74.6 74.4
6 South America 75.4 75.2
Set the baseline to compare
R default for the baseline is according to alphabet
life_exp_model <-lm(life_exp ~ continent, data = gapminder2022)coef(life_exp_model)
(Intercept) continentAsia continentEurope
66.309808 8.639965 13.597867
continentNorth America continentOceania continentSouth America
9.985410 8.106621 8.917692
How to read
intercept corresponds to the mean life expectancy of countries in Africa of 66.31 years.
continentAsia the mean life expectancy of countries in Asia is 66.31 + 8.64 = 74.95
continentEurope the mean life expectancy of countries in Europe is 66.31 + 13.6 = 79.91
continentNorth the mean life expectancy of countries in North America is 66.31 + 9.98 = 76.29
continentOceania the mean life expectancy of countries in Oceania is 66.31 + 8.11 = 74.42
continentSouth America the mean life expectancy of countries in South America is 66.31 + 8.92 = 75.23
Regression model apply on country
get_regression_points(life_exp_model, ID ="country")
# A tibble: 188 × 5
country life_exp continent life_exp_hat residual
<chr> <dbl> <fct> <dbl> <dbl>
1 Afghanistan 53.6 Asia 75.0 -21.3
2 Albania 79.5 Europe 79.9 -0.438
3 Algeria 78.0 Africa 66.3 11.7
4 Andorra 83.4 Europe 79.9 3.51
5 Angola 62.1 Africa 66.3 -4.2
6 Antigua and Barbuda 77.8 North America 76.3 1.50
7 Argentina 78.3 South America 75.2 3.08
8 Armenia 76.1 Asia 75.0 1.18
9 Australia 83.1 Oceania 74.4 8.67
10 Austria 82.3 Europe 79.9 2.36
# ℹ 178 more rows
Regression model apply on country
country_apply <-get_regression_points(life_exp_model, ID ="country")Asia_model <- country_apply |>filter(continent=="Asia")Asia_model
# A tibble: 44 × 5
country life_exp continent life_exp_hat residual
<chr> <dbl> <fct> <dbl> <dbl>
1 Afghanistan 53.6 Asia 75.0 -21.3
2 Armenia 76.1 Asia 75.0 1.18
3 Azerbaijan 74.2 Asia 75.0 -0.8
4 Bahrain 79.9 Asia 75.0 4.95
5 Bangladesh 74.7 Asia 75.0 -0.25
6 Bhutan 72.3 Asia 75.0 -2.64
7 Brunei 80.6 Asia 75.0 5.64
8 Cambodia 70.6 Asia 75.0 -4.3
9 Cyprus 79.7 Asia 75.0 4.79
10 Georgia 77.5 Asia 75.0 2.55
# ℹ 34 more rows
Exercise 4
Follow on previous logic, can you find out the linear model for gdp_per_capital and continent? Can you explain?
life_gdp_model <-lm(gdp_per_capita ~ continent, data = gapminder2022)coef(life_gdp_model)
(Intercept) continentAsia continentEurope
2637.100 13014.284 43061.350
continentNorth America continentOceania continentSouth America
13713.476 10031.048 8083.508
get_regression_points(life_gdp_model, ID ="country")
# A tibble: 188 × 5
country gdp_per_capita continent gdp_per_capita_hat residual
<chr> <dbl> <fct> <dbl> <dbl>
1 Afghanistan 356. Asia 15651. -15296.
2 Albania 6810. Europe 45698. -38888.
3 Algeria 4343. Africa 2637. 1706.
4 Andorra 41993. Europe 45698. -3706.
5 Angola 3000. Africa 2637. 363.
6 Antigua and Barbuda 19920. North America 16351. 3569.
7 Argentina 13651. South America 10721. 2930.
8 Armenia 7018. Asia 15651. -8633.
9 Australia 65100. Oceania 12668. 52432.
10 Austria 52085. Europe 45698. 6386.
# ℹ 178 more rows