The countries of the world dataset is a dataset that is made up from the US government. This dataset has variables like population, region, area size, infant mortality, birthrate, deathrate, GDP per capita, region, and more. It is possible to make a lot of analysis with this dataset. I plan to use this dataset to compare birthrates per region and also birthrates versus GDP per capita. From which I can conclude that the regions with the poorest countries are the ones with the highest birth rates. ## load the Dataset
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("/Users/leikarayjoseph/Desktop/Data 110")world_countries <-read_csv("countries of the world_cia_kaggle.csv")
Rows: 227 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): Country, Region, Pop. Density (per sq. mi.), Coastline (coast/area...
dbl (3): Population, Area (sq. mi.), GDP ($ per capita)
num (6): Infant mortality (per 1000 births), Literacy (%), Other (%), Clima...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Clean up Data
Make all header lowercase
#putting the headers in lower casenames(world_countries) <-tolower(names(world_countries))names(world_countries) <-gsub(" ","",names(world_countries))head(world_countries)
# A tibble: 6 × 20
country region population `area(sq.mi.)` pop.density(persq.mi…¹
<chr> <chr> <dbl> <dbl> <chr>
1 Afghanistan ASIA (EX. NEA… 31056997 647500 48,0
2 Albania EASTERN EUROPE 3581655 28748 124,6
3 Algeria NORTHERN AFRI… 32930091 2381740 13,8
4 American Samoa OCEANIA 57794 199 290,4
5 Andorra WESTERN EUROPE 71201 468 152,1
6 Angola SUB-SAHARAN A… 12127071 1246700 9,7
# ℹ abbreviated name: ¹`pop.density(persq.mi.)`
# ℹ 15 more variables: `coastline(coast/arearatio)` <chr>, netmigration <chr>,
# `infantmortality(per1000births)` <dbl>, `gdp($percapita)` <dbl>,
# `literacy(%)` <dbl>, `phones(per1000)` <chr>, `arable(%)` <chr>,
# `crops(%)` <chr>, `other(%)` <dbl>, climate <dbl>, birthrate <dbl>,
# deathrate <dbl>, agriculture <chr>, industry <chr>, service <chr>
The dimension of my dataset
summary(world_countries) # this gives us a little more insight about the dattaset
country region population area(sq.mi.)
Length:227 Length:227 Min. :7.026e+03 Min. : 2
Class :character Class :character 1st Qu.:4.376e+05 1st Qu.: 4648
Mode :character Mode :character Median :4.787e+06 Median : 86600
Mean :2.874e+07 Mean : 598227
3rd Qu.:1.750e+07 3rd Qu.: 441811
Max. :1.314e+09 Max. :17075200
pop.density(persq.mi.) coastline(coast/arearatio) netmigration
Length:227 Length:227 Length:227
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
infantmortality(per1000births) gdp($percapita) literacy(%)
Min. : 19.0 Min. : 500 Min. : 176.0
1st Qu.: 631.2 1st Qu.: 1900 1st Qu.: 706.0
Median : 1731.0 Median : 5550 Median : 925.0
Mean : 3164.7 Mean : 9690 Mean : 828.4
3rd Qu.: 4929.8 3rd Qu.:15700 3rd Qu.: 980.0
Max. :19119.0 Max. :55100 Max. :1000.0
NA's :3 NA's :1 NA's :18
phones(per1000) arable(%) crops(%) other(%)
Length:227 Length:227 Length:227 Min. : 50
Class :character Class :character Class :character 1st Qu.:5608
Mode :character Mode :character Mode :character Median :8015
Mean :6813
3rd Qu.:9299
Max. :9998
NA's :2
climate birthrate deathrate agriculture
Min. : 1.000 Min. : 10 Min. : 22.0 Length:227
1st Qu.: 2.000 1st Qu.:1077 1st Qu.: 517.5 Class :character
Median : 2.000 Median :1800 Median : 713.0 Mode :character
Mean : 2.995 Mean :2043 Mean : 819.0
3rd Qu.: 3.000 3rd Qu.:2934 3rd Qu.:1025.5
Max. :25.000 Max. :5073 Max. :2974.0
NA's :22 NA's :3 NA's :4
industry service
Length:227 Length:227
Class :character Class :character
Mode :character Mode :character
Select the column I want to work with and work with only the countries in Middle American realm
# A tibble: 227 × 11
country region population birthrate deathrate netmigration climate
<chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 Afghanistan ASIA (… 31056997 466 2034 23,06 1
2 Albania EASTER… 3581655 1511 522 -4,93 3
3 Algeria NORTHE… 32930091 1714 461 -0,39 1
4 American Samoa OCEANIA 57794 2246 327 -20,71 2
5 Andorra WESTER… 71201 871 625 6,6 3
6 Angola SUB-SA… 12127071 4511 242 0 NA
7 Anguilla LATIN … 13477 1417 534 10,76 2
8 Antigua & Barbuda LATIN … 69108 1693 537 -6,15 2
9 Argentina LATIN … 39921833 1673 755 0,61 3
10 Armenia C.W. O… 2976372 1207 823 -6,47 4
# ℹ 217 more rows
# ℹ 4 more variables: `infantmortality(per1000births)` <dbl>,
# `literacy(%)` <dbl>, agriculture <chr>, industry <chr>
## Infant Mortality vs death ratep1 <-ggplot(world_countries1, aes(x =`infantmortality(per1000births)`, y = deathrate)) +labs(title ="Infant Mortality versus Death Rate",caption ="Source: U.S. Government",x ="Infant Mortality",y ="Death Rate") +theme_minimal(base_size =12)+geom_point() +xlim(0, 2000) +ylim(0,3000) +geom_smooth() # add the points, specify the limits of the variablep1
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 109 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 109 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
p2
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 109 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 109 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 109 rows containing missing values or values outside the scale range
(`geom_point()`).
The linear regression here doesn’t really show a relationship between infant mortality and death rate.
Getting Linear regression Equation
Eq <-lm(`infantmortality(per1000births)`~ birthrate, data = world_countries1)summary(Eq) # looking for the linear regression equation.
Call:
lm(formula = `infantmortality(per1000births)` ~ birthrate, data = world_countries1)
Residuals:
Min 1Q Median 3Q Max
-6432.5 -1423.3 -667.3 929.2 15880.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -380.3480 366.5023 -1.038 0.301
birthrate 1.7318 0.1535 11.284 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2799 on 221 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.3655, Adjusted R-squared: 0.3627
F-statistic: 127.3 on 1 and 221 DF, p-value: < 2.2e-16
Linear Regression
Infant mortality = 1.7318(birthrate) - 380.3480
create a visualization between country and birthrate
world_countries2 <- world_countries |>group_by(region) |>summarise(birthrate =sum(birthrate, na.rm =TRUE))# in this chunck I group by region and summarise it by the average birthrate.world_countries2
# A tibble: 11 × 2
region birthrate
<chr> <dbl>
1 ASIA (EX. NEAR EAST) 47272
2 BALTICS 2803
3 C.W. OF IND. STATES 16378
4 EASTERN EUROPE 10445
5 LATIN AMER. & CARIB 79988
6 NEAR EAST 38449
7 NORTHERN AFRICA 10407
8 NORTHERN AMERICA 5551
9 OCEANIA 42137
10 SUB-SAHARAN AFRICA 176426
11 WESTERN EUROPE 27732
Bar plot of Region versus average Birthrate
## create a bar plot of region vs average birthratep3 <- world_countries2 |>ggplot() +geom_bar(aes(x=region, y=birthrate, fill = region),position ="dodge", stat ="identity") +scale_fill_brewer(palette ="Set3") +labs(fill ="Region",x ="Region",y ="Avrg_Bith Rate",title ="Region versus Avrg_Birth Rate",caption ="Source: US Government") +theme(axis.text.x =element_text(angle=40, hjust=1)) # I googled this last line (so the region name wouldn't go over each other).p3
This plot tell us that Sub-Saharan Africa region has the most birth rate.
# choose the two column that I want to work with and sumarise by average gdp per capitaworld_countries3 <- world_countries |>group_by(region, birthrate) |>summarise(GDP =sum(`gdp($percapita)` , na.rm =TRUE))
`summarise()` has grouped output by 'region'. You can override using the
`.groups` argument.
world_countries3
# A tibble: 225 × 3
# Groups: region [11]
region birthrate GDP
<chr> <dbl> <dbl>
1 ASIA (EX. NEAR EAST) 10 17800
2 ASIA (EX. NEAR EAST) 17 7000
3 ASIA (EX. NEAR EAST) 269 1900
4 ASIA (EX. NEAR EAST) 298 1900
5 ASIA (EX. NEAR EAST) 466 700
6 ASIA (EX. NEAR EAST) 729 28800
7 ASIA (EX. NEAR EAST) 848 19400
8 ASIA (EX. NEAR EAST) 934 23700
9 ASIA (EX. NEAR EAST) 937 28200
10 ASIA (EX. NEAR EAST) 1256 23400
# ℹ 215 more rows
Bar plot of Region versus GDP
# use option scipen to take scientific notation offoptions(scipen =999)# Creating the bar plotp4 <- world_countries3 |>ggplot() +geom_bar(aes(x=region, y= GDP, fill = region),position ="dodge", stat ="identity") +scale_fill_brewer(palette ="Paired") +# set the colorslabs(fill ="Region",x ="Region",y ="GDP",title ="Region versus GDP per Capita",caption ="Source: US Government") +theme(axis.text.x =element_text(angle=40, hjust=1)) # I google this last line.p4
Scatterplot of GDP per Capita versus Birthrate
p5 <- world_countries3 |>ggplot(aes(x= birthrate, y= GDP, col =as.factor(region))) +geom_point()+scale_fill_brewer(palette ="Paired") +# choose the colors labs(fill ="Region",x ="Birth Rate",y ="GDP",title ="Region versus GDP per Capita",caption ="Source: US Government") +theme(axis.text.x =element_text(angle=40, hjust=1)) # I google this part.p5
Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_point()`).
Essay
In this project, I didn’t do a lot of cleaning up. At first I started by using tolower() to make the column name in lowercase letters. I use summary() to have some information about the dataset that I’m using, and then I use the select() function to select the column that I want to work with. For my first graph, I did a scatter plot of infant mortality versus deathrate, and I also added a regression line. I continue to make some more plots using birthrate, region, and GDP per capita. What I got from the visualization is that Sub-Saharan Africa, the region that contains most of the world’s poorest countries, is the one with the highest birthrate and also the region with the lowest GDP per capita. While analizing this dataset, I felt like I needed to do more and try to analyze per country name and explore the region a little more because it would have been very informative. I also wanted to explore the climate and agriculture variables. This is my project one, although I’m thinking that I didn’t really do much in it.