Data 110 Project 1

Author

Leika Joseph

Project I

Countries of the World

Introduction

The countries of the world dataset is a dataset that is made up from the US government. This dataset has variables like population, region, area size, infant mortality, birthrate, deathrate, GDP per capita, region, and more. It is possible to make a lot of analysis with this dataset. I plan to use this dataset to compare birthrates per region and also birthrates versus GDP per capita. From which I can conclude that the regions with the poorest countries are the ones with the highest birth rates. ## load the Dataset

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd("/Users/leikarayjoseph/Desktop/Data 110")
world_countries <- read_csv("countries of the world_cia_kaggle.csv")

Rows: 227 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): Country, Region, Pop. Density (per sq. mi.), Coastline (coast/area...
dbl  (3): Population, Area (sq. mi.), GDP ($ per capita)
num  (6): Infant mortality (per 1000 births), Literacy (%), Other (%), Clima...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean up Data

Make all header lowercase

#putting the headers in lower case
names(world_countries) <- tolower(names(world_countries))
names(world_countries) <- gsub(" ","",names(world_countries))
head(world_countries)

# A tibble: 6 × 20
  country        region         population `area(sq.mi.)` pop.density(persq.mi…¹
  <chr>          <chr>               <dbl>          <dbl> <chr>                 
1 Afghanistan    ASIA (EX. NEA…   31056997         647500 48,0                  
2 Albania        EASTERN EUROPE    3581655          28748 124,6                 
3 Algeria        NORTHERN AFRI…   32930091        2381740 13,8                  
4 American Samoa OCEANIA             57794            199 290,4                 
5 Andorra        WESTERN EUROPE      71201            468 152,1                 
6 Angola         SUB-SAHARAN A…   12127071        1246700 9,7                   
# ℹ abbreviated name: ¹`pop.density(persq.mi.)`
# ℹ 15 more variables: `coastline(coast/arearatio)` <chr>, netmigration <chr>,
#   `infantmortality(per1000births)` <dbl>, `gdp($percapita)` <dbl>,
#   `literacy(%)` <dbl>, `phones(per1000)` <chr>, `arable(%)` <chr>,
#   `crops(%)` <chr>, `other(%)` <dbl>, climate <dbl>, birthrate <dbl>,
#   deathrate <dbl>, agriculture <chr>, industry <chr>, service <chr>

The dimension of my dataset

summary(world_countries) # this gives us a little more insight about the dattaset

   country             region            population         area(sq.mi.)     
 Length:227         Length:227         Min.   :7.026e+03   Min.   :       2  
 Class :character   Class :character   1st Qu.:4.376e+05   1st Qu.:    4648  
 Mode  :character   Mode  :character   Median :4.787e+06   Median :   86600  
                                       Mean   :2.874e+07   Mean   :  598227  
                                       3rd Qu.:1.750e+07   3rd Qu.:  441811  
                                       Max.   :1.314e+09   Max.   :17075200  
                                                                             
 pop.density(persq.mi.) coastline(coast/arearatio) netmigration      
 Length:227             Length:227                 Length:227        
 Class :character       Class :character           Class :character  
 Mode  :character       Mode  :character           Mode  :character  
                                                                     
                                                                     
                                                                     
                                                                     
 infantmortality(per1000births) gdp($percapita)  literacy(%)    
 Min.   :   19.0                Min.   :  500   Min.   : 176.0  
 1st Qu.:  631.2                1st Qu.: 1900   1st Qu.: 706.0  
 Median : 1731.0                Median : 5550   Median : 925.0  
 Mean   : 3164.7                Mean   : 9690   Mean   : 828.4  
 3rd Qu.: 4929.8                3rd Qu.:15700   3rd Qu.: 980.0  
 Max.   :19119.0                Max.   :55100   Max.   :1000.0  
 NA's   :3                      NA's   :1       NA's   :18      
 phones(per1000)     arable(%)           crops(%)            other(%)   
 Length:227         Length:227         Length:227         Min.   :  50  
 Class :character   Class :character   Class :character   1st Qu.:5608  
 Mode  :character   Mode  :character   Mode  :character   Median :8015  
                                                          Mean   :6813  
                                                          3rd Qu.:9299  
                                                          Max.   :9998  
                                                          NA's   :2     
    climate         birthrate      deathrate      agriculture       
 Min.   : 1.000   Min.   :  10   Min.   :  22.0   Length:227        
 1st Qu.: 2.000   1st Qu.:1077   1st Qu.: 517.5   Class :character  
 Median : 2.000   Median :1800   Median : 713.0   Mode  :character  
 Mean   : 2.995   Mean   :2043   Mean   : 819.0                     
 3rd Qu.: 3.000   3rd Qu.:2934   3rd Qu.:1025.5                     
 Max.   :25.000   Max.   :5073   Max.   :2974.0                     
 NA's   :22       NA's   :3      NA's   :4                          
   industry           service         
 Length:227         Length:227        
 Class :character   Class :character  
 Mode  :character   Mode  :character

Select the column I want to work with and work with only the countries in Middle American realm

world_countries1 <- world_countries |>
  select(country, region, population, birthrate, deathrate, netmigration,climate, 'infantmortality(per1000births)', 'literacy(%)', agriculture, industry)

world_countries1

# A tibble: 227 × 11
   country           region  population birthrate deathrate netmigration climate
   <chr>             <chr>        <dbl>     <dbl>     <dbl> <chr>          <dbl>
 1 Afghanistan       ASIA (…   31056997       466      2034 23,06              1
 2 Albania           EASTER…    3581655      1511       522 -4,93              3
 3 Algeria           NORTHE…   32930091      1714       461 -0,39              1
 4 American Samoa    OCEANIA      57794      2246       327 -20,71             2
 5 Andorra           WESTER…      71201       871       625 6,6                3
 6 Angola            SUB-SA…   12127071      4511       242 0                 NA
 7 Anguilla          LATIN …      13477      1417       534 10,76              2
 8 Antigua & Barbuda LATIN …      69108      1693       537 -6,15              2
 9 Argentina         LATIN …   39921833      1673       755 0,61               3
10 Armenia           C.W. O…    2976372      1207       823 -6,47              4
# ℹ 217 more rows
# ℹ 4 more variables: `infantmortality(per1000births)` <dbl>,
#   `literacy(%)` <dbl>, agriculture <chr>, industry <chr>

## Infant Mortality vs death rate
p1 <- ggplot(world_countries1, aes(x = `infantmortality(per1000births)`, y = deathrate)) +
labs(title = "Infant Mortality versus Death Rate",
caption = "Source: U.S. Government",
x = "Infant Mortality",
y = "Death Rate") +
theme_minimal(base_size = 12)+ 
  geom_point() +
  xlim(0, 2000) + 
  ylim(0,3000) +
  geom_smooth() # add the points, specify the limits of the variable
p1

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning: Removed 109 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 109 rows containing missing values or values outside the scale range
(`geom_point()`).

p2 <- p1 + geom_smooth(method = 'lm', formula= y~x, se = FALSE, linetype= "dotdash", color= "red", size = 0.3)

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

p2

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning: Removed 109 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 109 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 109 rows containing missing values or values outside the scale range
(`geom_point()`).

The linear regression here doesn’t really show a relationship between infant mortality and death rate.

Getting Linear regression Equation

Eq <- lm(`infantmortality(per1000births)` ~ birthrate, data = world_countries1)
summary(Eq) # looking for the linear regression equation.


Call:
lm(formula = `infantmortality(per1000births)` ~ birthrate, data = world_countries1)

Residuals:
    Min      1Q  Median      3Q     Max 
-6432.5 -1423.3  -667.3   929.2 15880.3 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -380.3480   366.5023  -1.038    0.301    
birthrate      1.7318     0.1535  11.284   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2799 on 221 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.3655,    Adjusted R-squared:  0.3627 
F-statistic: 127.3 on 1 and 221 DF,  p-value: < 2.2e-16

Linear Regression

Infant mortality = 1.7318(birthrate) - 380.3480

create a visualization between country and birthrate

world_countries2 <- world_countries |>
  group_by(region) |>
  summarise(birthrate = sum(birthrate, na.rm = TRUE))
# in this chunck I group by region and summarise it by the average birthrate.
world_countries2

# A tibble: 11 × 2
   region               birthrate
   <chr>                    <dbl>
 1 ASIA (EX. NEAR EAST)     47272
 2 BALTICS                   2803
 3 C.W. OF IND. STATES      16378
 4 EASTERN EUROPE           10445
 5 LATIN AMER. & CARIB      79988
 6 NEAR EAST                38449
 7 NORTHERN AFRICA          10407
 8 NORTHERN AMERICA          5551
 9 OCEANIA                  42137
10 SUB-SAHARAN AFRICA      176426
11 WESTERN EUROPE           27732

Bar plot of Region versus average Birthrate

## create a bar plot of region vs average birthrate
p3 <- world_countries2 |>
  ggplot() +
  geom_bar(aes(x=region, y=birthrate, fill = region),
      position = "dodge", stat = "identity") +
  scale_fill_brewer(palette = "Set3") +
  labs(fill = "Region",
       x = "Region",
     y = "Avrg_Bith Rate",
      title = "Region versus Avrg_Birth Rate",
     caption = "Source: US Government") +
    theme(axis.text.x = element_text(angle=40, hjust=1)) # I googled this last line (so the region name wouldn't go over each other).
p3

This plot tell us that Sub-Saharan Africa region has the most birth rate.

# choose the two column that I want to work with and sumarise by average gdp per capita
world_countries3 <- world_countries |>
  group_by(region, birthrate) |>
  summarise(GDP = sum(`gdp($percapita)` , na.rm = TRUE))

`summarise()` has grouped output by 'region'. You can override using the
`.groups` argument.

world_countries3

# A tibble: 225 × 3
# Groups:   region [11]
   region               birthrate   GDP
   <chr>                    <dbl> <dbl>
 1 ASIA (EX. NEAR EAST)        10 17800
 2 ASIA (EX. NEAR EAST)        17  7000
 3 ASIA (EX. NEAR EAST)       269  1900
 4 ASIA (EX. NEAR EAST)       298  1900
 5 ASIA (EX. NEAR EAST)       466   700
 6 ASIA (EX. NEAR EAST)       729 28800
 7 ASIA (EX. NEAR EAST)       848 19400
 8 ASIA (EX. NEAR EAST)       934 23700
 9 ASIA (EX. NEAR EAST)       937 28200
10 ASIA (EX. NEAR EAST)      1256 23400
# ℹ 215 more rows

Bar plot of Region versus GDP

# use option scipen to take scientific notation off
options(scipen = 999)

# Creating the bar plot
p4 <- world_countries3 |>
  ggplot() +
  geom_bar(aes(x=region, y= GDP, fill = region),
      position = "dodge", stat = "identity") +
  scale_fill_brewer(palette = "Paired") + # set the colors
  labs(fill = "Region",
       x = "Region",
     y = "GDP",
      title = "Region versus GDP per Capita",
     caption = "Source: US Government") +
   theme(axis.text.x = element_text(angle=40, hjust=1)) # I google this last line.
p4

Scatterplot of GDP per Capita versus Birthrate

p5 <- world_countries3 |>
  ggplot(aes(x= birthrate, y= GDP, col = as.factor(region))) +
  geom_point()+
  scale_fill_brewer(palette = "Paired") + # choose the colors 
  labs(fill = "Region",
       x = "Birth Rate",
     y = "GDP",
      title = "Region versus GDP per Capita",
     caption = "Source: US Government") +
    theme(axis.text.x = element_text(angle=40, hjust=1)) # I google this part.
p5

Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_point()`).

Essay

In this project, I didn’t do a lot of cleaning up. At first I started by using tolower() to make the column name in lowercase letters. I use summary() to have some information about the dataset that I’m using, and then I use the select() function to select the column that I want to work with. For my first graph, I did a scatter plot of infant mortality versus deathrate, and I also added a regression line. I continue to make some more plots using birthrate, region, and GDP per capita. What I got from the visualization is that Sub-Saharan Africa, the region that contains most of the world’s poorest countries, is the one with the highest birthrate and also the region with the lowest GDP per capita. While analizing this dataset, I felt like I needed to do more and try to analyze per country name and explore the region a little more because it would have been very informative. I also wanted to explore the climate and agriculture variables. This is my project one, although I’m thinking that I didn’t really do much in it.