Project Two

Author

Emrick Dallo

Introduction

In this Project, we delve into the worldwide county-level data that we obtained from the US government.The collection includes a range of demographic and socioeconomic variables from countries all around the world. These variables include, GDP, area, literacy rates and more.

All these data sets are made up of data from the US government. Generally they are free to use if you use the data in the US. The data-set went through a couple of cleaning procedures, including handling missing values to ensure accuracy. I have always been interested in maps and countries so making visualizations related to countries is always fun.

library(readr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(plotly)

Loading required package: ggplot2


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(ggplot2)

# Load Data-set
countries.of.the.world_cia_kaggle <- read_csv("~/DATA 110/countries of the world_cia_kaggle.csv")

Rows: 227 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): Country, Region, Pop. Density (per sq. mi.), Coastline (coast/area...
dbl  (3): Population, Area (sq. mi.), GDP ($ per capita)
num  (6): Infant mortality (per 1000 births), Literacy (%), Other (%), Clima...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(countries.of.the.world_cia_kaggle)

Let’s look at Countries in Eastern Europe.

Eastern_europe <- countries.of.the.world_cia_kaggle %>%
  filter(Region == "EASTERN EUROPE")

literacy VS. GDP

Cleaning data by removing NA’s

# Remove any rows with missing values in GDP or literacy
LitvGdp_data <- na.omit(Eastern_europe[c("GDP ($ per capita)", "Literacy (%)")])

Create a regression analysis between GDP (gross domestic product) and literacy rate for countries in eastern europe

# Perform linear regression
model <- lm(`GDP ($ per capita)` ~ `Literacy (%)`, data = LitvGdp_data)

# Summarize the regression model
summary(model)


Call:
lm(formula = `GDP ($ per capita)` ~ `Literacy (%)`, data = LitvGdp_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4577.2 -3834.5  -751.4  3128.7  6650.6 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
(Intercept)    -70568.28   32060.79  -2.201   0.0636 .
`Literacy (%)`     83.17      32.99   2.521   0.0398 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4202 on 7 degrees of freedom
Multiple R-squared:  0.4759,    Adjusted R-squared:  0.401 
F-statistic: 6.355 on 1 and 7 DF,  p-value: 0.03976

# Create a scatterplot with the regression line
ggplot(LitvGdp_data, aes(x = `Literacy (%)`, y = `GDP ($ per capita)`)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
   labs(x = "Literacy (%)", y = "GDP ($ per capita)", title = "Regression Analysis of Literacy Race vs. GDP")

`geom_smooth()` using formula = 'y ~ x'

Statistical component with analysis

Model Equation

y = 83.17x − 70568.28

The equation suggests that for every one-unit increase in literacy percentage, the GDP per capita is estimated to increase by $83.17, holding all other variables constant.

The intercept term is not statistically significant at the conventional level (p = 0.0636), but the literacy coefficient is statistically significant (p = 0.0398), suggesting that literacy has a significant effect on GDP per capita in this model.

Exploring The Data

.Ranking the Eastern European Countries by size

.The reorder function helps arrange the countries in order based on size.

# Create the bar chart
ggplot(Eastern_europe, aes(x = reorder(Country, -`Area (sq. mi.)`), y = `Area (sq. mi.)`)) +
  geom_bar(stat = "identity", fill = "purple") +
  labs(x = "Country", y = "Area in square miles", title = "Eastern european countries ranked by size") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Checking if there’s correlation between deathrate and Migration in Eastern Europe

Cleaning the data

cleaned_data <- na.omit(Eastern_europe[c("Net migration", "Deathrate")])

Scatterplot of Deathrate Vs net migration

ggplot(cleaned_data, aes(x =`Deathrate`, y =`Net migration`)) +
  geom_point() +
  labs(x = "Death Rate", y = "Net Migration", title = "Net Migration vs Death Rate")

There is no correlation.

Final Visualization

The size of the dots are based on each countries’ population

Scatterplot between GDP and Infant Mortality

p <- ggplot(Eastern_europe, aes(x = `GDP ($ per capita)`, y = `Infant mortality (per 1000 births)`, color = Country, size = Population)) +
  geom_point() +
  scale_color_discrete() +
  labs(x = "GDP ($ per capita)", y = "Infant Mortality", title = "Scatterplot of GDP per Capita vs. Infant Mortality") +
  theme_minimal() + 
   guides(color = guide_legend(title = "Country"), size = guide_legend(title = "Population")) + 
  theme(legend.position = "bottom") + 
  labs(caption = "Source: US Government") 

# Add interactivity with plotly
plot_interactive <- ggplotly(p, tooltip = "text")

# show the interactive plot
plot_interactive

Brief Essay

The topic of this analysis is the relationship between infant mortality rates and GDP per Capita in Eastern Europe. To clean up the data-set, I removed any missing values from the variables of interest. I chose this topic because GDP and infant mortality rates are clear indicators of how developed a country is. Most of these east European countries used to be one, so I was curious on how they are doing on their own.

Infant mortality refers to the number of deaths of infants under one year of age per 1000 live births. GDP per capita, on the other hand, measures the average economic output per person in a country. According to the World Bank, Eastern Europe has undergone significant economic and political changes since the fall of the Soviet Union (“EU Regular Economic Report - Living up to Potential in the Wake of Adverse Shocks: Part 1”).

To conclude, the visualization I made represents a scatterplot of infant mortality vs. GDP per capita for countries in Eastern Europe. Each point on the plot represents a country, with the x-coordinate indicating GDP per capita and the y-coordinate indicating infant mortality rate. The points are colored by country, and the size of each point is proportional to the population size of the country. I wish the data-set included more variables to help distinguish the countries well-being. Overall, this was a very fun project.