Project Stage-1

Author

Efe Şahin

Economic Questions

What demographic and regional characteristics are associated with differences in GNI per capita across countries and years?

Classification Question

Can observations be classified as high-income or low-income based on demographic and regional characteristics?

#Data Import

library(readr)
setwd("/Users/efesahin/Desktop/econ 495-ödev-1")
data <- read_csv("WB_WDI_NY_GNP_PCAP_KD.csv")

Rows: 6706 Columns: 41
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (37): STRUCTURE, STRUCTURE_ID, ACTION, FREQ, REF_AREA, INDICATOR, SEX, A...
dbl  (4): TIME_PERIOD, OBS_VALUE, DECIMALS, UNIT_MULT

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(data)

# A tibble: 6 × 41
  STRUCTURE     STRUCTURE_ID         ACTION FREQ  REF_AREA INDICATOR SEX   AGE  
  <chr>         <chr>                <chr>  <chr> <chr>    <chr>     <chr> <chr>
1 datastructure WB.DATA360:DS_DATA3… I      A     VEN      WB_WDI_N… _T    _T   
2 datastructure WB.DATA360:DS_DATA3… I      A     VNM      WB_WDI_N… _T    _T   
3 datastructure WB.DATA360:DS_DATA3… I      A     PSE      WB_WDI_N… _T    _T   
4 datastructure WB.DATA360:DS_DATA3… I      A     AFE      WB_WDI_N… _T    _T   
5 datastructure WB.DATA360:DS_DATA3… I      A     CEB      WB_WDI_N… _T    _T   
6 datastructure WB.DATA360:DS_DATA3… I      A     EAR      WB_WDI_N… _T    _T   
# ℹ 33 more variables: URBANISATION <chr>, UNIT_MEASURE <chr>,
#   COMP_BREAKDOWN_1 <chr>, COMP_BREAKDOWN_2 <chr>, COMP_BREAKDOWN_3 <chr>,
#   TIME_PERIOD <dbl>, OBS_VALUE <dbl>, AGG_METHOD <chr>, UNIT_TYPE <chr>,
#   DECIMALS <dbl>, DATABASE_ID <chr>, TIME_FORMAT <chr>, UNIT_MULT <dbl>,
#   OBS_STATUS <chr>, OBS_CONF <chr>, FREQ_LABEL <chr>, REF_AREA_LABEL <chr>,
#   INDICATOR_LABEL <chr>, SEX_LABEL <chr>, AGE_LABEL <chr>,
#   URBANISATION_LABEL <chr>, UNIT_MEASURE_LABEL <chr>, …

#Variable Inspection

colnames(data)

 [1] "STRUCTURE"              "STRUCTURE_ID"           "ACTION"                
 [4] "FREQ"                   "REF_AREA"               "INDICATOR"             
 [7] "SEX"                    "AGE"                    "URBANISATION"          
[10] "UNIT_MEASURE"           "COMP_BREAKDOWN_1"       "COMP_BREAKDOWN_2"      
[13] "COMP_BREAKDOWN_3"       "TIME_PERIOD"            "OBS_VALUE"             
[16] "AGG_METHOD"             "UNIT_TYPE"              "DECIMALS"              
[19] "DATABASE_ID"            "TIME_FORMAT"            "UNIT_MULT"             
[22] "OBS_STATUS"             "OBS_CONF"               "FREQ_LABEL"            
[25] "REF_AREA_LABEL"         "INDICATOR_LABEL"        "SEX_LABEL"             
[28] "AGE_LABEL"              "URBANISATION_LABEL"     "UNIT_MEASURE_LABEL"    
[31] "COMP_BREAKDOWN_1_LABEL" "COMP_BREAKDOWN_2_LABEL" "COMP_BREAKDOWN_3_LABEL"
[34] "AGG_METHOD_LABEL"       "UNIT_TYPE_LABEL"        "DECIMALS_LABEL"        
[37] "DATABASE_ID_LABEL"      "TIME_FORMAT_LABEL"      "UNIT_MULT_LABEL"       
[40] "OBS_STATUS_LABEL"       "OBS_CONF_LABEL"

The colnames(data) function was used to see all variables in the dataset. This helped identify important variables such as country, year, age group, gender, and income values.

#Data Cleaning

The dataset was cleaned by selecting only the important variables for the analysis. Missing values were also removed using na.omit() to create a cleaner dataset for modeling.

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

clean_data <- data %>%
  select(
    REF_AREA_LABEL,
    TIME_PERIOD,
    SEX_LABEL,
    AGE_LABEL,
    URBANISATION_LABEL,
    OBS_VALUE
  ) %>%
  na.omit()

head(clean_data)

# A tibble: 6 × 6
  REF_AREA_LABEL    TIME_PERIOD SEX_LABEL AGE_LABEL URBANISATION_LABEL OBS_VALUE
  <chr>                   <dbl> <chr>     <chr>     <chr>                  <dbl>
1 Venezuela, RB            1999 Total     All age … Total                  2390.
2 Vietnam                  1999 Total     All age … Total                  1050.
3 West Bank and Ga…        1999 Total     All age … Total                  3297.
4 Eastern & Southe…        2000 Total     All age … Total                  1112.
5 Central Europe a…        2000 Total     All age … Total                  7095.
6 Early-demographi…        2000 Total     All age … Total                  2146.

#Summary Statistics

summary(clean_data$OBS_VALUE)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   149.4   1644.8   4760.7  12528.8  17754.9 137688.9

sd(clean_data$OBS_VALUE)

[1] 16776.81

The minimum GNI per capita value is 149, while the maximum value is 137688, showing a large difference between observations. The mean value is 12528, which is much higher than the median value of 4760. This suggests that the distribution is positively skewed because some observations have very high income values.

#Histogram of GNI per Capita

library(ggplot2)

ggplot(clean_data, aes(x = OBS_VALUE)) +
  geom_histogram(bins = 30) +
  theme_minimal()

The histogram shows that most observations are concentrated at lower GNI per capita values, while a smaller number of observations have very high income levels. This creates a positively skewed distribution with a long right tail.

#Log Transformation

clean_data$log_income <- log(clean_data$OBS_VALUE)

ggplot(clean_data, aes(x = log_income)) +
  geom_histogram(bins = 30) +
  theme_minimal()

A logarithmic transformation was applied to reduce the skewness of the distribution. After the transformation, the histogram became more symmetric and closer to a normal distribution. This suggests that the original GNI per capita data may follow a log normal distribution.