Rows: 6706 Columns: 41
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (37): STRUCTURE, STRUCTURE_ID, ACTION, FREQ, REF_AREA, INDICATOR, SEX, A...
dbl (4): TIME_PERIOD, OBS_VALUE, DECIMALS, UNIT_MULT
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
# A tibble: 6 × 41
STRUCTURE STRUCTURE_ID ACTION FREQ REF_AREA INDICATOR SEX AGE
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 datastructure WB.DATA360:DS_DATA3… I A VEN WB_WDI_N… _T _T
2 datastructure WB.DATA360:DS_DATA3… I A VNM WB_WDI_N… _T _T
3 datastructure WB.DATA360:DS_DATA3… I A PSE WB_WDI_N… _T _T
4 datastructure WB.DATA360:DS_DATA3… I A AFE WB_WDI_N… _T _T
5 datastructure WB.DATA360:DS_DATA3… I A CEB WB_WDI_N… _T _T
6 datastructure WB.DATA360:DS_DATA3… I A EAR WB_WDI_N… _T _T
# ℹ 33 more variables: URBANISATION <chr>, UNIT_MEASURE <chr>,
# COMP_BREAKDOWN_1 <chr>, COMP_BREAKDOWN_2 <chr>, COMP_BREAKDOWN_3 <chr>,
# TIME_PERIOD <dbl>, OBS_VALUE <dbl>, AGG_METHOD <chr>, UNIT_TYPE <chr>,
# DECIMALS <dbl>, DATABASE_ID <chr>, TIME_FORMAT <chr>, UNIT_MULT <dbl>,
# OBS_STATUS <chr>, OBS_CONF <chr>, FREQ_LABEL <chr>, REF_AREA_LABEL <chr>,
# INDICATOR_LABEL <chr>, SEX_LABEL <chr>, AGE_LABEL <chr>,
# URBANISATION_LABEL <chr>, UNIT_MEASURE_LABEL <chr>, …
The colnames(data) function was used to see all variables in the dataset. This helped identify important variables such as country, year, age group, gender, and income values.
#Data Cleaning
The dataset was cleaned by selecting only the important variables for the analysis. Missing values were also removed using na.omit() to create a cleaner dataset for modeling.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
# A tibble: 6 × 6
REF_AREA_LABEL TIME_PERIOD SEX_LABEL AGE_LABEL URBANISATION_LABEL OBS_VALUE
<chr> <dbl> <chr> <chr> <chr> <dbl>
1 Venezuela, RB 1999 Total All age … Total 2390.
2 Vietnam 1999 Total All age … Total 1050.
3 West Bank and Ga… 1999 Total All age … Total 3297.
4 Eastern & Southe… 2000 Total All age … Total 1112.
5 Central Europe a… 2000 Total All age … Total 7095.
6 Early-demographi… 2000 Total All age … Total 2146.
#Summary Statistics
summary(clean_data$OBS_VALUE)
Min. 1st Qu. Median Mean 3rd Qu. Max.
149.4 1644.8 4760.7 12528.8 17754.9 137688.9
sd(clean_data$OBS_VALUE)
[1] 16776.81
The minimum GNI per capita value is 149, while the maximum value is 137688, showing a large difference between observations. The mean value is 12528, which is much higher than the median value of 4760. This suggests that the distribution is positively skewed because some observations have very high income values.
The histogram shows that most observations are concentrated at lower GNI per capita values, while a smaller number of observations have very high income levels. This creates a positively skewed distribution with a long right tail.
A logarithmic transformation was applied to reduce the skewness of the distribution. After the transformation, the histogram became more symmetric and closer to a normal distribution. This suggests that the original GNI per capita data may follow a log normal distribution.