The 10 variables are:
# read in necessary libraries
library(ggplot2)
library(ggpubr)
growth <-
read.csv("https://raw.githubusercontent.com/D-hartog/csv_file/main/GrowthDJ.csv")
head(growth)
For further analysis of the data I will not be using the inter, oecd, or the column labeled X (an index column carried over from the csv file.). I create a subset of the data with the columns that will be relevant to answer the questions above.
growth_og <- growth[, c("oil", "gdp60", "gdp85", "gdpgrowth", "popgrowth", "invest", "school", "literacy60")]
head(growth_og)
Looking at the summary of all the attributes in the data frame
summary(growth_og)
## oil gdp60 gdp85 gdpgrowth
## Length:121 Min. : 383.0 Min. : 412 Min. :-0.900
## Class :character 1st Qu.: 973.2 1st Qu.: 1209 1st Qu.: 2.800
## Mode :character Median : 1962.0 Median : 3484 Median : 3.900
## Mean : 3681.8 Mean : 5683 Mean : 4.094
## 3rd Qu.: 4274.5 3rd Qu.: 7719 3rd Qu.: 5.300
## Max. :77881.0 Max. :25635 Max. : 9.200
## NA's :5 NA's :13 NA's :4
## popgrowth invest school literacy60
## Min. :0.300 Min. : 4.10 Min. : 0.400 Min. : 1.00
## 1st Qu.:1.700 1st Qu.:12.00 1st Qu.: 2.400 1st Qu.: 15.00
## Median :2.400 Median :17.70 Median : 4.950 Median : 39.00
## Mean :2.279 Mean :18.16 Mean : 5.526 Mean : 48.17
## 3rd Qu.:2.900 3rd Qu.:24.10 3rd Qu.: 8.175 3rd Qu.: 83.50
## Max. :6.800 Max. :36.90 Max. :12.100 Max. :100.00
## NA's :14 NA's :3 NA's :18
We can see that there are some na values in the data frame and we can take a clear look at how many are in each column.
sapply(growth_og, function(x) sum(is.na(x)))
## oil gdp60 gdp85 gdpgrowth popgrowth invest school
## 0 5 13 4 14 0 3
## literacy60
## 18
My next step was to decide what to do about the NA values. I wanted to compare how the data would behave if I dropped those rows with the NA values OR replaced them with the median value; median values are less likely to be effected by outliers.
I first started with creating a new data frame and then replacing the NA values with the median values
growth_med <- growth_og
# Then fill in the median values into the new data frame growth_med
growth_med$gdp60[is.na(growth_med$gdp60)] <- median(growth_med$gdp60, na.rm = TRUE)
growth_med$gdp85[is.na(growth_med$gdp85)] <- median(growth_med$gdp85, na.rm = TRUE)
growth_med$gdpgrowth[is.na(growth_med$gdpgrowth)] <- median(growth_med$gdpgrowth, na.rm = TRUE)
growth_med$popgrowth[is.na(growth_med$popgrowth)] <- median(growth_med$popgrowth, na.rm = TRUE)
growth_med$school[is.na(growth_med$school)] <- median(growth_med$school, na.rm = TRUE)
growth_med$literacy60[is.na(growth_med$literacy60)] <- median(growth_med$literacy60, na.rm = TRUE)
# Recheck summary stats
summary(growth_med)
## oil gdp60 gdp85 gdpgrowth
## Length:121 Min. : 383 Min. : 412 Min. :-0.900
## Class :character 1st Qu.: 1009 1st Qu.: 1329 1st Qu.: 2.800
## Mode :character Median : 1962 Median : 3484 Median : 3.900
## Mean : 3611 Mean : 5447 Mean : 4.088
## 3rd Qu.: 3766 3rd Qu.: 6868 3rd Qu.: 5.200
## Max. :77881 Max. :25635 Max. : 9.200
## popgrowth invest school literacy60
## Min. :0.300 Min. : 4.10 Min. : 0.400 Min. : 1.0
## 1st Qu.:1.900 1st Qu.:12.00 1st Qu.: 2.400 1st Qu.: 16.0
## Median :2.400 Median :17.70 Median : 4.950 Median : 39.0
## Mean :2.293 Mean :18.16 Mean : 5.512 Mean : 46.8
## 3rd Qu.:2.800 3rd Qu.:24.10 3rd Qu.: 8.100 3rd Qu.: 75.0
## Max. :6.800 Max. :36.90 Max. :12.100 Max. :100.0
Next I created another data frame that dropped any row with an NA value
growth_dropna <- growth_og[rowSums(is.na(growth_og)) == 0, ]
# Recheck summary stats
summary(growth_dropna)
## oil gdp60 gdp85 gdpgrowth
## Length:100 Min. : 383 Min. : 412 Min. :-0.900
## Class :character 1st Qu.: 1001 1st Qu.: 1182 1st Qu.: 2.700
## Mode :character Median : 1945 Median : 3484 Median : 3.800
## Mean : 3837 Mean : 5634 Mean : 3.973
## 3rd Qu.: 4776 3rd Qu.: 7719 3rd Qu.: 5.125
## Max. :77881 Max. :25635 Max. : 9.200
## popgrowth invest school literacy60
## Min. :0.300 Min. : 4.10 Min. : 0.400 Min. : 1.00
## 1st Qu.:1.700 1st Qu.:11.62 1st Qu.: 2.400 1st Qu.: 15.00
## Median :2.400 Median :16.55 Median : 4.850 Median : 46.00
## Mean :2.274 Mean :17.43 Mean : 5.453 Mean : 49.43
## 3rd Qu.:2.900 3rd Qu.:23.32 3rd Qu.: 8.075 3rd Qu.: 84.00
## Max. :6.800 Max. :36.90 Max. :11.900 Max. :100.00
Box plots of GDP Growth
gdpgrowth_dropna <- ggplot(data = growth_dropna, aes(y = gdpgrowth, x = 1)) +
geom_boxplot() + ylab("gdp growth") +
ggtitle("NA values removed") +
theme(axis.title.y = element_text(color="blue", size=10),
plot.title = element_text(color="darkblue",
size=15))
gdpgrowth_med <- ggplot(data = growth_med, aes(y = gdpgrowth, x = 1)) +
geom_boxplot() + ylab("gdp growth") +
ggtitle("NA values replaced (median)") +
theme(axis.title.y = element_text(color="red", size=10),
plot.title = element_text(color="darkblue",
size=15))
ggarrange(gdpgrowth_dropna, gdpgrowth_med, labels = c("A", "B"), ncol = 2, nrow = 1)
Regarding the comparison of gdp growth with NA values removed vs. NA values replaced with median, we do not see much of a change between the quantiles and distribution of the data. This may have been influenced by the smaller number of missing values in the gdpgrowth column.
Looking at the median of each:
Median gdp growth (NA values removed): 3.8
Median gdp growth (NA values replaced with median): 3.9
Box plots of population growth rates
pop_dropna <- ggplot(data = growth_dropna, aes(y = popgrowth, x = 1)) +
geom_boxplot() + ylab("Population growth") +
ggtitle("NA values removed") +
theme(axis.title.y = element_text(color="blue", size=10),
plot.title = element_text(color="darkblue",
size=15))
pop_med <- ggplot(data = growth_med, aes(y = popgrowth, x = 1)) +
geom_boxplot() + ylab("Population growth") +
ggtitle("NA values replaced (median)") +
theme(axis.title.y = element_text(color="red", size=10),
plot.title = element_text(color="darkblue",
size=15))
ggarrange(pop_dropna, pop_med, labels = c("A", "B"), ncol = 2, nrow = 1)
Median population growth (NA values removed): 2.4
Median population growth (NA values replaced with median): 2.4
Replacing the values with the median didn’t seem to have a big effect on the median or the mean of the those variables (as seen in the summary statistics above). For this reason I decided to continue working with data set where NA values were removed. This also eliminated most of the outliers except one which was also discard.
# remove outliers from the growth_dropna$popgrowth
growth_dropna <- growth_dropna[growth_dropna$popgrowth != max(growth_dropna$popgrowth),]
rownames(growth_dropna) <- NULL
dim(growth_dropna)
## [1] 99 8
We can see that we have a smaller dimension of the data frame confirming that the new data frame does not contain any NA values
Since I wanted to explore the relationship between gdp growth and population growth I decided to take a closer look at the distribution of the average population growth by creating a histogram graph.
Histogram of the population growth
ggplot(data = growth_dropna, aes(x = popgrowth)) +
geom_histogram(bins = 15,
fill = "darkcyan",
color = "lightblue") +
stat_bin(bins = 15, geom = "text", color = "white", size = 4,
aes(label = ..count..), position=position_stack(vjust=0.5)) +
ggtitle("Histogram of the Avg. Population Growth") +
xlab("Avg. Population Growth") +
ylab("Frequency") +
theme(axis.title.x = element_text(color="chocolate1",size=15),
axis.title.y = element_text(color="chocolate1",size=15),
plot.title = element_text(color="darkblue",
size=20))
Looking graphically at the distribution of the average population
growth, it was revealed that the data is slightly skewed to the left.
Mathematically it was confirmed that the distribution is in fact left
skewed. This was calculated using the function skewness as part of the
moments library. I also wanted to compare this with the distribution of
population growth when NA values were replaced with the median value. As
you can see below this resulted in the data being right skewed.
Skewness: -0.4455959
Skewness (NA replaced with median values): 0.4364279
Next, I wanted to visualize the relationship between the average GDP growth and the average population growth. The most direct way to visualize this was by generating a scatter plot.
ggplot(growth_dropna,
aes(x = popgrowth, y = gdpgrowth)) +
geom_point(color = "blue") +
ggtitle("GDP Growth vs. Population Growth") +
xlab("Population Growth") +
ylab("GDP Growth") +
theme(axis.title.x = element_text(color="chocolate1",size=15),
axis.title.y = element_text(color="chocolate1",size=15),
plot.title = element_text(color="darkblue",
size=20))
Looking at the plot it appears that the points trend up in a positive
direction. Calculating the correlation coefficient it did show a
positive relationship between the average population growth and the gdp
growth, although a weak one.
Correlation coefficient:
cor(growth_dropna$popgrowth, growth_dropna$gdpgrowth)
Finally, I wanted to add various levels to the scatter plot for the purpose of visualizing the concentration of other variables within the data set. I separately looked at adding a literacy grade (explained below), whether the country was an oil producing country, and the average ratio of investment.
I created a new column called lit_grade which spit the columns into three factors based on the 25th, 50th, and 75th percentile.
growth_dropna$lit_grade[growth_dropna$literacy60 <
quantile(growth_dropna$literacy60, 0.25)] <- "poor"
growth_dropna$lit_grade[(growth_dropna$literacy60 >=
quantile(growth_dropna$literacy60, 0.25)) &
(growth_dropna$literacy60 < quantile(growth_dropna$literacy60, 0.75))] <- "fair"
growth_dropna$lit_grade[growth_dropna$literacy60 >=
quantile(growth_dropna$literacy60, 0.75)] <- "good"
head(growth_dropna)
Adding the new literacy grade to the scatter plot
ggplot(growth_dropna,
aes(x = popgrowth, y = gdpgrowth, color = lit_grade)) +
geom_point() +
ggtitle("GDP Growth vs. Population Growth") +
xlab("Population Growth") +
ylab("GDP Growth") +
theme(axis.title.x = element_text(color="chocolate1",size=15),
axis.title.y = element_text(color="chocolate1",size=15),
plot.title = element_text(color="darkblue",
size=20))
Adding the oil column to the scatter plot
ggplot(growth_dropna,
aes(x = popgrowth, y = gdpgrowth, color = oil)) +
geom_point() +
ggtitle("GDP Growth vs. Population Growth") +
xlab("Population Growth") +
ylab("GDP Growth") +
theme(axis.title.x = element_text(color="chocolate1",size=15),
axis.title.y = element_text(color="chocolate1",size=15),
plot.title = element_text(color="darkblue",
size=20))
Adding the investment date onto the scatter plot
ggplot(growth_dropna,
aes(x = popgrowth, y = gdpgrowth, color = invest)) +
geom_point() +
ggtitle("GDP Growth vs. Population Growth") +
xlab("Population Growth") +
ylab("GDP Growth") +
theme(axis.title.x = element_text(color="chocolate1",size=15),
axis.title.y = element_text(color="chocolate1",size=15),
plot.title = element_text(color="darkblue",
size=20))
Looking at the original problem statement: Did countries with higher average population growth rates show a higher average growth rate of per capita GDP from 1960 to 1985? Through the analysis there was a weak positive correlation between these two variables and did not show that the higher population growths were related to higher GDP growths.
Can we visualize other relationships between GDP growth, population growth and the rest of the data? When layering on the other variables onto the scatter plot there weren’t obvious patterns or concentrations of the new variable on the plot. When layering on countries that were oil producing, although a small portion (3 countries in total), it did appear that those countries had an average or above average growth rate of per capita GDP.