R-programming: Homework wk3

Problem statement: Did contries with higher average population growth rates show a higher average growth rate of per capita GDP from 1960 to 1985? Can we visualize other relationships between GDP growth, population growth and the rest of the data?

Initial data exploration, relevant data wrangling and transformation

For further analysis of the data I will not be using the inter, oecd, or the column labeled X (an index column carried over from the csv file.). I create a subset of the data with the columns that will be relevant to answer the questions above.

growth_og <- growth[, c("oil", "gdp60", "gdp85", "gdpgrowth", "popgrowth", "invest", "school", "literacy60")]
head(growth_og)

Looking at the summary of all the attributes in the data frame

summary(growth_og)

##      oil                gdp60             gdp85         gdpgrowth     
##  Length:121         Min.   :  383.0   Min.   :  412   Min.   :-0.900  
##  Class :character   1st Qu.:  973.2   1st Qu.: 1209   1st Qu.: 2.800  
##  Mode  :character   Median : 1962.0   Median : 3484   Median : 3.900  
##                     Mean   : 3681.8   Mean   : 5683   Mean   : 4.094  
##                     3rd Qu.: 4274.5   3rd Qu.: 7719   3rd Qu.: 5.300  
##                     Max.   :77881.0   Max.   :25635   Max.   : 9.200  
##                     NA's   :5         NA's   :13      NA's   :4       
##    popgrowth         invest          school         literacy60    
##  Min.   :0.300   Min.   : 4.10   Min.   : 0.400   Min.   :  1.00  
##  1st Qu.:1.700   1st Qu.:12.00   1st Qu.: 2.400   1st Qu.: 15.00  
##  Median :2.400   Median :17.70   Median : 4.950   Median : 39.00  
##  Mean   :2.279   Mean   :18.16   Mean   : 5.526   Mean   : 48.17  
##  3rd Qu.:2.900   3rd Qu.:24.10   3rd Qu.: 8.175   3rd Qu.: 83.50  
##  Max.   :6.800   Max.   :36.90   Max.   :12.100   Max.   :100.00  
##  NA's   :14                      NA's   :3        NA's   :18

We can see that there are some na values in the data frame and we can take a clear look at how many are in each column.

sapply(growth_og, function(x) sum(is.na(x)))

##        oil      gdp60      gdp85  gdpgrowth  popgrowth     invest     school 
##          0          5         13          4         14          0          3 
## literacy60 
##         18

My next step was to decide what to do about the NA values. I wanted to compare how the data would behave if I dropped those rows with the NA values OR replaced them with the median value; median values are less likely to be effected by outliers.

I first started with creating a new data frame and then replacing the NA values with the median values

growth_med <- growth_og

# Then fill in the median values into the new data frame growth_med
growth_med$gdp60[is.na(growth_med$gdp60)] <- median(growth_med$gdp60, na.rm = TRUE) 
growth_med$gdp85[is.na(growth_med$gdp85)] <- median(growth_med$gdp85, na.rm = TRUE) 
growth_med$gdpgrowth[is.na(growth_med$gdpgrowth)] <- median(growth_med$gdpgrowth, na.rm = TRUE) 
growth_med$popgrowth[is.na(growth_med$popgrowth)] <- median(growth_med$popgrowth, na.rm = TRUE) 
growth_med$school[is.na(growth_med$school)] <- median(growth_med$school, na.rm = TRUE) 
growth_med$literacy60[is.na(growth_med$literacy60)] <- median(growth_med$literacy60, na.rm = TRUE) 

# Recheck summary stats 
summary(growth_med)

##      oil                gdp60           gdp85         gdpgrowth     
##  Length:121         Min.   :  383   Min.   :  412   Min.   :-0.900  
##  Class :character   1st Qu.: 1009   1st Qu.: 1329   1st Qu.: 2.800  
##  Mode  :character   Median : 1962   Median : 3484   Median : 3.900  
##                     Mean   : 3611   Mean   : 5447   Mean   : 4.088  
##                     3rd Qu.: 3766   3rd Qu.: 6868   3rd Qu.: 5.200  
##                     Max.   :77881   Max.   :25635   Max.   : 9.200  
##    popgrowth         invest          school         literacy60   
##  Min.   :0.300   Min.   : 4.10   Min.   : 0.400   Min.   :  1.0  
##  1st Qu.:1.900   1st Qu.:12.00   1st Qu.: 2.400   1st Qu.: 16.0  
##  Median :2.400   Median :17.70   Median : 4.950   Median : 39.0  
##  Mean   :2.293   Mean   :18.16   Mean   : 5.512   Mean   : 46.8  
##  3rd Qu.:2.800   3rd Qu.:24.10   3rd Qu.: 8.100   3rd Qu.: 75.0  
##  Max.   :6.800   Max.   :36.90   Max.   :12.100   Max.   :100.0

Next I created another data frame that dropped any row with an NA value

growth_dropna <- growth_og[rowSums(is.na(growth_og)) == 0, ]

# Recheck summary stats
summary(growth_dropna)

##      oil                gdp60           gdp85         gdpgrowth     
##  Length:100         Min.   :  383   Min.   :  412   Min.   :-0.900  
##  Class :character   1st Qu.: 1001   1st Qu.: 1182   1st Qu.: 2.700  
##  Mode  :character   Median : 1945   Median : 3484   Median : 3.800  
##                     Mean   : 3837   Mean   : 5634   Mean   : 3.973  
##                     3rd Qu.: 4776   3rd Qu.: 7719   3rd Qu.: 5.125  
##                     Max.   :77881   Max.   :25635   Max.   : 9.200  
##    popgrowth         invest          school         literacy60    
##  Min.   :0.300   Min.   : 4.10   Min.   : 0.400   Min.   :  1.00  
##  1st Qu.:1.700   1st Qu.:11.62   1st Qu.: 2.400   1st Qu.: 15.00  
##  Median :2.400   Median :16.55   Median : 4.850   Median : 46.00  
##  Mean   :2.274   Mean   :17.43   Mean   : 5.453   Mean   : 49.43  
##  3rd Qu.:2.900   3rd Qu.:23.32   3rd Qu.: 8.075   3rd Qu.: 84.00  
##  Max.   :6.800   Max.   :36.90   Max.   :11.900   Max.   :100.00

Using side by side box plots to visualize any differences between the two data frames.

Box plots of GDP Growth

gdpgrowth_dropna <- ggplot(data = growth_dropna, aes(y = gdpgrowth, x = 1)) + 
  geom_boxplot() + ylab("gdp growth") + 
  ggtitle("NA values removed") +
  theme(axis.title.y = element_text(color="blue", size=10), 
        plot.title = element_text(color="darkblue",
                                  size=15))
                                  

gdpgrowth_med <- ggplot(data = growth_med, aes(y = gdpgrowth, x = 1)) + 
  geom_boxplot() + ylab("gdp growth") + 
  ggtitle("NA values replaced (median)") +
  theme(axis.title.y = element_text(color="red", size=10), 
        plot.title = element_text(color="darkblue",
                                  size=15))
                                  

ggarrange(gdpgrowth_dropna, gdpgrowth_med, labels = c("A", "B"), ncol = 2, nrow = 1)

Regarding the comparison of gdp growth with NA values removed vs. NA values replaced with median, we do not see much of a change between the quantiles and distribution of the data. This may have been influenced by the smaller number of missing values in the gdpgrowth column.

Looking at the median of each:

Median gdp growth (NA values removed): 3.8

Median gdp growth (NA values replaced with median): 3.9

Box plots of population growth rates

pop_dropna <- ggplot(data = growth_dropna, aes(y = popgrowth, x = 1))  +
  geom_boxplot() + ylab("Population growth") + 
  ggtitle("NA values removed") + 
  theme(axis.title.y = element_text(color="blue", size=10), 
        plot.title = element_text(color="darkblue",
                                  size=15))
                                  

pop_med <- ggplot(data = growth_med, aes(y = popgrowth, x = 1)) +
  geom_boxplot() + ylab("Population growth") + 
  ggtitle("NA values replaced (median)") + 
  theme(axis.title.y = element_text(color="red", size=10), 
        plot.title = element_text(color="darkblue",
                                  size=15))
                                  

ggarrange(pop_dropna, pop_med, labels = c("A", "B"), ncol = 2, nrow = 1)

Median population growth (NA values removed): 2.4

Median population growth (NA values replaced with median): 2.4

Replacing the values with the median didn’t seem to have a big effect on the median or the mean of the those variables (as seen in the summary statistics above). For this reason I decided to continue working with data set where NA values were removed. This also eliminated most of the outliers except one which was also discard.

# remove outliers from the growth_dropna$popgrowth 
growth_dropna <- growth_dropna[growth_dropna$popgrowth != max(growth_dropna$popgrowth),]
rownames(growth_dropna) <- NULL
dim(growth_dropna)

## [1] 99  8

We can see that we have a smaller dimension of the data frame confirming that the new data frame does not contain any NA values

Other exploratory visualizations

Since I wanted to explore the relationship between gdp growth and population growth I decided to take a closer look at the distribution of the average population growth by creating a histogram graph.

Histogram of the population growth

ggplot(data = growth_dropna, aes(x = popgrowth)) + 
  geom_histogram(bins = 15,
                 fill = "darkcyan",
                 color = "lightblue") +
  stat_bin(bins = 15, geom = "text", color = "white", size = 4,
           aes(label = ..count..), position=position_stack(vjust=0.5)) +
  ggtitle("Histogram of the Avg. Population Growth") +
  xlab("Avg. Population Growth") +
  ylab("Frequency") +
  theme(axis.title.x = element_text(color="chocolate1",size=15),
        axis.title.y = element_text(color="chocolate1",size=15),
          plot.title = element_text(color="darkblue",
                                    size=20))

Looking graphically at the distribution of the average population growth, it was revealed that the data is slightly skewed to the left. Mathematically it was confirmed that the distribution is in fact left skewed. This was calculated using the function skewness as part of the moments library. I also wanted to compare this with the distribution of population growth when NA values were replaced with the median value. As you can see below this resulted in the data being right skewed.

Skewness: -0.4455959

Skewness (NA replaced with median values): 0.4364279

Next, I wanted to visualize the relationship between the average GDP growth and the average population growth. The most direct way to visualize this was by generating a scatter plot.

ggplot(growth_dropna, 
       aes(x = popgrowth, y = gdpgrowth)) + 
  geom_point(color = "blue") +
  ggtitle("GDP Growth vs. Population Growth") +
  xlab("Population Growth") + 
  ylab("GDP Growth") + 
  theme(axis.title.x = element_text(color="chocolate1",size=15),
        axis.title.y = element_text(color="chocolate1",size=15),
        plot.title = element_text(color="darkblue",
                                  size=20))

Looking at the plot it appears that the points trend up in a positive direction. Calculating the correlation coefficient it did show a positive relationship between the average population growth and the gdp growth, although a weak one.

Correlation coefficient: cor(growth_dropna$popgrowth, growth_dropna$gdpgrowth)

Finally, I wanted to add various levels to the scatter plot for the purpose of visualizing the concentration of other variables within the data set. I separately looked at adding a literacy grade (explained below), whether the country was an oil producing country, and the average ratio of investment.

I created a new column called lit_grade which spit the columns into three factors based on the 25th, 50th, and 75th percentile.

growth_dropna$lit_grade[growth_dropna$literacy60 < 
                          quantile(growth_dropna$literacy60, 0.25)] <- "poor"
growth_dropna$lit_grade[(growth_dropna$literacy60 >= 
                          quantile(growth_dropna$literacy60, 0.25)) &
                          (growth_dropna$literacy60 < quantile(growth_dropna$literacy60, 0.75))] <- "fair"
growth_dropna$lit_grade[growth_dropna$literacy60 >=
                          quantile(growth_dropna$literacy60, 0.75)] <- "good"

head(growth_dropna)

Adding the new literacy grade to the scatter plot

ggplot(growth_dropna, 
       aes(x = popgrowth, y = gdpgrowth, color = lit_grade)) + 
  geom_point() +
  ggtitle("GDP Growth vs. Population Growth") +
  xlab("Population Growth") + 
  ylab("GDP Growth") + 
  theme(axis.title.x = element_text(color="chocolate1",size=15),
        axis.title.y = element_text(color="chocolate1",size=15),
        plot.title = element_text(color="darkblue",
                                  size=20))

Adding the oil column to the scatter plot

ggplot(growth_dropna, 
       aes(x = popgrowth, y = gdpgrowth, color = oil)) + 
  geom_point() +
  ggtitle("GDP Growth vs. Population Growth") +
  xlab("Population Growth") + 
  ylab("GDP Growth") + 
  theme(axis.title.x = element_text(color="chocolate1",size=15),
        axis.title.y = element_text(color="chocolate1",size=15),
        plot.title = element_text(color="darkblue",
                                  size=20))

Adding the investment date onto the scatter plot

ggplot(growth_dropna, 
       aes(x = popgrowth, y = gdpgrowth, color = invest)) + 
  geom_point() +
  ggtitle("GDP Growth vs. Population Growth") +
  xlab("Population Growth") + 
  ylab("GDP Growth") + 
  theme(axis.title.x = element_text(color="chocolate1",size=15),
        axis.title.y = element_text(color="chocolate1",size=15),
        plot.title = element_text(color="darkblue",
                                  size=20))

Conclusion

Looking at the original problem statement: Did countries with higher average population growth rates show a higher average growth rate of per capita GDP from 1960 to 1985? Through the analysis there was a weak positive correlation between these two variables and did not show that the higher population growths were related to higher GDP growths.

Can we visualize other relationships between GDP growth, population growth and the rest of the data? When layering on the other variables onto the scatter plot there weren’t obvious patterns or concentrations of the new variable on the plot. When layering on countries that were oil producing, although a small portion (3 countries in total), it did appear that those countries had an average or above average growth rate of per capita GDP.

R-programming: Homework wk3

Dirk Hartog

2023-07-24

The data set that I am looking at is the growth regression data as provided by Durlauf & Johnson (1995). There are 10 variables that are captured for each country (observation).

Problem statement: Did contries with higher average population growth rates show a higher average growth rate of per capita GDP from 1960 to 1985? Can we visualize other relationships between GDP growth, population growth and the rest of the data?

Initial data exploration, relevant data wrangling and transformation

Using side by side box plots to visualize any differences between the two data frames.

Other exploratory visualizations

Conclusion