Data Description The dataset selected for this assignment is the “World Development Indicators and Other World Bank Data” loaded as a package. This data is from over 40 databases hosted by the World Bank, including the World Development Indicators (‘WDI’), International Debt Statistics, Doing Business, Human Capital Index, and Sub-national Poverty indicators.The complete dataset has a total of 16532 observations. The dataset has a total of 13 variables.
I further filtered on the dataset to use the United States data to see the trends across time and probably see the impact of various world events on different aspects. One basic aspect that drives the trends in this data are the socio economic conditions in the country at any point of time.Variable of interest for us from the data is Infant mortality rate. The variation in this variable is driven by usually a combination of GDP, schooling years which effects the standard of healthcare provided and also the basic knowledge of individuals.
knitr::opts_chunk$set(echo = TRUE)
library(WDI)
library(ggplot2)
wdi_data = WDI(indicator = c('inf_mort' = "SP.DYN.IMRT.IN",
'gdpPercap'="NY.GDP.PCAP.KD",
'yrs_schooling'='BAR.SCHL.15UP'), # interest rate spread
start = 1960, end = 2020,
extra=TRUE) %>%
as_tibble()
united_states_data = wdi_data %>%
filter(country == 'United States')
dim(united_states_data)
## [1] 61 13
Filtering the data to include just the United States data gives a total of 61 observations of 13 variables.
Plot of infant mortality rate vs time(in years)
ggplot(united_states_data, aes(x = year, y = inf_mort)) +geom_line() + labs(y = 'Infant Mortality Rate',x = 'Year') + ggtitle("Infant mortality rate across Years")
## Warning: Removed 1 row(s) containing missing values (geom_path).
There is a significant downward trend of the infant mortality rate from 1960 to 2019.
Fitting a linear time trend to infant mortality rate
united_states_data %>%
ggplot()+
geom_line(aes(year,inf_mort))+
geom_smooth(aes(year,inf_mort),method='lm',color='red') +
theme_bw()+
xlab("Year")+
ylab("Infant Mortality")+
labs(title = "Infant Mortality in United States, 1960 - 2019")
## `geom_smooth()` using formula 'y ~ x'
Above graph indicates that a linear time trend provides a relatively good fit to the data. This can be further substantiated by fitting a regression model to infant mortality rate as a function of time.
Checking for outliers in infant mortality rate
boxplot(united_states_data$inf_mort, main = "Checking for outliers in infant mortality rate in United States Data", ylab = "Infant Mortality Rate")
Above boxplot shows that there are no outliers in the infant mortality rate whcih indicates that the mortality rate is normally distributed and there are no unusual trends in the variable.
Summary of Infant Mortality Rate for United States
summary(united_states_data$inf_mort)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 5.600 6.775 9.550 11.930 16.100 25.900 1
Summary stats of infant mortality rate for United States is provided above. For the time frame of 1960-2019, United States experienced a maximum of 25.9 infant mortality rate and a minimum of 5.6. We can also observe that median is slightly lower than the mean. This means that there are more data points less than the mean and less number of data points that have higher infant mortality rate.
Regression Model
mod = lm(inf_mort~year,data=united_states_data)
summary(mod)
##
## Call:
## lm(formula = inf_mort ~ year, data = united_states_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9079 -2.2039 -0.2998 2.0452 3.8974
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 691.23452 33.75623 20.48 <2e-16 ***
## year -0.34144 0.01697 -20.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.276 on 58 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8747, Adjusted R-squared: 0.8726
## F-statistic: 405 on 1 and 58 DF, p-value: < 2.2e-16
From the regression model summary, it can be observed that one unit increase in time(year in this case) is associated with a 0.34 unit decline in infant mortality rate and the p-value is way less than 0.05 and hence this holds true. A R-squared value of 0.87 indicates that the linear time trend provides a relatively good fit.