Final Research Project

#Abstract:

Given the alarming global child mortality rate, this project focused on two main objectives: 1) identifying countries with the highest and lowest child mortality rates and 2) developing a precise predictive model to understand the factors influencing child mortality and identify variables that can help reduce this rate. The methodology encompassed essential steps, including data collection, preprocessing, analysis, modeling techniques such as random forest, visualizations, and regression analysis. Child mortality variables were carefully constructed, and the top 10 countries with the highest child mortality rates were identified. Extensive data standardization and variable selection were conducted to ensure accurate analysis. The datasets were utilized to identify influential predictors and build a model, considering both raw and transformed data with a focus on optimizing the coefficient of determination (R-squared). Multiple linear regression and regularization models were explored, and the best regression model was chosen based on R-squared and root mean square error (RMSE) values. Notably, the analysis highlighted that a Random Forest model, applied to the transformed data, demonstrated the strongest results, suggesting its suitability among the considered models. Ultimately, the project aimed to develop a reliable predictive model for child mortality and provide recommendations tailored to specific countries to improve their child mortality rates. #Key words:

Child mortality, leading causes of death, global health, data analysis, mortality trends.

#Problem Statement:

Child mortality is a pressing global public health concern, with millions of children losing their lives each year due to preventable causes. This study seeks to tackle this issue by analyzing relevant datasets to uncover patterns and trends in child mortality rates. The insights gained from this analysis can play a crucial role in guiding policymakers and healthcare professionals in formulating effective strategies and interventions to combat child mortality and improve child health outcomes on a global scale. The project’s main objective is to address the problem of high child mortality rates worldwide. To achieve this, the project aims to answer two pivotal questions. Firstly, it seeks to identify the countries with the highest and lowest child mortality rates, providing valuable insights into the global distribution of child mortality. Secondly, the project aims to develop an accurate predictive model for child mortality and identify the key variables that contribute to reducing the child mortality rate. By understanding the factors that impact child mortality and developing a reliable predictive model, the project aims to pave the way for evidence-based interventions and strategies that can effectively mitigate child mortality rates and improve child well-being.

#Literature review:

In the field of child mortality research, several studies have addressed similar problems and achieved significant findings. Bizzego et al. (2021) used machine learning to identify distal causes of child mortality in low- and middle-income countries (LMICs) and emphasized the importance of preventive interventions. Burstein et al. (2019) focused on quantifying subnational variations in child mortality rates and identifying high-mortality clusters and geographical inequalities to guide interventions. Ezbakhe and Pérez-Foguet (2020) proposed a compositional data analysis approach to estimate levels and trends in child mortality, highlighting the need for complete data and new estimation methods. Jaadla et al. (2020) explored the relationship between socioeconomic status and mortality in early 19th-century England, finding small and age-specific differentials. Jin et al. (2018) analyzed cause-specific child mortality reductions in different countries to inform policy strategies. Pritchard et al. (2019) compared child abuse-related deaths and child mortality rates in the USA and other developed nations, highlighting the impact of income inequality. Sommer (2020) examined the relationship between corruption, health expenditure, and child mortality in LMICs, emphasizing the need to reduce corruption for improved development effectiveness. Ummalla et al. (2022) investigated the effect of sanitation and safe drinking water on child mortality and life expectancy, emphasizing the importance of access to these facilities. Yasir (2019) discussed the potential impact of women’s education on child care and mortality, highlighting the need for empowerment and cultural awareness. Young and Duke (2020) conducted a systematic review on the implementation of child mortality reviews in LMICs, identifying facilitators and barriers. These studies have significantly contributed to understanding child mortality and identifying associated factors. They have used various methodologies such as machine learning, geostatistical models, compositional data analysis, and panel econometric techniques. While each study has its advantages and drawbacks, collectively, they provide insights into the causes, trends, and interventions related to child mortality. My project focuses on developing an accurate predictive model for child mortality and identifying variables that can contribute to reducing the child mortality rate. The key objective is to develop a robust and reliable predictive model that can accurately identify the factors contributing to child mortality and provide insights for policymakers and interventions aimed at reducing the child mortality rate.

#Methodology:

The methodology implemented in this project encompassed several crucial steps to achieve the research objectives. These steps are outlined below: Data collection, preprocessing & cleaning Comprehensive data was collected on child mortality rates, income levels of countries, and other pertinent variables relevant to the project. The collected data underwent meticulous preprocessing and cleaning procedures to ensure its consistency and accuracy. This involved addressing missing or duplicate data, standardizing variables, and formatting the data appropriately for analysis. Data analysis, modeling & visualization Statistical methods were employed to analyze the relationships between variables. Descriptive and inferential analysis techniques were applied, including the calculation of summary statistics, the utilization of modeling techniques like random forest, and the creation of visualizations. Visual representations, such as graphs, were generated to gain a better understanding of the patterns and trends in child mortality rates. Interpretation of results The findings obtained from the data analysis and visualization were interpreted to derive meaningful conclusions. This involved examining the variations in child mortality rates across different countries and races, as well as investigating the influence of income levels on child mortality. Through this interpretation, valuable insights were gained into the key factors affecting child mortality rates, leading to the development of conclusions for constructing the most effective predictive model based on the analyzed data. Throughout the project, meticulous attention was given to feature engineering and the implementation of predictive modeling techniques, particularly random forest. These measures were undertaken to ensure the accuracy and reliability of the predictive model. By adhering to this methodology, the project aimed to successfully develop a precise predictive model for child mortality while identifying the variables that hold the potential for reducing the child mortality rate.

###Libraries

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
## 
## Attaching package: 'RCurl'
## 
## 
## The following object is masked from 'package:tidyr':
## 
##     complete
## 
## 
## 
## Attaching package: 'rvest'
## 
## 
## The following object is masked from 'package:readr':
## 
##     guess_encoding
## 
## 
## 
## Attaching package: 'kableExtra'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
## 
## 
## 
## Attaching package: 'BBmisc'
## 
## 
## The following objects are masked from 'package:dplyr':
## 
##     coalesce, collapse, symdiff
## 
## 
## The following object is masked from 'package:base':
## 
##     isFALSE
## 
## 
## Loading required package: NLP
## 
## 
## Attaching package: 'NLP'
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## 
## 
## Loading required package: gsubfn
## 
## Loading required package: proto
## 
## Loading required package: RSQLite
## 
## corrplot 0.92 loaded
## 
## 
## Attaching package: 'MASS'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## 
## Loading required package: lattice
## 
## 
## Attaching package: 'caret'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     lift
## 
## 
## Loading required package: Matrix
## 
## 
## Attaching package: 'Matrix'
## 
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## 
## Loaded glmnet 4.1-7
## 
## 
## Attaching package: 'xgboost'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     slice
## 
## 
## randomForest 4.7-1.1
## 
## Type rfNews() to see new features/changes/bug fixes.
## 
## 
## Attaching package: 'randomForest'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
## 
## 
## ------------------------------------------------------------------------------
## 
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## 
## ------------------------------------------------------------------------------
## 
## 
## Attaching package: 'plyr'
## 
## 
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## 
## The following object is masked from 'package:purrr':
## 
##     compact
## 
## 
## 
## Attaching package: 'data.table'
## 
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## 
## The following object is masked from 'package:purrr':
## 
##     transpose
## 
## 
## 
## Attaching package: 'rtweet'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     flatten
## 
## 
## 
## Attaching package: 'plotly'
## 
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, mutate, rename, summarise
## 
## 
## The following object is masked from 'package:xgboost':
## 
##     slice
## 
## 
## The following object is masked from 'package:MASS':
## 
##     select
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## 
## The following object is masked from 'package:graphics':
## 
##     layout
## 
## 
## 
## Attaching package: 'maps'
## 
## 
## The following object is masked from 'package:plyr':
## 
##     ozone
## 
## 
## The following object is masked from 'package:purrr':
## 
##     map
## 
## 
## 
## Attaching package: 'cowplot'
## 
## 
## The following object is masked from 'package:lubridate':
## 
##     stamp
## 
## 
## 
## Attaching package: 'ggthemes'
## 
## 
## The following object is masked from 'package:cowplot':
## 
##     theme_map
## 
## 
## 
## Attaching package: 'scales'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
## 
## 
## 
## Attaching package: 'magrittr'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
## 
## 
## here() starts at C:/Users/Ivant/Desktop
## 
## 
## Attaching package: 'here'
## 
## 
## The following object is masked from 'package:plyr':
## 
##     here
## 
## 
## 
## Attaching package: 'reshape2'
## 
## 
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt
## 
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
## 
## 
## 
## Attaching package: 'ModelMetrics'
## 
## 
## The following objects are masked from 'package:caret':
## 
##     confusionMatrix, precision, recall, sensitivity, specificity
## 
## 
## The following object is masked from 'package:base':
## 
##     kappa
## 
## 
## 
## Attaching package: 'Hmisc'
## 
## 
## The following object is masked from 'package:e1071':
## 
##     impute
## 
## 
## The following object is masked from 'package:plotly':
## 
##     subplot
## 
## 
## The following objects are masked from 'package:plyr':
## 
##     is.discrete, summarize
## 
## 
## The following object is masked from 'package:BBmisc':
## 
##     %nin%
## 
## 
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## 
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## 
## 
## Type 'citation("pROC")' for a citation.
## 
## 
## Attaching package: 'pROC'
## 
## 
## The following object is masked from 'package:ModelMetrics':
## 
##     auc
## 
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

#Data collection and preprocessing In order to assess and compare the performance of different models, I utilized a function that calculated evaluation metrics for a given model. These metrics played a crucial role in determining the most appropriate model for further analysis or deployment. The function, shown in the provided code snippet, calculated several evaluation metrics, including the adjusted R-squared and root mean square error (RMSE). The adjusted R-squared value provided a measure of the model’s goodness of fit, considering the number of predictors used. The RMSE quantified the average difference between the predicted and actual values, providing an indication of the model’s accuracy. By employing these evaluation metrics, I was able to objectively compare the models and make informed decisions regarding their suitability for the specific analysis or deployment requirements.

Operations below aim to load the dataset, convert it into a more convenient tibble format, select relevant columns, and handle missing data. The resulting life_table tibble is ready for further analysis, such as visualization, modeling, or exploratory data analysis.

The following code performs data loading and preprocessing operations on my next dataset:

The below code aim to load the dataset, convert it into a tibble format, select relevant columns, and handle missing data.

The following code aiding in the understanding of the children mortality rates data at the county level.

These summary statistics above offers an overview of the child mortality data at the county level, giving insights into the distribution and range of values.

The code L1 <- life_table$X2000.5 extracts the values of the variable X2000.5 from the life_table tibble and assigns them to a new vector L1. This variable represents children mortality. Each element of the L1 vector corresponds to the mortality value for a specific country.

The normalized values is representing the relative positions of the original values within the range of the variable.

The following histograms provide a visual representation of the distribution of deaths at year 2000.5, both in terms of raw values and normalized values. They can help us identify any patterns or trends in the data and gain insights into the age groups with higher or lower mortality rates. These summary statistics provide a general understanding of the mortality rates of children under 5 years old across different countries and years. The minimum value for the mortality rate is 0.210, indicating some countries had very low child mortality rates. The maximum value for the mortality rate is 44.190, indicating some countries had relatively high child mortality rates. The median mortality rate is 4.810, which can be considered as a measure of central tendency for the data. The mean mortality rate is 7.564, which is slightly higher than the median, suggesting a right-skewed distribution of mortality rates.

The variable in my next code represents the mortality rate per 1,000 live births for children under the age of 5. The extracted values are stored in the variable L2 for further analysis or manipulation.

Normalizing the data is useful for comparing variables with different scales and ensuring they are on a consistent scale for further analysis.

These plots below allow for visual comparison between the original and normalized data distributions, providing insights into how the data has been transformed through normalization. I used the code below to perform data transformations and normalization on the variables L1 and L2, and create new variables rate, nrate, o_rate, and n_rate.

These following statistics provide information about the distribution of the normalized rate metric. The summary statistics suggest that the distribution of the normalized rate metric is right-skewed, as the mean is greater than the median.

The following code is used to identify the top 12 countries with highest mortality rate based on the normalized rate metric (n_rate).

## [1] 15

Entity	Code	Mortality.rate..under.5..per.1.000.live.births.
Niger	NER	1.0000000
Niger	NER	0.9346879
Niger	NER	0.8742180
Burundi	BDI	0.8379085
Papua New Guinea	PNG	0.7914707
IDA only		0.7914659
Sub-Saharan Africa (excluding high income)		0.7847013
Comoros	COM	0.7761873
Burundi	BDI	0.7667182
Trinidad and Tobago	TTO	0.7322477
Dominican Republic	DOM	0.7281330
Rwanda	RWA	0.7276921
IDA only		0.7173713
Papua New Guinea	PNG	0.7141083
Sub-Saharan Africa (excluding high income)		0.7106564

Renaming the columns can provide more descriptive and meaningful names that accurately represent the data they contain.

The next code is necessary to load and examine the NCHS dataset, which contains information about infant mortality rates by race in the United States over a specific time period.

This following code is necessary to load and examine the CMbyIncome dataset, which contains information about child mortality rates by income level of countries.

Then I rename the columns of the CMbyIncome dataset to more meaningful names for further analysis or interpretation.

This following code is necessary to load and examine the data from the “Children-woman-death-vs-survival.csv” file.

This below code is helpful for providing more descriptive column names that better represent the data in each column.

By merging the data frames, we can analyze and explore the relationship between child mortality rates, income levels of countries, and the number of children who died and survived.

The following code helps to understand the overall structure of the merged data frame.

## Rows: 690,507
## Columns: 5
## $ Country        <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanist…
## $ Year.x         <int> 1971, 1971, 1971, 1971, 1971, 1971, 1971, 1971, 1971, 1…
## $ Mortality.rate <dbl> 29.53, 29.53, 29.53, 29.53, 29.53, 29.53, 29.53, 29.53,…
## $ Died_Value     <dbl> 1.774590, 1.873675, 1.823760, 2.806421, 2.755433, 2.709…
## $ Survived_Value <dbl> 5.675410, 5.576325, 5.626240, 4.673579, 4.704567, 4.740…

##    Country              Year.x     Mortality.rate     Died_Value    
##  Length:690507      Min.   :1960   Min.   : 0.210   Min.   :0.0030  
##  Class :character   1st Qu.:1976   1st Qu.: 1.250   1st Qu.:0.0726  
##  Mode  :character   Median :1990   Median : 3.200   Median :0.3627  
##                     Mean   :1990   Mean   : 6.674   Mean   :0.6143  
##                     3rd Qu.:2004   3rd Qu.: 9.790   3rd Qu.:1.0315  
##                     Max.   :2017   Max.   :44.190   Max.   :3.4309  
##                     NA's   :112    NA's   :112      NA's   :1870    
##  Survived_Value 
##  Min.   :1.028  
##  1st Qu.:2.327  
##  Median :3.356  
##  Mean   :3.543  
##  3rd Qu.:4.761  
##  Max.   :7.689  
##  NA's   :1870

The below code is used to remove missing values and select the numeric variables from the merged data frame for further analysis.

The code below selects the first six columns of the merged_df dataset and assigns them to a new dataframe called df_num. It then displays the dimensions of df_num (number of rows and columns) and shows the first few rows of the dataframe.

The correlation matrix plot below helps to identify potential associations and dependencies between variables, which can be useful for exploring patterns and making data-driven decisions. The first code block below is useful for understanding the extent of missing data in each variable of the df_num data frame. It helps identify variables that have a high percentage of missing values, which can be important for data cleaning and imputation processes.

The second code block is used for feature selection. By applying the Boruta algorithm, it identifies the significant variables that have a strong relationship with the Mortality.rate variable. This information can be valuable for building predictive models or conducting further analysis, as it focuses on the most relevant variables and reduces the dimensionality of the dataset.

## Growing trees.. Progress: 68%. Estimated remaining time: 14 seconds.
## Computing permutation importance.. Progress: 71%. Estimated remaining time: 12 seconds.
## Growing trees.. Progress: 70%. Estimated remaining time: 13 seconds.
## Computing permutation importance.. Progress: 70%. Estimated remaining time: 13 seconds.
## Growing trees.. Progress: 68%. Estimated remaining time: 14 seconds.
## Computing permutation importance.. Progress: 65%. Estimated remaining time: 16 seconds.
## Growing trees.. Progress: 69%. Estimated remaining time: 13 seconds.
## Computing permutation importance.. Progress: 69%. Estimated remaining time: 13 seconds.
## Growing trees.. Progress: 67%. Estimated remaining time: 15 seconds.
## Computing permutation importance.. Progress: 70%. Estimated remaining time: 13 seconds.
## Growing trees.. Progress: 66%. Estimated remaining time: 15 seconds.
## Computing permutation importance.. Progress: 68%. Estimated remaining time: 14 seconds.
## Growing trees.. Progress: 66%. Estimated remaining time: 16 seconds.
## Computing permutation importance.. Progress: 69%. Estimated remaining time: 13 seconds.
## Growing trees.. Progress: 66%. Estimated remaining time: 16 seconds.
## Computing permutation importance.. Progress: 70%. Estimated remaining time: 13 seconds.
## Growing trees.. Progress: 67%. Estimated remaining time: 15 seconds.
## Computing permutation importance.. Progress: 66%. Estimated remaining time: 15 seconds.

#Data analysis, modeling & visualization The plot below provides a visual representation of the importance of each variable in relation to the target variable. It helps to identify which variables are deemed significant (confirmed) or potentially significant (tentative) by the Boruta algorithm.

Next code excluding columns that have weak target correlation or are affected by multicollinearity.

The following code visualizes the distribution of each variable.

The purpose of the next code is to have a portion of the data (the training set) for building and training a model, and another portion (the testing set) for evaluating the performance of the trained model.

## [1] 516393      5

## [1] 172132      5

This following code is used to perform linear regression analysis (lm) on the training dataset (train) to explore the relationship between the response variable (Mortality.rate) and the predictor variables in the dataset. The summary function is then used to obtain statistical summary measures and assess the significance of the predictors in the model. This analysis helps understand how the predictor variables influence the mortality rate and provides insights into the strength and direction of their associations.

This normalization step below is useful to ensure that variables are on a comparable scale and can be compared and analyzed together without being dominated by variables with larger magnitudes.

This next step is done to ensure consistent feature engineering on both training and testing data without duplicating the feature engineering steps.

The below code splits the final_df dataset into training and testing datasets, and then fits a linear regression model (model_2) using the training dataset to predict the Mortality.rate variable.

The next code fits a linear regression model (model_3) to predict the Mortality.rate variable using the variables Country, Year.x, Died_Value, and Survived_Value from the train2 dataset, and provides a summary of the model’s results.

The following code evaluates the performance of the model_3 by predicting the Mortality.rate on both the training and testing datasets and calculating evaluation metrics.

## [1] "0.8619"
## [1] "2.7996"

## [1] "0.8619"
## [1] "2.8058"

The following training dataset is used to train the model, while the testing dataset is used to evaluate the performance of the trained model.

## [1] 516393      5

## [1] 172132      5

This next code fits a linear regression model (raw_lm) to predict the Mortality.rate using all the predictor variables in the train_raw dataset. The summary() function provides a summary of the model, including coefficient estimates, p-values, and goodness-of-fit measures.

This code below evaluates the performance of the raw_lm model by calculating prediction metrics on both the training and testing data.

## [1] "0.8618"
## [1] "2.8004"

## [1] "0.8618"
## [1] "2.8035"

This code below trains the aic_raw_lm model using the specified formula and the training data, and provides a summary of the model’s results.

This next code predicts the target variable using the aic_raw_lm model on both the training and testing data, and evaluates the model’s performance using evaluation metrics specific to the ‘Mortality.rate’ variable.

## [1] "0.8618"
## [1] "2.8004"

## [1] "0.8618"
## [1] "2.8035"

This following code specifies the column names for a data frame called cr_raw. The column names are “Mortality.rate”, “Country”, “Year.x”, “Died_Value”, and “Survived_Value”.

The next code creates numeric model matrices (x and x_test) and target vectors (y_train and y_test) for the training and testing data, respectively. Finally, a sequence of lambda values for regularization is defined in the lambdas variable.

## [1] 516393    182

## [1] 172132    182

The next code trains a Ridge regression model (raw_ridge_reg) using the glmnet function and computes the optimal lambda value (ol) through cross-validation using the cv.glmnet function.

## [1] 0.001

The eval_results function is useful for evaluating the performance of a predictive model by computing metrics such as RMSE and R-squared. These metrics provide insights into how well the model fits the data and how accurately it predicts the target variable. By calculating these metrics, we can assess the quality of the model and compare different models to determine which one performs better.

The following code is used to predict the target variable (Mortality.rate) using the trained raw_ridge_reg model on the training data. The eval_results function is then applied to evaluate the performance of the model by computing metrics such as RMSE and R-squared between the predicted values and the actual values of the target variable on the training data.

Ridge Regression is used to predict the target variable (Mortality.rate) using the trained raw_ridge_reg model on the testing data. The eval_results function is then applied to evaluate the performance of the model by computing metrics such as RMSE and R-squared between the predicted values and the actual values of the target variable on the testing data. Additionally, the code plots the lambda values used in ridge regression.

The below code is used to specify the column names for further analysis. The cols_reg vector contains the column names “Mortality.rate”, “Country”, “Code.x”, “Year.x”, “Died_Value”, and “Survived_Value”.

Dummy variables allow to represent categorical data as binary variables, which can be used as input in various statistical models.

## [1] 516393    182

## [1] 172132    182

The below code is converting the dummy variables and the target variable into numeric matrices.

The following code is training a ridge regression model to predict the mortality rate based on the provided input features. Ridge regression is a regularization technique that helps prevent overfitting by adding a penalty term to the model’s objective function. By training the ridge regression model, we can obtain coefficient estimates and assess their significance in predicting the mortality rate.

##           Length Class     Mode   
## a0          51   -none-    numeric
## beta      9282   dgCMatrix S4     
## df          51   -none-    numeric
## dim          2   -none-    numeric
## lambda      51   -none-    numeric
## dev.ratio   51   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         7   -none-    call   
## nobs         1   -none-    numeric

The following code is performing cross-validation to determine the optimal lambda (regularization parameter) for the ridge regression model. The cv.glmnet function is used to evaluate the model’s performance for different lambda values. The lambda with the lowest cross-validated error (estimated by mean squared error) is selected as the optimal lambda (ol2). This helps in tuning the model and finding the best regularization parameter for improved predictive performance.

## [1] 0.001

This code below helps assess the performance of the ridge regression model on the training data.

The following code is evaluating the performance of the ridge_reg model by predicting the target variable on the testing data and calculating the evaluation metrics, such as RMSE and R^2, to assess the model’s accuracy.

The following code is performing cross-validation using lasso regression (alpha = 1) on the training data to select the optimal lambda value from a range of lambdas (lambdas) and using 5-fold cross-validation (nfolds = 5).

The next code is computing the optimal lambda value (lambda_best) based on the lasso regression cross-validation results (lasso_reg$lambda.min). Then, it trains a lasso regression model (raw_lasso) using the optimal lambda on the training data. It predicts the target variable on both the training and testing data, and evaluates the model performance using the eval_results function. The code also generates lambda plots for visualizing the effect of lambda on variable selection.

## [1] 0.001

Lasso Regression

Lasso regression is a regularization technique used to perform feature selection and prevent overfitting in a regression model. By applying lasso regression, we can identify the most important features that contribute to predicting the target variable. The code calculates the optimal lambda (regularization parameter) using cross-validation and trains a lasso regression model based on that lambda. It then evaluates the performance of the model on both the training and testing data to assess its predictive capability.

## [1] 0.001

Random Forest Model

## [1] "RMSE: 0.413999940880968"

## [1] "R2_train: 0.996927644557393"

## [1] "RMSE: 2.84122812529806"

## [1] "R2_test: 0.859992910394767"

The lower RMSE values indicate that the model’s predictions are relatively close to the actual values. Additionally, the high R-squared values suggest that a significant portion of the variance in the target variable can be explained by the model.

Model Selection

The following code creates a table using the kable and kableExtra packages to summarize the results of the different regression models. It includes information such as the model number, method used (linear, ridge, or lasso), data type (raw or transformed), number of variables, R-squared values for training and testing data, and RMSE (Root Mean Squared Error) values for training and testing data. The table provides a concise comparison of the performance metrics for each model.

## Warning in cbind(Model, Method, Data, Var_Num, R2_train, RMSE_train, R2_test, :
## number of rows of result is not a multiple of vector length (arg 3)

Regression Model Comparison
Model	Method	Data	Var_Num	R2_train	RMSE_train	R2_test	RMSE_test
1	Linear	raw	0.001	0.8619	2.7996	0.8619	2.8058
2	Linear	raw(AIC)	0.001	0.8618	2.8004	0.8618	2.8035
3	Linear	transformed	0.001	0.8618	2.8004	0.8618	2.8035
4	Ridge	raw	0.001	0.8619	2.8004	0.8622	2.8035
5	Ridge	transformed	0.001	0.862	2.7996	0.862	2.8058
6	Lasso	raw	0.001	0.8619	2.8005	0.8622	2.8036
7	Lasso	transformed	0.001	0.862	2.7996	0.8619	2.8059
8	Random Forest	raw	0.001	0.9969	0.4139	0.8599	2.8412

In summary, the Random Forest model (Model 8) shows the highest R-squared value for the training data and performs well in terms of predicting the target variable, suggesting that it may be the most optimal model among the options presented.

The following code creates a map visualization using the plotly package. It plots the Mortality Rate (Mortality.rate) data from the CMbyIncome dataset on a world map. Each country is represented by a color based on its mortality rate, and the map is animated over different years (frame=~CMbyIncome$Year).

The next code reads a CSV file from the given URL and assigns it to the variable “CMR”. The CSV file contains data related to child mortality rates.

The following code creates an interactive scatter plot using Plotly for the “CMR” dataset. The scatter plot displays the relationship between the “Mortality.Rate” and “Country” variables. Each data point represents a country, and the color of the marker corresponds to the value of the “Mortality.Rate”. The plot includes tooltips that show the country name, x-axis value (country), and y-axis value (mortality rate of years) when hovering over a data point.

#Results and/or experimentation:

The visualizations provided offer valuable insights into child mortality rates across countries and income levels. They demonstrate that child mortality rates are generally higher in low-income countries and have declined over time across all income levels. Sub-Saharan Africa, South Asia, and parts of Oceania exhibit the highest child mortality rates, while Europe, North America, and parts of Asia have the lowest rates. The interactive scatter plots enable comparisons between countries and help identify high-risk regions that require focused attention and resources to address child mortality.

The histograms shed light on infant mortality rates in the United States. The Mortality Rate by Race histogram reveals greater variability among the black population compared to the white population. The Mortality Rate by Gender histogram indicates higher mortality rates for males, particularly with a significant number exceeding 10 deaths per 1,000 live births. The Mortality Rate by Country histogram demonstrates that most countries have rates between 0-5 deaths per 1,000 children aged 10 to 14, although some fall within the 10-15 deaths per 1,000 range. The Mortality Rate by Income Level histogram shows that the majority of countries exhibit rates between 0-10 deaths per 1,000 live births, while a few have rates between 40-60 deaths per 1,000 live births.

The visualization highlighting disparities in infant mortality rates among different racial groups in the United States effectively communicates that significant differences exist, with the Black race experiencing the highest infant mortality rate and the White race having the lowest.

Examining the table, we observe that the Random Forest model (Model 8) exhibits the highest R-squared value for the training data and demonstrates strong predictive performance, suggesting it may be the most suitable model among the options considered.

#Summary and future works:

The visualizations and analyses presented suggest that child mortality rates have generally decreased over time across all income levels, but there are still significant disparities between countries and regions, with Sub-Saharan Africa, South Asia, and parts of Oceania experiencing the highest rates of child mortality.Areas for future work could include examining the root causes of high child mortality rates in specific countries and regions, exploring interventions that have been successful in reducing child mortality, and identifying factors that contribute to the disparities in mortality rates between different racial and gender groups. Additionally, further research could investigate the impact of social, economic, and political factors on child mortality rates and explore ways to address these underlying issues to reduce child mortality globally.

#Bibliography:

Data

Valcho Valev. (n.d.). Child Mortality by Income Level of Country. Kaggle. https://www.kaggle.com/datasets/valchovalev/childmortalitybyincomelevelofcountry Prata, M. (n.d.). Mortality among Children Aged 05-14 (by country). Kaggle. https://www.kaggle.com/datasets/mpwolke/cusersmarildownloadsdeathscsv Jha, A. (n.d.). Global Child Mortality Rate Child Mortality(1 year to 4 years). Kaggle. https://www.kaggle.com/datasets/drateendrajha/global-child-mortality-rate Programmerrdai. (n.d.). Child and Infant Mortality Dataset. Kaggle. https://www.kaggle.com/datasets/programmerrdai/child-and-infant-mortality Jamagaryan, N. (n.d.). Child Mortality by Age/Country/Region Identify mortality issues and best practices. Kaggle. https://www.kaggle.com/datasets/jamnik99/child-mortality-by-agecountryregion Centers for Disease Control and Prevention. (n.d.). NCHS - Death rates and life expectancy at birth. Data.gov. https://catalog.data.gov/dataset/nchs-death-rates-and-life-expectancy-at-birth

Literature

Bizzego, G. G., Bornstein, M. H., Deater-Deckard, K., Lansford, J. E., Bradley, R. H., Costa, M., & Esposito, G. (2021). Predictors of Contemporary under-5 Child Mortality in Low- and Middle-Income Countries: A Machine Learning Approach. International Journal of Environmental Research and Public Health, 18(3), 1315–. https://doi.org/10.3390/ijerph18031315 Burstein, H. N. J., Collison, M. L., Marczak, L. B., Sligar, A., Watson, S., Khan, M., Listl, S., Murray, C. J. ., & Hay, S. I. (2019). Mapping 123 million neonatal, infant and child deaths between 2000 and 2017. Nature, 574(7778), 353–+. https://doi.org/10.1038/s41586-019-1545-0 Ezbakhe, F., & Pérez-Foguet, A. (2020). Child mortality levels and trends: A new compositional approach. Demographic Research, 43, 1263–1296. https://doi.org/10.4054/DemRes.2020.43.43 Jaadla, H., Potter, E., Keibek, S., & Davenport, R. (2020). Infant and child mortality by socio-economic status in early nineteenth-century England. The Economic History Review, 73(4), 991–1022. https://doi.org/10.1111/ehr.12971 Jin, M., Mankadi, P. M., Rigotti, J. I., & Cha, S. (2018). Cause-specific child mortality performance and contributions to all-cause child mortality, and number of child lives saved during the Millennium Development Goals era: A country-level analysis. Global Health Action, 11(1), 1546095–20. https://doi.org/10.1080/16549716.2018.1546095 Pritchard, R., Williams, R., & Rosenorn‐Lanng, E. (2019). Child Abuse-related Deaths, Child Mortality (0–4 Years) and Income Inequality in the USA and Other Developed Nations 1989–91 v 2013–15: Speaking Truth to Power. Child Abuse Review (Chichester, England : 1992), 28(5), 339–352. https://doi.org/10.1002/car.2599 Sommer, J. (2020). Corruption and Health expenditure: A Cross-National Analysis on Infant and Child Mortality. European Journal of Development Research, 32(3), 690–717. https://doi.org/10.1057/s41287-019-00235-1 Schmidt. (2019). Accumulating birth histories across surveys for improved estimates of child mortality. ICF. Ummalla, M., Samal, A., Zakari, A., & Lingamurthy, S. (2022). The effect of sanitation and safe drinking water on child mortality and life expectancy: Evidence from a global sample of 100 countries. Australian Economic Papers, 61(4), 778–797. https://doi.org/10.1111/1467-8454.12265 Yasir, A. (2019). Effects of Women’s Education on Child Care and Child Mortality. Indian Journal of Public Health Research and Development., 10(1), 810 https://doi.org/10.5958/0976-5506.2019.00159.1 Young, & Duke, T. (2020). The process of implementing child mortality reviews in low‐ and middle‐income countries: a narrative systematic review. Tropical Medicine & International Health, 25(7), 764–773. https://doi.org/10.1111/tmi.13403

#Appendix with Code

All code

###Libraries
library(tidyverse)
library(dplyr)
library(readr)
library(ggplot2)
library(RCurl)
library(rvest)
library(stringr)
library(tidyr)
library(kableExtra)
library(BBmisc)
library(rmarkdown)
library(tm)
library(sqldf)
library(inspectdf)
library(corrplot)
library(MASS)
library(caret)
library(glmnet)
library(Matrix)
library(Boruta)
library(gsubfn)
library(proto)
library(RSQLite)
library(graphics)
library(stats)
library(xgboost)
library(randomForest)
library(plyr)
library(knitr)
library(tinytex)
library(latexpdf)
library(data.table)
library(purrr)
library(rtweet)
library(tidytext)
library(plotly)
library(maps)
library(cowplot)
library(lubridate)
library(ggthemes)
library(scales)
library(htmlwidgets)
library(magrittr)
library(here)
library(reshape2)
library(ggrepel)
library(ModelMetrics)
library(e1071)
library(FactoMineR)
library(VIFCP)
library(Hmisc)
library(pROC)
library(binr)
#Data collection and preprocessing

#Evaluation metrics used in Model Building / Selection section:
eval_metrics = function(model, df, predictions, target){
    resids = df[,target] - predictions
    resids2 = resids**2
    N = length(predictions)
    r2 = as.character(round(summary(model)$r.squared, 4))
    adj_r2 = as.character(round(summary(model)$adj.r.squared, 4))
    print(adj_r2) #Adjusted R-squared
    print(as.character(round(sqrt(sum(resids2)/N), 4))) #RMSE
}

DN<- read.csv("https://raw.githubusercontent.com/IvanGrozny88/DATA698/main/deaths_number_10to14.csv")
DN_table <- as_tibble(DN)
DN_table <- DN_table %>% dplyr::select(1:2,4:16) %>% na.omit()
#head(DN)

CMbyIncome <- read.csv("https://raw.githubusercontent.com/IvanGrozny88/DATA698/main/child-mortality-by-income-level-of-country.csv")
CMbyIncome_table <- as_tibble(CMbyIncome)
CMbyIncome_table <- CMbyIncome_table %>% dplyr::select(1:2,3:4) %>% na.omit()
#head(CMbyIncome)

deaths <- read.csv("https://raw.githubusercontent.com/IvanGrozny88/DATA698/main/deaths%20(1).csv",  sep=";"  )
deaths_table <- as_tibble(deaths)
deaths_table <- deaths_table %>% dplyr::select(1:2,3:6) %>% na.omit()
#head(deaths)

#children mortality rates data at a county level:
#glimpse(DN_table)
#summary(DN_table)

#Extract variables of interest
#L1 <- DN_table$`X2000.5`

#n_L1 = (L1-min(L1))/(max(L1)-min(L1))

#par(mfrow=c(2,2))
#hist(L1, breaks=10, xlab="deaths (years)", col="lightblue", main="deaths_number,2000.5")
#hist(n_L1, breaks=10, xlab="deaths (years)", col="lightblue", main="deaths_number,2000.5")

#Explore Income data at a county level:
#glimpse(CMbyIncome_table)
#summary(CMbyIncome_table)

#Extract variables of Income
#L2 <- CMbyIncome_table$`Mortality.rate..under.5..per.1.000.live.births.`

#Normalize
#n_L2 = (L2-min(L2))/(max(L2)-min(L2))

#Histogram of original vs. normalized data
#par(mfrow=c(2,2))
#hist(L2, breaks=10, xlab="Income rate ", col="lightblue", main="Mortality.rate..under.5..per.1.000.live.births.")
#hist(n_L2, breaks=10, xlab="Income rate ", col="lightblue", main="Mortality.rate..under.5..per.1.000.live.births.")

#Add original and normalized variables together
#rate <- L1 + L2
#nrate <- n_L1 + n_L2 


#original rate to 0-1 range
#o_rate = (rate-min(rate))/(max(rate)-min(rate))

#Normalize rate to 0-1 range
#n_rate = (nrate-min(nrate))/(max(nrate)-min(nrate))
#head(rate)
#head(n_rate)

# Histogram of original vs. normalized data
#hist(o_rate, breaks=10, xlab="original Score", col="lightblue", main="original rate metric")
#hist(n_rate, breaks=10, xlab="Normalized Score", col="lightblue", main="Normalized rate metric")
#summary(n_rate)

#create new df  | Country | Mortality.rate
#starter_df <- CMbyIncome_table %>% 
    #dplyr::select(1:2)

#starter_df$Mortality.rate..under.5..per.1.000.live.births. <- n_rate
#Mortality.rate <- filter(starter_df, `Mortality.rate..under.5..per.1.000.live.births.` > 0.72) #top 12
#Mortality.rate <- Mortality.rate[order(-Mortality.rate$`Mortality.rate..under.5..per.1.000.live.births.`),] 

#head(starter_df)
#nrow(Mortality.rate) #12
#Mortality.rate %>%
  #kbl() %>%
  #kable_minimal() %>%
  #kable_styling(latex_options = "hold_position")

#rename columns
#colnames(starter_df)[1]<-("Country")
#colnames(starter_df)[3]<-("Mortality.rate.under.5 ")
#head(starter_df)

NCHS <- read.csv("https://raw.githubusercontent.com/IvanGrozny88/DATA698/main/NCHS_-_Infant_Mortality_Rates__by_Race__United_States__1915-2013.csv")
NCHS <- as_tibble(NCHS)
#dim(NCHS)
#head(NCHS)

CMbyIncome <- read.csv("https://raw.githubusercontent.com/IvanGrozny88/DATA698/main/child-mortality-by-income-level-of-country.csv")
CMbyIncome <- as_tibble(CMbyIncome)
#dim(CMbyIncome)
#head(CMbyIncome)

#rename columns
#colnames(CMbyIncome)[1]<-("Country")
#colnames(CMbyIncome)[4]<-("Mortality.rate")
#head(CMbyIncome)

CWDvsS <- read.csv("https://raw.githubusercontent.com/IvanGrozny88/DATA698/main/Children-woman-death-vs-survival.csv")
CWDvsS <- as_tibble(CWDvsS)
#dim(CWDvsS)
#head(CWDvsS)

#rename columns
#colnames(CWDvsS)[1]<-("Country")
#colnames(CWDvsS)[4]<-("Died_Value")
#colnames(CWDvsS)[5]<-("Survived_Value")
#head(CWDvsS)

# Merge data frames based on common column "Country"
#merged_df <- merge(CMbyIncome, CWDvsS, by = "Country",   all = TRUE)
#Remove extra Code, County columns
#merged_df <- subset(merged_df, select=-c(2,5,6))
#head(merged_df)
#dim(merged_df)

#glimpse(merged_df)
#summary(merged_df)

#drop NAs from consideration
#merged_df <- na.omit(merged_df)
#dim(df) 

#Select numeric variables
#df_num <- as.data.frame(merged_df[1:5])
#dim(df_num)
#head(df_num)

#options(scipen = 9)
#set.seed(123)

#plot_corr_matrix <- function(dataframe, significance_threshold){
  #title <- paste0('Correlation Matrix for significance > ',
                  #significance_threshold)
  
  #df_cor <- dataframe %>% mutate_if(is.character, as.factor)
  #df_cor <- df_cor %>% mutate_if(is.factor, as.numeric)
  #corr <- cor(df_cor, use = 'na.or.complete')
  #corr[lower.tri(corr,diag=TRUE)] <- NA 
  #corr[corr == 1] <- NA 
  #corr <- as.data.frame(as.table(corr))
  #corr <- na.omit(corr) 
  #corr <- subset(corr, abs(Freq) > significance_threshold) 
  #corr <- corr[order(-abs(corr$Freq)),] 
  #print table
  # print(corr)
  #turn corr back into matrix in order to plot with corrplot
  #mtx_corr <- reshape2::acast(corr, Var1~Var2, value.var="Freq")
  
  #plot correlations visually
  #corrplot(mtx_corr,
           #title=title,
           #mar=c(0,0,1,0),
           #method='color', 
           #tl.col="black", 
           #na.label= " ",
           #addCoef.col = 'black',
           #number.cex = .9)
#}

#Utilize custom-built correlation matrix generation function
#plot_corr_matrix(df_num, 0.2)

#Compute proportion of missing data per variable
#v <- colnames(df_num)
#incomplete <- function(x) sum(!complete.cases(x)) / 3004
#Missing_Data <- sapply(df_num[v], incomplete) 
#head(Missing_Data) #verify

# Perform Boruta search
#boruta_output <- Boruta(Mortality.rate ~ ., data=na.omit(df_num), doTrace=0, maxRuns = 10000)

#Get significant variables including tentatives
#boruta_signif <- getSelectedAttributes(boruta_output, withTentative = TRUE)
#print(boruta_signif)

# Plot Mortality.rate
#plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Mortality.rate")

#Feature selection (address weak target correlation and multicollinearity)
#select_df <- df_num[, c(1,2,3:4,5)]
#head(select_df) #verify

# Create histograms for all variables
#hist(select_df)

#Train-test split data
#dt = sort(sample(nrow(select_df), nrow(select_df)*.75))
#train <- select_df[dt,]
#test <- select_df[-dt,]
#dim(train) 
#dim(test)

#model_1 <- lm(Mortality.rate ~., data = train)
#summary(model_1)

norm_minmax <- function(x) {
  if(is.numeric(x)) {
    (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
  } else {
    x
  }
}

#norm_df <- as.data.frame(lapply(train, norm_minmax))
#head(norm_df)

#cooksD <- cooks.distance(model_1)
#influential <- as.numeric(names(cooksD)[(cooksD > (10 * mean(cooksD, na.rm = TRUE)))])
#train[influential,] #verify outliers - 10 rows
#out_df <- train[-influential,]

#combine datasets so we don't have to make features twice
#train$dataset <- 'train'
#test$dataset <- 'test'
#final_df <- rbind(train, test)

#train2 <- final_df %>% filter(dataset == 'train') %>% dplyr::select(-dataset)
#test2 <- final_df %>% filter(dataset == 'test') %>% dplyr::select(-dataset)
#dat <- final_df %>% dplyr::select(-dataset)
#model_2 <- lm(Mortality.rate ~., data = train2)
#summary(model_2)

#model_3 <- lm(formula = Mortality.rate ~ Country + Year.x + 
    #Died_Value + Survived_Value , 
    #data = train2)
#summary(model_3)

## Predict and evaluate raw_lm model on training data
#predictions = predict(model_3, newdata = train2)
#eval_metrics(model_3, train2, predictions, target = 'Mortality.rate')

# Predict and evaluate raw_lm model on testing data
#predictions = predict(model_3, newdata = test2)
#eval_metrics(model_3, test2, predictions, target = 'Mortality.rate')

#Train-test split data
#dt2 = sort(sample(nrow(df_num), nrow(df_num)*.75))
#train_raw <- df_num[dt2,]
#test_raw <- df_num[-dt2,] 
#dim(train_raw) 
#dim(test_raw)

#Train raw_lm model:
#raw_lm <- lm(Mortality.rate ~., data = train_raw)
#summary(raw_lm)

# Predict and evaluate raw_lm model on training data
#predictions = predict(raw_lm, newdata = train_raw)
#eval_metrics(raw_lm, train_raw, predictions, target = 'Mortality.rate') 

# Predict and evaluate raw_lm model on testing data
#predictions = predict(raw_lm, newdata = test_raw)
#eval_metrics(raw_lm, test_raw, predictions, target = 'Mortality.rate')

#Train aic_raw_lm model:
#aic_raw_lm <- lm(formula =  Mortality.rate ~ Country + Year.x + 
    #Died_Value + Survived_Value , data = train_raw)
#summary(aic_raw_lm) 

# Predict and evaluate aic_raw_lm model on training data
#predictions = predict(aic_raw_lm, newdata = train_raw)
#eval_metrics(aic_raw_lm, train_raw, predictions, target = 'Mortality.rate')

# Predict and evaluate aic_raw_lm model on testing data
#predictions = predict(aic_raw_lm, newdata = test_raw)
#eval_metrics(aic_raw_lm, test_raw, predictions, target = 'Mortality.rate')

#Specify column names
#cr_raw = c("Mortality.rate" , "Country", "Year.x" , "Died_Value" , "Survived_Value" )

#Generate dummy variables from data (if applicable)
#dummies <- dummyVars(Mortality.rate ~ ., data = df_num[,cr_raw])
#train_dummies = predict(dummies, newdata = train_raw[,cr_raw]) 
#test_dummies = predict(dummies, newdata = test_raw[,cr_raw]) 
#print(dim(train_dummies)); print(dim(test_dummies))

#Create numeric model matrices
#x = as.matrix(train_dummies)
#y_train = train_raw$Mortality.rate
#x_test = as.matrix(test_dummies)
#y_test = test_raw$Mortality.rate
#lambdas <- 10^seq(2, -3, by = -.1)

#Train model
#raw_ridge_reg = glmnet(x, y_train, nlambda = 25, alpha = 0, family = 'gaussian', lambda = lambdas)
#summary(raw_ridge_reg)

#Compute optimal lambda
#raw_cv_ridge <- cv.glmnet(x, y_train, alpha = 0, lambda = lambdas)
#ol <- raw_cv_ridge$lambda.min
#ol

#Compute R^2 from true and predicted values
#eval_results <- function(true, predicted, df) {
  #SSE <- sum((predicted - true)^2)
  #SST <- sum((true - mean(true))^2)
  #R_square <- round((1 - SSE / SST),4)
  #RMSE = round(sqrt(SSE/nrow(df)),4)
  
  # Model performance metrics
#data.frame(RMSE = RMSE, Rsquare = R_square)
#}

#Predict and evaluate raw_ridge_reg model on training data
#predictions_train <- predict(raw_ridge_reg, s = ol, newx = x)
#eval_results(y_train, predictions_train, train_raw)

#Predict and evaluate raw_ridge_reg model on testing data
#predictions_test <- predict(raw_ridge_reg, s = ol, newx = x_test)
#eval_results(y_test, predictions_test, test_raw) 
#Ridge regression lambda plot
#plot(raw_ridge_reg)

#Specify column names
cols_reg = c("Mortality.rate" , "Country", "Year.x" , "Died_Value" , "Survived_Value")

#Generate dummy variables from data (if applicable)
#dummies <- dummyVars(Mortality.rate ~ ., data = dat[,cols_reg])
#train_dummies2 = predict(dummies, newdata = train2[,cols_reg]) 
#test_dummies2 = predict(dummies, newdata = test2[,cols_reg]) 
#print(dim(train_dummies2)); print(dim(test_dummies2))

#Create numeric model matrices
#x2 = as.matrix(train_dummies2)
#y_train2 = train2$Mortality.rate
#x_test2 = as.matrix(test_dummies2)
#y_test2 = test2$Mortality.rate

#Train model
#ridge_reg = glmnet(x2, y_train2, nlambda = 25, alpha = 0, family = 'gaussian', lambda = lambdas)
#summary(ridge_reg)

#Compute optimal lambda
#cv_ridge2 <- cv.glmnet(x2, y_train2, alpha = 0, lambda = lambdas)
#ol2 <- cv_ridge2$lambda.min
#ol2

#Predict and evaluate ridge_reg model on training data
#predictions_train2 <- predict(ridge_reg, s = ol2, newx = x2)
#eval_results(y_train2, predictions_train2, train2)

#Predict and evaluate ridge_reg model on training data
#predictions_test2 <- predict(ridge_reg, s = ol2, newx = x_test2)
#eval_results(y_test2, predictions_test2, test2)

# Setting alpha = 1 implements lasso regression
#lasso_reg <- cv.glmnet(x, y_train, alpha = 1, lambda = lambdas, standardize = TRUE, nfolds = 5)

#Compute optimal lambda
#lambda_best <- lasso_reg$lambda.min 
#lambda_best 

#raw_lasso <- glmnet(x, y_train, alpha = 1, lambda = lambda_best, standardize = TRUE)

#predictions_train <- predict(raw_lasso, s = lambda_best, newx = x)
#eval_results(y_train, predictions_train, train_raw) 

#predictions_test <- predict(raw_lasso, s = lambda_best, newx = x_test)
#eval_results(y_test, predictions_test, test_raw) 

#Lasso regression lambda plots
#op <- par(mfrow=c(1, 2))
#plot(lasso_reg$glmnet.fit, "norm",   label=TRUE)
#plot(lasso_reg$glmnet.fit, "lambda", label=TRUE)
#par(op)

# Setting alpha = 1 implements lasso regression
#lasso_reg2 <- cv.glmnet(x2, y_train2, alpha = 1, lambda = lambdas, standardize = TRUE, nfolds = 5)

#Compute optimal lambda
#lambda_best2 <- lasso_reg2$lambda.min 
#lambda_best2

#lasso_model <- glmnet(x2, y_train2, alpha = 1, lambda = lambda_best2, standardize = TRUE)

#predictions_train2 <- predict(lasso_model, s = lambda_best2, newx = x2)
#eval_results(y_train2, predictions_train2, train2)

#predictions_test2 <- predict(lasso_model, s = lambda_best2, newx = x_test2)
#eval_results(y_test2, predictions_test2, test2) 



# Set the random seed for reproducibility
set.seed(123)

# Split the data into training and testing sets
#train_indices <- sample(1:nrow(CMbyIncome), nrow(CMbyIncome)*0.8)
#train_data <- CMbyIncome[train_indices, ]
#test_data <- CMbyIncome[-train_indices, ]

# Specify the formula for the random forest model
#formula <- Mortality.rate ~ Country + Year

# Create the random forest model
#rf_model <- randomForest(formula, data = train_data, ntree = 100, mtry = sqrt(ncol(train_data)), importance = TRUE)

# Print the summary of the random forest model
#print(rf_model)

# Predict using the random forest model
#rf_predictions <- predict(rf_model, newdata = test_data)

# Calculate RMSE
#rmse <- sqrt(mean((rf_predictions - test_data$Mortality.rate)^2))

# Calculate R2_TEST
#rsquared <- 1 - sum((test_data$Mortality.rate - rf_predictions)^2) / sum((test_data$Mortality.rate - mean(test_data$Mortality.rate))^2)


# Predict using the random forest model
#rf_predictions_train <- predict(rf_model, newdata = train_data)

# Calculate RMSE 
#rmse_train <- sqrt(mean((rf_predictions_train - train_data$Mortality.rate)^2))

# Calculate R2_train 
r#squared_train <- 1 - sum((train_data$Mortality.rate - rf_predictions_train)^2) / sum((train_data$Mortality.rate - mean(train_data$Mortality.rate))^2)

# Print the RMSE and R2_train 
#print(paste("RMSE:", rmse_train))
#print(paste("R2_train:", rsquared_train))
# Print the RMSE and R2_test
#print(paste("RMSE:", rmse))
#print(paste("R2_test:", rsquared))



#Create Kable table to succinctly summarize model optimization results
Model <- c('1', '2', '3', '4', '5', '6', '7', '8')
Method <- c('Linear', 'Linear', 'Linear', 'Ridge', 'Ridge', 'Lasso', 'Lasso','Random Forest' )
Data <- c('raw', 'raw(AIC)', 'transformed', 'raw', 'transformed', 'raw', 'transformed')
Var_Num <- c(0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001)
R2_train <- c(0.8619, 0.8618, 0.8618, 0.8619    , 0.862, 0.8619, 0.862, 0.9969)
RMSE_train <- c(2.7996, 2.8004, 2.8004, 2.8004  , 2.7996, 2.8005, 2.7996, 0.4139)
R2_test <- c(0.8619, 0.8618, 0.8618, 0.8622, 0.862, 0.8622, 0.8619, 0.8599  )
RMSE_test <- c(2.8058, 2.8035, 2.8035, 2.8035, 2.8058, 2.8036, 2.8059, 2.8412)

output <- cbind(Model, Method, Data, Var_Num, R2_train, RMSE_train, R2_test, RMSE_test)

output %>%
  kbl(caption = "Regression Model Comparison") %>%
  kable_minimal() %>%
  kable_styling(latex_options = "hold_position")

#create map for CMbyIncome
CMbyIncome_p <- plot_geo(CMbyIncome, locationmode = 'world') %>%
add_trace( z = ~CMbyIncome$Mortality.rate, locations = CMbyIncome$Code, frame=~CMbyIncome$Year,
color = ~CMbyIncome$Mortality.rate)
#CMbyIncome_p

#CMR <- read.csv("https://raw.githubusercontent.com/IvanGrozny88/DATA698/main/ChildMOrtalytRate.csv")

# Create an interactive scatter plot using Plotly for CMR
#plot_ly(CMR, x = ~Country, y = ~Mortality.Rate, color = ~Mortality.Rate,
       # type = "scatter", mode = "markers",
        #hovertemplate = paste("Country: %{text}",
                             # "<br>Country: %{x:.2f}",
                             # "<br>Mortality.Rate of Years: %{y:.2f}")) %>%
  #layout(title = "Mortality.Rate of Years vs. Country",
        # xaxis = list(title = "Country"),
        # yaxis = list(title = "Mortality.Rate of years"),
         #hovermode = "closest")

Final Research Project

IvanTikhonov

2023-05-01