Predicting Global Trade Patterns

Team Members: Divya Vemula

Rahul Chauhan

Mani Krishna Tippani

2024-12-11

Introduction

The project investigates the factors influencing which countries import the most commodities. With globalization, understanding export-import dynamics is crucial for optimizing trade strategies and supporting economic policies. Analyzing import patterns can provide insights into market dependencies, identify potential new markets, and enhance decision-making for exporters.

Dataset Description

Data collect data using the U.S. Census Bureau’s International Trade API, which provides detailed information on exports and imports, including commodity codes and values. This will serve as our primary data source.

Variables Included

Total 41 Variables

Project Investigation

Why is this data interesting?

Data Analysis Framework

Data Exploration(EDA):

Load Required Libraries

options(repos = c(CRAN = "https://cloud.r-project.org"))
if (!require('tidyverse')) install.packages('tidyverse'); library('tidyverse')
if (!require('h2o')) install.packages('h2o'); library('h2o')
if (!require('kableExtra')) install.packages('kableExtra'); library('kableExtra')
if (!require('DALEXtra')) install.packages('DALEXtra'); library('DALEXtra')
if (!require('skimr')) install.packages('skimr'); library('skimr')
if (!require('recipes')) install.packages('recipes'); library('recipes')
if (!require('janitor')) install.packages('janitor'); library('janitor')
if (!require('caret')) install.packages('caret'); library('caret')
if (!require('stringr')) install.packages('stringr'); library('stringr')
if (!require('DALEX')) install.packages('DALEX'); library('DALEX')
if (!require('ggplot2')) install.packages('ggplot2'); library('ggplot2')
if (!require('httr')) install.packages('httr'); library('httr')
if (!require('jsonlite')) install.packages('jsonlite'); library('jsonlite')
if (!require('tibble')) install.packages('tibble'); library('tibble')
if (!require('dplyr')) install.packages('dplyr'); library('dplyr')
if (!require('tidyr')) install.packages('tidyr'); library('tidyr')

Data Modelling

Read the PostProcessed data

api_url <- "https://api.census.gov/data/timeseries/intltrade/exports/hs?get=DISTRICT,DIST_NAME,E_COMMODITY,E_COMMODITY_LDESC,ALL_VAL_MO,ALL_VAL_YR,VES_VAL_MO,VES_VAL_YR,AIR_VAL_MO,AIR_VAL_YR,CC_YR,QTY_1_YR,QTY_2_YR,CTY_CODE,CTY_NAME,COMM_LVL,DF,LAST_UPDATE,YEAR,MONTH,VES_WGT_YR&YEAR=2013&MONTH=12&DISTRICT=13"
response <- GET(api_url, timeout(60))

if (status_code(response) == 200) {
  data <- fromJSON(content(response, "text"))
  headers <- data[1, ]
  records <- data[-1, ]
  
  # Convert to tibble and assign meaningful headers
  meaningful_headers <- c(
    "District Code", "District Name", "Export Commodity Code", 
    "Export Commodity Long Description", "Total Monthly Export Value", 
    "Total Year-to-Date Export Value", "Monthly Vessel Export Value", 
    "Year-to-Date Vessel Export Value", "Monthly Air Export Value", 
    "Year-to-Date Air Export Value", "Year-to-Date Card Count", 
    "Quantity 1 Year-to-Date", "Quantity 2 Year-to-Date", 
    "Country Code", "Country Name", "Commodity Level", 
    "Domestic/Foreign Indicator", "Last Update", "Year", 
    "Month", "Year-to-Date Vessel Weight"
  )
  data_table <- as_tibble(records, .name_repair = "unique")
  colnames(data_table) <- meaningful_headers
  print("Data successfully collected and loaded!")
} else {
  stop(paste("Error:", status_code(response)))
}
## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
## • `` -> `...16`
## • `` -> `...17`
## • `` -> `...18`
## • `` -> `...19`
## • `` -> `...20`
## • `` -> `...21`
## • `` -> `...22`
## • `` -> `...23`
## • `` -> `...24`
## [1] "Data successfully collected and loaded!"
# Clean column names to ensure they are valid
data_table <- data_table %>% clean_names()

# Limit the data to 20,000 records
if (nrow(data_table) > 20000) {
  data_table <- data_table %>% sample_n(20000)  # Randomly sample 20,000 records
}

# Print the total number of records 
total_records <- nrow(data_table)  
print(paste("Total number of records:", total_records))
## [1] "Total number of records: 20000"
# Step 2: Data Wrangling
# Check for missing values
missing_values <- sapply(data_table, function(x) sum(is.na(x)))
print(missing_values)
##                     district_code                     district_name 
##                                 0                                 0 
##             export_commodity_code export_commodity_long_description 
##                                 0                                 0 
##        total_monthly_export_value   total_year_to_date_export_value 
##                                 0                                 0 
##       monthly_vessel_export_value  year_to_date_vessel_export_value 
##                                 0                                 0 
##          monthly_air_export_value     year_to_date_air_export_value 
##                                 0                                 0 
##           year_to_date_card_count           quantity_1_year_to_date 
##                                 0                                 0 
##           quantity_2_year_to_date                      country_code 
##                                 0                                 0 
##                      country_name                   commodity_level 
##                                 0                                 0 
##        domestic_foreign_indicator                       last_update 
##                                 0                                 0 
##                              year                             month 
##                                 0                                 0 
##        year_to_date_vessel_weight                                na 
##                                 0                                 0 
##                              na_2                              na_3 
##                                 0                                 0
# Drop rows with missing data
data_table <- data_table %>% drop_na()

# Convert categorical variables to factors
data_table <- data_table %>%
  mutate(across(where(is.character), as.factor))

# Normalize numeric variables
numeric_cols <- data_table %>%
  select(where(is.numeric)) %>%
  colnames()
data_table <- data_table %>%
  mutate(across(all_of(numeric_cols), ~ (.-min(.))/(max(.)-min(.))))

# Filter columns to meet 20 predictors requirement
selected_columns <- c(
  "district_code", "district_name", "export_commodity_code", 
  "export_commodity_long_description", "total_monthly_export_value", 
  "total_year_to_date_export_value", "monthly_vessel_export_value", 
  "year_to_date_vessel_export_value", "monthly_air_export_value", 
  "year_to_date_air_export_value", "year_to_date_card_count", 
  "quantity_1_year_to_date", "quantity_2_year_to_date", 
  "country_code", "country_name", "commodity_level", 
  "domestic_foreign_indicator", "last_update", "year", 
  "month", "year_to_date_vessel_weight"
)
data_table <- data_table %>% select(all_of(selected_columns))

Predictors Data

# List of predictors to convert to factors
predictors <- c("district_code", "export_commodity_code", "export_commodity_long_description", 
                "total_year_to_date_export_value", "monthly_vessel_export_value", 
                "year_to_date_vessel_export_value", "monthly_air_export_value", 
                "year_to_date_air_export_value", "year_to_date_card_count", 
                "quantity_1_year_to_date", "quantity_2_year_to_date", 
                "country_code", "country_name", "commodity_level", 
                "domestic_foreign_indicator", "year_to_date_vessel_weight")

# Convert predictors to factors in the source dataframe
for (predictor in predictors) {
  data_table[[predictor]] <- as.factor(data_table[[predictor]])
}

# Verify the column types in the source dataframe
str(data_table)
## tibble [20,000 × 21] (S3: tbl_df/tbl/data.frame)
##  $ district_code                    : Factor w/ 1 level "13": 1 1 1 1 1 1 1 1 1 1 ...
##  $ district_name                    : Factor w/ 1 level "BALTIMORE, MD": 1 1 1 1 1 1 1 1 1 1 ...
##  $ export_commodity_code            : Factor w/ 5898 levels "-","01","010121",..: 5768 679 3659 1629 5675 3819 3734 4780 3713 1099 ...
##  $ export_commodity_long_description: Factor w/ 5390 levels "ABRASIVE ARTICLES ON A BASE OF WOVEN TEXTILE FABRIC ONLY",..: 4925 3657 2115 584 535 398 2722 4740 2539 3950 ...
##  $ total_monthly_export_value       : Factor w/ 3494 levels "0","10000","100000",..: 1 2664 1 1393 1 1 1 1 2324 3488 ...
##  $ total_year_to_date_export_value  : Factor w/ 13658 levels "10000","100000",..: 972 11740 4434 2579 2691 9835 4903 11166 13560 13623 ...
##  $ monthly_vessel_export_value      : Factor w/ 3206 levels "0","10000","100000",..: 1 2441 1 1292 1 1 1 1 1 3201 ...
##  $ year_to_date_vessel_export_value : Factor w/ 12552 levels "0","10000","100000",..: 903 10805 4105 2386 2492 9052 4542 10276 8946 12520 ...
##  $ monthly_air_export_value         : Factor w/ 348 levels "0","10000","10001",..: 1 1 1 1 1 1 1 1 238 1 ...
##  $ year_to_date_air_export_value    : Factor w/ 1837 levels "0","10000","100130",..: 1 1 1 1 1 1 1 1 1212 1 ...
##  $ year_to_date_card_count          : Factor w/ 530 levels "1","10","100",..: 44 313 167 375 412 1 167 335 335 167 ...
##  $ quantity_1_year_to_date          : Factor w/ 3035 levels "0","1","10","100",..: 1 1322 1 1 1 2 1 1 1061 1 ...
##  $ quantity_2_year_to_date          : Factor w/ 206 levels "0","1","100",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ country_code                     : Factor w/ 200 levels "-","0001","0003",..: 4 103 60 12 103 103 51 39 9 4 ...
##  $ country_name                     : Factor w/ 200 levels "AFGHANISTAN",..: 144 67 133 6 67 67 7 41 131 144 ...
##  $ commodity_level                  : Factor w/ 5 levels "-","HS10","HS2",..: 2 2 2 4 4 2 5 5 2 5 ...
##  $ domestic_foreign_indicator       : Factor w/ 3 levels "-","1","2": 2 2 1 2 1 2 1 2 2 1 ...
##  $ last_update                      : Factor w/ 1 level "0": 1 1 1 1 1 1 1 1 1 1 ...
##  $ year                             : Factor w/ 1 level "2013": 1 1 1 1 1 1 1 1 1 1 ...
##  $ month                            : Factor w/ 1 level "12": 1 1 1 1 1 1 1 1 1 1 ...
##  $ year_to_date_vessel_weight       : Factor w/ 10855 levels "0","1","10","100",..: 4218 5023 8777 7978 939 10201 390 6164 3002 989 ...
train_x_tbl <- data_table |> select(-total_monthly_export_value)
train_x_tbl_sorted <- train_x_tbl |> 
  arrange(desc(monthly_vessel_export_value))



kable(head(train_x_tbl_sorted, 10), format = "html", align = "l", caption = "Top 10 Rows of Training Predictor Variables") %>% 
  kable_styling(full_width = TRUE, bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), position = "center", font_size = 14) %>%
  column_spec(1, bold = TRUE, background = "#D3D3D3") %>% 
  row_spec(0, bold = TRUE, color = "white", background = "#4CAF50") %>% 
  
  footnote(general = "Top 10 rows", general_title = "Note: ", footnote_as_chunk = TRUE)
Top 10 Rows of Training Predictor Variables
district_code district_name export_commodity_code export_commodity_long_description total_year_to_date_export_value monthly_vessel_export_value year_to_date_vessel_export_value monthly_air_export_value year_to_date_air_export_value year_to_date_card_count quantity_1_year_to_date quantity_2_year_to_date country_code country_name commodity_level domestic_foreign_indicator last_update year month year_to_date_vessel_weight
13 BALTIMORE, MD 8529904720 RADAR APPARATUS PARTS 2364716 997803 1302370 52198 1062346 31 0 0 0003 EUROPEAN UNION HS10
0 2013 12 4085
13 BALTIMORE, MD 8529904720 RADAR APPARATUS PARTS 1289944 997803 1289944 0 0 11 0 0 4239 LUXEMBOURG HS10 1 0 2013 12 4083
13 BALTIMORE, MD 340219 ORGANIC SURFACE-ACTIVE AGENTS, WHETHER OR NOT PUT UP FOR RETAIL SALE, NESOI 1738065 9975 1738065 0 0 38 0 0 3XXX SOUTH AMERICA HS6
0 2013 12 183189
13 BALTIMORE, MD 340219 ORGANIC SURFACE-ACTIVE AGENTS, WHETHER OR NOT PUT UP FOR RETAIL SALE, NESOI 112049 9975 112049 0 0 6 0 0 3010 COLOMBIA HS6
0 2013 12 38587
13 BALTIMORE, MD 3402195000 ORGANIC SURFACE ACTIVE AGENTS,OTHER,NOT AROMATIC OR MODIFIED AROMATIC 1716265 9975 1716265 0 0 37 173441 0 0024 LAFTA HS10
0 2013 12 182139
13 BALTIMORE, MD 8708998175 PARTS AND ACCESSORIES, FOR MOTOR VEHICLES OF HEADINGS 8701 TO 8705, NESOI 554068 99524 530633 3219 23435 42 0 0 0014 PACIFIC RIM COUNTRIES HS10 1 0 2013 12 238481
13 BALTIMORE, MD 470710 WASTE AND SCRAP OF UNBLEACHED KRAFT PAPER OR PAPERBOARD OR OF CORRUGATED PAPER OR PAPERBOARD 2386016 99489 2386016 0 0 70 0 0 3370 CHILE HS6
0 2013 12 6084953
13 BALTIMORE, MD 4707100000 WASTE AND SCRAP OF UNBLEACHED KRAFT PAPER OR PAPERBOARD OR OF CORRUGATED PAPER OR PAPERBOARD 2386016 99489 2386016 0 0 70 6395 0 3370 CHILE HS10
0 2013 12 6084953
13 BALTIMORE, MD 870333 PASSENGER MOTOR VEHICLES WITH COMPRESSION-IGNITION INTERNAL COMBUSTION PISTON ENGINE (DIESEL), CYLINDER CAPACITY OVER 2,500 CC 89218429 9942004 89218429 0 0 295 0 0 4XXX EUROPE HS6 2 0 2013 12 6939492
13 BALTIMORE, MD 320710 PREPARED PIGMENTS, PREPARED OPACIFIERS, PREPARED COLORS AND SIMILAR PREPARATIONS 99374 99374 99374 0 0 2 0 0 0014 PACIFIC RIM COUNTRIES HS6
0 2013 12 12393
Note: Top 10 rows

Outcome Varible Data

train_y_tbl <- data_table |> select(total_monthly_export_value)
train_y_tbl_sorted <- train_y_tbl |> 
  arrange(desc(total_monthly_export_value))


kable(head(train_y_tbl_sorted, 10), format = "html", align = "l", caption = "Top 10 Rows of Outcome Variable") %>% 
  kable_styling(full_width = TRUE, bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered", "highlight"), position = "center", font_size = 14) %>% 
  add_header_above(c("total_monthly_export_value Data" = 1)) %>% 
  row_spec(0, bold = TRUE, color = "white", background = "#2C3E50") %>% 
  column_spec(1, color = "white", background = "#E74C3C") %>% 
  footnote(general = "This table displays the top 10 rows of the training target variable after preprocessing.", general_title = "Note: ", footnote_as_chunk = TRUE)
Top 10 Rows of Outcome Variable
total_monthly_export_value Data
total_monthly_export_value
99945
9982
9982
997803
9975
9975
9975
99489
99489
9942004
Note: This table displays the top 10 rows of the training target variable after preprocessing.

##initialize the h2o instance

h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 hours 50 minutes 
##     H2O cluster timezone:       America/New_York 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.44.0.3 
##     H2O cluster version age:    11 months and 21 days 
##     H2O cluster name:           H2O_started_from_R_divyavemula_jsk830 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   0.40 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 4.4.2 (2024-10-31)

Load the saved H2o model

h2o_model <- h2o.loadModel("TeamProject-Group1-PredictingGlobalTradePatterns.h2o")
summary(h2o_model)
## Model Details:
## ==============
## 
## H2ORegressionModel: deeplearning
## Model Key:  TeamProject-Group1-PredictingGlobalTradePatterns 
## Status of Neuron Layers: predicting total_monthly_export_value, regression, gaussian distribution, Quadratic loss, 7,413,377 weights/biases, 58.9 MB, 803,441 training samples, mini-batch size 1
##   layer units      type dropout       l1       l2 mean_rate rate_rms momentum
## 1     1 57657     Input  0.00 %       NA       NA        NA       NA       NA
## 2     2   128 Rectifier  0.00 % 0.000010 0.000010  0.003836 0.000000 0.216069
## 3     3   128 Rectifier  0.00 % 0.000010 0.000010  0.003836 0.000000 0.216069
## 4     4   128 Rectifier  0.00 % 0.000010 0.000010  0.003836 0.000000 0.216069
## 5     5     1    Linear      NA 0.000010 0.000010  0.003836 0.000000 0.216069
##   mean_weight weight_rms mean_bias bias_rms
## 1          NA         NA        NA       NA
## 2   -0.000182   0.013013 -7.912375 3.348563
## 3   -0.256982   0.083695 -1.604274 1.476976
## 4   -0.137196   0.163916  0.247930 0.572892
## 5    0.095969   0.255025 -0.621036 0.000000
## 
## H2ORegressionMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on temporary training frame with 10041 samples **
## 
## MSE:  0.0153426
## RMSE:  0.1238652
## MAE:  0.01012971
## RMSLE:  0.02582092
## Mean Residual Deviance :  0.0153426
## 
## 
## H2ORegressionMetrics: deeplearning
## ** Reported on validation data. **
## ** Metrics reported on full validation frame **
## 
## MSE:  0.5036592
## RMSE:  0.7096895
## MAE:  0.253089
## RMSLE:  0.07578916
## Mean Residual Deviance :  0.5036592
## 
## 
## 
## 
## Scoring History: 
##             timestamp          duration training_speed   epochs iterations
## 1 2024-12-11 11:21:24         0.000 sec             NA  0.00000          0
## 2 2024-12-11 11:21:26        44.350 sec    925 obs/sec  0.11231          1
## 3 2024-12-11 11:40:18 19 min 38.628 sec    735 obs/sec 50.01812        437
##         samples training_rmse training_deviance training_mae training_r2
## 1      0.000000            NA                NA           NA          NA
## 2   1804.000000       5.09314          25.94010      4.67226    -0.00787
## 3 803441.000000       0.12387           0.01534      0.01013     0.99940
##   validation_rmse validation_deviance validation_mae validation_r2
## 1              NA                  NA             NA            NA
## 2         5.13470            26.36519        4.70520      -0.00534
## 3         0.70969             0.50366        0.25309       0.98079
## 
## Variable Importances: (Extract with `h2o.varimp`) 
## =================================================
## 
## Variable Importances: 
##                                       variable relative_importance
## 1                monthly_vessel_export_value.0            1.000000
## 2                            country_code.0021            0.865046
## 3 country_name.TWENTY LATIN AMERICAN REPUBLICS            0.865030
## 4                 domestic_foreign_indicator.1            0.734723
## 5              year_to_date_air_export_value.0            0.734597
##   scaled_importance percentage
## 1          1.000000   0.054941
## 2          0.865046   0.047526
## 3          0.865030   0.047525
## 4          0.734723   0.040366
## 5          0.734597   0.040359
## 
## ---
##                                     variable relative_importance
## 57652   monthly_air_export_value.missing(NA)            0.000000
## 57653    quantity_2_year_to_date.missing(NA)            0.000000
## 57654               country_name.missing(NA)            0.000000
## 57655               country_code.missing(NA)            0.000000
## 57656            commodity_level.missing(NA)            0.000000
## 57657 domestic_foreign_indicator.missing(NA)            0.000000
##       scaled_importance percentage
## 57652          0.000000   0.000000
## 57653          0.000000   0.000000
## 57654          0.000000   0.000000
## 57655          0.000000   0.000000
## 57656          0.000000   0.000000
## 57657          0.000000   0.000000

Predictive performance of the model

performance metrics

h2o_df <- as.h2o(data_table)
##   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%
# Clean column names to ensure they are valid for H2O
colnames(h2o_df) <- gsub(" ", "_", colnames(h2o_df))  # Replace spaces with underscores
colnames(h2o_df) <- tolower(colnames(h2o_df))  # Convert to lowercase

# Check cleaned column names
print(colnames(h2o_df))
##  [1] "district_code"                     "district_name"                    
##  [3] "export_commodity_code"             "export_commodity_long_description"
##  [5] "total_monthly_export_value"        "total_year_to_date_export_value"  
##  [7] "monthly_vessel_export_value"       "year_to_date_vessel_export_value" 
##  [9] "monthly_air_export_value"          "year_to_date_air_export_value"    
## [11] "year_to_date_card_count"           "quantity_1_year_to_date"          
## [13] "quantity_2_year_to_date"           "country_code"                     
## [15] "country_name"                      "commodity_level"                  
## [17] "domestic_foreign_indicator"        "last_update"                      
## [19] "year"                              "month"                            
## [21] "year_to_date_vessel_weight"
# **Ensure the target variable is numeric** before applying the log transformation
h2o_df$total_monthly_export_value <- as.numeric(h2o_df$total_monthly_export_value)

# Apply log transformation to reduce the impact of outliers
h2o_df$total_monthly_export_value <- log1p(h2o_df$total_monthly_export_value)

# Convert the target variable to numeric to ensure it is treated as regression
h2o_df$total_monthly_export_value <- as.numeric(h2o_df$total_monthly_export_value)

# Split the dataset into training and testing sets (80% train, 20% test)
splits <- h2o.splitFrame(data = h2o_df, ratios = 0.8, seed = 123)
train_h2o <- splits[[1]] # from training data
test_h2o <- splits[[2]] # from training data

performance <- h2o.performance(h2o_model, newdata = test_h2o)
print(performance)
## H2ORegressionMetrics: deeplearning
## 
## MSE:  0.4504235
## RMSE:  0.671136
## MAE:  0.2340486
## RMSLE:  0.06440183
## Mean Residual Deviance :  0.4504235

Plot: performance of the model

plot(h2o_model)

Plot Explanation**:

Trend: Both lines show a general decreasing trend in RMSE as the number of epochs increases. This indicates that the model’s performance improves with more epochs.

Training vs. Validation RMSE: - Initially, both training and validation RMSE decrease rapidly. - Over time, the rate of decrease slows down, indicating the model is learning and improving its predictions.

Model Generalization: - If the validation RMSE closely follows the training RMSE, it suggests that the model generalizes well to unseen data. - If there’s a significant divergence between the two lines, it could indicate overfitting (the model performs well on training data but poorly on validation data).

Explain the model

Explainer

test_data <- as.data.frame(test_h2o)
 
# Assuming your dataset is stored in 'test_data' (converted from the H2O object)
set.seed(123)  # For reproducibility
sampled_data <- test_data[sample(nrow(test_data), 1000), ]
 
# Use DALEX to explain the trained model
explainer <- DALEX::explain(h2o_model, 
                            data = test_data[, predictors], 
                            y = test_data$total_monthly_export_value, 
                            label = "Deep Learning Model")
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'total_year_to_date_export_value' has levels not trained on:
## ["100085", "100127", "10024571", "10040", "10043274", "100662", "1007218",
## "100857", "1009500", "10141", ...1661 not listed..., "99462", "9946323",
## "99628", "9974", "9975", "998024", "99830", "9989", "999094", "999488"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'year_to_date_vessel_export_value' has levels not trained on:
## ["100085", "10024571", "10040", "10043274", "100662", "1007218", "100857",
## "1009500", "10140075", "10141", ...1534 not listed..., "99152", "9935",
## "9946323", "99628", "9974", "9975", "998024", "99830", "9989", "999094"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'year_to_date_vessel_weight' has levels not trained on: ["1001",
## "1002999", "10045", "1006", "101047", "101090", "1012816", "1019675", "10210",
## "102350", ...1252 not listed..., "9897", "9900", "99009", "99064", "993821",
## "9946", "9949", "9972", "9985822", "99898"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'export_commodity_code' has levels not trained on:
## ["0206290090", "0207140030", "020910", "021019", "0303260000", "030471",
## "0402100000", "040221", "0408", "071420", ...252 not listed..., "920510",
## "9303200035", "9305913010", "9306210000", "930629", "940290", "9403500000",
## "9506610000", "9610", "970500"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'export_commodity_long_description' has levels not trained on:
## ["AC GENERATORS (ALTERNATORS), OUTPUT EXCEEDING 750 KVA BUT NOT EXCEEDING
## 10,000 KVA", "AIR COMPRESSORS,RECIPROCATING, STATIONARY, EXCEEDING 11.19 KW BUT
## NOT EXCEEDING 74.6 KW", "AIR CONDITIONING EVAPORATOR COILS", "AIR GUN PELLETS
## AND PARTS OF SHOTGUN CARTRIDGES", "AIRCRAFT SPARK-IGNITION RECIPROCATING OR
## ROTARY INTERNAL COMBUSTION PISTON ENGINES", "ALUMINUM COLLAPSIBLE TUBULAR
## CONTAINERS CAP NT OV 300 LITERS", "ALUMINUM COLLAPSIBLE TUBULAR CONTAINERS, OF
## A CAPACITY NOT OVER 300 LITERS (79.30 GAL.)", "ALUMINUM FOIL, NOT OVER 0.2 MM
## THICK, NOT BACKED, ROLLED BUT NOT FURTHER WORKED", "ALUMINUM WIRE", "ALUMINUM,
## NOT ALLOYED, UNWROUGHT", ...207 not listed..., "WELDED LINK CHAIN OF IRON OR
## NONALLOY STEEL NESOI", "WIRE OF REFINED COPPER, WITH A MAXIMUM CROSS SECTIONAL
## DIMENSION NOT OVER 6 MM (.23 IN.)", "WOMEN'S OR GIRL'S ARTICLES OTHER THAN
## T-SHIRTS, SINGLETS AND TANK TOPS OF OTHER TEXTILE MATERIAL NESOI, KNITTED OR
## CROCHETED", "WOMEN'S OR GIRLS' OVERCOATS, RAINCOATS, CARCOATS, CAPES, CLOAKS
## AND SIMILAR ARTICLES OF TEXTILE MATERIALS NESOI, NOT KNITTED OR CROCHETED",
## "WOMEN'S OR GIRLS' SUITS, ENSEMBLES, SUIT-TYPE JACKETS, BLAZERS, DRESSES,
## SKIRTS, DIVIDED SKIRTS, TROUSERS, ETC. (NO SWIMWEAR), KNITTED OR CROCHETED",
## "WOODEN FURNITURE OF A KIND USED IN THE BEDROOM", "WOVEN FABRICS OF FLAX
## CONTAINING 85% OR MORE BY WEIGHT OF FLAX OTHER", "WOVEN PILE FABRICS AND
## CHENILLE FABRICS (OTHER THAN WOVEN TERRY OR TUFTED FABRICS AND NARROW WOVEN
## FABRICS NOT OVER 30 CM IN WIDTH) NESOI", "WRENCHES, ROTARY TYPE, PNEUMATIC
## HAND-DIRECTED TOOLS, NESOI", "YARN OF COMBED WOOL, NOT PUT UP FOR RETAIL SALE"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'monthly_vessel_export_value' has levels not trained on:
## ["10281", "10323", "104636", "104676", "1059761", "106246", "10753600",
## "1082214", "108863", "108982", ...337 not listed..., "914860", "91536541",
## "92980", "9551", "959799", "96619", "9876", "9886", "99489", "997803"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'quantity_1_year_to_date' has levels not trained on: ["1001",
## "10023", "10136", "10141", "10149", "101717", "101776", "1019", "10301",
## "104004", ...343 not listed..., "933", "95108", "95729", "9583", "95845",
## "96526", "977", "9772", "9812", "9975"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'year_to_date_air_export_value' has levels not trained on:
## ["10254", "103554", "10367", "10404", "10460", "107846", "108757", "11549",
## "116641", "1187382", ...170 not listed..., "9033", "907200", "931193", "9362",
## "96812", "96872", "96942", "99462", "995952", "999488"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'year_to_date_card_count' has levels not trained on: ["1147",
## "1153", "1583", "160", "1670", "240", "258", "274", "287", "2893", ...10 not
## listed..., "547", "579", "586", "588", "596", "6014", "631", "6478", "675",
## "700"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'monthly_air_export_value' has levels not trained on: ["10322",
## "106202", "108206", "108922", "129959", "14823", "15662", "16474", "18953",
## "23021", ...12 not listed..., "40575", "41042", "44756", "4517", "4695",
## "57672", "6931", "78110", "87860", "95790"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'quantity_2_year_to_date' has levels not trained on: ["1",
## "101", "10301", "1080", "108718", "126552", "140", "144", "159", "1866", ...11
## not listed..., "50", "55", "562", "592", "69029", "71", "825", "8424", "86",
## "9043"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'country_name' has levels not trained on: ["BRITISH VIRGIN
## ISLANDS", "BURMA"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'country_code' has levels not trained on: ["2482", "5460"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'total_year_to_date_export_value' has levels not trained on:
## ["100085", "100127", "10024571", "10040", "10043274", "100662", "1007218",
## "100857", "1009500", "10141", ...1661 not listed..., "99462", "9946323",
## "99628", "9974", "9975", "998024", "99830", "9989", "999094", "999488"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'year_to_date_vessel_export_value' has levels not trained on:
## ["100085", "10024571", "10040", "10043274", "100662", "1007218", "100857",
## "1009500", "10140075", "10141", ...1534 not listed..., "99152", "9935",
## "9946323", "99628", "9974", "9975", "998024", "99830", "9989", "999094"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'year_to_date_vessel_weight' has levels not trained on: ["1001",
## "1002999", "10045", "1006", "101047", "101090", "1012816", "1019675", "10210",
## "102350", ...1252 not listed..., "9897", "9900", "99009", "99064", "993821",
## "9946", "9949", "9972", "9985822", "99898"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'export_commodity_code' has levels not trained on:
## ["0206290090", "0207140030", "020910", "021019", "0303260000", "030471",
## "0402100000", "040221", "0408", "071420", ...252 not listed..., "920510",
## "9303200035", "9305913010", "9306210000", "930629", "940290", "9403500000",
## "9506610000", "9610", "970500"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'export_commodity_long_description' has levels not trained on:
## ["AC GENERATORS (ALTERNATORS), OUTPUT EXCEEDING 750 KVA BUT NOT EXCEEDING
## 10,000 KVA", "AIR COMPRESSORS,RECIPROCATING, STATIONARY, EXCEEDING 11.19 KW BUT
## NOT EXCEEDING 74.6 KW", "AIR CONDITIONING EVAPORATOR COILS", "AIR GUN PELLETS
## AND PARTS OF SHOTGUN CARTRIDGES", "AIRCRAFT SPARK-IGNITION RECIPROCATING OR
## ROTARY INTERNAL COMBUSTION PISTON ENGINES", "ALUMINUM COLLAPSIBLE TUBULAR
## CONTAINERS CAP NT OV 300 LITERS", "ALUMINUM COLLAPSIBLE TUBULAR CONTAINERS, OF
## A CAPACITY NOT OVER 300 LITERS (79.30 GAL.)", "ALUMINUM FOIL, NOT OVER 0.2 MM
## THICK, NOT BACKED, ROLLED BUT NOT FURTHER WORKED", "ALUMINUM WIRE", "ALUMINUM,
## NOT ALLOYED, UNWROUGHT", ...207 not listed..., "WELDED LINK CHAIN OF IRON OR
## NONALLOY STEEL NESOI", "WIRE OF REFINED COPPER, WITH A MAXIMUM CROSS SECTIONAL
## DIMENSION NOT OVER 6 MM (.23 IN.)", "WOMEN'S OR GIRL'S ARTICLES OTHER THAN
## T-SHIRTS, SINGLETS AND TANK TOPS OF OTHER TEXTILE MATERIAL NESOI, KNITTED OR
## CROCHETED", "WOMEN'S OR GIRLS' OVERCOATS, RAINCOATS, CARCOATS, CAPES, CLOAKS
## AND SIMILAR ARTICLES OF TEXTILE MATERIALS NESOI, NOT KNITTED OR CROCHETED",
## "WOMEN'S OR GIRLS' SUITS, ENSEMBLES, SUIT-TYPE JACKETS, BLAZERS, DRESSES,
## SKIRTS, DIVIDED SKIRTS, TROUSERS, ETC. (NO SWIMWEAR), KNITTED OR CROCHETED",
## "WOODEN FURNITURE OF A KIND USED IN THE BEDROOM", "WOVEN FABRICS OF FLAX
## CONTAINING 85% OR MORE BY WEIGHT OF FLAX OTHER", "WOVEN PILE FABRICS AND
## CHENILLE FABRICS (OTHER THAN WOVEN TERRY OR TUFTED FABRICS AND NARROW WOVEN
## FABRICS NOT OVER 30 CM IN WIDTH) NESOI", "WRENCHES, ROTARY TYPE, PNEUMATIC
## HAND-DIRECTED TOOLS, NESOI", "YARN OF COMBED WOOL, NOT PUT UP FOR RETAIL SALE"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'monthly_vessel_export_value' has levels not trained on:
## ["10281", "10323", "104636", "104676", "1059761", "106246", "10753600",
## "1082214", "108863", "108982", ...337 not listed..., "914860", "91536541",
## "92980", "9551", "959799", "96619", "9876", "9886", "99489", "997803"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'quantity_1_year_to_date' has levels not trained on: ["1001",
## "10023", "10136", "10141", "10149", "101717", "101776", "1019", "10301",
## "104004", ...343 not listed..., "933", "95108", "95729", "9583", "95845",
## "96526", "977", "9772", "9812", "9975"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'year_to_date_air_export_value' has levels not trained on:
## ["10254", "103554", "10367", "10404", "10460", "107846", "108757", "11549",
## "116641", "1187382", ...170 not listed..., "9033", "907200", "931193", "9362",
## "96812", "96872", "96942", "99462", "995952", "999488"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'year_to_date_card_count' has levels not trained on: ["1147",
## "1153", "1583", "160", "1670", "240", "258", "274", "287", "2893", ...10 not
## listed..., "547", "579", "586", "588", "596", "6014", "631", "6478", "675",
## "700"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'monthly_air_export_value' has levels not trained on: ["10322",
## "106202", "108206", "108922", "129959", "14823", "15662", "16474", "18953",
## "23021", ...12 not listed..., "40575", "41042", "44756", "4517", "4695",
## "57672", "6931", "78110", "87860", "95790"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'quantity_2_year_to_date' has levels not trained on: ["1",
## "101", "10301", "1080", "108718", "126552", "140", "144", "159", "1866", ...11
## not listed..., "50", "55", "562", "592", "69029", "71", "825", "8424", "86",
## "9043"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'country_name' has levels not trained on: ["BRITISH VIRGIN
## ISLANDS", "BURMA"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'country_code' has levels not trained on: ["2482", "5460"]

New observation

# Extract a single row for prediction (assuming you want the first row)
new_observation <- test_h2o[1, ]

# Alternatively, extract a specific observation by row index
# new_observation <- test_h2o[specific_row_index, ]

# Make sure the factor levels match as needed
new_observation <- as.data.frame(new_observation)
new_observation <- new_observation %>%
  mutate(
    total_year_to_date_export_value = factor(total_year_to_date_export_value, 
                                             levels = c("610000", "750000")),
    monthly_vessel_export_value = factor(monthly_vessel_export_value, 
                                         levels = c("0", "200000")),
    year_to_date_vessel_export_value = factor(year_to_date_vessel_export_value, 
                                              levels = c("610000", "720000")),
    monthly_air_export_value = factor(monthly_air_export_value, 
                                      levels = c("0", "100000")),
    year_to_date_air_export_value = factor(year_to_date_air_export_value, 
                                           levels = c("610000", "850000")),
    quantity_1_year_to_date = factor(quantity_1_year_to_date, 
                                     levels = c("0")),
    quantity_2_year_to_date = factor(quantity_2_year_to_date, 
                                     levels = c("0")),
    country_code = factor(country_code, levels = c("4XXX", "2XXX")),
    country_name = factor(country_name, levels = c("EUROPE", "CENTRAL AMERICA")),
    domestic_foreign_indicator = factor(domestic_foreign_indicator, 
                                        levels = c("-", "1")),
    year_to_date_vessel_weight = factor(year_to_date_vessel_weight, 
                                        levels = c("24551", "15000")),
    last_update = as.Date(Sys.Date())  # Add the 'last_update' column as the current date
  )

print(new_observation)
##   district_code district_name export_commodity_code
## 1            13 BALTIMORE, MD            8424890000
##                                                                                       export_commodity_long_description
## 1 MECHANICAL APPLIANCES (WHETHER OR NOT HAND OPERATED) FOR PROJECTING, DISPERSING OR SPRAYING LIQUIDS OR POWDERS, NESOI
##   total_monthly_export_value total_year_to_date_export_value
## 1                   10.75045                            <NA>
##   monthly_vessel_export_value year_to_date_vessel_export_value
## 1                           0                             <NA>
##   monthly_air_export_value year_to_date_air_export_value
## 1                     <NA>                          <NA>
##   year_to_date_card_count quantity_1_year_to_date quantity_2_year_to_date
## 1                       4                    <NA>                       0
##   country_code country_name commodity_level domestic_foreign_indicator
## 1         <NA>         <NA>            HS10                          1
##   last_update year month year_to_date_vessel_weight
## 1  2024-12-11 2013    12                       <NA>
# Process the data
new_observation_tbl_skim = partition(skim(new_observation))
names(new_observation_tbl_skim)
## [1] "Date"    "factor"  "numeric"
# Convert string columns to factors
string_2_factor_names_new_observation <- new_observation_tbl_skim$character$skim_variable
rec_obj_new_observation <- recipe(~ ., data = new_observation) |>
  step_string2factor(all_of(string_2_factor_names_new_observation)) |>
  prep()
new_observation_processed_tbl <- bake(rec_obj_new_observation, new_observation)

# Prediction-ready dataset
tradePatterns_prediction = new_observation_processed_tbl

XAI-Method-1 - SHAP

h2o_exp_shap <- predict_parts(
explainer = explainer, new_observation = tradePatterns_prediction, type = "shap", B = 3)
plot(h2o_exp_shap) + ggtitle("SHAP explanation")

### Key Insights of SHAP:

Significant Negative Impact:

Moderate Negative Impact:

Small Negative Impact:

Negligible Impact:

XAI-Method-2 Ceteris-paribus Profiles

Ceteris paribus profiles provide insights into how individual observations respond to changes in specific features, allowing for a more detailed understanding of the model’s behavior at the level of individual data points.

h2o_exp_cp <- DALEX::predict_profile(
  explainer = explainer,        # The explainer object
  new_observation = as.data.frame(test_h2o[1, ])  # New observation for prediction
)

XAI-Method-3 Model performance

# Model performance
mp_h2o <- model_performance(explainer)
plot(mp_h2o) + ggtitle("Model Performance")

Model Performance Plot Interpretation

This plot, titled “Scoring History,” visualizes the root mean square error (RMSE) over a series of epochs for both training and validation datasets.

Key Elements of the Plot:

Insights:

Trend: Both lines show a general decreasing trend in RMSE as the number of epochs increases. This indicates that the model’s performance improves with more epochs.

Training vs. Validation RMSE: - Initially, both training and validation RMSE decrease rapidly. - Over time, the rate of decrease slows down, indicating the model is learning and improving its predictions.

Model Generalization: - If the validation RMSE closely follows the training RMSE, it suggests that the model generalizes well to unseen data. - If there’s a significant divergence between the two lines, it could indicate overfitting (the model performs well on training data but poorly on validation data).

Conclusion

This project aims to provide valuable insights into global trade dynamics, supporting stakeholders in making informed decisions about export strategies and market targeting.