Data Analysis Framework

Data Exploration(EDA):

Summary Statistics: Generate descriptive statistics to capture the essence of the data.
Visualizations: Create plots to visualize trends over time and identify any seasonal patterns in exports.

Load Required Libraries

options(repos = c(CRAN = "https://cloud.r-project.org"))
if (!require('tidyverse')) install.packages('tidyverse'); library('tidyverse')
if (!require('h2o')) install.packages('h2o'); library('h2o')
if (!require('kableExtra')) install.packages('kableExtra'); library('kableExtra')
if (!require('DALEXtra')) install.packages('DALEXtra'); library('DALEXtra')
if (!require('skimr')) install.packages('skimr'); library('skimr')
if (!require('recipes')) install.packages('recipes'); library('recipes')
if (!require('janitor')) install.packages('janitor'); library('janitor')
if (!require('caret')) install.packages('caret'); library('caret')
if (!require('stringr')) install.packages('stringr'); library('stringr')
if (!require('DALEX')) install.packages('DALEX'); library('DALEX')
if (!require('ggplot2')) install.packages('ggplot2'); library('ggplot2')
if (!require('httr')) install.packages('httr'); library('httr')
if (!require('jsonlite')) install.packages('jsonlite'); library('jsonlite')
if (!require('tibble')) install.packages('tibble'); library('tibble')
if (!require('dplyr')) install.packages('dplyr'); library('dplyr')
if (!require('tidyr')) install.packages('tidyr'); library('tidyr')

Data Modelling

Read the PostProcessed data

api_url <- "https://api.census.gov/data/timeseries/intltrade/exports/hs?get=DISTRICT,DIST_NAME,E_COMMODITY,E_COMMODITY_LDESC,ALL_VAL_MO,ALL_VAL_YR,VES_VAL_MO,VES_VAL_YR,AIR_VAL_MO,AIR_VAL_YR,CC_YR,QTY_1_YR,QTY_2_YR,CTY_CODE,CTY_NAME,COMM_LVL,DF,LAST_UPDATE,YEAR,MONTH,VES_WGT_YR&YEAR=2013&MONTH=12&DISTRICT=13"
response <- GET(api_url, timeout(60))

if (status_code(response) == 200) {
  data <- fromJSON(content(response, "text"))
  headers <- data[1, ]
  records <- data[-1, ]
  
  # Convert to tibble and assign meaningful headers
  meaningful_headers <- c(
    "District Code", "District Name", "Export Commodity Code", 
    "Export Commodity Long Description", "Total Monthly Export Value", 
    "Total Year-to-Date Export Value", "Monthly Vessel Export Value", 
    "Year-to-Date Vessel Export Value", "Monthly Air Export Value", 
    "Year-to-Date Air Export Value", "Year-to-Date Card Count", 
    "Quantity 1 Year-to-Date", "Quantity 2 Year-to-Date", 
    "Country Code", "Country Name", "Commodity Level", 
    "Domestic/Foreign Indicator", "Last Update", "Year", 
    "Month", "Year-to-Date Vessel Weight"
  )
  data_table <- as_tibble(records, .name_repair = "unique")
  colnames(data_table) <- meaningful_headers
  print("Data successfully collected and loaded!")
} else {
  stop(paste("Error:", status_code(response)))
}

## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
## • `` -> `...16`
## • `` -> `...17`
## • `` -> `...18`
## • `` -> `...19`
## • `` -> `...20`
## • `` -> `...21`
## • `` -> `...22`
## • `` -> `...23`
## • `` -> `...24`

## [1] "Data successfully collected and loaded!"

# Clean column names to ensure they are valid
data_table <- data_table %>% clean_names()

# Limit the data to 20,000 records
if (nrow(data_table) > 20000) {
  data_table <- data_table %>% sample_n(20000)  # Randomly sample 20,000 records
}

# Print the total number of records 
total_records <- nrow(data_table)  
print(paste("Total number of records:", total_records))

## [1] "Total number of records: 20000"

# Step 2: Data Wrangling
# Check for missing values
missing_values <- sapply(data_table, function(x) sum(is.na(x)))
print(missing_values)

##                     district_code                     district_name 
##                                 0                                 0 
##             export_commodity_code export_commodity_long_description 
##                                 0                                 0 
##        total_monthly_export_value   total_year_to_date_export_value 
##                                 0                                 0 
##       monthly_vessel_export_value  year_to_date_vessel_export_value 
##                                 0                                 0 
##          monthly_air_export_value     year_to_date_air_export_value 
##                                 0                                 0 
##           year_to_date_card_count           quantity_1_year_to_date 
##                                 0                                 0 
##           quantity_2_year_to_date                      country_code 
##                                 0                                 0 
##                      country_name                   commodity_level 
##                                 0                                 0 
##        domestic_foreign_indicator                       last_update 
##                                 0                                 0 
##                              year                             month 
##                                 0                                 0 
##        year_to_date_vessel_weight                                na 
##                                 0                                 0 
##                              na_2                              na_3 
##                                 0                                 0

# Drop rows with missing data
data_table <- data_table %>% drop_na()

# Convert categorical variables to factors
data_table <- data_table %>%
  mutate(across(where(is.character), as.factor))

# Normalize numeric variables
numeric_cols <- data_table %>%
  select(where(is.numeric)) %>%
  colnames()
data_table <- data_table %>%
  mutate(across(all_of(numeric_cols), ~ (.-min(.))/(max(.)-min(.))))

# Filter columns to meet 20 predictors requirement
selected_columns <- c(
  "district_code", "district_name", "export_commodity_code", 
  "export_commodity_long_description", "total_monthly_export_value", 
  "total_year_to_date_export_value", "monthly_vessel_export_value", 
  "year_to_date_vessel_export_value", "monthly_air_export_value", 
  "year_to_date_air_export_value", "year_to_date_card_count", 
  "quantity_1_year_to_date", "quantity_2_year_to_date", 
  "country_code", "country_name", "commodity_level", 
  "domestic_foreign_indicator", "last_update", "year", 
  "month", "year_to_date_vessel_weight"
)
data_table <- data_table %>% select(all_of(selected_columns))

Predictors Data

# List of predictors to convert to factors
predictors <- c("district_code", "export_commodity_code", "export_commodity_long_description", 
                "total_year_to_date_export_value", "monthly_vessel_export_value", 
                "year_to_date_vessel_export_value", "monthly_air_export_value", 
                "year_to_date_air_export_value", "year_to_date_card_count", 
                "quantity_1_year_to_date", "quantity_2_year_to_date", 
                "country_code", "country_name", "commodity_level", 
                "domestic_foreign_indicator", "year_to_date_vessel_weight")

# Convert predictors to factors in the source dataframe
for (predictor in predictors) {
  data_table[[predictor]] <- as.factor(data_table[[predictor]])
}

# Verify the column types in the source dataframe
str(data_table)

## tibble [20,000 × 21] (S3: tbl_df/tbl/data.frame)
##  $ district_code                    : Factor w/ 1 level "13": 1 1 1 1 1 1 1 1 1 1 ...
##  $ district_name                    : Factor w/ 1 level "BALTIMORE, MD": 1 1 1 1 1 1 1 1 1 1 ...
##  $ export_commodity_code            : Factor w/ 5898 levels "-","01","010121",..: 5768 679 3659 1629 5675 3819 3734 4780 3713 1099 ...
##  $ export_commodity_long_description: Factor w/ 5390 levels "ABRASIVE ARTICLES ON A BASE OF WOVEN TEXTILE FABRIC ONLY",..: 4925 3657 2115 584 535 398 2722 4740 2539 3950 ...
##  $ total_monthly_export_value       : Factor w/ 3494 levels "0","10000","100000",..: 1 2664 1 1393 1 1 1 1 2324 3488 ...
##  $ total_year_to_date_export_value  : Factor w/ 13658 levels "10000","100000",..: 972 11740 4434 2579 2691 9835 4903 11166 13560 13623 ...
##  $ monthly_vessel_export_value      : Factor w/ 3206 levels "0","10000","100000",..: 1 2441 1 1292 1 1 1 1 1 3201 ...
##  $ year_to_date_vessel_export_value : Factor w/ 12552 levels "0","10000","100000",..: 903 10805 4105 2386 2492 9052 4542 10276 8946 12520 ...
##  $ monthly_air_export_value         : Factor w/ 348 levels "0","10000","10001",..: 1 1 1 1 1 1 1 1 238 1 ...
##  $ year_to_date_air_export_value    : Factor w/ 1837 levels "0","10000","100130",..: 1 1 1 1 1 1 1 1 1212 1 ...
##  $ year_to_date_card_count          : Factor w/ 530 levels "1","10","100",..: 44 313 167 375 412 1 167 335 335 167 ...
##  $ quantity_1_year_to_date          : Factor w/ 3035 levels "0","1","10","100",..: 1 1322 1 1 1 2 1 1 1061 1 ...
##  $ quantity_2_year_to_date          : Factor w/ 206 levels "0","1","100",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ country_code                     : Factor w/ 200 levels "-","0001","0003",..: 4 103 60 12 103 103 51 39 9 4 ...
##  $ country_name                     : Factor w/ 200 levels "AFGHANISTAN",..: 144 67 133 6 67 67 7 41 131 144 ...
##  $ commodity_level                  : Factor w/ 5 levels "-","HS10","HS2",..: 2 2 2 4 4 2 5 5 2 5 ...
##  $ domestic_foreign_indicator       : Factor w/ 3 levels "-","1","2": 2 2 1 2 1 2 1 2 2 1 ...
##  $ last_update                      : Factor w/ 1 level "0": 1 1 1 1 1 1 1 1 1 1 ...
##  $ year                             : Factor w/ 1 level "2013": 1 1 1 1 1 1 1 1 1 1 ...
##  $ month                            : Factor w/ 1 level "12": 1 1 1 1 1 1 1 1 1 1 ...
##  $ year_to_date_vessel_weight       : Factor w/ 10855 levels "0","1","10","100",..: 4218 5023 8777 7978 939 10201 390 6164 3002 989 ...

train_x_tbl <- data_table |> select(-total_monthly_export_value)
train_x_tbl_sorted <- train_x_tbl |> 
  arrange(desc(monthly_vessel_export_value))



kable(head(train_x_tbl_sorted, 10), format = "html", align = "l", caption = "Top 10 Rows of Training Predictor Variables") %>% 
  kable_styling(full_width = TRUE, bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), position = "center", font_size = 14) %>%
  column_spec(1, bold = TRUE, background = "#D3D3D3") %>% 
  row_spec(0, bold = TRUE, color = "white", background = "#4CAF50") %>% 
  
  footnote(general = "Top 10 rows", general_title = "Note: ", footnote_as_chunk = TRUE)

Top 10 Rows of Training Predictor Variables
district_code	district_name	export_commodity_code	export_commodity_long_description	total_year_to_date_export_value	monthly_vessel_export_value	year_to_date_vessel_export_value	monthly_air_export_value	year_to_date_air_export_value	year_to_date_card_count	quantity_1_year_to_date	quantity_2_year_to_date	country_code	country_name	commodity_level	domestic_foreign_indicator	last_update	year	month	year_to_date_vessel_weight
13	BALTIMORE, MD	8529904720	RADAR APPARATUS PARTS	2364716	997803	1302370	52198	1062346	31	0	0	0003	EUROPEAN UNION	HS10		0	2013	12	4085
13	BALTIMORE, MD	8529904720	RADAR APPARATUS PARTS	1289944	997803	1289944	0	0	11	0	0	4239	LUXEMBOURG	HS10	1	0	2013	12	4083
13	BALTIMORE, MD	340219	ORGANIC SURFACE-ACTIVE AGENTS, WHETHER OR NOT PUT UP FOR RETAIL SALE, NESOI	1738065	9975	1738065	0	0	38	0	0	3XXX	SOUTH AMERICA	HS6		0	2013	12	183189
13	BALTIMORE, MD	340219	ORGANIC SURFACE-ACTIVE AGENTS, WHETHER OR NOT PUT UP FOR RETAIL SALE, NESOI	112049	9975	112049	0	0	6	0	0	3010	COLOMBIA	HS6		0	2013	12	38587
13	BALTIMORE, MD	3402195000	ORGANIC SURFACE ACTIVE AGENTS,OTHER,NOT AROMATIC OR MODIFIED AROMATIC	1716265	9975	1716265	0	0	37	173441	0	0024	LAFTA	HS10		0	2013	12	182139
13	BALTIMORE, MD	8708998175	PARTS AND ACCESSORIES, FOR MOTOR VEHICLES OF HEADINGS 8701 TO 8705, NESOI	554068	99524	530633	3219	23435	42	0	0	0014	PACIFIC RIM COUNTRIES	HS10	1	0	2013	12	238481
13	BALTIMORE, MD	470710	WASTE AND SCRAP OF UNBLEACHED KRAFT PAPER OR PAPERBOARD OR OF CORRUGATED PAPER OR PAPERBOARD	2386016	99489	2386016	0	0	70	0	0	3370	CHILE	HS6		0	2013	12	6084953
13	BALTIMORE, MD	4707100000	WASTE AND SCRAP OF UNBLEACHED KRAFT PAPER OR PAPERBOARD OR OF CORRUGATED PAPER OR PAPERBOARD	2386016	99489	2386016	0	0	70	6395	0	3370	CHILE	HS10		0	2013	12	6084953
13	BALTIMORE, MD	870333	PASSENGER MOTOR VEHICLES WITH COMPRESSION-IGNITION INTERNAL COMBUSTION PISTON ENGINE (DIESEL), CYLINDER CAPACITY OVER 2,500 CC	89218429	9942004	89218429	0	0	295	0	0	4XXX	EUROPE	HS6	2	0	2013	12	6939492
13	BALTIMORE, MD	320710	PREPARED PIGMENTS, PREPARED OPACIFIERS, PREPARED COLORS AND SIMILAR PREPARATIONS	99374	99374	99374	0	0	2	0	0	0014	PACIFIC RIM COUNTRIES	HS6		0	2013	12	12393
Note: Top 10 rows

Outcome Varible Data

train_y_tbl <- data_table |> select(total_monthly_export_value)
train_y_tbl_sorted <- train_y_tbl |> 
  arrange(desc(total_monthly_export_value))


kable(head(train_y_tbl_sorted, 10), format = "html", align = "l", caption = "Top 10 Rows of Outcome Variable") %>% 
  kable_styling(full_width = TRUE, bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered", "highlight"), position = "center", font_size = 14) %>% 
  add_header_above(c("total_monthly_export_value Data" = 1)) %>% 
  row_spec(0, bold = TRUE, color = "white", background = "#2C3E50") %>% 
  column_spec(1, color = "white", background = "#E74C3C") %>% 
  footnote(general = "This table displays the top 10 rows of the training target variable after preprocessing.", general_title = "Note: ", footnote_as_chunk = TRUE)

Top 10 Rows of Outcome Variable
total_monthly_export_value Data
total_monthly_export_value
99945
9982
9982
997803
9975
9975
9975
99489
99489
9942004
Note: This table displays the top 10 rows of the training target variable after preprocessing.

##initialize the h2o instance

h2o.init(nthreads = -1)

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 hours 50 minutes 
##     H2O cluster timezone:       America/New_York 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.44.0.3 
##     H2O cluster version age:    11 months and 21 days 
##     H2O cluster name:           H2O_started_from_R_divyavemula_jsk830 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   0.40 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 4.4.2 (2024-10-31)

Load the saved H2o model

h2o_model <- h2o.loadModel("TeamProject-Group1-PredictingGlobalTradePatterns.h2o")
summary(h2o_model)

## Model Details:
## ==============
## 
## H2ORegressionModel: deeplearning
## Model Key:  TeamProject-Group1-PredictingGlobalTradePatterns 
## Status of Neuron Layers: predicting total_monthly_export_value, regression, gaussian distribution, Quadratic loss, 7,413,377 weights/biases, 58.9 MB, 803,441 training samples, mini-batch size 1
##   layer units      type dropout       l1       l2 mean_rate rate_rms momentum
## 1     1 57657     Input  0.00 %       NA       NA        NA       NA       NA
## 2     2   128 Rectifier  0.00 % 0.000010 0.000010  0.003836 0.000000 0.216069
## 3     3   128 Rectifier  0.00 % 0.000010 0.000010  0.003836 0.000000 0.216069
## 4     4   128 Rectifier  0.00 % 0.000010 0.000010  0.003836 0.000000 0.216069
## 5     5     1    Linear      NA 0.000010 0.000010  0.003836 0.000000 0.216069
##   mean_weight weight_rms mean_bias bias_rms
## 1          NA         NA        NA       NA
## 2   -0.000182   0.013013 -7.912375 3.348563
## 3   -0.256982   0.083695 -1.604274 1.476976
## 4   -0.137196   0.163916  0.247930 0.572892
## 5    0.095969   0.255025 -0.621036 0.000000
## 
## H2ORegressionMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on temporary training frame with 10041 samples **
## 
## MSE:  0.0153426
## RMSE:  0.1238652
## MAE:  0.01012971
## RMSLE:  0.02582092
## Mean Residual Deviance :  0.0153426
## 
## 
## H2ORegressionMetrics: deeplearning
## ** Reported on validation data. **
## ** Metrics reported on full validation frame **
## 
## MSE:  0.5036592
## RMSE:  0.7096895
## MAE:  0.253089
## RMSLE:  0.07578916
## Mean Residual Deviance :  0.5036592
## 
## 
## 
## 
## Scoring History: 
##             timestamp          duration training_speed   epochs iterations
## 1 2024-12-11 11:21:24         0.000 sec             NA  0.00000          0
## 2 2024-12-11 11:21:26        44.350 sec    925 obs/sec  0.11231          1
## 3 2024-12-11 11:40:18 19 min 38.628 sec    735 obs/sec 50.01812        437
##         samples training_rmse training_deviance training_mae training_r2
## 1      0.000000            NA                NA           NA          NA
## 2   1804.000000       5.09314          25.94010      4.67226    -0.00787
## 3 803441.000000       0.12387           0.01534      0.01013     0.99940
##   validation_rmse validation_deviance validation_mae validation_r2
## 1              NA                  NA             NA            NA
## 2         5.13470            26.36519        4.70520      -0.00534
## 3         0.70969             0.50366        0.25309       0.98079
## 
## Variable Importances: (Extract with `h2o.varimp`) 
## =================================================
## 
## Variable Importances: 
##                                       variable relative_importance
## 1                monthly_vessel_export_value.0            1.000000
## 2                            country_code.0021            0.865046
## 3 country_name.TWENTY LATIN AMERICAN REPUBLICS            0.865030
## 4                 domestic_foreign_indicator.1            0.734723
## 5              year_to_date_air_export_value.0            0.734597
##   scaled_importance percentage
## 1          1.000000   0.054941
## 2          0.865046   0.047526
## 3          0.865030   0.047525
## 4          0.734723   0.040366
## 5          0.734597   0.040359
## 
## ---
##                                     variable relative_importance
## 57652   monthly_air_export_value.missing(NA)            0.000000
## 57653    quantity_2_year_to_date.missing(NA)            0.000000
## 57654               country_name.missing(NA)            0.000000
## 57655               country_code.missing(NA)            0.000000
## 57656            commodity_level.missing(NA)            0.000000
## 57657 domestic_foreign_indicator.missing(NA)            0.000000
##       scaled_importance percentage
## 57652          0.000000   0.000000
## 57653          0.000000   0.000000
## 57654          0.000000   0.000000
## 57655          0.000000   0.000000
## 57656          0.000000   0.000000
## 57657          0.000000   0.000000

Predictive performance of the model

performance metrics

h2o_df <- as.h2o(data_table)

##   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%

# Clean column names to ensure they are valid for H2O
colnames(h2o_df) <- gsub(" ", "_", colnames(h2o_df))  # Replace spaces with underscores
colnames(h2o_df) <- tolower(colnames(h2o_df))  # Convert to lowercase

# Check cleaned column names
print(colnames(h2o_df))

##  [1] "district_code"                     "district_name"                    
##  [3] "export_commodity_code"             "export_commodity_long_description"
##  [5] "total_monthly_export_value"        "total_year_to_date_export_value"  
##  [7] "monthly_vessel_export_value"       "year_to_date_vessel_export_value" 
##  [9] "monthly_air_export_value"          "year_to_date_air_export_value"    
## [11] "year_to_date_card_count"           "quantity_1_year_to_date"          
## [13] "quantity_2_year_to_date"           "country_code"                     
## [15] "country_name"                      "commodity_level"                  
## [17] "domestic_foreign_indicator"        "last_update"                      
## [19] "year"                              "month"                            
## [21] "year_to_date_vessel_weight"

# **Ensure the target variable is numeric** before applying the log transformation
h2o_df$total_monthly_export_value <- as.numeric(h2o_df$total_monthly_export_value)

# Apply log transformation to reduce the impact of outliers
h2o_df$total_monthly_export_value <- log1p(h2o_df$total_monthly_export_value)

# Convert the target variable to numeric to ensure it is treated as regression
h2o_df$total_monthly_export_value <- as.numeric(h2o_df$total_monthly_export_value)

# Split the dataset into training and testing sets (80% train, 20% test)
splits <- h2o.splitFrame(data = h2o_df, ratios = 0.8, seed = 123)
train_h2o <- splits[[1]] # from training data
test_h2o <- splits[[2]] # from training data

performance <- h2o.performance(h2o_model, newdata = test_h2o)
print(performance)

## H2ORegressionMetrics: deeplearning
## 
## MSE:  0.4504235
## RMSE:  0.671136
## MAE:  0.2340486
## RMSLE:  0.06440183
## Mean Residual Deviance :  0.4504235

Plot: performance of the model

plot(h2o_model)

Plot Explanation**:

Trend: Both lines show a general decreasing trend in RMSE as the number of epochs increases. This indicates that the model’s performance improves with more epochs.

Training vs. Validation RMSE: - Initially, both training and validation RMSE decrease rapidly. - Over time, the rate of decrease slows down, indicating the model is learning and improving its predictions.

Model Generalization: - If the validation RMSE closely follows the training RMSE, it suggests that the model generalizes well to unseen data. - If there’s a significant divergence between the two lines, it could indicate overfitting (the model performs well on training data but poorly on validation data).

Explain the model

Explainer

test_data <- as.data.frame(test_h2o)
 
# Assuming your dataset is stored in 'test_data' (converted from the H2O object)
set.seed(123)  # For reproducibility
sampled_data <- test_data[sample(nrow(test_data), 1000), ]
 
# Use DALEX to explain the trained model
explainer <- DALEX::explain(h2o_model, 
                            data = test_data[, predictors], 
                            y = test_data$total_monthly_export_value, 
                            label = "Deep Learning Model")

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'total_year_to_date_export_value' has levels not trained on:
## ["100085", "100127", "10024571", "10040", "10043274", "100662", "1007218",
## "100857", "1009500", "10141", ...1661 not listed..., "99462", "9946323",
## "99628", "9974", "9975", "998024", "99830", "9989", "999094", "999488"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'year_to_date_vessel_export_value' has levels not trained on:
## ["100085", "10024571", "10040", "10043274", "100662", "1007218", "100857",
## "1009500", "10140075", "10141", ...1534 not listed..., "99152", "9935",
## "9946323", "99628", "9974", "9975", "998024", "99830", "9989", "999094"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'year_to_date_vessel_weight' has levels not trained on: ["1001",
## "1002999", "10045", "1006", "101047", "101090", "1012816", "1019675", "10210",
## "102350", ...1252 not listed..., "9897", "9900", "99009", "99064", "993821",
## "9946", "9949", "9972", "9985822", "99898"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'export_commodity_code' has levels not trained on:
## ["0206290090", "0207140030", "020910", "021019", "0303260000", "030471",
## "0402100000", "040221", "0408", "071420", ...252 not listed..., "920510",
## "9303200035", "9305913010", "9306210000", "930629", "940290", "9403500000",
## "9506610000", "9610", "970500"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'export_commodity_long_description' has levels not trained on:
## ["AC GENERATORS (ALTERNATORS), OUTPUT EXCEEDING 750 KVA BUT NOT EXCEEDING
## 10,000 KVA", "AIR COMPRESSORS,RECIPROCATING, STATIONARY, EXCEEDING 11.19 KW BUT
## NOT EXCEEDING 74.6 KW", "AIR CONDITIONING EVAPORATOR COILS", "AIR GUN PELLETS
## AND PARTS OF SHOTGUN CARTRIDGES", "AIRCRAFT SPARK-IGNITION RECIPROCATING OR
## ROTARY INTERNAL COMBUSTION PISTON ENGINES", "ALUMINUM COLLAPSIBLE TUBULAR
## CONTAINERS CAP NT OV 300 LITERS", "ALUMINUM COLLAPSIBLE TUBULAR CONTAINERS, OF
## A CAPACITY NOT OVER 300 LITERS (79.30 GAL.)", "ALUMINUM FOIL, NOT OVER 0.2 MM
## THICK, NOT BACKED, ROLLED BUT NOT FURTHER WORKED", "ALUMINUM WIRE", "ALUMINUM,
## NOT ALLOYED, UNWROUGHT", ...207 not listed..., "WELDED LINK CHAIN OF IRON OR
## NONALLOY STEEL NESOI", "WIRE OF REFINED COPPER, WITH A MAXIMUM CROSS SECTIONAL
## DIMENSION NOT OVER 6 MM (.23 IN.)", "WOMEN'S OR GIRL'S ARTICLES OTHER THAN
## T-SHIRTS, SINGLETS AND TANK TOPS OF OTHER TEXTILE MATERIAL NESOI, KNITTED OR
## CROCHETED", "WOMEN'S OR GIRLS' OVERCOATS, RAINCOATS, CARCOATS, CAPES, CLOAKS
## AND SIMILAR ARTICLES OF TEXTILE MATERIALS NESOI, NOT KNITTED OR CROCHETED",
## "WOMEN'S OR GIRLS' SUITS, ENSEMBLES, SUIT-TYPE JACKETS, BLAZERS, DRESSES,
## SKIRTS, DIVIDED SKIRTS, TROUSERS, ETC. (NO SWIMWEAR), KNITTED OR CROCHETED",
## "WOODEN FURNITURE OF A KIND USED IN THE BEDROOM", "WOVEN FABRICS OF FLAX
## CONTAINING 85% OR MORE BY WEIGHT OF FLAX OTHER", "WOVEN PILE FABRICS AND
## CHENILLE FABRICS (OTHER THAN WOVEN TERRY OR TUFTED FABRICS AND NARROW WOVEN
## FABRICS NOT OVER 30 CM IN WIDTH) NESOI", "WRENCHES, ROTARY TYPE, PNEUMATIC
## HAND-DIRECTED TOOLS, NESOI", "YARN OF COMBED WOOL, NOT PUT UP FOR RETAIL SALE"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'monthly_vessel_export_value' has levels not trained on:
## ["10281", "10323", "104636", "104676", "1059761", "106246", "10753600",
## "1082214", "108863", "108982", ...337 not listed..., "914860", "91536541",
## "92980", "9551", "959799", "96619", "9876", "9886", "99489", "997803"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'quantity_1_year_to_date' has levels not trained on: ["1001",
## "10023", "10136", "10141", "10149", "101717", "101776", "1019", "10301",
## "104004", ...343 not listed..., "933", "95108", "95729", "9583", "95845",
## "96526", "977", "9772", "9812", "9975"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'year_to_date_air_export_value' has levels not trained on:
## ["10254", "103554", "10367", "10404", "10460", "107846", "108757", "11549",
## "116641", "1187382", ...170 not listed..., "9033", "907200", "931193", "9362",
## "96812", "96872", "96942", "99462", "995952", "999488"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'year_to_date_card_count' has levels not trained on: ["1147",
## "1153", "1583", "160", "1670", "240", "258", "274", "287", "2893", ...10 not
## listed..., "547", "579", "586", "588", "596", "6014", "631", "6478", "675",
## "700"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'monthly_air_export_value' has levels not trained on: ["10322",
## "106202", "108206", "108922", "129959", "14823", "15662", "16474", "18953",
## "23021", ...12 not listed..., "40575", "41042", "44756", "4517", "4695",
## "57672", "6931", "78110", "87860", "95790"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'quantity_2_year_to_date' has levels not trained on: ["1",
## "101", "10301", "1080", "108718", "126552", "140", "144", "159", "1866", ...11
## not listed..., "50", "55", "562", "592", "69029", "71", "825", "8424", "86",
## "9043"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'country_name' has levels not trained on: ["BRITISH VIRGIN
## ISLANDS", "BURMA"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'country_code' has levels not trained on: ["2482", "5460"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'total_year_to_date_export_value' has levels not trained on:
## ["100085", "100127", "10024571", "10040", "10043274", "100662", "1007218",
## "100857", "1009500", "10141", ...1661 not listed..., "99462", "9946323",
## "99628", "9974", "9975", "998024", "99830", "9989", "999094", "999488"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'year_to_date_vessel_export_value' has levels not trained on:
## ["100085", "10024571", "10040", "10043274", "100662", "1007218", "100857",
## "1009500", "10140075", "10141", ...1534 not listed..., "99152", "9935",
## "9946323", "99628", "9974", "9975", "998024", "99830", "9989", "999094"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'year_to_date_vessel_weight' has levels not trained on: ["1001",
## "1002999", "10045", "1006", "101047", "101090", "1012816", "1019675", "10210",
## "102350", ...1252 not listed..., "9897", "9900", "99009", "99064", "993821",
## "9946", "9949", "9972", "9985822", "99898"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'export_commodity_code' has levels not trained on:
## ["0206290090", "0207140030", "020910", "021019", "0303260000", "030471",
## "0402100000", "040221", "0408", "071420", ...252 not listed..., "920510",
## "9303200035", "9305913010", "9306210000", "930629", "940290", "9403500000",
## "9506610000", "9610", "970500"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'export_commodity_long_description' has levels not trained on:
## ["AC GENERATORS (ALTERNATORS), OUTPUT EXCEEDING 750 KVA BUT NOT EXCEEDING
## 10,000 KVA", "AIR COMPRESSORS,RECIPROCATING, STATIONARY, EXCEEDING 11.19 KW BUT
## NOT EXCEEDING 74.6 KW", "AIR CONDITIONING EVAPORATOR COILS", "AIR GUN PELLETS
## AND PARTS OF SHOTGUN CARTRIDGES", "AIRCRAFT SPARK-IGNITION RECIPROCATING OR
## ROTARY INTERNAL COMBUSTION PISTON ENGINES", "ALUMINUM COLLAPSIBLE TUBULAR
## CONTAINERS CAP NT OV 300 LITERS", "ALUMINUM COLLAPSIBLE TUBULAR CONTAINERS, OF
## A CAPACITY NOT OVER 300 LITERS (79.30 GAL.)", "ALUMINUM FOIL, NOT OVER 0.2 MM
## THICK, NOT BACKED, ROLLED BUT NOT FURTHER WORKED", "ALUMINUM WIRE", "ALUMINUM,
## NOT ALLOYED, UNWROUGHT", ...207 not listed..., "WELDED LINK CHAIN OF IRON OR
## NONALLOY STEEL NESOI", "WIRE OF REFINED COPPER, WITH A MAXIMUM CROSS SECTIONAL
## DIMENSION NOT OVER 6 MM (.23 IN.)", "WOMEN'S OR GIRL'S ARTICLES OTHER THAN
## T-SHIRTS, SINGLETS AND TANK TOPS OF OTHER TEXTILE MATERIAL NESOI, KNITTED OR
## CROCHETED", "WOMEN'S OR GIRLS' OVERCOATS, RAINCOATS, CARCOATS, CAPES, CLOAKS
## AND SIMILAR ARTICLES OF TEXTILE MATERIALS NESOI, NOT KNITTED OR CROCHETED",
## "WOMEN'S OR GIRLS' SUITS, ENSEMBLES, SUIT-TYPE JACKETS, BLAZERS, DRESSES,
## SKIRTS, DIVIDED SKIRTS, TROUSERS, ETC. (NO SWIMWEAR), KNITTED OR CROCHETED",
## "WOODEN FURNITURE OF A KIND USED IN THE BEDROOM", "WOVEN FABRICS OF FLAX
## CONTAINING 85% OR MORE BY WEIGHT OF FLAX OTHER", "WOVEN PILE FABRICS AND
## CHENILLE FABRICS (OTHER THAN WOVEN TERRY OR TUFTED FABRICS AND NARROW WOVEN
## FABRICS NOT OVER 30 CM IN WIDTH) NESOI", "WRENCHES, ROTARY TYPE, PNEUMATIC
## HAND-DIRECTED TOOLS, NESOI", "YARN OF COMBED WOOL, NOT PUT UP FOR RETAIL SALE"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'monthly_vessel_export_value' has levels not trained on:
## ["10281", "10323", "104636", "104676", "1059761", "106246", "10753600",
## "1082214", "108863", "108982", ...337 not listed..., "914860", "91536541",
## "92980", "9551", "959799", "96619", "9876", "9886", "99489", "997803"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'quantity_1_year_to_date' has levels not trained on: ["1001",
## "10023", "10136", "10141", "10149", "101717", "101776", "1019", "10301",
## "104004", ...343 not listed..., "933", "95108", "95729", "9583", "95845",
## "96526", "977", "9772", "9812", "9975"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'year_to_date_air_export_value' has levels not trained on:
## ["10254", "103554", "10367", "10404", "10460", "107846", "108757", "11549",
## "116641", "1187382", ...170 not listed..., "9033", "907200", "931193", "9362",
## "96812", "96872", "96942", "99462", "995952", "999488"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'year_to_date_card_count' has levels not trained on: ["1147",
## "1153", "1583", "160", "1670", "240", "258", "274", "287", "2893", ...10 not
## listed..., "547", "579", "586", "588", "596", "6014", "631", "6478", "675",
## "700"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'monthly_air_export_value' has levels not trained on: ["10322",
## "106202", "108206", "108922", "129959", "14823", "15662", "16474", "18953",
## "23021", ...12 not listed..., "40575", "41042", "44756", "4517", "4695",
## "57672", "6931", "78110", "87860", "95790"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'quantity_2_year_to_date' has levels not trained on: ["1",
## "101", "10301", "1080", "108718", "126552", "140", "144", "159", "1866", ...11
## not listed..., "50", "55", "562", "592", "69029", "71", "825", "8424", "86",
## "9043"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'country_name' has levels not trained on: ["BRITISH VIRGIN
## ISLANDS", "BURMA"]

## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'country_code' has levels not trained on: ["2482", "5460"]

New observation

# Extract a single row for prediction (assuming you want the first row)
new_observation <- test_h2o[1, ]

# Alternatively, extract a specific observation by row index
# new_observation <- test_h2o[specific_row_index, ]

# Make sure the factor levels match as needed
new_observation <- as.data.frame(new_observation)
new_observation <- new_observation %>%
  mutate(
    total_year_to_date_export_value = factor(total_year_to_date_export_value, 
                                             levels = c("610000", "750000")),
    monthly_vessel_export_value = factor(monthly_vessel_export_value, 
                                         levels = c("0", "200000")),
    year_to_date_vessel_export_value = factor(year_to_date_vessel_export_value, 
                                              levels = c("610000", "720000")),
    monthly_air_export_value = factor(monthly_air_export_value, 
                                      levels = c("0", "100000")),
    year_to_date_air_export_value = factor(year_to_date_air_export_value, 
                                           levels = c("610000", "850000")),
    quantity_1_year_to_date = factor(quantity_1_year_to_date, 
                                     levels = c("0")),
    quantity_2_year_to_date = factor(quantity_2_year_to_date, 
                                     levels = c("0")),
    country_code = factor(country_code, levels = c("4XXX", "2XXX")),
    country_name = factor(country_name, levels = c("EUROPE", "CENTRAL AMERICA")),
    domestic_foreign_indicator = factor(domestic_foreign_indicator, 
                                        levels = c("-", "1")),
    year_to_date_vessel_weight = factor(year_to_date_vessel_weight, 
                                        levels = c("24551", "15000")),
    last_update = as.Date(Sys.Date())  # Add the 'last_update' column as the current date
  )

print(new_observation)

##   district_code district_name export_commodity_code
## 1            13 BALTIMORE, MD            8424890000
##                                                                                       export_commodity_long_description
## 1 MECHANICAL APPLIANCES (WHETHER OR NOT HAND OPERATED) FOR PROJECTING, DISPERSING OR SPRAYING LIQUIDS OR POWDERS, NESOI
##   total_monthly_export_value total_year_to_date_export_value
## 1                   10.75045                            <NA>
##   monthly_vessel_export_value year_to_date_vessel_export_value
## 1                           0                             <NA>
##   monthly_air_export_value year_to_date_air_export_value
## 1                     <NA>                          <NA>
##   year_to_date_card_count quantity_1_year_to_date quantity_2_year_to_date
## 1                       4                    <NA>                       0
##   country_code country_name commodity_level domestic_foreign_indicator
## 1         <NA>         <NA>            HS10                          1
##   last_update year month year_to_date_vessel_weight
## 1  2024-12-11 2013    12                       <NA>

# Process the data
new_observation_tbl_skim = partition(skim(new_observation))
names(new_observation_tbl_skim)

## [1] "Date"    "factor"  "numeric"

# Convert string columns to factors
string_2_factor_names_new_observation <- new_observation_tbl_skim$character$skim_variable
rec_obj_new_observation <- recipe(~ ., data = new_observation) |>
  step_string2factor(all_of(string_2_factor_names_new_observation)) |>
  prep()
new_observation_processed_tbl <- bake(rec_obj_new_observation, new_observation)

# Prediction-ready dataset
tradePatterns_prediction = new_observation_processed_tbl

XAI-Method-1 - SHAP

h2o_exp_shap <- predict_parts(
explainer = explainer, new_observation = tradePatterns_prediction, type = "shap", B = 3)
plot(h2o_exp_shap) + ggtitle("SHAP explanation")

### Key Insights of SHAP:

Significant Negative Impact:

monthly_vessel_export_value and monthly_air_export_value being zero.
year_to_date_card_count having a value of 1.

Moderate Negative Impact:

Missing country_code.

Small Negative Impact:

domestic_foreign_indicator being “-”.
export_commodity_long_description being “PASTA, PREPARED, NESOI”.
quantity_1_year_to_date being zero.

Negligible Impact:

Missing total_year_to_date_export_value.
Missing year_to_date_vessel_export_value.
Missing year_to_date_vessel_weight.

XAI-Method-2 Ceteris-paribus Profiles

Ceteris paribus profiles provide insights into how individual observations respond to changes in specific features, allowing for a more detailed understanding of the model’s behavior at the level of individual data points.

h2o_exp_cp <- DALEX::predict_profile(
  explainer = explainer,        # The explainer object
  new_observation = as.data.frame(test_h2o[1, ])  # New observation for prediction
)

XAI-Method-3 Model performance

# Model performance
mp_h2o <- model_performance(explainer)
plot(mp_h2o) + ggtitle("Model Performance")

Model Performance Plot Interpretation

This plot, titled “Scoring History,” visualizes the root mean square error (RMSE) over a series of epochs for both training and validation datasets.

Key Elements of the Plot:

X-Axis (Epochs): This axis represents the number of epochs, ranging from 0 to 50. An epoch is one complete pass through the entire training dataset.
Y-Axis (RMSE): This axis shows the RMSE values, which range from 0 to 5. RMSE is a measure of the differences between predicted and actual values. Lower values indicate better model performance.
Lines:
- Blue Line (Training RMSE): Represents the RMSE for the training dataset over the epochs.
- Orange Line (Validation RMSE): Represents the RMSE for the validation dataset over the epochs.

Insights: