Identifying demographic and neighborhood characteristics associated with green space deprivation in England (Tidymodels and LASSO Regression)

My role: analysis design, analysis, data visualisation, reporting

Techniques used: linear regression (LASSO), feature engineering, parameter tuning, exploratory data analysis, data visualisation.

In summary

In this notebook, I run a regression analysis to an attempt to identify demographic and neighborhood characteristics associated with green space deprivation in England. This notebook focuses on the feature engineering and modelling.

Objective: to identify demographic and neighborhood characteristics associated with green space deprivation in England.

My initial (informal) hypothesis: green space deprivation might be associated with one or more demographic (e.g. percentage of the population from minority ethnic backgrounds) and neighborhood characteristics (e.g. population density of the neighborhood).

Terminology: In terms of green space deprivation, I focus on the percentage of the population within each neighborhood without easy walking access to public green space. More specifically the percentage of the population without 2 ha of public green space within 300m. This would be approximately a five minute walk for some walking slowly or with children.

Geospatial scale: In this analysis I work at the neighborhood scale. The Lower Super Output Area (LSOA) scale in the jargon of UK administrative terminology. To give a sense of scale the average population of a LSOA is approximately 1600. This was the smallest geospatial scale that data on green space deprivation was available.

Data: The dataset used is was released by Friends of the Earth in 2020, as part of their research on ‘England’s Green Space Gap’. The dataset includes some UK Government data, and some data which Friends of the Earth derived based on analysis of the UK Government data.

Resources: I used the following resources and information sources when putting together this notebook.

Key findings:

Whether or not a neighborhood/LSOA was urban or rural was the most important variable in estimating the proportion of its population without good access to green space. The model estimated that, holding other explanatory variables constant, an urban neighborhood would would have approx. 25% less people without good access to green space.
The amount of green space per capita in a neighborhood was the second most important variable in estimating the proportion of its population without good access to green space. The model predicts that neighborhoods with more green space per capita have higher percentages of people without good access to green space.
Other variables positively associated with the higher percentages of a neighborhoods population being without good access to public green space include:
- average income of the population;
- total population of the neighborhood;
- the percentage of the population from a black African, Caribbean or black British ethnic background.
Applying the model to unsee data:
- the predictive accuracy of the model was relatively poor. The RMSE was approx 24, with the range of predicted outcomes being between 0 and 100.
- the R² values was 0.46, this could be cosnidered resonable good for the problem domain and the first iteration of model development.

Next steps: If I invest more time in this analysis, I think exploring the following approaches/issues would be interesting/useful.

Creating separate models for urban and rural neighborhoods/LSOAs.
Consider what other variables could help improve the explanatory/predictive capabilities of the model.
Consider if alternative (non regression based) analytical methods could be help in understanding the data better.

Setting up the notebook

This section of the notebook includes all the setup code including importing and documenting the packages used, and setting notebook options.

# ------------------------------------------------------------------------------
# Import packages used in the notebook
# ------------------------------------------------------------------------------

# for data manipulation and visualisation
library(dplyr)
library(ggplot2)
library(tidyr)

# for modeling
library(tidymodels)

# for string manipulation
library(stringr)

# for functional programming
library(purrr)

# for reading in data xslx files
library(readxl)

# for table formatting
library(knitr)

# for skewedness and kurtosis tests used in exploratory analysis
library(moments)

# for analysing the importance of explanatory/predictor variables
library(vip)

# for handling factor variables
library(forcats)

# ------------------------------------------------------------------------------
# Imported packages used in the notebook
# ------------------------------------------------------------------------------
subset(
  data.frame(sessioninfo::package_info()),
  attached == TRUE,
  c(package, loadedversion)
) %>%
  kable()

	package	loadedversion
broom	broom	1.0.3
dials	dials	1.1.0
dplyr	dplyr	1.1.0
forcats	forcats	1.0.0
ggplot2	ggplot2	3.4.0
infer	infer	1.0.4
knitr	knitr	1.42
modeldata	modeldata	1.1.0
moments	moments	0.14.1
parsnip	parsnip	1.0.3
purrr	purrr	1.0.1
readxl	readxl	1.4.1
recipes	recipes	1.0.4
rsample	rsample	1.1.1
scales	scales	1.2.1
stringr	stringr	1.5.0
tibble	tibble	3.1.8
tidymodels	tidymodels	1.0.0
tidyr	tidyr	1.3.0
tune	tune	1.0.1
vip	vip	0.3.2
workflows	workflows	1.1.2
workflowsets	workflowsets	1.0.0
yardstick	yardstick	1.1.0

# ------------------------------------------------------------------------------
# Set up notebook output and data viz styling
# ------------------------------------------------------------------------------
# set default Rmd Chunk options
knitr::opts_chunk$set(
  echo = TRUE,
  cache = TRUE,
  warning = FALSE,
  message = FALSE
)

# set default theme for exploratory plots
theme_set(theme_light())

# define colour palette for plots
colours <- c("#374E83", "#C94A54")

Reading in the data

The dataset used is from Friends of the Earth in 2020. It was released as part of their research on ‘England’s Green Space Gap’. Data at different spatial resolutions (LSOA, MSOA and Local Authority is included), but I only focus at the LSOA (Lower Super Output Area) scale. This was the finest grain resolution where the data required for the analysis was available.

The code chuck below reads the data in from a local Excel file, sorts of the variable naming conventions and outputs an overview of the structure of the data. So, I could see all the variables I had to work with and check if the variable types (character, numeric, logical etc.) looked correct.

foe_green_space_raw <- read_excel("../Data/(FOE) Green Space Consolidated Data - England - Version 2.1.xlsx",
  sheet = "LSOAs V2.1"
) %>%
  # inconsistent naming conventions for variables are used in the source data
  # some needed to clean names for consistency (variables names are now )
  janitor::clean_names()

str(foe_green_space_raw)

## tibble [32,844 x 31] (S3: tbl_df/tbl/data.frame)
##  $ x1                                       : num [1:32844] 1 2 3 4 5 6 7 8 9 10 ...
##  $ lsoa_code                                : chr [1:32844] "E01000001" "E01000002" "E01000003" "E01000005" ...
##  $ lsoa_name                                : chr [1:32844] "City of London 001A" "City of London 001B" "City of London 001C" "City of London 001E" ...
##  $ msoa_code                                : chr [1:32844] "E02000001" "E02000001" "E02000001" "E02000001" ...
##  $ msoa_name                                : chr [1:32844] "City of London 001" "City of London 001" "City of London 001" "City of London 001" ...
##  $ msoa_name_house_of_commons               : chr [1:32844] "City of London" "City of London" "City of London" "City of London" ...
##  $ la_code                                  : chr [1:32844] "E09000001" "E09000001" "E09000001" "E09000001" ...
##  $ la_name                                  : chr [1:32844] "City of London" "City of London" "City of London" "City of London" ...
##  $ la_name_for_readability                  : chr [1:32844] "City of London" "City of London" "City of London" "City of London" ...
##  $ area                                     : num [1:32844] 133326 226199 57305 190745 144197 ...
##  $ imd_st_areasha                           : num [1:32844] 133321 226191 57303 190739 144196 ...
##  $ population_imd                           : num [1:32844] 1296 1156 1350 1121 2040 ...
##  $ population_density                       : num [1:32844] 0.00972 0.00511 0.02356 0.00588 0.01415 ...
##  $ total_pop_from_ethnicity_data            : num [1:32844] 1465 1436 1346 985 1703 ...
##  $ white_pop                                : num [1:32844] 1238 1274 1055 506 557 ...
##  $ mixed_multiple_ethnic_group_pop          : num [1:32844] 54 54 55 59 58 88 83 62 149 28 ...
##  $ asian_asian_british_pop                  : num [1:32844] 128 95 168 274 861 ...
##  $ black_african_caribbean_black_british_pop: num [1:32844] 11 4 45 100 177 389 574 228 580 143 ...
##  $ other_ethnic_group_pop                   : num [1:32844] 34 9 23 46 50 28 53 49 94 47 ...
##  $ bame_pop                                 : num [1:32844] 227 162 291 479 1146 ...
##  $ income_decile                            : num [1:32844] 10 10 6 2 5 2 2 3 3 3 ...
##  $ unbuffrd_go_space_area                   : num [1:32844] 0 0 0 0 0 ...
##  $ buffrd_go_space_area                     : num [1:32844] 0 80106 1858 0 27147 ...
##  $ unbuffered_go_space_per_capita           : num [1:32844] 0 0 0 0 0 ...
##  $ pop_area                                 : num [1:32844] 133326 226199 57305 190745 144197 ...
##  $ pop_area_with_go_space_access            : num [1:32844] 0 80106 1858 0 27147 ...
##  $ pcnt_pop_area_with_go_space_access       : num [1:32844] 0 35.41 3.24 0 18.83 ...
##  $ pcnt_pop_without_go_space_access         : num [1:32844] 100 64.6 96.8 100 81.2 ...
##  $ pop_without_go_space_access              : num [1:32844] 1296 747 1306 1121 1656 ...
##  $ gsdi_avg_area                            : num [1:32844] 1 1 1 1 1 1 1 1 2 1 ...
##  $ gsdi_access                              : num [1:32844] 1 2 1 1 1 2 2 3 3 4 ...

We also need some additional data on where each neighborhood (i.e. LSOA) is urban or rural. This data comes from the Office for National Statistics.

urban_rural <- read_excel("../Data/Rural_Urban_Classification_2011_lookup_tables_for_small_area_geographies.xlsx",
  sheet = "LSOA11", skip = 2
) %>%
  janitor::clean_names() %>%
  select(-c(lower_super_output_area_2011_name, ))

kable(head(urban_rural))

lower_super_output_area_2011_code	rural_urban_classification_2011_code	rural_urban_classification_2011_10_fold	rural_urban_classification_2011_2_fold
E01000001	A1	Urban major conurbation	Urban
E01000002	A1	Urban major conurbation	Urban
E01000003	A1	Urban major conurbation	Urban
E01000005	A1	Urban major conurbation	Urban
E01000006	A1	Urban major conurbation	Urban
E01000007	A1	Urban major conurbation	Urban

Now, we just need to join the two datasets together.

foe_green_space_urban_rural <- foe_green_space_raw %>%
  left_join(urban_rural, by = c("lsoa_code" = "lower_super_output_area_2011_code"))

kable(head(foe_green_space_urban_rural))

x1	lsoa_code	lsoa_name	msoa_code	msoa_name	msoa_name_house_of_commons	la_code	la_name	la_name_for_readability	area	imd_st_areasha	population_imd	population_density	total_pop_from_ethnicity_data	white_pop	mixed_multiple_ethnic_group_pop	asian_asian_british_pop	black_african_caribbean_black_british_pop	other_ethnic_group_pop	bame_pop	income_decile	buffrd_go_space_area	pop_area	pop_area_with_go_space_access	pcnt_pop_area_with_go_space_access	pcnt_pop_without_go_space_access	pop_without_go_space_access	gsdi_avg_area	gsdi_access	rural_urban_classification_2011_code	rural_urban_classification_2011_10_fold	rural_urban_classification_2011_2_fold
1	E01000001	City of London 001A	E02000001	City of London 001	City of London	E09000001	City of London	City of London	133325.89	133320.77	1296	0.0097205	1465	1238	54	128	11	34	227	10	0.000	133325.89	0.000	0.000000	100.00000	1296.0000	1	1	A1	Urban major conurbation	Urban
2	E01000002	City of London 001B	E02000001	City of London 001	City of London	E09000001	City of London	City of London	226199.38	226191.27	1156	0.0051105	1436	1274	54	95	4	9	162	10	80106.352	226199.38	80106.352	35.414046	64.58595	746.6136	1	2	A1	Urban major conurbation	Urban
3	E01000003	City of London 001C	E02000001	City of London 001	City of London	E09000001	City of London	City of London	57305.11	57302.97	1350	0.0235581	1346	1055	55	168	45	23	291	6	1857.635	57305.11	1857.635	3.241657	96.75834	1306.2376	1	1	A1	Urban major conurbation	Urban
4	E01000005	City of London 001E	E02000001	City of London 001	City of London	E09000001	City of London	City of London	190745.29	190738.76	1121	0.0058769	985	506	59	274	100	46	479	2	0.000	190745.29	0.000	0.000000	100.00000	1121.0000	1	1	A1	Urban major conurbation	Urban
5	E01000006	Barking and Dagenham 016A	E02000017	Barking and Dagenham 016	Barking East	E09000002	Barking and Dagenham	Barking and Dagenham	144196.94	144195.85	2040	0.0141473	1703	557	58	861	177	50	1146	5	27146.718	144196.94	27146.718	18.826140	81.17386	1655.9467	1	1	A1	Urban major conurbation	Urban
6	E01000007	Barking and Dagenham 015A	E02000016	Barking and Dagenham 015	Barking Central	E09000002	Barking and Dagenham	Barking and Dagenham	198136.98	198134.81	2101	0.0106038	1391	439	88	447	389	28	952	2	83933.435	198136.98	83933.435	42.361317	57.63868	1210.9887	1	2	A1	Urban major conurbation	Urban

Data cleaning

Happily, there is no missing data. I was hoping the would be the case, given the data is mostly either from or derived from open government data.

visdat::vis_miss(foe_green_space_urban_rural, warn_large_data = FALSE)

Exploratory data analysis

I know this dataset pretty well, from previous analysis, so I’ll do a fairly minimal exploratory analysis focusing on:

identifying multi-colinearity amongst the predictors.
understanding how skewed the distribution of each variable is.

Multi-colinearity

The correlation plot below helped me identify some colinearity between variables.

foe_green_space_variables <- foe_green_space_urban_rural %>%
  select(
    -contains(c("name", "code", "gsdi")),
    -x1
  )

foe_green_space_variables %>%
  GGally::ggcorr(hjust = 1, layout.exp = 4, size = 3)

Based on inspection of the correlation plot about above, in the code chunk below I actually drop some variables I suspect that we don’t need for the analysis. Variables are dropped either because they are highly correlated or because they are duplicate measures of the same underlying quantity but from different datasets.

foe_green_space_urban_rural_focus <- foe_green_space_urban_rural %>%
  
  # explicitly remove variables so I know what has been dropped
  select(
    -c(
      x1, imd_st_areasha, population_imd, unbuffrd_go_space_area, buffrd_go_space_area,
      pop_area, pop_area_with_go_space_access, pcnt_pop_area_with_go_space_access,
      pop_without_go_space_access
    ),
    
  # foe green space deprivation ratings are not going to help in the analysis  
    -contains("gsdi")
  )

Skew

I know this dataset pretty well, from previous analysis, and remembered that a lot of the variables are highly skewed. An issue that I’ll pick in the feature engineering. So, the code chunk below runs the skewedness test for each numeric variable and I’ll store the results as they will probably be useful later on.

# look at skewedness for each variable
skew_tests <- foe_green_space_urban_rural_focus %>%
  
  # drop non numeric variables
  select(-contains(c("name", "code", "gsdi", "rural_urban"))) %>%
  
  # calculate skewnedess
  summarise(across(
    dplyr::everything(),
    \(x) skewness(x, na.rm = TRUE)
  )) %>%
  
  # make the table format more readible
  pivot_longer(cols = everything()) %>%
  rename(
    variable = name,
    skewness = value
  ) %>%
  arrange(skewness) %>%
  
  # classify variable by amount of skewedness
  mutate(skew_group = case_when(
    skewness > 5 ~ "extreme left",
    skewness > 2 ~ "strong left",
    skewness > 0.5 ~ "left",
    skewness > -0.5 ~ "limited",
    skewness > -2 ~ "right",
    skewness > -5 ~ "strong right",
    TRUE ~ "extreme right"
  ))


skew_tests %>%
  # format for rmarkdown
  kable()

variable	skewness	skew_group
pcnt_pop_without_go_space_access	-0.5780064	right
income_decile	-0.0000103	limited
white_pop	0.0229443	limited
mixed_multiple_ethnic_group_pop	1.8827047	left
total_pop_from_ethnicity_data	2.2291027	strong left
bame_pop	2.4637131	strong left
population_density	2.5508460	strong left
asian_asian_british_pop	3.7868011	strong left
black_african_caribbean_black_british_pop	3.8526981	strong left
other_ethnic_group_pop	4.9348319	strong left
area	12.0015978	extreme left
unbuffered_go_space_per_capita	39.8380888	extreme left

LASSO regression analysis

We can now move on the to run the regression analysis itself. Some key details on the analysis approach are presented below, before getting into running the modelling workflow itself.

Outcome variable: pcnt_pop_without_go_space_access this is the percentage of the population (in a neighbourhood/LSOA) without 2 ha of public green space within 300m.

Predictor variables: all other relevant variables available.

Methods:

I used LASSO regression to penalise inclusion of variables, this was in case I missed any multi-colinearity in the EDA above.
I split data into test and train sets, as I wanted to get a sense how the model performs on unseen data.

Approach to coding/implementation: I work within the TidyModels framework.

Preparing the data

In the code chunk below I do some final data preparation ahead of splitting into test and train sets. The main data transformation is to convert population variables from raw counts to percentages. For example, moving from having a raw count variable for the BAME population (bame_pop) in each neighborhood, to have a percentage BAME population (perc_bame_pop).

foe_green_space_urban_rural_model <- foe_green_space_urban_rural_focus %>%
  
  # keep only the variables needed for modelling
  select(
    lsoa_code,
    where(is.numeric),
    rural_urban_classification_2011_2_fold
  ) %>%
  
  # calculate demographic variables as percentage of population rather than
  # raw counts
  mutate(
    perc_bame_pop = bame_pop / total_pop_from_ethnicity_data,
    perc_mixed_multiple_ethnic_group_pop = mixed_multiple_ethnic_group_pop / total_pop_from_ethnicity_data,
    perc_asian_asian_british_pop = asian_asian_british_pop / total_pop_from_ethnicity_data,
    perc_black_african_caribbean_black_british_pop = black_african_caribbean_black_british_pop / total_pop_from_ethnicity_data,
    perc_other_ethnic_group_pop = other_ethnic_group_pop / total_pop_from_ethnicity_data,
  ) %>%
  
  # remove population count data
  select(-c(
    white_pop, mixed_multiple_ethnic_group_pop,
    asian_asian_british_pop,
    black_african_caribbean_black_british_pop,
    other_ethnic_group_pop,
    bame_pop
  ))

Splitting the data

Next, I split the data into test and training sets. I want each split to be have a representative proportion of rural and urban neighborhoods, because I thing based on doamin knowledge that the urban-rural classification will be important. So, stratify the sampling on this basis.

data_split <- initial_split(foe_green_space_urban_rural_model,
  strata = rural_urban_classification_2011_2_fold
)

train_df <- training(data_split)
test_df <- testing(data_split)

Feature engineering

Next, is setting up the feature engineering which can be run within the modelling workflow. Mostly, this is trying to reduce the skew of the explanatory variables, and making sure some of the prerequisites for LASSO regression are met.

green_space_recipe <- recipe(pcnt_pop_without_go_space_access ~ .,
  data = train_df
) %>%
  
  # allows lsoa_code to be retained as an id without affecting model results
  update_role(lsoa_code, new_role = "ID") %>%
  
  # for extreme right skewed data
  step_inverse(
    unbuffered_go_space_per_capita, area,
    offset = 1
  ) %>%
  
  # for right skewed data
  step_log(
    contains("perc"), population_density,
    offset = 1
  ) %>%
  
  # normalisation is required for LASSO
  step_normalize(all_numeric(), -all_outcomes()) %>%
  
  # transform urban-rural classification from a string into a format that can
  # be used in the model
  # 0 = rural, 1 = urban
  step_string2factor(rural_urban_classification_2011_2_fold) %>%
  step_dummy(rural_urban_classification_2011_2_fold)

# check what training data looks like after running the preprocessing recipe
green_space_prep <- green_space_recipe %>%
  prep(strings_as_factors = FALSE)

Parameter tuning

Everything is set now with the data, and feature engineering specification. So, the next step is to tune a key parameter of the LASSO algorithm, the penalty for including a variable in the model. The penalty can range between 0 and 1, with 0 meaning that a standard linear regression will be run. The modelling workflow is run for multiple values of penalty, on a different bootstrap resample from the training set. Then all the performance metrics are collected, so we can see which value of penalty to use in the model fit.

# for reproducibility
set.seed(2023)

# bootstrap resamples required for tuning penalty parameter
green_space_boot <- bootstraps(train_df,
  strata = rural_urban_classification_2011_2_fold
)

# specify the model
lasso_tuning_spec <- linear_reg(penalty = tune(), mixture = 1) %>%
  set_engine("glmnet")

# create a tuning grid
lambda_grid <- grid_regular(penalty(), levels = 50)

# setup the modelling workflow
wf <- workflow() %>%
  add_recipe(green_space_recipe) %>%
  add_model(lasso_tuning_spec)

# run the parameter tuning
lasso_grid <- tune_grid(
  wf,
  resamples = green_space_boot,
  grid = lambda_grid
)

# view the performance of the model on the training set with different values
# of penalty
lasso_grid %>%
  collect_metrics() %>%
  ggplot(aes(penalty, mean, color = .metric)) +
  geom_line() +
  facet_wrap(~.metric, nrow = 2, scales = "free")

The chart above shows the model performing worse as the penalty is increased. I’ll come back to the implications of this later on in the notebook.

Training the model

Everything is now in place to do the model fit on the training data. Before doing that I need to get the value of the penalty that give the best model performance. This is very close to zero, so We’ll effectively be running a standard linear regression model.

# get the best performing value of penalty
lowest_rmse <- lasso_grid %>%
  select_best("rmse", maximise = FALSE)

kable(lowest_rmse)

penalty	.config
0	Preprocessor1_Model01

With the value of penalty decided upon, the chunk of code below runs the final model fit on the training data, and we can take a look at the model coefficients.

# set up the final modelling workflow (post tuning)
final_lasso <- finalize_workflow(
  wf, lowest_rmse
)

# fit the model
lasso_fit <- final_lasso %>%
  fit(data = train_df) %>% 
  extract_fit_parsnip()

# view model coefficients fit
lasso_fit %>%
  tidy() %>% 
  kable()

term	estimate	penalty
(Intercept)	85.912385	0
area	-7.250327	0
population_density	-1.148555	0
total_pop_from_ethnicity_data	1.473266	0
income_decile	2.611268	0
unbuffered_go_space_per_capita	21.023485	0
perc_bame_pop	-2.177079	0
perc_mixed_multiple_ethnic_group_pop	-2.456070	0
perc_asian_asian_british_pop	0.000000	0
perc_black_african_caribbean_black_british_pop	1.395428	0
perc_other_ethnic_group_pop	1.829372	0
rural_urban_classification_2011_2_fold_Urban	-25.487460	0

Finally, in the analysis stage we can look at the importance of each variable included in the model. I’ll come back to the interpretation of these results in the Key findings section below.

p <- lasso_fit %>% 

  # get the importance of each variable
  vi(lambda = lowest_rmse$penalty) %>%
  
  # tidy up variables names for plot
  mutate(
    importance = abs(Importance),
    variable = fct_reorder(Variable, Importance)
  ) %>%
  
  # plot
  ggplot(aes(importance, variable, fill = Sign)) +
  geom_col() +
  labs(y = NULL)

p

Running the model with test data

Having fit the model to the training data, it is straightforward to run the model on the test data and take a look at the performance of the model on unseen data. Again, I’ll come back to the interpretation of these results in the Key findings section immediately below.

metrics <- last_fit(
  final_lasso,
  data_split
) %>%
  collect_metrics()

kable(metrics)

.metric	.estimator	.estimate	.config
rmse	standard	23.6468916	Preprocessor1_Model1
rsq	standard	0.4630332	Preprocessor1_Model1

Key findings

Whether or not a neighborhood/LSOA was urban or rural was the most important variable in estimating the proportion of its population without good access to green space. The model estimated that, holding other explanatory variables constant, an urban neighborhood would would have approx. 25% less people without good access to green space.
The amount of green space per capita in a neighborhood was the second most important variable in estimating the proportion of its population without good access to green space. The model predicts that neighborhoods with more green space per capita have higher percentages of people without good access to green space. Similar, to the importance of urban-rural classification noted above, this is likely to be result of larger, rural LSOA being less dense so the distance to travel to public accessible green space are likely to higher than in urban areas.
Other variables positively associated with the higher percentages of a neighborhoods population being without good access to public green space include:
- average income of the population;
- total population of the neighborhood;
- the percentage of the population from a black African, Caribbean or black British ethnic background.
Applying the model to unsee data:
- the predictive accuracy of the model was relatively poor. The RMSE was approx 24, with the range of predicted outcomes being between 0 and 100.
- the R² values was 0.46, this could be cosnidered resonable good for the problem domain and the first iteration of model development.

Next steps

Options to consider in further developing this analysis.

Creating separate models for urban and rural neighborhoods/LSOAs.
Consider what other variables could help improve the explanatory/predictive capabilities of the model.
Consider if alternative (non regression based) analytical methods could be help in understanding the data better.