Introduction:

The modern economy is inseparable from the production of energy from fossil fuels, which are hazardous and combustible by nature. The recent diesel spill by Russian company Norilsk Nickel in the Arctic, and the Gulf of Mexico oil spill in 2010 are both irreversible disasters that would cause permanent harm to the local environment and economy. Though a pipeline accident is an unlikely event, when it does occur, the economic and environmental damage are always alert to the fragility of our ecosystem and the destructive power of human behavior.

To reduce such incidents, the Pipeline and Hazardous Materials Safety Administration (PHMSA) operates under the US Department Transportation to enforce pipeline safety standards. All pipeline operators in the US are mandated to submit an incident report to PHMSA within 30 days of a pipeline incident or accident. Information on the operating facility, location, cause of the incident, remedial response, and remedial costs are collected. Therefore, we are interested in investigating what variables in an incident can result in higher remedial costs, while taking into consideration the different pipeline types and liquid types involved.

Methods:

We cleaned the raw PHMSA data containing pipeline incident reports from US pipeline operators from 2010 to 2017 to obtain the data in our analysis. The raw data contains 2795 observations with each observation representing one incident report. It contains over 48 variables that give detailed information on the pipeline operator, location, the operating facility, incident details, and injuries/fatalities information, etc. Since we are interested in what factors of each incident would influence the remedial costs, we filter out variables in the raw data such as location longitude/latitude and incident datetime, etc, which are non-essential to the purpose of our study.

We also choose to filter out the variables that contain excessive 0 and N/A values such as number of fatalities, number of public evacuations. The raw data also presents a breakdown of the total remedial costs that includes public/private property damage costs, emergency response costs, environmental remediation costs, and other costs. For simplicity, we will take the sum of the costs of all four aspects above as the response we want to investigate. The cleaned dataset contains 2229 row.

In the analysis, we will use the log-transformed total costs as the response variable of interest since the distribution of total costs is heavily right-skewed. We will also log-transform explanatory variable Unintentional Release Barrel because of the right-skewness in its distribution. Considering both variables that need to be log-transformed contain 0 entries, we take the logarithm of each value plus a fudge factor of 1 so that the log transformed values are sound.

The following is a description of variables in the dataset:

Results:

Exploratory Data Analysis

Exploratory analysis reveals that before log transformation the total cost ranges from as little as no cost to a whopping 8 trillion dollars, which is eight times larger than the second most costly accident in the data. The median cost of all incidents is 26,064 dollars and the mean cost is 963,342 dollars. The difference between the mean and median costs suggest the distribution is right skewed. The log-transformed total costs logCost has a median of 10.168 units and a mean of 10.141. Unintentional releases also range from 0 to as large as 30,565 barrels. The log-transformed Unintentional release has a median of 1.224 and a mean of 2.031.

Variables Min Q1 Median Mean Q3 Max
Total Cost 0 5533.000 26064.000 963341.741 130030.000 840526118
log(Total Cost) 0 8.619 10.168 10.141 130030.000 20.550
Total Release 0 0.500 2.400 211.502 20.000 30565.0
log(Release) 0 0.405 1.224 2.031 3.045 10.328

A total of 229 pipeline operators have filed a total of 2229 reports contained in this data. 6 out of the 229 operators have filed more than 100 incident reports while 137 filed less than 5 reports. The 2229 observations in the data represent pipeline incidents from 43 states. Depending on variations of the number of pipeline operations in different states, the number of incidents occurred at each state varies. For example, Texan operators filed a total of 817 incident reports, contributing the largest number of incidents to the dataset, and roughly four times larger than the second largest contributor. However, since we are not interested in investigating factors at state level, we do not remove observations in Texas from the data.

Exploratory analysis also indicates interesting relationships between logCost and each categorical explanatory variable on the incident. If we take a look at the 10 incidents with the highest logCost, we would notice that they are all underground pipeline types and 6 out of which are crude oil liquid types. Indeed, visualizations on the entire data suggest the distribution of logCost is different by pipeline types and by liquid types. As shown in Figure 2, logCost of underground pipelines has a higher center of distribution than other pipeline types, suggesting that underground pipelines could result in more costly incidents. Though there are exceptional cases such as the outliers in the bottom of the underground pipeline boxplot, suggesting minimal cost during the incident. logCost is also distributed differently by liquid type. Despite the difference between the center of distributions of each liquid may not be as obvious, we can still spot the large number of logCost outliers and larger variance in the crude oil category. We can also see that the Carbon Dioxide category has the smallest center of distribution, which is consistent with the understanding of the liquid’s more stable chemical quality.

Pipeline.Type and Liquid.Type also interact with indicator variables that describe certain events related to the incident, for example, Liquid.Ignition, Liquid.Explosion, and Pipeline.Shutdown as these variables indicate the severity of an incident. Using the interaction between Pipeline.Type and Pipeline.Shutdown as an example, we can see that in incidents that led to pipeline shutdown, the center of distributions of logCost tend to be higher than that of logCost in incidents that did not lead to pipeline shutdown, as shown in Figure 3. We can also find similar discrepancies in box plots that show distributions of logCost by pipeline types and liquid types, faceted by Liquid.Ignition and Liquid.Explosion. In incidents when liquid is ignited, exploded or have resulted in pipeline shutdown, the center of distributions of logCost tends to be higher than in incidents that the events above did not occur.

Another important predictor of logCost is logRelease, the log-transformed value of the number of unintentional release barrels. As suggested by the scatterplot below (Figure 4), logCost is positively correlated with logRelease. logRelease also interacts with Liquid.Ignition, Liquid.Explosion, and Pipeline.Shutdown as the least squares lines in Figure 4 suggest. The least square lines between logRelease and logCost are steeper when Liquid Ignition, Liquid.Explosion, and Pipeline.Shutdown occurred.

Visualizations of relationships between logCost and logRelease also indicate that the influence of incident causes (Cause.Category) and sub-causes (Cause.Subcategory) are worth investigating. The data contains 38 levels of Cause.Subcategory, each of which are specific subcategories of a higher level Cause.Category. Each least square line in Figure 5 represents one level in Cause.Subcategory and each plot represents one level in Cause.Category. The intercepts and slopes of the least square lines vary but the lines in the same plot tend to parallel more.

Final Model:

Given the insights we have gained from the exploratory analysis, we start with a two-level linear mixed effects model with Cause.Subcategory being the second level and explore the possibility of expanding it to a third-level model by including Cause.Category. After testing the appropriate fixed effects and random effects to add in, we attempted to add in an additional random intercept at the level of Cause.Category on top of Cause.Subcategory. However, a Likelihood Ratio Test rejects the additional random intercept of a third level. An attempt to include a random slope at the level of Cause.Category on top of Cause.Subcategory also failed the simulation. Therefore, we choose to stick with a two-level model to continue with our analysis.

At the stage of fixed effects selection, we modified the two variables Pipeline.Type and Liquid.Type since the t-statistics suggest that not all levels in these two variables are significant. In Pipeline.Type, we combined the level that does not have significant t-statistics Transition Area, with the reference level Aboveground. Similarly in Liquid.Type, we combined the insignificant levels with the reference level Biofuel/Alternative Fuel, so we can distinguish the significant level Carbon Dioxide. In the process of identifying the appropriate fixed and random effects, we resolved the issue of a -1 correlation between the random slope of logRelease with the random intercept at Cause.Subcategory level. The approach we used did not noticeably change the any coefficients of the fixed effects.

We used Cook’s distance and the leverage methods to identify potential outliers. Using Cook’s distance we removed case 2206, 2180, 279. By removing these high Cook’s distance cases, we observed a relatively noticeable update of fixed effect coefficients and random effects standard deviation.

The final model is presented below, where i represents individual incident report and j represents individual Cause.Subcategory:

Level 1 Individual Case

\(Y_{i, j} = a_i + b_ilogRelease + ciPipelineShutdown + d_iLiquidExplosion + e_iLiquidIgnition + \epsilon_{i, j}\)

Level 2 Cause.Subcategory

\(a_i = \alpha_0 + \alpha_1logRelease + \alpha_2PipelineType + \alpha_3LiquidType + \alpha_4PipelineType \*LiquidIgnition + \alpha_5PipelineType \* LiquidIgnition + u_i\)

\(b_i = \beta_0 + \beta_1LiquidIgnition + v_i\)

\(c_i = \gamma_0 + w_i\)

\(d_i = \delta_0 + z_i\)

\(e_i = \theta_0 + m_i\)

Composite

\(Y_{i, j} = \alpha_0 + \alpha_1logRelease + \alpha_2PipelineType + \alpha_3LiquidType + \alpha_4PipelineType \*LiquidIgnition + \alpha_5PipelineType \* LiquidIgnition + \beta_0logRelease + \beta_1LiquidIgnition \* logRelease + \gamma_0PipelineShutdown+ \delta_0LiquidExplosion + \theta_0LiquidIgnition + u_i + v_ilogRelease + w_iPipelineShutdown + z_iLiquidExplosion + m_iLiquidIgnition\)

The coefficient, standard deviation, and t-value of the fixed effects variables in the model are shown below:

Estimate Std..Error t.value
(Intercept) 6.794 0.452 15.043
logRelease 0.532 0.034 15.571
Pipeline.Type2 (TANK) 0.794 0.139 5.710
Pipeline.Type2 (UNDERGROUND) 1.452 0.112 12.973
Liquid.Type2 (Other Liquid) 1.897 0.425 4.458
Liquid.Ignition (YES) 2.087 0.540 3.864
Liquid.Explosion (YES) -3.047 1.825 -1.670
logRelease:Liquid.Ignition (YES) 0.093 0.123 0.759
Pipeline.Type2 (TANK) * Liquid.Ignition (YES) -2.217 0.797 -2.781
Pipeline.Type2 (UNDERGROUND) * Liquid.Ignition (YES) -1.183 0.611 -1.935

Our model demonstrates that logCost increases by 0.532 with every unit increase in logRelease, which is to say the total cost of incident increases by 1.7 times as the number of unintentional release barrels increases by exponential factor e = 2.718. Certain levels in pipeline types and liquid types are useful predictors of logCost. logCost increases by 0.794 when the pipeline type is tank and increases by 1.452 when the pipeline type is underground, as opposed to the other pipeline types such as aboveground pipeline and transition area pipeline. Considering the log-transformed value, in other words the total remedial cost of an incident increases by 2.212 times in incidents involving tank pipeline type and increases by 4.27 times in incidents involving underground pipeline type. Similarly logCost increases by 1.897 units when the liquid type involved in the incident is classified as other liquid, which is a combined class we created that includes biofuel, crude oil, hvl/inflammable, and refined oil, as opposed to Carbon Dioxide. This suggests that when the liquid type involved in the accident is not carbon dioxide, the total remedial cost increases by 6.67 times.

Events such as liquid ignition and liquid explosion in an incident can also impact the total remedial cost of an incident. In the exploratory analysis, we suspect that pipeline shutdown would also have an impact on the total cost based on the visualization generated. However, we did not include Pipeline.Shutdown in our final model due to reasons that the variability of Pipeline.Shutdown may have been accounted for in other variables such as Pipeline.Type or Liquid.Type. The model predicts that when liquid explosion occurs, the logCost decreases by -3.407 units and the total cost decreases by 0.03 times.

The final model also suggests that interactions exist between Liquid.Ignition and logRelease and Pipeline.Type. Though the coefficient of Liquid.Ignition is positive, the combined effects with Pipeline.Type may not eventually be positive. When the pipeline type involved in the incident is neither tank nor underground, then logCost increases by 2.087 units when liquid ignition occurred, which means an increase of 8.06 times in total cost. In incidents that involve tank pipeline type, we distract the coefficient of Pipeline.Type (Tank) and Liquid.Ignition interaction from the coefficient of Liquid.Ignition. Therefore, we get that logCost decreases by 0.13 which results in the total cost to decrease by 0.878. When the pipeline type involved in the incident is underground type, we also need to distract the coefficient of Pipeline.Type (Underground) and Liquid.Ignition interaction from the coefficient of Liquid.Ignition. In that case, we get that logCost increases at a smaller rate of 0.904, which means the total cost increases by 2.469 times.

Discussion:

In this study we focused on predicting how costly a pipeline incident would be based on the information regarding the pipeline operation such as liquid types, pipeline types, and details of the incident such as liquid explosion/ignition and number of liquid barrels loss, etc. The relationships we have discovered in this study is critical for policy makers and pipeline insurance companies when estimating potential monetary loss in the event of an incident. The positive association between cost and number of releases suggests that the safety procedures that prevent barrel loss when an incident occurs is crucial. The positive association between total cost and certain liquid types and pipeline types also indicates that certain pipeline operations are more susceptible to costlier incidents than others.

One limitation in this study is the lack of variables and missing values in variables. Doubtlessly, pipeline operations are complicated facilities that involve many metrics. A pipeline incident involves various other variables that are not accounted for in this study. Certain variables in the raw data contain a lot of NA values and had to be excluded from the dataset used in the study. The self-reporting system PHMSA imposed on operators also enables the possibility for false reporting. The 30-day window for reporting also indicates that the remedial cost reported may have been underestimated.

In future studies, researchers can gather demographic and economic data on the state level, the city level, and the state level to create other linear mixed effect models. Another potential future study is to gather pipeline operators data and use the datetime information available in this data to build zero-inflated poisson models to estimate the probability of the number of incidents that would happen.