Oil Pipeline Leak or Spill Incidents Data Analysis

Introduction

A data frame is the most common way of storing data in R and, generally, is the data structure most often used for data analyses. Simply said, data frame is a table of data set with columns and rows. Each element of the column lists refers to variable, and of the row lists refers to observation. Unlike matrices, data frames can store different classes of object in each column (i.e. numeric, character, factor). In this section, I will perform basic operation with a data frame.

About data set

The data set that I’ll be working on includes a record for each oil pipeline leak or spill reported to the Pipeline and Hazardous Materials Safety Administration since 2010 until 2017. These records include the incident date and time, operator and pipeline, cause of incident, type of hazardous liquid and quantity lost, injuries and fatalities, and associated costs.

Variables Description
Operator Name (chr) E.g: Shell Pipeline Co., Mobil Corp
Pipeline/Facility Name (chr) E.g: Iowa Pump Station
Pipeline Location (chr) Onshore/ Offshore
Pipeline Type (chr) Underground/Above ground/Tank/Transition Area

Objectives

The objectives are:

  1. To find out the most common place of oil pipeline leak or spill incidents occurred using data visualization.
  2. To learn how to perform data preparation using dplyr package.
  3. To learn how to plot bar chart using ggplot2 package.

Packages Information

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

Data Preparation

1. READ CSV FILES

database <- read.csv("database.csv")

2. PEEK AT THE DATA SETS

In data analysis, it is common to work with large data sets. It is important as a data analyst to develop a clear understanding of the structure and main elements of the data set. Let’s briefly look at the data sets. How can you do so?

  • The str() function gives you an overview of your data by telling you the following:

    • The total number of observations
    • The total number of variables
    • A full list of the variables names
    • The data type of each variable
    • The first observations
  • In the original data set, there are 2795 observations and 48 variables.

str(database)
## 'data.frame':    2795 obs. of  48 variables:
##  $ Report.Number                       : int  20100016 20100254 20100038 20100260 20100030 20100021 20110036 20100255 20100261 20100024 ...
##  $ Supplemental.Number                 : int  17305 17331 17747 18574 16276 17161 18052 18584 18050 18390 ...
##  $ Accident.Year                       : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ Accident.Date.Time                  : chr  "1/1/2010 7:15 AM" "1/4/2010 8:30 AM" "1/5/2010 10:30 AM" "1/6/2010 7:30 PM" ...
##  $ Operator.ID                         : int  32109 15786 20160 11169 300 11169 26041 12624 26041 31684 ...
##  $ Operator.Name                       : chr  "ONEOK NGL PIPELINE LP" "PORTLAND PIPELINE CORP" "PETROLOGISTICS OLEFINS, LLC" "ENBRIDGE ENERGY, LIMITED PARTNERSHIP" ...
##  $ Pipeline.Facility.Name              : chr  "KINDER MORGAN JCT" "24-INCH MAIN LINE" "" "SUPERIOR TERMINAL" ...
##  $ Pipeline.Location                   : chr  "ONSHORE" "ONSHORE" "ONSHORE" "ONSHORE" ...
##  $ Pipeline.Type                       : chr  "ABOVEGROUND" "ABOVEGROUND" "ABOVEGROUND" "UNDERGROUND" ...
##  $ Liquid.Type                         : chr  "HVL OR OTHER FLAMMABLE OR TOXIC FLUID, GAS" "CRUDE OIL" "HVL OR OTHER FLAMMABLE OR TOXIC FLUID, GAS" "CRUDE OIL" ...
##  $ Liquid.Subtype                      : chr  "LPG (LIQUEFIED PETROLEUM GAS) / NGL (NATURAL GAS LIQUID)" "" "OTHER HVL" "" ...
##  $ Liquid.Name                         : chr  "" "" "ETHANE" "" ...
##  $ Accident.City                       : chr  "MCPHERSON" "RAYMOND" "SULPHER" "SUPERIOR" ...
##  $ Accident.County                     : chr  "MCPHERSON" "CUMBERLAND" "CALCASIEU" "DOUGLAS" ...
##  $ Accident.State                      : chr  "KS" "ME" "LA" "WI" ...
##  $ Accident.Latitude                   : num  38.7 43.9 30.2 46.7 33.6 ...
##  $ Accident.Longitude                  : num  -97.8 -70.5 -93.4 -92.1 -96.6 ...
##  $ Cause.Category                      : chr  "INCORRECT OPERATION" "MATERIAL/WELD/EQUIP FAILURE" "MATERIAL/WELD/EQUIP FAILURE" "NATURAL FORCE DAMAGE" ...
##  $ Cause.Subcategory                   : chr  "PIPELINE/EQUIPMENT OVERPRESSURED" "PUMP OR PUMP-RELATED EQUIPMENT" "DEFECTIVE OR LOOSE TUBING/FITTING" "TEMPERATURE" ...
##  $ Unintentional.Release..Barrels.     : num  21 0.12 2 0.48 700 ...
##  $ Intentional.Release..Barrels.       : num  0.1 0 0 0 NA 0 0 0 0 0 ...
##  $ Liquid.Recovery..Barrels.           : num  0 0.12 0 0.48 698 ...
##  $ Net.Loss..Barrels.                  : num  21 0 2 0 2 ...
##  $ Liquid.Ignition                     : chr  "NO" "NO" "NO" "NO" ...
##  $ Liquid.Explosion                    : chr  "NO" "NO" "NO" "NO" ...
##  $ Pipeline.Shutdown                   : chr  "NO" "" "" "" ...
##  $ Shutdown.Date.Time                  : chr  "" "" "" "" ...
##  $ Restart.Date.Time                   : chr  "" "" "" "" ...
##  $ Public.Evacuations                  : int  NA NA NA NA NA 0 NA NA NA NA ...
##  $ Operator.Employee.Injuries          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Operator.Contractor.Injuries        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Emergency.Responder.Injuries        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Other.Injuries                      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Public.Injuries                     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ All.Injuries                        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Operator.Employee.Fatalities        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Operator.Contractor.Fatalities      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Emergency.Responder.Fatalities      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Other.Fatalities                    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Public.Fatalities                   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ All.Fatalities                      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Property.Damage.Costs               : int  110 4000 0 200 20000 76940 0 400 0 0 ...
##  $ Lost.Commodity.Costs                : int  1517 8 200 40 150 167775 400 13 336 50 ...
##  $ Public.Private.Property.Damage.Costs: int  0 0 0 0 0 150000 0 0 0 0 ...
##  $ Emergency.Response.Costs            : int  0 0 0 11300 7500 1800000 0 0 0 10000 ...
##  $ Environmental.Remediation.Costs     : int  0 0 0 0 2000 2000000 70000 0 40000 10000 ...
##  $ Other.Costs                         : int  0 0 0 0 0 0 0 0 0 10000 ...
##  $ All.Costs                           : int  1627 4008 200 11540 29650 4194715 70400 413 40336 30050 ...

3. CLEAN THE DATA SETS

Not all data in the data sets are needed in a project. It is important to understand the data required to answer your questions. In this case, I want to select the following columns only:

  • Operator Name
  • Pipeline Facility Name
  • Pipeline Location
  • Pipeline Type

using dplyr package and select() function.

df <- database %>%
  select(Operator.Name, 
         Pipeline.Facility.Name, 
         Pipeline.Location, 
         Pipeline.Type)

4. SHOW FIRST 6 ROWS

Another way to understand your data is to show only small part of the entire data set. How to do so in R?

  • Using head() function enables you to take a peek at the first 6 rows of the data. (default)
head(df)
##                          Operator.Name Pipeline.Facility.Name Pipeline.Location
## 1                ONEOK NGL PIPELINE LP      KINDER MORGAN JCT           ONSHORE
## 2               PORTLAND PIPELINE CORP      24-INCH MAIN LINE           ONSHORE
## 3          PETROLOGISTICS OLEFINS, LLC                                  ONSHORE
## 4 ENBRIDGE ENERGY, LIMITED PARTNERSHIP      SUPERIOR TERMINAL           ONSHORE
## 5                PLAINS PIPELINE, L.P.         RED RIVER EAST           ONSHORE
## 6 ENBRIDGE ENERGY, LIMITED PARTNERSHIP                                  ONSHORE
##   Pipeline.Type
## 1   ABOVEGROUND
## 2   ABOVEGROUND
## 3   ABOVEGROUND
## 4   UNDERGROUND
## 5   UNDERGROUND
## 6   UNDERGROUND

5. SHOW FIRST N ROWS

To show the first n rows of data set, you can add another argument, n in the head() function. In this case, I want to see the first 10 rows of the data set.

head(df, n = 10) 
##                           Operator.Name Pipeline.Facility.Name
## 1                 ONEOK NGL PIPELINE LP      KINDER MORGAN JCT
## 2                PORTLAND PIPELINE CORP      24-INCH MAIN LINE
## 3           PETROLOGISTICS OLEFINS, LLC                       
## 4  ENBRIDGE ENERGY, LIMITED PARTNERSHIP      SUPERIOR TERMINAL
## 5                 PLAINS PIPELINE, L.P.         RED RIVER EAST
## 6  ENBRIDGE ENERGY, LIMITED PARTNERSHIP                       
## 7   KINDER MORGAN LIQUID TERMINALS, LLC                       
## 8                            MOBIL CORP           HULL STATION
## 9   KINDER MORGAN LIQUID TERMINALS, LLC                       
## 10                       CONOCOPHILLIPS              TANK 1501
##    Pipeline.Location Pipeline.Type
## 1            ONSHORE   ABOVEGROUND
## 2            ONSHORE   ABOVEGROUND
## 3            ONSHORE   ABOVEGROUND
## 4            ONSHORE   UNDERGROUND
## 5            ONSHORE   UNDERGROUND
## 6            ONSHORE   UNDERGROUND
## 7            ONSHORE          TANK
## 8            ONSHORE   ABOVEGROUND
## 9            ONSHORE   ABOVEGROUND
## 10           ONSHORE          TANK

6. SHOW LAST N ROWS

Similar to head() function, you can use tail() function to show the last N row of data set.

tail(df, n = 10)
##                                Operator.Name
## 2786                  MARATHON PIPE LINE LLC
## 2787                CCPS TRANSPORTATION, LLC
## 2788           ENTERPRISE CRUDE PIPELINE LLC
## 2789                PHILLIPS 66 PIPELINE LLC
## 2790 HOLLY ENERGY PARTNERS - OPERATING, L.P.
## 2791          TC OIL PIPELINE OPERATIONS INC
## 2792                  EXXONMOBIL PIPELINE CO
## 2793       ENBRIDGE STORAGE (CUSHING) L.L.C.
## 2794                  MARATHON PIPE LINE LLC
## 2795           ENTERPRISE CRUDE PIPELINE LLC
##                        Pipeline.Facility.Name Pipeline.Location Pipeline.Type
## 2786                           ELWOOD STATION           ONSHORE   ABOVEGROUND
## 2787                      LINE 59 KEY STATION           ONSHORE   ABOVEGROUND
## 2788                            ECHO MANIFOLD           ONSHORE   ABOVEGROUND
## 2789                   NT-90, PREDO TO CARSON           ONSHORE   ABOVEGROUND
## 2790                          RUSSELL STATION           ONSHORE   ABOVEGROUND
## 2791                                 KEYSTONE           ONSHORE   ABOVEGROUND
## 2792 BRRF - CHOCTAW ETHANE/PROPANE MIX SYSTEM           ONSHORE   UNDERGROUND
## 2793                 CUSHING CENTRAL TERMINAL           ONSHORE          TANK
## 2794                          MIDLAND STATION           ONSHORE   UNDERGROUND
## 2795                         MARSHALL STATION           ONSHORE   ABOVEGROUND

7. NUMBER OF ALL OPERATORS

By using unique() function, you can return a vector of unique elements of a column. Then, use length() function to find out the total number of elements in a vector. From the “Operator Name” column, we find out that there are a total of 229 operators involved in oil pipeline leak or spill incidents.

cat("The total of operators involved in the oil pipeline leak/spill incidents: ", length(unique(df$Operator.Name))) 
## The total of operators involved in the oil pipeline leak/spill incidents:  229

8. TOP 10 OPERATORS

Get a list of top 10 operators in which their pipelines caused most incidents using:

  1. dplyr package
  2. count() funtion : To count the number of incident, n
  3. head() function : To show only top 10 elements
top_operators <- df %>%
  count(df$Operator.Name, sort =  TRUE) %>%
  head(n = 10)
top_operators
##                       df$Operator.Name   n
## 1        ENTERPRISE CRUDE PIPELINE LLC 195
## 2                 SUNOCO PIPELINE L.P. 180
## 3                PLAINS PIPELINE, L.P. 156
## 4    ENTERPRISE PRODUCTS OPERATING LLC 155
## 5        MAGELLAN PIPELINE COMPANY, LP 140
## 6                 COLONIAL PIPELINE CO 136
## 7                 BUCKEYE PARTNERS, LP 115
## 8  KINDER MORGAN LIQUID TERMINALS, LLC  87
## 9               MARATHON PIPE LINE LLC  87
## 10            PHILLIPS 66 PIPELINE LLC  80

9. PIPELINE/FACILITY NAME WITH INCIDENT FOR AN OPERATOR

  1. filter() function
    • To filter your selected operator.
  2. select() function
    • To select a column (use negative sign(-) to drop a column).
  3. mutate_if() function
    • To replace any blank row/column to NA.
  4. count() function
    • To count total of unique elements.
top_pipeline <- df %>%
  filter(df$Operator.Name == 'ENTERPRISE CRUDE PIPELINE LLC') %>%
  select(Pipeline.Facility.Name) %>%
  mutate_if(is.character, list(~na_if(.,""))) %>%
  count(Pipeline.Facility.Name, sort = TRUE)
head(top_pipeline)
##   Pipeline.Facility.Name  n
## 1                   <NA> 11
## 2        MIDLAND STATION 10
## 3    JONES CREEK STATION  9
## 4   TEXAS CITY TANK FARM  7
## 5          HOBBS STATION  6
## 6                    C18  5

In above data set, we can see that for Enterprise Crude Pipeline LLC, the incidents mostly occurred at Midland Station.

10. LOCATION WHERE MOST INCIDENTS OCCUR

Using similar function as the previous one, we can also clean the data set to get the location of pipeline data only (offshore/onshore).

top_location <- df %>%
  select(Pipeline.Location) %>%
  mutate_if(is.character, list(~na_if(.,""))) %>%
  count(Pipeline.Location, sort = TRUE)
top_location
##   Pipeline.Location    n
## 1           ONSHORE 2777
## 2          OFFSHORE   18

In above data set, most incidents occurred onshore.

11. TYPE OF PIPELINE WHERE MOST INCIDENTS OCCUR

Using similar function as the previous one, we can also clean the data set to get the type of pipeline data only.

top_pipeline_type <- df %>%
  select(Pipeline.Type) %>%
  mutate_if(is.character, list(~na_if(.,""))) %>%
  count(Pipeline.Type, sort = TRUE)
top_pipeline_type
##     Pipeline.Type    n
## 1     ABOVEGROUND 1475
## 2     UNDERGROUND  985
## 3            TANK  301
## 4            <NA>   18
## 5 TRANSITION AREA   16

In above data set, most incidents occurred mostly at above ground, and the least one occurred at transition area.

Data Analysis

ggplot(top_location, aes(x = reorder(Pipeline.Location, -n), y = n, color = reorder(Pipeline.Location, -n), fill = reorder(Pipeline.Location, -n))) + 
  geom_bar(stat = "identity") +
  geom_text(aes(label=n), vjust = -0.5, size = 2.5) +
  theme_minimal() +
  labs(title = "Pipeline Location Where Most Incidents Occurred", x = "Pipeline Location", y = "Total Number", color = "Pipeline Location", fill = "Pipeline Location") +
  theme(axis.text.x = element_blank(), plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(breaks = seq(from = 0, to = 2800, by = 400))

ggplot(top_pipeline_type, aes(x = reorder(Pipeline.Type, -n), y = n, color = reorder(Pipeline.Type, -n), fill = reorder(Pipeline.Type, -n))) + 
  geom_bar(stat = "identity") +
  geom_text(aes(label=n), vjust = -0.5, size = 2.5) +
  theme_minimal() +
  labs(title = "Pipeline Type Where Most Incidents Occurred", x = "Pipeline Type", y = "Total Number", color = "Pipeline Type", fill = "Pipeline Type") +
  theme(axis.text.x = element_blank(), plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(breaks = seq(from = 0, to = 1400, by = 200))

From graphs shown above, we can conclude that the oil pipeline leak or spill incidents from 2010 until 2017 mostly occurred onshore at above ground pipeline.