A data frame is the most common way of storing data in R and, generally, is the data structure most often used for data analyses. Simply said, data frame is a table of data set with columns and rows. Each element of the column lists refers to variable, and of the row lists refers to observation. Unlike matrices, data frames can store different classes of object in each column (i.e. numeric, character, factor). In this section, I will perform basic operation with a data frame.
The data set that I’ll be working on includes a record for each oil pipeline leak or spill reported to the Pipeline and Hazardous Materials Safety Administration since 2010 until 2017. These records include the incident date and time, operator and pipeline, cause of incident, type of hazardous liquid and quantity lost, injuries and fatalities, and associated costs.
| Variables | Description |
|---|---|
| Operator Name | (chr) E.g: Shell Pipeline Co., Mobil Corp |
| Pipeline/Facility Name | (chr) E.g: Iowa Pump Station |
| Pipeline Location | (chr) Onshore/ Offshore |
| Pipeline Type | (chr) Underground/Above ground/Tank/Transition Area |
The objectives are:
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)database <- read.csv("database.csv")In data analysis, it is common to work with large data sets. It is important as a data analyst to develop a clear understanding of the structure and main elements of the data set. Let’s briefly look at the data sets. How can you do so?
The str() function gives you an overview of your data by telling you the following:
In the original data set, there are 2795 observations and 48 variables.
str(database)## 'data.frame': 2795 obs. of 48 variables:
## $ Report.Number : int 20100016 20100254 20100038 20100260 20100030 20100021 20110036 20100255 20100261 20100024 ...
## $ Supplemental.Number : int 17305 17331 17747 18574 16276 17161 18052 18584 18050 18390 ...
## $ Accident.Year : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ Accident.Date.Time : chr "1/1/2010 7:15 AM" "1/4/2010 8:30 AM" "1/5/2010 10:30 AM" "1/6/2010 7:30 PM" ...
## $ Operator.ID : int 32109 15786 20160 11169 300 11169 26041 12624 26041 31684 ...
## $ Operator.Name : chr "ONEOK NGL PIPELINE LP" "PORTLAND PIPELINE CORP" "PETROLOGISTICS OLEFINS, LLC" "ENBRIDGE ENERGY, LIMITED PARTNERSHIP" ...
## $ Pipeline.Facility.Name : chr "KINDER MORGAN JCT" "24-INCH MAIN LINE" "" "SUPERIOR TERMINAL" ...
## $ Pipeline.Location : chr "ONSHORE" "ONSHORE" "ONSHORE" "ONSHORE" ...
## $ Pipeline.Type : chr "ABOVEGROUND" "ABOVEGROUND" "ABOVEGROUND" "UNDERGROUND" ...
## $ Liquid.Type : chr "HVL OR OTHER FLAMMABLE OR TOXIC FLUID, GAS" "CRUDE OIL" "HVL OR OTHER FLAMMABLE OR TOXIC FLUID, GAS" "CRUDE OIL" ...
## $ Liquid.Subtype : chr "LPG (LIQUEFIED PETROLEUM GAS) / NGL (NATURAL GAS LIQUID)" "" "OTHER HVL" "" ...
## $ Liquid.Name : chr "" "" "ETHANE" "" ...
## $ Accident.City : chr "MCPHERSON" "RAYMOND" "SULPHER" "SUPERIOR" ...
## $ Accident.County : chr "MCPHERSON" "CUMBERLAND" "CALCASIEU" "DOUGLAS" ...
## $ Accident.State : chr "KS" "ME" "LA" "WI" ...
## $ Accident.Latitude : num 38.7 43.9 30.2 46.7 33.6 ...
## $ Accident.Longitude : num -97.8 -70.5 -93.4 -92.1 -96.6 ...
## $ Cause.Category : chr "INCORRECT OPERATION" "MATERIAL/WELD/EQUIP FAILURE" "MATERIAL/WELD/EQUIP FAILURE" "NATURAL FORCE DAMAGE" ...
## $ Cause.Subcategory : chr "PIPELINE/EQUIPMENT OVERPRESSURED" "PUMP OR PUMP-RELATED EQUIPMENT" "DEFECTIVE OR LOOSE TUBING/FITTING" "TEMPERATURE" ...
## $ Unintentional.Release..Barrels. : num 21 0.12 2 0.48 700 ...
## $ Intentional.Release..Barrels. : num 0.1 0 0 0 NA 0 0 0 0 0 ...
## $ Liquid.Recovery..Barrels. : num 0 0.12 0 0.48 698 ...
## $ Net.Loss..Barrels. : num 21 0 2 0 2 ...
## $ Liquid.Ignition : chr "NO" "NO" "NO" "NO" ...
## $ Liquid.Explosion : chr "NO" "NO" "NO" "NO" ...
## $ Pipeline.Shutdown : chr "NO" "" "" "" ...
## $ Shutdown.Date.Time : chr "" "" "" "" ...
## $ Restart.Date.Time : chr "" "" "" "" ...
## $ Public.Evacuations : int NA NA NA NA NA 0 NA NA NA NA ...
## $ Operator.Employee.Injuries : int NA NA NA NA NA NA NA NA NA NA ...
## $ Operator.Contractor.Injuries : int NA NA NA NA NA NA NA NA NA NA ...
## $ Emergency.Responder.Injuries : int NA NA NA NA NA NA NA NA NA NA ...
## $ Other.Injuries : int NA NA NA NA NA NA NA NA NA NA ...
## $ Public.Injuries : int NA NA NA NA NA NA NA NA NA NA ...
## $ All.Injuries : int NA NA NA NA NA NA NA NA NA NA ...
## $ Operator.Employee.Fatalities : int NA NA NA NA NA NA NA NA NA NA ...
## $ Operator.Contractor.Fatalities : int NA NA NA NA NA NA NA NA NA NA ...
## $ Emergency.Responder.Fatalities : int NA NA NA NA NA NA NA NA NA NA ...
## $ Other.Fatalities : int NA NA NA NA NA NA NA NA NA NA ...
## $ Public.Fatalities : int NA NA NA NA NA NA NA NA NA NA ...
## $ All.Fatalities : int NA NA NA NA NA NA NA NA NA NA ...
## $ Property.Damage.Costs : int 110 4000 0 200 20000 76940 0 400 0 0 ...
## $ Lost.Commodity.Costs : int 1517 8 200 40 150 167775 400 13 336 50 ...
## $ Public.Private.Property.Damage.Costs: int 0 0 0 0 0 150000 0 0 0 0 ...
## $ Emergency.Response.Costs : int 0 0 0 11300 7500 1800000 0 0 0 10000 ...
## $ Environmental.Remediation.Costs : int 0 0 0 0 2000 2000000 70000 0 40000 10000 ...
## $ Other.Costs : int 0 0 0 0 0 0 0 0 0 10000 ...
## $ All.Costs : int 1627 4008 200 11540 29650 4194715 70400 413 40336 30050 ...
Not all data in the data sets are needed in a project. It is important to understand the data required to answer your questions. In this case, I want to select the following columns only:
using dplyr package and select() function.
df <- database %>%
select(Operator.Name,
Pipeline.Facility.Name,
Pipeline.Location,
Pipeline.Type)Another way to understand your data is to show only small part of the entire data set. How to do so in R?
head(df)## Operator.Name Pipeline.Facility.Name Pipeline.Location
## 1 ONEOK NGL PIPELINE LP KINDER MORGAN JCT ONSHORE
## 2 PORTLAND PIPELINE CORP 24-INCH MAIN LINE ONSHORE
## 3 PETROLOGISTICS OLEFINS, LLC ONSHORE
## 4 ENBRIDGE ENERGY, LIMITED PARTNERSHIP SUPERIOR TERMINAL ONSHORE
## 5 PLAINS PIPELINE, L.P. RED RIVER EAST ONSHORE
## 6 ENBRIDGE ENERGY, LIMITED PARTNERSHIP ONSHORE
## Pipeline.Type
## 1 ABOVEGROUND
## 2 ABOVEGROUND
## 3 ABOVEGROUND
## 4 UNDERGROUND
## 5 UNDERGROUND
## 6 UNDERGROUND
To show the first n rows of data set, you can add another argument, n in the head() function. In this case, I want to see the first 10 rows of the data set.
head(df, n = 10) ## Operator.Name Pipeline.Facility.Name
## 1 ONEOK NGL PIPELINE LP KINDER MORGAN JCT
## 2 PORTLAND PIPELINE CORP 24-INCH MAIN LINE
## 3 PETROLOGISTICS OLEFINS, LLC
## 4 ENBRIDGE ENERGY, LIMITED PARTNERSHIP SUPERIOR TERMINAL
## 5 PLAINS PIPELINE, L.P. RED RIVER EAST
## 6 ENBRIDGE ENERGY, LIMITED PARTNERSHIP
## 7 KINDER MORGAN LIQUID TERMINALS, LLC
## 8 MOBIL CORP HULL STATION
## 9 KINDER MORGAN LIQUID TERMINALS, LLC
## 10 CONOCOPHILLIPS TANK 1501
## Pipeline.Location Pipeline.Type
## 1 ONSHORE ABOVEGROUND
## 2 ONSHORE ABOVEGROUND
## 3 ONSHORE ABOVEGROUND
## 4 ONSHORE UNDERGROUND
## 5 ONSHORE UNDERGROUND
## 6 ONSHORE UNDERGROUND
## 7 ONSHORE TANK
## 8 ONSHORE ABOVEGROUND
## 9 ONSHORE ABOVEGROUND
## 10 ONSHORE TANK
Similar to head() function, you can use tail() function to show the last N row of data set.
tail(df, n = 10)## Operator.Name
## 2786 MARATHON PIPE LINE LLC
## 2787 CCPS TRANSPORTATION, LLC
## 2788 ENTERPRISE CRUDE PIPELINE LLC
## 2789 PHILLIPS 66 PIPELINE LLC
## 2790 HOLLY ENERGY PARTNERS - OPERATING, L.P.
## 2791 TC OIL PIPELINE OPERATIONS INC
## 2792 EXXONMOBIL PIPELINE CO
## 2793 ENBRIDGE STORAGE (CUSHING) L.L.C.
## 2794 MARATHON PIPE LINE LLC
## 2795 ENTERPRISE CRUDE PIPELINE LLC
## Pipeline.Facility.Name Pipeline.Location Pipeline.Type
## 2786 ELWOOD STATION ONSHORE ABOVEGROUND
## 2787 LINE 59 KEY STATION ONSHORE ABOVEGROUND
## 2788 ECHO MANIFOLD ONSHORE ABOVEGROUND
## 2789 NT-90, PREDO TO CARSON ONSHORE ABOVEGROUND
## 2790 RUSSELL STATION ONSHORE ABOVEGROUND
## 2791 KEYSTONE ONSHORE ABOVEGROUND
## 2792 BRRF - CHOCTAW ETHANE/PROPANE MIX SYSTEM ONSHORE UNDERGROUND
## 2793 CUSHING CENTRAL TERMINAL ONSHORE TANK
## 2794 MIDLAND STATION ONSHORE UNDERGROUND
## 2795 MARSHALL STATION ONSHORE ABOVEGROUND
By using unique() function, you can return a vector of unique elements of a column. Then, use length() function to find out the total number of elements in a vector. From the “Operator Name” column, we find out that there are a total of 229 operators involved in oil pipeline leak or spill incidents.
cat("The total of operators involved in the oil pipeline leak/spill incidents: ", length(unique(df$Operator.Name))) ## The total of operators involved in the oil pipeline leak/spill incidents: 229
Get a list of top 10 operators in which their pipelines caused most incidents using:
top_operators <- df %>%
count(df$Operator.Name, sort = TRUE) %>%
head(n = 10)
top_operators## df$Operator.Name n
## 1 ENTERPRISE CRUDE PIPELINE LLC 195
## 2 SUNOCO PIPELINE L.P. 180
## 3 PLAINS PIPELINE, L.P. 156
## 4 ENTERPRISE PRODUCTS OPERATING LLC 155
## 5 MAGELLAN PIPELINE COMPANY, LP 140
## 6 COLONIAL PIPELINE CO 136
## 7 BUCKEYE PARTNERS, LP 115
## 8 KINDER MORGAN LIQUID TERMINALS, LLC 87
## 9 MARATHON PIPE LINE LLC 87
## 10 PHILLIPS 66 PIPELINE LLC 80
top_pipeline <- df %>%
filter(df$Operator.Name == 'ENTERPRISE CRUDE PIPELINE LLC') %>%
select(Pipeline.Facility.Name) %>%
mutate_if(is.character, list(~na_if(.,""))) %>%
count(Pipeline.Facility.Name, sort = TRUE)
head(top_pipeline)## Pipeline.Facility.Name n
## 1 <NA> 11
## 2 MIDLAND STATION 10
## 3 JONES CREEK STATION 9
## 4 TEXAS CITY TANK FARM 7
## 5 HOBBS STATION 6
## 6 C18 5
In above data set, we can see that for Enterprise Crude Pipeline LLC, the incidents mostly occurred at Midland Station.
Using similar function as the previous one, we can also clean the data set to get the location of pipeline data only (offshore/onshore).
top_location <- df %>%
select(Pipeline.Location) %>%
mutate_if(is.character, list(~na_if(.,""))) %>%
count(Pipeline.Location, sort = TRUE)
top_location## Pipeline.Location n
## 1 ONSHORE 2777
## 2 OFFSHORE 18
In above data set, most incidents occurred onshore.
Using similar function as the previous one, we can also clean the data set to get the type of pipeline data only.
top_pipeline_type <- df %>%
select(Pipeline.Type) %>%
mutate_if(is.character, list(~na_if(.,""))) %>%
count(Pipeline.Type, sort = TRUE)
top_pipeline_type## Pipeline.Type n
## 1 ABOVEGROUND 1475
## 2 UNDERGROUND 985
## 3 TANK 301
## 4 <NA> 18
## 5 TRANSITION AREA 16
In above data set, most incidents occurred mostly at above ground, and the least one occurred at transition area.
ggplot(top_location, aes(x = reorder(Pipeline.Location, -n), y = n, color = reorder(Pipeline.Location, -n), fill = reorder(Pipeline.Location, -n))) +
geom_bar(stat = "identity") +
geom_text(aes(label=n), vjust = -0.5, size = 2.5) +
theme_minimal() +
labs(title = "Pipeline Location Where Most Incidents Occurred", x = "Pipeline Location", y = "Total Number", color = "Pipeline Location", fill = "Pipeline Location") +
theme(axis.text.x = element_blank(), plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(breaks = seq(from = 0, to = 2800, by = 400))ggplot(top_pipeline_type, aes(x = reorder(Pipeline.Type, -n), y = n, color = reorder(Pipeline.Type, -n), fill = reorder(Pipeline.Type, -n))) +
geom_bar(stat = "identity") +
geom_text(aes(label=n), vjust = -0.5, size = 2.5) +
theme_minimal() +
labs(title = "Pipeline Type Where Most Incidents Occurred", x = "Pipeline Type", y = "Total Number", color = "Pipeline Type", fill = "Pipeline Type") +
theme(axis.text.x = element_blank(), plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(breaks = seq(from = 0, to = 1400, by = 200))From graphs shown above, we can conclude that the oil pipeline leak or spill incidents from 2010 until 2017 mostly occurred onshore at above ground pipeline.