1 World Offshore Accident Database - Woad
2 Method
3 R Package
4 Functions
5 Fase 1
- 5.1 Cleaning Woad data
- 5.2 Tidying data
6 Exploratory analysis
7 Fase 2 - Diagnóstico

1 World Offshore Accident Database - Woad

Woad (World Offshore Accident Database) gives you access to accident data for diverse offshore facility types. It has been curated since 1975 by experts at DNV GL, providing accident causes, location, social and economic impacts, etc. that prove invaluable for a variety of risk management initiatives.

Offshore accident data for oil and gas facilities Access to the world’s most comprehensive available offshore accident data Information on over 6000 offshore accidents and incidents from 1970 to date Technical data on approximately 3700 offshore units The World Offshore Accident Database is a tool used extensively by key players in the offshore industry worldwide, including rig owners, drilling operators, insurance companies, consultants, salvage companies and regulatory authorities Continuously updated offshore accident data Knowledge of past accidents serves as important input to risk assessment initiatives and helps you ensure that known hazards are properly understood and effectively mitigated against. The World Offshore Accident Database contains a vast range of information covering the different aspects of an accident – technical, economic, social, operational etc. The range of useful applications of information from the World Offshore Accident Database is extensive. It shows the rate of accidents by unit type, geography, function, accident type and more - extremely useful for quantitative risk analysis (QRA).

2 Method

2.1 a) Fase 1 - Preparação, organização e limpeza de dados

O objetivo desta etapa é limpar os dados para preparar a organização de unidades relevantes de análise.

2.2 b) Fase 2 - Análise descritiva

O objetivo desta etapa é compreender as variáveis de dados, tanto em seus níveis de dados (univariado), quanto em suas relações com outras variáveis (bivariada e multivariada). Assim, a partir dessas análises buscar-se-á estabelecer um contexto referencial que busca responder as seguintes perguntas:

Qual o comportamento das variáveis de dados?
Quais as variáveis relevantes para descoberta de conhecimento na base de dados?
Quais os níveis de dados das variáveis mais relevantes (tendo em vista a completude e abrangência no domínio)

2.3 c) Fase 3 - Diagnóstico

O objetivo desta etapa é identificar fatores que modelam o contexto referencial.

O que os dados sumarizados revelam?
Há causas aparentes dos acidentes?

2.4 d) Fase 4 - Análise preditiva

O objetivo desta análise é aplicar modelos regressivos e preditivos para estabelecer possíveis cenários.

Quais variáveis aumentam os acidentes?
Quais diminuem?

2.5 e) Fase 5 - Análise prescritiva

O objetivo desta etapa é estabelecer um sistema capaz de oferecer alternativas de ações com base nos cenários estabelecidos.

O que fazer para diminuir os acidentes?
O que evitar?

3 R Package

library(rattle)
library(caret)
library(rpart)
library(rpart.plot)
library(corrplot)
library(randomForest)
library(RColorBrewer)
library(dplyr)
library(plotly)
library(ggplot2)
library(tidyr)
library(magrittr)
library(plotrix)
library(rgl)
library(lubridate)
library(ggplot2)
library(GGally)
library(corrplot)
library(corrgram)
library(ppcor)
library(readr)
library(ggvis)
library(gganimate)
library(gifski)
library(av)
library(magick)
library(viridis)
library(hrbrthemes)
library(caret)

4 Functions

panel.hist <- function(x, ...)
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(usr[1:2], 0, 1.5) )
  h <- hist(x, plot = FALSE)
  breaks <- h$breaks; nB <- length(breaks)
  y <- h$counts; y <- y/max(y)
  rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}

# 1.2 by Melina de Souza Leite 
panel.lm <- function (x, y, col = par("col"), bg = NA, pch = par("pch"), 
                      cex = 1, col.line="red") {
  points(x, y, pch = pch, col = col, bg = bg, cex = cex)
  ok <- is.finite(x) & is.finite(y)
  if (any(ok)) {
    abline(lm(y[ok]~x[ok]), col = col.line)
  }
}

# 1.3 help(pairs) by Melina de Souza Leite 
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- abs(cor(x, y))
  txt <- format(c(r, 0.123456789), digits = digits)[1]
  txt <- paste0(prefix, txt)
  if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = cex.cor * r)
}


# Transform vector into a data frame with frequency of levels and proportion
.Unianalysis = function (x) {
    y <- as.data.frame(as.table(table(x)))
    y <- mutate(y, percentual = prop.table(y$Freq) *100)#Proportion
    y <- arrange(y, desc(y$Freq))
return(y)
}

.BarPlot <- function (z) {
      z %>% 
    ggplot( aes(x=x, y=Freq)) +
    geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4) +
    coord_flip() +
    xlab("") +
    theme_bw()

}

5 Fase 1

5.1 Cleaning Woad data

Retirada: _ normalização dos rótulos de dados - Conversão para CSV - linhas 1 e 2 - linhas 3736 até 3738 - linhas vazias da coluna “Human Cause” foram trocadas por “not identified” - linha 418 por não ter dados completos - 3987 registros não que não possuiam dados de custo do evento foram tratados como 0, para que fosse possível fazer o somatório dos dados existentes.

A variável CustoHumano refere-se a soma de todas as fatalidades e ferimentos da tripulação e de terceiros, com peso 10 para fatalidades e peso 3 para ferimentos.

O gráfico abaixo mostra as ocorrências por tipo de unidade correlacionado com o custo humano e grau de danos.

setwd("~/Regression Models")
library(readr)
Cleaning_completo_woad <- read_delim("Cleaning_completo_woad.CSV", 
    ";", escape_double = FALSE, col_types = cols(
    DamageCost_million = col_number(),
    DrillDepth_km = col_number(), 
    WaterDepth_m = col_number(),
    WindSpeed_m_s= col_number(), 
    FatalitiesCrew = col_number(),
    Fatalities3rd_Party = col_number(),
    InjuriesCrew = col_number(),
    Injuries3rd_Party = col_number(),
    Accident_Date = col_date(format = "%m/%d/%Y")),
    trim_ws = TRUE)

5.2 Tidying data

trainVariable <- Cleaning_completo_woad %>%
              mutate(HumanCost = (FatalitiesCrew * 10) + (Fatalities3rd_Party * 10) + (InjuriesCrew * 3) +  (Injuries3rd_Party * 3)) %>%
              mutate(Year = year(Accident_Date)) %>% 
              mutate(Month =  month(Accident_Date))

by_Event_Function <-
        trainVariable %>% 
        group_by(MainEvent, Function,TypeofUnit, Damage,SpillType, AccidentCategory,Year, Month) %>%
        summarise(How_Many = n(),
                  Owner = n_distinct(Owner),
                  FatalitiesCrew = sum(FatalitiesCrew),
                  Fatalities3rd_Party = sum(Fatalities3rd_Party),
                  InjuriesCrew = sum(InjuriesCrew),
                  Injuries3rd_Party = sum(Injuries3rd_Party),
                  DamageCost_million= sum(DamageCost_million), 
                  DrillDepth_km = mean(DrillDepth_km), 
                  WaterDepth_m = mean(WaterDepth_m),
                  WindSpeed_m_s= sum(WindSpeed_m_s),
                  HumanCost = sum(HumanCost)) %>%
       mutate(Mean_HumanCost = (FatalitiesCrew + Fatalities3rd_Party + InjuriesCrew + Injuries3rd_Party)/How_Many) %>%
       arrange(desc(How_Many))
   

write.csv2(by_Event_Function, file = "by_Event_Function.CSV")
write.csv2(trainVariable, file = "trainVariable.CSV")



damage <- cbind(by_Event_Function$Damage,
                by_Event_Function$How_Many,
                by_Event_Function$HumanCost,
                by_Event_Function$Mean_HumanCost,
                by_Event_Function$DamageCost_million)

colnames(damage) <- cbind("Grau de danos","Ocorrências", "Total Custo Humano", "Média Custo Humano", "Danos em milões")

damage <- as.data.frame(na.omit(damage))


OverView <- apply(Cleaning_completo_woad, 2, .Unianalysis)

OverView2 <- sapply(OverView, distinct)



by_Year <- trainVariable %>%
          group_by(Year) %>%
          summarise(Freq = n(),
           Owner = n_distinct(Owner),
           FatalitiesCrew = sum(FatalitiesCrew),
           Fatalities3rd_Party = sum(Fatalities3rd_Party),
           InjuriesCrew = sum(InjuriesCrew),
           Injuries3rd_Party = sum(Injuries3rd_Party),
           DamageCost_million= sum(DamageCost_million), 
           HumanCost = sum(HumanCost),   
           WindSpeed_m_s= mean(WindSpeed_m_s)) 
  
  
by_HumanCost <- trainVariable %>%
          filter(HumanCost > 0) %>%
          group_by(HumanCost, AccidentCategory, Damage) %>%
          summarise(Freq = n(),
           Owner = n_distinct(Owner),
           FatalitiesCrew = sum(FatalitiesCrew),
           Fatalities3rd_Party = sum(Fatalities3rd_Party),
           InjuriesCrew = sum(InjuriesCrew),
           Injuries3rd_Party = sum(Injuries3rd_Party),
           DamageCost_million= sum(DamageCost_million), 
           WindSpeed_m_s= mean(WindSpeed_m_s))  %>% 
           filter(Freq > 20)
           

dim(by_HumanCost)

## [1]  3 11

by_Year<- by_Year %>% 
  mutate_all(replace_na, 0)

by_Event_Function <- by_Event_Function %>% 
  mutate_all(replace_na, 0)

6 Exploratory analysis

6.1 Checking Data Consistency

library(GGally)
ggpairs(trainVariable, columns = 24:30, ggplot2::aes(colour=Damage))

trainVariable_N <- trainVariable %>%
    dplyr::filter(HumanCost > 0)
summary(trainVariable_N$HumanCost)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    3.00   10.00   32.89   27.00 1850.00

dim(trainVariable_N)

## [1] 872  43

Apesar de ter sido retirado da amostra os registros de custo humano 0 e custo humano maior que o terceiro quartil, o histograma ainda mostra que os dados não estão normalizados, o que pode alterar a correlação de variáveis

histogram(trainVariable_N$HumanCost)

Assim optou-se por selecionar uma amostra com valores mais aproximados da média. O valor mínimo encontrado foi 250 para custo humano, o que gerou uma perda de registros para avaliação, devido a iregularidade dos dados, inviabilizando assim a análise de correlação que recomenda ao menos 40 registros.

trainVariable_N <- trainVariable %>%
    dplyr::filter(HumanCost > 250)

summary(trainVariable_N$HumanCost)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   260.0   363.0   720.0   735.7   875.0  1850.0

dim(trainVariable_N)

## [1] 11 43

histogram(trainVariable_N$HumanCost)

O p-valor foi de 0.. Isto quer dizer que os dados são normais, pois não diferem de uma curva normal. O p-valor for < 0.05 indica que os dados não apresentam normalidade.

shapiro.test(trainVariable_N$HumanCost)

## 
##  Shapiro-Wilk normality test
## 
## data:  trainVariable_N$HumanCost
## W = 0.87649, p-value = 0.09387

Assim, sugere-se realizar uma pesquisa qualitativa a partir desses registros de acidentes. Abaixo a descrição dos registros.

DT::datatable(trainVariable_N)

Contudo, mesmo não havendo normalidade de dados para a variável custo humano, algumas análises serão realizadas, desde que sejam observadas as limitações dos resultados obtidos.

6.2 Normality test variable Y “Frequency of events”

6.2.1 Histogram

summary(trainVariable$HumanCost)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.00    0.00    4.27    0.00 1850.00      15

histogram(trainVariable$HumanCost)

dim(trainVariable)

## [1] 6733   43

Não há normalidade de dados para análises de correlação.

6.3 Análise de variancia

6.3.1 Análise de variância simples

Há variancia de vento nos acidentes com maiores danos

aggregate(WindSpeed_m_s ~ Damage, trainVariable, var)

##               Damage WindSpeed_m_s
## 1 Insignif/no damage      24.54341
## 2       Minor damage     136.63356
## 3      Severe damage     251.19942
## 4 Significant damage     208.44087
## 5         Total loss     143.50564

aggregate(WindSpeed_m_s ~ Damage, trainVariable,mean)

##               Damage WindSpeed_m_s
## 1 Insignif/no damage     0.9644248
## 2       Minor damage     3.8438178
## 3      Severe damage     6.0176391
## 4 Significant damage     6.6126021
## 5         Total loss     5.2136752

aggregate(WindSpeed_m_s ~ Damage, trainVariable,summary)

##               Damage WindSpeed_m_s.Min. WindSpeed_m_s.1st Qu.
## 1 Insignif/no damage          0.0000000             0.0000000
## 2       Minor damage          0.0000000             0.0000000
## 3      Severe damage          0.0000000             0.0000000
## 4 Significant damage          0.0000000             0.0000000
## 5         Total loss          0.0000000             0.0000000
##   WindSpeed_m_s.Median WindSpeed_m_s.Mean WindSpeed_m_s.3rd Qu.
## 1            0.0000000          0.9644248             0.0000000
## 2            0.0000000          3.8438178             0.0000000
## 3            0.0000000          6.0176391             0.0000000
## 4            0.0000000          6.6126021             0.0000000
## 5            0.0000000          5.2136752             0.0000000
##   WindSpeed_m_s.Max.
## 1         60.0000000
## 2         83.0000000
## 3         62.0000000
## 4         83.0000000
## 5         82.0000000

6.3.2 Análise de variancia ANOVA

As análises de variância permiterm verificar em uma só análise se as diferenças amostrais observadas são reais (causadas por diferenças significativas nas populações observadas) ou casuais (decorrentes da mera variabilidade amostral).

O gráfico (Scale-Location) serve para indicar a distribuição de pontos no intervalo de valores previstos. A variação deve ser razoavelmente igual em todo o intervalo do preditor, no nosso caso existe uma variação considerável nos intervalos.

fit <- aov(HumanCost ~ WindSpeed_m_s + WaveHeight_m + DrillDepth_km + SpillAmount_m3 + as.factor(TypeofUnit) + as.factor(Function) + as.factor(MainEvent), data = trainVariable) 

summary(fit)

##                         Df  Sum Sq Mean Sq F value   Pr(>F)    
## WindSpeed_m_s            1   10482   10482   3.827   0.0507 .  
## WaveHeight_m             1   66806   66806  24.393 9.00e-07 ***
## DrillDepth_km            1       0       0   0.000   0.9987    
## SpillAmount_m3           1      33      33   0.012   0.9129    
## as.factor(TypeofUnit)   19  211884   11152   4.072 1.12e-08 ***
## as.factor(Function)     17   23642    1391   0.508   0.9504    
## as.factor(MainEvent)    20  180652    9033   3.298 1.32e-06 ***
## Residuals             1166 3193339    2739                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 5506 observations deleted due to missingness

aov(fit)

## Call:
##    aov(formula = fit)
## 
## Terms:
##                 WindSpeed_m_s WaveHeight_m DrillDepth_km SpillAmount_m3
## Sum of Squares          10482        66806             0             33
## Deg. of Freedom             1            1             1              1
##                 as.factor(TypeofUnit) as.factor(Function) as.factor(MainEvent)
## Sum of Squares                 211884               23642               180652
## Deg. of Freedom                    19                  17                   20
##                 Residuals
## Sum of Squares    3193339
## Deg. of Freedom      1166
## 
## Residual standard error: 52.33271
## 2 out of 63 effects not estimable
## Estimated effects may be unbalanced
## 5506 observations deleted due to missingness

plot(fit)

## Warning: not plotting observations with leverage one:
##   100, 154, 193

## Warning: not plotting observations with leverage one:
##   100, 154, 193

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produzidos

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produzidos

6.4 Análise Univariada

O objetivo da análise univariada é compreender cada unidade de dados da base analisada, de forma a estabelecer um contexto referencial. Isso foi feito a partir de perguntas chaves descritas nas próximas seções.

6.4.1 Qual o perfil e a quantidade dos eventos que ocorreram?

Evento principal

DT::datatable(OverView2$MainEvent, colnames = c('Variavel', 'Freq.','Percentual'), filter = c("top"))

.BarPlot(OverView2$MainEvent)

Categoria de acidente

DT::datatable(OverView2$AccidentCategory, colnames = c('Variável', 'Freq.','Percentual'), filter = c("top"))

.BarPlot(OverView$AccidentCategory)

Tipo de dano

DT::datatable(OverView2$Damage, colnames = c('Variável', 'Freq.','Percentual'), filter = c("top"))

.BarPlot(OverView2$Damage)

6.4.2 Quais as possíveis causas dos eventos?

Causa humana

DT::datatable(OverView2$HumanCause, colnames = c('Variável', 'Freq.','Percentual'), filter = c("top"))

.BarPlot(OverView2$HumanCause)

Causa de equipamento

DT::datatable(OverView2$EquipmentCause, colnames = c('Variável', 'Freq.','Percentual'), filter = c("top"))

.BarPlot(OverView2$EquipmentCause)

6.4.3 Onde os eventos ocorrem na plataforma?

Tipo de unidade

DT::datatable(OverView2$TypeofUnit, colnames = c('Variável', 'Freq.','Percentual'), filter = c("top"))

.BarPlot(OverView2$TypeofUnit)

Funções

DT::datatable(OverView2$Function, colnames = c('Variável', 'Freq.','Percentual'), filter = c("top"))

.BarPlot(OverView2$Function)

Operação principal

DT::datatable(OverView2$MainOperation, colnames = c('Variável', 'Freq.','Percentual'), filter = c("top"))

.BarPlot(OverView2$MainOperation)

6.4.4 Quais modelos de gestão são mais frequentes nessas ocorrências?

Modelo de Gestão

DT::datatable(OverView2$Class_Society, colnames = c('Variável', 'Freq.','Percentual'), filter = c("top"))

.BarPlot(OverView2$Class_Society)

6.4.5 Qual o custo humano desses eventos?

Fatalidades da tripulação

DT::datatable(OverView2$FatalitiesCrew, colnames = c('Variável', 'Freq.','Percentual'), filter = c("top"))

 .BarPlot(OverView2$FatalitiesCrew)

Fatalidades de terceiros

DT::datatable(OverView2$Fatalities3rd_Party, colnames = c('Variável', 'Freq.','Percentual'), filter = c("top"))

 .BarPlot(OverView2$Fatalities3rd_Party)

Ferimentos da tripulação

DT::datatable(OverView2$FatalitiesCrew, colnames = c('Variável', 'Freq.','Percentual'), filter = c("top"))

.BarPlot(OverView2$FatalitiesCrew)

Ferimentos de terceiros

DT::datatable(OverView2$Fatalities3rd_Party, colnames = c('Variável', 'Freq.','Percentual'), filter = c("top"))

.BarPlot(OverView2$Fatalities3rd_Party)

6.4.6 Qual o custo financeiro desses eventos?

Custo dos danos em milhoes

DT::datatable(OverView2$DamageCost_million, colnames = c('Variável', 'Freq.','Percentual'), filter = c("top"))

.BarPlot(OverView2$DamageCost_million)

6.5 Analise bivariada

O Objetivo desta análise é identificar possíveis correlações entre duas unidades de análise e desta forma observar se exite possibilidade de correlação entre as variáveis.

6.5.1 Quais eventos ocorrem com mais frequencia pelo tipo de ocorrência?

trainVariable %>%
  ggplot( aes(x=AccidentCategory, y=MainEvent, fill=AccidentCategory)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Principais eventos por tipo de gravidade") +
    xlab("")

trainVariable %>%
  ggplot( aes(x=Damage, y=MainEvent, fill=Damage)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Principais eventos por tipo de danos") + theme(axis.text.x = element_text(angle = 35, vjust = 0.5, hjust=1)) +
    xlab("")

trainVariable %>%
  ggplot( aes(x=Damage, y=AccidentCategory, fill=Damage)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Categorias de acidentes por tipo de danos") + theme(axis.text.x = element_text(angle = 35, vjust = 0.5, hjust=1)) +
    xlab("")

6.5.2 Em quais funções esses eventos ocorrem?

trainVariable %>%
  ggplot( aes(x=AccidentCategory, y=Function, fill=AccidentCategory)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Ocorrências de eventos por Funções e gravidade") +
    xlab("")

trainVariable %>%
  ggplot( aes(x=Damage, y=Function, fill=Damage)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Ocorrências de eventos por Funções e danos") + theme(axis.text.x = element_text(angle = 35, vjust = 0.5, hjust=1)) +
    xlab("")

6.5.3 Em quais tipos de unidades esses eventos ocorrem?

trainVariable %>%
  ggplot( aes(x=AccidentCategory, y=TypeofUnit, fill=TypeofUnit)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Ocorrências por tipo de unidade e gravidade") + 
    xlab("")

trainVariable %>%
  ggplot( aes(x=Damage, y=TypeofUnit, fill=Damage)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      legend.position ="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Ocorrências por tipo de unidade e danos") +  theme(axis.text.x = element_text(angle = 35, vjust = 0.5, hjust=1)) +
    xlab("")

6.5.4 Qual a distribuição desses eventos por ano?

trainVariable %>%
  ggplot( aes(x=Year, y=AccidentCategory, fill=AccidentCategory)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Ocorrências por gravidade e ano") + 
    xlab("")

trainVariable %>%
  ggplot( aes(x=Year, y=Damage, fill=Damage)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      legend.position ="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Ocorrências por danos e ano") +  theme(axis.text.x = element_text( vjust = 0.5, hjust=1)) +
    xlab("")

6.5.4.1 Há diferenças significativas ao longo do tempo?

qplot(log(How_Many),    Year,   data    =   by_Event_Function,  facets  =   .   ~   Damage)

qplot(log(How_Many),    Year,   data    =   by_Event_Function,  facets  =   .   ~   AccidentCategory)

qplot(How_Many, as.factor(Month),   data    =   by_Event_Function,  facets  =   .   ~   Damage)

6.5.5 Há diferenças significativas dos ventos conforme a gravidade das ocorrências?

Aplicando-se uma análise de variância simples, é possível perceber que a variações que devem ser investigadas.

aggregate(WindSpeed_m_s ~ AccidentCategory, trainVariable, var)

##   AccidentCategory WindSpeed_m_s
## 1         Accident     177.34067
## 2  Incidnt/haz.sit      54.13867
## 3        Near miss      29.61625
## 4    Unsignificant      24.06966

bartlett.test(WindSpeed_m_s ~ AccidentCategory, trainVariable)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  WindSpeed_m_s by AccidentCategory
## Bartlett's K-squared = 1665.2, df = 3, p-value < 2.2e-16

summary(trainVariable$WindSpeed_m_s)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   2.857   0.000  83.000      15

6.6 Analise multivariada

Na análise multivariada é identificar quais as possíveis influências dos diversos fatores que permeiam o contexto de análise. Por meio do estudo multivariado é possível obter insumos para realizar diagnósticos, modelar análise preditivas e prescritivas.

6.6.1 Qual a relação do ano e mês, tipo de ocorrência e custo humano?

Por ano

qplot(log(HumanCost),   Year,   data    =   trainVariable,  facets  =   .   ~   AccidentCategory)

## Warning: Removed 15 rows containing missing values (geom_point).

qplot(as.factor(Month), as.factor(Year), data   =   trainVariable,  facets  =   .   ~   Damage)

qplot(as.factor(Month), as.factor(Year), data   =   trainVariable,  facets  =   .   ~   AccidentCategory)

Por mês

qplot(log(HumanCost),   as.factor(Month),   data    =   by_Event_Function, color    =   Damage, geom    =   c("point",  "smooth"),  method  =   "lm")

## Warning: Ignoring unknown parameters: method

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 4972 rows containing non-finite values (stat_smooth).

qplot(DamageCost_million,   as.factor(Month),   data    =   trainVariable, color    =   Damage, geom    =   c("point",  "smooth"),  method  =   "lm")

## Warning: Ignoring unknown parameters: method

## `geom_smooth()` using formula 'y ~ x'

qplot(DamageCost_million,   as.factor(Month),   data    =   trainVariable, color    =   Damage, geom    =   c("point",  "smooth"),  method  =   "lm")

## Warning: Ignoring unknown parameters: method

## `geom_smooth()` using formula 'y ~ x'

6.6.2 Correlação multivariável

Conforme observado no gráfico abaixo, não há correlação visiveis

by_Year2 <- trainVariable_N %>%
          group_by(Year) %>%
          summarise(Freq = n(),
           Owner = n_distinct(Owner),
           FatalitiesCrew = sum(FatalitiesCrew),
           Fatalities3rd_Party = sum(Fatalities3rd_Party),
           InjuriesCrew = sum(InjuriesCrew),
           Injuries3rd_Party = sum(Injuries3rd_Party),
           DamageCost_million= sum(DamageCost_million), 
           HumanCost = sum(HumanCost),   
           WindSpeed_m_s= mean(WindSpeed_m_s))

summary(by_Year[, 2:10])

##       Freq            Owner       FatalitiesCrew   Fatalities3rd_Party
##  Min.   :  2.00   Min.   : 1.00   Min.   :  0.00   Min.   : 0.000     
##  1st Qu.: 76.25   1st Qu.: 3.50   1st Qu.:  0.00   1st Qu.: 0.000     
##  Median :136.50   Median :40.50   Median : 15.00   Median : 3.000     
##  Mean   :146.37   Mean   :39.07   Mean   : 27.96   Mean   : 4.283     
##  3rd Qu.:188.50   3rd Qu.:62.00   3rd Qu.: 39.25   3rd Qu.: 6.000     
##  Max.   :484.00   Max.   :93.00   Max.   :180.00   Max.   :27.000     
##   InjuriesCrew    Injuries3rd_Party DamageCost_million   HumanCost     
##  Min.   :  0.00   Min.   : 0.000    Min.   :  0.00     Min.   :   0.0  
##  1st Qu.:  0.00   1st Qu.: 0.000    1st Qu.:  0.00     1st Qu.:   0.0  
##  Median : 18.00   Median : 1.500    Median : 30.02     Median : 274.5  
##  Mean   : 21.67   Mean   : 6.109    Mean   : 94.90     Mean   : 405.7  
##  3rd Qu.: 30.75   3rd Qu.: 7.000    3rd Qu.:107.43     3rd Qu.: 585.0  
##  Max.   :108.00   Max.   :90.000    Max.   :903.34     Max.   :2103.0  
##  WindSpeed_m_s   
##  Min.   : 0.000  
##  1st Qu.: 0.000  
##  Median : 1.048  
##  Mean   : 2.335  
##  3rd Qu.: 3.770  
##  Max.   :20.584

pairs(by_Year[,3:7],  
      diag.panel = panel.hist,
      upper.panel = panel.cor,
      lower.panel = panel.lm,
      main="Correlação Multivariável")

pairs(by_Year[, 3:10],  
      diag.panel = panel.hist,
      upper.panel = panel.cor,
      lower.panel = panel.lm,
      main="Correlação Multivariável")

6.6.3 Ocorrências por tipo de dano, categoria de acidentes e mês de ocorrência

Neste gráfico abaixo ilustra significante ocorrência de acidentes e incidentets nos meses de agosto e setembro ao longo das 5 décadas de ocorrências. Importante identificar se esses fatos devem-se aos outliers, as especificidades de algum país ou algum outro fator não frequente (mero acaso), ou há algum fator de relevância esses meses(significância).

qplot(log(How_Many),    as.factor(Month),   data    =   by_Event_Function, facets   =   .   ~   AccidentCategory, color =   Damage, geom    =   c("point",  "smooth"),  method  =   "lm")

## Warning: Ignoring unknown parameters: method

## `geom_smooth()` using formula 'y ~ x'

## Warning in qt((1 - level)/2, df): NaNs produzidos

## Warning in qt((1 - level)/2, df): NaNs produzidos

## Warning in max(ids, na.rm = TRUE): nenhum argumento não faltante para max;
## retornando -Inf

## Warning in max(ids, na.rm = TRUE): nenhum argumento não faltante para max;
## retornando -Inf

6.6.4 Dados da Categoria acidentes que mostra custo financeiro por custo humano e tipo de dano

No Gráfico abaixo são filtrados somentes os eventos chamados de Acidentes, os quais são observados a partir dos danos por milhões em relação ao custo humano, observando-se a correlação por tipo de dano. Assim é possível perceber uma relação positiva nas perdas totais, danos severos e danos significativos e negativa nos danos insignificantes.

qplot(HumanCost,    DamageCost_million, data    =   trainVariable[AccidentCategory = "Accident"], color =   Damage, geom    =   c("point",  "smooth"),  method  =   "lm")

## Warning: Ignoring unknown parameters: method

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 15 rows containing non-finite values (stat_smooth).

## Warning: Removed 15 rows containing missing values (geom_point).

6.6.5 Custo humano por tipo de unidade e grau de danos

No gráfico é possível observar uma correlação positiva nas unidades Jackup, jacket, semi-submersivel, helicopter com custo humano nas ocorrências com perda total.

trainVariable %>% ggvis(~HumanCost, ~TypeofUnit, fill = ~as.factor(Damage)) %>% layer_points()

# Most basic bubble plot
by_Event_Function %>%
   ggplot(aes(x= HumanCost, y= DamageCost_million, size= Mean_HumanCost, color=Damage)) +
    geom_point(alpha=0.5) +
    scale_size(range = c(.1, 20), name="Mean_HumanCost")

par(mfrow=c(1,3))
# Per Damage
ggplot(by_Event_Function, aes(fill=Damage, y=Function, x=How_Many)) + 
    geom_bar(position="fill", stat="identity") + ggtitle("Eventos por Função e danos")

ggplot(by_Event_Function, aes(fill=Damage, y=TypeofUnit, x=How_Many)) + 
    geom_bar(position="fill", stat="identity") + ggtitle("Eventos por Unidade e danos")

ggplot(by_Event_Function, aes(fill=Damage, y=MainEvent, x=How_Many)) + 
    geom_bar(position="fill", stat="identity") + ggtitle("Tipo de Eventos por danos")

# Per Accident Category

ggplot(by_Event_Function, aes(fill=AccidentCategory, y=Function, x=How_Many)) + 
    geom_bar(position="fill", stat="identity") + ggtitle("Eventos por Função e tipo")

ggplot(by_Event_Function, aes(fill=AccidentCategory, y=TypeofUnit, x=How_Many)) + 
    geom_bar(position="fill", stat="identity") + ggtitle("Eventos por Unidade e tipo")

ggplot(trainVariable, aes(fill=AccidentCategory, y=MainEvent, x=HumanCost)) + 
    geom_bar(position="fill", stat="identity") + ggtitle("Eventos por Categoria e tipo")

## Warning: Removed 15 rows containing missing values (position_stack).

## Warning: Removed 49 rows containing missing values (geom_bar).

6.7 Conclusões das análises exploratórias

As análises exploratórias tem o propósito de representar o contexto de um domínio de análise e por isso elas tendem a ser “rápidas e sujas”. “Rápidas”“, pois buscamos o maior número de informações sobre o contexto, utilizando gráficos que nos direcionam em óticas unitárias, binárias e multinéveis sobre os dados, o que acaba trazendo muita informação com ruido (”sujas").

Contudo, trata-se de um estratégia que visa resumir os dados (geralmente graficamente) e destacar qualquer recurso amplo.

Os dados sobre as ocorrências parecem terem sofrido modificações de nomenclatura no decorrer das 5 décadas analisas. Por exemplo, até o ano de 1882 não havia nenhum registro do termo “Near Miss”, indicando que talvez esse termo passou a ser utilizado a partir de 1983, provavelente por alguma razão relacionada a categorização dos eventos, que permitiram facilitar a gestão dessas ocorrências. Valeria nesse ponto uma investigação literária a respeito disso. Assim, abaixo segue a compilação dos dados coletados em quase 5 décadas (46 anos) em 91 países, com a participação de 559

Quanto ao perfil dos eventos:

Cerca de 57% de todas as ocorrências relatadas ocorrem por mal funcionamento de equipamentos (32,92%) e problemas no tempo (24,43%)
Cerca de 52% de todas as ocorrências tem por evento principal a Queda de carga / objeto descartado (20,41%), vazamento de fluido ou gás (17,88) e fogo (13,84%).
Um fato já esperado é que cerca de 85% de todas as ocorrências relatadas ocorrem nas funções de Perfuração (34,96%), Produção (29,39%), e Perfuração e produção (20,72%) 4 Ar de Transferência
Há correlação moderada (0,43) de custo financeiro e custo humano.
Há correlação moderada (0,49) da velocidade dos ventos em metro por segundo com ocorrência de lesões da tripulação.

(colocar mais achados)

Explorar perguntas e hipóteses básicas (e talvez excluir algumas)

Sugirir estratégias de modelagem para a "próxima etapa

7 Fase 2 - Diagnóstico

7.0.1 Predição de custo humano por Variáveis de condições climáticas

Somente a velocidade do vento medido em metros por segundo tem significância no modelo. Conforme mostra os gráficos, esse modelo não tem um bom ajuste aos dados.

7.0.1.1 Ajuste do modelo

fit1 <- glm(formula = HumanCost^2 ~ WindSpeed_m_s + WaterDepth_m + DrillDepth_km  -1, family = "poisson", data = trainVariable)

## Warning: glm.fit: algorithm did not converge

summary(fit1)

## 
## Call:
## glm(formula = HumanCost^2 ~ WindSpeed_m_s + WaterDepth_m + DrillDepth_km - 
##     1, family = "poisson", data = trainVariable)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -20225.5      -5.9      -2.7      -1.9    5108.9  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## WindSpeed_m_s 1.375e-01  1.437e-05    9568   <2e-16 ***
## WaterDepth_m  1.274e-02  5.016e-06    2540   <2e-16 ***
## DrillDepth_km 7.013e-02  4.868e-06   14405   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance:  81106021  on 945  degrees of freedom
## Residual deviance: 457449413  on 942  degrees of freedom
##   (5788 observations deleted due to missingness)
## AIC: 457450277
## 
## Number of Fisher Scoring iterations: 25

plot(fit1)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produzidos

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produzidos

7.0.1.2 Árvore de decisão do modelo

library(rpart)
library(rpart.plot)

arvore <- rpart(HumanCost ~ WindSpeed_m_s + WaterDepth_m + DrillDepth_km, method = "poisson", data = trainVariable)

summary(trainVariable$HumanCost)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.00    0.00    4.27    0.00 1850.00      15

summary(arvore)

## Call:
## rpart(formula = HumanCost ~ WindSpeed_m_s + WaterDepth_m + DrillDepth_km, 
##     data = trainVariable, method = "poisson")
##   n=6718 (15 observations deleted due to missingness)
## 
##           CP nsplit rel error    xerror      xstd
## 1 0.05798458      0 1.0000000 1.0041095 0.1475681
## 2 0.02273193      1 0.9420154 0.9805919 0.1366798
## 3 0.02045564      2 0.9192835 0.9941205 0.1391663
## 4 0.01060825      5 0.8579166 1.0167146 0.1440153
## 5 0.01000000      6 0.8473083 1.0564948 0.1507965
## 
## Variable importance
## WindSpeed_m_s  WaterDepth_m DrillDepth_km 
##            48            39            13 
## 
## Node number 1: 6718 observations,    complexity param=0.05798458
##   events=28684,  estimated rate=4.269723 , mean deviance=27.79048 
##   left son=2 (6054 obs) right son=3 (664 obs)
##   Primary splits:
##       WindSpeed_m_s < 7.5   to the left,  improve=10825.520, (0 missing)
##       WaterDepth_m  < 144.5 to the right, improve= 3620.149, (3200 missing)
##       DrillDepth_km < 3.65  to the right, improve= 1204.526, (5216 missing)
## 
## Node number 2: 6054 observations
##   events=19422,  estimated rate=3.208168 , mean deviance=17.72852 
## 
## Node number 3: 664 observations,    complexity param=0.02273193
##   events=9262,  estimated rate=13.94538 , mean deviance=103.2266 
##   left son=6 (192 obs) right son=7 (472 obs)
##   Primary splits:
##       WaterDepth_m  < 26.5  to the left,  improve=5472.268, (262 missing)
##       DrillDepth_km < 3.1   to the left,  improve=4224.825, (336 missing)
##       WindSpeed_m_s < 52.5  to the right, improve=3155.478, (0 missing)
##   Surrogate splits:
##       WindSpeed_m_s < 55.5  to the right, agree=0.624, adj=0.09, (262 split)
## 
## Node number 6: 192 observations
##   events=294,  estimated rate=1.534586 , mean deviance=11.08198 
## 
## Node number 7: 472 observations,    complexity param=0.02045564
##   events=8968,  estimated rate=18.99269 , mean deviance=131.7177 
##   left son=14 (458 obs) right son=15 (14 obs)
##   Primary splits:
##       DrillDepth_km < 3.1   to the left,  improve=3824.745, (256 missing)
##       WaterDepth_m  < 144.5 to the right, improve=3539.002, (236 missing)
##       WindSpeed_m_s < 8.5   to the right, improve=2220.595, (0 missing)
## 
## Node number 14: 458 observations,    complexity param=0.02045564
##   events=7248,  estimated rate=15.81942 , mean deviance=112.8826 
##   left son=28 (429 obs) right son=29 (29 obs)
##   Primary splits:
##       WindSpeed_m_s < 8.5   to the right, improve=2830.1260, (0 missing)
##       WaterDepth_m  < 144.5 to the right, improve=2541.8540, (235 missing)
##       DrillDepth_km < 0.77  to the right, improve= 179.6027, (256 missing)
## 
## Node number 15: 14 observations
##   events=1720,  estimated rate=120.9059 , mean deviance=478.5848 
## 
## Node number 28: 429 observations,    complexity param=0.01060825
##   events=5353,  estimated rate=12.47338 , mean deviance=85.37741 
##   left son=56 (267 obs) right son=57 (162 obs)
##   Primary splits:
##       WaterDepth_m  < 98    to the right, improve=3590.4010, (221 missing)
##       WindSpeed_m_s < 12.5  to the left,  improve= 956.1647, (0 missing)
##       DrillDepth_km < 0.77  to the right, improve= 140.2840, (241 missing)
##   Surrogate splits:
##       WindSpeed_m_s < 30.5  to the left,  agree=0.582, adj=0.074, (221 split)
## 
## Node number 29: 29 observations,    complexity param=0.02045564
##   events=1895,  estimated rate=64.85553 , mean deviance=422.1828 
##   left son=58 (22 obs) right son=59 (7 obs)
##   Primary splits:
##       WaterDepth_m < 132.5 to the left,  improve=2819.918, (14 missing)
## 
## Node number 56: 267 observations
##   events=1722,  estimated rate=6.447528 , mean deviance=40.89064 
## 
## Node number 57: 162 observations
##   events=3631,  estimated rate=22.38739 , mean deviance=146.4727 
## 
## Node number 58: 22 observations
##   events=45,  estimated rate=2.068884 , mean deviance=8.404232 
## 
## Node number 59: 7 observations
##   events=1850,  estimated rate=255.8677 , mean deviance=1028.826

rpart.plot(arvore)

7.1 Predição de custo humano por multiplas variáveis

7.1.1 Predição de custo financeiro por custo humano

7.1.1.1 Ajuste do modelo

ajuste <- lm(DamageCost_million^2 ~ HumanCost^2, data = trainVariable)

plot(ajuste)

7.1.1.2 Sumário dos valores

summary(ajuste)

## 
## Call:
## lm(formula = DamageCost_million^2 ~ HumanCost^2, data = trainVariable)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7931    -65    -65    -65 211535 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   65.313     46.462   1.406     0.16    
## HumanCost      8.643      1.210   7.141 1.02e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3785 on 6716 degrees of freedom
##   (15 observations deleted due to missingness)
## Multiple R-squared:  0.007536,   Adjusted R-squared:  0.007388 
## F-statistic: 50.99 on 1 and 6716 DF,  p-value: 1.025e-12

7.1.1.3 Predição

trainVariable$predito <- predict(ajuste, newdata = trainVariable)

trainVariable %>% dplyr::select(HumanCost, DamageCost_million, predito) %>% head() %>% knitr::kable()

HumanCost	DamageCost_million	predito
0	0.5	65.31251
0	0.0	65.31251
0	0.2	65.31251
0	0.0	65.31251
0	0.0	65.31251
37	3.5	385.11135

7.1.1.4 Ajuste do modelo

Aplicando regressão de poisson com as variáveis

fit2 <- glm(formula = HumanCost ~ WindSpeed_m_s +  WaterDepth_m + DrillDepth_km + as.factor(AccidentCategory) + as.factor(Damage) + as.factor(MainEvent) - 1, family = "poisson", data = trainVariable)

summary(fit2)

## 
## Call:
## glm(formula = HumanCost ~ WindSpeed_m_s + WaterDepth_m + DrillDepth_km + 
##     as.factor(AccidentCategory) + as.factor(Damage) + as.factor(MainEvent) - 
##     1, family = "poisson", data = trainVariable)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -19.224   -1.234   -0.090    0.000   65.463  
## 
## Coefficients:
##                                                      Estimate Std. Error
## WindSpeed_m_s                                       2.234e-02  9.337e-04
## WaterDepth_m                                        5.409e-03  1.719e-04
## DrillDepth_km                                       5.642e-03  1.067e-03
## as.factor(AccidentCategory)Accident                -1.433e+01  7.118e+02
## as.factor(AccidentCategory)Incidnt/haz.sit         -1.770e+01  7.118e+02
## as.factor(AccidentCategory)Near miss               -3.521e+01  1.121e+03
## as.factor(AccidentCategory)Unsignificant           -3.360e+01  7.927e+02
## as.factor(Damage)Minor damage                      -8.415e-01  1.308e-01
## as.factor(Damage)Severe damage                      7.826e-01  1.248e-01
## as.factor(Damage)Significant damage                 7.824e-02  1.301e-01
## as.factor(Damage)Total loss                         2.921e+00  1.221e-01
## as.factor(MainEvent)Blowout                         1.471e+01  7.118e+02
## as.factor(MainEvent)Breakage or fatigue             1.216e+01  7.118e+02
## as.factor(MainEvent)Capsizing,overturn,toppling     1.565e+01  7.118e+02
## as.factor(MainEvent)Collision,not offshore units    1.296e+01  7.118e+02
## as.factor(MainEvent)Collision,offshore units        1.468e+01  7.118e+02
## as.factor(MainEvent)Crane accident                  1.682e+00  1.130e+03
## as.factor(MainEvent)Explosion                       1.829e+01  7.118e+02
## as.factor(MainEvent)Falling load / Dropped object   1.581e+01  7.118e+02
## as.factor(MainEvent)Fire                            1.473e+01  7.118e+02
## as.factor(MainEvent)Grounding                       1.400e+01  7.118e+02
## as.factor(MainEvent)Helicopter accident             1.782e+01  7.118e+02
## as.factor(MainEvent)Leakage into hull              -4.400e+00  1.071e+03
## as.factor(MainEvent)List, uncontrolled inclination -4.068e+00  9.217e+02
## as.factor(MainEvent)Loss of buoyancy or sinking     1.261e+01  7.118e+02
## as.factor(MainEvent)Machinery/propulsion failure    2.435e-01  5.762e+03
## as.factor(MainEvent)Other                           1.516e+01  7.118e+02
## as.factor(MainEvent)Release of fluid or gas         5.354e-01  7.185e+02
## as.factor(MainEvent)Towline failure/rupture         1.755e+01  7.118e+02
## as.factor(MainEvent)Well problem, no blowout       -1.795e+00  9.921e+02
##                                                    z value Pr(>|z|)    
## WindSpeed_m_s                                       23.923  < 2e-16 ***
## WaterDepth_m                                        31.466  < 2e-16 ***
## DrillDepth_km                                        5.288 1.24e-07 ***
## as.factor(AccidentCategory)Accident                 -0.020    0.984    
## as.factor(AccidentCategory)Incidnt/haz.sit          -0.025    0.980    
## as.factor(AccidentCategory)Near miss                -0.031    0.975    
## as.factor(AccidentCategory)Unsignificant            -0.042    0.966    
## as.factor(Damage)Minor damage                       -6.432 1.26e-10 ***
## as.factor(Damage)Severe damage                       6.269 3.64e-10 ***
## as.factor(Damage)Significant damage                  0.602    0.547    
## as.factor(Damage)Total loss                         23.924  < 2e-16 ***
## as.factor(MainEvent)Blowout                          0.021    0.984    
## as.factor(MainEvent)Breakage or fatigue              0.017    0.986    
## as.factor(MainEvent)Capsizing,overturn,toppling      0.022    0.982    
## as.factor(MainEvent)Collision,not offshore units     0.018    0.985    
## as.factor(MainEvent)Collision,offshore units         0.021    0.984    
## as.factor(MainEvent)Crane accident                   0.001    0.999    
## as.factor(MainEvent)Explosion                        0.026    0.980    
## as.factor(MainEvent)Falling load / Dropped object    0.022    0.982    
## as.factor(MainEvent)Fire                             0.021    0.983    
## as.factor(MainEvent)Grounding                        0.020    0.984    
## as.factor(MainEvent)Helicopter accident              0.025    0.980    
## as.factor(MainEvent)Leakage into hull               -0.004    0.997    
## as.factor(MainEvent)List, uncontrolled inclination  -0.004    0.996    
## as.factor(MainEvent)Loss of buoyancy or sinking      0.018    0.986    
## as.factor(MainEvent)Machinery/propulsion failure     0.000    1.000    
## as.factor(MainEvent)Other                            0.021    0.983    
## as.factor(MainEvent)Release of fluid or gas          0.001    0.999    
## as.factor(MainEvent)Towline failure/rupture          0.025    0.980    
## as.factor(MainEvent)Well problem, no blowout        -0.002    0.999    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 48062  on 725  degrees of freedom
## Residual deviance: 16912  on 695  degrees of freedom
##   (6008 observations deleted due to missingness)
## AIC: 17324
## 
## Number of Fisher Scoring iterations: 16

plot(fit2)

## Warning: not plotting observations with leverage one:
##   318

## Warning: not plotting observations with leverage one:
##   318

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produzidos

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produzidos

7.1.1.5 Árvore regressiva do modelo

library(rpart)
library(rpart.plot)

arvore_V_E <- rpart(HumanCost ~ WindSpeed_m_s + WaterDepth_m + DrillDepth_km + as.factor(AccidentCategory) , data = trainVariable)

print("Esses são os percentis do custo humano", summary(trainVariable$HumanCost))

## [1] "Esses são os percentis do custo humano"

print("Esses são os percentis do custo humano", summary(trainVariable$WindSpeed_m_s))

## [1] "Esses são os percentis do custo humano"

summary(arvore_V_E)

## Call:
## rpart(formula = HumanCost ~ WindSpeed_m_s + WaterDepth_m + DrillDepth_km + 
##     as.factor(AccidentCategory), data = trainVariable)
##   n=6718 (15 observations deleted due to missingness)
## 
##           CP nsplit rel error   xerror      xstd
## 1 0.02661014      0 1.0000000 1.000305 0.4070489
## 2 0.02367356      4 0.8881897 1.090430 0.4010282
## 3 0.01817880      5 0.8645161 1.097191 0.4009100
## 4 0.01000000      6 0.8463373 1.134120 0.4022135
## 
## Variable importance
##                WaterDepth_m               WindSpeed_m_s 
##                          66                          22 
## as.factor(AccidentCategory) 
##                          12 
## 
## Node number 1: 6718 observations,    complexity param=0.02661014
##   mean=4.269723, MSE=1455.293 
##   left son=2 (4119 obs) right son=3 (2599 obs)
##   Primary splits:
##       as.factor(AccidentCategory) splits as  RLLL,      improve=0.0177621900, (0 missing)
##       WindSpeed_m_s               < 7.5   to the left,  improve=0.0070606170, (0 missing)
##       WaterDepth_m                < 144.5 to the right, improve=0.0010778260, (3200 missing)
##       DrillDepth_km               < 3.65  to the right, improve=0.0006550697, (5216 missing)
##   Surrogate splits:
##       WindSpeed_m_s < 0.5   to the left,  agree=0.638, adj=0.063, (0 split)
## 
## Node number 2: 4119 observations
##   mean=0.2311241, MSE=11.24496 
## 
## Node number 3: 2599 observations,    complexity param=0.02661014
##   mean=10.67026, MSE=3677.063 
##   left son=6 (2190 obs) right son=7 (409 obs)
##   Primary splits:
##       WindSpeed_m_s < 7.5   to the left,  improve=0.007185765, (0 missing)
##       WaterDepth_m  < 66.5  to the left,  improve=0.003849976, (1488 missing)
##       DrillDepth_km < 3.15  to the left,  improve=0.002623320, (1734 missing)
## 
## Node number 6: 2190 observations
##   mean=8.448858, MSE=694.4793 
## 
## Node number 7: 409 observations,    complexity param=0.02661014
##   mean=22.56479, MSE=19479.47 
##   left son=14 (361 obs) right son=15 (48 obs)
##   Primary splits:
##       WaterDepth_m  < 64.5  to the left,  improve=0.06229407, (181 missing)
##       DrillDepth_km < 3.05  to the left,  improve=0.02729433, (168 missing)
##       WindSpeed_m_s < 8.5   to the right, improve=0.01114019, (0 missing)
## 
## Node number 14: 361 observations
##   mean=9.33795, MSE=2758.833 
## 
## Node number 15: 48 observations,    complexity param=0.02661014
##   mean=122.0417, MSE=134021.2 
##   left son=30 (15 obs) right son=31 (33 obs)
##   Primary splits:
##       WaterDepth_m  < 144.5 to the right, improve=0.04860182, (0 missing)
##       WindSpeed_m_s < 35    to the left,  improve=0.02880777, (0 missing)
##       DrillDepth_km < 2.25  to the left,  improve=0.02426029, (22 missing)
## 
## Node number 30: 15 observations
##   mean=2.333333, MSE=18.08889 
## 
## Node number 31: 33 observations,    complexity param=0.02367356
##   mean=176.4545, MSE=185457.3 
##   left son=62 (25 obs) right son=63 (8 obs)
##   Primary splits:
##       WindSpeed_m_s < 35    to the left,  improve=0.03781780, (0 missing)
##       WaterDepth_m  < 95.5  to the left,  improve=0.02596779, (0 missing)
##       DrillDepth_km < 0.965 to the left,  improve=0.01239007, (14 missing)
## 
## Node number 62: 25 observations,    complexity param=0.0181788
##   mean=129.08, MSE=181618.7 
##   left son=124 (18 obs) right son=125 (7 obs)
##   Primary splits:
##       WaterDepth_m  < 98    to the left,  improve=0.03914308, (0 missing)
##       WindSpeed_m_s < 10.5  to the right, improve=0.03914308, (0 missing)
##       DrillDepth_km < 0.435 to the right, improve=0.02035497, (10 missing)
##   Surrogate splits:
##       WindSpeed_m_s < 8.5   to the right, agree=0.76, adj=0.143, (0 split)
## 
## Node number 63: 8 observations
##   mean=324.5, MSE=168521.8 
## 
## Node number 124: 18 observations
##   mean=76.5, MSE=79398.25 
## 
## Node number 125: 7 observations
##   mean=264.2857, MSE=419081.6

rpart.plot(arvore_V_E)

7.2 Sem outliers

7.2.1 Remover outliers

outliers <- boxplot(trainVariable$HumanCost, plot=FALSE)$out

no_outliers <- trainVariable 
no_outliers <-no_outliers[which(no_outliers$HumanCost %in% outliers),]

ver <- lm(formula = HumanCost ~ WindSpeed_m_s + as.factor(MainEvent) +  - 1, data = no_outliers)
dim(no_outliers)

## [1] 872  44

## Confidence interval
sumCoef <- summary(ver)$coefficients
sumCoef[1,1] + c(-1, 1) * qt(.975, df = ver$df) * sumCoef[1,2]

## [1] 3.59840 6.07494

sumCoef[2,1] + c(-1, 1) * qt(.975, df = ver$df) * sumCoef[2,2]

## [1] -76.54902 118.53901

arvore <- rpart(HumanCost ~ WindSpeed_m_s, data = no_outliers)

print("Esses são os percentis do custo humano", summary(trainVariable$HumanCost))

## [1] "Esses são os percentis do custo humano"

print("Esses são os percentis do custo humano", summary(trainVariable$WindSpeed_m_s))

## [1] "Esses são os percentis do custo humano"

summary(arvore)

## Call:
## rpart(formula = HumanCost ~ WindSpeed_m_s, data = no_outliers)
##   n= 872 
## 
##           CP nsplit rel error    xerror      xstd
## 1 0.10360931      0 1.0000000 1.0010964 0.4265433
## 2 0.03476275      1 0.8963907 1.0123428 0.4200013
## 3 0.02040746      2 0.8616279 1.0026011 0.4152389
## 4 0.01000000      3 0.8412205 0.9834447 0.4109972
## 
## Variable importance
## WindSpeed_m_s 
##           100 
## 
## Node number 1: 872 observations,    complexity param=0.1036093
##   mean=32.8945, MSE=10270.17 
##   left son=2 (811 obs) right son=3 (61 obs)
##   Primary splits:
##       WindSpeed_m_s < 7.5  to the left,  improve=0.1036093, (0 missing)
## 
## Node number 2: 811 observations
##   mean=23.94821, MSE=1551.368 
## 
## Node number 3: 61 observations,    complexity param=0.03476275
##   mean=151.8361, MSE=110976.1 
##   left son=6 (52 obs) right son=7 (9 obs)
##   Primary splits:
##       WindSpeed_m_s < 35.5 to the left,  improve=0.04598845, (0 missing)
## 
## Node number 6: 52 observations,    complexity param=0.02040746
##   mean=122.1154, MSE=99197.41 
##   left son=12 (45 obs) right son=13 (7 obs)
##   Primary splits:
##       WindSpeed_m_s < 9    to the right, improve=0.03543067, (0 missing)
## 
## Node number 7: 9 observations
##   mean=323.5556, MSE=144439.8 
## 
## Node number 12: 45 observations
##   mean=98.73333, MSE=46034.95 
## 
## Node number 13: 7 observations
##   mean=272.4286, MSE=414847.4

rpart.plot(arvore)

Modeloagem - Aprendizado Estatístico http://material.curso-r.com/modelos/

https://data.library.virginia.edu/diagnostic-plots/

Woad Database Analysis

Viviane Schneider

17 de Maio de 2020.