Submitted TO:
Md. Atikur Rahman
Assistant Professor
Department of Statistics
Jahangirnagar University
EDA by: Jobaida Gulsanara Jarin (ID: 20215018)
Batch: 05(B)
EDA Date: 28th March 2022
Data Mining (PM-ASDS18)
Course: Professional Masters in Applied Statistics and Data Science (ASDS)
Installing Packages
#install.packages("car")
#install.packages("tidyverse")
#install.packages("e1071")
#install.packages("caTools")
#install.packages("caret")
#install.packages("modeest")
#install.packages("ggplot2")
#install.packages("lubridate", dependencies=TRUE)
install.packages("here")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
install.packages("skimr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
Loading packages
library(e1071)
library(caTools)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(car)
## Loading required package: carData
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x lubridate::intersect() masks base::intersect()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
## x dplyr::recode() masks car::recode()
## x lubridate::setdiff() masks base::setdiff()
## x purrr::some() masks car::some()
## x lubridate::union() masks base::union()
library(modeest)
## Registered S3 method overwritten by 'rmutil':
## method from
## print.response httr
##
## Attaching package: 'modeest'
## The following object is masked from 'package:e1071':
##
## skewness
library(here)
## here() starts at /cloud/project
library(skimr)
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(dplyr)
Research Problem: It is required to predict the cost of constructing further reactors.
As our main objective is to predict the cost of constructing further reactors for, hence a model is required to make such prediction. Usually Linear Regression model, Logistic Regression model, and Non-linear Regression model are more like to apply for such prediction. Each model has some assumptions, before using any model it is required to check the validity of all underlying assumptions of that model. Based on the valid assumptions, we could find an appropriate model, for our dataset. Since the response variable (cost) is dependent on multiple variable, so a Multiple Linear Regression Model is more suitable for this dataset.
At this point Exploratory Data Analysis can be performed for initial investigations on data, to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical refpresentations. The following sections represent EDA for our dataset.
Observing the Nuclear Reactor Construction Data:
Loading dataset:
project_data = read.csv("nr_construction_data.csv")
head(project_data)
## Location D Cost T1 T2 S RN North CT CP N JP log.cost
## 1 Belleville May, 1980 3901 22 97 1310 1 1 1 0 13 1 3.591176
## 2 Belleville Aug, 1980 3744 25 101 1310 2 1 1 0 13 1 3.573336
## 3 Blayais Jan, 1977 1870 9 59 910 1 1 0 1 14 1 3.271842
## 4 Blayais Jan, 1977 1392 9 73 910 2 1 0 1 14 1 3.143639
## 5 Blayais Apr, 1978 2818 17 55 910 3 1 0 1 6 1 3.449941
## 6 Blayais Apr, 1978 2555 17 66 910 4 1 0 1 6 1 3.407391
skim_without_charts(project_data)
| Name | project_data |
| Number of rows | 58 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Location | 0 | 1 | 5 | 15 | 0 | 19 | 0 |
| D | 0 | 1 | 9 | 9 | 0 | 47 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Cost | 0 | 1 | 6097.28 | 9791.85 | 726.00 | 1586.5 | 3157.5 | 6224.00 | 57729.00 |
| T1 | 0 | 1 | 15.33 | 4.10 | 8.00 | 13.0 | 15.0 | 17.00 | 25.00 |
| T2 | 0 | 1 | 85.76 | 27.65 | 55.00 | 69.0 | 77.5 | 91.75 | 196.00 |
| S | 0 | 1 | 1088.45 | 224.78 | 880.00 | 910.0 | 915.0 | 1310.00 | 1500.00 |
| RN | 0 | 1 | 2.33 | 1.26 | 1.00 | 1.0 | 2.0 | 3.00 | 6.00 |
| North | 0 | 1 | 0.72 | 0.45 | 0.00 | 0.0 | 1.0 | 1.00 | 1.00 |
| CT | 0 | 1 | 0.69 | 0.47 | 0.00 | 0.0 | 1.0 | 1.00 | 1.00 |
| CP | 0 | 1 | 0.59 | 0.50 | 0.00 | 0.0 | 1.0 | 1.00 | 1.00 |
| N | 0 | 1 | 8.24 | 6.06 | 1.00 | 4.0 | 6.0 | 12.75 | 34.00 |
| JP | 0 | 1 | 0.55 | 0.50 | 0.00 | 0.0 | 1.0 | 1.00 | 1.00 |
| log.cost | 0 | 1 | 3.54 | 0.42 | 2.86 | 3.2 | 3.5 | 3.79 | 4.76 |
str(project_data)
## 'data.frame': 58 obs. of 13 variables:
## $ Location: chr "Belleville" "Belleville" "Blayais" "Blayais" ...
## $ D : chr "May, 1980" "Aug, 1980" "Jan, 1977" "Jan, 1977" ...
## $ Cost : int 3901 3744 1870 1392 2818 2555 1215 839 955 1114 ...
## $ T1 : int 22 25 9 9 17 17 21 13 11 12 ...
## $ T2 : int 97 101 59 73 55 66 76 66 61 66 ...
## $ S : int 1310 1310 910 910 910 910 910 910 880 880 ...
## $ RN : int 1 2 1 2 3 4 2 3 4 5 ...
## $ North : int 1 1 1 1 1 1 0 0 0 0 ...
## $ CT : int 1 1 0 0 0 0 1 1 1 1 ...
## $ CP : int 0 0 1 1 1 1 1 1 1 1 ...
## $ N : int 13 13 14 14 6 6 4 21 6 6 ...
## $ JP : int 1 1 1 1 1 1 0 0 1 1 ...
## $ log.cost: num 3.59 3.57 3.27 3.14 3.45 ...
There are 58 observation of 13 variables where D (Date) is formated as character and North, CP, CT, JP are formated as integer. To satisfy the assumptions of the multiple linear regression model, we need transform D variable to integer format and North, CP, CT, JP in factor format.
# Transform Character to Date type
project_data$D <- lubridate::my(project_data$D)
# Transform Date to integer
dates <- as.POSIXct(project_data$D, format = "%m/%d/%Y %H:%M:%S")
project_data$D <- year(dates)
project_data$D
## [1] 1980 1980 1977 1977 1978 1978 1972 1973 1974 1974 1979 1980 1982 1983 1977
## [16] 1977 1980 1981 1984 1984 1988 1991 1978 1978 1979 1979 1975 1975 1975 1975
## [31] 1971 1972 1979 1980 1982 1984 1975 1975 1975 1976 1979 1979 1981 1982 1977
## [46] 1978 1979 1980 1982 1984 1979 1979 1976 1976 1974 1974 1975 1975
A factor is a class that has multiple class labels or levels. There are North, CT, CP, JP variables are factor but formatted in interger. Now we will them to factor.
project_data[,'North'] <- as.factor(project_data[,'North'])
project_data[,'CT'] <- as.factor(project_data[,'CT'])
project_data[,'CP'] <- as.factor(project_data[,'CP'])
project_data[,'JP'] <- as.factor(project_data[,'JP'])
sapply(project_data, class)
## Location D Cost T1 T2 S
## "character" "numeric" "integer" "integer" "integer" "integer"
## RN North CT CP N JP
## "integer" "factor" "factor" "factor" "integer" "factor"
## log.cost
## "numeric"
Now the variables are in correct format to perform statistical analysis.
Summary statistics:
summary(project_data)
## Location D Cost T1
## Length:58 Min. :1971 Min. : 726 Min. : 8.00
## Class :character 1st Qu.:1975 1st Qu.: 1586 1st Qu.:13.00
## Mode :character Median :1978 Median : 3158 Median :15.00
## Mean :1978 Mean : 6097 Mean :15.33
## 3rd Qu.:1980 3rd Qu.: 6224 3rd Qu.:17.00
## Max. :1991 Max. :57729 Max. :25.00
## T2 S RN North CT CP
## Min. : 55.00 Min. : 880 Min. :1.000 0:16 0:18 0:24
## 1st Qu.: 69.00 1st Qu.: 910 1st Qu.:1.000 1:42 1:40 1:34
## Median : 77.50 Median : 915 Median :2.000
## Mean : 85.76 Mean :1088 Mean :2.328
## 3rd Qu.: 91.75 3rd Qu.:1310 3rd Qu.:3.000
## Max. :196.00 Max. :1500 Max. :6.000
## N JP log.cost
## Min. : 1.000 0:26 Min. :2.861
## 1st Qu.: 4.000 1:32 1st Qu.:3.200
## Median : 6.000 Median :3.497
## Mean : 8.241 Mean :3.540
## 3rd Qu.:12.750 3rd Qu.:3.794
## Max. :34.000 Max. :4.761
Data Description: The data file gives information relating to the construction of 58 nuclear reactors in France. The data has 58 rows and 13 columns.
Table 1 represents all the variables description with their value level, level of measurements, and suitable measures.
| Variable Name | Variable Description | Value Level | Level of measurements | Appropriate measures |
|---|---|---|---|---|
| Location | NR location | In France | Nominal | Mode |
| D | Construction start date | 1971 to 1991 | Interval | Mean, Median, Mode |
| Cost | Construction cost (Million EUR) | 726 to 57729 Million EUR | Ratio | Mean, Median, Mode |
| T1 | Length of application process (months) | 8 to 25 months | Interval | Median |
| T2 | Length of Construction process (months) | 55 to 196 months | Interval | Median |
| S | Power plant net capacity (Mwe) | 880 to 1500 Megawatts electric (One million watts of electric capacity) | Interval | Median |
| RN | Reactor Number | 1 to 6 | Ratio | Mean, Median, Mode |
| North | Plant constructed in the north of France | 0 = Plant constructed outside the north, 1 = Plant constructed in the north |
Nominal | Mode |
| CT | Use of cooling tower | 0 = No, 1 = Yes |
Nominal | Mode |
| CP | Reactor type | 0 = the pressurized-water reactor (PWR), 1 = the boiling-water reactor (BWR) |
Nominal | Mode |
| N | Number of Architects participated in a Nuclear Power Plant Construction | 1 to 34 architects | Ratio | Median |
| JP | Joint planning with another reactor | 0 = No, 1 = Yes |
Interval | Median |
| log.cost | Log(Cost) | Logarithmic value of cost | Ratio | Mean,Median,Mood |
Objectives for EDA: Construction cost, which depends on different independent variables, is the main response variable for our study! Because we have to predict the Construction Cost using the remaining variables. Hence, to establish the main objective of our study, we specify the following objectives:
Calculating standard deviation:
std_D = sd(project_data$D)
cat("Standard Deviation of D =", std_D, "\n")
## Standard Deviation of D = 3.875287
std_cost = sd(project_data$Cost)
cat("Standard Deviation of Cost =", std_cost, "\n")
## Standard Deviation of Cost = 9791.851
std_T1 = sd(project_data$T1)
cat("Standard Deviation of T1 =", std_T1, "\n")
## Standard Deviation of T1 = 4.097016
std_T2 = sd(project_data$T2)
cat("Standard Deviation of T2 =", std_T2, "\n")
## Standard Deviation of T2 = 27.6509
std_S = sd(project_data$S)
cat("Standard Deviation of S =", std_S, "\n")
## Standard Deviation of S = 224.7849
std_RN = sd(project_data$RN)
cat("Standard Deviation of RN =", std_RN, "\n")
## Standard Deviation of RN = 1.261966
std_N = sd(project_data$N)
cat("Standard Deviation of N =", std_N, "\n")
## Standard Deviation of N = 6.061991
std_log.cost = sd(project_data$log.cost)
cat("Standard Deviation of log.cost =", std_log.cost, "\n")
## Standard Deviation of log.cost = 0.4167527
Calculating Mode:
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
mfv(project_data$Location)
## [1] "Graveliness"
#dplyr::count(project_data, D, sort = TRUE)
mfv(project_data$D)
## [1] 1975 1979
cat("Most frequent Date =", mfv(project_data$D), "\n")
## Most frequent Date = 1975 1979
#(dplyr::count(project_data, Cost, sort = TRUE))
print("No mode for Cost")
## [1] "No mode for Cost"
print(mfv(project_data$T1))
## [1] 13 15
cat("Most frequent Length of application process =", mfv(project_data$T1), " months.", "\n")
## Most frequent Length of application process = 13 15 months.
#dplyr::count(project_data, T1, sort = TRUE)
print(mfv(project_data$T2))
## [1] 66
cat("Most frequent Length of Construction process =", mfv(project_data$T2), " months.", "\n")
## Most frequent Length of Construction process = 66 months.
#dplyr::count(project_data, T2, sort = TRUE)
print(mfv(project_data$S))
## [1] 910
cat("Most frequent Power plant net capacity (Mwe): S =", mfv(project_data$S), "Mwe", "\n")
## Most frequent Power plant net capacity (Mwe): S = 910 Mwe
#dplyr::count(project_data, S, sort = TRUE)
mode_RN = getmode(project_data$RN)
cat("Most frequent Reactor Number: RN =", mode_RN, "\n")
## Most frequent Reactor Number: RN = 2
#dplyr::count(project_data, RN, sort = TRUE)
mode_North = getmode(project_data$North)
cat("Most frequent Plant constructed in the north of France: North =", mode_North, "\n")
## Most frequent Plant constructed in the north of France: North = 2
#dplyr::count(project_data, North, sort = TRUE)
mode_CT = getmode(project_data$CT)
cat("Most frequent Use of cooling tower: CT =", mode_CT, "\n")
## Most frequent Use of cooling tower: CT = 2
#dplyr::count(project_data, CT, sort = TRUE)
mode_CP = getmode(project_data$CP)
cat("Most frequent Reactor Type: CP =", mode_CP, "\n")
## Most frequent Reactor Type: CP = 2
#dplyr::count(project_data, CP, sort = TRUE)
mode_N = getmode(project_data$N)
cat("Most frequent Number of power plants constructed by architect: N =", mode_N, "\n")
## Most frequent Number of power plants constructed by architect: N = 4
#dplyr::count(project_data, N, sort = TRUE)
mode_JP = getmode(project_data$JP)
cat("Most frequent Joint planning with another reactor: JP =", mode_JP, "\n")
## Most frequent Joint planning with another reactor: JP = 2
#dplyr::count(project_data, JP, sort = TRUE)
#dplyr::count(project_data, log.cost, sort = TRUE)
print("No mode for log.cost")
## [1] "No mode for log.cost"
| Variable Name | Mean | Median | Mode | Standard Deviation |
|---|---|---|---|---|
| Location (NR location) | - | - | The most NR constructed in “Graveliness” | - |
| D (Construction start date) | On average the construction year is 1978. | Half of the construction was before > 1978 year > Half of the construction was after | The most common NR constructed on the year 1979 and 1975 | The average construction year must vary by 4 years. |
| Cost (Construction cost) | On average the construction cost is 6097 milion EUR. | Half of the construction cost < 3158 EUR < Half of the construction cost. | 0 | The average construction cost must vary by 9791.851 EUR. |
| T1 (Length of application process (months)) | On average the length of application process is 15.33 months | Half of the length of application process < 15.00 months < Half of the length of application process | Most common NR application process take 13 and 15 months length. | The average NR application process must vary by 4.097016 months. |
| T2 (Length of Construction process (months)) | On average the Length of Construction process is 85.76 months | Half of the length of construction process < 77.50 months < Half of the length of construction process | The Most common NR construction process take 66 months length | The average NR construction process must vary by 27.6509 months |
| S (Power plant net capacity (Mwe)) | On average the Power plant net capacity is 1088 Mwe | Half of the Power plant net capacity < 915 Mwe < Half of the Power plant net capacity | The Most common Power plant net capacity is 910 Mwe | The average Power plant net capacity must vary by 224.7849 Mwe |
| RN (Reactor Number) | On average reactor number is more than 2 | Half of the Reactor Number < 2 < Half of the Reactor Number | The most common reactor number is 2. | The average Reactor Number must vary by more than 1. |
| North (Plant constructed in the north of France) | - | - | The most common NR constructed in the North of France | - |
| CT (Use of cooling tower) | - | - | 1, Most commonly NR use cooling tower | - |
| CP (Reactor type) | - | - | The most common reactor type is BWR. | - |
| N (Number of Architects participated in a Nuclear Power Plant Construction) | On average 9 architects participated in a Nulear Power Plant Construction. | Half of the NR had < 6 architects < Half of the NR had | Most commonly 4 architects participated in a Nulear Power Plant Construction. | The average architects participated in a Nulear Power Plant Construction must vary by 7. |
| JP (Joint planning with another reactor) | - | - | 1, Most of the NR had Joint planning with another reactor | - |
| log.cost (log(Cost)) | On average a Nulear Power Plant Construction cost is 10^2.861 Million EUR | Half of the NR Construction cost < 10^3.497 Million EUR < Half of the NR Construction cost | 0 | The average NR Construction cost must vary by 10^0.4167527 Million EUR |
Data Cleaning: Checking NA values:
sum(is.na(project_data))
## [1] 0
No NA values.
Detecting Outliers: We are not creating box plot for Nominal and Categorical variables here.
df_boxplot1 = data.frame(project_data$D, project_data$T1, project_data$T2, project_data$S, project_data$RN, project_data$N, project_data$log.cost)
head(df_boxplot1)
## project_data.D project_data.T1 project_data.T2 project_data.S project_data.RN
## 1 1980 22 97 1310 1
## 2 1980 25 101 1310 2
## 3 1977 9 59 910 1
## 4 1977 9 73 910 2
## 5 1978 17 55 910 3
## 6 1978 17 66 910 4
## project_data.N project_data.log.cost
## 1 13 3.591176
## 2 13 3.573336
## 3 14 3.271842
## 4 14 3.143639
## 5 6 3.449941
## 6 6 3.407391
# boxplot for each attribute on one image
par(mfrow=c(1,7))
for(i in 1:7) {
boxplot(df_boxplot1[,i], main=names(df_boxplot1)[i])
}
Outliers to be seen in the D, T1, T2, N and log.cost variables. Now we
are going to remove outliers from the dataset.
Removing outliers with IQR:
outliers_T1 <- boxplot(project_data$T1, main="Boxplot", horizontal=TRUE, xlab="T1", cex.axis = 1, cex.main = 1, cex.sub = 1)$out
text(boxplot.stats(project_data$T1)$stats, labels=boxplot.stats(project_data$T1)$stats, y = 1.25, cex = 1)
outliers_T1
## [1] 25 24 24
ls_T1 <- which(project_data$T1 %in% outliers_T1)
ls_T1
## [1] 2 15 16
outliers_T2 <- boxplot(project_data$T2, main="Boxplot", horizontal=TRUE, xlab="T2", cex.axis = 0.7, cex.main = 1, cex.sub = 0.8)$out
text(boxplot.stats(project_data$T2)$stats, labels=boxplot.stats(project_data$T2)$stats, y = 1.25, cex = 0.7)
outliers_T2
## [1] 196 196 159 132
ls_T2 <- which(project_data$T2 %in% outliers_T2)
ls_T2
## [1] 19 20 21 22
outliers_N <- boxplot(project_data$N, main="Boxplot", horizontal=TRUE, xlab="N", cex.axis = 0.7, cex.main = 1, cex.sub = 0.8)$out
text(boxplot.stats(project_data$N)$stats, labels=boxplot.stats(project_data$N)$stats, y = 1.25, cex = 0.7)
outliers_N
## [1] 34
ls_N <- which(project_data$N %in% outliers_N)
ls_N
## [1] 11
outliers_log.cost <- boxplot(project_data$log.cost, main="Boxplot", horizontal=TRUE, xlab="log.cost", cex.axis = 0.7, cex.main = 1, cex.sub = 0.8)$out
text(boxplot.stats(project_data$log.cost)$stats, labels=boxplot.stats(round(project_data$log.cost),3)$stats, y = 1.25, cex = 0.7)
outliers_log.cost
## [1] 4.761394
ls_log.cost <- which(project_data$log.cost %in% outliers_log.cost)
ls_log.cost
## [1] 22
ls = c(ls_T1, ls_T2, ls_N, ls_log.cost)
ls
## [1] 2 15 16 19 20 21 22 11 22
drop_obs = unique(sort(ls))
drop_obs
## [1] 2 11 15 16 19 20 21 22
clean_data <- project_data[-c(drop_obs),]
head(clean_data)
## Location D Cost T1 T2 S RN North CT CP N JP log.cost
## 1 Belleville 1980 3901 22 97 1310 1 1 1 0 13 1 3.591176
## 3 Blayais 1977 1870 9 59 910 1 1 0 1 14 1 3.271842
## 4 Blayais 1977 1392 9 73 910 2 1 0 1 14 1 3.143639
## 5 Blayais 1978 2818 17 55 910 3 1 0 1 6 1 3.449941
## 6 Blayais 1978 2555 17 66 910 4 1 0 1 6 1 3.407391
## 7 Bugey 1972 1215 21 76 910 2 0 1 1 4 0 3.084576
dim(clean_data)
## [1] 50 13
summary(clean_data)
## Location D Cost T1
## Length:50 Min. :1971 Min. : 726 Min. : 8.00
## Class :character 1st Qu.:1975 1st Qu.: 1564 1st Qu.:12.25
## Mode :character Median :1978 Median : 2752 Median :14.50
## Mean :1978 Mean : 3993 Mean :14.72
## 3rd Qu.:1980 3rd Qu.: 5819 3rd Qu.:17.00
## Max. :1984 Max. :14726 Max. :22.00
## T2 S RN North CT CP
## Min. : 55.00 Min. : 880 Min. :1.00 0:16 0:18 0:18
## 1st Qu.: 68.00 1st Qu.: 910 1st Qu.:1.00 1:34 1:32 1:32
## Median : 76.00 Median : 915 Median :2.00
## Mean : 78.56 Mean :1054 Mean :2.46
## 3rd Qu.: 87.50 3rd Qu.:1310 3rd Qu.:3.00
## Max. :113.00 Max. :1335 Max. :6.00
## N JP log.cost
## Min. : 1.00 0:23 Min. :2.861
## 1st Qu.: 4.00 1:27 1st Qu.:3.194
## Median : 6.00 Median :3.440
## Mean : 7.64 Mean :3.470
## 3rd Qu.:11.50 3rd Qu.:3.764
## Max. :21.00 Max. :4.168
df_boxplot2 = data.frame(clean_data$D, clean_data$T1, clean_data$T2, clean_data$S, clean_data$RN, clean_data$N, clean_data$log.cost)
# boxplot for each attribute on one image
par(mfrow=c(1,7))
for(i in 1:7) {
boxplot(df_boxplot2[,i], main=names(df_boxplot2)[i])
}
So there is no outlier in our clean_data dataframe.
Plotting factor variables: North, CT, CP and JP:
(levels(clean_data$North))
## [1] "0" "1"
(levels(clean_data$CT))
## [1] "0" "1"
col_names_north = c("Outside North (0)","Inside North (1)")
percentage_north = (prop.table(table(clean_data$North)) * 100)
col_names_CT = c("No use of cooling tower (0)", "Use of cooling tower (1)")
percentage_CT = (prop.table(table(clean_data$CT)) * 100)
# set the plotting area into a 1*2 array
par(mfrow=c(1,2))
pie(table(clean_data$North), labels = paste(col_names_north, percentage_north, "%"), col=grey.colors(length(col_names_north)), main="North or not", cex= 0.8)
pie(table(clean_data$CT), labels = paste(col_names_CT, percentage_CT, "%"), col=grey.colors(length(col_names_CT)), main="Use of cooling tower", cex= 0.8)
North: 0 = 32% Plant constructed outside the north and 1 = 68% plant constructed in the north in France.
CT: 0 = 36% are not use cooling tower and 1 = 64% are using cooling tower.
(levels(clean_data$CP))
## [1] "0" "1"
(levels(clean_data$JP))
## [1] "0" "1"
col_names_CP = c("BWR (0)", "PWR (1)")
percentage_CP = round((prop.table(table(clean_data$CP)) * 100), 3)
col_names_JP = c("No Joint Planning (0)","Joint Planning (1)")
percentage_JP = round((prop.table(table(clean_data$JP)) * 100), 3)
# set the plotting area into a 1*2 array
par(mfrow=c(1,2))
pie(table(clean_data$CP), labels = paste(col_names_CP, percentage_CP, "%"), col=grey.colors(length(col_names_CP)), main="Water reactor", cex= 0.8)
pie(table(clean_data$JP), labels = paste(col_names_JP, percentage_JP, "%"), col=grey.colors(length(col_names_JP)), main="Joint planning with another reactor", cex=0.8)
CP: 36% are the Boiling-water reactor (PWR) and 64% are the Pressurized-water reactor (BWR)
JP: 46% have joint planning with another reactor. 54% do not have joint planning with another reactor.
Distribution of the observations with density curve:
m_T1 = mean(clean_data$T1)
std_T1 = sqrt(var(clean_data$T1))
m = mean(clean_data$T2)
std = sqrt(var(clean_data$T2))
m_S = mean(clean_data$S)
std_S = sqrt(var(clean_data$S))
# set the plotting area into a 1*3 array
par(mfrow=c(1,3))
hist(clean_data$T1, prob=T, main="T1", breaks = 50, cex.axis = 0.8, cex.lab = 0.9, cex.main=1, cex.sub = 0.8)
curve(dnorm(x, mean = m_T1, sd = std_T1), col = "darkblue", lwd =2, add = TRUE)
hist(clean_data$T2, prob=T, main="T2", breaks = 30, cex.axis = 0.8, cex.lab = 0.9, cex.main=1, cex.sub = 0.8)
curve(dnorm(x, mean = m, sd = std), col = "darkblue", lwd =2, add = TRUE)
hist(clean_data$S, prob=T, main="S", breaks = 100, cex.axis = 0.8, cex.lab = 0.9, cex.main=1, cex.sub = 0.8)
curve(dnorm(x, mean = m_S, sd = std_S), col = "darkblue", lwd =2, add = TRUE)
m_RN = mean(clean_data$RN)
std_RN = sqrt(var(clean_data$RN))
m_log.cost = mean(clean_data$log.cost)
std_log.cost = sqrt(var(clean_data$log.cost))
m_D = mean(clean_data$D)
std_D = sqrt(var(clean_data$D))
# set the plotting area into a 1*4 array
par(mfrow=c(1,3))
hist(clean_data$RN, prob=T, main="RN", breaks = 50, cex.axis = 0.7, cex.lab = 0.8, cex.main=1, cex.sub = 0.7)
curve(dnorm(x, mean = m_RN, sd = std_RN), col = "darkblue", lwd =2, add = TRUE)
hist(clean_data$log.cost, prob=T, main="log.cost", breaks = 100, cex.axis = 0.7, cex.lab = 0.8, cex.main=1, cex.sub = 0.7)
curve(dnorm(x, mean = m_log.cost, sd = std_log.cost), col = "darkblue", lwd =2, add = TRUE)
hist(clean_data$D, main="Date", breaks=50, cex.axis = 0.7, cex.lab = 0.8, cex.main=1, cex.sub = 0.7)
curve(dnorm(x, mean = m_D, sd = std_D), col = "darkblue", lwd =2, add = TRUE)
The variables are mostly normally distributed.
Multiple Linear Regression:
Fit a suitable model to predict the construction cost of Nuclear Reactor construction in France:.
The Multiple linear regression model will predict the dependent variable ‘Cost’ using a regression line based on the 12 other independent variables.
The equation of the Linear Regression is:
y = B0 + B1X1 + B2X2 + B3X3 + … + BnXn + ϵ
Where B0 is the intercept, B1,B2,….Bn are the slope of the line and ϵ is the error (residual).
The equation above is used to predict the value of the target variable based on the given predictor variables.
When we are running a Multiple Regression, there are several assumptions that we need to check our data meet, in order for our analysis to be reliable and valid.
Assumption #1: The relationship between the IVs and the DV is linear.
The first assumption of Multiple Regression is that the relationship between the IVs and the DV can be characterised by a straight line. A simple way to check this is by producing scatterplots of the relationship between each of our IVs and our DV.
To fully test the assumption of linearity, we have produced scatterplot with regression line for each of the IVs and the DV in Figure-7, Figure-8 and Figure-9.
Looking at the scatterplot produced by seaborn with regression line, we can see that the relationship between the IV and the DV could be modelled by a straight line, suggesting that the relationship between these variables is linear.
Linearity with scatterplot:
x_D <- clean_data$D
x_T1 <- clean_data$T1
x_T2 <- clean_data$T2
y_log.cost <- clean_data$log.cost
# set the plotting area into a 1*3 array
par(mfrow=c(1,3))
# Plot with main and axis titles
# Change point shape (pch = 19) and remove frame.
# Add regression line
plot(x_D, y_log.cost, main = "Date - log.cost", xlab = "Date", ylab = "log.cost", pch = 19, frame = FALSE)
abline(lm(y_log.cost ~ x_D, data = clean_data), col = "blue")
plot(x_T1, y_log.cost, main = "T1 - log.cost", xlab = "T1", ylab = "log.cost", pch = 19, frame = FALSE)
abline(lm(y_log.cost ~ x_T1, data = clean_data), col = "blue")
plot(x_T2, y_log.cost, main = "T2 - log.cost", xlab = "T2", ylab = "log.cost", pch = 19, frame = FALSE)
abline(lm(y_log.cost ~ x_T2, data = clean_data), col = "blue")
x_S <- clean_data$S
x_RN <- clean_data$RN
x_North <- clean_data$North
y_log.cost <- clean_data$log.cost
# set the plotting area into a 1*3 array
par(mfrow=c(1,3))
# Plot with main and axis titles
# Change point shape (pch = 19) and remove frame.
# Add regression line
plot(x_S, y_log.cost, main = "S - log.cost", xlab = "S", ylab = "log.cost", pch = 19, frame = FALSE)
abline(lm(y_log.cost ~ x_S, data = clean_data), col = "blue")
plot(x_RN, y_log.cost, main = "RN - log.cost", xlab = "RN", ylab = "log.cost", pch = 19, frame = FALSE)
abline(lm(y_log.cost ~ x_RN, data = clean_data), col = "blue")
plot(x_North, y_log.cost, main = "North - log.cost", xlab = "North", ylab = "log.cost", pch = 19, frame = FALSE)
abline(lm(y_log.cost ~ x_North, data = clean_data), col = "blue")
x_CT <- clean_data$CT
x_CP <- clean_data$CP
x_N <- clean_data$N
y_log.cost <- clean_data$log.cost
# set the plotting area into a 1*3 array
par(mfrow=c(1,3))
# Plot with main and axis titles
# Change point shape (pch = 19) and remove frame.
# Add regression line
plot(x_CT, y_log.cost, main = "CT - log.cost", xlab = "CT", ylab = "log.cost", pch = 19, frame = FALSE)
abline(lm(y_log.cost ~ x_CT, data = clean_data), col = "blue")
plot(x_CP, y_log.cost, main = "CP - log.cost", xlab = "CP", ylab = "log.cost", pch = 19, frame = FALSE)
abline(lm(y_log.cost ~ x_CP, data = clean_data), col = "blue")
plot(x_N, y_log.cost, main = "N - log.cost", xlab = "N", ylab = "log.cost", pch = 19, frame = FALSE)
abline(lm(y_log.cost ~ x_N, data = clean_data), col = "blue")
x_JP <- clean_data$JP
y_log.cost <- clean_data$log.cost
# set the plotting area into a 1*3 array
par(mfrow=c(1,3))
# Plot with main and axis titles
# Change point shape (pch = 19) and remove frame.
# Add regression line
plot(x_JP, y_log.cost, main = "JP - log.cost", xlab = "JP", ylab = "log.cost", pch = 19, frame = FALSE)
abline(lm(y_log.cost ~ x_JP, data = clean_data), col = "blue")
Linear Relation with scatterplot and regression line:
| log.cost | Association | Comment |
|---|---|---|
| D | A linear relationship exists, uphill (with a positive slope) | As the D-values increase (move right), the log.cost tend to increase (move up). |
| T1 | A linear relationship exists, uphill (with a positive slope) | As the T1-values increase (move right), the log.cost tend to increase (move up). |
| T2 | A linear relationship exists, uphill (with a positive slope) | As the T2-values increase (move right), the log.cost tend to increase (move up). |
| S | A linear relationship exists, uphill (with a positive slope) | As the S-values increase (move right), the log.cost tend to increase (move up). |
| RN | No linear relationship | - |
| North | No linear relationship | - |
| CT | No linear relationship | - |
| CP | boxplot for factor variable | As the CP=0, the log.cost tend to increase (move up) And the CP=1, the log.cost tend to decrease (move down) |
| JP | boxplot for factor variable | As the JP=0, the log.cost tend to increase (move up) And the JP=1, the log.cost tend to decrease (move down) |
Linear relation with Pearson’s correlation coefficient: We will make correlation-coefficient for numeric variables D, T1, T2, S, RN, N and log.cost variable and skip date, numerical and factor variables.
df_cor = data.frame(clean_data$D, clean_data$T1, clean_data$T2, clean_data$S, clean_data$RN, clean_data$N, clean_data$log.cost)
cor1 = round(cor(df_cor),3)
cor1
## clean_data.D clean_data.T1 clean_data.T2 clean_data.S
## clean_data.D 1.000 0.293 0.630 0.710
## clean_data.T1 0.293 1.000 0.236 0.256
## clean_data.T2 0.630 0.236 1.000 0.800
## clean_data.S 0.710 0.256 0.800 1.000
## clean_data.RN 0.009 0.083 -0.330 -0.307
## clean_data.N 0.022 -0.252 -0.195 -0.290
## clean_data.log.cost 0.953 0.362 0.707 0.782
## clean_data.RN clean_data.N clean_data.log.cost
## clean_data.D 0.009 0.022 0.953
## clean_data.T1 0.083 -0.252 0.362
## clean_data.T2 -0.330 -0.195 0.707
## clean_data.S -0.307 -0.290 0.782
## clean_data.RN 1.000 0.161 -0.080
## clean_data.N 0.161 1.000 -0.158
## clean_data.log.cost -0.080 -0.158 1.000
Interpretation of correlation coefficient (Karl Pearson’s correlation coefficient): Table-4: — |Corr. with log.cost| Corr.Value| Comment| |:-|:-|:-| |D| 0.953 (Strongly positive)| Where D is high, log.cost is high.| |T1| 0.362 (Weakly positive)| Where T1 is high, log.cost is high.| |T2| 0.707 (Moderately positive)| Where the T2 high, log.cost is high.| |S| 0.782 (Moderately positive)| Where S is high, log.cost is high.| |RN| -0.080 (No correlation)| -| |N| -0.158 (Weakly negative)| Where N is high, log.cost is low.|
So, there is some independent variable has linear correlation with dependent variable. So we can build a multiple linear regression model.
Creating a Validation Dataset: Now we are going to split the dataset in training/test and validation dataset:
set.seed(3550)
# create a list of 80% of the rows in the original dataset we can use for training
validation_index = createDataPartition(clean_data$log.cost, p=0.80, list=FALSE)
# select 20% of the data for validation
validation = clean_data[-validation_index,]
# use the remaining 80% of data to training and testing the models
dataset = clean_data[validation_index,]
head(dataset)
## Location D Cost T1 T2 S RN North CT CP N JP log.cost
## 1 Belleville 1980 3901 22 97 1310 1 1 1 0 13 1 3.591176
## 3 Blayais 1977 1870 9 59 910 1 1 0 1 14 1 3.271842
## 4 Blayais 1977 1392 9 73 910 2 1 0 1 14 1 3.143639
## 6 Blayais 1978 2555 17 66 910 4 1 0 1 6 1 3.407391
## 7 Bugey 1972 1215 21 76 910 2 0 1 1 4 0 3.084576
## 8 Bugey 1973 839 13 66 910 3 0 1 1 21 0 2.923762
dim(dataset)
## [1] 42 13
head(validation)
## Location D Cost T1 T2 S RN North CT CP N JP log.cost
## 5 Blayais 1978 2818 17 55 910 3 1 0 1 6 1 3.449941
## 12 Cattenom 1980 6270 14 91 1300 2 1 1 0 2 0 3.797268
## 23 Cruas 1978 3497 17 68 915 1 0 1 1 4 1 3.543696
## 31 Fessenheim 1971 726 12 76 880 1 1 1 1 4 1 2.860937
## 33 Flamanville 1979 6086 20 84 1330 1 1 0 0 4 0 3.784332
## 38 Graveliness 1975 1555 11 69 910 2 1 0 1 14 1 3.191730
dim(validation)
## [1] 8 13
Multiple linear regression model building: Here we will eliminate Location variable because it is nominal and we cannot categorize it as binary.
mlr_model1 <- lm(log.cost ~ D + T1 + T2 + S + RN + North + CT + CP + N + JP, data = dataset)
summary(mlr_model1)
##
## Call:
## lm(formula = log.cost ~ D + T1 + T2 + S + RN + North + CT + CP +
## N + JP, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.113194 -0.027026 -0.009452 0.026391 0.136811
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.717e+02 1.102e+01 -15.581 3.27e-16 ***
## D 8.736e-02 5.670e-03 15.408 4.45e-16 ***
## T1 -1.805e-03 3.471e-03 -0.520 0.60674
## T2 2.104e-03 1.507e-03 1.396 0.17269
## S 1.724e-03 1.299e-03 1.327 0.19412
## RN 1.632e-03 1.043e-02 0.157 0.87662
## North1 3.939e-02 3.169e-02 1.243 0.22316
## CT1 8.513e-02 3.953e-02 2.153 0.03918 *
## CP1 7.426e-01 5.243e-01 1.416 0.16669
## N -1.090e-02 2.356e-03 -4.626 6.25e-05 ***
## JP1 -1.258e-01 3.368e-02 -3.735 0.00076 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06593 on 31 degrees of freedom
## Multiple R-squared: 0.9727, Adjusted R-squared: 0.9639
## F-statistic: 110.5 on 10 and 31 DF, p-value: < 2.2e-16
Comment on model1 report:
R-squared: The model has been able to explain 97% of the variance in the squared residuals, indicating a good fit. ie. model is 97% fit. 97% of total variation in y (log.cost) is explained by the set of independent variables.
Adj. R-squared: 0.9639 i.e independent variables are relevant to the overall model.
Prob (F-statistic): To check whether the above model is adequate for estimating the construction cost, we can conduct the F-test with the following null and alternative hypothesis,
H0: B0 = B1 = 0, (Model is not adequate) H1: Bi != 0 (i = 0, 1; at least one occurs), (Model is adequate)
We have found that the p-value: < 2.2e-16 < 0.05, so we reject the null hypothesis at 5% level of significance.
Hence, the model is adequate for estimating the construction cost.
T1, T2, S, RN, North, CT, CP feature has pvalue > 0.05 and t-value is extremely low, so they are insignificant to the overall model.
D, N, JP feature has pvalue < 0.05 and t-value is adequate, so they are significant to the overall model.
We have to drop these variables to build a better model and continue analysis on.
Assumption #2: Minimal multicollinearity among independent variables.
Multicollinearity fixing: * The best way to identify the multicollineariry is to calculate the Variance Inflation Factor (VIF) corresponding to every independent variable in the Dataset. * VIF determines the strength of the correlation between the independent variables. It is predicted by taking a variable and regression it against every other variable. * VIF score of an independent variable represents how well the variable is explained by other independent variables.
Variance Inflation Factor (VIF): * if VIF = 1; No multicollinearity * if VIF <= 5; Low multicollinearity (moderately correlated) * if VIF >= 5; High multicollinearity
We will remove the variables which are showing VIF > 10.
car::vif(mlr_model1)
## D T1 T2 S RN North CT
## 3.223553 1.393837 4.135044 668.693160 1.810899 2.156377 3.467027
## CP N JP
## 626.518738 1.402745 2.740366
In the above results, we identify that variable S = 668.693160 and CP = 626.518738 have the highest VIF score. So, we will drop S from the above results and check again VIF score for other variables.
mlr_model2 <- lm(log.cost ~ D + T1 + T2 + RN + North + CT + CP + N + JP, data = dataset)
summary(mlr_model2)
##
## Call:
## lm(formula = log.cost ~ D + T1 + T2 + RN + North + CT + CP +
## N + JP, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.11921 -0.03760 -0.01140 0.03183 0.13834
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.718e+02 1.115e+01 -15.411 2.32e-16 ***
## D 8.859e-02 5.659e-03 15.654 < 2e-16 ***
## T1 -8.691e-04 3.438e-03 -0.253 0.802066
## T2 2.071e-03 1.525e-03 1.358 0.184042
## RN -4.554e-03 9.437e-03 -0.483 0.632720
## North1 1.754e-02 2.740e-02 0.640 0.526507
## CT1 4.761e-02 2.796e-02 1.703 0.098331 .
## CP1 4.978e-02 5.029e-02 0.990 0.329617
## N -1.096e-02 2.383e-03 -4.598 6.37e-05 ***
## JP1 -1.409e-01 3.207e-02 -4.393 0.000115 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06671 on 32 degrees of freedom
## Multiple R-squared: 0.9712, Adjusted R-squared: 0.963
## F-statistic: 119.7 on 9 and 32 DF, p-value: < 2.2e-16
Comment on model2 report:
R-squared: 97%
Adj. R-squared: 96%
T1, T2, North, CT, CP feature has yet pvalue > 0.05 and t-value is extremely low, so they are insignificant to the overall model.
D, RN, N, JP feature has pvalue < 0.05 and t-value is adequate, so they are significant to the overall model.
We have to drop these variables to build a better model and continue analysis on.
car::vif(mlr_model2)
## D T1 T2 RN North CT CP N
## 3.136932 1.336327 4.133888 1.449064 1.574523 1.694543 5.628677 1.402256
## JP
## 2.426907
In the above results, we identify that variable CP = 5.628677 has the VIF score > 5. So, we will drop CP from the above results and check the model performance and VIF score for other variables.
mlr_model3 <- lm(log.cost ~ D + T1 + T2 + RN + North + CT + N + JP, data = dataset)
summary(mlr_model3)
##
## Call:
## lm(formula = log.cost ~ D + T1 + T2 + RN + North + CT + N + JP,
## data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.111653 -0.042496 -0.006608 0.032044 0.135571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.683e+02 1.056e+01 -15.937 < 2e-16 ***
## D 8.684e-02 5.373e-03 16.161 < 2e-16 ***
## T1 -6.580e-04 3.431e-03 -0.192 0.84908
## T2 1.309e-03 1.317e-03 0.995 0.32720
## RN -2.396e-03 9.179e-03 -0.261 0.79567
## North1 1.622e-02 2.736e-02 0.593 0.55741
## CT1 5.507e-02 2.692e-02 2.046 0.04882 *
## N -1.005e-02 2.200e-03 -4.570 6.52e-05 ***
## JP1 -1.274e-01 2.902e-02 -4.390 0.00011 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06669 on 33 degrees of freedom
## Multiple R-squared: 0.9703, Adjusted R-squared: 0.9631
## F-statistic: 134.6 on 8 and 33 DF, p-value: < 2.2e-16
Comment on model3 There is no significant change on overall model performance comparatively Model2. So we can advance our analysis by eliminating insignificant variables.
mlr_model4 <- lm(log.cost ~ D + RN + N + JP, data = dataset)
summary(mlr_model4)
##
## Call:
## lm(formula = log.cost ~ D + RN + N + JP, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.137460 -0.038170 -0.009338 0.036098 0.160221
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.722e+02 8.332e+00 -20.665 < 2e-16 ***
## D 8.890e-02 4.210e-03 21.119 < 2e-16 ***
## RN -8.578e-03 8.247e-03 -1.040 0.305
## N -1.073e-02 2.107e-03 -5.090 1.07e-05 ***
## JP1 -1.381e-01 2.731e-02 -5.057 1.18e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06889 on 37 degrees of freedom
## Multiple R-squared: 0.9644, Adjusted R-squared: 0.9606
## F-statistic: 250.8 on 4 and 37 DF, p-value: < 2.2e-16
Comment on model4
R-squared: 97%
Adj. R-squared: 96%
D, RN, N, JP feature has pvalue < 0.05 and t-value is adequate, so they are significant to the overall model.
Now we can fit out model with model4.
Fitting model:
fitted_mlr = fitted.values(mlr_model4)
fitted_mlr
## 1 3 4 6 7 8 9 10
## 3.567120 3.289678 3.281099 3.438661 3.081933 2.979908 3.083040 3.074462
## 13 14 17 18 24 25 26 27
## 3.919503 4.021283 3.666606 3.800566 3.477271 3.663513 3.526216 3.208406
## 28 29 30 32 34 35 36 37
## 3.199828 3.212703 3.204125 2.943838 3.814630 3.979565 4.084438 3.111867
## 39 40 41 42 43 44 45 46
## 3.083984 3.164311 3.476082 3.467504 3.869207 3.852995 3.535038 3.626092
## 47 48 49 50 51 52 53 55
## 3.727873 3.786747 3.947386 4.116618 3.702122 3.725724 3.222225 3.162407
## 56 57
## 3.153829 3.159070
Predicting y:
y_predict = predict(mlr_model4, validation)
y_predict
## 5 12 23 31 33 38 54 58
## 3.447240 3.814630 3.485849 2.863511 3.712849 3.103289 3.213647 3.150492
Calculating Residuals:
Residual = validation$log.cost - y_predict
as.data.frame(Residual)
## Residual
## 5 0.002701416
## 12 -0.017362037
## 23 0.057846684
## 31 -0.002574418
## 33 0.071482888
## 38 0.088441748
## 54 0.033097456
## 58 0.056872995
Model Performance:
mse = mean((Residual)^2)
mae = mean(abs(Residual))
rmse = sqrt(mse)
R2 = 1-(sum((Residual)^2)/sum((validation$log.cost-mean(validation$log.cost))^2))
cat(" MAE:", mae, "\n", "MSE:", mse, "\n", "RMSE:", rmse, "\n", "R-squared:", R2)
## MAE: 0.04129746
## MSE: 0.002615416
## RMSE: 0.05114114
## R-squared: 0.9710001
Assumption #3: Normality of Residual distribution.
#get list of residuals
res <- resid(mlr_model4)
res
## 1 3 4 6 7 8
## 0.024055539 -0.017835960 -0.137460241 -0.031270577 0.002643218 -0.056146282
## 9 10 13 14 17 18
## -0.103036447 -0.027576537 0.160221037 -0.017607353 0.011548307 0.053010506
## 24 25 26 27 28 29
## -0.042382736 0.058874006 0.043976723 -0.010299106 -0.008377000 0.004254091
## 30 32 34 35 36 37
## 0.128919006 -0.021632087 -0.121782659 -0.040445589 0.083646329 0.025803801
## 39 40 41 42 43 44
## 0.109140641 -0.042095403 -0.069542240 -0.023147202 -0.013687542 0.006323811
## 45 46 47 48 49 50
## 0.011751123 -0.077457090 -0.031341547 0.034439085 -0.029985049 -0.050292332
## 51 52 53 55 56 57
## -0.080115791 0.052427088 0.036651284 0.137318078 0.030578501 0.037934595
Plotting Residuals:
#create Q-Q plot for residuals
qqnorm(res)
#add a straight diagonal line to the plot
qqline(res)
The x-axis displays the fitted values and the y-axis displays the residuals. From the plot we can see that the spread of the residuals tends to be higher for higher fitted values, but it doesn’t look serious enough that we would need to make any changes to the model.
We can see that the residuals tend to stray from the line quite a bit near the tails, which could indicate that they’re not normally distributed.
Produce a density plot:
We can also produce a density plot, which is also useful for visually checking whether or not the residuals are normally distributed. If the plot is roughly bell-shaped, then the residuals likely follow a normal distribution.
#Create density plot of residuals
plot(density(res))
We can see that the density plot clearly follows a bell shape. The data have ensured that the residuals are more normally distributed.
Assumption #4: The variance of the residuals is constant (homoscedasticity).
The simplest way to determine if this assumption is met is to create a plot of standardized residuals versus predicted values.
Once you fit a regression model to a dataset, you can then create a scatter plot that shows the predicted values for the response variable on the x-axis and the standardized residuals of the model on the y-axis.
If the points in the scatter plot exhibit a pattern, then heteroscedasticity is present.
#produce residual vs. fitted plot
plot(fitted(mlr_model4), res)
#add a horizontal line at 0
abline(0,0)
Notice that the standardized residuals are scattered about zero with no clear pattern.
Assumption #5: The values of the residuals are independent (Auto-correlation).
One of the key assumptions in linear regression is that there is no correlation between the residuals, e.g. the residuals are independent.
One way to determine if this assumption is met is to perform a Durbin-Watson test, which is used to detect the presence of autocorrelation in the residuals of a regression. This test uses the following hypotheses:
H0: There is no correlation among the residuals. HA: The residuals are autocorrelated.
Perform a Durbin-Watson test:
#perform Durbin-Watson test
durbinWatsonTest(mlr_model4)
## lag Autocorrelation D-W Statistic p-value
## 1 0.1454758 1.697558 0.194
## Alternative hypothesis: rho != 0
From the output we can see that the test statistic is 1.697558 and the corresponding p-value is 0.194. Since this p-value is greater than 0.05, we cannot reject the null hypothesis and conclude that There is no correlation among the residuals in this regression model.
So our model fulfilled all the assumptions for multiple linear regression.
Verifying multiple linear regression by mathematical calculation:
df_prediction = data.frame(validation$log.cost, y_predict, Residual)
df_prediction
## validation.log.cost y_predict Residual
## 5 3.449941 3.447240 0.002701416
## 12 3.797268 3.814630 -0.017362037
## 23 3.543696 3.485849 0.057846684
## 31 2.860937 2.863511 -0.002574418
## 33 3.784332 3.712849 0.071482888
## 38 3.191730 3.103289 0.088441748
## 54 3.246745 3.213647 0.033097456
## 58 3.207365 3.150492 0.056872995
print(mlr_model4)
##
## Call:
## lm(formula = log.cost ~ D + RN + N + JP, data = dataset)
##
## Coefficients:
## (Intercept) D RN N JP1
## -1.722e+02 8.891e-02 -8.578e-03 -1.073e-02 -1.381e-01
validation[1, ]
## Location D Cost T1 T2 S RN North CT CP N JP log.cost
## 5 Blayais 1978 2818 17 55 910 3 1 0 1 6 1 3.449941
Here,
x1= value of D, x2= value of RN, x3= value of N, x4= value of JP1 (joint planning with another reactor)
B0=-1.722e+02, B1=8.891e-02, B2 = -8.578e-03, B3 = -1.073e-02, B4 = -1.381e-01
y = B0 + B1X1(D) + B2X2(RN) + B3X3(N) + B3X3(JP1:joint planning with another reactor)
row_predict = 5
y = -1.722e+02 + (8.891e-02*1978) + (-8.578e-03*3) + (-1.073e-02*6) + (-1.381e-01*1)
y
## [1] 3.435766
Construction_cost = 10^3.435766
cat("Construction cost:", Construction_cost, "Million EUR.")
## Construction cost: 2727.508 Million EUR.
So, predicted value is also correct according to the mathematical equation.
Using the model4 as appropriate regression model we can predict the construction cost of Nuclear Reactor in France.
In conclusion, considering the year of 1978, with 3 reactors, 6 Architects participated in the Nuclear Power Plant Construction and the nuclear reactor joint planning with another reactor will be cost approximately 2727.508 Million EUR.