Data Mining: Assignment

EDA Report on the construction of 58 nuclear reactors in France.

Submitted TO:

Md. Atikur Rahman

Assistant Professor

Department of Statistics

Jahangirnagar University

EDA by: Jobaida Gulsanara Jarin (ID: 20215018)

Batch: 05(B)

EDA Date: 28th March 2022

Data Mining (PM-ASDS18)

Course: Professional Masters in Applied Statistics and Data Science (ASDS)

Installing Packages

#install.packages("car")
#install.packages("tidyverse")
#install.packages("e1071")
#install.packages("caTools")
#install.packages("caret")
#install.packages("modeest")
#install.packages("ggplot2")
#install.packages("lubridate", dependencies=TRUE)

install.packages("here")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)

install.packages("skimr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)

install.packages("janitor")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)

Loading packages

library(e1071)
library(caTools)
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(car)

## Loading required package: carData

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ✓ purrr   0.3.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date()        masks base::date()
## x dplyr::filter()          masks stats::filter()
## x lubridate::intersect()   masks base::intersect()
## x dplyr::lag()             masks stats::lag()
## x purrr::lift()            masks caret::lift()
## x dplyr::recode()          masks car::recode()
## x lubridate::setdiff()     masks base::setdiff()
## x purrr::some()            masks car::some()
## x lubridate::union()       masks base::union()

library(modeest)

## Registered S3 method overwritten by 'rmutil':
##   method         from
##   print.response httr

## 
## Attaching package: 'modeest'

## The following object is masked from 'package:e1071':
## 
##     skewness

library(here)

## here() starts at /cloud/project

library(skimr)
library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(dplyr)

Research Problem: It is required to predict the cost of constructing further reactors.

As our main objective is to predict the cost of constructing further reactors for, hence a model is required to make such prediction. Usually Linear Regression model, Logistic Regression model, and Non-linear Regression model are more like to apply for such prediction. Each model has some assumptions, before using any model it is required to check the validity of all underlying assumptions of that model. Based on the valid assumptions, we could find an appropriate model, for our dataset. Since the response variable (cost) is dependent on multiple variable, so a Multiple Linear Regression Model is more suitable for this dataset.

At this point Exploratory Data Analysis can be performed for initial investigations on data, to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical refpresentations. The following sections represent EDA for our dataset.

Observing the Nuclear Reactor Construction Data:

Loading dataset:

project_data = read.csv("nr_construction_data.csv")
head(project_data)

##     Location         D Cost T1  T2    S RN North CT CP  N JP log.cost
## 1 Belleville May, 1980 3901 22  97 1310  1     1  1  0 13  1 3.591176
## 2 Belleville Aug, 1980 3744 25 101 1310  2     1  1  0 13  1 3.573336
## 3    Blayais Jan, 1977 1870  9  59  910  1     1  0  1 14  1 3.271842
## 4    Blayais Jan, 1977 1392  9  73  910  2     1  0  1 14  1 3.143639
## 5    Blayais Apr, 1978 2818 17  55  910  3     1  0  1  6  1 3.449941
## 6    Blayais Apr, 1978 2555 17  66  910  4     1  0  1  6  1 3.407391

skim_without_charts(project_data)

Data summary
Name	project_data
Number of rows	58
Number of columns	13
_______________________
Column type frequency:
character	2
numeric	11
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Location	0	1	5	15	0	19	0
D	0	1	9	9	0	47	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
Cost	1	6097.28	9791.85	726.00	1586.5	3157.5	6224.00	57729.00
T1	1	15.33	4.10	8.00	13.0	15.0	17.00	25.00
T2	1	85.76	27.65	55.00	69.0	77.5	91.75	196.00
S	1	1088.45	224.78	880.00	910.0	915.0	1310.00	1500.00
RN	1	2.33	1.26	1.00	1.0	2.0	3.00	6.00
North	1	0.72	0.45	0.00	0.0	1.0	1.00	1.00
CT	1	0.69	0.47	0.00	0.0	1.0	1.00	1.00
CP	1	0.59	0.50	0.00	0.0	1.0	1.00	1.00
N	1	8.24	6.06	1.00	4.0	6.0	12.75	34.00
JP	1	0.55	0.50	0.00	0.0	1.0	1.00	1.00
log.cost	1	3.54	0.42	2.86	3.2	3.5	3.79	4.76

str(project_data)

## 'data.frame':    58 obs. of  13 variables:
##  $ Location: chr  "Belleville" "Belleville" "Blayais" "Blayais" ...
##  $ D       : chr  "May, 1980" "Aug, 1980" "Jan, 1977" "Jan, 1977" ...
##  $ Cost    : int  3901 3744 1870 1392 2818 2555 1215 839 955 1114 ...
##  $ T1      : int  22 25 9 9 17 17 21 13 11 12 ...
##  $ T2      : int  97 101 59 73 55 66 76 66 61 66 ...
##  $ S       : int  1310 1310 910 910 910 910 910 910 880 880 ...
##  $ RN      : int  1 2 1 2 3 4 2 3 4 5 ...
##  $ North   : int  1 1 1 1 1 1 0 0 0 0 ...
##  $ CT      : int  1 1 0 0 0 0 1 1 1 1 ...
##  $ CP      : int  0 0 1 1 1 1 1 1 1 1 ...
##  $ N       : int  13 13 14 14 6 6 4 21 6 6 ...
##  $ JP      : int  1 1 1 1 1 1 0 0 1 1 ...
##  $ log.cost: num  3.59 3.57 3.27 3.14 3.45 ...

Q3. Do any variables need to be transformed to satisfy the assumptions of the model?

There are 58 observation of 13 variables where D (Date) is formated as character and North, CP, CT, JP are formated as integer. To satisfy the assumptions of the multiple linear regression model, we need transform D variable to integer format and North, CP, CT, JP in factor format.

# Transform Character to Date type
project_data$D <- lubridate::my(project_data$D)

# Transform Date to integer
dates <- as.POSIXct(project_data$D, format = "%m/%d/%Y %H:%M:%S")
project_data$D <- year(dates)
project_data$D

##  [1] 1980 1980 1977 1977 1978 1978 1972 1973 1974 1974 1979 1980 1982 1983 1977
## [16] 1977 1980 1981 1984 1984 1988 1991 1978 1978 1979 1979 1975 1975 1975 1975
## [31] 1971 1972 1979 1980 1982 1984 1975 1975 1975 1976 1979 1979 1981 1982 1977
## [46] 1978 1979 1980 1982 1984 1979 1979 1976 1976 1974 1974 1975 1975

A factor is a class that has multiple class labels or levels. There are North, CT, CP, JP variables are factor but formatted in interger. Now we will them to factor.

project_data[,'North'] <- as.factor(project_data[,'North'])
project_data[,'CT'] <- as.factor(project_data[,'CT'])
project_data[,'CP'] <- as.factor(project_data[,'CP'])
project_data[,'JP'] <- as.factor(project_data[,'JP'])

sapply(project_data, class)

##    Location           D        Cost          T1          T2           S 
## "character"   "numeric"   "integer"   "integer"   "integer"   "integer" 
##          RN       North          CT          CP           N          JP 
##   "integer"    "factor"    "factor"    "factor"   "integer"    "factor" 
##    log.cost 
##   "numeric"

Now the variables are in correct format to perform statistical analysis.

Summary statistics:

summary(project_data)

##    Location               D             Cost             T1       
##  Length:58          Min.   :1971   Min.   :  726   Min.   : 8.00  
##  Class :character   1st Qu.:1975   1st Qu.: 1586   1st Qu.:13.00  
##  Mode  :character   Median :1978   Median : 3158   Median :15.00  
##                     Mean   :1978   Mean   : 6097   Mean   :15.33  
##                     3rd Qu.:1980   3rd Qu.: 6224   3rd Qu.:17.00  
##                     Max.   :1991   Max.   :57729   Max.   :25.00  
##        T2               S              RN        North  CT     CP    
##  Min.   : 55.00   Min.   : 880   Min.   :1.000   0:16   0:18   0:24  
##  1st Qu.: 69.00   1st Qu.: 910   1st Qu.:1.000   1:42   1:40   1:34  
##  Median : 77.50   Median : 915   Median :2.000                       
##  Mean   : 85.76   Mean   :1088   Mean   :2.328                       
##  3rd Qu.: 91.75   3rd Qu.:1310   3rd Qu.:3.000                       
##  Max.   :196.00   Max.   :1500   Max.   :6.000                       
##        N          JP        log.cost    
##  Min.   : 1.000   0:26   Min.   :2.861  
##  1st Qu.: 4.000   1:32   1st Qu.:3.200  
##  Median : 6.000          Median :3.497  
##  Mean   : 8.241          Mean   :3.540  
##  3rd Qu.:12.750          3rd Qu.:3.794  
##  Max.   :34.000          Max.   :4.761

Data Description: The data file gives information relating to the construction of 58 nuclear reactors in France. The data has 58 rows and 13 columns.

Table 1 represents all the variables description with their value level, level of measurements, and suitable measures.

Table 1:

Variable Name	Variable Description	Value Level	Level of measurements	Appropriate measures
Location	NR location	In France	Nominal	Mode
D	Construction start date	1971 to 1991	Interval	Mean, Median, Mode
Cost	Construction cost (Million EUR)	726 to 57729 Million EUR	Ratio	Mean, Median, Mode
T1	Length of application process (months)	8 to 25 months	Interval	Median
T2	Length of Construction process (months)	55 to 196 months	Interval	Median
S	Power plant net capacity (Mwe)	880 to 1500 Megawatts electric (One million watts of electric capacity)	Interval	Median
RN	Reactor Number	1 to 6	Ratio	Mean, Median, Mode
North	Plant constructed in the north of France	0 = Plant constructed outside the north, 1 = Plant constructed in the north	Nominal	Mode
CT	Use of cooling tower	0 = No, 1 = Yes	Nominal	Mode
CP	Reactor type	0 = the pressurized-water reactor (PWR), 1 = the boiling-water reactor (BWR)	Nominal	Mode
N	Number of Architects participated in a Nuclear Power Plant Construction	1 to 34 architects	Ratio	Median
JP	Joint planning with another reactor	0 = No, 1 = Yes	Interval	Median
log.cost	Log(Cost)	Logarithmic value of cost	Ratio	Mean,Median,Mood

Objectives for EDA: Construction cost, which depends on different independent variables, is the main response variable for our study! Because we have to predict the Construction Cost using the remaining variables. Hence, to establish the main objective of our study, we specify the following objectives:

Visualize the data series and construe the results.
What type of statistical model is suitable for these data?
Do any variables need to be transformed to satisfy the 4. assumptions of the model?
Are there observations that do not follow the same model as the main body of data?
Can the model be simplified, by reducing the number of explanatory variables?
What are the limitations on the interpretation and application of the model?
Perform the classification and decision tree approach.
Determine the optimal number of clusters for the dependent variable which you used for regression analysis.
Check have any multicollinearity problem in your data set and sort out this.

Calculating standard deviation:

std_D = sd(project_data$D)
cat("Standard Deviation of D =", std_D, "\n")

## Standard Deviation of D = 3.875287

std_cost = sd(project_data$Cost)
cat("Standard Deviation of Cost =", std_cost, "\n")

## Standard Deviation of Cost = 9791.851

std_T1 = sd(project_data$T1)
cat("Standard Deviation of T1 =", std_T1, "\n")

## Standard Deviation of T1 = 4.097016

std_T2 = sd(project_data$T2)
cat("Standard Deviation of T2 =", std_T2, "\n")

## Standard Deviation of T2 = 27.6509

std_S = sd(project_data$S)
cat("Standard Deviation of S =", std_S, "\n")

## Standard Deviation of S = 224.7849

std_RN = sd(project_data$RN)
cat("Standard Deviation of RN =", std_RN, "\n")

## Standard Deviation of RN = 1.261966

std_N = sd(project_data$N)
cat("Standard Deviation of N =", std_N, "\n")

## Standard Deviation of N = 6.061991

std_log.cost = sd(project_data$log.cost)
cat("Standard Deviation of log.cost =", std_log.cost, "\n")

## Standard Deviation of log.cost = 0.4167527

Calculating Mode:

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

mfv(project_data$Location)

## [1] "Graveliness"

#dplyr::count(project_data, D, sort = TRUE)
mfv(project_data$D)

## [1] 1975 1979

cat("Most frequent Date =", mfv(project_data$D), "\n")

## Most frequent Date = 1975 1979

#(dplyr::count(project_data, Cost, sort = TRUE))
print("No mode for Cost")

## [1] "No mode for Cost"

print(mfv(project_data$T1))

## [1] 13 15

cat("Most frequent Length of application process =", mfv(project_data$T1), " months.", "\n")

## Most frequent Length of application process = 13 15  months.

#dplyr::count(project_data, T1, sort = TRUE)

print(mfv(project_data$T2))

## [1] 66

cat("Most frequent Length of Construction process =", mfv(project_data$T2), " months.", "\n")

## Most frequent Length of Construction process = 66  months.

#dplyr::count(project_data, T2, sort = TRUE)

print(mfv(project_data$S))

## [1] 910

cat("Most frequent Power plant net capacity (Mwe): S =", mfv(project_data$S), "Mwe", "\n")

## Most frequent Power plant net capacity (Mwe): S = 910 Mwe

#dplyr::count(project_data, S, sort = TRUE)

mode_RN = getmode(project_data$RN)
cat("Most frequent Reactor Number: RN =", mode_RN, "\n")

## Most frequent Reactor Number: RN = 2

#dplyr::count(project_data, RN, sort = TRUE)

mode_North = getmode(project_data$North)
cat("Most frequent Plant constructed in the north of France: North =", mode_North, "\n")

## Most frequent Plant constructed in the north of France: North = 2

#dplyr::count(project_data, North, sort = TRUE)

mode_CT = getmode(project_data$CT)
cat("Most frequent Use of cooling tower: CT =", mode_CT, "\n")

## Most frequent Use of cooling tower: CT = 2

#dplyr::count(project_data, CT, sort = TRUE)

mode_CP = getmode(project_data$CP)
cat("Most frequent Reactor Type: CP =", mode_CP, "\n")

## Most frequent Reactor Type: CP = 2

#dplyr::count(project_data, CP, sort = TRUE)

mode_N = getmode(project_data$N)
cat("Most frequent Number of power plants constructed by architect: N =", mode_N, "\n")

## Most frequent Number of power plants constructed by architect: N = 4

#dplyr::count(project_data, N, sort = TRUE)

mode_JP = getmode(project_data$JP)
cat("Most frequent Joint planning with another reactor: JP =", mode_JP, "\n")

## Most frequent Joint planning with another reactor: JP = 2

#dplyr::count(project_data, JP, sort = TRUE)

#dplyr::count(project_data, log.cost, sort = TRUE)
print("No mode for log.cost")

## [1] "No mode for log.cost"

Table-2: Summary measures of different variables of the Nuclear Reactor Construction Dataset:

Variable Name	Mean	Median	Mode	Standard Deviation
Location (NR location)	-	-	The most NR constructed in “Graveliness”	-
D (Construction start date)	On average the construction year is 1978.	Half of the construction was before > 1978 year > Half of the construction was after	The most common NR constructed on the year 1979 and 1975	The average construction year must vary by 4 years.
Cost (Construction cost)	On average the construction cost is 6097 milion EUR.	Half of the construction cost < 3158 EUR < Half of the construction cost.	0	The average construction cost must vary by 9791.851 EUR.
T1 (Length of application process (months))	On average the length of application process is 15.33 months	Half of the length of application process < 15.00 months < Half of the length of application process	Most common NR application process take 13 and 15 months length.	The average NR application process must vary by 4.097016 months.
T2 (Length of Construction process (months))	On average the Length of Construction process is 85.76 months	Half of the length of construction process < 77.50 months < Half of the length of construction process	The Most common NR construction process take 66 months length	The average NR construction process must vary by 27.6509 months
S (Power plant net capacity (Mwe))	On average the Power plant net capacity is 1088 Mwe	Half of the Power plant net capacity < 915 Mwe < Half of the Power plant net capacity	The Most common Power plant net capacity is 910 Mwe	The average Power plant net capacity must vary by 224.7849 Mwe
RN (Reactor Number)	On average reactor number is more than 2	Half of the Reactor Number < 2 < Half of the Reactor Number	The most common reactor number is 2.	The average Reactor Number must vary by more than 1.
North (Plant constructed in the north of France)	-	-	The most common NR constructed in the North of France	-
CT (Use of cooling tower)	-	-	1, Most commonly NR use cooling tower	-
CP (Reactor type)	-	-	The most common reactor type is BWR.	-
N (Number of Architects participated in a Nuclear Power Plant Construction)	On average 9 architects participated in a Nulear Power Plant Construction.	Half of the NR had < 6 architects < Half of the NR had	Most commonly 4 architects participated in a Nulear Power Plant Construction.	The average architects participated in a Nulear Power Plant Construction must vary by 7.
JP (Joint planning with another reactor)	-	-	1, Most of the NR had Joint planning with another reactor	-
log.cost (log(Cost))	On average a Nulear Power Plant Construction cost is 10^2.861 Million EUR	Half of the NR Construction cost < 10^3.497 Million EUR < Half of the NR Construction cost	0	The average NR Construction cost must vary by 10^0.4167527 Million EUR

Data Cleaning: Checking NA values:

sum(is.na(project_data))

## [1] 0

No NA values.

Q1. Visualize the data series and construe the results.

Detecting Outliers: We are not creating box plot for Nominal and Categorical variables here.

df_boxplot1 = data.frame(project_data$D, project_data$T1, project_data$T2, project_data$S, project_data$RN, project_data$N, project_data$log.cost)
head(df_boxplot1)

##   project_data.D project_data.T1 project_data.T2 project_data.S project_data.RN
## 1           1980              22              97           1310               1
## 2           1980              25             101           1310               2
## 3           1977               9              59            910               1
## 4           1977               9              73            910               2
## 5           1978              17              55            910               3
## 6           1978              17              66            910               4
##   project_data.N project_data.log.cost
## 1             13              3.591176
## 2             13              3.573336
## 3             14              3.271842
## 4             14              3.143639
## 5              6              3.449941
## 6              6              3.407391

# boxplot for each attribute on one image
par(mfrow=c(1,7))
  for(i in 1:7) {
  boxplot(df_boxplot1[,i], main=names(df_boxplot1)[i])
}

Outliers to be seen in the D, T1, T2, N and log.cost variables. Now we are going to remove outliers from the dataset.

Removing outliers with IQR:

outliers_T1 <- boxplot(project_data$T1, main="Boxplot", horizontal=TRUE, xlab="T1", cex.axis = 1, cex.main = 1, cex.sub = 1)$out

text(boxplot.stats(project_data$T1)$stats, labels=boxplot.stats(project_data$T1)$stats, y = 1.25, cex = 1)

outliers_T1

## [1] 25 24 24

ls_T1 <- which(project_data$T1 %in% outliers_T1)
ls_T1

## [1]  2 15 16

outliers_T2 <- boxplot(project_data$T2, main="Boxplot", horizontal=TRUE, xlab="T2", cex.axis = 0.7, cex.main = 1, cex.sub = 0.8)$out

text(boxplot.stats(project_data$T2)$stats, labels=boxplot.stats(project_data$T2)$stats, y = 1.25, cex = 0.7)

outliers_T2

## [1] 196 196 159 132

ls_T2 <- which(project_data$T2 %in% outliers_T2)
ls_T2

## [1] 19 20 21 22

outliers_N <- boxplot(project_data$N, main="Boxplot", horizontal=TRUE, xlab="N", cex.axis = 0.7, cex.main = 1, cex.sub = 0.8)$out

text(boxplot.stats(project_data$N)$stats, labels=boxplot.stats(project_data$N)$stats, y = 1.25, cex = 0.7)

outliers_N

## [1] 34

ls_N <- which(project_data$N %in% outliers_N)
ls_N

## [1] 11

outliers_log.cost <- boxplot(project_data$log.cost, main="Boxplot", horizontal=TRUE, xlab="log.cost", cex.axis = 0.7, cex.main = 1, cex.sub = 0.8)$out

text(boxplot.stats(project_data$log.cost)$stats, labels=boxplot.stats(round(project_data$log.cost),3)$stats, y = 1.25, cex = 0.7)

outliers_log.cost

## [1] 4.761394

ls_log.cost <- which(project_data$log.cost %in% outliers_log.cost)

ls_log.cost

## [1] 22

ls = c(ls_T1, ls_T2, ls_N, ls_log.cost)
ls

## [1]  2 15 16 19 20 21 22 11 22

drop_obs = unique(sort(ls))
drop_obs

## [1]  2 11 15 16 19 20 21 22

clean_data <- project_data[-c(drop_obs),]
head(clean_data)

##     Location    D Cost T1 T2    S RN North CT CP  N JP log.cost
## 1 Belleville 1980 3901 22 97 1310  1     1  1  0 13  1 3.591176
## 3    Blayais 1977 1870  9 59  910  1     1  0  1 14  1 3.271842
## 4    Blayais 1977 1392  9 73  910  2     1  0  1 14  1 3.143639
## 5    Blayais 1978 2818 17 55  910  3     1  0  1  6  1 3.449941
## 6    Blayais 1978 2555 17 66  910  4     1  0  1  6  1 3.407391
## 7      Bugey 1972 1215 21 76  910  2     0  1  1  4  0 3.084576

dim(clean_data)

## [1] 50 13

summary(clean_data)

##    Location               D             Cost             T1       
##  Length:50          Min.   :1971   Min.   :  726   Min.   : 8.00  
##  Class :character   1st Qu.:1975   1st Qu.: 1564   1st Qu.:12.25  
##  Mode  :character   Median :1978   Median : 2752   Median :14.50  
##                     Mean   :1978   Mean   : 3993   Mean   :14.72  
##                     3rd Qu.:1980   3rd Qu.: 5819   3rd Qu.:17.00  
##                     Max.   :1984   Max.   :14726   Max.   :22.00  
##        T2               S              RN       North  CT     CP    
##  Min.   : 55.00   Min.   : 880   Min.   :1.00   0:16   0:18   0:18  
##  1st Qu.: 68.00   1st Qu.: 910   1st Qu.:1.00   1:34   1:32   1:32  
##  Median : 76.00   Median : 915   Median :2.00                       
##  Mean   : 78.56   Mean   :1054   Mean   :2.46                       
##  3rd Qu.: 87.50   3rd Qu.:1310   3rd Qu.:3.00                       
##  Max.   :113.00   Max.   :1335   Max.   :6.00                       
##        N         JP        log.cost    
##  Min.   : 1.00   0:23   Min.   :2.861  
##  1st Qu.: 4.00   1:27   1st Qu.:3.194  
##  Median : 6.00          Median :3.440  
##  Mean   : 7.64          Mean   :3.470  
##  3rd Qu.:11.50          3rd Qu.:3.764  
##  Max.   :21.00          Max.   :4.168

df_boxplot2 = data.frame(clean_data$D, clean_data$T1, clean_data$T2, clean_data$S, clean_data$RN, clean_data$N, clean_data$log.cost)

# boxplot for each attribute on one image
par(mfrow=c(1,7))
  for(i in 1:7) {
  boxplot(df_boxplot2[,i], main=names(df_boxplot2)[i])
}

So there is no outlier in our clean_data dataframe.

Plotting factor variables: North, CT, CP and JP:

(levels(clean_data$North))

## [1] "0" "1"

(levels(clean_data$CT))

## [1] "0" "1"

col_names_north = c("Outside North (0)","Inside North (1)")
percentage_north = (prop.table(table(clean_data$North)) * 100)

col_names_CT = c("No use of cooling tower (0)", "Use of cooling tower (1)")
percentage_CT = (prop.table(table(clean_data$CT)) * 100)

# set the plotting area into a 1*2 array
par(mfrow=c(1,2)) 

pie(table(clean_data$North), labels = paste(col_names_north, percentage_north, "%"), col=grey.colors(length(col_names_north)), main="North or not", cex= 0.8)

pie(table(clean_data$CT), labels = paste(col_names_CT, percentage_CT, "%"), col=grey.colors(length(col_names_CT)), main="Use of cooling tower", cex= 0.8)

North: 0 = 32% Plant constructed outside the north and 1 = 68% plant constructed in the north in France.

CT: 0 = 36% are not use cooling tower and 1 = 64% are using cooling tower.

(levels(clean_data$CP))

## [1] "0" "1"

(levels(clean_data$JP))

## [1] "0" "1"

col_names_CP = c("BWR (0)", "PWR (1)")
percentage_CP = round((prop.table(table(clean_data$CP)) * 100), 3)

col_names_JP = c("No Joint Planning (0)","Joint Planning (1)")
percentage_JP = round((prop.table(table(clean_data$JP)) * 100), 3)

# set the plotting area into a 1*2 array
par(mfrow=c(1,2))  

pie(table(clean_data$CP), labels = paste(col_names_CP, percentage_CP, "%"), col=grey.colors(length(col_names_CP)), main="Water reactor", cex= 0.8)

pie(table(clean_data$JP), labels = paste(col_names_JP, percentage_JP, "%"), col=grey.colors(length(col_names_JP)), main="Joint planning with another reactor", cex=0.8)

CP: 36% are the Boiling-water reactor (PWR) and 64% are the Pressurized-water reactor (BWR)

JP: 46% have joint planning with another reactor. 54% do not have joint planning with another reactor.

Distribution of the observations with density curve:

m_T1 = mean(clean_data$T1)
std_T1 = sqrt(var(clean_data$T1))

m = mean(clean_data$T2)
std = sqrt(var(clean_data$T2))

m_S = mean(clean_data$S)
std_S = sqrt(var(clean_data$S))

# set the plotting area into a 1*3 array
par(mfrow=c(1,3))

hist(clean_data$T1, prob=T, main="T1", breaks = 50, cex.axis = 0.8, cex.lab = 0.9, cex.main=1, cex.sub = 0.8)
curve(dnorm(x, mean = m_T1, sd = std_T1), col = "darkblue", lwd =2, add = TRUE)

hist(clean_data$T2, prob=T, main="T2", breaks = 30, cex.axis = 0.8, cex.lab = 0.9, cex.main=1, cex.sub = 0.8)
curve(dnorm(x, mean = m, sd = std), col = "darkblue", lwd =2, add = TRUE)

hist(clean_data$S, prob=T, main="S", breaks = 100, cex.axis = 0.8, cex.lab = 0.9, cex.main=1, cex.sub = 0.8)
curve(dnorm(x, mean = m_S, sd = std_S), col = "darkblue", lwd =2, add = TRUE)

m_RN = mean(clean_data$RN)
std_RN = sqrt(var(clean_data$RN))

m_log.cost = mean(clean_data$log.cost)
std_log.cost = sqrt(var(clean_data$log.cost))

m_D = mean(clean_data$D)
std_D = sqrt(var(clean_data$D))

# set the plotting area into a 1*4 array
par(mfrow=c(1,3))

hist(clean_data$RN, prob=T, main="RN", breaks = 50, cex.axis = 0.7, cex.lab = 0.8, cex.main=1, cex.sub = 0.7)
curve(dnorm(x, mean = m_RN, sd = std_RN), col = "darkblue", lwd =2, add = TRUE)

hist(clean_data$log.cost, prob=T, main="log.cost", breaks = 100, cex.axis = 0.7, cex.lab = 0.8, cex.main=1, cex.sub = 0.7)
curve(dnorm(x, mean = m_log.cost, sd = std_log.cost), col = "darkblue", lwd =2, add = TRUE)

hist(clean_data$D, main="Date", breaks=50, cex.axis = 0.7, cex.lab = 0.8, cex.main=1, cex.sub = 0.7)
curve(dnorm(x, mean = m_D, sd = std_D), col = "darkblue", lwd =2, add = TRUE)

The variables are mostly normally distributed.

Q2. What type of statistical model is suitable for these data?

Multiple Linear Regression:

Fit a suitable model to predict the construction cost of Nuclear Reactor construction in France:.

The Multiple linear regression model will predict the dependent variable ‘Cost’ using a regression line based on the 12 other independent variables.

The equation of the Linear Regression is:

y = B0 + B1X1 + B2X2 + B3X3 + … + BnXn + ϵ

Where B0 is the intercept, B1,B2,….Bn are the slope of the line and ϵ is the error (residual).

The equation above is used to predict the value of the target variable based on the given predictor variables.

When we are running a Multiple Regression, there are several assumptions that we need to check our data meet, in order for our analysis to be reliable and valid.

Assumption #1: The relationship between the IVs and the DV is linear.

The first assumption of Multiple Regression is that the relationship between the IVs and the DV can be characterised by a straight line. A simple way to check this is by producing scatterplots of the relationship between each of our IVs and our DV.

To fully test the assumption of linearity, we have produced scatterplot with regression line for each of the IVs and the DV in Figure-7, Figure-8 and Figure-9.

Looking at the scatterplot produced by seaborn with regression line, we can see that the relationship between the IV and the DV could be modelled by a straight line, suggesting that the relationship between these variables is linear.

Linearity with scatterplot:

x_D <- clean_data$D
x_T1 <- clean_data$T1
x_T2 <- clean_data$T2
y_log.cost <- clean_data$log.cost

# set the plotting area into a 1*3 array
par(mfrow=c(1,3))

# Plot with main and axis titles
# Change point shape (pch = 19) and remove frame.
# Add regression line
plot(x_D, y_log.cost, main = "Date - log.cost", xlab = "Date", ylab = "log.cost", pch = 19, frame = FALSE)
abline(lm(y_log.cost ~ x_D, data = clean_data), col = "blue")

plot(x_T1, y_log.cost, main = "T1 - log.cost", xlab = "T1", ylab = "log.cost", pch = 19, frame = FALSE)
abline(lm(y_log.cost ~ x_T1, data = clean_data), col = "blue")

plot(x_T2, y_log.cost, main = "T2 - log.cost", xlab = "T2", ylab = "log.cost", pch = 19, frame = FALSE)
abline(lm(y_log.cost ~ x_T2, data = clean_data), col = "blue")

x_S <- clean_data$S
x_RN <- clean_data$RN
x_North <- clean_data$North
y_log.cost <- clean_data$log.cost

# set the plotting area into a 1*3 array
par(mfrow=c(1,3))

# Plot with main and axis titles
# Change point shape (pch = 19) and remove frame.
# Add regression line
plot(x_S, y_log.cost, main = "S - log.cost", xlab = "S", ylab = "log.cost", pch = 19, frame = FALSE)
abline(lm(y_log.cost ~ x_S, data = clean_data), col = "blue")

plot(x_RN, y_log.cost, main = "RN - log.cost", xlab = "RN", ylab = "log.cost", pch = 19, frame = FALSE)
abline(lm(y_log.cost ~ x_RN, data = clean_data), col = "blue")

plot(x_North, y_log.cost, main = "North - log.cost", xlab = "North", ylab = "log.cost", pch = 19, frame = FALSE)
abline(lm(y_log.cost ~ x_North, data = clean_data), col = "blue")

x_CT <- clean_data$CT
x_CP <- clean_data$CP
x_N <- clean_data$N
y_log.cost <- clean_data$log.cost

# set the plotting area into a 1*3 array
par(mfrow=c(1,3))

# Plot with main and axis titles
# Change point shape (pch = 19) and remove frame.
# Add regression line
plot(x_CT, y_log.cost, main = "CT - log.cost", xlab = "CT", ylab = "log.cost", pch = 19, frame = FALSE)
abline(lm(y_log.cost ~ x_CT, data = clean_data), col = "blue")

plot(x_CP, y_log.cost, main = "CP - log.cost", xlab = "CP", ylab = "log.cost", pch = 19, frame = FALSE)
abline(lm(y_log.cost ~ x_CP, data = clean_data), col = "blue")

plot(x_N, y_log.cost, main = "N - log.cost", xlab = "N", ylab = "log.cost", pch = 19, frame = FALSE)
abline(lm(y_log.cost ~ x_N, data = clean_data), col = "blue")

x_JP <- clean_data$JP
y_log.cost <- clean_data$log.cost

# set the plotting area into a 1*3 array
par(mfrow=c(1,3))

# Plot with main and axis titles
# Change point shape (pch = 19) and remove frame.
# Add regression line
plot(x_JP, y_log.cost, main = "JP - log.cost", xlab = "JP", ylab = "log.cost", pch = 19, frame = FALSE)
abline(lm(y_log.cost ~ x_JP, data = clean_data), col = "blue")

Linear Relation with scatterplot and regression line:

Table-3:

log.cost	Association	Comment
D	A linear relationship exists, uphill (with a positive slope)	As the D-values increase (move right), the log.cost tend to increase (move up).
T1	A linear relationship exists, uphill (with a positive slope)	As the T1-values increase (move right), the log.cost tend to increase (move up).
T2	A linear relationship exists, uphill (with a positive slope)	As the T2-values increase (move right), the log.cost tend to increase (move up).
S	A linear relationship exists, uphill (with a positive slope)	As the S-values increase (move right), the log.cost tend to increase (move up).
RN	No linear relationship	-
North	No linear relationship	-
CT	No linear relationship	-
CP	boxplot for factor variable	As the CP=0, the log.cost tend to increase (move up) And the CP=1, the log.cost tend to decrease (move down)
JP	boxplot for factor variable	As the JP=0, the log.cost tend to increase (move up) And the JP=1, the log.cost tend to decrease (move down)

Linear relation with Pearson’s correlation coefficient: We will make correlation-coefficient for numeric variables D, T1, T2, S, RN, N and log.cost variable and skip date, numerical and factor variables.

df_cor = data.frame(clean_data$D, clean_data$T1, clean_data$T2, clean_data$S, clean_data$RN, clean_data$N, clean_data$log.cost)
cor1 = round(cor(df_cor),3)
cor1

##                     clean_data.D clean_data.T1 clean_data.T2 clean_data.S
## clean_data.D               1.000         0.293         0.630        0.710
## clean_data.T1              0.293         1.000         0.236        0.256
## clean_data.T2              0.630         0.236         1.000        0.800
## clean_data.S               0.710         0.256         0.800        1.000
## clean_data.RN              0.009         0.083        -0.330       -0.307
## clean_data.N               0.022        -0.252        -0.195       -0.290
## clean_data.log.cost        0.953         0.362         0.707        0.782
##                     clean_data.RN clean_data.N clean_data.log.cost
## clean_data.D                0.009        0.022               0.953
## clean_data.T1               0.083       -0.252               0.362
## clean_data.T2              -0.330       -0.195               0.707
## clean_data.S               -0.307       -0.290               0.782
## clean_data.RN               1.000        0.161              -0.080
## clean_data.N                0.161        1.000              -0.158
## clean_data.log.cost        -0.080       -0.158               1.000

So, there is some independent variable has linear correlation with dependent variable. So we can build a multiple linear regression model.

Creating a Validation Dataset: Now we are going to split the dataset in training/test and validation dataset:

set.seed(3550)
# create a list of 80% of the rows in the original dataset we can use for training
validation_index = createDataPartition(clean_data$log.cost, p=0.80, list=FALSE)
# select 20% of the data for validation
validation = clean_data[-validation_index,]
# use the remaining 80% of data to training and testing the models
dataset = clean_data[validation_index,]
head(dataset)

##     Location    D Cost T1 T2    S RN North CT CP  N JP log.cost
## 1 Belleville 1980 3901 22 97 1310  1     1  1  0 13  1 3.591176
## 3    Blayais 1977 1870  9 59  910  1     1  0  1 14  1 3.271842
## 4    Blayais 1977 1392  9 73  910  2     1  0  1 14  1 3.143639
## 6    Blayais 1978 2555 17 66  910  4     1  0  1  6  1 3.407391
## 7      Bugey 1972 1215 21 76  910  2     0  1  1  4  0 3.084576
## 8      Bugey 1973  839 13 66  910  3     0  1  1 21  0 2.923762

dim(dataset)

## [1] 42 13

head(validation)

##       Location    D Cost T1 T2    S RN North CT CP  N JP log.cost
## 5      Blayais 1978 2818 17 55  910  3     1  0  1  6  1 3.449941
## 12    Cattenom 1980 6270 14 91 1300  2     1  1  0  2  0 3.797268
## 23       Cruas 1978 3497 17 68  915  1     0  1  1  4  1 3.543696
## 31  Fessenheim 1971  726 12 76  880  1     1  1  1  4  1 2.860937
## 33 Flamanville 1979 6086 20 84 1330  1     1  0  0  4  0 3.784332
## 38 Graveliness 1975 1555 11 69  910  2     1  0  1 14  1 3.191730

dim(validation)

## [1]  8 13

Multiple linear regression model building: Here we will eliminate Location variable because it is nominal and we cannot categorize it as binary.

mlr_model1 <- lm(log.cost ~ D + T1 + T2 + S + RN + North + CT + CP + N + JP, data = dataset)
summary(mlr_model1)

## 
## Call:
## lm(formula = log.cost ~ D + T1 + T2 + S + RN + North + CT + CP + 
##     N + JP, data = dataset)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.113194 -0.027026 -0.009452  0.026391  0.136811 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.717e+02  1.102e+01 -15.581 3.27e-16 ***
## D            8.736e-02  5.670e-03  15.408 4.45e-16 ***
## T1          -1.805e-03  3.471e-03  -0.520  0.60674    
## T2           2.104e-03  1.507e-03   1.396  0.17269    
## S            1.724e-03  1.299e-03   1.327  0.19412    
## RN           1.632e-03  1.043e-02   0.157  0.87662    
## North1       3.939e-02  3.169e-02   1.243  0.22316    
## CT1          8.513e-02  3.953e-02   2.153  0.03918 *  
## CP1          7.426e-01  5.243e-01   1.416  0.16669    
## N           -1.090e-02  2.356e-03  -4.626 6.25e-05 ***
## JP1         -1.258e-01  3.368e-02  -3.735  0.00076 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06593 on 31 degrees of freedom
## Multiple R-squared:  0.9727, Adjusted R-squared:  0.9639 
## F-statistic: 110.5 on 10 and 31 DF,  p-value: < 2.2e-16

Comment on model1 report:

R-squared: The model has been able to explain 97% of the variance in the squared residuals, indicating a good fit. ie. model is 97% fit. 97% of total variation in y (log.cost) is explained by the set of independent variables.
Adj. R-squared: 0.9639 i.e independent variables are relevant to the overall model.
Prob (F-statistic): To check whether the above model is adequate for estimating the construction cost, we can conduct the F-test with the following null and alternative hypothesis,
H0: B0 = B1 = 0, (Model is not adequate) H1: Bi != 0 (i = 0, 1; at least one occurs), (Model is adequate)

We have found that the p-value: < 2.2e-16 < 0.05, so we reject the null hypothesis at 5% level of significance.

Hence, the model is adequate for estimating the construction cost.

T1, T2, S, RN, North, CT, CP feature has pvalue > 0.05 and t-value is extremely low, so they are insignificant to the overall model.
D, N, JP feature has pvalue < 0.05 and t-value is adequate, so they are significant to the overall model.

We have to drop these variables to build a better model and continue analysis on.

Q5. Can the model be simplified, by reducing the number of explanatory variables?

Q9. Check have any multicollinearity problem in your data set and sort out this.

Assumption #2: Minimal multicollinearity among independent variables.

Multicollinearity fixing: * The best way to identify the multicollineariry is to calculate the Variance Inflation Factor (VIF) corresponding to every independent variable in the Dataset. * VIF determines the strength of the correlation between the independent variables. It is predicted by taking a variable and regression it against every other variable. * VIF score of an independent variable represents how well the variable is explained by other independent variables.

Variance Inflation Factor (VIF): * if VIF = 1; No multicollinearity * if VIF <= 5; Low multicollinearity (moderately correlated) * if VIF >= 5; High multicollinearity

We will remove the variables which are showing VIF > 10.

car::vif(mlr_model1)

##          D         T1         T2          S         RN      North         CT 
##   3.223553   1.393837   4.135044 668.693160   1.810899   2.156377   3.467027 
##         CP          N         JP 
## 626.518738   1.402745   2.740366

In the above results, we identify that variable S = 668.693160 and CP = 626.518738 have the highest VIF score. So, we will drop S from the above results and check again VIF score for other variables.

mlr_model2 <- lm(log.cost ~ D + T1 + T2 + RN + North + CT + CP + N + JP, data = dataset)
summary(mlr_model2)

## 
## Call:
## lm(formula = log.cost ~ D + T1 + T2 + RN + North + CT + CP + 
##     N + JP, data = dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.11921 -0.03760 -0.01140  0.03183  0.13834 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.718e+02  1.115e+01 -15.411 2.32e-16 ***
## D            8.859e-02  5.659e-03  15.654  < 2e-16 ***
## T1          -8.691e-04  3.438e-03  -0.253 0.802066    
## T2           2.071e-03  1.525e-03   1.358 0.184042    
## RN          -4.554e-03  9.437e-03  -0.483 0.632720    
## North1       1.754e-02  2.740e-02   0.640 0.526507    
## CT1          4.761e-02  2.796e-02   1.703 0.098331 .  
## CP1          4.978e-02  5.029e-02   0.990 0.329617    
## N           -1.096e-02  2.383e-03  -4.598 6.37e-05 ***
## JP1         -1.409e-01  3.207e-02  -4.393 0.000115 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06671 on 32 degrees of freedom
## Multiple R-squared:  0.9712, Adjusted R-squared:  0.963 
## F-statistic: 119.7 on 9 and 32 DF,  p-value: < 2.2e-16

Comment on model2 report:

R-squared: 97%
Adj. R-squared: 96%
T1, T2, North, CT, CP feature has yet pvalue > 0.05 and t-value is extremely low, so they are insignificant to the overall model.
D, RN, N, JP feature has pvalue < 0.05 and t-value is adequate, so they are significant to the overall model.

We have to drop these variables to build a better model and continue analysis on.

car::vif(mlr_model2)

##        D       T1       T2       RN    North       CT       CP        N 
## 3.136932 1.336327 4.133888 1.449064 1.574523 1.694543 5.628677 1.402256 
##       JP 
## 2.426907

In the above results, we identify that variable CP = 5.628677 has the VIF score > 5. So, we will drop CP from the above results and check the model performance and VIF score for other variables.

mlr_model3 <- lm(log.cost ~ D + T1 + T2 + RN + North + CT + N + JP, data = dataset)
summary(mlr_model3)

## 
## Call:
## lm(formula = log.cost ~ D + T1 + T2 + RN + North + CT + N + JP, 
##     data = dataset)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.111653 -0.042496 -0.006608  0.032044  0.135571 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.683e+02  1.056e+01 -15.937  < 2e-16 ***
## D            8.684e-02  5.373e-03  16.161  < 2e-16 ***
## T1          -6.580e-04  3.431e-03  -0.192  0.84908    
## T2           1.309e-03  1.317e-03   0.995  0.32720    
## RN          -2.396e-03  9.179e-03  -0.261  0.79567    
## North1       1.622e-02  2.736e-02   0.593  0.55741    
## CT1          5.507e-02  2.692e-02   2.046  0.04882 *  
## N           -1.005e-02  2.200e-03  -4.570 6.52e-05 ***
## JP1         -1.274e-01  2.902e-02  -4.390  0.00011 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06669 on 33 degrees of freedom
## Multiple R-squared:  0.9703, Adjusted R-squared:  0.9631 
## F-statistic: 134.6 on 8 and 33 DF,  p-value: < 2.2e-16

Comment on model3 There is no significant change on overall model performance comparatively Model2. So we can advance our analysis by eliminating insignificant variables.

mlr_model4 <- lm(log.cost ~ D + RN + N + JP, data = dataset)
summary(mlr_model4)

## 
## Call:
## lm(formula = log.cost ~ D + RN + N + JP, data = dataset)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.137460 -0.038170 -0.009338  0.036098  0.160221 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.722e+02  8.332e+00 -20.665  < 2e-16 ***
## D            8.890e-02  4.210e-03  21.119  < 2e-16 ***
## RN          -8.578e-03  8.247e-03  -1.040    0.305    
## N           -1.073e-02  2.107e-03  -5.090 1.07e-05 ***
## JP1         -1.381e-01  2.731e-02  -5.057 1.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06889 on 37 degrees of freedom
## Multiple R-squared:  0.9644, Adjusted R-squared:  0.9606 
## F-statistic: 250.8 on 4 and 37 DF,  p-value: < 2.2e-16

Comment on model4

R-squared: 97%
Adj. R-squared: 96%
D, RN, N, JP feature has pvalue < 0.05 and t-value is adequate, so they are significant to the overall model.

Now we can fit out model with model4.

Fitting model:

fitted_mlr = fitted.values(mlr_model4)
fitted_mlr

##        1        3        4        6        7        8        9       10 
## 3.567120 3.289678 3.281099 3.438661 3.081933 2.979908 3.083040 3.074462 
##       13       14       17       18       24       25       26       27 
## 3.919503 4.021283 3.666606 3.800566 3.477271 3.663513 3.526216 3.208406 
##       28       29       30       32       34       35       36       37 
## 3.199828 3.212703 3.204125 2.943838 3.814630 3.979565 4.084438 3.111867 
##       39       40       41       42       43       44       45       46 
## 3.083984 3.164311 3.476082 3.467504 3.869207 3.852995 3.535038 3.626092 
##       47       48       49       50       51       52       53       55 
## 3.727873 3.786747 3.947386 4.116618 3.702122 3.725724 3.222225 3.162407 
##       56       57 
## 3.153829 3.159070

Predicting y:

y_predict = predict(mlr_model4, validation)
y_predict

##        5       12       23       31       33       38       54       58 
## 3.447240 3.814630 3.485849 2.863511 3.712849 3.103289 3.213647 3.150492

Calculating Residuals:

Residual = validation$log.cost - y_predict
as.data.frame(Residual)

##        Residual
## 5   0.002701416
## 12 -0.017362037
## 23  0.057846684
## 31 -0.002574418
## 33  0.071482888
## 38  0.088441748
## 54  0.033097456
## 58  0.056872995

Model Performance:

mse = mean((Residual)^2)
mae = mean(abs(Residual))
rmse = sqrt(mse)
R2 = 1-(sum((Residual)^2)/sum((validation$log.cost-mean(validation$log.cost))^2))

cat(" MAE:", mae, "\n", "MSE:", mse, "\n", "RMSE:", rmse, "\n", "R-squared:", R2)

##  MAE: 0.04129746 
##  MSE: 0.002615416 
##  RMSE: 0.05114114 
##  R-squared: 0.9710001

Assumption #3: Normality of Residual distribution.

#get list of residuals 
res <- resid(mlr_model4)
res

##            1            3            4            6            7            8 
##  0.024055539 -0.017835960 -0.137460241 -0.031270577  0.002643218 -0.056146282 
##            9           10           13           14           17           18 
## -0.103036447 -0.027576537  0.160221037 -0.017607353  0.011548307  0.053010506 
##           24           25           26           27           28           29 
## -0.042382736  0.058874006  0.043976723 -0.010299106 -0.008377000  0.004254091 
##           30           32           34           35           36           37 
##  0.128919006 -0.021632087 -0.121782659 -0.040445589  0.083646329  0.025803801 
##           39           40           41           42           43           44 
##  0.109140641 -0.042095403 -0.069542240 -0.023147202 -0.013687542  0.006323811 
##           45           46           47           48           49           50 
##  0.011751123 -0.077457090 -0.031341547  0.034439085 -0.029985049 -0.050292332 
##           51           52           53           55           56           57 
## -0.080115791  0.052427088  0.036651284  0.137318078  0.030578501  0.037934595

Plotting Residuals:

#create Q-Q plot for residuals
qqnorm(res)

#add a straight diagonal line to the plot
qqline(res)

The x-axis displays the fitted values and the y-axis displays the residuals. From the plot we can see that the spread of the residuals tends to be higher for higher fitted values, but it doesn’t look serious enough that we would need to make any changes to the model.

We can see that the residuals tend to stray from the line quite a bit near the tails, which could indicate that they’re not normally distributed.

Produce a density plot:

We can also produce a density plot, which is also useful for visually checking whether or not the residuals are normally distributed. If the plot is roughly bell-shaped, then the residuals likely follow a normal distribution.

#Create density plot of residuals
plot(density(res))

We can see that the density plot clearly follows a bell shape. The data have ensured that the residuals are more normally distributed.

Assumption #4: The variance of the residuals is constant (homoscedasticity).

The simplest way to determine if this assumption is met is to create a plot of standardized residuals versus predicted values.

Once you fit a regression model to a dataset, you can then create a scatter plot that shows the predicted values for the response variable on the x-axis and the standardized residuals of the model on the y-axis.

If the points in the scatter plot exhibit a pattern, then heteroscedasticity is present.

#produce residual vs. fitted plot
plot(fitted(mlr_model4), res)

#add a horizontal line at 0 
abline(0,0)

Notice that the standardized residuals are scattered about zero with no clear pattern.

Assumption #5: The values of the residuals are independent (Auto-correlation).

One of the key assumptions in linear regression is that there is no correlation between the residuals, e.g. the residuals are independent.

One way to determine if this assumption is met is to perform a Durbin-Watson test, which is used to detect the presence of autocorrelation in the residuals of a regression. This test uses the following hypotheses:

H0: There is no correlation among the residuals. HA: The residuals are autocorrelated.

Perform a Durbin-Watson test:

#perform Durbin-Watson test
durbinWatsonTest(mlr_model4)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.1454758      1.697558   0.194
##  Alternative hypothesis: rho != 0

From the output we can see that the test statistic is 1.697558 and the corresponding p-value is 0.194. Since this p-value is greater than 0.05, we cannot reject the null hypothesis and conclude that There is no correlation among the residuals in this regression model.

So our model fulfilled all the assumptions for multiple linear regression.

Verifying multiple linear regression by mathematical calculation:

df_prediction = data.frame(validation$log.cost, y_predict, Residual)
df_prediction

##    validation.log.cost y_predict     Residual
## 5             3.449941  3.447240  0.002701416
## 12            3.797268  3.814630 -0.017362037
## 23            3.543696  3.485849  0.057846684
## 31            2.860937  2.863511 -0.002574418
## 33            3.784332  3.712849  0.071482888
## 38            3.191730  3.103289  0.088441748
## 54            3.246745  3.213647  0.033097456
## 58            3.207365  3.150492  0.056872995

print(mlr_model4)

## 
## Call:
## lm(formula = log.cost ~ D + RN + N + JP, data = dataset)
## 
## Coefficients:
## (Intercept)            D           RN            N          JP1  
##  -1.722e+02    8.891e-02   -8.578e-03   -1.073e-02   -1.381e-01

validation[1, ]

##   Location    D Cost T1 T2   S RN North CT CP N JP log.cost
## 5  Blayais 1978 2818 17 55 910  3     1  0  1 6  1 3.449941

Here,

x1= value of D, x2= value of RN, x3= value of N, x4= value of JP1 (joint planning with another reactor)

B0=-1.722e+02, B1=8.891e-02, B2 = -8.578e-03, B3 = -1.073e-02, B4 = -1.381e-01

y = B0 + B1X1(D) + B2X2(RN) + B3X3(N) + B3X3(JP1:joint planning with another reactor)

row_predict = 5
y = -1.722e+02 + (8.891e-02*1978) + (-8.578e-03*3) + (-1.073e-02*6) + (-1.381e-01*1)
y

## [1] 3.435766

Construction_cost = 10^3.435766
cat("Construction cost:", Construction_cost, "Million EUR.")

## Construction cost: 2727.508 Million EUR.

So, predicted value is also correct according to the mathematical equation.

Using the model4 as appropriate regression model we can predict the construction cost of Nuclear Reactor in France.

In conclusion, considering the year of 1978, with 3 reactors, 6 Architects participated in the Nuclear Power Plant Construction and the nuclear reactor joint planning with another reactor will be cost approximately 2727.508 Million EUR.