In semiconductor manufacturing, thin film deposition is a critical process where precise control over film thickness ensures optimal electrical properties such as resistance and conductivity. Small variations in deposition conditions can impact device performance, making it essential to analyze the relationship between film thickness (nm) and electrical resistance (mOhm).
This study focuses on Sputter Deposition, a common Physical Vapor Deposition (PVD) technique used for depositing conductive metal films. While this method provides high purity and precise thickness control, slight fluctuations in deposition rate or plasma conditions can lead to resistance variations, affecting the reliability of semiconductor components. Understanding this relationship helps assess process stability and optimize manufacturing parameters.
This report applies Exploratory Data Analysis (EDA) and Linear Regression Modeling to determine whether film thickness is a strong predictor of electrical resistance. The analysis is structured as follows:
Exploratory Data Analysis (EDA) is essential for understanding the underlying structure and patterns in the dataset before applying regression modeling. This section examines the distribution of variables, detects potential outliers, and evaluates the relationship between Film Thickness and Electrical Resistance. Boxplots and histograms are used to assess the spread and symmetry of each variable, while scatter plots and correlation analysis help determine the strength and direction of their relationship.
In this section, the dataset is imported and loaded into a structured format for analysis. The dataset is read from a CSV file and displayed to ensure proper loading. Following this, an initial examination of the data is conducted, including summary statistics, structural insights, and missing value assessment.
DATA LOADING AND PREPROCESSING
# Create the dataframe
# Raw GitHub URL
url <- "https://raw.githubusercontent.com/JairoRodriguezB/Datasets/main/semiconductor_SLR_dataset.csv"
# Read the CSV file
data <- read.csv(url)
head(data)
## Film_Thickness_nm Electrical_Resistance_mOhm
## 1 87.45 15.118
## 2 145.07 23.601
## 3 123.20 19.904
## 4 109.87 16.103
## 5 65.60 12.901
## 6 65.60 13.278
rmarkdown::paged_table(data)
INITIAL DATA ANALYSIS
This section provides an overview of the dataset, focusing on key statistical summaries and data structure. Summary statistics are computed for numerical variables to understand their distribution and central tendencies.
# Summary statistics for numerical variables
summary(data)
## Film_Thickness_nm Electrical_Resistance_mOhm
## Min. : 50.55 Min. :11.68
## 1st Qu.: 69.32 1st Qu.:13.60
## Median : 96.42 Median :15.74
## Mean : 97.02 Mean :16.80
## 3rd Qu.:123.02 3rd Qu.:19.82
## Max. :148.69 Max. :25.74
# Overview of the dataset structure
glimpse(data)
## Rows: 100
## Columns: 2
## $ Film_Thickness_nm <dbl> 87.45, 145.07, 123.20, 109.87, 65.60, 65.60…
## $ Electrical_Resistance_mOhm <dbl> 15.118, 23.601, 19.904, 16.103, 12.901, 13.…
# Count the number of missing values (NA) in each column
colSums(is.na(data))
## Film_Thickness_nm Electrical_Resistance_mOhm
## 0 0
Observations:
Analyzing the distribution of variables is a important step before applying regression modeling. Understanding the variability, central tendency, and potential deviations in the dataset helps assess whether key statistical assumptions are met.
By examining both X and Y, this analysis provides insights into their spread, symmetry, and possible irregularities. Identifying patterns in the data allows for a better evaluation of their relationship and the potential need for transformations to improve model performance.
Boxplots provide a clear visualization of the distribution, variability, and potential outliers in the dataset. In addition to the boxplot for Electrical Resistance (mOhm) (Y), a boxplot for Film Thickness (nm) (X) has been included to analyze its distribution and assess its potential impact on the linear regression model. A comparative boxplot further highlights differences in variability between both variables, ensuring a better understanding of their behavior before applying regression analysis.
INDIVIDUAL BOXPLOTS (VISUAL COMPARISON)
# INDIVIDUAL BOXPLOTS (VISUAL COMPARISON)
par(mfrow = c(1, 2), mar = c(5, 3, 4, 0.5) + 0.1, oma = c(2, 0, 2, 0))
# Boxplot 1: Electrical Resistance
boxplot(data$Electrical_Resistance_mOhm,
main = "Boxplot of Electrical Resistance",
xlab = "Electrical Resistance (mOhm)",
col = "lightblue", border = "black",
horizontal = TRUE)
# Boxplot 2: Film Thickness
boxplot(data$Film_Thickness_nm,
main = "Boxplot of Film Thickness",
xlab = "Film Thickness (nm)",
col = "lightgreen", border = "black",
horizontal = TRUE)
mtext("Comparison of Electrical Resistance and Film Thickness",
outer = TRUE, cex = 1.5, font = 2)
COMPARATIVE BOXPLOTS
# COMPARATIVE BOXPLOTS
boxplot(data$Film_Thickness_nm, data$Electrical_Resistance_mOhm,
names = c("Film Thickness", "Electrical Resistance"),
main = "Boxplots of Film Thickness and Electrical Resistance",
col = c("lightblue", "lightgreen"), border = "black")
Observations:
Histograms provide a visual representation of the distribution of both Electrical Resistance (Y) and Film Thickness (X). In addition to examining the behavior of Y, the inclusion of X allows for a more comprehensive analysis of its distribution and potential impact on the linear regression model. The density curve (red line) has been added to each histogram to better visualize the shape of the data, aiding in the evaluation of key statistical properties before applying regression analysis.
# HISTOGRAMS
par(mfrow = c(1, 2), mar = c(5, 4, 4, 3) + 0.1, oma = c(0, 0, 3, 0))
# Histogram 1: Electrical Resistance
hist_resistance <- hist(data$Electrical_Resistance_mOhm,
main = "Histogram of Electrical Resistance",
xlab = "Electrical Resistance (mOhm)",
col = "lightblue", border = "black",
breaks = 10, prob = TRUE)
# Add density curve (outline)
lines(density(data$Electrical_Resistance_mOhm), col = "red", lwd = 2)
# Histogram 2: Film Thickness
hist_thickness <- hist(data$Film_Thickness_nm,
main = "Histogram of Film Thickness",
xlab = "Film Thickness (nm)",
col = "lightgreen", border = "black",
breaks = 10, prob = TRUE)
# Add density curve
lines(density(data$Film_Thickness_nm), col = "red", lwd = 2)
mtext("Distribution of Electrical Resistance and Film Thickness",
outer = TRUE, cex = 1.5, font = 2)
Observations:
To assess the relationship between Film Thickness (X) and Electrical Resistance (Y), a scatter plot and correlation analysis were conducted. The scatter plot visually examines whether a linear relationship exists between the variables, while the correlation matrix quantifies the strength and direction of this relationship. A high correlation value would indicate a strong association, which is crucial for validating the assumptions of a linear regression model.
These analyses help determine whether Film Thickness (X) serves as a significant predictor of Electrical Resistance (Y), guiding the next steps in the regression modeling process
SCATTER PLOT
Scatter plots provide a visual representation of the relationship between Film Thickness (X) and Electrical Resistance (Y). In addition to assessing the overall trend between the variables, this visualization helps identify potential deviations from linearity, outliers, or patterns that could impact the regression model. The distribution of points allows for a preliminary evaluation of whether a linear model is appropriate for describing the relationship, guiding further analysis before proceeding with regression assumptions and model fitting. Resistance (Y), guiding the next steps in the regression modeling process
# Scatter Plot
plot(data$Film_Thickness_nm, data$Electrical_Resistance_mOhm,
main = "Scatter Plot: Film Thickness vs Electrical Resistance",
xlab = "Film Thickness (nm)",
ylab = "Electrical Resistance (mOhm)",
pch = 20)
CORRELATION MATRIX
Correlation analysis provides a quantitative measure of the strength and direction of the relationship between Film Thickness (X) and Electrical Resistance (Y). In addition to visually inspecting the relationship, computing the correlation coefficient helps confirm whether a strong linear association exists. The correlation matrix offers a structured representation of this relationship, aiding in assessing the suitability of a linear regression model
# Correlation Matrix
cor_matrix <- cor(data)
ggcorrplot(cor_matrix,
lab = TRUE,
colors = c("red", "white", "#4A90E2"),
outline.color = "black",
show.legend = TRUE)
Observations:
This section focuses on fitting a linear regression model to analyze the relationship between Film Thickness (X) and Electrical Resistance (Y) using the Least Squares method. The model is implemented in R using the lm() function, which estimates the regression coefficients and evaluates model performance through the coefficient of determination (\(R^2\)). However, for the model to be valid, key assumptions—normality of residuals and constant variance (homoscedasticity)—must be met. These assumptions are verified using the Residuals vs. Fitted Plot and Q-Q Plot. If violations are detected, a Box-Cox transformation is applied to stabilize variance and improve normality. The transformed model is then re-evaluated to ensure compliance with regression assumptions, leading to a more reliable and interpretable model.
Linear regression is a fundamental statistical method used to model the relationship between a dependent variable (Y) and an independent variable (X). In this case, the goal is to analyze how Film Thickness (X) influences Electrical Resistance (Y) and determine the strength of their association.
The model follows the equation:
\[ Y = \beta_0 + \beta_1 X + \epsilon \]
Where:
LEAST SQUARES ESTIMATION
Since the parameters \(B_0\) and \(B_1\) are unknown, they must be estimated using the Least Squares Method (OLS - Ordinary Least Squares). This method finds the values \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that minimize the sum of squared residuals, ensuring the best possible fit for the observed data.
These estimates are computed as:
\[ \hat{\beta}_1 = \frac{\sum (x_i - \bar{x}) (y_i - \bar{y})}{\sum (x_i - \bar{x})^2} \]
\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \]
Where:
These estimates define the regression equation:
\[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X \]
Where \(\hat{Y}\) represents the predicted electrical resistance for a given film thickness.
SUM OF SQUARES AND MODEL EVALUATION
The variability in \(Y\) is decomposed into three components:
\[ SST = \sum (y_i - \bar{y})^2 \]
\[ SSR = \sum (\hat{y}_i - \bar{y})^2 \]
\[ SSE = \sum (y_i - \hat{y}_i)^2 \]
These components are related by the fundamental equation:
\[ SST = SSR + SSE \]
The Model performance is evaluated using the COEFFICIENT OF DETERMINATION (\(R^2\)), which measures the proportion of variance in \(Y\) accounted for by \(X\).
\[ R^2 = \frac{SSR}{SST} \]
A higher \(R^2\) value indicates a stronger relationship between the variables, meaning the model explains a greater portion of the variability in \(Y\).
IMPLEMENTATION OF THE MODEL
The implementation of the linear regression model is performed using the lm() function in R, which applies the Least Squares Method to estimate the regression coefficients \(\hat{\beta}_1\) and \(\hat{\beta}_0\). This method minimizes the sum of squared residuals, ensuring the best possible linear fit to the data.
Additionally, the lm() function internally computes the Total Sum of Squares (SST), Regression Sum of Squares (SSR), and Error Sum of Squares (SSE) to obtain the coefficient of determination (\(R^2\)), which quantifies the proportion of variance in \(Y\) explained by \(X\).
It is assumed that the adequacy criteria for linear regression—normality of residuals and constant variance (homoscedasticity)—are met, allowing for the proper application of lm().
By using the summary() function, key statistical outputs such as coefficient estimates, standard errors and \(R^2\) allowing for a comprehensive evaluation of the model’s performance.
# Linear Regression
model = lm(Electrical_Resistance_mOhm ~ Film_Thickness_nm, data = data)
summary(model)
##
## Call:
## lm(formula = Electrical_Resistance_mOhm ~ Film_Thickness_nm,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.27640 -0.75508 -0.08631 0.70422 2.69671
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.870489 0.356848 13.65 <2e-16 ***
## Film_Thickness_nm 0.122954 0.003518 34.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.041 on 98 degrees of freedom
## Multiple R-squared: 0.9257, Adjusted R-squared: 0.925
## F-statistic: 1221 on 1 and 98 DF, p-value: < 2.2e-16
The regression model evaluates the relationship between Film Thickness (X) and Electrical Resistance (Y) using the least squares method. From the Coefficients section:
\[ \hat{Y} = 4.870 + 0.123X \]
Conclusions:
In the previous section, the linear regression model was fitted using the lm() function under the assumption that the key regression assumptions—normality of residuals and constant variance (homoscedasticity)—hold. However, these assumptions must be verified to ensure the model’s validity.
In this section, assumption checking is performed through:
If these assumptions are violated, the previous model may be considered invalid, as its estimates, statistical significance, and predictive reliability could be compromised. In such cases, data transformations (Box-Cox) approach will be required.
NORMALITY VERIFICATION (QQ RESIDUALS PLOT)
# ASSUMPTION CHECKING
# Normality verification (QQ residuals Plot)
plot(model,2)
CONSTANT VARIANCE VERIFICATION (RESIDUALS VS. FITTED PLOT)
# ASSUMPTION CHECKING
# Constant variance verification (Residuals vs. Fitted Plot)
plot(model,1)
Conclusions:
IDENTIFYING THE OPTIMAL TRANSFORMATION
To address non-constant variance in the residuals, a Box-Cox transformation was applied. The plot shows the log-likelihood for different values of \(\lambda\), with the peak indicating the optimal transformation.
The estimated \(\lambda\) is approximately -0.81, suggesting an inverse power transformation.
boxcox(Electrical_Resistance_mOhm ~ Film_Thickness_nm, data = data)
APPLYING THE BOX-COX TRANSFORMATION
After determining an optimal \(\lambda \approx -0.81\), the Box-Cox transformation was applied to the response variable Electrical Resistance. A second Box-Cox plot was generated to verify whether the transformation improved variance stabilization
The new plot shows an estimated \(\lambda \approx 1\), indicating that the transformed variable now exhibits a structure closer to linearity.
lambda = -0.81
data$new_Electrical_Resistance <- data$Electrical_Resistance_mOhm^lambda
boxcox(new_Electrical_Resistance ~ Film_Thickness_nm, data = data)
VISUALIZING DATA BEFORE AND AFTER TRANSFORMATION
In this section, the original data and its transformed counterpart are visualized to assess the impact of the Box-Cox transformation. The first comparison presents scatter plots of Film Thickness (X) vs. Electrical Resistance (Y) before and after transformation, allowing for an evaluation of the linearity and spread of the data. The second comparison uses histograms with density curves to examine changes in the distribution of Electrical Resistance, highlighting the effect of transformation on normality.
par(mfrow = c(1, 2), oma = c(0, 0, 3, 0))
# Original data plot
plot(data$Film_Thickness_nm, data$Electrical_Resistance_mOhm,
main = "Original Data",
xlab = "Film Thickness (nm)",
ylab = "Electrical Resistance (mOhm)",
col = "blue", pch = 16)
# Transformed data plot
plot(data$Film_Thickness_nm, data$new_Electrical_Resistance,
main = "Transformed Data",
xlab = "Film Thickness (nm)",
ylab = "Transformed Electrical Resistance",
col = "red", pch = 16)
mtext("Comparison of Original and Transformed Data", outer = TRUE, cex = 1.5, font = 2)
# HISTOGRAMS
par(mfrow = c(1, 2), mar = c(5, 4, 4, 3) + 0.1, oma = c(0, 0, 3, 0))
# Histogram 1: BEFORE
hist_resistance <- hist(data$Electrical_Resistance_mOhm,
main = "Electrical Resistance",
xlab = "Electrical Resistance (mOhm)",
col = "lightblue", border = "black",
breaks = 10, prob = TRUE)
# Add density curve (outline)
lines(density(data$Electrical_Resistance_mOhm), col = "red", lwd = 2)
# Histogram 2: AFTER
hist_thickness <- hist(data$new_Electrical_Resistance,
main = "Transformed Electrical Resistance",
xlab = "Transformed Electrical Resistance",
col = "lightgreen", border = "black",
breaks = 10, prob = TRUE)
# Add density curve
lines(density(data$new_Electrical_Resistance), col = "red", lwd = 2)
mtext("Comparison of Original and Transformed Data", outer = TRUE, cex = 1.5, font = 2)
Observations:
FITTING THE TRANSFORMED MODEL
After applying the Box-Cox transformation to stabilize variance and improve normality, a new linear regression model is fitted using the transformed response variable (\(\text{new_Electrical_Resistance}\)).
model2 = lm(new_Electrical_Resistance ~ Film_Thickness_nm, data = data)
summary(model2)
##
## Call:
## lm(formula = new_Electrical_Resistance ~ Film_Thickness_nm, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0111971 -0.0019573 -0.0002081 0.0025100 0.0093906
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.629e-01 1.342e-03 121.44 <2e-16 ***
## Film_Thickness_nm -5.932e-04 1.323e-05 -44.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.003915 on 98 degrees of freedom
## Multiple R-squared: 0.9536, Adjusted R-squared: 0.9531
## F-statistic: 2012 on 1 and 98 DF, p-value: < 2.2e-16
Observations:
EVALUATING ASSUMPTIONS POST-TRANSFORMATION
After fitting the transformed model, it is important to reassess whether the assumptions of linear regression are now better satisfied. In particular, we verify constant variance (homoscedasticity) by comparing the Residuals vs. Fitted plots before and after transformation.
# ASSUMPTION REVALIDATION
# Constant variance verification (Residuals vs. Fitted Plot)
# ORIGINAL MODEL
plot(model, 1, main = "Original Model")
# ASSUMPTION REVALIDATION
# Constant variance verification (Residuals vs. Fitted Plot)
# TRANSFORMED MODEL
plot(model2, 1, main = "Transformed Model")
Observations:
In linear regression analysis, confidence and prediction intervals quantify the uncertainty associated with estimating the response variable (\(Y\)) at a given predictor value (\(X\)). These intervals are important for understanding the reliability of predictions and assessing model performance.
CONFIDENCE INTERVAL ON THE EXPECTED RESPONSE
The confidence interval (CI) provides a range where the mean response \(\hat{Y}\) is expected to fall for a given \(X_j\), with a specified confidence level (95%). It accounts for the uncertainty in estimating the population mean response and is given by:
\[ \left( \hat{\beta}\_0 + \hat{\beta}x_j \right) \pm t_{1-\alpha/2, n-2} \left( \sqrt{ MSE \left[ \frac{1}{n} + \frac{x_j - \bar{x}}{\sum (x_i - \bar{x})^2} \right] } \right) \]
Where:
PREDICTION INTERVAL ON AN INDIVIDUAL RESPONSE
The prediction interval (PI) estimates the range where an individual future observation \(Y_j\) is likely to fall for a given \(X_j\) . Unlike the confidence interval, the prediction interval incorporates both the variability in estimating the mean response and the inherent variability of individual observations. The equation is:
\[ \left( \hat{\beta}_0 + \hat{\beta}_1 x_j \right) \pm t_{1-\alpha/2, n-2} \left( \sqrt{ MSE \left[ 1 + \frac{1}{n} + \frac{x_j - \bar{x}}{\sum (x_i - \bar{x})^2} \right] } \right) \]
IMPLEMENTATION OF INTERVALS
Before computing confidence and prediction intervals, it is necessary to define a range of predictor values (\(X\)) over which these intervals will be evaluated. In the R implementation, this is achieved using:
x <- data$Film_Thickness_nm
y <- data$new_Electrical_Resistance
newx <- seq(min(x), max(x), 0.01)
This command generates a sequence of \(X\) values ranging from the minimum to the maximum observed film thickness in small increments (0.01). By defining newx, we obtain a continuous set of predictor values for which we can compute interval estimates, allowing for a smooth visual representation of the fitted regression model. Each value of newx is substituted into the regression equation:
\[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_j \]
where \(X_j\) represents each value in “newx”. The corresponding confidence and prediction intervals are then calculated for each point \(X_j\).
The predict() function automates these calculations based on the fitted regression model. Internally, it performs the following steps:
\[ MSE = \frac{SSE}{n-2} \]
Calculates the Standard Error for Each \(X_j\): Based on the variance formula in the interval equations.
Uses the t-Distribution Critical Value: Extracts \(t_{1-\alpha/2, n-2}\) to determine the interval width.
Generates Interval Estimates:
Confidence Intervals (interval=“confidence”) estimate the expected mean response at each predictor value.
Prediction Intervals (interval=“prediction”) account for the variability in individual observations.
Each row in the computed confidence and prediction intervals contains:
Estimated Response (\(\hat{Y}\)): The predicted value of the dependent variable at a given predictor value.
Lower Bound: The lower limit of the corresponding interval, representing the minimum plausible value within the specified confidence level.
Upper Bound: The upper limit of the corresponding interval, representing the maximum plausible value within the specified confidence level.
CONFIDENCE AND PREDICTION INTERVALS
This analysis models the relationship between film thickness (nm) and electrical resistance using linear regression, where the red line represents the fitted model. The response variable, electrical resistance, is used in its transformed form to improve linearity and model performance. To assess the reliability of predictions, 95% confidence intervals (green, dashed) indicate where the true mean response is expected to lie, while 95% prediction intervals (black, dashed) show the range for individual future observations.
# CONFIDENCE AND PREDICT INTERVALS
model <- lm(y ~ x)
conf <- predict(model, data.frame(x=newx), interval="confidence", level=0.95)
pred <- predict(model, data.frame(x=newx), interval="prediction", level=0.95)
plot.new()
plot(x, y, main = "Linear Regression with Confidence and Prediction Intervals",
xlab = "Film Thickness (nm)",
ylab = "Transformed Electrical Resistance",
pch = 16, col = "blue")
abline(model, col="red", lwd=2)
# Add confidence interval lines
lines(newx, conf[,2], col="darkgreen", lty=2, lwd=2) # Lower CI limit
lines(newx, conf[,3], col="darkgreen", lty=2, lwd=2) # Upper CI limit
# Add prediction interval lines
lines(newx, pred[,2], col="black", lty=2, lwd=2) # Lower PI limit
lines(newx, pred[,3], col="black", lty=2, lwd=2) # Upper PI limit
legend("topright", legend=c("Regression Line", "Confidence Interval", "Prediction Interval"),
col=c("red", "darkgreen", "black"), lty=c(1, 2, 2), lwd=c(2, 2, 2), bty="n")
ESTIMATED RESISTANCE AND VARIABILITY AT 100 NM
In semiconductor manufacturing, target film thickness values vary depending on the specific application and material being deposited. For conductive metal layers, typical thickness values range between 50 nm and 150 nm, with 100 nm frequently monitored as a midpoint in the standard process range and a reference for quality control. This analysis estimates the electrical resistance at 100 nm using a trained linear regression model, applying the inverse transformation to obtain the predicted value. To assess uncertainty, a 95% confidence interval (CI) is provided, indicating the range where the true mean resistance is expected to lie, while a 95% prediction interval (PI) accounts for variability in individual observations. These estimates help evaluate the expected variation in resistance at this critical thickness, allowing engineers to determine process stability and compliance with design specifications. Additionally, a scatterplot with the fitted regression line, including confidence and prediction intervals, visually represents the model’s predictions and associated uncertainty.
X0 <- data.frame(x = 100) # specific value for Film Thickness (100 nm)
ci_100 <- predict(model, X0, interval="confidence", level=0.95) # Confidence interval
pi_100 <- predict(model, X0, interval="prediction", level=0.95) # Prediction interval
# Results
cat("Prediction and Interval Estimates:\n",
"-> Predicted resistance at 100 nm:", (ci_100[1])^(1/lambda), "mOhm\n",
"-> Confidence Interval (95%):", (ci_100[2])^(1/lambda), "-", (ci_100[3])^(1/lambda), "mOhm\n",
"-> Prediction Interval (95%):", (pi_100[2])^(1/lambda), "-", (pi_100[3])^(1/lambda), "mOhm\n"
)
## Prediction and Interval Estimates:
## -> Predicted resistance at 100 nm: 16.43143 mOhm
## -> Confidence Interval (95%): 16.58563 - 16.2798 mOhm
## -> Prediction Interval (95%): 18.10062 - 15.02146 mOhm
Now, we provide a refined view of the linear regression model, emphasizing the predicted resistance at 100 nm. This plot presents the regression line, along with confidence and prediction intervals, to illustrate the expected resistance variation and assess the model’s predictive uncertainty in this critical region.
# Limits for the zoom
zoom_x_min <- 80
zoom_x_max <- 120
zoom_y_min <- min(conf[,2])
zoom_y_max <- max(conf[,3])
# Filter data within the zoom range
x_zoom <- x[x >= zoom_x_min & x <= zoom_x_max]
y_zoom <- y[x >= zoom_x_min & x <= zoom_x_max]
# Define new x values within the zoom range
newx_zoom <- seq(zoom_x_min, zoom_x_max, 0.01)
# Compute confidence and prediction intervals within the zoom range
conf_zoom <- predict(model, data.frame(x=newx_zoom), interval="confidence", level=0.95)
pred_zoom <- predict(model, data.frame(x=newx_zoom), interval="prediction", level=0.95)
# Plot the zoomed-in region
plot(x_zoom, y_zoom, main="Detailed Zoom on Linear Regression",
xlab="Film Thickness (nm)", ylab="Transformed Electrical Resistance",
pch=16, col="blue", cex=1.4,
xlim=c(zoom_x_min, zoom_x_max), ylim=c(zoom_y_min, zoom_y_max))
abline(model, col="red", lwd=2)
# Plot the intervals
lines(newx_zoom, conf_zoom[,2], col="darkgreen", lty=2, lwd=2)
lines(newx_zoom, conf_zoom[,3], col="darkgreen", lty=2, lwd=2)
lines(newx_zoom, pred_zoom[,2], col="black", lty=2, lwd=2)
lines(newx_zoom, pred_zoom[,3], col="black", lty=2, lwd=2)
# Prediction point at X0 = 100 nm
points(100, ci_100[1], col="black", pch=18, cex=2, lwd=3)
text(100, ci_100[1] - 0.004, labels="100 nm", col="black", cex=1.2, font=2)
legend("topright", legend=c("Regression Line", "Confidence Interval", "Prediction Interval"),
col=c("red", "darkgreen", "black"), lty=c(1, 2, 2), lwd=c(2, 2, 2), bty="n")
Data Quality & Distribution:
EDA Findings:
Regression Model Performance:
Model Refinement with Transformation:
Prediction & Variability at 100 nm
Strength of the Relationship Between Film Thickness and Electrical Resistance:
Predictive Power of the Model:
Implications for Process Control & Quality Improvement:
# Libraries
library(dplyr)
library(tidyr)
library(MASS)
library(psych)
library(ggcorrplot)
library(ggplot2)
# SECTION 1: EXPLORATORY DATA ANALYSIS (EDA) ----------------------------------------------------------
# DATA LOADING AND PREPROCESSING
# Create the dataframe
# Raw GitHub URL
url <- "https://raw.githubusercontent.com/JairoRodriguezB/Datasets/main/semiconductor_SLR_dataset.csv"
# Read the CSV file
data <- read.csv(url)
head(data)
# Summary statistics for numerical variables
summary(data)
# Overview of the dataset structure
glimpse(data)
# Count the number of missing values (NA) in each column
colSums(is.na(data))
# INDIVIDUAL BOXPLOTS (VISUAL COMPARISON)
par(mfrow = c(1, 2), mar = c(5, 3, 4, 0.5) + 0.1, oma = c(2, 0, 2, 0))
# Boxplot 1: Electrical Resistance
boxplot(data$Electrical_Resistance_mOhm,
main = "Boxplot of Electrical Resistance",
xlab = "Electrical Resistance (mOhm)",
col = "lightblue", border = "black",
horizontal = TRUE)
# Boxplot 2: Film Thickness
boxplot(data$Film_Thickness_nm,
main = "Boxplot of Film Thickness",
xlab = "Film Thickness (nm)",
col = "lightgreen", border = "black",
horizontal = TRUE)
mtext("Comparison of Electrical Resistance and Film Thickness",
outer = TRUE, cex = 1.5, font = 2)
# COMPARATIVE BOXPLOTS
boxplot(data$Film_Thickness_nm, data$Electrical_Resistance_mOhm,
names = c("Film Thickness", "Electrical Resistance"),
main = "Boxplots of Film Thickness and Electrical Resistance",
col = c("lightblue", "lightgreen"), border = "black")
# HISTOGRAMS
par(mfrow = c(1, 2), mar = c(5, 4, 4, 3) + 0.1, oma = c(0, 0, 3, 0))
# Histogram 1: Electrical Resistance
hist_resistance <- hist(data$Electrical_Resistance_mOhm,
main = "Histogram of Electrical Resistance",
xlab = "Electrical Resistance (mOhm)",
col = "lightblue", border = "black",
breaks = 10, prob = TRUE)
# Add density curve (outline)
lines(density(data$Electrical_Resistance_mOhm), col = "red", lwd = 2)
# Histogram 2: Film Thickness
hist_thickness <- hist(data$Film_Thickness_nm,
main = "Histogram of Film Thickness",
xlab = "Film Thickness (nm)",
col = "lightgreen", border = "black",
breaks = 10, prob = TRUE)
# Add density curve
lines(density(data$Film_Thickness_nm), col = "red", lwd = 2)
mtext("Distribution of Electrical Resistance and Film Thickness",
outer = TRUE, cex = 1.5, font = 2)
# RELATIONSHIP ANALYSIS
# Scatter Plot
plot(data$Film_Thickness_nm, data$Electrical_Resistance_mOhm,
main = "Scatter Plot: Film Thickness vs Electrical Resistance",
xlab = "Film Thickness (nm)",
ylab = "Electrical Resistance (mOhm)",
pch = 20)
# Correlation Matrix
cor_matrix <- cor(data)
ggcorrplot(cor_matrix,
lab = TRUE,
colors = c("red", "white", "#4A90E2"),
outline.color = "black",
show.legend = TRUE)
# -----------------------------------------------------------------------------------------------------
# REGRESSION MODEL FITTING AND ASSUMPTION CHECKING-----------------------------------------------------
# Linear Regression
model = lm(Electrical_Resistance_mOhm ~ Film_Thickness_nm, data = data)
summary(model)
# NORMALITY VERIFICATION
# Normality verification (QQ residuals Plot)
plot(model,2)
# Constant variance verification (Residuals vs. Fitted Plot)
plot(model,1)
# TRANSFORMING DATASET
boxcox(Electrical_Resistance_mOhm ~ Film_Thickness_nm, data = data)
lambda = -0.81
data$new_Electrical_Resistance <- data$Electrical_Resistance_mOhm^lambda
boxcox(new_Electrical_Resistance ~ Film_Thickness_nm, data = data)
# VISUALIZING DATA BEFORE AND AFTER TRANSFORMATION
par(mfrow = c(1, 2), oma = c(0, 0, 3, 0))
# Original data plot
plot(data$Film_Thickness_nm, data$Electrical_Resistance_mOhm,
main = "Original Data",
xlab = "Film Thickness (nm)",
ylab = "Electrical Resistance (mOhm)",
col = "blue", pch = 16)
# Transformed data plot
plot(data$Film_Thickness_nm, data$new_Electrical_Resistance,
main = "Transformed Data",
xlab = "Film Thickness (nm)",
ylab = "Transformed Electrical Resistance",
col = "red", pch = 16)
mtext("Comparison of Original and Transformed Data", outer = TRUE, cex = 1.5, font = 2)
# HISTOGRAMS
par(mfrow = c(1, 2), mar = c(5, 4, 4, 3) + 0.1, oma = c(0, 0, 3, 0))
# Histogram 1: BEFORE
hist_resistance <- hist(data$Electrical_Resistance_mOhm,
main = "Electrical Resistance",
xlab = "Electrical Resistance (mOhm)",
col = "lightblue", border = "black",
breaks = 10, prob = TRUE)
# Add density curve (outline)
lines(density(data$Electrical_Resistance_mOhm), col = "red", lwd = 2)
# Histogram 2: AFTER
hist_thickness <- hist(data$new_Electrical_Resistance,
main = "Transformed Electrical Resistance",
xlab = "Transformed Electrical Resistance",
col = "lightgreen", border = "black",
breaks = 10, prob = TRUE)
# Add density curve
lines(density(data$new_Electrical_Resistance), col = "red", lwd = 2)
mtext("Comparison of Original and Transformed Data", outer = TRUE, cex = 1.5, font = 2)
# FITTING THE TRANSFORMED MODEL
model2 = lm(new_Electrical_Resistance ~ Film_Thickness_nm, data = data)
summary(model2)
# EVALUATING ASSUMPTIONS POST-TRANSFORMATION
# Constant variance verification (Residuals vs. Fitted Plot)
# ORIGINAL MODEL
plot(model, 1, main = "Original Model")
# Constant variance verification (Residuals vs. Fitted Plot)
# TRANSFORMED MODEL
plot(model2, 1, main = "Transformed Model")
# -----------------------------------------------------------------------------------------------------
# INTERVALS -------------------------------------------------------------------------------------------
# IMPLEMENTATION OF INTERVALS
x <- data$Film_Thickness_nm
y <- data$new_Electrical_Resistance
newx <- seq(min(x), max(x), 0.01)
# CONFIDENCE AND PREDICT INTERVALS
model <- lm(y ~ x)
conf <- predict(model, data.frame(x=newx), interval="confidence", level=0.95)
pred <- predict(model, data.frame(x=newx), interval="prediction", level=0.95)
par(pin=c(6,4))
plot(x, y, main = "Linear Regression with Confidence and Prediction Intervals",
xlab = "Film Thickness (nm)",
ylab = "Transformed Electrical Resistance",
pch = 16, col = "blue")
abline(model, col="red", lwd=2)
# Add confidence interval lines
lines(newx, conf[,2], col="darkgreen", lty=2, lwd=2) # Lower CI limit
lines(newx, conf[,3], col="darkgreen", lty=2, lwd=2) # Upper CI limit
# Add prediction interval lines
lines(newx, pred[,2], col="black", lty=2, lwd=2) # Lower PI limit
lines(newx, pred[,3], col="black", lty=2, lwd=2) # Upper PI limit
legend("topright", legend=c("Regression Line", "Confidence Interval", "Prediction Interval"),
col=c("red", "darkgreen", "black"), lty=c(1, 2, 2), lwd=c(2, 2, 2), bty="n")
X0 <- data.frame(x = 100) # specific value for Film Thickness (100 nm)
ci_100 <- predict(model, X0, interval="confidence", level=0.95) # Confidence interval
pi_100 <- predict(model, X0, interval="prediction", level=0.95) # Prediction interval
# Results
cat("Prediction and Interval Estimates:\n",
"-> Predicted resistance at 100 nm:", (ci_100[1])^(1/lambda), "mOhm\n",
"-> Confidence Interval (95%):", (ci_100[2])^(1/lambda), "-", (ci_100[3])^(1/lambda), "mOhm\n",
"-> Prediction Interval (95%):", (pi_100[2])^(1/lambda), "-", (pi_100[3])^(1/lambda), "mOhm\n"
)
# Limits for the zoom
zoom_x_min <- 80
zoom_x_max <- 120
zoom_y_min <- min(conf[,2])
zoom_y_max <- max(conf[,3])
# Filter data within the zoom range
x_zoom <- x[x >= zoom_x_min & x <= zoom_x_max]
y_zoom <- y[x >= zoom_x_min & x <= zoom_x_max]
# Define new x values within the zoom range
newx_zoom <- seq(zoom_x_min, zoom_x_max, 0.01)
# Compute confidence and prediction intervals within the zoom range
conf_zoom <- predict(model, data.frame(x=newx_zoom), interval="confidence", level=0.95)
pred_zoom <- predict(model, data.frame(x=newx_zoom), interval="prediction", level=0.95)
# Plot the zoomed-in region
plot(x_zoom, y_zoom, main="Detailed Zoom on Linear Regression",
xlab="Film Thickness (nm)", ylab="Transformed Electrical Resistance",
pch=16, col="blue", cex=1.4,
xlim=c(zoom_x_min, zoom_x_max), ylim=c(zoom_y_min, zoom_y_max))
abline(model, col="red", lwd=2)
# Plot the intervals
lines(newx_zoom, conf_zoom[,2], col="darkgreen", lty=2, lwd=2)
lines(newx_zoom, conf_zoom[,3], col="darkgreen", lty=2, lwd=2)
lines(newx_zoom, pred_zoom[,2], col="black", lty=2, lwd=2)
lines(newx_zoom, pred_zoom[,3], col="black", lty=2, lwd=2)
# Prediction point at X0 = 100 nm
points(100, ci_100[1], col="black", pch=18, cex=2, lwd=3)
text(100, ci_100[1] - 0.004, labels="100 nm", col="black", cex=1.2, font=2)
legend("topright", legend=c("Regression Line", "Confidence Interval", "Prediction Interval"),
col=c("red", "darkgreen", "black"), lty=c(1, 2, 2), lwd=c(2, 2, 2), bty="n")
# -----------------------------------------------------------------------------------------------------