1 INTRODUCTION

In semiconductor manufacturing, thin film deposition is a critical process where precise control over film thickness ensures optimal electrical properties such as resistance and conductivity. Small variations in deposition conditions can impact device performance, making it essential to analyze the relationship between film thickness (nm) and electrical resistance (mOhm).

This study focuses on Sputter Deposition, a common Physical Vapor Deposition (PVD) technique used for depositing conductive metal films. While this method provides high purity and precise thickness control, slight fluctuations in deposition rate or plasma conditions can lead to resistance variations, affecting the reliability of semiconductor components. Understanding this relationship helps assess process stability and optimize manufacturing parameters.

This report applies Exploratory Data Analysis (EDA) and Linear Regression Modeling to determine whether film thickness is a strong predictor of electrical resistance. The analysis is structured as follows:

  1. Data Loading and Initial Analysis: Importing and summarizing the dataset to assess completeness and structure.
  2. Exploratory Data Analysis (EDA): Using statistical visualizations (boxplots, histograms, scatter plots) to examine distributions and detect patterns.
  3. Regression Model Fitting and Assumption Checking: Developing a linear model, testing assumptions (residual diagnostics, normality checks), and validating its applicability.
  4. Transformations and Model Refinement: Applying Box-Cox transformation if needed to improve variance stability and normality.
  5. Prediction and Confidence Intervals: Estimating electrical resistance at 100 nm, with confidence and prediction intervals for reliability assessment.


2 EXPLORATORY DATA ANALYSIS (EDA)

Exploratory Data Analysis (EDA) is essential for understanding the underlying structure and patterns in the dataset before applying regression modeling. This section examines the distribution of variables, detects potential outliers, and evaluates the relationship between Film Thickness and Electrical Resistance. Boxplots and histograms are used to assess the spread and symmetry of each variable, while scatter plots and correlation analysis help determine the strength and direction of their relationship.

2.1 DATA LOADING AND INITIAL ANALYSIS

In this section, the dataset is imported and loaded into a structured format for analysis. The dataset is read from a CSV file and displayed to ensure proper loading. Following this, an initial examination of the data is conducted, including summary statistics, structural insights, and missing value assessment.


DATA LOADING AND PREPROCESSING

# Create the dataframe
# Raw GitHub URL
url <- "https://raw.githubusercontent.com/JairoRodriguezB/Datasets/main/semiconductor_SLR_dataset.csv"

# Read the CSV file
data <- read.csv(url)
head(data)
##   Film_Thickness_nm Electrical_Resistance_mOhm
## 1             87.45                     15.118
## 2            145.07                     23.601
## 3            123.20                     19.904
## 4            109.87                     16.103
## 5             65.60                     12.901
## 6             65.60                     13.278
rmarkdown::paged_table(data)


INITIAL DATA ANALYSIS

This section provides an overview of the dataset, focusing on key statistical summaries and data structure. Summary statistics are computed for numerical variables to understand their distribution and central tendencies.

# Summary statistics for numerical variables
summary(data)  
##  Film_Thickness_nm Electrical_Resistance_mOhm
##  Min.   : 50.55    Min.   :11.68             
##  1st Qu.: 69.32    1st Qu.:13.60             
##  Median : 96.42    Median :15.74             
##  Mean   : 97.02    Mean   :16.80             
##  3rd Qu.:123.02    3rd Qu.:19.82             
##  Max.   :148.69    Max.   :25.74
# Overview of the dataset structure
glimpse(data)   
## Rows: 100
## Columns: 2
## $ Film_Thickness_nm          <dbl> 87.45, 145.07, 123.20, 109.87, 65.60, 65.60…
## $ Electrical_Resistance_mOhm <dbl> 15.118, 23.601, 19.904, 16.103, 12.901, 13.…
# Count the number of missing values (NA) in each column
colSums(is.na(data))
##          Film_Thickness_nm Electrical_Resistance_mOhm 
##                          0                          0

Observations:

  • The dataset consists of 100 observations and 2 numerical variables: Film_Thickness_nm and Electrical_Resistance_mOhm.
  • No missing values are present in either column, ensuring data completeness.
  • Film Thickness (nm): The thickness of the film ranges from 50.55 nm to 148.69 nm, with a median of 96.42 nm and a mean of 97.02 nm. The distribution appears relatively symmetric, with the first and third quartiles at 69.32 nm and 123.02 nm, respectively.
  • Electrical Resistance (mOhm): The resistance values vary between 11.68 mOhm and 25.74 mOhm, with a median of 15.74 mOhm and a mean of 16.80 mOhm. The higher mean compared to the median suggests a slight right-skewness in the data distribution.
  • Data Quality: There are no missing values in the dataset, which simplifies further analysis and ensures that all observations can be used without imputation or removal.


2.2 DISTRIBUTION VISUALIZATION

Analyzing the distribution of variables is a important step before applying regression modeling. Understanding the variability, central tendency, and potential deviations in the dataset helps assess whether key statistical assumptions are met.

By examining both X and Y, this analysis provides insights into their spread, symmetry, and possible irregularities. Identifying patterns in the data allows for a better evaluation of their relationship and the potential need for transformations to improve model performance.


2.2.1 BOXPLOTS

Boxplots provide a clear visualization of the distribution, variability, and potential outliers in the dataset. In addition to the boxplot for Electrical Resistance (mOhm) (Y), a boxplot for Film Thickness (nm) (X) has been included to analyze its distribution and assess its potential impact on the linear regression model. A comparative boxplot further highlights differences in variability between both variables, ensuring a better understanding of their behavior before applying regression analysis.


INDIVIDUAL BOXPLOTS (VISUAL COMPARISON)

# INDIVIDUAL BOXPLOTS (VISUAL COMPARISON)
par(mfrow = c(1, 2), mar = c(5, 3, 4, 0.5) + 0.1, oma = c(2, 0, 2, 0)) 

# Boxplot 1: Electrical Resistance
boxplot(data$Electrical_Resistance_mOhm,
        main = "Boxplot of Electrical Resistance",
        xlab = "Electrical Resistance (mOhm)",
        col = "lightblue", border = "black",
        horizontal = TRUE)

# Boxplot 2: Film Thickness
boxplot(data$Film_Thickness_nm,
        main = "Boxplot of Film Thickness",
        xlab = "Film Thickness (nm)",
        col = "lightgreen", border = "black",
        horizontal = TRUE)

mtext("Comparison of Electrical Resistance and Film Thickness", 
      outer = TRUE, cex = 1.5, font = 2)


COMPARATIVE BOXPLOTS

# COMPARATIVE BOXPLOTS
boxplot(data$Film_Thickness_nm, data$Electrical_Resistance_mOhm,
        names = c("Film Thickness", "Electrical Resistance"),
        main = "Boxplots of Film Thickness and Electrical Resistance",
        col = c("lightblue", "lightgreen"), border = "black")

Observations:

  • The boxplot of Electrical Resistance (Y) suggests a slight right skew (positive skewness), as the upper whisker extends further than the lower whisker. This indicates that some higher resistance values are present, although no extreme outliers are observed.
  • The boxplot of Film Thickness (X) appears symmetrically distributed, with a central tendency around 100 nm and no significant outliers,
  • The comparative boxplot highlights the difference in scale between X (Film Thickness) and Y (Electrical Resistance). Film thickness has a wider range (~50 to 150 nm), while electrical resistance is more constrained (~12 to 26 mOhm)
  • Film Thickness shows higher dispersion. In contrast, Electrical Resistance appears more concentrated, but its slight skewness suggests that some resistance values deviate from the central trend.
  • As the skewness in Y may suggest the need for a transformation (Box-Cox) to improve normality before applying the regression model.


2.2.2 HISTOGRAMS

Histograms provide a visual representation of the distribution of both Electrical Resistance (Y) and Film Thickness (X). In addition to examining the behavior of Y, the inclusion of X allows for a more comprehensive analysis of its distribution and potential impact on the linear regression model. The density curve (red line) has been added to each histogram to better visualize the shape of the data, aiding in the evaluation of key statistical properties before applying regression analysis.


# HISTOGRAMS
par(mfrow = c(1, 2), mar = c(5, 4, 4, 3) + 0.1, oma = c(0, 0, 3, 0))

# Histogram 1: Electrical Resistance
hist_resistance <- hist(data$Electrical_Resistance_mOhm, 
     main = "Histogram of Electrical Resistance", 
     xlab = "Electrical Resistance (mOhm)", 
     col = "lightblue", border = "black",
     breaks = 10, prob = TRUE)

# Add density curve (outline)
lines(density(data$Electrical_Resistance_mOhm), col = "red", lwd = 2)

# Histogram 2:  Film Thickness
hist_thickness <- hist(data$Film_Thickness_nm, 
     main = "Histogram of Film Thickness", 
     xlab = "Film Thickness (nm)", 
     col = "lightgreen", border = "black",
     breaks = 10, prob = TRUE)

# Add density curve
lines(density(data$Film_Thickness_nm), col = "red", lwd = 2)

mtext("Distribution of Electrical Resistance and Film Thickness", 
      outer = TRUE, cex = 1.5, font = 2)

Observations:

  • The Electrical Resistance (Y) histogram exhibits a right-skewed distribution, with most values concentrated between 12 and 18 mOhm and a longer tail extending towards higher resistance values (~20 to 25 mOhm). The density curve confirms this skewness, indicating that the data may not be normally distributed. This aligns with the boxplot analysis, which also suggested a lack of symmetry, reinforcing the potential need for a Box-Cox transformation to improve normality.
  • The Film Thickness (X) histogram shows a broader variability, with values ranging between ~50 nm and 150 nm
  • The differences in distribution highlight that while Y is slightly skewed, X appears more dispersed and irregular, which could impact the linearity and homoscedasticity assumptions in the regression model. The skewness in Y, as observed in both the boxplot and histogram, may require transformation (e.g., Box-Cox), while the variability in X should be further analyzed to determine its influence on model performance.


2.3 RELATIONSHIP ANALYSIS

To assess the relationship between Film Thickness (X) and Electrical Resistance (Y), a scatter plot and correlation analysis were conducted. The scatter plot visually examines whether a linear relationship exists between the variables, while the correlation matrix quantifies the strength and direction of this relationship. A high correlation value would indicate a strong association, which is crucial for validating the assumptions of a linear regression model.

These analyses help determine whether Film Thickness (X) serves as a significant predictor of Electrical Resistance (Y), guiding the next steps in the regression modeling process


SCATTER PLOT

Scatter plots provide a visual representation of the relationship between Film Thickness (X) and Electrical Resistance (Y). In addition to assessing the overall trend between the variables, this visualization helps identify potential deviations from linearity, outliers, or patterns that could impact the regression model. The distribution of points allows for a preliminary evaluation of whether a linear model is appropriate for describing the relationship, guiding further analysis before proceeding with regression assumptions and model fitting. Resistance (Y), guiding the next steps in the regression modeling process

# Scatter Plot
plot(data$Film_Thickness_nm, data$Electrical_Resistance_mOhm,
     main = "Scatter Plot: Film Thickness vs Electrical Resistance",
     xlab = "Film Thickness (nm)",
     ylab = "Electrical Resistance (mOhm)",
     pch = 20)


CORRELATION MATRIX

Correlation analysis provides a quantitative measure of the strength and direction of the relationship between Film Thickness (X) and Electrical Resistance (Y). In addition to visually inspecting the relationship, computing the correlation coefficient helps confirm whether a strong linear association exists. The correlation matrix offers a structured representation of this relationship, aiding in assessing the suitability of a linear regression model

# Correlation Matrix
cor_matrix <- cor(data)
ggcorrplot(cor_matrix, 
           lab = TRUE,               
           colors = c("red", "white", "#4A90E2"), 
           outline.color = "black",   
           show.legend = TRUE)

Observations:

  • The scatter plot shows a strong positive relationship between Film Thickness (X) and Electrical Resistance (Y). As film thickness increases, electrical resistance also increases in a nearly linear pattern, suggesting that a linear regression model may be appropriate.
  • The correlation matrix confirms this strong association, with a correlation coefficient of 0.96, indicating an almost perfect positive correlation. This suggests that Film Thickness is a strong predictor of Electrical Resistance.
  • Given the high correlation, the next steps involve fitting a linear regression model, followed by an assessment of residuals to verify assumptions such as linearity, homoscedasticity, and normality


3 REGRESSION MODEL FITTING AND ASSUMPTION CHECKING

This section focuses on fitting a linear regression model to analyze the relationship between Film Thickness (X) and Electrical Resistance (Y) using the Least Squares method. The model is implemented in R using the lm() function, which estimates the regression coefficients and evaluates model performance through the coefficient of determination (\(R^2\)). However, for the model to be valid, key assumptions—normality of residuals and constant variance (homoscedasticity)—must be met. These assumptions are verified using the Residuals vs. Fitted Plot and Q-Q Plot. If violations are detected, a Box-Cox transformation is applied to stabilize variance and improve normality. The transformed model is then re-evaluated to ensure compliance with regression assumptions, leading to a more reliable and interpretable model.


3.1 REGRESION MODEL

Linear regression is a fundamental statistical method used to model the relationship between a dependent variable (Y) and an independent variable (X). In this case, the goal is to analyze how Film Thickness (X) influences Electrical Resistance (Y) and determine the strength of their association.

The model follows the equation:

\[ Y = \beta_0 + \beta_1 X + \epsilon \]

Where:

  • \(B_0 \text{: is the intercept the expected value of Y when X = 0}\)
  • \(B_1 \text{: is the slope (representing the rate of change in Y for a unit increase in X}\)
  • \(\epsilon \text{: is the error term, capturing variability not explained by the model}\)


LEAST SQUARES ESTIMATION

Since the parameters \(B_0\) and \(B_1\) are unknown, they must be estimated using the Least Squares Method (OLS - Ordinary Least Squares). This method finds the values \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that minimize the sum of squared residuals, ensuring the best possible fit for the observed data.

These estimates are computed as:

\[ \hat{\beta}_1 = \frac{\sum (x_i - \bar{x}) (y_i - \bar{y})}{\sum (x_i - \bar{x})^2} \]

\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \]

Where:

  • \(\hat{\beta}_1 \text{(Estimated Slope): Measures the average change in Y for each unit increase in X}\)
  • \(\hat{\beta}_0 \text{(Estimated Intercept): Represents the predicted value of Y when X = 0}\)
  • \(\bar{x} \text{ and } \bar{y} \text{: Are the sample means of X and Y, respectively}\)

These estimates define the regression equation:

\[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X \]

Where \(\hat{Y}\) represents the predicted electrical resistance for a given film thickness.


SUM OF SQUARES AND MODEL EVALUATION

The variability in \(Y\) is decomposed into three components:

  • TOTAL SUM OF SQUARES (SST): Measures the total variation in \(Y\), representing the overall dispersion of the observed values around their mean. It is calculated as:

\[ SST = \sum (y_i - \bar{y})^2 \]

  • REGRESSION SUM OF SQUARES (SSR): Represents the variation explained by the model, quantifying how much the predicted values deviate from the overall mean of \(Y\). It is given by:

\[ SSR = \sum (\hat{y}_i - \bar{y})^2 \]

  • ERROR SUM OF SQUARES (SSE): Captures the unexplained variation, measuring how much the observed values deviate from the predicted values:

\[ SSE = \sum (y_i - \hat{y}_i)^2 \]

These components are related by the fundamental equation:

\[ SST = SSR + SSE \]

The Model performance is evaluated using the COEFFICIENT OF DETERMINATION (\(R^2\)), which measures the proportion of variance in \(Y\) accounted for by \(X\).

\[ R^2 = \frac{SSR}{SST} \]

A higher \(R^2\) value indicates a stronger relationship between the variables, meaning the model explains a greater portion of the variability in \(Y\).


IMPLEMENTATION OF THE MODEL

The implementation of the linear regression model is performed using the lm() function in R, which applies the Least Squares Method to estimate the regression coefficients \(\hat{\beta}_1\) and \(\hat{\beta}_0\). This method minimizes the sum of squared residuals, ensuring the best possible linear fit to the data.

Additionally, the lm() function internally computes the Total Sum of Squares (SST), Regression Sum of Squares (SSR), and Error Sum of Squares (SSE) to obtain the coefficient of determination (\(R^2\)), which quantifies the proportion of variance in \(Y\) explained by \(X\).

It is assumed that the adequacy criteria for linear regression—normality of residuals and constant variance (homoscedasticity)—are met, allowing for the proper application of lm().

By using the summary() function, key statistical outputs such as coefficient estimates, standard errors and \(R^2\) allowing for a comprehensive evaluation of the model’s performance.

# Linear Regression
model = lm(Electrical_Resistance_mOhm ~ Film_Thickness_nm, data = data)
summary(model)
## 
## Call:
## lm(formula = Electrical_Resistance_mOhm ~ Film_Thickness_nm, 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.27640 -0.75508 -0.08631  0.70422  2.69671 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.870489   0.356848   13.65   <2e-16 ***
## Film_Thickness_nm 0.122954   0.003518   34.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.041 on 98 degrees of freedom
## Multiple R-squared:  0.9257, Adjusted R-squared:  0.925 
## F-statistic:  1221 on 1 and 98 DF,  p-value: < 2.2e-16

The regression model evaluates the relationship between Film Thickness (X) and Electrical Resistance (Y) using the least squares method. From the Coefficients section:

\[ \hat{Y} = 4.870 + 0.123X \]

  • The intercept (4.870) represents the estimated electrical resistance when the film thickness is zero.
  • The slope (0.123) indicates that for every 1 nm increase in film thickness, the electrical resistance increases by 0.123 mOhm on average.


Conclusions:

  • Strong Relationship Between Variables: The model exhibits a high coefficient of determination (\(R^2 = 0.9257\)), indicating that 92.57% of the variability in Electrical Resistance (\(Y\)) is explained by Film Thickness (\(X\)). This suggests a very strong linear relationship between the two variables.
  • Residual Analysis Suggests a Good Fit: The residuals exhibit a relatively small standard error (1.041), indicating that the model’s predictions are generally close to the observed values. The distribution of residuals (Min: -2.276, Max: 2.696) suggests some spread, but the median residual is close to zero.
  • The F-statistic (1221) and its small p-value (< 2.2e-16) confirm that the overall model is statistically significant


3.2 ASSUMPTION CHECKING

In the previous section, the linear regression model was fitted using the lm() function under the assumption that the key regression assumptions—normality of residuals and constant variance (homoscedasticity)—hold. However, these assumptions must be verified to ensure the model’s validity.

In this section, assumption checking is performed through:

  • Residuals vs. Fitted Plot: To assess whether residuals exhibit a random pattern, confirming constant variance.
  • QQ Residuals Plot: To evaluate whether the residuals follow a normal distribution.

If these assumptions are violated, the previous model may be considered invalid, as its estimates, statistical significance, and predictive reliability could be compromised. In such cases, data transformations (Box-Cox) approach will be required.

NORMALITY VERIFICATION (QQ RESIDUALS PLOT)

# ASSUMPTION CHECKING
# Normality verification (QQ residuals Plot)
plot(model,2)



CONSTANT VARIANCE VERIFICATION (RESIDUALS VS. FITTED PLOT)

# ASSUMPTION CHECKING
# Constant variance verification (Residuals vs. Fitted Plot)
plot(model,1)


Conclusions:

  • Normality of Residuals: The Q-Q plot shows that the residuals follow an approximately normal distribution, as most points align closely with the diagonal reference line. While there are slight deviations at the extremes. This suggests that the assumption of normality is met.
  • Constant Variance (Homoscedasticity): The Residuals vs. Fitted plot reveals a curved pattern, indicating non-constant variance.
  • Since normality is satisfied but constant variance is violated, the validity of the linear regression model is compromised. This issue suggests that a transformation of the response \(Y\) variable (Box-Cox) will be necessary to stabilize variance and improve the model’s reliability.



3.3 TRANSFORMING DATASET

IDENTIFYING THE OPTIMAL TRANSFORMATION

To address non-constant variance in the residuals, a Box-Cox transformation was applied. The plot shows the log-likelihood for different values of \(\lambda\), with the peak indicating the optimal transformation.

The estimated \(\lambda\) is approximately -0.81, suggesting an inverse power transformation.

boxcox(Electrical_Resistance_mOhm ~ Film_Thickness_nm, data = data)


APPLYING THE BOX-COX TRANSFORMATION

After determining an optimal \(\lambda \approx -0.81\), the Box-Cox transformation was applied to the response variable Electrical Resistance. A second Box-Cox plot was generated to verify whether the transformation improved variance stabilization

The new plot shows an estimated \(\lambda \approx 1\), indicating that the transformed variable now exhibits a structure closer to linearity.

lambda = -0.81
data$new_Electrical_Resistance <- data$Electrical_Resistance_mOhm^lambda
boxcox(new_Electrical_Resistance ~ Film_Thickness_nm, data = data)


VISUALIZING DATA BEFORE AND AFTER TRANSFORMATION

In this section, the original data and its transformed counterpart are visualized to assess the impact of the Box-Cox transformation. The first comparison presents scatter plots of Film Thickness (X) vs. Electrical Resistance (Y) before and after transformation, allowing for an evaluation of the linearity and spread of the data. The second comparison uses histograms with density curves to examine changes in the distribution of Electrical Resistance, highlighting the effect of transformation on normality.

par(mfrow = c(1, 2), oma = c(0, 0, 3, 0))

# Original data plot
plot(data$Film_Thickness_nm, data$Electrical_Resistance_mOhm,
     main = "Original Data",
     xlab = "Film Thickness (nm)",
     ylab = "Electrical Resistance (mOhm)",
     col = "blue", pch = 16)

# Transformed data plot
plot(data$Film_Thickness_nm, data$new_Electrical_Resistance,
     main = "Transformed Data",
     xlab = "Film Thickness (nm)",
     ylab = "Transformed Electrical Resistance",
     col = "red", pch = 16)

mtext("Comparison of Original and Transformed Data", outer = TRUE, cex = 1.5, font = 2)

# HISTOGRAMS
par(mfrow = c(1, 2), mar = c(5, 4, 4, 3) + 0.1, oma = c(0, 0, 3, 0))

# Histogram 1: BEFORE
hist_resistance <- hist(data$Electrical_Resistance_mOhm, 
     main = "Electrical Resistance", 
     xlab = "Electrical Resistance (mOhm)", 
     col = "lightblue", border = "black",
     breaks = 10, prob = TRUE)

# Add density curve (outline)
lines(density(data$Electrical_Resistance_mOhm), col = "red", lwd = 2)

# Histogram 2:  AFTER
hist_thickness <- hist(data$new_Electrical_Resistance, 
     main = "Transformed Electrical Resistance", 
     xlab = "Transformed Electrical Resistance", 
     col = "lightgreen", border = "black",
     breaks = 10, prob = TRUE)

# Add density curve
lines(density(data$new_Electrical_Resistance), col = "red", lwd = 2)

mtext("Comparison of Original and Transformed Data", outer = TRUE, cex = 1.5, font = 2)

Observations:

  • The scatter plot of the original data shows a strong positive correlation between Film Thickness and Electrical Resistance. After transformation, the trend is reversed due to the applied power transformation (\(\lambda \approx -0.81\)), where larger resistance values are now mapped to smaller transformed values
  • The histogram of the transformed Electrical Resistance suggests a more symmetric distribution, reducing skewness and improving normality, which is important for meeting regression assumptions.
  • Overall, the transformation effectively improves normality, making the dataset more suitable for linear regression analysis.


FITTING THE TRANSFORMED MODEL

After applying the Box-Cox transformation to stabilize variance and improve normality, a new linear regression model is fitted using the transformed response variable (\(\text{new_Electrical_Resistance}\)).

model2 = lm(new_Electrical_Resistance ~ Film_Thickness_nm, data = data)
summary(model2)
## 
## Call:
## lm(formula = new_Electrical_Resistance ~ Film_Thickness_nm, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0111971 -0.0019573 -0.0002081  0.0025100  0.0093906 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.629e-01  1.342e-03  121.44   <2e-16 ***
## Film_Thickness_nm -5.932e-04  1.323e-05  -44.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.003915 on 98 degrees of freedom
## Multiple R-squared:  0.9536, Adjusted R-squared:  0.9531 
## F-statistic:  2012 on 1 and 98 DF,  p-value: < 2.2e-16

Observations:

  • Improved Model Fit: Adjusted \(R^2\) increased from 0.925 to 0.953
  • Reduced Residual Error: Standard error dropped from 1.041 to 0.0039, stabilizing variance and improving prediction accuracy
  • Coefficient Adjustments: The slope changed from 0.1229 to -0.0005932, reflecting the transformation effect, while significance remained strong (\(p<2.2e^{−16}\)).


EVALUATING ASSUMPTIONS POST-TRANSFORMATION

After fitting the transformed model, it is important to reassess whether the assumptions of linear regression are now better satisfied. In particular, we verify constant variance (homoscedasticity) by comparing the Residuals vs. Fitted plots before and after transformation.

# ASSUMPTION REVALIDATION
# Constant variance verification (Residuals vs. Fitted Plot)
# ORIGINAL MODEL
plot(model, 1, main = "Original Model")

# ASSUMPTION REVALIDATION
# Constant variance verification (Residuals vs. Fitted Plot)
# TRANSFORMED MODEL
plot(model2, 1, main = "Transformed Model")

Observations:

  • Original Model: Residuals show a curved pattern, indicating non-constant variance and potential model misspecification.
  • Transformed Model: Residuals are more randomly scattered, suggesting improved homoscedasticity and linearity.
  • The Box-Cox transformation effectively stabilized variance and improved model assumptions.



4 INTERVALS

In linear regression analysis, confidence and prediction intervals quantify the uncertainty associated with estimating the response variable (\(Y\)) at a given predictor value (\(X\)). These intervals are important for understanding the reliability of predictions and assessing model performance.


CONFIDENCE INTERVAL ON THE EXPECTED RESPONSE

The confidence interval (CI) provides a range where the mean response \(\hat{Y}\) is expected to fall for a given \(X_j\), with a specified confidence level (95%). It accounts for the uncertainty in estimating the population mean response and is given by:

\[ \left( \hat{\beta}\_0 + \hat{\beta}x_j \right) \pm t_{1-\alpha/2, n-2} \left( \sqrt{ MSE \left[ \frac{1}{n} + \frac{x_j - \bar{x}}{\sum (x_i - \bar{x})^2} \right] } \right) \]

Where:

  • \(\hat{\beta}_0 \text{ and } \hat{\beta}_1 \text{ are the estimated regression coefficients.}\)
  • \(t_{1-\alpha/2, n-2} \text{ is the critical value from the t-distribution with } n - 2 \text{ degrees of freedom}\)
  • \(MSE \text{ is the mean squared error of the model}\)
  • \(n \text{ is the sample size.}\)
  • \(\hat{x} \text{ is the mean of the predictor variable}\)


PREDICTION INTERVAL ON AN INDIVIDUAL RESPONSE

The prediction interval (PI) estimates the range where an individual future observation \(Y_j\) is likely to fall for a given \(X_j\) . Unlike the confidence interval, the prediction interval incorporates both the variability in estimating the mean response and the inherent variability of individual observations. The equation is:

\[ \left( \hat{\beta}_0 + \hat{\beta}_1 x_j \right) \pm t_{1-\alpha/2, n-2} \left( \sqrt{ MSE \left[ 1 + \frac{1}{n} + \frac{x_j - \bar{x}}{\sum (x_i - \bar{x})^2} \right] } \right) \]


IMPLEMENTATION OF INTERVALS

Before computing confidence and prediction intervals, it is necessary to define a range of predictor values (\(X\)) over which these intervals will be evaluated. In the R implementation, this is achieved using:

x <- data$Film_Thickness_nm 
y <- data$new_Electrical_Resistance
newx <- seq(min(x), max(x), 0.01)

This command generates a sequence of \(X\) values ranging from the minimum to the maximum observed film thickness in small increments (0.01). By defining newx, we obtain a continuous set of predictor values for which we can compute interval estimates, allowing for a smooth visual representation of the fitted regression model. Each value of newx is substituted into the regression equation:

\[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_j \]

where \(X_j\) represents each value in “newx”. The corresponding confidence and prediction intervals are then calculated for each point \(X_j\).

The predict() function automates these calculations based on the fitted regression model. Internally, it performs the following steps:

  • Uses the Estimated Regression Coefficients (\((\hat{\beta}_0 , \hat{\beta}_1)\)): Obtained from the “lm()” model fit.
  • Computes the Mean Squared Error (MSE): Estimated as:

\[ MSE = \frac{SSE}{n-2} \]

  • Calculates the Standard Error for Each \(X_j\): Based on the variance formula in the interval equations.

  • Uses the t-Distribution Critical Value: Extracts \(t_{1-\alpha/2, n-2}\) to determine the interval width.

  • Generates Interval Estimates:

    • Confidence Intervals (interval=“confidence”) estimate the expected mean response at each predictor value.

    • Prediction Intervals (interval=“prediction”) account for the variability in individual observations.

  • Each row in the computed confidence and prediction intervals contains:

    • Estimated Response (\(\hat{Y}\)): The predicted value of the dependent variable at a given predictor value.

    • Lower Bound: The lower limit of the corresponding interval, representing the minimum plausible value within the specified confidence level.

    • Upper Bound: The upper limit of the corresponding interval, representing the maximum plausible value within the specified confidence level.


CONFIDENCE AND PREDICTION INTERVALS

This analysis models the relationship between film thickness (nm) and electrical resistance using linear regression, where the red line represents the fitted model. The response variable, electrical resistance, is used in its transformed form to improve linearity and model performance. To assess the reliability of predictions, 95% confidence intervals (green, dashed) indicate where the true mean response is expected to lie, while 95% prediction intervals (black, dashed) show the range for individual future observations.

# CONFIDENCE AND PREDICT INTERVALS

model <- lm(y ~ x)

conf <- predict(model, data.frame(x=newx), interval="confidence", level=0.95)
pred <- predict(model, data.frame(x=newx), interval="prediction", level=0.95)

plot.new()

plot(x, y, main = "Linear Regression with Confidence and Prediction Intervals",
     xlab = "Film Thickness (nm)", 
     ylab = "Transformed Electrical Resistance",
     pch = 16, col = "blue") 

abline(model, col="red", lwd=2)

# Add confidence interval lines
lines(newx, conf[,2], col="darkgreen", lty=2, lwd=2)  # Lower CI limit
lines(newx, conf[,3], col="darkgreen", lty=2, lwd=2)  # Upper CI limit

# Add prediction interval lines
lines(newx, pred[,2], col="black", lty=2, lwd=2)  # Lower PI limit
lines(newx, pred[,3], col="black", lty=2, lwd=2)  # Upper PI limit


legend("topright", legend=c("Regression Line", "Confidence Interval", "Prediction Interval"),
       col=c("red", "darkgreen", "black"), lty=c(1, 2, 2), lwd=c(2, 2, 2), bty="n")


ESTIMATED RESISTANCE AND VARIABILITY AT 100 NM

In semiconductor manufacturing, target film thickness values vary depending on the specific application and material being deposited. For conductive metal layers, typical thickness values range between 50 nm and 150 nm, with 100 nm frequently monitored as a midpoint in the standard process range and a reference for quality control. This analysis estimates the electrical resistance at 100 nm using a trained linear regression model, applying the inverse transformation to obtain the predicted value. To assess uncertainty, a 95% confidence interval (CI) is provided, indicating the range where the true mean resistance is expected to lie, while a 95% prediction interval (PI) accounts for variability in individual observations. These estimates help evaluate the expected variation in resistance at this critical thickness, allowing engineers to determine process stability and compliance with design specifications. Additionally, a scatterplot with the fitted regression line, including confidence and prediction intervals, visually represents the model’s predictions and associated uncertainty.

X0 <- data.frame(x = 100)  # specific value for Film Thickness (100 nm)

ci_100 <- predict(model, X0, interval="confidence", level=0.95)  # Confidence interval
pi_100 <- predict(model, X0, interval="prediction", level=0.95)  # Prediction interval

# Results
cat("Prediction and Interval Estimates:\n",
  "-> Predicted resistance at 100 nm:", (ci_100[1])^(1/lambda), "mOhm\n",
  "-> Confidence Interval (95%):", (ci_100[2])^(1/lambda), "-", (ci_100[3])^(1/lambda), "mOhm\n",
  "-> Prediction Interval (95%):", (pi_100[2])^(1/lambda), "-", (pi_100[3])^(1/lambda), "mOhm\n"
)
## Prediction and Interval Estimates:
##  -> Predicted resistance at 100 nm: 16.43143 mOhm
##  -> Confidence Interval (95%): 16.58563 - 16.2798 mOhm
##  -> Prediction Interval (95%): 18.10062 - 15.02146 mOhm

Now, we provide a refined view of the linear regression model, emphasizing the predicted resistance at 100 nm. This plot presents the regression line, along with confidence and prediction intervals, to illustrate the expected resistance variation and assess the model’s predictive uncertainty in this critical region.

# Limits for the zoom
zoom_x_min <- 80   
zoom_x_max <- 120  
zoom_y_min <- min(conf[,2])  
zoom_y_max <- max(conf[,3])

# Filter data within the zoom range
x_zoom <- x[x >= zoom_x_min & x <= zoom_x_max]
y_zoom <- y[x >= zoom_x_min & x <= zoom_x_max]

# Define new x values within the zoom range
newx_zoom <- seq(zoom_x_min, zoom_x_max, 0.01)

# Compute confidence and prediction intervals within the zoom range
conf_zoom <- predict(model, data.frame(x=newx_zoom), interval="confidence", level=0.95)
pred_zoom <- predict(model, data.frame(x=newx_zoom), interval="prediction", level=0.95)

# Plot the zoomed-in region
plot(x_zoom, y_zoom, main="Detailed Zoom on Linear Regression",
     xlab="Film Thickness (nm)", ylab="Transformed Electrical Resistance",
     pch=16, col="blue", cex=1.4, 
     xlim=c(zoom_x_min, zoom_x_max), ylim=c(zoom_y_min, zoom_y_max))  

abline(model, col="red", lwd=2) 

# Plot the intervals
lines(newx_zoom, conf_zoom[,2], col="darkgreen", lty=2, lwd=2)  
lines(newx_zoom, conf_zoom[,3], col="darkgreen", lty=2, lwd=2)  

lines(newx_zoom, pred_zoom[,2], col="black", lty=2, lwd=2)  
lines(newx_zoom, pred_zoom[,3], col="black", lty=2, lwd=2)  

# Prediction point at X0 = 100 nm
points(100, ci_100[1], col="black", pch=18, cex=2, lwd=3)  
text(100, ci_100[1] - 0.004, labels="100 nm", col="black", cex=1.2, font=2)

legend("topright", legend=c("Regression Line", "Confidence Interval", "Prediction Interval"),
       col=c("red", "darkgreen", "black"), lty=c(1, 2, 2), lwd=c(2, 2, 2), bty="n")

5 CONCLUSIONS

Data Quality & Distribution:

  • The dataset contains 100 observations with no missing values.
  • Film Thickness is symmetrically distributed, while Electrical Resistance shows slight right-skewness.

EDA Findings:

  • Strong correlation (0.96) between Film Thickness and Electrical Resistance.
  • Electrical Resistance skewness suggests a need for transformation.

Regression Model Performance:

  • High explanatory power (\(R^2=0.9257\)), indicating a strong relationship.
  • Residual analysis detected non-constant variance (heteroscedasticity).

Model Refinement with Transformation:

  • Box-Cox transformation (\(\lambda = −0.81\)) improved normality and variance stability.
  • Adjusted \(R^2\) increased to 0.953, reducing residual error.

Prediction & Variability at 100 nm

  • Predicted resistance: 16.43 mOhm.
  • 95% Confidence Interval: 16.28 – 16.59 mOhm (mean resistance estimate).
  • 95% Prediction Interval: 15.02 – 18.10 mOhm (expected range for individual measurements).

Strength of the Relationship Between Film Thickness and Electrical Resistance:

  • The high correlation (0.96) and model significance (\(p<2.2e^{−16}\)) confirm a strong linear dependency.
  • Film Thickness explains 95.3% of the variation in Electrical Resistance after transformation, demonstrating that changes in thickness directly impact resistance.

Predictive Power of the Model:

  • The model provides accurate resistance predictions based on thickness, with tight confidence intervals, making it a reliable tool for estimating electrical properties.
  • The prediction interval suggests a controlled, but still present, variability in the deposition process, highlighting the importance of maintaining thickness consistency.

Implications for Process Control & Quality Improvement:

  • Precise thickness control is essential since small deviations significantly impact resistance, reinforcing the need for strict manufacturing tolerances.
  • The model can be used for real-time monitoring, allowing manufacturers to predict resistance and adjust deposition parameters proactively.
  • The defined resistance range at 100 nm (15.02 – 18.10 mOhm) can serve as a reference for quality assurance and defect reduction in semiconductor fabrication.


6 COMPLETE R-CODE

# Libraries
library(dplyr)
library(tidyr)
library(MASS)
library(psych)
library(ggcorrplot)
library(ggplot2)
# SECTION 1: EXPLORATORY DATA ANALYSIS (EDA) ----------------------------------------------------------
# DATA LOADING AND PREPROCESSING
# Create the dataframe
# Raw GitHub URL
url <- "https://raw.githubusercontent.com/JairoRodriguezB/Datasets/main/semiconductor_SLR_dataset.csv"

# Read the CSV file
data <- read.csv(url)
head(data)

# Summary statistics for numerical variables
summary(data)  
# Overview of the dataset structure
glimpse(data)   
# Count the number of missing values (NA) in each column
colSums(is.na(data))

# INDIVIDUAL BOXPLOTS (VISUAL COMPARISON)
par(mfrow = c(1, 2), mar = c(5, 3, 4, 0.5) + 0.1, oma = c(2, 0, 2, 0)) 
# Boxplot 1: Electrical Resistance
boxplot(data$Electrical_Resistance_mOhm,
        main = "Boxplot of Electrical Resistance",
        xlab = "Electrical Resistance (mOhm)",
        col = "lightblue", border = "black",
        horizontal = TRUE)
# Boxplot 2: Film Thickness
boxplot(data$Film_Thickness_nm,
        main = "Boxplot of Film Thickness",
        xlab = "Film Thickness (nm)",
        col = "lightgreen", border = "black",
        horizontal = TRUE)
mtext("Comparison of Electrical Resistance and Film Thickness", 
      outer = TRUE, cex = 1.5, font = 2)

# COMPARATIVE BOXPLOTS
boxplot(data$Film_Thickness_nm, data$Electrical_Resistance_mOhm,
        names = c("Film Thickness", "Electrical Resistance"),
        main = "Boxplots of Film Thickness and Electrical Resistance",
        col = c("lightblue", "lightgreen"), border = "black")


# HISTOGRAMS
par(mfrow = c(1, 2), mar = c(5, 4, 4, 3) + 0.1, oma = c(0, 0, 3, 0))
# Histogram 1: Electrical Resistance
hist_resistance <- hist(data$Electrical_Resistance_mOhm, 
     main = "Histogram of Electrical Resistance", 
     xlab = "Electrical Resistance (mOhm)", 
     col = "lightblue", border = "black",
     breaks = 10, prob = TRUE)
# Add density curve (outline)
lines(density(data$Electrical_Resistance_mOhm), col = "red", lwd = 2)
# Histogram 2:  Film Thickness
hist_thickness <- hist(data$Film_Thickness_nm, 
     main = "Histogram of Film Thickness", 
     xlab = "Film Thickness (nm)", 
     col = "lightgreen", border = "black",
     breaks = 10, prob = TRUE)
# Add density curve
lines(density(data$Film_Thickness_nm), col = "red", lwd = 2)

mtext("Distribution of Electrical Resistance and Film Thickness", 
      outer = TRUE, cex = 1.5, font = 2)

# RELATIONSHIP ANALYSIS
# Scatter Plot
plot(data$Film_Thickness_nm, data$Electrical_Resistance_mOhm,
     main = "Scatter Plot: Film Thickness vs Electrical Resistance",
     xlab = "Film Thickness (nm)",
     ylab = "Electrical Resistance (mOhm)",
     pch = 20)

# Correlation Matrix
cor_matrix <- cor(data)
ggcorrplot(cor_matrix, 
           lab = TRUE,               
           colors = c("red", "white", "#4A90E2"), 
           outline.color = "black",   
           show.legend = TRUE)
# -----------------------------------------------------------------------------------------------------




# REGRESSION MODEL FITTING AND ASSUMPTION CHECKING-----------------------------------------------------

# Linear Regression
model = lm(Electrical_Resistance_mOhm ~ Film_Thickness_nm, data = data)
summary(model)

# NORMALITY VERIFICATION
# Normality verification (QQ residuals Plot)
plot(model,2)
# Constant variance verification (Residuals vs. Fitted Plot)
plot(model,1)

# TRANSFORMING DATASET
boxcox(Electrical_Resistance_mOhm ~ Film_Thickness_nm, data = data)
lambda = -0.81
data$new_Electrical_Resistance <- data$Electrical_Resistance_mOhm^lambda
boxcox(new_Electrical_Resistance ~ Film_Thickness_nm, data = data)

# VISUALIZING DATA BEFORE AND AFTER TRANSFORMATION
par(mfrow = c(1, 2), oma = c(0, 0, 3, 0))
# Original data plot
plot(data$Film_Thickness_nm, data$Electrical_Resistance_mOhm,
     main = "Original Data",
     xlab = "Film Thickness (nm)",
     ylab = "Electrical Resistance (mOhm)",
     col = "blue", pch = 16)
# Transformed data plot
plot(data$Film_Thickness_nm, data$new_Electrical_Resistance,
     main = "Transformed Data",
     xlab = "Film Thickness (nm)",
     ylab = "Transformed Electrical Resistance",
     col = "red", pch = 16)
mtext("Comparison of Original and Transformed Data", outer = TRUE, cex = 1.5, font = 2)

# HISTOGRAMS
par(mfrow = c(1, 2), mar = c(5, 4, 4, 3) + 0.1, oma = c(0, 0, 3, 0))
# Histogram 1: BEFORE
hist_resistance <- hist(data$Electrical_Resistance_mOhm, 
     main = "Electrical Resistance", 
     xlab = "Electrical Resistance (mOhm)", 
     col = "lightblue", border = "black",
     breaks = 10, prob = TRUE)
# Add density curve (outline)
lines(density(data$Electrical_Resistance_mOhm), col = "red", lwd = 2)
# Histogram 2:  AFTER
hist_thickness <- hist(data$new_Electrical_Resistance, 
     main = "Transformed Electrical Resistance", 
     xlab = "Transformed Electrical Resistance", 
     col = "lightgreen", border = "black",
     breaks = 10, prob = TRUE)
# Add density curve
lines(density(data$new_Electrical_Resistance), col = "red", lwd = 2)
mtext("Comparison of Original and Transformed Data", outer = TRUE, cex = 1.5, font = 2)

# FITTING THE TRANSFORMED MODEL
model2 = lm(new_Electrical_Resistance ~ Film_Thickness_nm, data = data)
summary(model2)

# EVALUATING ASSUMPTIONS POST-TRANSFORMATION
# Constant variance verification (Residuals vs. Fitted Plot)
# ORIGINAL MODEL
plot(model, 1, main = "Original Model")
# Constant variance verification (Residuals vs. Fitted Plot)
# TRANSFORMED MODEL
plot(model2, 1, main = "Transformed Model")
# -----------------------------------------------------------------------------------------------------




# INTERVALS -------------------------------------------------------------------------------------------
# IMPLEMENTATION OF INTERVALS
x <- data$Film_Thickness_nm 
y <- data$new_Electrical_Resistance
newx <- seq(min(x), max(x), 0.01)

# CONFIDENCE AND PREDICT INTERVALS
model <- lm(y ~ x)
conf <- predict(model, data.frame(x=newx), interval="confidence", level=0.95)
pred <- predict(model, data.frame(x=newx), interval="prediction", level=0.95)
par(pin=c(6,4))
plot(x, y, main = "Linear Regression with Confidence and Prediction Intervals",
     xlab = "Film Thickness (nm)", 
     ylab = "Transformed Electrical Resistance",
     pch = 16, col = "blue") 
abline(model, col="red", lwd=2)
# Add confidence interval lines
lines(newx, conf[,2], col="darkgreen", lty=2, lwd=2)  # Lower CI limit
lines(newx, conf[,3], col="darkgreen", lty=2, lwd=2)  # Upper CI limit
# Add prediction interval lines
lines(newx, pred[,2], col="black", lty=2, lwd=2)  # Lower PI limit
lines(newx, pred[,3], col="black", lty=2, lwd=2)  # Upper PI limit
legend("topright", legend=c("Regression Line", "Confidence Interval", "Prediction Interval"),
       col=c("red", "darkgreen", "black"), lty=c(1, 2, 2), lwd=c(2, 2, 2), bty="n")

X0 <- data.frame(x = 100)  # specific value for Film Thickness (100 nm)
ci_100 <- predict(model, X0, interval="confidence", level=0.95)  # Confidence interval
pi_100 <- predict(model, X0, interval="prediction", level=0.95)  # Prediction interval
# Results
cat("Prediction and Interval Estimates:\n",
  "-> Predicted resistance at 100 nm:", (ci_100[1])^(1/lambda), "mOhm\n",
  "-> Confidence Interval (95%):", (ci_100[2])^(1/lambda), "-", (ci_100[3])^(1/lambda), "mOhm\n",
  "-> Prediction Interval (95%):", (pi_100[2])^(1/lambda), "-", (pi_100[3])^(1/lambda), "mOhm\n"
)

# Limits for the zoom
zoom_x_min <- 80   
zoom_x_max <- 120  
zoom_y_min <- min(conf[,2])  
zoom_y_max <- max(conf[,3])
# Filter data within the zoom range
x_zoom <- x[x >= zoom_x_min & x <= zoom_x_max]
y_zoom <- y[x >= zoom_x_min & x <= zoom_x_max]
# Define new x values within the zoom range
newx_zoom <- seq(zoom_x_min, zoom_x_max, 0.01)
# Compute confidence and prediction intervals within the zoom range
conf_zoom <- predict(model, data.frame(x=newx_zoom), interval="confidence", level=0.95)
pred_zoom <- predict(model, data.frame(x=newx_zoom), interval="prediction", level=0.95)
# Plot the zoomed-in region
plot(x_zoom, y_zoom, main="Detailed Zoom on Linear Regression",
     xlab="Film Thickness (nm)", ylab="Transformed Electrical Resistance",
     pch=16, col="blue", cex=1.4, 
     xlim=c(zoom_x_min, zoom_x_max), ylim=c(zoom_y_min, zoom_y_max))  
abline(model, col="red", lwd=2) 
# Plot the intervals
lines(newx_zoom, conf_zoom[,2], col="darkgreen", lty=2, lwd=2)  
lines(newx_zoom, conf_zoom[,3], col="darkgreen", lty=2, lwd=2)  
lines(newx_zoom, pred_zoom[,2], col="black", lty=2, lwd=2)  
lines(newx_zoom, pred_zoom[,3], col="black", lty=2, lwd=2)  
# Prediction point at X0 = 100 nm
points(100, ci_100[1], col="black", pch=18, cex=2, lwd=3)  
text(100, ci_100[1] - 0.004, labels="100 nm", col="black", cex=1.2, font=2)
legend("topright", legend=c("Regression Line", "Confidence Interval", "Prediction Interval"),
       col=c("red", "darkgreen", "black"), lty=c(1, 2, 2), lwd=c(2, 2, 2), bty="n")

# -----------------------------------------------------------------------------------------------------