Semiconductor manufacturing is an incredibly precise operation where even the tiniest variations can significantly impact the viability, performance, and properties of the wafers. One critical step in fabricating these devices is the deposition of thin films on the silicon wafers.
Engineers must continuously monitor and control film thickness to ensure that resistance remains within an acceptable range. If resistance is found to vary unpredictably due to thickness, it may indicate future problems. This case study will determine whether variations in film thickness are a strong predictor of electrical resistance. The correlation between the variables, it’s relationship strength and dynamic, and probable applications will be studied further.
In this project, I was given realistic data collected from a wafer fabrication process, where film thickness ((X)) and electrical resistance ((Y)) have been measured. From this data, I was asked to determine whether a linear relationship exists between these two variables by fitting a simple linear regression model.
Before applying regression modeling to the given data, the analysis of distribution and reliability need to be assessed and validated. Understanding the variability, normality, accuracy, and potential deviations in the data set aids the confirmation process of key statistical assumptions.
For this case, Electrical Resistance and its relationship to Film Thickness is the focus of the data analysis. By visualizing the range, distribution, uniformity, and possible deviations, a better understanding and examination can be applied to the relationship. Moreover, the comprehension of the relationship and it’s deficiencies will allow for the data to be optimized and solidified for peak model presentation.
In order to summarize the data set and being to examine its properties and relationship, an exploratory data analysis (EDA) as preformed. This process began by reading the data via the read.table command and requesting the display of the header as a confirmation check. Followed by summarizing the data’s statistical characteristics.
df <- read.table("https://raw.githubusercontent.com/tmatis12/datafiles/refs/heads/main/semiconductor_SLR_dataset.csv", header = TRUE, sep = ",")
head(df)
## Film_Thickness_nm Electrical_Resistance_mOhm
## 1 87.45 15.118
## 2 145.07 23.601
## 3 123.20 19.904
## 4 109.87 16.103
## 5 65.60 12.901
## 6 65.60 13.278
summary(df)
## Film_Thickness_nm Electrical_Resistance_mOhm
## Min. : 50.55 Min. :11.68
## 1st Qu.: 69.32 1st Qu.:13.60
## Median : 96.42 Median :15.74
## Mean : 97.02 Mean :16.80
## 3rd Qu.:123.02 3rd Qu.:19.82
## Max. :148.69 Max. :25.74
Understanding the distribution of the data is crucial because it reveals underlying patterns, guides statistical analysis and model selection, and helps in making more accurate and reliable interpretations and predictions. Given that our primary focus is on Electrical Resistance ((Y)) , the construction of a histogram and a box plot will effectively model the distribution and variability of the data
par(mfrow = c(1, 2), mar = c(4, 4, 3, 0.2) + 0.1, oma = c(2,0,2,0))
hist(df$Electrical_Resistance_mOhm,
main = "Histogram of Resistance",
xlab = "Electrical Resistance (mOhm)",
col ="lightblue",
border = 'darkblue')
boxplot(df$Electrical_Resistance_mOhm,
main = "Boxplot of Resistance",
xlab = "Electrical Resistance (mOhm)",
ylab = "Frequency",
col ="orange",
border = 'brown')
mtext("Distribution of Electrical Resistance", outer = TRUE, cex = 2, font = 4)
Observations:
Histogram - The density of the bars shows that the majority of the data, aka the median, is approximately 15 with a mininimum and maximum of approximately 11 and 25. Which is line with the summary above. The distribution of data leaning towards the left indicates a right-skewed graph and a positive distribution.
Box Plot - The illustration of data’s spread displays the the full extent of the data and its frequency dispersal among the range. Due to the upper quartile’s greater extension, the conclusion that the data is slightly skewed is upheld. Furthermore, the existence of the skew suggests that before applying the regression model, a transformation should be done to improve it’s normality.
Having analyzed the individual data set, an investigation of the relationship and correlation between Film Thickness (X) and Electrical Resistance (Y) via a scatter plot will further the understanding and analysis of the data.
A scatter plot is used to visually assess the presence of a linear relationship between variables. A high correlation value signifies a strong association, which is essential for verifying the assumptions of a linear regression model. This is attributable to the fact that a scatter plot visually evaluates the presence of a linear relationship between the variables. The distribution of points depicts not only an overall trend, but is helpful in identify potential deviations, outliers, or harmful patterns that could impact the regression model. Furthermore, a high correlation value would indicate a strong association between the values, therefore validating the assumptions of a linear regression model.
plot(df$Film_Thickness_nm,df$Electrical_Resistance_mOhm,
main = "Electrical Resistance vs. Film Thickness",
ylab = "Electrical Resistance (mOhm)",
xlab = "Film Thickness (nm)",
col ="darkgreen",
pch=19)
Observations: The curvature of the graph and linear pattern of the points expresses a strong positive and correlating relationship between the electrical resistance and film thickness. Meaning that as the thickness increases, the electrical resistance will decrease, as shown in the plot’s flow. Therefore validating that a linear regression model can be applied.
This section focuses on applying a linear regression model to examine
the relationship between Film Thickness (X) and Electrical Resistance
(Y) using the Least Squares method. The model is implemented in R with
the lm() function, which estimates the regression
coefficients and assesses model performance. For the model to be valid,
key assumptions, i.e. normality of residuals and constant variance
(homoscedasticity), must be satisfied. These assumptions are evaluated
using the various plots the LSE model creates. If there any violations
are identified, a Box-Cox transformation will be preformed upon the
independent data to stabilize variance and enhance normality. Then the
transformed model is reevaluated to ensure adherence to regression
assumptions as to improve the model’s accuracy, efficiency, and
reliability.
While R calculates these values beautiful and without hassle, it is important to understand the mathematical correlations and reasoning that provide the analysis on Film Thickness’ (X) influences on Electrical Resistance (Y) and their relationship.
This requires the use of the linear regression model:
\[ y= \beta_{0} + \beta^2 x + \epsilon \]
where \(\beta_{0}\) is the intercept, \(\beta_{1}\) is the slope, \(\epsilon\) is the random error component, x is the predictor/regressor variable and y as the response variable. Traditionally, \(\beta_{0}\) and \(\beta_{1}\) aka the regresion coefficents, are unknown and need to be calculated using the Least Squares method. The estimation of the values are a result of assuming that sum of the squares of the differences between the observations at \(y_{1}\) and the straight line is a minimum.
Which creates a simple regression model:
\[ y_{1} = \beta_{0} + \beta_{1}x_{1}, i = 1, 2, ...n \]
Where Y represents the predicted electrical resistance for a given film thickness. Though the variability of the Electrical Resistance, aka Y, the coefficient of determination can be calculated:
\(SST=\sum (y_{i} - \overline{y})^2\) - Sum of Squares Total with \(n-1\) d.f.
\(SSE=\sum (y_{i} - \hat{y})^2\) - Sum of Squares Error with \(n-2\) d.f.
\(SSR=\sum (\hat{y_{i}} - \overline{y})^2\) - Sum of Square Regression with \(1\) d.f.
\(SST= SSE + SSR\) - Partition of SST into SSR and SSE
\[ \Rightarrow R^2=\frac{SSR}{SST}, \; 0\leq R^2 \leq 1 \]
The correlation between the R value and its relationship implication is positive and significant. Accordingly, a higher \(R^2\) value indicates a stronger relationship between the variables, and therefore the model explains a greater portion of the variability in Y.
A simple linear regression model using least squares estimation is created by using the lm( ) function, which computes the Total Sum of Squares (SST), Regression Sum of Squares (SSR), and Error Sum of Squares (SSE) to obtain the coefficient of determination (\(R^2\)) by expressing the influence of “Y on X” hence the code below.
Though the visual aspect of the model is the primary way of assessment, using the summary ( ) command allows for a numerical representation of key statistical outputs such as coefficient estimates, standard errors, etc. in order to have a comprehensive result of model’s performance.
model = lm(df$Electrical_Resistance_mOhm ~ df$Film_Thickness_nm)
summary(model)
##
## Call:
## lm(formula = df$Electrical_Resistance_mOhm ~ df$Film_Thickness_nm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.27640 -0.75508 -0.08631 0.70422 2.69671
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.870489 0.356848 13.65 <2e-16 ***
## df$Film_Thickness_nm 0.122954 0.003518 34.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.041 on 98 degrees of freedom
## Multiple R-squared: 0.9257, Adjusted R-squared: 0.925
## F-statistic: 1221 on 1 and 98 DF, p-value: < 2.2e-16
Observations: A MSE, or \(R^2\) of 0.9257 indicates that 92.57% of the data fits the regression line, therefore suggesting that the model fits the data well in regards to the proportion of variance. Furthermore, a p-value of <2.2e-16, which is quiet lower than the general threshold of 0.05, demonstrates stronger evidence against the null hypothesis and the model is appropriate. Though, argueably the most important values are regressor coefficients., which create the linear regression equation:
\[ y= \beta_{0} + \beta_{1}x + \; \epsilon \: \Rightarrow y=4.870+0.123x\]
Though certain statistical values did enhance the understanding of the linear regression model and whether it was acceptable, there are visual factors that take precedence. As previously discussed, there are key assumptions that must be met in order to verify a model’s validity:
Residuals vs. Fitted Plot: To gauge whether the residuals exhibit a pattern of randomness, aka constant variance.
QQ Residuals Plot: To evaluate whether the residuals follow a normal distribution.
If either of these assumptions is unsatisfied, then previous model may be considered invalid, as validity is based on both conditions being attained. This conditional relationship require total compliance because otherwise its statistical calculations and predictions, assessment, and reliability could be compromised.
plot(model)
Observations:
Normality of Residuals: The Q-Q plot shows that the residuals follow an normal distribution with little deviance. Though a few points verre away, most points reside along the reference line. Therefore suggesting that normality is met.
Constant Variance (Homoscedasticity): The Residuals vs. Fitted plot reveals a curved pattern, when the desired line should be straight along the horizontal axis. indicating non-constant variance. This divergence from the expected outcome indicates that the model is non-constant variance.
Since constant variance is violated, the validity of the linear regression model is unacceptable. For further analysis to be compelted, the variance must be stabilizes. Therefore a transformation, specifically BoxCox, will be required to improve the reliability of the model.
To address non-constant variance in the residuals, a Box-Cox transformation is applied to compute the optimal transformation that will improve the model and refit the regression. As seen below, this method is initialized by importing the “MASS” library and graphing the original values.
library(MASS)
b<-boxcox(Electrical_Resistance_mOhm ~ Film_Thickness_nm, data = df)
Because the number one is not included as a confidence interval, it is indicating that a transformation would be useful. Therefore, the initialization of a BoxCox is made and the data frame is separated into x and y values and input into a max function in order to find the optimal electrical resistance.
lambda<-b$x
likelihood <- b$y
max_l<-which.max(likelihood)
lambda[max_l]
## [1] -0.8282828
This value is then used to square the electrical resistance set since the square root transformation maximized the likelihood under BoxCox. This new data set is then used to display an additional plot which has a better fit and contains a confidence interval of one.
df$new_Electrical_Resistance_mOhm <- df$Electrical_Resistance_mOhm^-0.81
boxcox(new_Electrical_Resistance_mOhm ~ Film_Thickness_nm, data = df)
The benefit of transforming the data with focus on the optimization of the electrical resistance is also shown in the summary of the data:
model2 = lm(new_Electrical_Resistance_mOhm ~ Film_Thickness_nm, data = df)
summary(model2)
##
## Call:
## lm(formula = new_Electrical_Resistance_mOhm ~ Film_Thickness_nm,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0111971 -0.0019573 -0.0002081 0.0025100 0.0093906
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.629e-01 1.342e-03 121.44 <2e-16 ***
## Film_Thickness_nm -5.932e-04 1.323e-05 -44.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.003915 on 98 degrees of freedom
## Multiple R-squared: 0.9536, Adjusted R-squared: 0.9531
## F-statistic: 2012 on 1 and 98 DF, p-value: < 2.2e-16
Observations: While the p-value stayed the same, the \(R^2\) value increased from 0.9252 to 0.9535, which indicates a better proportion of variance and model fit. Additionally, there is a lower residual standard of error than the prior model had expressed which suggests that not only is the variance stabilizing but prediction accuracy has increased. All while keeping a strong significance, as shown in its’ p-value less than 2.2e-16.
After plotting the transformed model, it is important to reassess whether the assumptions of linear regression have completely met. Special focus is put on constant variance (homoscedasticity) since the original model failed to meet the expectation.
plot(model2)
Though refitting the model, two of the many modes to asses adequacy via evaluating statistical assumptions are met, including:
Normality of residuals - As seen in the “Residuals vs. Fitted” graph, the curvature of the regression line has been decreased and now resembled a straight line than before. Additionally, the data points have clustered together with fewer outliers, therefore achieving normality and furthermore, adequacy.
Constant variance (homoscedasticity) - The variability of data points in “Q-Q Residuals” has also decreased from its original plot as more points reside along the dotted line, with a minimal number of outliers located away from the regression line. Therefore displaying a higher level of consistency and furthermore, adequacy.
Which overall suggests that the transformation of the data via BoxCox allowed for the data to be optimized and fit the regression-line with minimal variance. Therefore, the model above is adequate for further analysis as seen below.
In semiconductor manufacturing, common target film thickness values depend on the specific application and material being deposited. For this analysis, I calculated a 95% confidence interval (CI) and prediction interval (PI) for resistance at 100 nm of thickness since it serves as a midpoint in the standard process range and is often used as a reference for quality control. This will allow engineers to determine whether the process is stable and meets design specifications.
This calculation begins with finding the minimum and maximum of the predictor value, which is the Film Thickness. These values are then used to create a range for a new sequence that begins at the minimum thickness and increments at a set value up to the maximum:
x <- df$Film_Thickness_nm
y <- df$new_Electrical_Resistance_mOhm
newx <- seq(min(x), max(x), 0.03)
Therefore, newx contains a set of predictor values, that are substituted in place of the original unfit predictor, used to calculate interval estimates for a better fitted regression model. The substitution changes the regression equation as well:
\[ \hat{Y}= \hat{\beta_{0}} + \hat{\beta_{1}}x \: \Rightarrow \hat{Y}= \hat{\beta_{0}} + \hat{\beta_{1}}X_{j}\]
where \(X_{j}\) represents each value in the “newx” set that creates corresponding confidence and prediction intervals and \(\hat{Y}\) which is the estimated response that represents the predicted value of the dependent variable for its correlating predictor value. It is also important to note that the intervals explain and express different trends and calculations of the values. Confidence Intervals are estimations of the expected mean while Prediction Intervals focus on the variability of each observation point.
These values are then created on “Y” (aka the Film Thickness) at each “new x” (aka the Electrical Resistance) value using the predict ( ) command:
model3 <- lm(y ~ x)
conf<- predict(model3, data.frame(x=newx), interval = "confidence",level=0.95)
pred<- predict(model3, data.frame(x=newx), interval = "prediction",level=0.95)
The new data sets will be used to create the confidence and predict tolerance lines on the model, but this requires that the model be plotted from fitted regression line. Then plot regression line is added to show trend, and the upper and lower tolerances for the confidence and prediction values:
plot(x, y,
main = "Linear Regression with Confidence and Prediction Intervals",
ylab = "Transfigured Electrical Resistance (mOhm)",
xlab = "Film Thickness (nm)",
col ="lightpink",
pch = 19)
abline(model3, col = "magenta", lwd=2)
lines(newx, conf[,2],col="blue", lty=2, lwd=2)
lines(newx, conf[,3], col="blue", lty=2, lwd=2)
lines(newx, pred[,2], col="darkgreen", lty=2, lwd=2)
lines(newx, pred[,3], col="darkgreen", lty=2, lwd=2)
This flawless calculation stems from the fitted simple linear regression model ( \(\hat{\beta_{0}} + \hat{\beta_{1}}x\) ), which gives a point estimate of the mean of y for a particular x. From this equation we can derive residuals, since the difference between the observed value and the corresponding fitted value i is a residual. If these errors are normally and independently, which is the case here, the sampling distribution is set to “t with n − 2 degrees of freedom”. Therefore, a 100(1 − α) percent confidence interval (CI) on the slope \(\beta_{1}\) is given by
\[ \hat{\beta_{1}} - t_{\alpha /2, n-2},se(\hat{\beta_{1}})\leq \hat{\beta_{1}} \leq \hat{\beta_{1}} + t_{\alpha /2, n-2},se(\hat{\beta_{1}}) \]
where the width of these confidence intervals is a measure of the overall quality of the regression line. Furthermore, the base regression line equation is also used to derive the predictive intervals, noted as \(y_{0}\), for future operations and observations:
\[ \psi=y_{0} - \hat{y_{0}} \; \Rightarrow Var(\psi) = Var(y_{0} - \hat{y_{0}})\; \Rightarrow \sigma^2[1+\frac{1}{n}+\frac{(x_{0} - \bar{x})^2}{S_{xx}}] \]
If the line is normally distributed with mean zero and variance, then the predicted value is zero because the future observation is independent of said value. Thus, the 100(1 − α) percent prediction interval on a future observation at \(x_{0}\) is:
\[ \hat{y_{0}}-t_{\alpha /2, n-2}\sqrt{MS_{Res}(1+\frac{1}{n}+\frac{(x_{0} - \bar{x})^2}{S_{xx}})}\leq y_{0} \leq \hat{y_{0}}+t_{\alpha /2, n-2}\sqrt{MS_{Res}(1+\frac{1}{n}+\frac{(x_{0} - \bar{x})^2}{S_{xx}})} \]
where the prediction interval is always wider than the confidence interval because the prediction interval depends on both the error from the fitted model and the error associated with future observations. This will also become apparent in the graphs plotted below.
In semiconductor manufacturing, target film thickness values vary depending on the specific application and material being deposited. For example, conductive metal layers have a thickness range between 50 nm and 150 nm, with 100 nm frequently monitored as a midpoint in the standard process range and a reference for quality control. In this case, an analysis of the electrical resistance at 100 nm using a trained linear regression model is requested. To assess uncertainty, a 95% Confidence Interval (CI) and Prediction Interval (PI) will calculate the true mean resistance and variability in each expected individual observations. These estimates demonstrate the variation in resistance at the designated critical thickness (100 nm), allowing engineers to determine process stability and compliance with design specifications.
The initialization of these intervals are calculated once a variable is assigned to 100 (nm) and the said variable is called upon to replace “x” in the prediction formulas:
x_100 <- data.frame(x = 100)
CI_100 <- predict(model3, x_100, interval="confidence", level=0.95)
PI_100 <- predict(model3, x_100, interval="prediction", level=0.95)
Once the intervals have been defined, the predicted resistance level at 100nm and the said intervals are printed. The latter has a range listed to account for the upper and lower limits of the data set and estimation of error.
cat("Predicted resistance at 100 nm is", CI_100[1], "mOhm\n")
## Predicted resistance at 100 nm is 0.1035865 mOhm
cat("Confidence Interval (95%) Range:", CI_100[2], "-", CI_100[3], "mOhm\n")
## Confidence Interval (95%) Range: 0.1028057 - 0.1043673 mOhm
cat("Prediction Interval (95%) Range:", PI_100[2], "-", PI_100[3], "mOhm\n")
## Prediction Interval (95%) Range: 0.09577858 - 0.1113944 mOhm
The calculated upper and lower tolerances of both values are designated to a variable that will then be called as part of the function that will filter the data within the designated range. New X-Values are created, as was done in the previous process above, in order to account for quality control. Said values are then integrated into the interval’s predictions calculations.
# Define FT's Upper and Lower Limits (nm) - X AXIS
zoom_FT_min <- 80
zoom_FT_max <- 120
# Define ER's Upper and Lower Limits (mOhms) - Y AXIS
zoom_ER_min <- min(conf[,2])
zoom_ER_max <- max(conf[,3])
FT_zoom <- x[x >= zoom_FT_min & x <= zoom_FT_max]
ER_zoom <- y[x >= zoom_FT_min & x <= zoom_FT_max]
newx_zoom <- seq(zoom_FT_min, zoom_FT_max, 0.03)
conf_zoom <- predict(model3, data.frame(x=newx_zoom), interval="confidence", level=0.95)
pred_zoom <- predict(model3, data.frame(x=newx_zoom), interval="prediction", level=0.95)
Finally, the model is plotted with respect to the confidence interval scatter plot that shows the regression line in purple, the confidence and predictive lines in blue and pink, respectively, and the black dot as the Film Thickness of 100nm along the line. Therefore displaying the model’s fitted regression line and its predictions and associated uncertainty via the confidence and prediction intervals.
plot(FT_zoom, ER_zoom, main="Detailed Zoom on Linear Regression",
xlab="Film Thickness (nm)", ylab="Electrical Resistance (mOhm)",
pch=16, col="orange", cex=1.5,
xlim=c(zoom_FT_min, zoom_FT_max), ylim=c(zoom_ER_min, zoom_ER_max)) # Zoom in!
abline(model3, col="purple", lwd=2)
lines(newx_zoom, conf_zoom[,2], col="blue", lty=2, lwd=2)
lines(newx_zoom, conf_zoom[,3], col="blue", lty=2, lwd=2)
lines(newx_zoom, pred_zoom[,2], col="pink", lty=2, lwd=2)
lines(newx_zoom, pred_zoom[,3], col="pink", lty=2, lwd=2)
# Prediction point at Film Thickness = 100 nm
points(100, CI_100[1], col="black", pch=16, cex=2, lwd=3)
Semiconductor manufacturing is an incredibly process where even the smallest variations can significantly impact the performance and properties of the wafers. In this case study, we focused on the deposition of thin films on the silicon wafers which could greatly impact the electrical resistance. It is important to note that as the thickness increases, the resistance generally decreases, since a thicker layer provides more pathways for current flow.
Therefore, engineers must continuously monitor and control film thickness as a way to ensure that resistance remains within an acceptable range. If resistance is found to vary unpredictably due to thickness, it may indicate future problems in performance. Though the strength of the relationship of film thickness to electrical resistance was not explicitly expressed since this report’s main focus was to determined whether variations in film thickness are a strong predictor of electrical resistance.
Through further analysis into the relationship and correlation’s stability, effectiveness, and accuracy, the results became conclusive. There are many takeaways from this case study that should be remembered, but the questions below capture the key points about the data analysis of semiconductors:
How well does film thickness predict electrical resistance? - The influence Film Thickness has on the Electrical Resistance is precise and strong, therefore the known value of thickness could do a reliable and accurate job at predicting the corresponding resistance.
How significant is the relationship between thickness and resistance? - There is a strong positive correlation between Film Thickness and Electrical Resistance, where resistance will decrease as thickness increase, since a thicker layer provides more pathways for current flow.
What are the implications of your findings for process control and quality improvement? - Since semiconductor devices are in charge of the deposition of thin films on silicon wafers, it is important that these machines stay monitored and evaluated so a consistent level of film can be given to each wafer, minimizing variability while optimizing product performance.
Below is the entirety of my code for this project. It has been commented out so it will not execute the commands again, but is here for your perusal. This concludes the of my case study.
#I think that I should probably comment the entire code out before entering since it duplicates my work when I don't comment it out. But that is for another time.
# #Read in the data from the URL given
# df <- read.table("https://raw.githubusercontent.com/tmatis12/datafiles/refs/heads/main/semiconductor_SLR_dataset.csv", header = TRUE, sep =
# ",")
#
# # A histogram and box plot of electrical resistance ((Y)) to examine its distribution
# par(mfrow = c(1, 2), mar = c(4, 4, 3, 0.2) + 0.1, oma = c(2,0,2,0))
# hist(df$Electrical_Resistance_mOhm,
# main = "Histogram of Resistance",
# xlab = "Electrical Resistance (mOhm)",
# col ="lightblue",
# border = 'darkblue')
#
# boxplot(df$Electrical_Resistance_mOhm,
# main = "Boxplot of Resistance",
# xlab = "Electrical Resistance (mOhm)",
# ylab = "Frequency",
# col ="orange",
# border = 'brown')
#
# mtext("Distribution of Electrical Resistance", outer = TRUE, cex = 2, font = 4)
#
# # A scatterplot of resistance ((Y)) versus thickness ((X)) to visualize the potential relationship.
# plot(df$Film_Thickness_nm,df$Electrical_Resistance_mOhm,
# main = "Electrical Resistance vs. Film Thickness",
# ylab = "Electrical Resistance (mOhm)",
# xlab = "Film Thickness (nm)",
# col ="darkgreen",
# pch=19)
#
# # Use LSE linear regression model to determine the nature of the relationship between film thickness and resistance.
# model = lm(df$Electrical_Resistance_mOhm ~ df$Film_Thickness_nm)
# plot(model)
# summary(model)
#
# # Apply an appropriate transformation, aka boxcox, to improve the model and refit the regression.
#
# library(MASS)
#
# #To find out lamda
# b<-boxcox(Electrical_Resistance_mOhm ~ Film_Thickness_nm, data = df)
#
# lambda<-b$x
# likelihood <- b$y
# max_l<-which.max(likelihood)
# lambda[max_l]
#
# #Re-plot with new max lambda
# df$new_Electrical_Resistance_mOhm <- df$Electrical_Resistance_mOhm^-0.83
# boxcox(new_Electrical_Resistance ~ Film_Thickness_nm, data = df)
#
# model2 = lm(new_Electrical_Resistance ~ Film_Thickness_nm, data = df)
# summary(model2)
# plot(model2)
#
#
# #Additionally, generate a scatterplot of the data with the fitted regression line, including both confidence and prediction intervals.
#
# x <- df$Film_Thickness_nm
# y <- df$new_Electrical_Resistance_mOhm
#
# newx <- seq(min(x), max(x), 0.03)
#
# model3 <- lm(y ~ x)
#
# conf<- predict(model3, data.frame(x=newx), interval = "confidence",level=0.95)
# pred<- predict(model3, data.frame(x=newx), interval = "prediction",level=0.95)
#
# plot(x, y,
# main = "Expected Variation in Resistance",
# ylab = "Electrical Resistance (mOhm)",
# xlab = "Film Thickness (nm)",
# col ="lightpink",
# pch = 19)
#
# abline(model3, col = "magenta", lwd=2)
#
# lines(newx, conf[,2],col="blue", lty=2, lwd=2)
# lines(newx, conf[,3], col="blue", lty=2, lwd=2)
# lines(newx, pred[,2], col="darkgreen", lty=2, lwd=2)
# lines(newx, pred[,3], col="darkgreen", lty=2, lwd=2)
#
#
#
#
# # Calculate a 95% confidence interval (CI) and prediction interval (PI) for resistance at 100 nm of thickness.
#
# x_100 <- data.frame(x = 100)
# CI_100 <- predict(model3, x_100, interval="confidence", level=0.95)
# PI_100 <- predict(model3, x_100, interval="prediction", level=0.95)
#
# cat("Predicted resistance at 100 nm:", CI_100[1], "mOhm\n")
# cat("Confidence Interval (95%):", CI_100[2], "-", CI_100[3], "mOhm\n")
# cat("Prediction Interval (95%):", PI_100[2], "-", PI_100[3], "mOhm\n")
#
#
# #Note to self: Will need to zoom in on FT = 100nm!
# # Define FT's Upper and Lower Limits (nm) - X AXIS
# zoom_FT_min <- 80
# zoom_FT_max <- 120
#
# # Define ER's Upper and Lower Limits (mOhms) - Y AXIS
# zoom_ER_min <- min(conf[,2])
# zoom_ER_max <- max(conf[,3])
#
# FT_zoom <- x[x >= zoom_FT_min & x <= zoom_FT_max]
# ER_zoom <- y[x >= zoom_FT_min & x <= zoom_FT_max]
#
# newx_zoom <- seq(zoom_FT_min, zoom_FT_max, 0.03)
#
#
# conf_zoom <- predict(model3, data.frame(x=newx_zoom), interval="confidence", level=0.95)
# pred_zoom <- predict(model3, data.frame(x=newx_zoom), interval="prediction", level=0.95)
#
#
# plot(FT_zoom, ER_zoom, main="Detailed Zoom on Linear Regression",
# xlab="Film Thickness (nm)", ylab="Electrical Resistance (mOhm)",
# pch=16, col="orange", cex=1.5,
# xlim=c(zoom_FT_min, zoom_FT_max), ylim=c(zoom_ER_min, zoom_ER_max)# Zoom in!
#
# abline(model, col="purple", lwd=2)
# lines(newx_zoom, conf_zoom[,2], col="blue", lty=2, lwd=2)
# lines(newx_zoom, conf_zoom[,3], col="blue", lty=2, lwd=2)
# lines(newx_zoom, pred_zoom[,2], col="pink", lty=2, lwd=2)
# lines(newx_zoom, pred_zoom[,3], col="pink", lty=2, lwd=2)
#
# # Prediction point at Film Thickness = 100 nm
# points(100, CI_100[1], col="black", pch=16, cex=2, lwd=3)