Introduction The primary goal of this exercise is to learn and apply exploratory data analysis techniques. We will use descriptive statistics and visualizations, such as boxplots and scatter plots, to identify trends, patterns, and variability in the data. In particular, we will explore how different power settings (160W, 180W, 200W, and 220W) affect the performance of C2F6 gas in a controlled environment. By testing five wafers at each power level, we aim to understand the relationship between power settings and observations. The problem is provided below:
Problem The engineer is interested in a particular gas (C2F6) and gap (0.80 cm) and wants to test four levels of power settings: 160W, 180W, 200W, and 220W. The engineer decided to test five wafers at each level of power. The experiment is replicated % times; runs made in random order.
Loading the data First, we will load the data set using R syntax. The following R code snippet creates and displays a data frame named data to organize the experimental results. The data frame consists of columns representing different power settings and corresponding observations:
# Loading the dataset
data <- data.frame(
# Column 'Power' with four different power settings
Power = c(160, 180, 200, 220),
# Column 'Observation_1' with five observations at each power setting
Observation_1 = c(575, 565, 600, 725),
# Column 'Observation_2' with another set of observations at each power setting
Observation_2 = c(542, 593, 651, 700),
# Column 'Observation_3' with yet another set of observations
Observation_3 = c(530, 590, 610, 715),
# Column 'Observation_4' with additional observations
Observation_4 = c(539, 579, 637, 685),
# Column 'Observation_5' with the final set of observations
Observation_5 = c(570, 610, 629, 710)
)
# Print the data frame to view its contents
print(data)
## Power Observation_1 Observation_2 Observation_3 Observation_4 Observation_5
## 1 160 575 542 530 539 570
## 2 180 565 593 590 579 610
## 3 200 600 651 610 637 629
## 4 220 725 700 715 685 710
The table represents the results of an experiment conducted by an engineer to study the effect of four different power settings (160W, 180W, 200W, and 220W) on a particular gas (C2F6) with a gap of 0.80 cm. The engineer tested five wafers at each power level to observe how the power setting influences the outcome. Each power setting was replicated with five different observations.
Here’s how the data is structured:
Power: The power setting applied during the experiment (160W, 180W, 200W, 220W). Observation_1 to Observation_5: The results obtained from testing five wafers at each power setting. The table allows for easy comparison of how the observations vary with changes in power levels. By examining the data, trends or patterns can be identified, such as whether higher power settings lead to more consistent or higher observation values.
Descriptive Statistics
This code snippet reshapes the data into a long format to simplify the analysis of observations across different power levels. It then calculates descriptive statistics to summarize the central tendency and variability of the observations for each power setting.
# Load the tidyr package for data reshaping
library(tidyr)
# Reshape the data from wide to long format for easier analysis
data_long <- gather(data, key = "Observation", value = "Value", -Power)
# Load the dplyr package for data manipulation
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Compute descriptive statistics for the reshaped data
summary_stats <- data_long %>%
# Group data by the 'Power' variable
group_by(Power) %>%
# Summarize the data with various statistics
summarize(
# Calculate the mean of 'Value' for each power level
Mean = mean(Value),
# Calculate the median of 'Value' for each power level
Median = median(Value),
# Calculate the minimum value of 'Value' for each power level
Min = min(Value),
# Calculate the maximum value of 'Value' for each power level
Max = max(Value),
# Calculate the standard deviation of 'Value' for each power level
SD = sd(Value),
# Calculate the 1st quartile of 'Value' for each power level
Q1 = quantile(Value, 0.25),
# Calculate the 3rd quartile of 'Value' for each power level
Q3 = quantile(Value, 0.75),
# Calculate the interquartile range of 'Value' for each power level
IQR = IQR(Value)
)
# Print the summary statistics to view the results
print(summary_stats)
## # A tibble: 4 × 9
## Power Mean Median Min Max SD Q1 Q3 IQR
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 160 551. 542 530 575 20.0 539 570 31
## 2 180 587. 590 565 610 16.7 579 593 14
## 3 200 625. 629 600 651 20.5 610 637 27
## 4 220 707 710 685 725 15.2 700 715 15
As the power level increases, both the mean and median values increase, reflecting higher values with higher power levels.The variability of data, as indicated by the standard deviation, tends to be similar for different power levels, though there is a slight decrease in variability at the highest power level (220). The interquartile range provides insight into the spread of the central 50% of the data, with higher power levels showing a tighter concentration in the middle 50% of the data points.
Box-Plot
We create creates a boxplot to visually compare the distribution of observations across different power levels, highlighting trends and variability.
# Load necessary libraries
library(tidyr)
library(dplyr)
library(ggplot2)
# Reshape the data to long format
data_long <- gather(data, key = "Observation", value = "Value", -Power)
# Verify the reshaped data
print(data_long)
## Power Observation Value
## 1 160 Observation_1 575
## 2 180 Observation_1 565
## 3 200 Observation_1 600
## 4 220 Observation_1 725
## 5 160 Observation_2 542
## 6 180 Observation_2 593
## 7 200 Observation_2 651
## 8 220 Observation_2 700
## 9 160 Observation_3 530
## 10 180 Observation_3 590
## 11 200 Observation_3 610
## 12 220 Observation_3 715
## 13 160 Observation_4 539
## 14 180 Observation_4 579
## 15 200 Observation_4 637
## 16 220 Observation_4 685
## 17 160 Observation_5 570
## 18 180 Observation_5 610
## 19 200 Observation_5 629
## 20 220 Observation_5 710
# Boxplot of Observations by Power level
boxplot(Value ~ Power, data = data_long,
main = "Boxplot of Observations by Power Level",
xlab = "Power (W)",
ylab = "Observation",
col = "pink", border = "hotpink")
Increasing Median: As the power level increases, the median of the
boxplot also increases, indicating higher central values. Box Size: The
size of the box, which represents the IQR, varies. The IQR is widest for
the 160 power level and narrows as power increases, except for a slight
widening at 200. Whiskers: The range covered by the whiskers (from
minimum to maximum) remains fairly consistent across power levels,
showing similar spread.
Scatter Plot This code generates a scatterplot to explore the relationship between power levels and observations. By plotting individual observations and fitting a regression line, it visually examines how changes in power settings impact the observed values, highlighting any trends or patterns in the data.
# Load ggplot2 for creating plots
library(ggplot2)
# Create a scatterplot of observations versus power with a regression line
ggplot(data_long, aes(x = Power, y = Value)) +
# Add points to the scatterplot, colored by power level
geom_point(aes(color = as.factor(Power)), size = 3) +
# Add a linear regression line to the scatterplot without confidence interval shading
geom_smooth(method = "lm", se = FALSE, color = "hotpink") +
# Add labels and customize plot appearance
labs(title = "Scatterplot of Observations by Power Level",
x = "Power (W)",
y = "Observation",
color = "Power Level") +
# Apply a minimal theme for a cleaner look
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The spread is fairly consistent with some variation at each power level.
For example, at 160W, the observations range from 530 to 575, while at
220W, they range from 685 to 725. This shows that while higher power
settings yield higher observation values, there is still variability in
the results.
Interpretation
Higher power settings tend to yield higher observation values, and the consistency of results improves slightly with higher power. The variability remains notable, suggesting that while power settings influence the outcome, other factors might also play a role. The visualizations support these findings, showing clear trends and some variability in the data.