Introduction The primary goal of this exercise is to learn and apply exploratory data analysis techniques. We will use descriptive statistics and visualizations, such as boxplots and scatter plots, to identify trends, patterns, and variability in the data. In particular, we will explore how different power settings (160W, 180W, 200W, and 220W) affect the performance of C2F6 gas in a controlled environment. By testing five wafers at each power level, we aim to understand the relationship between power settings and observations. The problem is provided below:

Problem The engineer is interested in a particular gas (C2F6) and gap (0.80 cm) and wants to test four levels of power settings: 160W, 180W, 200W, and 220W. The engineer decided to test five wafers at each level of power. The experiment is replicated % times; runs made in random order.

Loading the data First, we will load the data set using R syntax. The following R code snippet creates and displays a data frame named data to organize the experimental results. The data frame consists of columns representing different power settings and corresponding observations:

# Loading the dataset
data <- data.frame(
  # Column 'Power' with four different power settings
  Power = c(160, 180, 200, 220),
  
  # Column 'Observation_1' with five observations at each power setting
  Observation_1 = c(575, 565, 600, 725),
  
  # Column 'Observation_2' with another set of observations at each power setting
  Observation_2 = c(542, 593, 651, 700),
  
  # Column 'Observation_3' with yet another set of observations
  Observation_3 = c(530, 590, 610, 715),
  
  # Column 'Observation_4' with additional observations
  Observation_4 = c(539, 579, 637, 685),
  
  # Column 'Observation_5' with the final set of observations
  Observation_5 = c(570, 610, 629, 710)
)

# Print the data frame to view its contents
print(data)
##   Power Observation_1 Observation_2 Observation_3 Observation_4 Observation_5
## 1   160           575           542           530           539           570
## 2   180           565           593           590           579           610
## 3   200           600           651           610           637           629
## 4   220           725           700           715           685           710

The table represents the results of an experiment conducted by an engineer to study the effect of four different power settings (160W, 180W, 200W, and 220W) on a particular gas (C2F6) with a gap of 0.80 cm. The engineer tested five wafers at each power level to observe how the power setting influences the outcome. Each power setting was replicated with five different observations.

Here’s how the data is structured:

Power: The power setting applied during the experiment (160W, 180W, 200W, 220W). Observation_1 to Observation_5: The results obtained from testing five wafers at each power setting. The table allows for easy comparison of how the observations vary with changes in power levels. By examining the data, trends or patterns can be identified, such as whether higher power settings lead to more consistent or higher observation values.

Descriptive Statistics

This code snippet reshapes the data into a long format to simplify the analysis of observations across different power levels. It then calculates descriptive statistics to summarize the central tendency and variability of the observations for each power setting.

# Load the tidyr package for data reshaping
library(tidyr)

# Reshape the data from wide to long format for easier analysis
data_long <- gather(data, key = "Observation", value = "Value", -Power)

# Load the dplyr package for data manipulation
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Compute descriptive statistics for the reshaped data
summary_stats <- data_long %>%
  # Group data by the 'Power' variable
  group_by(Power) %>%
  # Summarize the data with various statistics
  summarize(
    # Calculate the mean of 'Value' for each power level
    Mean = mean(Value),
    # Calculate the median of 'Value' for each power level
    Median = median(Value),
    # Calculate the minimum value of 'Value' for each power level
    Min = min(Value),
    # Calculate the maximum value of 'Value' for each power level
    Max = max(Value),
    # Calculate the standard deviation of 'Value' for each power level
    SD = sd(Value),
    # Calculate the 1st quartile of 'Value' for each power level
    Q1 = quantile(Value, 0.25),
    # Calculate the 3rd quartile of 'Value' for each power level
    Q3 = quantile(Value, 0.75),
    # Calculate the interquartile range of 'Value' for each power level
    IQR = IQR(Value)
  )

# Print the summary statistics to view the results
print(summary_stats)
## # A tibble: 4 × 9
##   Power  Mean Median   Min   Max    SD    Q1    Q3   IQR
##   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1   160  551.    542   530   575  20.0   539   570    31
## 2   180  587.    590   565   610  16.7   579   593    14
## 3   200  625.    629   600   651  20.5   610   637    27
## 4   220  707     710   685   725  15.2   700   715    15

As the power level increases, both the mean and median values increase, reflecting higher values with higher power levels.The variability of data, as indicated by the standard deviation, tends to be similar for different power levels, though there is a slight decrease in variability at the highest power level (220). The interquartile range provides insight into the spread of the central 50% of the data, with higher power levels showing a tighter concentration in the middle 50% of the data points.

Box-Plot

We create creates a boxplot to visually compare the distribution of observations across different power levels, highlighting trends and variability.

# Load necessary libraries
library(tidyr)
library(dplyr)
library(ggplot2)

# Reshape the data to long format
data_long <- gather(data, key = "Observation", value = "Value", -Power)

# Verify the reshaped data
print(data_long)
##    Power   Observation Value
## 1    160 Observation_1   575
## 2    180 Observation_1   565
## 3    200 Observation_1   600
## 4    220 Observation_1   725
## 5    160 Observation_2   542
## 6    180 Observation_2   593
## 7    200 Observation_2   651
## 8    220 Observation_2   700
## 9    160 Observation_3   530
## 10   180 Observation_3   590
## 11   200 Observation_3   610
## 12   220 Observation_3   715
## 13   160 Observation_4   539
## 14   180 Observation_4   579
## 15   200 Observation_4   637
## 16   220 Observation_4   685
## 17   160 Observation_5   570
## 18   180 Observation_5   610
## 19   200 Observation_5   629
## 20   220 Observation_5   710
# Boxplot of Observations by Power level
boxplot(Value ~ Power, data = data_long,
        main = "Boxplot of Observations by Power Level",
        xlab = "Power (W)",
        ylab = "Observation",
        col = "pink", border = "hotpink")

Increasing Median: As the power level increases, the median of the boxplot also increases, indicating higher central values. Box Size: The size of the box, which represents the IQR, varies. The IQR is widest for the 160 power level and narrows as power increases, except for a slight widening at 200. Whiskers: The range covered by the whiskers (from minimum to maximum) remains fairly consistent across power levels, showing similar spread.

Scatter Plot This code generates a scatterplot to explore the relationship between power levels and observations. By plotting individual observations and fitting a regression line, it visually examines how changes in power settings impact the observed values, highlighting any trends or patterns in the data.

# Load ggplot2 for creating plots
library(ggplot2)

# Create a scatterplot of observations versus power with a regression line
ggplot(data_long, aes(x = Power, y = Value)) +
  # Add points to the scatterplot, colored by power level
  geom_point(aes(color = as.factor(Power)), size = 3) +
  # Add a linear regression line to the scatterplot without confidence interval shading
  geom_smooth(method = "lm", se = FALSE, color = "hotpink") +
  # Add labels and customize plot appearance
  labs(title = "Scatterplot of Observations by Power Level",
       x = "Power (W)",
       y = "Observation",
       color = "Power Level") +
  # Apply a minimal theme for a cleaner look
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

The spread is fairly consistent with some variation at each power level. For example, at 160W, the observations range from 530 to 575, while at 220W, they range from 685 to 725. This shows that while higher power settings yield higher observation values, there is still variability in the results.

Interpretation

Higher power settings tend to yield higher observation values, and the consistency of results improves slightly with higher power. The variability remains notable, suggesting that while power settings influence the outcome, other factors might also play a role. The visualizations support these findings, showing clear trends and some variability in the data.