R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Use control+Enter to run the code chunks on PC. Use command+Enter to run the code chunks on MAC.

Load Packages

In this section, we install and load the necessary packages.

Import Data

In this section, we import the necessary data for this lab.

Quality Control Case

Everybody seems to disagree about just why so many parts have to be fixed or thrown away after they are produced. Some say that it’s the temperature of the production process, which needs to be held constant (within a reasonable range). Others claim that it’s clearly the density of the product, and that if we could only produce a heavier material, the problems would disappear. Then there is Ole the site manager, who has been warning everyone forever to take care not to push the equipment beyond its limits. This problem would be the easiest to fix, simply by slowing down the production rate; however, this would increase costs. Unfortunately, rate is the only variable that the manager can control. Interestingly, many of the workers on the morning shift think that the problem is “those inexperienced workers in the afternoon,” who, curiously, feel the same way about the morning workers.

Ever since the factory was automated, with computer network communication and bar code readers at each station, data have been piling up. After taking MGT585 class, you’ve finally decided to have a look. Your assistant aggregated the data by 4-hour blocks and then typed in the AM/PM variable, you found the following description of the variables:

temp: measures the temperature variability as a standard deviation during the time of measurement

density: indicates the density of the final product

rate: rate of production

am: 1 indicates morning and 0 afternoon

defect: average number of defects per 1000 produced

Do the following tasks and answer the questions below.

Task 1: Explore your data

Explore the dataset using 5 functions: dim(), str(), colnames(), head() and tail().

# Explore the dataset using 5 basic functions: dim(), str(), colnames(), head(), and tail()

# View the number of rows and columns
dim(quality)
## [1] 30  5
# See the structure of the dataset (variable names, types, and examples)
str(quality)
## 'data.frame':    30 obs. of  5 variables:
##  $ temp   : num  0.97 2.85 2.95 2.84 1.84 2.05 1.5 2.48 2.23 3.02 ...
##  $ density: num  32.1 21.1 20.6 22.5 27.4 ...
##  $ rate   : num  178 254 273 273 211 ...
##  $ am     : int  0 1 1 1 0 1 0 0 0 1 ...
##  $ defect : num  0.2 47.9 50.9 49.7 11 15.6 5.5 37.4 27.8 58.7 ...
# Display the column names
colnames(quality)
## [1] "temp"    "density" "rate"    "am"      "defect"
# Show the first 6 rows
head(quality)
##   temp density  rate am defect
## 1 0.97   32.08 177.7  0    0.2
## 2 2.85   21.14 254.1  1   47.9
## 3 2.95   20.65 272.6  1   50.9
## 4 2.84   22.53 273.4  1   49.7
## 5 1.84   27.43 210.8  0   11.0
## 6 2.05   25.42 236.1  1   15.6
# Show the last 6 rows
tail(quality)
##    temp density  rate am defect
## 25 2.92   22.50 260.0  1   55.4
## 26 2.44   23.47 236.0  0   36.7
## 27 1.87   26.51 237.3  0   24.5
## 28 1.45   30.70 221.0  1    2.8
## 29 2.82   22.30 253.2  1   60.8
## 30 1.74   28.47 207.9  0   10.5

Question 1: What do we learn about the data?

From the initial data exploration, we learn that the dataset contains 30 observations and 5 variables, each describing different aspects of the production process. The variables include temp (temperature variability), density (density of the final product), rate (production rate), am (shift indicator, where 1 represents morning and 0 represents afternoon), and defect (average number of defects per 1,000 products). All variables are numeric, which makes them suitable for correlation and regression analysis.

By looking at the first and last few rows of the dataset, we can see that there’s noticeable variation in temperature, density, and rate, and that defect counts vary widely — from very low to quite high. This suggests that the production process experiences inconsistency, and factors like temperature fluctuation, material density, and production speed might all influence the number of defects. The data appears clean and ready for analysis, setting the stage to explore which of these variables has the strongest impact on product defects.

Task 2: Run descriptive statistics

Compute correlation between defect and temp, between defect and density, between defect and rate. Feel free to use dplyr functions if needed.

# Task 2: Descriptive Statistics (using dplyr)
library(dplyr)

# Compute correlations between defect and temp, density, rate
quality %>%
  summarise(
    cor_def_temp = cor(defect, temp,    use = "complete.obs"),
    cor_def_dens = cor(defect, density, use = "complete.obs"),
    cor_def_rate = cor(defect, rate,    use = "complete.obs")
  )
##   cor_def_temp cor_def_dens cor_def_rate
## 1    0.9290726    -0.923365    0.8853499

Question 2: what do we learn about the data?

From the descriptive statistics, we can see that there are strong relationships between the variables that describe the production process and the number of defects. The correlation between defect and temperature (0.93) is very high and positive, suggesting that as temperature variability increases, the number of defects also increases — meaning that maintaining a stable temperature may be critical for quality control. In contrast, the correlation between defect and density (-0.92) is strongly negative, indicating that denser materials are associated with fewer defects. The correlation between defect and production rate (0.89) is also positive, implying that increasing the production speed tends to raise the number of defects, likely because equipment is being pushed too hard.

The summary statistics further show that, on average, the factory produces about 27 defects per 1,000 units, with a standard deviation of around 19, which means defect rates vary widely between batches. Temperature and rate also show variation, with the average temperature variability being 2.20 and the average rate around 236 units. Overall, the data suggest that high temperature fluctuation and faster production speeds both increase defects, while maintaining higher density helps reduce them — aligning with the manager’s concerns about controlling production conditions.

Task 3: Identify response and predictor and plot the scatter plot

Because we want to learn what variables impact the defect and how they impact the defect, the response (dependent variable) should be defect, and the 3 potential numerical predictors (independent variables) are temp, density, and rate.

Next, use ggplot() from ggplot2 package to create a scatter plot for the response and one of the predictors by your choice. You need to set defect as the y axis and the predictor you choose as the x axis.

# Task 3: Identify response and predictor and plot scatter plot 
library(ggplot2)

# Response variable: defect
# Predictor variable: rate

ggplot(quality, aes(x = rate, y = defect)) +
  geom_point(color = "steelblue", size = 3) +
  geom_smooth(method = "lm", se = TRUE, color = "darkred") +
  labs(
    title = "Scatter Plot of Defect vs Production Rate",
    x = "Production Rate",
    y = "Average Number of Defects per 1000"
  ) +
  theme_minimal()

Question 3: What does the scatter plot show? Write one line for the relationship between the response and the predictor.

The scatter plot shows a strong positive linear relationship between production rate and number of defects — as the production rate increases, the average number of defects also increases.

Task 4: Simple Linear Regression

Use the response and the predictor selected in Task 3 to run regression analysis as instructed below.

First, use lm() to run a regression analysis on the predictor as X and the response as Y. The, use function summary() to summarize the regression analysis.

# Task 4: Simple Linear Regression
# Response variable: defect
# Predictor variable: rate

# Run linear regression model
model <- lm(defect ~ rate, data = quality)

# Summarize regression results
summary(model)
## 
## Call:
## lm(formula = defect ~ rate, data = quality)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.3159  -5.1129  -0.7204   7.6170  22.6529 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -128.90616   15.57665  -8.276 5.26e-09 ***
## rate           0.65977    0.06548  10.077 8.13e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.185 on 28 degrees of freedom
## Multiple R-squared:  0.7838, Adjusted R-squared:  0.7761 
## F-statistic: 101.5 on 1 and 28 DF,  p-value: 8.132e-11

Question 4: How do you interpret the results? Interpret (1) the coefficient estimates, (2) p-value for beta1, (3) R-squared, and (4) p-value for F-statistics.

  1. Coefficient Estimates: The regression equation is defect=−128.91+0.66×rate. This means that when the production rate increases by 1 unit, the number of defects increases by about 0.66 defects per 1000 products. The negative intercept (−128.91) represents the theoretical number of defects when the rate is 0-it’s not practically meaningful but is part of the fitted line.

  2. p-value for β₁ (rate): The p-value for the rate coefficient is 8.13e-11, which is far below 0.05. This means the relationship between production rate and defects is statistically significant-as rate increases, defects reliably increase.

  3. R-squared (R²): The R² value of 0.7838 indicates that about 78% of the variation in the number of defects can be explained by the production rate. This shows the model fits the data very well-production rate is a strong predictor of defects.

  4. p-value for F-statistic: The F-test p-value (8.13e-11) tests the overall significance of the model. Since it’s extremely small, we can conclude the regression model is highly significant overall-the predictor (rate) has a real effect on the response (defects).