R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Use control+Enter to run the code chunks on PC. Use command+Enter to run the code chunks on MAC.

Load Packages

In this section, we install and load the necessary packages.

Import Data

In this section, we import the necessary data for this lab.

Quality Control Case

Everybody seems to disagree about just why so many parts have to be fixed or thrown away after they are produced. Some say that it’s the temperature of the production process, which needs to be held constant (within a reasonable range). Others claim that it’s clearly the density of the product, and that if we could only produce a heavier material, the problems would disappear. Then there is Ole the site manager, who has been warning everyone forever to take care not to push the equipment beyond its limits. This problem would be the easiest to fix, simply by slowing down the production rate; however, this would increase costs. Unfortunately, rate is the only variable that the manager can control. Interestingly, many of the workers on the morning shift think that the problem is “those inexperienced workers in the afternoon,” who, curiously, feel the same way about the morning workers.

Ever since the factory was automated, with computer network communication and bar code readers at each station, data have been piling up. After taking MGT585 class, you’ve finally decided to have a look. Your assistant aggregated the data by 4-hour blocks and then typed in the AM/PM variable, you found the following description of the variables:

temp: measures the temperature variability as a standard deviation during the time of measurement

density: indicates the density of the final product

rate: rate of production

am: 1 indicates morning and 0 afternoon

defect: average number of defects per 1000 produced

Do the following tasks and answer the questions below.

Task 1: Explore your data

Explore the dataset using 5 functions: dim(), str(), colnames(), head() and tail().

# Explore the dataset using 5 functions: dim(), str(), colnames(), head() and tail()

dim(quality)
## [1] 30  5
str(quality)
## 'data.frame':    30 obs. of  5 variables:
##  $ temp   : num  0.97 2.85 2.95 2.84 1.84 2.05 1.5 2.48 2.23 3.02 ...
##  $ density: num  32.1 21.1 20.6 22.5 27.4 ...
##  $ rate   : num  178 254 273 273 211 ...
##  $ am     : int  0 1 1 1 0 1 0 0 0 1 ...
##  $ defect : num  0.2 47.9 50.9 49.7 11 15.6 5.5 37.4 27.8 58.7 ...
colnames(quality)
## [1] "temp"    "density" "rate"    "am"      "defect"
head(quality)
##   temp density  rate am defect
## 1 0.97   32.08 177.7  0    0.2
## 2 2.85   21.14 254.1  1   47.9
## 3 2.95   20.65 272.6  1   50.9
## 4 2.84   22.53 273.4  1   49.7
## 5 1.84   27.43 210.8  0   11.0
## 6 2.05   25.42 236.1  1   15.6
tail(quality)
##    temp density  rate am defect
## 25 2.92   22.50 260.0  1   55.4
## 26 2.44   23.47 236.0  0   36.7
## 27 1.87   26.51 237.3  0   24.5
## 28 1.45   30.70 221.0  1    2.8
## 29 2.82   22.30 253.2  1   60.8
## 30 1.74   28.47 207.9  0   10.5

Question 1: what do we learn about the data?

There’s 30 rows of data and 5 columns. Most are number columns except for AM which in an integer with the value of 1 or 0.

Task 2: Run descriptive statistics

Compute descriptive stats mean and sd (or any other stats you find relevant) for all continuous variables: temp, density, rate, and defect. Feel free to use dplyr functions if needed.

# Descriptive stats for continuous variables

# temp

quality  %>% summarise(mean=mean(temp), sd=sd(temp), min=min(temp), max=max(temp))
##    mean        sd  min  max
## 1 2.203 0.5834153 0.97 3.02
# density

quality  %>% summarise(mean=mean(density), sd=sd(density), min=min(density), max=max(density))
##       mean       sd   min   max
## 1 25.28533 3.361424 19.45 32.19
# rate

quality  %>% summarise(mean=mean(rate), sd=sd(rate), min=min(rate), max=max(rate))
##       mean       sd   min   max
## 1 236.5167 26.05077 177.7 281.9
# defect

quality  %>% summarise(mean=mean(defect), sd=sd(defect), min=min(defect), max=max(defect))
##    mean       sd min  max
## 1 27.14 19.41319   0 60.8
# compute correlation between defect and temp, defect and density, defect and rate

quality %>% summarise(correlation=cor(defect, temp))
##   correlation
## 1   0.9290726
quality %>% summarise(correlation=cor(defect, density))
##   correlation
## 1   -0.923365
quality %>% summarise(correlation=cor(defect, rate))
##   correlation
## 1   0.8853499

Question 2: what do we learn about the data?

The correlation between defect and temp is close to 1, which indicates the relationship is strong. Increasing temperature also increases defect.The correlation between defect and rate is also close to 1, which indicates the relationship is strong. Increasing rate also increases defect. Lastly, the correlation between defect and density is negative, which indicates the relationship is weak.

Task 3: Identify response and predictors and plot the scatter plots

Identify a response (dependent variable) and numerical predictors (independent variables) from all the variables in the quality data set.

Hint: There are one response (dependent variable) and 3 potential (continuous numerical) predictors (independent variables).

Write the response here:

The dependent valuable is the defect (Y). Write the predictors here:

The three predictors (X) are the temperature, density, and rate.

Next, use ggplot() from ggplot2 package to create scatter plots for the response and the predictors one by one. You need to set the response as the y axis and the predictor as the x axis.

## scatter plot using ggplot() for all predictors and the response

# scatter plot of response vs predictor 1

ggplot(quality, mapping = aes(x = temp, y = defect)) + 
  geom_point() + 
  geom_smooth(method = "lm",se = FALSE, colour = "blue") + 
  ggtitle("Impact of Temperature on Defects") + 
  xlab("Temp") +
  ylab("Defects")
## `geom_smooth()` using formula = 'y ~ x'

# scatter plot of response vs predictor 2

ggplot(quality, mapping = aes(x = density, y = defect)) + 
  geom_point() + 
  geom_smooth(method = "lm",se = FALSE, colour = "blue") + 
  ggtitle("Impact of Density on Defects") + 
  xlab("Density") +
  ylab("Defects")
## `geom_smooth()` using formula = 'y ~ x'

# scatter plot of response vs predictor 3

ggplot(quality, mapping = aes(x = rate, y = defect)) + 
  geom_point() + 
  geom_smooth(method = "lm",se = FALSE, colour = "blue") + 
  ggtitle("Impact of Rate on Defects") + 
  xlab("Rate") +
  ylab("Defects")
## `geom_smooth()` using formula = 'y ~ x'

Question 3 What do the scatter plots show? Write one line for each pair of response and predictor

As temp increases, the defect also increases.

As density increases, the defect decreases.

As rate increases, the defect also increases.

Task 4: Simple Linear Regression

Use the response and the predictors selected in Task 3 to run regression analyses as instructed below.

Task 4.1: First, use lm() to run a regression analysis on the predictor 1 as X and the response as Y. The, use function summary() to summarize the regression analysis.

# The impact of predictor 1 on the response

reg_temp <- lm(defect ~ temp, data = quality)

summary(reg_temp)
## 
## Call:
## lm(formula = defect ~ temp, data = quality)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.5937  -4.9138  -0.6179   4.2113  15.1887 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -40.966      5.295  -7.736 1.99e-08 ***
## temp          30.915      2.326  13.291 1.29e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.308 on 28 degrees of freedom
## Multiple R-squared:  0.8632, Adjusted R-squared:  0.8583 
## F-statistic: 176.6 on 1 and 28 DF,  p-value: 1.29e-13

Question 4: How do you interpret the results? Interpret (1) the coefficient estimates, (2) p-value for beta1, (3) R-squared , and (4) p-value for F-statistics.

The size of the effect, beta1 = 30.915. This means 1 increase in temperature leads to 31 unit increase in defects.

The significance of the effect, the p-value is small and we can reject the null hypothesis.

The R-Squared is high at 86%, which means we can explain 86% of the variance in the data with this model.

The F-statistics is <0.05 and therefore the model is valid.

Task 4.2: Then, use the regression model developed in the previous code chunk to predict the response for the mean of the predictor.

##Predict a future response for a new data
# choosing the new value for predictor 1

temp_new <- data.frame(temp = 100)

# predict using the predict function

predict(reg_temp, newdata = temp_new)
##        1 
## 3050.531

Task 4.3: First, use lm() to run a regression analysis on the predictor 2 as X and the response as Y. The, use function summary() to summarize the regression analysis.

# The impact of predictor 2 on the response

reg_density <- lm(defect ~ density, data = quality)

summary(reg_density)
## 
## Call:
## lm(formula = defect ~ density, data = quality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.802  -4.702  -0.803   4.419  17.740 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  161.979     10.685   15.16 5.01e-15 ***
## density       -5.333      0.419  -12.73 3.68e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.585 on 28 degrees of freedom
## Multiple R-squared:  0.8526, Adjusted R-squared:  0.8473 
## F-statistic:   162 on 1 and 28 DF,  p-value: 3.677e-13

Question 5: How do you interpret the results? Interpret (1) the coefficient estimates, (2) p-value for beta1, (3) R-squared , and (4) p-value for F-statistics.

The size of the effect, beta1 = -5.3. This means 1 increase in density does not leads to increase in defects.

The significance of the effect, the p-value is small and we can reject the null hypothesis.

The R-Squared is high at 85%, which means we can explain 85% of the variance in the data with this model.

The F-statistics is <0.05 and therefore the model is valid.

Task 4.4: Then, use the regression model developed in the previous code chunk to predict the response for the mean of the predictor.

##Predict a future response for a new data
# choosing the new value for predictor 2
density_new <- data.frame(density = 100)

# predict using the predict function
predict(reg_density, newdata = density_new)
##         1 
## -371.2909

Task 4.5: First, use lm() to run a regression analysis on the predictor 3 as X and the response as Y. The, use function summary() to summarize the regression analysis.

# The impact of predictor 3 on the response

reg_rate <- lm(defect ~ rate, data = quality)

summary(reg_rate)
## 
## Call:
## lm(formula = defect ~ rate, data = quality)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.3159  -5.1129  -0.7204   7.6170  22.6529 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -128.90616   15.57665  -8.276 5.26e-09 ***
## rate           0.65977    0.06548  10.077 8.13e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.185 on 28 degrees of freedom
## Multiple R-squared:  0.7838, Adjusted R-squared:  0.7761 
## F-statistic: 101.5 on 1 and 28 DF,  p-value: 8.132e-11

Question 6: How do you interpret the results? Interpret (1) the coefficient estimates, (2) p-value for beta1, (3) R-squared , and (4) p-value for F-statistics.

The size of the effect, beta1 = 0.65. This means 1 increase in rate leads to 6 unit increase in defects.

The significance of the effect, the p-value is small and we can reject the null hypothesis.

The R-Squared is high at 78%, which means we can explain 78% of the variance in the data with this model.

The F-statistics is <0.05 and therefore the model is valid.

Task 4.6: Then, use the regression model developed in the previous code chunk to predict the response for the mean of the predictor.

##Predict a future response for a new data
# choosing the new value for predictor 3

rate_new <- data.frame(rate = 100)

# predict using the predict function

predict(reg_rate, newdata = rate_new)
##         1 
## -62.92934