This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
Use control+Enter to run the code chunks on PC. Use command+Enter to run the code chunks on MAC.
In this section, we install and load the necessary packages.
In this section, we import the necessary data for this lab.
Everybody seems to disagree about just why so many parts have to be fixed or thrown away after they are produced. Some say that it’s the temperature of the production process, which needs to be held constant (within a reasonable range). Others claim that it’s clearly the density of the product, and that if we could only produce a heavier material, the problems would disappear. Then there is Ole the site manager, who has been warning everyone forever to take care not to push the equipment beyond its limits. This problem would be the easiest to fix, simply by slowing down the production rate; however, this would increase costs. Unfortunately, rate is the only variable that the manager can control. Interestingly, many of the workers on the morning shift think that the problem is “those inexperienced workers in the afternoon,” who, curiously, feel the same way about the morning workers.
Ever since the factory was automated, with computer network communication and bar code readers at each station, data have been piling up. After taking MGT585 class, you’ve finally decided to have a look. Your assistant aggregated the data by 4-hour blocks and then typed in the AM/PM variable, you found the following description of the variables:
temp: measures the temperature variability as a standard deviation during the time of measurement
density: indicates the density of the final product
rate: rate of production
am: 1 indicates morning and 0 afternoon
defect: average number of defects per 1000 produced
Do the following tasks and answer the questions below.
Explore the dataset using 5 functions: dim(), str(), colnames(), head() and tail().
# Explore the dataset using 5 functions: dim(), str(), colnames(), head() and tail()
dim(quality)
## [1] 30 5
str(quality)
## 'data.frame': 30 obs. of 5 variables:
## $ temp : num 0.97 2.85 2.95 2.84 1.84 2.05 1.5 2.48 2.23 3.02 ...
## $ density: num 32.1 21.1 20.6 22.5 27.4 ...
## $ rate : num 178 254 273 273 211 ...
## $ am : int 0 1 1 1 0 1 0 0 0 1 ...
## $ defect : num 0.2 47.9 50.9 49.7 11 15.6 5.5 37.4 27.8 58.7 ...
colnames(quality)
## [1] "temp" "density" "rate" "am" "defect"
head(quality)
## temp density rate am defect
## 1 0.97 32.08 177.7 0 0.2
## 2 2.85 21.14 254.1 1 47.9
## 3 2.95 20.65 272.6 1 50.9
## 4 2.84 22.53 273.4 1 49.7
## 5 1.84 27.43 210.8 0 11.0
## 6 2.05 25.42 236.1 1 15.6
tail(quality)
## temp density rate am defect
## 25 2.92 22.50 260.0 1 55.4
## 26 2.44 23.47 236.0 0 36.7
## 27 1.87 26.51 237.3 0 24.5
## 28 1.45 30.70 221.0 1 2.8
## 29 2.82 22.30 253.2 1 60.8
## 30 1.74 28.47 207.9 0 10.5
Question 1: what do we learn about the data?
There’s 30 rows of data and 5 columns. Most are number columns except for AM which in an integer with the value of 1 or 0.
Compute descriptive stats mean and sd (or any other stats you find relevant) for all continuous variables: temp, density, rate, and defect. Feel free to use dplyr functions if needed.
# Descriptive stats for continuous variables
# temp
quality %>% summarise(mean=mean(temp), sd=sd(temp), min=min(temp), max=max(temp))
## mean sd min max
## 1 2.203 0.5834153 0.97 3.02
# density
quality %>% summarise(mean=mean(density), sd=sd(density), min=min(density), max=max(density))
## mean sd min max
## 1 25.28533 3.361424 19.45 32.19
# rate
quality %>% summarise(mean=mean(rate), sd=sd(rate), min=min(rate), max=max(rate))
## mean sd min max
## 1 236.5167 26.05077 177.7 281.9
# defect
quality %>% summarise(mean=mean(defect), sd=sd(defect), min=min(defect), max=max(defect))
## mean sd min max
## 1 27.14 19.41319 0 60.8
# compute correlation between defect and temp, defect and density, defect and rate
quality %>% summarise(correlation=cor(defect, temp))
## correlation
## 1 0.9290726
quality %>% summarise(correlation=cor(defect, density))
## correlation
## 1 -0.923365
quality %>% summarise(correlation=cor(defect, rate))
## correlation
## 1 0.8853499
Question 2: what do we learn about the data?
The correlation between defect and temp is close to 1, which indicates the relationship is strong. Increasing temperature also increases defect.The correlation between defect and rate is also close to 1, which indicates the relationship is strong. Increasing rate also increases defect. Lastly, the correlation between defect and density is negative, which indicates the relationship is weak.
Identify a response (dependent variable) and numerical predictors (independent variables) from all the variables in the quality data set.
Hint: There are one response (dependent variable) and 3 potential (continuous numerical) predictors (independent variables).
Write the response here:
The dependent valuable is the defect (Y). Write the predictors here:
The three predictors (X) are the temperature, density, and rate.
Next, use ggplot() from ggplot2 package to create scatter plots for the response and the predictors one by one. You need to set the response as the y axis and the predictor as the x axis.
## scatter plot using ggplot() for all predictors and the response
# scatter plot of response vs predictor 1
ggplot(quality, mapping = aes(x = temp, y = defect)) +
geom_point() +
geom_smooth(method = "lm",se = FALSE, colour = "blue") +
ggtitle("Impact of Temperature on Defects") +
xlab("Temp") +
ylab("Defects")
## `geom_smooth()` using formula = 'y ~ x'
# scatter plot of response vs predictor 2
ggplot(quality, mapping = aes(x = density, y = defect)) +
geom_point() +
geom_smooth(method = "lm",se = FALSE, colour = "blue") +
ggtitle("Impact of Density on Defects") +
xlab("Density") +
ylab("Defects")
## `geom_smooth()` using formula = 'y ~ x'
# scatter plot of response vs predictor 3
ggplot(quality, mapping = aes(x = rate, y = defect)) +
geom_point() +
geom_smooth(method = "lm",se = FALSE, colour = "blue") +
ggtitle("Impact of Rate on Defects") +
xlab("Rate") +
ylab("Defects")
## `geom_smooth()` using formula = 'y ~ x'
Question 3 What do the scatter plots show? Write one line for each pair of response and predictor
As temp increases, the defect also increases.
As density increases, the defect decreases.
As rate increases, the defect also increases.
Use the response and the predictors selected in Task 3 to run regression analyses as instructed below.
Task 4.1: First, use lm() to run a regression analysis on the predictor 1 as X and the response as Y. The, use function summary() to summarize the regression analysis.
# The impact of predictor 1 on the response
reg_temp <- lm(defect ~ temp, data = quality)
summary(reg_temp)
##
## Call:
## lm(formula = defect ~ temp, data = quality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.5937 -4.9138 -0.6179 4.2113 15.1887
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -40.966 5.295 -7.736 1.99e-08 ***
## temp 30.915 2.326 13.291 1.29e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.308 on 28 degrees of freedom
## Multiple R-squared: 0.8632, Adjusted R-squared: 0.8583
## F-statistic: 176.6 on 1 and 28 DF, p-value: 1.29e-13
Question 4: How do you interpret the results? Interpret (1) the coefficient estimates, (2) p-value for beta1, (3) R-squared , and (4) p-value for F-statistics.
The size of the effect, beta1 = 30.915. This means 1 increase in temperature leads to 31 unit increase in defects.
The significance of the effect, the p-value is small and we can reject the null hypothesis.
The R-Squared is high at 86%, which means we can explain 86% of the variance in the data with this model.
The F-statistics is <0.05 and therefore the model is valid.
Task 4.2: Then, use the regression model developed in the previous code chunk to predict the response for the mean of the predictor.
##Predict a future response for a new data
# choosing the new value for predictor 1
temp_new <- data.frame(temp = 100)
# predict using the predict function
predict(reg_temp, newdata = temp_new)
## 1
## 3050.531
Task 4.3: First, use lm() to run a regression analysis on the predictor 2 as X and the response as Y. The, use function summary() to summarize the regression analysis.
# The impact of predictor 2 on the response
reg_density <- lm(defect ~ density, data = quality)
summary(reg_density)
##
## Call:
## lm(formula = defect ~ density, data = quality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.802 -4.702 -0.803 4.419 17.740
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 161.979 10.685 15.16 5.01e-15 ***
## density -5.333 0.419 -12.73 3.68e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.585 on 28 degrees of freedom
## Multiple R-squared: 0.8526, Adjusted R-squared: 0.8473
## F-statistic: 162 on 1 and 28 DF, p-value: 3.677e-13
Question 5: How do you interpret the results? Interpret (1) the coefficient estimates, (2) p-value for beta1, (3) R-squared , and (4) p-value for F-statistics.
The size of the effect, beta1 = -5.3. This means 1 increase in density does not leads to increase in defects.
The significance of the effect, the p-value is small and we can reject the null hypothesis.
The R-Squared is high at 85%, which means we can explain 85% of the variance in the data with this model.
The F-statistics is <0.05 and therefore the model is valid.
Task 4.4: Then, use the regression model developed in the previous code chunk to predict the response for the mean of the predictor.
##Predict a future response for a new data
# choosing the new value for predictor 2
density_new <- data.frame(density = 100)
# predict using the predict function
predict(reg_density, newdata = density_new)
## 1
## -371.2909
Task 4.5: First, use lm() to run a regression analysis on the predictor 3 as X and the response as Y. The, use function summary() to summarize the regression analysis.
# The impact of predictor 3 on the response
reg_rate <- lm(defect ~ rate, data = quality)
summary(reg_rate)
##
## Call:
## lm(formula = defect ~ rate, data = quality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.3159 -5.1129 -0.7204 7.6170 22.6529
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -128.90616 15.57665 -8.276 5.26e-09 ***
## rate 0.65977 0.06548 10.077 8.13e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.185 on 28 degrees of freedom
## Multiple R-squared: 0.7838, Adjusted R-squared: 0.7761
## F-statistic: 101.5 on 1 and 28 DF, p-value: 8.132e-11
Question 6: How do you interpret the results? Interpret (1) the coefficient estimates, (2) p-value for beta1, (3) R-squared , and (4) p-value for F-statistics.
The size of the effect, beta1 = 0.65. This means 1 increase in rate leads to 6 unit increase in defects.
The significance of the effect, the p-value is small and we can reject the null hypothesis.
The R-Squared is high at 78%, which means we can explain 78% of the variance in the data with this model.
The F-statistics is <0.05 and therefore the model is valid.
Task 4.6: Then, use the regression model developed in the previous code chunk to predict the response for the mean of the predictor.
##Predict a future response for a new data
# choosing the new value for predictor 3
rate_new <- data.frame(rate = 100)
# predict using the predict function
predict(reg_rate, newdata = rate_new)
## 1
## -62.92934