Loading the necessary R packages into the working session:
Research objective
The objective of this research is to identify the shape of the relationship between pH and volatile_acidity variables and to estimate the regression model. The analyzed dataset is wine quality available here. We will also, estimate and test the correlation indicators and we will estimate and test the regression parameters of the generated model.
Dataset description
The dataset analyzed in this research, wine quality, is available on the Machine Learning Repository platform here. It consists of two datasets - red_wines and white_wines - both available for research and analysis on the UCI (University of California Irvine) Machine Learning Repository website. These two datasets were merged into a single one (named wines_quality). The dataset contains information about physcal - chemical and sensory properties of red and white wine samples from the Vinho Verde region. It contains 6497 observations and 12 numerical variables. The data were collected over a 3-year period (from May 2004 to February 2007) using a computerized iLab method. After eliminating all duplicates, the final dataset consists of 5320 observations
Variables presentation:
In the data set are available the following variables:
To analysis the relationship between two variables we will use a
graphical representation (scatter plot or correlogram).
Although not many phenomena can be explained by a single independent variable, in the following we will analyse the relationship between two variables, pH and volatile acidity. The graphical representation suggests a linear relationship between analyzed variables. Therefore, we propose a linear regression model with the following general equation:
\[ y = \beta_0 + \beta_1 x + \epsilon\]
where y – is the dependent variable, x – is the independent variable, \(\epsilon\) - represents the error or residual variable. The estimation of the regression coefficients for the simple linear regression model is performed using the least squares method, which minimizes the sum of squared errors. By using lm() function to generate the regression model and summary() function to obtain the model statistics, we will be able to write the estimated regression model as follows.
##
## Call:
## lm(formula = pH ~ volatile_acidity, data = bd_sem1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52602 -0.10488 -0.00899 0.09868 0.69225
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.143742 0.004852 647.89 <0.0000000000000002 ***
## volatile_acidity 0.235149 0.012667 18.56 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1554 on 5318 degrees of freedom
## Multiple R-squared: 0.06085, Adjusted R-squared: 0.06068
## F-statistic: 344.6 on 1 and 5318 DF, p-value: < 0.00000000000000022
## (Intercept) volatile_acidity
## 3.143742 0.235149
If we take a look at the image above, we can observa that it provides various information about the estimated model, such as: parameter estimates, their statistical significance, the R-square and the adjusted R-square values, the statistical significance of the estimated model, and some information regarding the residual variable.
The estimated regression model equation is: \[ \text{pH} = 3.143742 + 0.235149 \cdot \text{volatile_acidity}\]
Two parameters appear in the estimated regression model, \(\beta_0\) and \(\beta_1\). The coefficient associated with the constant (is 1 in this case if we write the regression line as a polynomial of degree one: P(x)=\(\beta_0\)\(\cdot\)\(x^0\)+\(\beta_1\)\(\cdot\)\(x^1\), where \(x^0\) is allways one), \(b_0\), indicates the average value of the pH variable ( is 3.143742) when the volatile acidity variable is null. While the estimated parameter, \(b_1\), the regression slope, is associated with the volatile acidity variable indicates with how many units the pH increases if the volatile acidity increases by one unit (1 \(\text{g/dm}^3\)).
Because \(b_1\) \(> 0\) \(\Rightarrow\), it indicates that the relationship between analyzed variables is linear or positive (this means that when volatile acidity increase by one unit (1 \(\text{g/dm}^3\)), pH variable increases by 0.235149). Additionally, the regression slope is sub-unit or less than one (\(b_1\) \(< 1\)), which implais that the rate of change of the dependent variable is slower than that of the independent variable.
If a moment ago a point estimate of the parameters was made, we will now estimate the regression parameters of the model using confidence intervals. A confidence interval is a range of values within which the true value of a parameter is most likely to fall, given a specified level of confidence.
\[ \beta_i \in \left[ b_i \pm t_{\frac{\alpha}{2}, n-k} \cdot s_{\widehat{\beta_i}} \right] \]
The confidance interval of the estimated parameters was determined based on available data using confint() function and is as follows:
Therefore, with a 95% probabilitiy, we can say that the \(\beta_0\) parameter value lies within the range (3.134; 3.153). Additionally, we accept a 5% risk that the \(\beta_1\) parameter value may not fall within the range (0.210; 0.260).
The research question that we gonna answer is if the independent variable, volatile acidity, from the regression model is useful in predicting the pH of the wine. The steps involved in the testing process are:
Hypothesis statement: \[ H_0: \beta_0 = 0 \quad \text{și} \quad H_1: \beta_0 \neq 0 \] \[ H_0: \beta_1 = 0 \quad \text{și} \quad H_1: \beta_1 \neq 0 \]
Selecting the significance level (\(\alpha = 0.05\)), choosing the appropriate statistical test, and performing the calculations: \[ t_{\text{calc}} = \frac{b_i}{s_{\widehat{\beta_i}}}, \quad i = \overline{0, 1} \]
Estimating the \(b_0\) coefficient: \[ t_{\text{calc}} = \frac{b_0}{s_{\widehat{\beta_0}}} = \frac{3.143742}{0.004852} = 647.89 \]
Estimating the \(b_1\) coefficient: \[ t_{\text{calc}} = \frac{b_1}{s_{\widehat{\beta_1}}} = \frac{0.235149}{0.012667} = 18.56 \]
Theoretical value of the student test, \(t\): \[ t_{\frac{\alpha}{2}, n-k} = t_{\frac{0.05}{2}, 5320-2} = 1.96 \]
Decision rule: \[ \text{If } \left|t_{\text{calc}}\right| > t_{\frac{\alpha}{2}, n-k}, \quad \text{sig} < \alpha, \quad H_0 \text{ is ruject} \quad \text{with an asummed risk of 5%}. \] \[ \text{If } \left|t_{\text{calc}}\right| \leq t_{\frac{\alpha}{2}, n-k}, \quad \text{sig} \geq \alpha, \quad H_0\text{ is accepted} \quad \text{with a probability of 95%}. \]
Decision:
For \(b_0\): \[ \left|t_{\text{calc}}\right| = 647.89 > t_{\frac{\alpha}{2}, n-k} = 1.96, \quad p\text{-value} = 0.0000000000000002 < \alpha = 0.05 \] \[ \Rightarrow H_0 \text{ is rejected}, \text{ with a probability of 95%}. \]
For \(b_1\):
\[\left|t_{\text{calc}}\right| = 18.56 \> t\_{\frac{\alpha}{2}, n-k} = 1.96, \quad p\text{-value} = 0.0000000000000002 < \alpha = 0.05 \]
\[ \Rightarrow \text{The null hypothesis is rejected } (H_0) \text{ with a probability of 95%}.\]
Finally, with 95% probability, the estimated \(\beta_0\) and \(\beta_1\) parameters are statistically significant.
The steps involved in the testing process are:
Hypothesis statement: \[ H_0: \beta_0, \beta_1, \beta_2 = 0 \] \[ H_1: \text{At least one parameter is statistically significant (in our case } \beta_1 \neq 0 \text{)}. \]
Selecting the significance level (\(\alpha = 0.05\)), choosing the appropriate statistical test, and performing the calculations: \[ F_{\text{calc}} = \frac{\text{ESS}}{\text{RSS}} \cdot \frac{(n-k)}{(k-1)} \]
Decision rule:
If \(F_{\text{calc}} > F_{\alpha, k-1, n-k}\), \(\text{sig} < \alpha\), \(H_0\) is rejected, whit an assumed risk of 5%.
If \(F_{\text{calc}} \leq F_{\alpha, k-1, n-k}\), \(\text{sig} \geq \alpha\), \(H_0\) is accepted, with a probability of 95%.
p-value = 0.00000000000000022 (the probability associated with the \(F_{\text{calc}}\) = 344.6) < \(\alpha\) = 0.05 the null hypothesis (\(H_0\)) is ruled out, with a 5% risk.
With a 95% confidence level the regression model is statistically significant, indicating that there is a relationship between the outcome and the predictor variable.
correlation coefficient: In simple linear regression model the correlation coefficient is equal with Pearson correlation coefficient (\({|r|}\) = \({R}\) = \({0.2468}\)), indicating a weak relationship between the outcome (pH) and the predictor variable (volatile acidity).
Coefficient of determination (R-square): \({R^2}\) = 0.06085– 6.085% of the variation in the outcome is explained by the model.
\[ R^2 = \frac{\text{ESS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}} \]
Adjusted R-squared: \(\bar{R^2}\) = 0.06068 – 6.068% and has the same interpretation as \({R^2}\).
\[ \bar{R^2} = 1 - \left(1 - R^2\right) \cdot \frac{(n-1)}{(n-k)} = 1 - \frac{\frac{\text{RSS}}{n-k}}{\frac{\text{TSS}}{n-1}} \]
\[ F_{\text{calc}} = \frac{\text{ESS}}{\text{RSS}} \cdot \frac{(n-k)}{(k-1)} = \text{or} \ F_{\text{calc}} = \frac{R^2}{1 - R^2} \cdot \frac{(n-k)}{(k-1)} \]
\[ R = \sqrt{\frac{\text{ESS}}{\text{TSS}}} = \sqrt{1 - \frac{\text{RSS}}{\text{TSS}}} \]
pH | volatile_acidity | .fitted | .resid | .hat | .sigma | .cooksd | .std.resid |
---|---|---|---|---|---|---|---|
3.51 | 0.7Â | 3.31 | 0.202Â | 0.00103Â | 0.155 | 0.000868 | 1.3Â Â Â |
3.2Â | 0.88 | 3.35 | -0.151Â | 0.0021Â Â | 0.155 | 0.000988 | -0.97Â Â |
3.26 | 0.76 | 3.32 | -0.0625 | 0.00134Â | 0.155 | 0.000108 | -0.402Â |
3.16 | 0.28 | 3.21 | -0.0496 | 0.000215 | 0.155 | 1.1e-05Â | -0.319Â |
3.51 | 0.66 | 3.3Â | 0.211Â | 0.000851 | 0.155 | 0.000786 | 1.36Â Â |
3.3Â | 0.6Â | 3.28 | 0.0152 | 0.000623 | 0.155 | 2.97e-06 | 0.0976 |
pH | volatile_acidity | .fitted | .resid | .hat | .sigma | .cooksd | .std.resid |
---|---|---|---|---|---|---|---|
3.06 | 0.74 | 3.32 | -0.258 | 0.00123 | 0.155 | 0.00169 | -1.66 |
From the above graphic, it can be seen that not all points lie exactly on the regression line. The difference between the observed values and the values estimated by the regression model is captured in the residuals. In the figure above, the residual errors (the red-colored segments) clearly show the difference between the empirical and estimated values.
In the regression analysis, the components of variance are:
To obtain ESS, RSS and TSS values we can use aov() function.
The regression model obtained with the lm() function is the argument used by the aov() function:
# model de regresie liniară
aov(model)
## Call:
## aov(formula = model)
##
## Terms:
## volatile_acidity Residuals
## Sum of Squares 8.32564 128.48696
## Deg. of Freedom 1 5318
##
## Residual standard error: 0.1554373
## Estimated effects may be unbalanced
Generating the regression model with the aov() function
# testarea modelului MRLS
model_aov <- aov(pH ~ volatile_acidity, data = bd_sem1)
summary(model_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## volatile_acidity 1 8.33 8.326 344.6 <0.0000000000000002 ***
## Residuals 5318 128.49 0.024
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## TSS (Total Sum of Squares): 136.8126
## RSS (Residual Sum of Squares): 128.487
## ESS (Explained Sum of Squares): 8.325637
pH | volatile_acidity | .fitted | .resid | .hat | .sigma | .cooksd | .std.resid |
---|---|---|---|---|---|---|---|
3.51 | 0.7Â | 3.31 | 0.202Â | 0.00103Â | 0.155 | 0.000868 | 1.3Â Â Â |
3.2Â | 0.88 | 3.35 | -0.151Â | 0.0021Â Â | 0.155 | 0.000988 | -0.97Â Â |
3.26 | 0.76 | 3.32 | -0.0625 | 0.00134Â | 0.155 | 0.000108 | -0.402Â |
3.16 | 0.28 | 3.21 | -0.0496 | 0.000215 | 0.155 | 1.1e-05Â | -0.319Â |
3.51 | 0.66 | 3.3Â | 0.211Â | 0.000851 | 0.155 | 0.000786 | 1.36Â Â |
3.3Â | 0.6Â | 3.28 | 0.0152 | 0.000623 | 0.155 | 2.97e-06 | 0.0976 |
pH | volatile_acidity | .fitted | .resid | .hat | .sigma | .cooksd | .std.resid |
---|---|---|---|---|---|---|---|
3.63 | 1.18 | 3.42 | 0.208 | 0.00488 | 0.155 | 0.0044 | 1.34 |
pH | volatile_acidity | .fitted | .resid | .hat | .sigma | .cooksd | .std.resid |
---|---|---|---|---|---|---|---|
3.63 | 1.18 | 3.42 | 0.208 | 0.00488 | 0.155 | 0.0044 | 1.34 |
The above figure clearly shows the residual errors (the red-colored segments), indicating the difference between the empirical values and the values estimated by the model, as well as the residual variance (RSS), explained variance (ESS), and total variance (TSS).
The stochastic and deterministic component of the regression model must respect a series of assumptions:
Hypotheses regarding the deterministic component:
The independent variable, volatile acidity, must not be stochastic;
The independent variable, volatile acidity, has a finite variance
Hypotheses regarding the stochastic component:
The mean of the errors must be equal to zero \[ \text{M}(\varepsilon_i) = 0, \quad \forall i \]
Homoscedasticity assumption: the variance of the errors remains constant \[ \text{V}(\varepsilon_i) = \sigma^2, \quad \forall i \]
The normality assumption for the errors \[ \varepsilon_i \sim N(0, \sigma^2) \]
The assumption of uncorrelated errors: the errors do not influence each other \[ \text{cov}(\varepsilon_i, \varepsilon_j) = 0, \quad \forall i \neq j \]
The hypothesis regarding the absence of correlation between the independent variable and the residual variable \[ \text{cov}(\varepsilon_i, x_i) = 0, \quad \forall i \]
To verify the assumptions of the regression model, we can use diagnostic plots, such as those presented above. Diagnostic plots provide the opportunity for a visual check of the assumptions, including: checking the hypothesis of linearity between the two variables, testing the normality assumption of the errors (Q-Q plot), checking the assumption of homoscedasticity of the errors (scale-location plot), and identifying influential points (Residuals vs. Leverage plot).
The cor() function can be used to estimate the bivariate correlation coefficient:
Point estimation the correlation coefficient and its significance can be obtained with the rcorr() function:
## pH volatile_acidity
## pH NA 0
## volatile_acidity 0 NA
The estimated value of the correlation coefficient is 0.247, indicating that the relationship between the analyzed variables is weak and positive (as volatile acidity increases, the wine pH also increases).
With 95% probability, the value of the correlation coefficient is within the range [0.221, 0.272].
By standardizing the variables, we obtain:
lm.beta(model)
##
## Call:
## lm(formula = pH ~ volatile_acidity, data = bd_sem1)
##
## Standardized Coefficients::
## (Intercept) volatile_acidity
## NA 0.2466867
The correlation coefficient can be calculated using the following relationship:
\[r = b_1 \cdot \frac{s_x}{s_y} = 0.2466867\] sau \[b_1 = r \cdot \frac{s_y}{s_x}\] where: - \({r}\) is Pearson correlation coefficient - \({b_1}\) -regression slope - \({s_x}\) - standard deviation of independent variable - \({s_y}\) - standard deviation of dependent variable
The steps involved in the testing process are:
\[H_0: \rho = 0 \text{ the correlation coefficient value is not statistically significant }\] \[H_0: \rho \neq 0 \text{ the correlation coefficient value is statistically significant}\]
\[\mathbf{t}_{\mathbf{calc}} = \frac{\mathbf{r}}{\sqrt{\frac{(1 - \mathbf{r}^2)}{(n - k)}}}\]
\[\text{If } \left| t_{\text{calc}} \right| < t_{\frac{\alpha}{2}, (n-k)}, \, \text{sig} < \alpha, \, \text{the null hypotesis } (H_0) \text{ is rejected} \text{ with a risk of 5% }\]
\[\text{If } \left| t_{\text{calc}} \right| \geq t_{\frac{\alpha}{2}, (n-k)}, \, \text{sig} \geq \alpha, \, \text{the null hypotesis } (H_0) \text{ is accepted} \, \text{with a probablitity of 95%}\]
p-value = 2.2 \({10^{-16}}\) < \(\alpha\) = 0.05 => H_0 \text{ with a probability of 95%}.
The point and confidence interval estimation of the correlation coefficient can be done with the cor.test() function. Additionally, its statistical significance can also be tested:
##
## Pearson's product-moment correlation
##
## data: bd_sem1$pH and bd_sem1$volatile_acidity
## t = 18.563, df = 5318, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2212809 0.2717578
## sample estimates:
## cor
## 0.2466867
The correlation coefficient value is:
R <- sqrt(summary(model)$r.squared)
cat("R (correlation coefficient):", R, "\n")
## R (correlation coefficient): 0.2466867
In a simple linear regression model, the absolute value of the Pearson correlation coefficient is the same as that of the correlation coefficient (|r| = R).
##
## Call:
## lm(formula = pH ~ volatile_acidity, data = bd_sem1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52602 -0.10488 -0.00899 0.09868 0.69225
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.143742 0.004852 647.89 <0.0000000000000002 ***
## volatile_acidity 0.235149 0.012667 18.56 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1554 on 5318 degrees of freedom
## Multiple R-squared: 0.06085, Adjusted R-squared: 0.06068
## F-statistic: 344.6 on 1 and 5318 DF, p-value: < 0.00000000000000022
The steps involved in the testing process are:
Hypothesis statement: \[ H_0: \eta = 0 \] \[ H_1: \eta \neq 0. \]
Selecting the significance level (\(\alpha = 0.05\)), choosing the appropriate statistical test, and performing the calculations: \[ F_{\text{calc}} = \frac{\text{ESS}}{\text{RSS}} \cdot \frac{(n-k)}{(k-1)} = 344.6 \]
Decision rule:
If \(F_{\text{calc}} > F_{\alpha, k-1, n-k}\), \(\text{sig} < \alpha\), the null hypothesis (\(H_0\)) is rejected with a risk of 5%.
If \(F_{\text{calc}} \leq F_{\alpha, k-1, n-k}\), \(\text{sig} \geq \alpha\), the null hypothesis (\(H_0\)) is accepted with a probability of 95%.
p-value = 0.00000000000000022 (associated with the \(F_{\text{calc}}\) = 344.6) < \(\alpha\) = 0.05 the null hypothesis (\(H_0\)) is rejected with a risk of 5%.
With 95% probability, the correlation coefficient is statistically significant.
The estimated coefficient of determination or R-square value is:
r <- summary(model)
r$r.squared
## [1] 0.06085432
##
## Call:
## lm(formula = pH ~ volatile_acidity, data = bd_sem1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52602 -0.10488 -0.00899 0.09868 0.69225
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.143742 0.004852 647.89 <0.0000000000000002 ***
## volatile_acidity 0.235149 0.012667 18.56 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1554 on 5318 degrees of freedom
## Multiple R-squared: 0.06085, Adjusted R-squared: 0.06068
## F-statistic: 344.6 on 1 and 5318 DF, p-value: < 0.00000000000000022
The coefficient of determination, \({R^2}\) = 0.0609, indicates that 6.085% of the variation in the dependent variable is explained by the regression model.
The steps involved in the testing process are:
Hypothesis statement: \[ H_0: \eta^2 = 0 \] \[ H_1: \eta^2 \neq 0. \]
Selecting the significance level (\(\alpha = 0.05\)), choosing the appropriate statistical test, and performing the calculations:\[ F_{\text{calc}} = \frac{\text{ESS}}{\text{RSS}} \cdot \frac{(n-k)}{(k-1)} = 344.6 \]
Decision rule:
If \(F_{\text{calc}} > F_{\alpha, k-1, n-k}\), \(\text{sig} < \alpha\), the null hypothesis (\(H_0\)) is rejected with a 5% risk.
If \(F_{\text{calc}} \leq F_{\alpha, k-1, n-k}\), \(\text{sig} \geq \alpha\), the null hypothesis (\(H_0\)) is accepted with a 95% probability.
p-value = 0.00000000000000022 (associated with \(F_{\text{calc}}\) = 344.6) < \(\alpha\) = 0.05 the null hypothesis (\(H_0\)) is rejected with a 5% risk.
With a 95% confidence level, the value of R-squared is statistically significant.
The value of the adjusted R-square is:
r <- summary(model)
r$adj.r.squared
## [1] 0.06067772
##
## Call:
## lm(formula = pH ~ volatile_acidity, data = bd_sem1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52602 -0.10488 -0.00899 0.09868 0.69225
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.143742 0.004852 647.89 <0.0000000000000002 ***
## volatile_acidity 0.235149 0.012667 18.56 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1554 on 5318 degrees of freedom
## Multiple R-squared: 0.06085, Adjusted R-squared: 0.06068
## F-statistic: 344.6 on 1 and 5318 DF, p-value: < 0.00000000000000022
Crt. | Variable | R_value | R_squared | R_squared_adj | F_value | p_value | Significance |
---|---|---|---|---|---|---|---|
1 | volatile_acidity | 0.247 | 0.0609 | 0.0607 | 345 | 1.39e-74 | Semnificativ |
##estimarea si testarea raportului de determinatie
R <- summary(model)
R$r.squared
## [1] 0.06085432
summary(model)$r.squared
## [1] 0.06085432
#estimarea variatiei explicate si residuale
anova(model)
Df | Sum Sq | Mean Sq | F value | Pr(>F) |
---|---|---|---|---|
1 | 8.33 | 8.33Â Â | 345 | 1.39e-74 |
5318 | 128Â Â Â | 0.0242 | Â Â Â Â Â Â Â |
Estimarea raportului de corelatie
sqrt(summary(model)$r.squared)
## [1] 0.2466867
In cazul MRLS coeficientul de corelatie estimat este egal cu raportul de corelatie
## [1] 0.2466867