1 Simple linear regression model

Loading the necessary R packages into the working session:

Research objective

The objective of this research is to identify the shape of the relationship between pH and volatile_acidity variables and to estimate the regression model. The analyzed dataset is wine quality available here. We will also, estimate and test the correlation indicators and we will estimate and test the regression parameters of the generated model.

Dataset description

The dataset analyzed in this research, wine quality, is available on the Machine Learning Repository platform here. It consists of two datasets - red_wines and white_wines - both available for research and analysis on the UCI (University of California Irvine) Machine Learning Repository website. These two datasets were merged into a single one (named wines_quality). The dataset contains information about physcal - chemical and sensory properties of red and white wine samples from the Vinho Verde region. It contains 6497 observations and 12 numerical variables. The data were collected over a 3-year period (from May 2004 to February 2007) using a computerized iLab method. After eliminating all duplicates, the final dataset consists of 5320 observations

Variables presentation:

In the data set are available the following variables:

2 Analysis of the relationship between variables

To analysis the relationship between two variables we will use a graphical representation (scatter plot or correlogram).

3 Simple linear regression model

3.1 General equation of MRLS

Although not many phenomena can be explained by a single independent variable, in the following we will analyse the relationship between two variables, pH and volatile acidity. The graphical representation suggests a linear relationship between analyzed variables. Therefore, we propose a linear regression model with the following general equation:

\[ y = \beta_0 + \beta_1 x + \epsilon\]

where y – is the dependent variable, x – is the independent variable, \(\epsilon\) - represents the error or residual variable. The estimation of the regression coefficients for the simple linear regression model is performed using the least squares method, which minimizes the sum of squared errors. By using lm() function to generate the regression model and summary() function to obtain the model statistics, we will be able to write the estimated regression model as follows.

3.2 Parameters estimation of MRLS

## 
## Call:
## lm(formula = pH ~ volatile_acidity, data = bd_sem1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52602 -0.10488 -0.00899  0.09868  0.69225 
## 
## Coefficients:
##                  Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)      3.143742   0.004852  647.89 <0.0000000000000002 ***
## volatile_acidity 0.235149   0.012667   18.56 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1554 on 5318 degrees of freedom
## Multiple R-squared:  0.06085,    Adjusted R-squared:  0.06068 
## F-statistic: 344.6 on 1 and 5318 DF,  p-value: < 0.00000000000000022

##      (Intercept) volatile_acidity 
##         3.143742         0.235149

If we take a look at the image above, we can observa that it provides various information about the estimated model, such as: parameter estimates, their statistical significance, the R-square and the adjusted R-square values, the statistical significance of the estimated model, and some information regarding the residual variable.

The estimated regression model equation is: \[ \text{pH} = 3.143742 + 0.235149 \cdot \text{volatile_acidity}\]

Two parameters appear in the estimated regression model, \(\beta_0\) and \(\beta_1\). The coefficient associated with the constant (is 1 in this case if we write the regression line as a polynomial of degree one: P(x)=\(\beta_0\)\(\cdot\)\(x^0\)+\(\beta_1\)\(\cdot\)\(x^1\), where \(x^0\) is allways one), \(b_0\), indicates the average value of the pH variable ( is 3.143742) when the volatile acidity variable is null. While the estimated parameter, \(b_1\), the regression slope, is associated with the volatile acidity variable indicates with how many units the pH increases if the volatile acidity increases by one unit (1 \(\text{g/dm}^3\)).

Because \(b_1\) \(> 0\) \(\Rightarrow\), it indicates that the relationship between analyzed variables is linear or positive (this means that when volatile acidity increase by one unit (1 \(\text{g/dm}^3\)), pH variable increases by 0.235149). Additionally, the regression slope is sub-unit or less than one (\(b_1\) \(< 1\)), which implais that the rate of change of the dependent variable is slower than that of the independent variable.

3.3 The Estimated Confidence Interval for the regression parameter

If a moment ago a point estimate of the parameters was made, we will now estimate the regression parameters of the model using confidence intervals. A confidence interval is a range of values within which the true value of a parameter is most likely to fall, given a specified level of confidence.

\[ \beta_i \in \left[ b_i \pm t_{\frac{\alpha}{2}, n-k} \cdot s_{\widehat{\beta_i}} \right] \]

The confidance interval of the estimated parameters was determined based on available data using confint() function and is as follows:

Therefore, with a 95% probabilitiy, we can say that the \(\beta_0\) parameter value lies within the range (3.134; 3.153). Additionally, we accept a 5% risk that the \(\beta_1\) parameter value may not fall within the range (0.210; 0.260).

3.4 MRLS Parameter Tests

The research question that we gonna answer is if the independent variable, volatile acidity, from the regression model is useful in predicting the pH of the wine. The steps involved in the testing process are:

Hypothesis statement: \[ H_0: \beta_0 = 0 \quad \text{și} \quad H_1: \beta_0 \neq 0 \] \[ H_0: \beta_1 = 0 \quad \text{și} \quad H_1: \beta_1 \neq 0 \]
Selecting the significance level (\(\alpha = 0.05\)), choosing the appropriate statistical test, and performing the calculations: \[ t_{\text{calc}} = \frac{b_i}{s_{\widehat{\beta_i}}}, \quad i = \overline{0, 1} \]

Estimating the \(b_0\) coefficient: \[ t_{\text{calc}} = \frac{b_0}{s_{\widehat{\beta_0}}} = \frac{3.143742}{0.004852} = 647.89 \]
Estimating the \(b_1\) coefficient: \[ t_{\text{calc}} = \frac{b_1}{s_{\widehat{\beta_1}}} = \frac{0.235149}{0.012667} = 18.56 \]

Theoretical value of the student test, \(t\): \[ t_{\frac{\alpha}{2}, n-k} = t_{\frac{0.05}{2}, 5320-2} = 1.96 \]

Decision rule: \[ \text{If } \left|t_{\text{calc}}\right| > t_{\frac{\alpha}{2}, n-k}, \quad \text{sig} < \alpha, \quad H_0 \text{ is ruject} \quad \text{with an asummed risk of 5%}. \] \[ \text{If } \left|t_{\text{calc}}\right| \leq t_{\frac{\alpha}{2}, n-k}, \quad \text{sig} \geq \alpha, \quad H_0\text{ is accepted} \quad \text{with a probability of 95%}. \]
Decision:

For \(b_0\): \[ \left|t_{\text{calc}}\right| = 647.89 > t_{\frac{\alpha}{2}, n-k} = 1.96, \quad p\text{-value} = 0.0000000000000002 < \alpha = 0.05 \] \[ \Rightarrow H_0 \text{ is rejected}, \text{ with a probability of 95%}. \]

For \(b_1\):

\[\left|t_{\text{calc}}\right| = 18.56 \> t\_{\frac{\alpha}{2}, n-k} = 1.96, \quad p\text{-value} = 0.0000000000000002 < \alpha = 0.05 \]

\[ \Rightarrow \text{The null hypothesis is rejected } (H_0) \text{ with a probability of 95%}.\]

Finally, with 95% probability, the estimated \(\beta_0\) and \(\beta_1\) parameters are statistically significant.

3.5 Testing the regression model

The steps involved in the testing process are:

Hypothesis statement: \[ H_0: \beta_0, \beta_1, \beta_2 = 0 \] \[ H_1: \text{At least one parameter is statistically significant (in our case } \beta_1 \neq 0 \text{)}. \]
Selecting the significance level (\(\alpha = 0.05\)), choosing the appropriate statistical test, and performing the calculations: \[ F_{\text{calc}} = \frac{\text{ESS}}{\text{RSS}} \cdot \frac{(n-k)}{(k-1)} \]
Decision rule:

If \(F_{\text{calc}} > F_{\alpha, k-1, n-k}\), \(\text{sig} < \alpha\), \(H_0\) is rejected, whit an assumed risk of 5%.
If \(F_{\text{calc}} \leq F_{\alpha, k-1, n-k}\), \(\text{sig} \geq \alpha\), \(H_0\) is accepted, with a probability of 95%.

Decision:

p-value = 0.00000000000000022 (the probability associated with the \(F_{\text{calc}}\) = 344.6) < \(\alpha\) = 0.05 the null hypothesis (\(H_0\)) is ruled out, with a 5% risk.

With a 95% confidence level the regression model is statistically significant, indicating that there is a relationship between the outcome and the predictor variable.

correlation coefficient: In simple linear regression model the correlation coefficient is equal with Pearson correlation coefficient (\({|r|}\) = \({R}\) = \({0.2468}\)), indicating a weak relationship between the outcome (pH) and the predictor variable (volatile acidity).

Coefficient of determination (R-square): \({R^2}\) = 0.06085– 6.085% of the variation in the outcome is explained by the model.

\[ R^2 = \frac{\text{ESS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}} \]

Adjusted R-squared: \(\bar{R^2}\) = 0.06068 – 6.068% and has the same interpretation as \({R^2}\).

\[ \bar{R^2} = 1 - \left(1 - R^2\right) \cdot \frac{(n-1)}{(n-k)} = 1 - \frac{\frac{\text{RSS}}{n-k}}{\frac{\text{TSS}}{n-1}} \]

\[ F_{\text{calc}} = \frac{\text{ESS}}{\text{RSS}} \cdot \frac{(n-k)}{(k-1)} = \text{or} \ F_{\text{calc}} = \frac{R^2}{1 - R^2} \cdot \frac{(n-k)}{(k-1)} \]

\[ R = \sqrt{\frac{\text{ESS}}{\text{TSS}}} = \sqrt{1 - \frac{\text{RSS}}{\text{TSS}}} \]

4 Explanation of RSS, ESS, and TSS

4.1 Graphical representation A

pH	volatile_acidity	.fitted	.resid	.hat	.sigma	.cooksd	.std.resid
3.51	0.7	3.31	0.202	0.00103	0.155	0.000868	1.3
3.2	0.88	3.35	-0.151	0.0021	0.155	0.000988	-0.97
3.26	0.76	3.32	-0.0625	0.00134	0.155	0.000108	-0.402
3.16	0.28	3.21	-0.0496	0.000215	0.155	1.1e-05	-0.319
3.51	0.66	3.3	0.211	0.000851	0.155	0.000786	1.36
3.3	0.6	3.28	0.0152	0.000623	0.155	2.97e-06	0.0976

pH	volatile_acidity	.fitted	.resid	.hat	.sigma	.cooksd	.std.resid
3.06	0.74	3.32	-0.258	0.00123	0.155	0.00169	-1.66

From the above graphic, it can be seen that not all points lie exactly on the regression line. The difference between the observed values and the values estimated by the regression model is captured in the residuals. In the figure above, the residual errors (the red-colored segments) clearly show the difference between the empirical and estimated values.

4.2 Introduction

In the regression analysis, the components of variance are:

Total Sum of Squares (TSS): \[{TSS = \sum_{i=1}^{n} \left( y_i - \bar{y} \right)^2}\]
Explained Sum of Squares (ESS): by the model \[ESS = \sum_{i=1}^{n} \left( \hat{y}_i - \bar{y} \right)^2\]
Residual Sum of Squares (RSS): is the variance that remains unexplained \[RSS = \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2\]

4.3 How to calculate the RSS, ESS, and TSS values

To obtain ESS, RSS and TSS values we can use aov() function.

The regression model obtained with the lm() function is the argument used by the aov() function:

# model de regresie liniară
aov(model)

## Call:
##    aov(formula = model)
## 
## Terms:
##                 volatile_acidity Residuals
## Sum of Squares           8.32564 128.48696
## Deg. of Freedom                1      5318
## 
## Residual standard error: 0.1554373
## Estimated effects may be unbalanced

Generating the regression model with the aov() function

# testarea modelului MRLS
model_aov <- aov(pH ~ volatile_acidity, data = bd_sem1)
summary(model_aov)

##                    Df Sum Sq Mean Sq F value              Pr(>F)    
## volatile_acidity    1   8.33   8.326   344.6 <0.0000000000000002 ***
## Residuals        5318 128.49   0.024                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## TSS (Total Sum of Squares): 136.8126

## RSS (Residual Sum of Squares): 128.487

## ESS (Explained Sum of Squares): 8.325637

pH	volatile_acidity	.fitted	.resid	.hat	.sigma	.cooksd	.std.resid
3.51	0.7	3.31	0.202	0.00103	0.155	0.000868	1.3
3.2	0.88	3.35	-0.151	0.0021	0.155	0.000988	-0.97
3.26	0.76	3.32	-0.0625	0.00134	0.155	0.000108	-0.402
3.16	0.28	3.21	-0.0496	0.000215	0.155	1.1e-05	-0.319
3.51	0.66	3.3	0.211	0.000851	0.155	0.000786	1.36
3.3	0.6	3.28	0.0152	0.000623	0.155	2.97e-06	0.0976

4.4 Graphical representation B

pH	volatile_acidity	.fitted	.resid	.hat	.sigma	.cooksd	.std.resid
3.63	1.18	3.42	0.208	0.00488	0.155	0.0044	1.34

pH	volatile_acidity	.fitted	.resid	.hat	.sigma	.cooksd	.std.resid
3.63	1.18	3.42	0.208	0.00488	0.155	0.0044	1.34

The above figure clearly shows the residual errors (the red-colored segments), indicating the difference between the empirical values and the values estimated by the model, as well as the residual variance (RSS), explained variance (ESS), and total variance (TSS).

5 Testing the regression model

5.1 Assumptions of the regression model

The stochastic and deterministic component of the regression model must respect a series of assumptions:

Hypotheses regarding the deterministic component:
- The independent variable, volatile acidity, must not be stochastic;
- The independent variable, volatile acidity, has a finite variance
Hypotheses regarding the stochastic component:
- The mean of the errors must be equal to zero \[ \text{M}(\varepsilon_i) = 0, \quad \forall i \]
- Homoscedasticity assumption: the variance of the errors remains constant \[ \text{V}(\varepsilon_i) = \sigma^2, \quad \forall i \]
- The normality assumption for the errors \[ \varepsilon_i \sim N(0, \sigma^2) \]
- The assumption of uncorrelated errors: the errors do not influence each other \[ \text{cov}(\varepsilon_i, \varepsilon_j) = 0, \quad \forall i \neq j \]
- The hypothesis regarding the absence of correlation between the independent variable and the residual variable \[ \text{cov}(\varepsilon_i, x_i) = 0, \quad \forall i \]

5.2 Checking the assumptions of the regression model (Diagnostic plots)

To verify the assumptions of the regression model, we can use diagnostic plots, such as those presented above. Diagnostic plots provide the opportunity for a visual check of the assumptions, including: checking the hypothesis of linearity between the two variables, testing the normality assumption of the errors (Q-Q plot), checking the assumption of homoscedasticity of the errors (scale-location plot), and identifying influential points (Residuals vs. Leverage plot).

6 Testing the correlation indicators

6.1 Pearson correlation coefficient

6.1.1 Point estimation

The cor() function can be used to estimate the bivariate correlation coefficient:

Point estimation the correlation coefficient and its significance can be obtained with the rcorr() function:

##                  pH volatile_acidity
## pH               NA                0
## volatile_acidity  0               NA

The estimated value of the correlation coefficient is 0.247, indicating that the relationship between the analyzed variables is weak and positive (as volatile acidity increases, the wine pH also increases).

6.1.2 Confidence interval estimation

With 95% probability, the value of the correlation coefficient is within the range [0.221, 0.272].

By standardizing the variables, we obtain:

lm.beta(model)

## 
## Call:
## lm(formula = pH ~ volatile_acidity, data = bd_sem1)
## 
## Standardized Coefficients::
##      (Intercept) volatile_acidity 
##               NA        0.2466867

The correlation coefficient can be calculated using the following relationship:

\[r = b_1 \cdot \frac{s_x}{s_y} = 0.2466867\] sau \[b_1 = r \cdot \frac{s_y}{s_x}\] where: - \({r}\) is Pearson correlation coefficient - \({b_1}\) -regression slope - \({s_x}\) - standard deviation of independent variable - \({s_y}\) - standard deviation of dependent variable

6.1.3 Testing the bivariate correlation coefficient

The steps involved in the testing process are:

Hypothesis statement:

\[H_0: \rho = 0 \text{ the correlation coefficient value is not statistically significant }\] \[H_0: \rho \neq 0 \text{ the correlation coefficient value is statistically significant}\]

Selecting the significance level (\(\alpha = 0.05\)), choosing the appropriate statistical test, and performing the calculations:

\[\mathbf{t}_{\mathbf{calc}} = \frac{\mathbf{r}}{\sqrt{\frac{(1 - \mathbf{r}^2)}{(n - k)}}}\]

Decision rule:

\[\text{If } \left| t_{\text{calc}} \right| < t_{\frac{\alpha}{2}, (n-k)}, \, \text{sig} < \alpha, \, \text{the null hypotesis } (H_0) \text{ is rejected} \text{ with a risk of 5% }\]

\[\text{If } \left| t_{\text{calc}} \right| \geq t_{\frac{\alpha}{2}, (n-k)}, \, \text{sig} \geq \alpha, \, \text{the null hypotesis } (H_0) \text{ is accepted} \, \text{with a probablitity of 95%}\]

Decision:

p-value = 2.2 \({10^{-16}}\) < \(\alpha\) = 0.05 => H_0 \text{ with a probability of 95%}.

The point and confidence interval estimation of the correlation coefficient can be done with the cor.test() function. Additionally, its statistical significance can also be tested:

## 
##  Pearson's product-moment correlation
## 
## data:  bd_sem1$pH and bd_sem1$volatile_acidity
## t = 18.563, df = 5318, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2212809 0.2717578
## sample estimates:
##       cor 
## 0.2466867

6.2 Correlation coefficient

6.2.1 Correlation coefficient estimation

The correlation coefficient value is:

R <- sqrt(summary(model)$r.squared)
cat("R (correlation coefficient):", R, "\n")

## R (correlation coefficient): 0.2466867

In a simple linear regression model, the absolute value of the Pearson correlation coefficient is the same as that of the correlation coefficient (|r| = R).

## 
## Call:
## lm(formula = pH ~ volatile_acidity, data = bd_sem1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52602 -0.10488 -0.00899  0.09868  0.69225 
## 
## Coefficients:
##                  Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)      3.143742   0.004852  647.89 <0.0000000000000002 ***
## volatile_acidity 0.235149   0.012667   18.56 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1554 on 5318 degrees of freedom
## Multiple R-squared:  0.06085,    Adjusted R-squared:  0.06068 
## F-statistic: 344.6 on 1 and 5318 DF,  p-value: < 0.00000000000000022

6.2.2 Testing the correlation coefficient

The steps involved in the testing process are:

Hypothesis statement: \[ H_0: \eta = 0 \] \[ H_1: \eta \neq 0. \]
Selecting the significance level (\(\alpha = 0.05\)), choosing the appropriate statistical test, and performing the calculations: \[ F_{\text{calc}} = \frac{\text{ESS}}{\text{RSS}} \cdot \frac{(n-k)}{(k-1)} = 344.6 \]
Decision rule:

If \(F_{\text{calc}} > F_{\alpha, k-1, n-k}\), \(\text{sig} < \alpha\), the null hypothesis (\(H_0\)) is rejected with a risk of 5%.
If \(F_{\text{calc}} \leq F_{\alpha, k-1, n-k}\), \(\text{sig} \geq \alpha\), the null hypothesis (\(H_0\)) is accepted with a probability of 95%.

Decision:

p-value = 0.00000000000000022 (associated with the \(F_{\text{calc}}\) = 344.6) < \(\alpha\) = 0.05 the null hypothesis (\(H_0\)) is rejected with a risk of 5%.

With 95% probability, the correlation coefficient is statistically significant.

6.3 The coefficient of determination

6.3.1 The coefficient of determination estimation

The estimated coefficient of determination or R-square value is:

r <- summary(model)
r$r.squared

## [1] 0.06085432

## 
## Call:
## lm(formula = pH ~ volatile_acidity, data = bd_sem1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52602 -0.10488 -0.00899  0.09868  0.69225 
## 
## Coefficients:
##                  Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)      3.143742   0.004852  647.89 <0.0000000000000002 ***
## volatile_acidity 0.235149   0.012667   18.56 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1554 on 5318 degrees of freedom
## Multiple R-squared:  0.06085,    Adjusted R-squared:  0.06068 
## F-statistic: 344.6 on 1 and 5318 DF,  p-value: < 0.00000000000000022

The coefficient of determination, \({R^2}\) = 0.0609, indicates that 6.085% of the variation in the dependent variable is explained by the regression model.

6.3.2 Testing the coefficient of determination

The steps involved in the testing process are:

Hypothesis statement: \[ H_0: \eta^2 = 0 \] \[ H_1: \eta^2 \neq 0. \]
Selecting the significance level (\(\alpha = 0.05\)), choosing the appropriate statistical test, and performing the calculations:\[ F_{\text{calc}} = \frac{\text{ESS}}{\text{RSS}} \cdot \frac{(n-k)}{(k-1)} = 344.6 \]
Decision rule:

If \(F_{\text{calc}} > F_{\alpha, k-1, n-k}\), \(\text{sig} < \alpha\), the null hypothesis (\(H_0\)) is rejected with a 5% risk.
If \(F_{\text{calc}} \leq F_{\alpha, k-1, n-k}\), \(\text{sig} \geq \alpha\), the null hypothesis (\(H_0\)) is accepted with a 95% probability.

Decision:

p-value = 0.00000000000000022 (associated with \(F_{\text{calc}}\) = 344.6) < \(\alpha\) = 0.05 the null hypothesis (\(H_0\)) is rejected with a 5% risk.

With a 95% confidence level, the value of R-squared is statistically significant.

6.4 Adjusted R-square

6.4.1 Estimation of the adjusted R-square

The value of the adjusted R-square is:

r <- summary(model)
r$adj.r.squared

## [1] 0.06067772

## 
## Call:
## lm(formula = pH ~ volatile_acidity, data = bd_sem1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52602 -0.10488 -0.00899  0.09868  0.69225 
## 
## Coefficients:
##                  Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)      3.143742   0.004852  647.89 <0.0000000000000002 ***
## volatile_acidity 0.235149   0.012667   18.56 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1554 on 5318 degrees of freedom
## Multiple R-squared:  0.06085,    Adjusted R-squared:  0.06068 
## F-statistic: 344.6 on 1 and 5318 DF,  p-value: < 0.00000000000000022

Crt.	Variable	R_value	R_squared	R_squared_adj	F_value	p_value	Significance
1	volatile_acidity	0.247	0.0609	0.0607	345	1.39e-74	Semnificativ

##estimarea si testarea raportului de determinatie
R <- summary(model)
R$r.squared

## [1] 0.06085432

summary(model)$r.squared

## [1] 0.06085432

#estimarea variatiei explicate si residuale
anova(model)

Df	Sum Sq	Mean Sq	F value	Pr(>F)
1	8.33	8.33	345	1.39e-74
5318	128	0.0242

Estimarea raportului de corelatie

sqrt(summary(model)$r.squared)

## [1] 0.2466867

In cazul MRLS coeficientul de corelatie estimat este egal cu raportul de corelatie

## [1] 0.2466867

7 References:

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Wine Quality [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C56S3T.
D.B. Radosavljevic, S. Ilic, S.N. Pitulic, A data mining approach to wine quality prediction, 2019, International Scientific Conference UNITECHAt: Gabrovo, Bulgaria.
M. Taboga, Linear regression with standardized variables, https://www.statlect.com/fundamentals-of-statistics/linear-regression-with-standardized-variables accesat la data de 24.01.2025.
https://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials/ accesat la data de 24.01.2025.
https://cran.r-project.org/web/packages/ggfortify/ggfortify.pdf, https://www.sthda.com/english/wiki/ggfortify-extension-to-ggplot2-to-handle-some-popular-packages-r-software-and-data-visualization accesat la data de 24.01.2025.
https://riffomonas.org/code_club/2024-11-11-venn-symbol 24.01.2025.
https://www.dummies.com/article/business-careers-money/business/accounting/calculation-analysis/test-the-estimated-regression-equation-using-the-coefficient-of-determination-r2-145936/
U. Ergun, A. Goksu, Applied econometrics with eviews applications, 2013.
A. Codirlasu, Econometrie aplicata utilizand Eviews 5.1, 2007.
C. Anghelescu, I. Partachi, A. Baltac, C. Paunescu, Estimarea parametrilor modelului liniar de regresie, 2012, https://www.revistadestatistica.ro/suplimente/2012/3/srrs3_2012a19.pdf accesat la data 31.01.2025.
M. Irimia, Modelul de regresie liniar simplu (Simple linear regression model), 2025, https://rpubs.com/mihaela123/1266378 accesat la data de 31.01.2025.

Modelul de regresie liniar simplu (Simple linear regression model)

Irimia Mihaela

2025-01-26

1 Simple linear regression model

2 Analysis of the relationship between variables

3 Simple linear regression model

3.1 General equation of MRLS

3.2 Parameters estimation of MRLS

3.3 The Estimated Confidence Interval for the regression parameter

3.4 MRLS Parameter Tests

3.5 Testing the regression model

4 Explanation of RSS, ESS, and TSS

4.1 Graphical representation A

4.2 Introduction

4.3 How to calculate the RSS, ESS, and TSS values

4.4 Graphical representation B

5 Testing the regression model

5.1 Assumptions of the regression model

5.2 Checking the assumptions of the regression model (Diagnostic plots)

6 Testing the correlation indicators

6.1 Pearson correlation coefficient

6.1.1 Point estimation

6.1.2 Confidence interval estimation

6.1.3 Testing the bivariate correlation coefficient

6.2 Correlation coefficient

6.2.1 Correlation coefficient estimation

6.2.2 Testing the correlation coefficient

6.3 The coefficient of determination

6.3.1 The coefficient of determination estimation

6.3.2 Testing the coefficient of determination

6.4 Adjusted R-square

6.4.1 Estimation of the adjusted R-square

7 References: