================================================================================================================
Regression analysis is a statistical technique for investigating and modeling the relationship between variables
In order to predict the value of the dependent variable for individuals for whom some information concerning the explanatory variables is available, or in order to estimate the effect of some explanatory variable on the dependent variable.
Linear regression is a statistical technique used to model and analyze the relationship between a dependent variable and one or more independent variables. The linear regression equation represents the estimated linear relationship between the variables.
In this equation:
The parameters β0 and β1 are usually called regression coefficients .
The random error term 𝜀 is included in the mod.el to represent the following two phenomena: * Random Variation * Missing or omitted variables or collective predictors of outcome
The marginal effect refers to the change in the outcome or dependent variable resulting from a small change in one of the explanatory or independent variables, while holding all other variables constant. It quantifies the impact of a specific independent variable on the dependent variable.The marginal effect is valuable for understanding the impact of individual variables and making comparisons between groups or categories. It helps in interpreting the coefficients of regression models and drawing conclusions about the relationship between variables.
‘Homoscedasticity’ , an assumption of equal or similar variances in different groups being compared, it is homoscedastic if all its random variables have the same finite variance.
If the dependent variable does not change when the independent variable changes, it indicates a lack of relationship or association between the variables. In this condition, we would say that the independent variable has no effect or influence on the dependent variable. referred to as no effect, no relationship, or no association between the variable
Types of data and models:
| Dependent variable (𝑦) | Independent variable (𝑥) | Name of Regression |
|---|---|---|
| Interval or ratio | Interval or ratio | Multipe |
| Interval or ratio | Ordinal, nominal | Multiple (Dummy Variable) |
| Nominal or ordinal | Interval or ratio | Logistic Regression |
Its a numerical variables used in regression analysis to represent subgroup of the sample in study. Dummy variable takes only 0 and 1 which indicate the absence or presence . Also known as indicator variable. Dummy variable always less then 1 as level number
Such as, we can create a dummy variable to represent the education level, where: * Let’s define a dummy variable “D” for education level. * “D” takes the value 1 if the individual has a college education, and 0 otherwise (indicating high school education).
Here’s how the dummy variable “D” would look for a sample of individuals:
Example:
| Individual | Education Level | Income (in thousands) | Dummy Variable (D) |
|---|---|---|---|
| A | High School | 35 | 0 |
| B | College | 50 | 1 |
| C | High School | 28 | 0 |
| D | College | 60 | 1 |
| E | College | 45 | 1 |
From Mam Class Slide,
It is used to answer questions such as * Does advertising expenditure affect sales revenue of a store? * Does change in diet result in change in cholesterol level, and if so does the result depend on other characteristics such as age, gender, and amount of exercise?
================================================================================================================
Regression Model
This is called a multiple linear regression model because more than one regressor is involved. An important objective of regression analysis is to estimate the unknown parameters in the regression model.This process is also called fi tting the model to the data. For example, the least - squares fi t to the delivery time data . y it is important to remember that regression analysis is part of a broader data - analytic approach to problem solving. That is, the regression equation itself may not be the primary objective of the study. It is usually more important to gain insight and understanding concerning the system generating the data.
y = sales x1 = youtube expenditure x2 = fb expenditure x3 = news paper expenditure
Step 1 : check y is normal or not Step 2 : x and y are linearly associated or not Step 3 : Regression are not correlated/Multiple Linear Correlation
#reading the adn.txt dataset
df <- read.table("./Data/adn.txt", header = TRUE)
head(df,5)
## Y X1 X2 X3
## 1 26.52 276.12 45.36 83.04
## 2 12.48 53.40 47.16 54.12
## 3 11.16 20.64 55.08 83.16
## 4 22.20 181.80 49.56 70.20
## 5 15.48 216.96 12.96 70.08
hist(df$Y);hist(df$X1);hist(df$X2);hist(df$X3)
for df$y, sales variable are distributed normally
Nonsense Correlation/No correlation means no linear relation ,means no sense of correlation coefficient
Here, R squared of the model of one individual predictor against all the other predictors
𝑉𝐼𝐹 = 1 ; No multicollinearity 𝑉𝐼𝐹 ≤ 5; Low multicollinearity or moderately correlated 𝑉𝐼𝐹 > 5 ; High multicollinearity or highly correlated
For example, a VIF of 6 indicates that the existing multicollinearity is inflating the variance of the coefficients 6 times compared to no multicollinearity
*** Sampling measure error SE(x_bar)
pairs(df) #observed the relation between sales and other expenditure relation, for this problem it's linear
cor(df)
## Y X1 X2 X3
## Y 1.0000000 0.78222442 0.57622257 0.22829903
## X1 0.7822244 1.00000000 0.05480866 0.05664787
## X2 0.5762226 0.05480866 1.00000000 0.35410375
## X3 0.2282990 0.05664787 0.35410375 1.00000000
sales variable are all linearly correlated with all expenditure from scatter plot. sales and youtube expenditure are positively moderated correlated.
Least Squares Estimation Given that (𝑥1, 𝑦1), (𝑥2, 𝑦2), … , (𝑥𝑛, 𝑦𝑛) Fitted regression : y_hat = 𝜷_hat0 + 𝜷_hat1x
Slope: 𝜷_hat1 Intercept: 𝜷_hat0 = y_bar − 𝜷_hat1 x_bar
Interpretation of 𝜷_hat
irl <- lm(Y~X1+X2+X3, data=df)
summary(irl)
##
## Call:
## lm(formula = Y ~ X1 + X2 + X3, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5932 -1.0690 0.2902 1.4272 3.3951
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.526667 0.374290 9.422 <2e-16 ***
## X1 0.045765 0.001395 32.809 <2e-16 ***
## X2 0.188530 0.008611 21.893 <2e-16 ***
## X3 -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.023 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
The p-value associated with the F-statistic is < 2.2e-16, which is extremely small. This indicates that the p-value is significantly less than 0.05(consider level of significant 5%). Therefore, even at a 5% level of significance, we would reject the null hypothesis and conclude that the regression model as a whole is highly statistically significant.
y_hat = 3.526667 + 0.045765 X1 + 0.188530 X2 - 0.001037 X3
SE = 0.374290 0.001395 0.008611 0.005871
Fixed effect model = X1, X2, X3
library("car")
## Loading required package: carData
vif_value <- vif(irl)
vif_value
## X1 X2 X3
## 1.004611 1.144952 1.145187
All VIF’s are more than 1 and less than 5 means this variables are Low multicollinearity or moderately correlated.
plot(irl)
Consider the regression model: 𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + ε
where : y = Sales 𝑥1 : YouTube advertising budget 𝑥2 : spending on Facebook advertiments 𝑥3 : advertising cost in newspaper
Plot a histogram:
hist(df$Y)
The histogram appears approximately symmetric and bell-shaped, it
suggests that the residuals follow a normal distribution.
The coefficients 𝛽1, 𝛽2, and 𝛽3 represent the associations between the respective advertising costs (x1, x2, x3) and sales (y). If the coefficients 𝛽1, 𝛽2, and 𝛽3 are statistically significant and not equal to zero, it indicates that there is a linear association between the corresponding advertising cost and sales.
We can check the significance of the coefficients by looking at their p-values in the regression model output. A p-value less than the chosen significance level (e.g., 0.05) suggests that the coefficient is statistically significant.
Based on the summary output,the explanatory variables (X1, X2, X3) in the regression model are statistically significant in total. The F-statistic is given as F = 570.3 with a very low p-value of < 2.2e-16. Since the p-value is much smaller than the significance level of 0.05, we can conclude that the regression model, as a whole, is statistically significant.
𝑅𝑎𝑑𝑗^2 (Adjusted R-squared) is a statistical measure that represents the proportion of the variation in the response variable (Y) that is explained by the regression model, adjusted for the number of predictors in the model. It is an adjusted version of the R-squared (𝑅^2) statistic
In this case, the p-value is less than 0.05, that’s why we reject the null hypothesis and consider the explanatory variable statistically significant. If the p-value is greater than 0.05, we fail to reject the null hypothesis, and the explanatory variable is not considered statistically significant.
Based on the provided regression model summary, the best fitted model can be stated as follows:
Sales = 3.527 + 0.046X1 + 0.189X2 - 0.001*X3
Interpretation of the regression coefficients:
Intercept (β0): when the advertising budgets on YouTube (X1), Facebook (X2), and newspaper (X3) are all zero, the expected Sales value is 3.527 (in thousands of dollars). However, since it is unlikely to have zero advertising budgets in practice, the interpretation of the intercept may have limited practical significance.
X1 (YouTube advertising budget) coefficient (β1):For a one-unit increase in the YouTube advertising budget, holding other variables constant, we expect Sales to increase by 0.046 units (in thousands of dollars). This suggests that YouTube advertising has a positive and statistically significant impact on Sales.
X2 (Facebook advertising budget) coefficient (β2): For a one-unit increase in the spending on Facebook advertisements, while keeping other variables constant, we expect Sales to increase by 0.189 units (in thousands of dollars). This indicates that Facebook advertising also has a positive and statistically significant impact on Sales.
X3 (newspaper advertising cost) coefficient (β3): For a one-unit increase in advertising cost in newspapers, all other variables held constant, we expect Sales to decrease by 0.001 units (in thousands of dollars). However, the coefficient is not statistically significant (p-value of 0.86), indicating that there is no strong evidence to suggest a significant impact of newspaper advertising on Sales.
In summary, the best fitted model suggests that YouTube advertising (X1) and Facebook advertising (X2) have a positive and statistically significant impact on Sales, while newspaper advertising (X3) does not have a statistically significant impact.
The diagnostic plots are given in output #2. Interpret the plots
Estimate mean sales of all product where 𝑥1=50, 𝑥2 = 40 and 𝑥3 = 20.
use the regression equation:
Sales = 3.527 + 0.046X1 + 0.189X2 - 0.001*X3
Substituting the given values:
Sales = 3.527 + 0.046(50) + 0.189(40) - 0.001*(20)
Sales = 3.527 + 2.3 + 7.56 - 0.02
Sales = 13.366 (approximately)
Therefore, the estimated mean sales of all products, when 𝑥1 = 50, 𝑥2 = 40, and 𝑥3 = 20, is approximately 13.366 thousand dollars.
=================================================================================================================
plot(irl)
Diagnostic plot # 1: Residuals vs Fitted values If the red line running through the middle of the graph is nearly flat, we can infer that the residuals exhibit a linear trend, indicating linearity. For this plot it indicate linearity
Diagnostic plot # 2: Q-Q plot By observing the plot, we can notice that the data points generally align with the straight diagonal line, although there is a slight deviation at the end for observation #131.
Diagnostic plot # 3: Scale-Location plot Also known as the Spread-Location plot or the Variance-Location plot, is a graphical tool used in statistics to assess the heteroscedasticity assumption in linear regression analysis. If the plot displays a roughly horizontal line or exhibits a consistent pattern with no clear trend, it suggests that the residuals have constant variance, indicating the assumption of homoscedasticity. On the other hand, if the plot shows a systematic pattern, such as a funnel shape or an increasing or decreasing trend, it suggests heteroscedasticity, indicating that the variance of the residuals is not constant. We observe that the red line maintains a roughly horizontal orientation throughout the plot. This suggests that the assumption of equal variance, also known as homoscedasticity, is not violated.
Diagnostic plot # 4: Residuals vs Leverage plot The Residuals vs Leverage plot helps in identifying influential observations that may have a substantial impact on the regression analysis. It allows us to assess the influence of individual data points on the model’s estimates and helps identify potential outliers or observations with extreme predictor values. If any points in this plot fall outside of Cook’s distance (the dashed lines) then it is an influential observation.For this plot no points are fall outside of cook’s distance then it’s an influential observation.
Robust Regression Robust regression is a statistical technique that aims to minimize the impact of outliers and violations of assumptions on the regression analysis. It is designed to provide reliable estimates of the regression parameters even when the data contains outliers or is not normally distributed.
Total sun square = SSR + SSE SSE(decrease ) SSR(increase) F-statistic(increase) SSE(increase) SSR(decrease) F-statistic(decrease)
lr2 <- lm(Y~ X1+X2, data=df)
summary(lr2)
##
## Call:
## lm(formula = Y ~ X1 + X2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5572 -1.0502 0.2906 1.4049 3.3994
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.50532 0.35339 9.919 <2e-16 ***
## X1 0.04575 0.00139 32.909 <2e-16 ***
## X2 0.18799 0.00804 23.382 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.018 on 197 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8962
## F-statistic: 859.6 on 2 and 197 DF, p-value: < 2.2e-16
All X are statistically significant it’s called best fitted model
lr3 <- lm(Y ~ X1+X2 + X1*X2, data=df)
summary(lr3)
##
## Call:
## lm(formula = Y ~ X1 + X2 + X1 * X2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6039 -0.4833 0.2197 0.7137 1.8295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.100e+00 2.974e-01 27.233 <2e-16 ***
## X1 1.910e-02 1.504e-03 12.699 <2e-16 ***
## X2 2.886e-02 8.905e-03 3.241 0.0014 **
## X1:X2 9.054e-04 4.368e-05 20.727 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.132 on 196 degrees of freedom
## Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673
## F-statistic: 1963 on 3 and 196 DF, p-value: < 2.2e-16