ID: 20228034

================================================================================================================

First Class

DATE: 12.05.23

Applied Regression

Regression analysis is a statistical technique for investigating and modeling the relationship between variables

Typically, a regression analysis is done for one of two purposes:

In order to predict the value of the dependent variable for individuals for whom some information concerning the explanatory variables is available, or in order to estimate the effect of some explanatory variable on the dependent variable.

  • Random variation represents the inherent uncertainty in data,
  • model misspecification occurs when the model does not accurately represent the underlying relationship, and
  • missing variables refer to relevant variables that are not included in the analysis

Linear regression is a statistical technique used to model and analyze the relationship between a dependent variable and one or more independent variables. The linear regression equation represents the estimated linear relationship between the variables.

y = β0 + β1x + ε ; ε ~ N(0,σ²)

In this equation:

  • “y” represents the dependent variable or response (e.g. Sales revenue)
  • “x” represents the independent variable or predictor (e.g. Advertising cost)
  • β0 represents the y-intercept, which is the value of the dependent variable when the independent variable is zero.
  • β1 represents the slope coefficient, which indicates the change in the dependent variable associated with a one-unit increase in the independent variable.
  • ε represents the error term or residual, which accounts for the random variation or unexplained factors in the relationship./ Random error term

The parameters β0 and β1 are usually called regression coefficients .

The random error term 𝜀 is included in the mod.el to represent the following two phenomena: * Random Variation * Missing or omitted variables or collective predictors of outcome

What is the marginal effect?

The marginal effect refers to the change in the outcome or dependent variable resulting from a small change in one of the explanatory or independent variables, while holding all other variables constant. It quantifies the impact of a specific independent variable on the dependent variable.The marginal effect is valuable for understanding the impact of individual variables and making comparisons between groups or categories. It helps in interpreting the coefficients of regression models and drawing conclusions about the relationship between variables.

Assumption Multiple linear regression (MLR) model

  • Dependent and independent variables are linear
  • ε normally distributed
  • V(y|x) = V(ε) = σ² conditional variance/ Homoscedasticity
  • Regressors are not correlated

‘Homoscedasticity’ , an assumption of equal or similar variances in different groups being compared, it is homoscedastic if all its random variables have the same finite variance.

If dependent variable not change for independent variable change, what we called this condition?

If the dependent variable does not change when the independent variable changes, it indicates a lack of relationship or association between the variables. In this condition, we would say that the independent variable has no effect or influence on the dependent variable. referred to as no effect, no relationship, or no association between the variable

  • Scatter Plot measure the linearity of the variables

Types of data and models:

Dependent variable (𝑦) Independent variable (𝑥) Name of Regression
Interval or ratio Interval or ratio Multipe
Interval or ratio Ordinal, nominal Multiple (Dummy Variable)
Nominal or ordinal Interval or ratio Logistic Regression

Dummy variable

Its a numerical variables used in regression analysis to represent subgroup of the sample in study. Dummy variable takes only 0 and 1 which indicate the absence or presence . Also known as indicator variable. Dummy variable always less then 1 as level number

Such as, we can create a dummy variable to represent the education level, where: * Let’s define a dummy variable “D” for education level. * “D” takes the value 1 if the individual has a college education, and 0 otherwise (indicating high school education).

Here’s how the dummy variable “D” would look for a sample of individuals:

  • Individual A: Education Level = High School -> D = 0
  • Individual B: Education Level = College -> D = 1
  • Individual C: Education Level = High School -> D = 0

Example:

Individual Education Level Income (in thousands) Dummy Variable (D)
A High School 35 0
B College 50 1
C High School 28 0
D College 60 1
E College 45 1
  • Discrete/Continuous variable can be qualitative/quantitative

From Mam Class Slide,

  • Regression is the statistical technique of studying relationship

It is used to answer questions such as * Does advertising expenditure affect sales revenue of a store? * Does change in diet result in change in cholesterol level, and if so does the result depend on other characteristics such as age, gender, and amount of exercise?

Objectives

  • Explanatory Researchers want to understand whether predictors are related to outcome. In the example 1, how does advertising expenditure affect sales revenue?
  • Prediction

Measure of (linear) relationship

  • Scatter plot (Graphical)
  • Correlation Co-efficient (analytical measure)

================================================================================================================

Second Class

DATE: 19.05.23

First Quiz in 02/06/23

Regression Model

y = β0 + β1x1 + β2x2 +β3x3 + ε

This is called a multiple linear regression model because more than one regressor is involved. An important objective of regression analysis is to estimate the unknown parameters in the regression model.This process is also called fi tting the model to the data. For example, the least - squares fi t to the delivery time data . y it is important to remember that regression analysis is part of a broader data - analytic approach to problem solving. That is, the regression equation itself may not be the primary objective of the study. It is usually more important to gain insight and understanding concerning the system generating the data.

y = sales x1 = youtube expenditure x2 = fb expenditure x3 = news paper expenditure

Step 1 : check y is normal or not Step 2 : x and y are linearly associated or not Step 3 : Regression are not correlated/Multiple Linear Correlation

Assumption Multiple linear regression (MLR) model

  • Dependent and independent variables are linear
  • ε normally distributed
  • V(y|x) = V(ε) = σ² conditional variance/ Homoscedasticity
  • Regressions are not correlated
#reading the adn.txt dataset
df <- read.table("./Data/adn.txt", header = TRUE)
head(df,5)
##       Y     X1    X2    X3
## 1 26.52 276.12 45.36 83.04
## 2 12.48  53.40 47.16 54.12
## 3 11.16  20.64 55.08 83.16
## 4 22.20 181.80 49.56 70.20
## 5 15.48 216.96 12.96 70.08
hist(df$Y);hist(df$X1);hist(df$X2);hist(df$X3)

for df$y, sales variable are distributed normally

Detection of Multicollinearity

  • Correlation Matrix
  • Variance Inflation Factor (VIF)
  • Tolerance
  1. Correlation co-efficient 𝑟, Range: −1 ≤ 𝑟 ≤ +1 Sign: Positive values of 𝑟 imply that y increases as x increases; negative values imply that y decreases as x increases
  • If 0 < |𝑟| ≤ 0.5 , then there is a week linear relationship
  • If 0.5 < |𝑟| ≤ 0.8 , then there is a moderate linear relationship
  • If |𝑟| > 0.8 , then there is a strong linear relationship

Nonsense Correlation/No correlation means no linear relation ,means no sense of correlation coefficient

  1. Variance Inflation Factor (VIF) = 1/(1-(𝑅𝑗)^2)

Here, R squared of the model of one individual predictor against all the other predictors

𝑉𝐼𝐹 = 1 ; No multicollinearity 𝑉𝐼𝐹 ≤ 5; Low multicollinearity or moderately correlated 𝑉𝐼𝐹 > 5 ; High multicollinearity or highly correlated

For example, a VIF of 6 indicates that the existing multicollinearity is inflating the variance of the coefficients 6 times compared to no multicollinearity

  1. Tolerance, 𝑇 = 1 − (𝑅𝑗)^2 With T < 0.1 there might be multicollinearity in the data and with T < 0.01 there certainly is.

*** Sampling measure error SE(x_bar)

pairs(df) #observed the relation between sales and other expenditure relation, for this problem it's linear

cor(df)
##            Y         X1         X2         X3
## Y  1.0000000 0.78222442 0.57622257 0.22829903
## X1 0.7822244 1.00000000 0.05480866 0.05664787
## X2 0.5762226 0.05480866 1.00000000 0.35410375
## X3 0.2282990 0.05664787 0.35410375 1.00000000

sales variable are all linearly correlated with all expenditure from scatter plot. sales and youtube expenditure are positively moderated correlated.

Fitted regression model

y_hat = 𝜷_hat0 + 𝜷_hat1x

Least Squares Estimation Given that (𝑥1, 𝑦1), (𝑥2, 𝑦2), … , (𝑥𝑛, 𝑦𝑛) Fitted regression : y_hat = 𝜷_hat0 + 𝜷_hat1x

Slope: 𝜷_hat1 Intercept: 𝜷_hat0 = y_bar − 𝜷_hat1 x_bar

y_hat = x𝜷_hat

Interpretation of 𝜷_hat

  • 𝜷_hat0 = Average of response (outcome) when the value of independent variables are zero
  • 𝜷_hat1 = Average change in response (outcome) for one unit change in 𝑥1 , controlling 𝑥2, 𝑥3, … 𝑥k
irl <- lm(Y~X1+X2+X3, data=df)
summary(irl)
## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5932  -1.0690   0.2902   1.4272   3.3951 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.526667   0.374290   9.422   <2e-16 ***
## X1           0.045765   0.001395  32.809   <2e-16 ***
## X2           0.188530   0.008611  21.893   <2e-16 ***
## X3          -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.023 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

The p-value associated with the F-statistic is < 2.2e-16, which is extremely small. This indicates that the p-value is significantly less than 0.05(consider level of significant 5%). Therefore, even at a 5% level of significance, we would reject the null hypothesis and conclude that the regression model as a whole is highly statistically significant.

y_hat = 3.526667 + 0.045765 X1 + 0.188530 X2 - 0.001037 X3

SE = 0.374290 0.001395 0.008611 0.005871

Fixed effect model = X1, X2, X3

library("car")
## Loading required package: carData
vif_value <- vif(irl)
vif_value
##       X1       X2       X3 
## 1.004611 1.144952 1.145187

All VIF’s are more than 1 and less than 5 means this variables are Low multicollinearity or moderately correlated.

plot(irl)

Example:

Consider the regression model: 𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + ε

where : y = Sales 𝑥1 : YouTube advertising budget 𝑥2 : spending on Facebook advertiments 𝑥3 : advertising cost in newspaper

  1. Verify the normality of response variable

Plot a histogram:

hist(df$Y)

The histogram appears approximately symmetric and bell-shaped, it suggests that the residuals follow a normal distribution.

  1. Is there any linear association between sales and three advertising costs?

The coefficients 𝛽1, 𝛽2, and 𝛽3 represent the associations between the respective advertising costs (x1, x2, x3) and sales (y). If the coefficients 𝛽1, 𝛽2, and 𝛽3 are statistically significant and not equal to zero, it indicates that there is a linear association between the corresponding advertising cost and sales.

We can check the significance of the coefficients by looking at their p-values in the regression model output. A p-value less than the chosen significance level (e.g., 0.05) suggests that the coefficient is statistically significant.

  1. Are the explanatory variables in this regression model, in total, statistically significant at 5% level. What is meant by 𝑅𝑎𝑑𝑗^2?

Based on the summary output,the explanatory variables (X1, X2, X3) in the regression model are statistically significant in total. The F-statistic is given as F = 570.3 with a very low p-value of < 2.2e-16. Since the p-value is much smaller than the significance level of 0.05, we can conclude that the regression model, as a whole, is statistically significant.

𝑅𝑎𝑑𝑗^2 (Adjusted R-squared) is a statistical measure that represents the proportion of the variation in the response variable (Y) that is explained by the regression model, adjusted for the number of predictors in the model. It is an adjusted version of the R-squared (𝑅^2) statistic

  1. Considered individually, which of explanatory variables are significant? Explain

In this case, the p-value is less than 0.05, that’s why we reject the null hypothesis and consider the explanatory variable statistically significant. If the p-value is greater than 0.05, we fail to reject the null hypothesis, and the explanatory variable is not considered statistically significant.

  1. State the best fitted model. Interpret the regression coefficients

Based on the provided regression model summary, the best fitted model can be stated as follows:

Sales = 3.527 + 0.046X1 + 0.189X2 - 0.001*X3

Interpretation of the regression coefficients:

Intercept (β0): when the advertising budgets on YouTube (X1), Facebook (X2), and newspaper (X3) are all zero, the expected Sales value is 3.527 (in thousands of dollars). However, since it is unlikely to have zero advertising budgets in practice, the interpretation of the intercept may have limited practical significance.

X1 (YouTube advertising budget) coefficient (β1):For a one-unit increase in the YouTube advertising budget, holding other variables constant, we expect Sales to increase by 0.046 units (in thousands of dollars). This suggests that YouTube advertising has a positive and statistically significant impact on Sales.

X2 (Facebook advertising budget) coefficient (β2): For a one-unit increase in the spending on Facebook advertisements, while keeping other variables constant, we expect Sales to increase by 0.189 units (in thousands of dollars). This indicates that Facebook advertising also has a positive and statistically significant impact on Sales.

X3 (newspaper advertising cost) coefficient (β3): For a one-unit increase in advertising cost in newspapers, all other variables held constant, we expect Sales to decrease by 0.001 units (in thousands of dollars). However, the coefficient is not statistically significant (p-value of 0.86), indicating that there is no strong evidence to suggest a significant impact of newspaper advertising on Sales.

In summary, the best fitted model suggests that YouTube advertising (X1) and Facebook advertising (X2) have a positive and statistically significant impact on Sales, while newspaper advertising (X3) does not have a statistically significant impact.

  1. The diagnostic plots are given in output #2. Interpret the plots

  2. Estimate mean sales of all product where 𝑥1=50, 𝑥2 = 40 and 𝑥3 = 20.

use the regression equation:

Sales = 3.527 + 0.046X1 + 0.189X2 - 0.001*X3

Substituting the given values:

Sales = 3.527 + 0.046(50) + 0.189(40) - 0.001*(20)

Sales = 3.527 + 2.3 + 7.56 - 0.02

Sales = 13.366 (approximately)

Therefore, the estimated mean sales of all products, when 𝑥1 = 50, 𝑥2 = 40, and 𝑥3 = 20, is approximately 13.366 thousand dollars.

=================================================================================================================

Third Class

DATE: 26.05.23

plot(irl)

Diagnostic plot # 1: Residuals vs Fitted values If the red line running through the middle of the graph is nearly flat, we can infer that the residuals exhibit a linear trend, indicating linearity. For this plot it indicate linearity

Diagnostic plot # 2: Q-Q plot By observing the plot, we can notice that the data points generally align with the straight diagonal line, although there is a slight deviation at the end for observation #131.

Diagnostic plot # 3: Scale-Location plot Also known as the Spread-Location plot or the Variance-Location plot, is a graphical tool used in statistics to assess the heteroscedasticity assumption in linear regression analysis. If the plot displays a roughly horizontal line or exhibits a consistent pattern with no clear trend, it suggests that the residuals have constant variance, indicating the assumption of homoscedasticity. On the other hand, if the plot shows a systematic pattern, such as a funnel shape or an increasing or decreasing trend, it suggests heteroscedasticity, indicating that the variance of the residuals is not constant. We observe that the red line maintains a roughly horizontal orientation throughout the plot. This suggests that the assumption of equal variance, also known as homoscedasticity, is not violated.

Diagnostic plot # 4: Residuals vs Leverage plot The Residuals vs Leverage plot helps in identifying influential observations that may have a substantial impact on the regression analysis. It allows us to assess the influence of individual data points on the model’s estimates and helps identify potential outliers or observations with extreme predictor values. If any points in this plot fall outside of Cook’s distance (the dashed lines) then it is an influential observation.For this plot no points are fall outside of cook’s distance then it’s an influential observation.

Robust Regression Robust regression is a statistical technique that aims to minimize the impact of outliers and violations of assumptions on the regression analysis. It is designed to provide reliable estimates of the regression parameters even when the data contains outliers or is not normally distributed.

Total sun square = SSR + SSE SSE(decrease ) SSR(increase) F-statistic(increase) SSE(increase) SSR(decrease) F-statistic(decrease)

  • When R^2 are closed to 1 it’s give better performance in individual testing
  • How large F statistic depends on SE overall
lr2 <- lm(Y~ X1+X2, data=df)
summary(lr2)
## 
## Call:
## lm(formula = Y ~ X1 + X2, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5572  -1.0502   0.2906   1.4049   3.3994 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.50532    0.35339   9.919   <2e-16 ***
## X1           0.04575    0.00139  32.909   <2e-16 ***
## X2           0.18799    0.00804  23.382   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.018 on 197 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8962 
## F-statistic: 859.6 on 2 and 197 DF,  p-value: < 2.2e-16

All X are statistically significant it’s called best fitted model

lr3 <- lm(Y ~ X1+X2 + X1*X2, data=df)
summary(lr3)
## 
## Call:
## lm(formula = Y ~ X1 + X2 + X1 * X2, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6039 -0.4833  0.2197  0.7137  1.8295 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.100e+00  2.974e-01  27.233   <2e-16 ***
## X1          1.910e-02  1.504e-03  12.699   <2e-16 ***
## X2          2.886e-02  8.905e-03   3.241   0.0014 ** 
## X1:X2       9.054e-04  4.368e-05  20.727   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.132 on 196 degrees of freedom
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9673 
## F-statistic:  1963 on 3 and 196 DF,  p-value: < 2.2e-16
  • If intercept has negative value its not sensible interpretation