Regression

Regression Analysis

A regression model provides the user with a functional relationship between the response variable and explanatory variables that allows the user to determine which of the explanatory variables have an effect on the response. The regression model allows the user to explore what happens to the response variable for specified changes in the explanatory variables.

The basic idea of regression analysis is to obtain a model for the functional relationship between a response variable (often referred to as the dependent variable) and one or more explanatory variables (often referred to as the independent variables). Regression models have a number of uses.

Regression Model

\(Y=\beta_{0}+\beta_{1} X+\varepsilon\)

with

\(Y\) : the response/dependent variable

\(X\) : the explanatory/independent variable

\(\beta_{0}\) and \(\beta_{1}\) : the regression parameters.

\(\varepsilon\) : the errors/residuals.

Estimating Model Parameters

The intercept \(\beta_{0}\) and \(\beta_{1}\) slope in the regression model are population quantities. We must estimate these values from sample data. The method to estimate regression model parameters is least-squares method, the parameters obtained by minimizing the sum squares of errors (SSE).

Assumptions

Formal assumptions of regression analysis:

  1. The relation is, in fact, linear, so that the errors all have expected value zero: \(E(\varepsilon_{i})=0\) for all i.

  2. The errors all have the same variance: \(Var(\varepsilon_{i})=\sigma^{2}\) for all i.

  3. The errors are independent of each other.

  4. The errors are all normally distributed; \(\varepsilon_{i}\) is normally distributed for all i.

Coefficient of Determination

The coefficient of determination \(R^{2}\) is the ratio of the explained variation to the total variation.

\(R^{2}=SSR/SST\)

F test for Simultaneous Test

SS(Regression) is the sum of squared deviations of predicted y values from the y mean. SS(Residual) is the sum of squared deviations of actual y values from predicted y values. The formula for SSR, SSE, and SST :

If the model is useful, MSR will be large compared to the unexplained variation, MSE.

Hypothesis:

\(H_{0}:\beta_{0}=\beta_{1}=0\)

\(H_{1}:\) At least there is \(\beta_{i} \neq 0\) ; \(i=0,1\)

Rejection Region:

Hypothesis null is rejected if \(F>F_{df1;df2;\alpha}\)

t test for Individual Coefficient

t test for \(\beta_{1}\)

Hypothesis:

\(H_{0}:\beta_{1}=0\) (there is significant linear relationship between X1 and Y)

\(H_{1}:\beta_{1} \neq 0\) (there is no significant linear relationship between X1 and Y)

The test statistics:

Rejection Region:

Hypothesis null is rejected if \(t>t_{\alpha/2 ; n-2}\)

Example

Data from a sample of 10 pharmacies are used to examine the relation between prescription sales volume and the percentage of prescription ingredients purchased directly from the supplier. The sample data are shown in Table below.

The data can be downloaded here [Link to download]

  1. Find the least-squares estimates for the regression line \(\hat{Y}=\hat{\beta_{0}}+\hat{\beta_{1}} X\).

  2. Predict sales volume for a pharmacy that purchases 15% of its prescription ingredients directly from the supplier.

  3. Interpret \(\hat{\beta_{1}}\).

  4. Is the regression model useful to predict sales volume for a pharmacy (Y)?

  5. Find the coefficient determination (\(R^{2}\)).

  6. Is there a significant linear relationship between prescription sales volume (Y) and the percentage of prescription ingredients purchased directly from the supplier (X)?

Answer:

Calculations for obtaining least-squares estimates

Substituting into the formulas for \(\hat{\beta_{0}}\) and \(\hat{\beta_{1}}\):

Using R:

library(readxl)
## Warning: package 'readxl' was built under R version 4.1.3
datareg1<-read_excel("D:/MATERI KULIAH S2 IPB/ASPRAK 2/REGRESSION.xlsx")
datareg1
## # A tibble: 10 x 2
##    sales_volume_Y Ingredients_X
##             <dbl>         <dbl>
##  1             25            10
##  2             55            18
##  3             50            25
##  4             75            40
##  5            110            50
##  6            138            63
##  7             90            42
##  8             60            30
##  9             10             5
## 10            100            55
model_reg <- lm(sales_volume_Y ~ Ingredients_X, data = datareg1)
summary(model_reg)
## 
## Call:
## lm(formula = sales_volume_Y ~ Ingredients_X, data = datareg1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.074  -4.403  -1.607   5.719  14.834 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.6979     5.9520   0.789    0.453    
## Ingredients_X   1.9705     0.1545  12.750 1.35e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.022 on 8 degrees of freedom
## Multiple R-squared:  0.9531, Adjusted R-squared:  0.9472 
## F-statistic: 162.6 on 1 and 8 DF,  p-value: 1.349e-06

Regression Model can be written:

\(\hat{y}=4.6979+1.9705x1\)

Predict sales volume

When x = 15%, the predicted sales volume is \(\hat{y}=4.70+1.97(15)=34.25\) (that is, $34,250).

Interpretation of \(\hat{\beta_{1}}\)

\(\hat{\beta_{1}}\) measures the expected change in the mean value of Y if X changes one unit.

From \(\hat{\beta_{1}}=1.97\), we conclude that if a pharmacy would increase by 1% the percentage of ingredients purchased directly, then the estimated increase in average sales volume would be $1,970.

F test

I.Hypotheses:

\(H_{0}:\beta_{0}=\beta_{1}=0\)

\(H_{1}:\) At least there is \(\beta_{i} \neq 0\) ; \(i=0,1\)

II.Significance Level: 5%

III.Test Statistics

Using R:

aov_table<-aov(sales_volume_Y ~ Ingredients_X, data = datareg1)
summary(aov_table)
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## Ingredients_X  1  13231   13231   162.6 1.35e-06 ***
## Residuals      8    651      81                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

IV.Rejection Region:

Hypothesis null is rejected if \(F>F_{1;8;0.05}=5.32\)

V.Conclusion

Hypothesis null is rejected because \(F=162.6>F_{1;8;0.05}=5.32\). It means the regression model is useful to predict the effect of ingredients purchased directly on the prescription sales volume.

Coefficient Determination

\(R^{2}=13231/(13231+651)=0.9531\)

\(R^{2}=95.31%\)

About 95.31% of the variation in Y can be explained by the X. About 4.689% of the variation is explained by anything else beyond X.

t test for \(\beta_{1}\)

I.Hypotheses:

\(H_{0}:\beta_{1}=0\) (there is significant linear relationship between X1 and Y)

\(H_{1}:\beta_{1} \neq 0\) (there is no significant linear relationship between X1 and Y)

II.Significance Level: 5%

III.Test Statistics

summary(model_reg)
## 
## Call:
## lm(formula = sales_volume_Y ~ Ingredients_X, data = datareg1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.074  -4.403  -1.607   5.719  14.834 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.6979     5.9520   0.789    0.453    
## Ingredients_X   1.9705     0.1545  12.750 1.35e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.022 on 8 degrees of freedom
## Multiple R-squared:  0.9531, Adjusted R-squared:  0.9472 
## F-statistic: 162.6 on 1 and 8 DF,  p-value: 1.349e-06

IV.Rejection Region

Hypothesis null is rejected if \(t>t_{0.025; 8}=2.306\)

V.Conclusion

Hypothesis null is rejected because \(t=12.750 >t_{0.025; 8}=2.306\) .It means there is significant influence of the percentage of prescription ingredients purchased directly from the supplier (X) on prescription sales volume (Y).

Excercise

Forest scientists are concerned with the decline in forest growth throughout the world. One aspect of this decline is the possible effect of emissions from coal-fired power plants. The scientists in particular are interested in the pH level of the soil and the resulting impact on tree growth retardation. The scientists study various forests which are likely to be exposed to these emissions. They measure various aspects of growth associated with trees in a specified region and the soil pH in the same region. The forest scientists then want to determine impact on tree growth as the soil becomes more acidic. An index of growth retardation is constructed from the various measurements taken on the trees with a high value indicating greater retardation in tree growth. A higher value of soil pH indicates a more acidic soil. Twenty tree stands which are exposed to the power plant emissions are selected for study. The values of the growth retardation index and average soil pH are recorded in Table below.

The data can be downloaded here [Link to download]

Using the above data and analysis using R, do the following:

  1. Identify least-squares estimates for \(\beta_{0}\) and \(\beta_{1}\) in the model \(Y=\beta_{0}+\beta_{1} X\) where y is the index of growth retardation and x is the soil pH.

  2. Predict the growth retardation for a soil pH of 4.0.

  3. Interpret \(\hat{\beta_{1}}\).

  4. Find the coefficient determination (\(R^{2}\)).

  5. Is the regression model useful to predict the index of growth retardation (Y)? [F-Test]