The original clothing dataset was imported and sorted to include only the study variables. The code in some of these steps is rendered ineffective for ease of visual presentation. A summary of the study variables was undertaken along with a visual presentation of their distribution using histograms.
clothing <- read.csv("~/Downloads/clothing.csv", header = TRUE)
#sapply(clothing, class)#
#head(clothing)#
#ncol(clothing)#
#str(clothing)#
clothingfinal <- clothing[c(3,5,6,7,9,11,12,13)]
names(clothingfinal)
## [1] "sales" "nown" "nfull" "npart" "hoursw" "inv1" "inv2" "ssize"
Some columns (in particular inv1 & inv2) have very large relative differences between their means and medians. This is evident in the statistical summary output given by R. This means that at some point the specific column must be checked for skewness/kurtosis and possibly transformed.
summary(clothingfinal)
## sales nown nfull npart
## Min. : 300 Min. : 1.000 Min. :1.000 Min. :1.000
## 1st Qu.: 3904 1st Qu.: 1.000 1st Qu.:1.923 1st Qu.:1.283
## Median : 5279 Median : 1.000 Median :1.956 Median :1.283
## Mean : 6335 Mean : 1.284 Mean :2.069 Mean :1.566
## 3rd Qu.: 7740 3rd Qu.: 1.295 3rd Qu.:2.066 3rd Qu.:2.000
## Max. :27000 Max. :10.000 Max. :8.000 Max. :9.000
## hoursw inv1 inv2 ssize
## Min. : 32.0 Min. : 1000 Min. : 350 Min. : 16.0
## 1st Qu.: 80.0 1st Qu.: 20000 1st Qu.: 10000 1st Qu.: 80.0
## Median :104.0 Median : 22207 Median : 22860 Median : 120.0
## Mean :121.1 Mean : 58257 Mean : 27829 Mean : 151.1
## 3rd Qu.:145.2 3rd Qu.: 62269 3rd Qu.: 22860 3rd Qu.: 190.0
## Max. :582.0 Max. :1500000 Max. :400000 Max. :1214.0
Statistical tests of the variable’s distributions were undertaken and many non-normal distributions were identified. Non-normal distributions were transformed where appropriate using both a log and outlier approach. All variables other than the number of staff (which are categorical variables) were logged in an attempt to make the data distribution more normal. Some alternative transformations (i.e square root) could have improved the distribution of the data, but this would have compromised the inferential power of the model. This is because of the ability of log-log model outputs to be interpreted in percentage terms. In log-log models one must think of the elasticity between variables and any transformations other than logs would compromise the latter inference.
clothingfinal$ssize <- log(clothingfinal$ssize)
clothingfinal$hoursw <- log(clothingfinal$hoursw)
clothingfinal$sales <- log(clothingfinal$sales)
clothingfinal$inv1 <- log(clothingfinal$inv1)
clothingfinal$inv2 <- log(clothingfinal$inv2)
One could log the nown, nfull and npart data but it does not make much sense (i.e one can not make much useful inference) in having a log-log regression output measuring the effect of a percentage increase in nown, nfull and npart on the percentage increase in sales. This is because nown, nfull and npart can not be understood in percentage terms - what is a one percent increase in part time staff for a data frame where the number of staff in shops range from 1-8? To normalise the number of staff variables (categorical variables), a visual approach was used whereby the above histograms were compared to a frequency distribution of each categorical variable. Logical cutoff points were chosen in accordance with the latter and the relevant rows were selected.
The code below was used to select the staff outlier cut-off points. The frequency code used can be seen below however the output is rendered ineffective for ease of visual presentation/digestion.
#sort(clothingfinal$nfull, decreasing = TRUE)#
#sort(clothingfinal$nown, decreasing = TRUE)#
#sort(clothingfinal$npart, decreasing = TRUE)#
Only 8 rows out of 400 were deleted when undertaking the outlier removal method described above which is unlikely to compromise the overall model.
subset(clothingfinal, nown > 3.5)
## sales nown nfull npart hoursw inv1 inv2 ssize
## 33 8.873868 4 3 1.0000 5.247024 12.10071 11.51293 4.941642
## 268 9.865991 10 4 2.2222 6.366470 12.58744 11.17844 4.605170
subset(clothingfinal, npart > 4.5)
## sales nown nfull npart hoursw inv1 inv2 ssize
## 361 8.405968 1 5 5 4.770685 9.750725 9.210340 5.940171
## 391 7.403741 2 5 5 5.953243 11.039223 9.718657 7.101676
## 397 9.721166 1 8 9 4.787492 12.587440 11.750366 5.703782
subset(clothingfinal, nfull > 5.5)
## sales nown nfull npart hoursw inv1 inv2 ssize
## 271 9.138020 1 7 1 5.883322 10.00816 10.037138 6.063785
## 282 8.964696 1 6 2 5.683580 11.03922 9.718657 5.857933
## 397 9.721166 1 8 9 4.787492 12.58744 11.750366 5.703782
clothingfinal <- clothingfinal[-c(33,268,361,391,397,271,282),]
The logged data distribution is better than before, however the modified data distribution is not ideal. As discussed before, alternative transformations (i.e square root) could have improved the distribution of the data, but this would have compromised the inferential power of the model. This is because of the ability of log-log model outputs to be interpreted in percentage terms.
In order to select the variables for the final linear model, we first conducted a visual observation of the relationship between all the variables. In the pairs matrix it is obvious that one must observe the relationship of sales (the dependent variable) to all the other variables by looking along the first row and/or first column. The observed visual relationships of the independent variables with the dependant variable (sales) must be confirmed in the below linear regression model. Namely, that hours worked has a strong positive relationship with sales and that store size has strong negative correlation with sales. At some point the data must also be checked for interactions between the independent variables.
The linear regression confirms the relationships observed in the pairs matrix above. Only hours worked and store size have statistically significant relationship with sales. The low p values indicate that if the study were to be replicated again, it would be highly probable that only store size and hours worked have an effect on sales.
summary(lm(sales ~ nown + nfull + npart + hoursw + inv1 + inv2 + ssize, data = clothingfinal))
##
## Call:
## lm(formula = sales ~ nown + nfull + npart + hoursw + inv1 + inv2 +
## ssize, data = clothingfinal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.91088 -0.22619 0.01063 0.29888 1.09616
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.69279 0.31566 24.371 <2e-16 ***
## nown -0.02551 0.04986 -0.512 0.609
## nfull -0.01617 0.02692 -0.601 0.548
## npart 0.00726 0.04269 0.170 0.865
## hoursw 0.93755 0.06678 14.040 <2e-16 ***
## inv1 0.01868 0.02206 0.847 0.398
## inv2 -0.02039 0.02033 -1.003 0.317
## ssize -0.71228 0.04501 -15.825 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.425 on 385 degrees of freedom
## Multiple R-squared: 0.4446, Adjusted R-squared: 0.4345
## F-statistic: 44.03 on 7 and 385 DF, p-value: < 2.2e-16
To double check the above relationships, the same linear regression was run on the original untransformed data. The pairs matrix is not a useful indicator for the original data as none of the untransformed variables are centred. The regression results obtained with the original data differ drastically from the ones with the transformed data!
The below regression indicates a strong positive relationship of sales with hours worked, number of part time employees and number of full time employees; and a strong negative relationship with store size. Two added statistically significant effects over the transformed data can be observed when regressing using the original data. There are two hypothesis which can explain the latter, either a type II error has been detected or the outlier selection method used in question 1 is flawed.
A comparison of regression residuals between the original and transformed data should solve the latter problem (with a particular focus on comparing the Cook’s distance graphs). This is undertaken in question 5 and the transformed data is found to have a far more normal distribution. Moreover, the determining factor in choosing to use transformed data is that the R-squared value for the first model is more robust.
summary(lm(sales ~ nown + nfull + npart + hoursw + inv1 + inv2 + ssize, data = clothing))
##
## Call:
## lm(formula = sales ~ nown + nfull + npart + hoursw + inv1 + inv2 +
## ssize, data = clothing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6413.8 -1868.6 -485.3 1204.4 17725.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.976e+03 4.939e+02 8.051 9.92e-15 ***
## nown -2.336e+02 2.591e+02 -0.902 0.36776
## nfull 5.284e+02 1.768e+02 2.989 0.00298 **
## npart 6.895e+02 2.294e+02 3.006 0.00281 **
## hoursw 3.426e+01 3.466e+00 9.886 < 2e-16 ***
## inv1 1.322e-04 1.561e-03 0.085 0.93257
## inv2 -8.848e-04 4.022e-03 -0.220 0.82597
## ssize -2.415e+01 1.702e+00 -14.194 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2914 on 392 degrees of freedom
## Multiple R-squared: 0.4035, Adjusted R-squared: 0.3929
## F-statistic: 37.89 on 7 and 392 DF, p-value: < 2.2e-16
In the covariance matrix below and the pairs matrix above, several notable interaction effects can be observed. Notably, the correlation between number of staff (full time) with hours worked and store size with hours worked.
cov(clothingfinal)
## sales nown nfull npart hoursw
## sales 0.319472052 0.014922553 0.073336565 -0.001641109 0.06850264
## nown 0.014922553 0.201073856 -0.002505334 0.016049997 0.05131224
## nfull 0.073336565 -0.002505334 0.799577191 0.045628934 0.16047857
## npart -0.001641109 0.016049997 0.045628934 0.301622664 0.07058451
## hoursw 0.068502642 0.051312243 0.160478565 0.070584515 0.20890419
## inv1 -0.005457953 0.009679002 0.209934438 0.182591112 0.09941484
## inv2 -0.002773046 -0.006075500 0.151930167 0.043410323 0.07893846
## ssize -0.111518624 0.040036766 0.091830693 0.100221016 0.17438573
## inv1 inv2 ssize
## sales -0.005457953 -0.002773046 -0.11151862
## nown 0.009679002 -0.006075500 0.04003677
## nfull 0.209934438 0.151930167 0.09183069
## npart 0.182591112 0.043410323 0.10022102
## hoursw 0.099414843 0.078938457 0.17438573
## inv1 1.155547990 0.275384906 0.15768494
## inv2 0.275384906 1.204331457 0.07775521
## ssize 0.157684935 0.077755209 0.38551628
summary(lm(sales ~ nown + nfull + npart + hoursw + inv1 + inv2 + ssize + nfull:hoursw + hoursw:ssize , data = clothingfinal))
##
## Call:
## lm(formula = sales ~ nown + nfull + npart + hoursw + inv1 + inv2 +
## ssize + nfull:hoursw + hoursw:ssize, data = clothingfinal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.06482 -0.21105 0.01022 0.25530 1.00777
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.483141 1.617734 0.299 0.76537
## nown -0.001576 0.049318 -0.032 0.97452
## nfull -0.540059 0.306470 -1.762 0.07883 .
## npart 0.013081 0.041391 0.316 0.75215
## hoursw 2.463461 0.338590 7.276 1.97e-12 ***
## inv1 0.016255 0.021447 0.758 0.44896
## inv2 -0.011541 0.019772 -0.584 0.55977
## ssize 1.067148 0.345377 3.090 0.00215 **
## nfull:hoursw 0.113585 0.060356 1.882 0.06061 .
## hoursw:ssize -0.380208 0.073067 -5.204 3.19e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4117 on 383 degrees of freedom
## Multiple R-squared: 0.4816, Adjusted R-squared: 0.4695
## F-statistic: 39.54 on 9 and 383 DF, p-value: < 2.2e-16
In line with standard statistical methods of 95% confidence intervals and attempting to maintain the highest R-squared value possible; the final model is simplified and many non-valid variables are dropped including an interaction term.
“Everything should be made as simple as possible, but not simpler.”
-Albert Eienstien
The final model below is a less ‘over-fitted’ than the one above which includes all the variables and notable interactions. The R-squared value is slightly reduced, but the model can still predict almost half of the factors affecting sales - a large improvement on the original model with only managed to explain a third of the variance in sales.
summary(lm(sales ~ hoursw + ssize + hoursw:ssize , data = clothingfinal))
##
## Call:
## lm(formula = sales ~ hoursw + ssize + hoursw:ssize, data = clothingfinal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.01664 -0.21229 0.01204 0.27061 1.02196
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.68511 1.47492 0.465 0.6425
## hoursw 2.42329 0.31916 7.593 2.35e-13 ***
## ssize 0.76355 0.30814 2.478 0.0136 *
## hoursw:ssize -0.31313 0.06516 -4.806 2.21e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4119 on 389 degrees of freedom
## Multiple R-squared: 0.4729, Adjusted R-squared: 0.4688
## F-statistic: 116.3 on 3 and 389 DF, p-value: < 2.2e-16
log(sales per sq.m) = B0 + B1 log(hours worked) + B2 log(store size) + B1,2 log(hours worked x store size) + Error
(sales) = 0.7 + 2.42 (hours worked) + 0.76 (store size) - 0.313(hours worked x store size)
Log-log models are interpreted in percentage terms. Therefore:
1% additional hours worked yields on average 2.42% of extra sales per square meter.
1% extra store size yields on average only 0.76% of sales per square meter. Therefore, increasing store size has a marginal effect on store sales per square meter.
However, because store size and hours worked are inter related i.e. a larger store needs more staff (see question 2 section on interactions), when the hours worked increases, the negative effect of increasing store size on sales also increases. The interaction coefficient represents the negative effect of increasing store size with increasing hours worked.
The latter interaction is illustrated in the plot (on the original unlogged data) below. The green line is the effect of store size on sales per square meter, the red line is the linear model including the interaction and the blue line excludes it. With the interaction term, the negative effects of larger store size values (despite more hours worked) decrease the positive effect of hours worked on sales. This most is evident for larger stores with more than 200 hours worked per week and can be considered an example of decreasing returns to scale. Question 7, will test the effect of decreasing returns to scale as store size becomes larger when making sales forecasts.
## Warning in abline(lm3, col = "red"): only using the first two of 4
## regression coefficients
## Warning in abline(lm4, col = "blue"): only using the first two of 3
## regression coefficients
Conceptually one can think of the latter as more hours worked will naturally increase sales (blue + red lines). However,if the store is bigger, even though more hours are worked, a lower level of customer service is expected than in a small store and this will reduce the relative effectiveness of working more hours.
In log-log models one must think of the elasticity between variables and thus the red line includes the effect of larger store size values on extra hours worked.
log(sales per sq.m) = B0 + B1 log(hours worked) + B2 log(store size) + B1,2 log(hours worked x store size) + Error
The partial/marginal t-test will check the significance of individual regression coefficients in the final linear model. This test measures the contribution of adding a variable whilst the remaining variables are included in the model. In the final model seen above, if the test for B1 is carried out then it will check the significance of adding B1 to the model given the inclusion of B2 and B1,2.
To calculate t-values for the individual coefficients, a variance-covariance matrix of the regression coefficients is used. This allows the manual computation of the standard error of each estimated coefficient.
finalmodel<- lm(sales ~ hoursw + ssize + ssize:hoursw, data = clothingfinal)
vcov(finalmodel)
## (Intercept) hoursw ssize hoursw:ssize
## (Intercept) 2.17540028 -0.46626060 -0.44687468 0.095060495
## hoursw -0.46626060 0.10186235 0.09429461 -0.020453373
## ssize -0.44687468 0.09429461 0.09494840 -0.019886224
## hoursw:ssize 0.09506049 -0.02045337 -0.01988622 0.004245694
The null hypothesises for each coefficient are:
H0: B1 = 0
H0: B2 = 0
H0: B1,B2 = 0
For the coefficents to have valid explanatory power and reject the null hypothesis:
Ha: B1 > or < 0
Ha: B2 > or < 0
Ha: B1,2 > or < 0
In order to compute t-values in the form of signal to noise ratios and reject the null-hypothesis, the following test statistics for each individual coefficient are conducted. Beta hat (signal) is divided by its standard error (noise):
t0 B1 = B1 hat / Se B1hat
t0 B2 = B2 hat / Se B2hat
t0 B1,2 = B1,2hat / Se B1,2hat
The standard error (Se) values are calculated as follows using the diagonal values from the variance-covariance matrix above:
#Se B1# =
sqrt(0.101862)
## [1] 0.3191583
#Se B2# =
sqrt(0.094948)
## [1] 0.3081363
#Se B1,2# =
sqrt(0.004245)
## [1] 0.06515366
The null-hypothesis t-values using the above data are as follows:
#T0 B1# =
(2.42329)/(sqrt(0.101862))
## [1] 7.592753
#T0 B2# =
(0.76355)/(sqrt(0.094948))
## [1] 2.477962
#T0 B1,2# =
(-0.31313)/(sqrt(0.004245))
## [1] -4.806023
However, in order to reject the null hypothesis a significance level (alpha) must be determined and the following formula is used where n is the degrees of freedom and k is the matrix size:
t(alpha)/2, n-(k+1)
t(0.5)/2, 398-(3+1)
t0.25, 394 = 1.96 (value obtained from T-table using infinity degrees of freedom and is postivie/negative)
The corresponding t-values must NOT lie between -1.96 and +1.96 so that the null hypothesis can be rejected. This is because the test seeks to find out whether the null-hypothesis is plausible.
t0 B1 = Null rejected @ significance level 0.5
t0 B2 = Null rejected @ significance level 0.5
t0 B1,2 = Null rejected @ significance level 0.5
All variables are significant.
There is little to no heteroskidacity in the model. The residuals seem mostly normally distributed and no specific pattern can be observed around them.
In order to compute White standard errors, the MASS package was installed and a robust linear regression was undertaken.
## Loading required package: MASS
The white standard errors slightly reduced the residual standard error of the overall model from 0.41 to 0.36 and adjusted some of the coefficients. The improvements observed are minor as the model was already corrected for outliers before (see questions 1 & 2).
#WHITE-HUBER#
summary(rlm(sales ~ hoursw + ssize + ssize:hoursw, data = clothingfinal))
##
## Call: rlm(formula = sales ~ hoursw + ssize + ssize:hoursw, data = clothingfinal)
## Residuals:
## Min 1Q Median 3Q Max
## -2.078700 -0.232034 -0.003183 0.261051 1.023657
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 0.5065 1.3811 0.3667
## hoursw 2.4550 0.2989 8.2149
## ssize 0.8479 0.2885 2.9388
## hoursw:ssize -0.3291 0.0610 -5.3935
##
## Residual standard error: 0.356 on 389 degrees of freedom
#OLS#
summary(lm(sales ~ hoursw + ssize + ssize:hoursw, data = clothingfinal))
##
## Call:
## lm(formula = sales ~ hoursw + ssize + ssize:hoursw, data = clothingfinal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.01664 -0.21229 0.01204 0.27061 1.02196
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.68511 1.47492 0.465 0.6425
## hoursw 2.42329 0.31916 7.593 2.35e-13 ***
## ssize 0.76355 0.30814 2.478 0.0136 *
## hoursw:ssize -0.31313 0.06516 -4.806 2.21e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4119 on 389 degrees of freedom
## Multiple R-squared: 0.4729, Adjusted R-squared: 0.4688
## F-statistic: 116.3 on 3 and 389 DF, p-value: < 2.2e-16
The following code was used to compare the difference in residual distribution between the OLS and White linear models.
resOLS <-residuals(lm(sales ~ hoursw + ssize + hoursw:ssize, data = clothingfinal))
resWHITE <-residuals(rlm(sales ~ hoursw + ssize + hoursw:ssize, data = clothingfinal))
The density plots reveal slight differences in bandwidth between the OLS and White residuals. The larger bandwidth in the White residual distribution plot indicates that the White linear regression slightly improves the distribution of the errors around the mean (i.e. the data is more centred).
“The negative effects of larger store size values (despite more hours worked) will decrease the positive effect of hours worked on sales.” (Question 5 ref. ’Understanding the Interaction Term’)
Sales forecast: A store owner wants to find out if doubling the amount of hours worked has the same effect on sales for a small and large store.
N.B: The logged clothingfinal data had to be exponentiated in order to double hours values. The results are also expressed in actual sales values (exponentiated) in order to measure the effect of decreasing returns to scale as the store becomes larger.
#REAL SALES + HOURS + INV VALUES#
summary(exp(clothingfinal))
## sales nown nfull npart
## Min. : 300 Min. : 2.718 Min. : 2.718 Min. : 2.718
## 1st Qu.: 3904 1st Qu.: 2.718 1st Qu.: 6.842 1st Qu.: 3.609
## Median : 5238 Median : 2.718 Median : 7.068 Median : 3.609
## Mean : 6279 Mean : 3.973 Mean : 13.790 Mean : 5.656
## 3rd Qu.: 7714 3rd Qu.: 3.397 3rd Qu.: 7.389 3rd Qu.: 7.389
## Max. :27000 Max. :20.086 Max. :148.413 Max. :54.598
## hoursw inv1 inv2 ssize
## Min. : 32.0 Min. : 1000 Min. : 350 Min. : 16.0
## 1st Qu.: 80.0 1st Qu.: 20000 1st Qu.: 10000 1st Qu.: 80.0
## Median :104.0 Median : 22207 Median : 22860 Median :120.0
## Mean :118.1 Mean : 56929 Mean : 27398 Mean :146.3
## 3rd Qu.:144.0 3rd Qu.: 62269 3rd Qu.: 22860 3rd Qu.:187.0
## Max. :318.0 Max. :1500000 Max. :400000 Max. :700.0
#SALES LARGE STORE NORMAL HOURS#
m1 = lm(sales ~ hoursw + ssize + ssize:hoursw, data = clothingfinal)
pred.dat = data.frame(ssize=(log(200)), hoursw=(log(118.1)))
S1H1 <- predict(m1, newdata=pred.dat)
exp(S1H1)
## 1
## 4347.241
#SALES LARGE STORE DOUBLE HOURS#
m1 = lm(sales ~ hoursw + ssize + ssize:hoursw, data = clothingfinal)
pred.dat = data.frame(ssize=(log(200)), hoursw=(log(236.2)))
S1H2 <- predict(m1, newdata=pred.dat)
exp(S1H2)
## 1
## 7383.729
# % INCREASE IN SALES#
(((exp(S1H2))-(exp(S1H1)))/((exp(S1H1))))*100
## 1
## 69.84862
#SALES SMALL STORE NORMAL HOURS#
m1 = lm(sales ~ hoursw + ssize + ssize:hoursw, data = clothingfinal)
pred.dat = data.frame(ssize=(log(50)), hoursw=(log(50)))
S1H3 <- predict(m1, newdata=pred.dat)
exp(S1H3)
## 1
## 4272.952
#SALES SMALL STORE DOUBLE HOURS#
m1 = lm(sales ~ hoursw + ssize + ssize:hoursw, data = clothingfinal)
pred.dat = data.frame(ssize=(log(50)), hoursw=(log(100)))
S1H4 <- predict(m1, newdata=pred.dat)
exp(S1H4)
## 1
## 9805.34
# % INCREASE IN SALES#
(((exp(S1H4))-(exp(S1H3)))/((exp(S1H3))))*100
## 1
## 129.4746
It is obvious that the negative effects of larger store size values (despite more hours worked) decrease the positive effect of hours worked on sales. The larger 200 sq.mt store only has a 69% increase in sales when doubling hours worked compared with a 129% increase for the 50sq.mt store. The latter situation can be considered an example of decreasing returns to scale.
The distribution of the model is mostly Gaussian.
res <-residuals(lm(sales ~ hoursw + ssize + hoursw:ssize, data = clothingfinal))
hist(res, breaks = 80)
We seek to analyse whether Russian oil companies are undervalued through the use of financial valuation and political risk metrics. A historical comparison of ‘Price-Earnings’ and ‘Price-Book’ ratios of listed oil companies will seek to display whether Russian oil producers remain chronically undervalued in comparison to their peers given the relative political risk they face.
The political risk variable in this study is measured using the ‘World Bank Political Risk Guide’ (2000-2016).
Historical data on company earnings was extracted through Guru Focus (2017). The ‘Price to Earnings Ratios’ for a given year were calculated using yearly earnings per share (December release, Q1-Q4 included) and average share price in the given year (share price at trading closure / number of days traded). ‘Price-to-Book’ ratios in any given year were calculated using net assets per share (debt - tangible assets) for the reported year (Q4 December release) over average share price in the given year (share price at closing time / number of days traded).
Data was gathered on all listed oil companies across 6 major sub regions. The stock exchanges included were the USA (8 exchanges), Canada (3 exchanges), the UK and Ireland (2 exchanges), Europe (21 exchanges), Asia (12 exchanges), Latin America (9 exchanges) and Africa (10 exchanges). In order to be included in this study an oil company had to have been listed on the latter exchanges at any time between 2000-2016; and with a market capitalisation of $500 million+ at local December exchange rates in the given year.
The PE and PB values originally obtained confirm a common observation that the market displays too much volatility to gather any meaningful inferences from raw year-on-year PE and PB data. In line with accepted statistical methods, the Tukey’s interquartile range test was undertaken and upper bound outliers removed accordingly. Any upper bound outliers identified in the aggregate PE and PB data (all years together 2000-2016) were removed. The PE and PB values were then centred on a year to year basis. The latter was conducted by dividing the all the PE and PB values by avreage PE and PB in any given year.
In order to carry out the study, first the relationship between politcal risk and valuation is observed visually. As confirmed by plot below a more stable country commands higher valuation.
Then, in order to check that Russian companies are undervalued: a box plot with an abline displaying the mean valuation of all companies attempts show that Russian companies are undervalued in relation to thier peers WITHOUT considering the effects of political risk.
The regression/study will therefore seek to check if investors undervalue Russian oil companies GIVEN the relative levels of political risk they face.
Scaled PE/PB = B0 + B1(PROD) + B2(PRRS.STAB) + Error
Where PE and PB are valuation metrics scaled/centred on year-on-year basis and PROD is company production location and PRRS.STAB is the political risk metric. The formula includes all the aggregate valuation data combined for the time period 2000-2016.
Accordingly, the following regressions were run:
summary(lm(formula = PE.AVG.PE ~ PROD + PRRS.STAB, data = dat))
##
## Call:
## lm(formula = PE.AVG.PE ~ PROD + PRRS.STAB, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1200 -0.3865 -0.1191 0.2351 2.9250
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.88813 0.40250 2.207 0.02772 *
## PRODAUS -0.02844 0.14660 -0.194 0.84623
## PRODBRA -0.32790 0.18115 -1.810 0.07078 .
## PRODCAN 0.08974 0.16250 0.552 0.58099
## PRODCHA 0.08978 0.13579 0.661 0.50876
## PRODCOL 0.14811 0.23454 0.632 0.52794
## PRODDEN 0.03419 0.61093 0.056 0.95539
## PRODGAB -0.36750 0.23113 -1.590 0.11235
## PRODGBR -0.35166 0.18656 -1.885 0.05992 .
## PRODHUN -0.45731 0.24136 -1.895 0.05861 .
## PRODIDO -0.19196 0.25613 -0.749 0.45386
## PRODIND -0.07124 0.19080 -0.373 0.70901
## PRODISR 0.57340 0.22413 2.558 0.01076 *
## PRODJAP 0.13768 0.18143 0.759 0.44822
## PRODKAZ -0.21494 0.21354 -1.007 0.31457
## PRODKEN -0.28508 0.28395 -1.004 0.31578
## PRODNIG -0.21116 0.21806 -0.968 0.33326
## PRODNOR -0.33289 0.18788 -1.772 0.07693 .
## PRODPAK -0.27380 0.22269 -1.230 0.21935
## PRODPOL -0.21036 0.17598 -1.195 0.23240
## PRODROM -0.09423 0.21133 -0.446 0.65584
## PRODRUS -0.32035 0.15707 -2.040 0.04183 *
## PRODSAF -0.28680 0.18822 -1.524 0.12809
## PRODSAU 0.79571 0.43704 1.821 0.06915 .
## PRODTHI -0.12243 0.17474 -0.701 0.48381
## PRODUAE -0.23401 0.23085 -1.014 0.31114
## PRODUKR -0.87034 0.32181 -2.705 0.00703 **
## PRODUSA -0.25342 0.14794 -1.713 0.08723 .
## PRODVIT -0.17538 0.28980 -0.605 0.54530
## PRRS.STAB 0.35981 0.49119 0.733 0.46413
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5952 on 600 degrees of freedom
## (2549 observations deleted due to missingness)
## Multiple R-squared: 0.1335, Adjusted R-squared: 0.09167
## F-statistic: 3.189 on 29 and 600 DF, p-value: 7.598e-08
summary(lm(formula = PE.AVG.PB ~ PROD + PRRS.STAB, data = dat))
##
## Call:
## lm(formula = PE.AVG.PB ~ PROD + PRRS.STAB, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3077 -0.3890 -0.1351 0.2635 2.9091
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.31875 0.36342 0.877 0.380749
## PRODAUS 0.14853 0.15707 0.946 0.344663
## PRODBRA -0.04527 0.17918 -0.253 0.800628
## PRODCAN 0.55620 0.17816 3.122 0.001872 **
## PRODCHA 0.25801 0.14456 1.785 0.074732 .
## PRODCOL 0.73823 0.23699 3.115 0.001915 **
## PRODDEN 0.76283 0.63267 1.206 0.228337
## PRODGAB -0.10748 0.24389 -0.441 0.659584
## PRODGBR -0.29328 0.17121 -1.713 0.087167 .
## PRODHUN -0.30731 0.23428 -1.312 0.190052
## PRODIDO -0.10748 0.21015 -0.511 0.609217
## PRODIND 0.36445 0.18176 2.005 0.045332 *
## PRODISR 1.58029 0.29497 5.357 1.15e-07 ***
## PRODJAP -0.48801 0.18601 -2.624 0.008894 **
## PRODKAZ -0.05486 0.23379 -0.235 0.814530
## PRODKEN -0.49880 0.25629 -1.946 0.052033 .
## PRODNIG 0.22794 0.21566 1.057 0.290894
## PRODNOR 0.01577 0.20446 0.077 0.938556
## PRODPAK 0.74315 0.20538 3.618 0.000318 ***
## PRODPOL -0.53633 0.18021 -2.976 0.003020 **
## PRODROM -0.08807 0.21225 -0.415 0.678333
## PRODRUS -0.12890 0.16875 -0.764 0.445219
## PRODSAF 0.40399 0.19954 2.025 0.043296 *
## PRODSAU 0.27006 0.26681 1.012 0.311811
## PRODTHI 0.39266 0.17873 2.197 0.028352 *
## PRODUAE -0.57668 0.20977 -2.749 0.006132 **
## PRODUKR -0.34002 0.45794 -0.742 0.458045
## PRODUSA -0.07357 0.15483 -0.475 0.634836
## PRODVIT 1.07219 0.30397 3.527 0.000448 ***
## PRRS.STAB 0.95623 0.43585 2.194 0.028569 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.616 on 693 degrees of freedom
## (2456 observations deleted due to missingness)
## Multiple R-squared: 0.2499, Adjusted R-squared: 0.2185
## F-statistic: 7.959 on 29 and 693 DF, p-value: < 2.2e-16
Russian oil companies are estimated by the model to have a lower valuation relative to the mean of other companies given the politial risks they face. However, due to both the high volatility and noise in the data, the answer to the study’s central hypothesis still remains questionable. Only a few estimators are considered to be statistically significant.
Whether the regression confirms that Russian companies remain undervalued, given the relative political risks they face, depends on an individual’s tolerance to statistical fluctuations. For finanical data, having an R-squared value that explains 24% of the variance can be considered rather good in some cases.
Even after having corrected for outliers in the data, the data was found to be very noisy. Only certain regression results for individual countries display statistically significant linear correlations on the dependant variables (PE & PB). LeRoy and Porter (1981) and Shiller’s (1981) argument that the market displays too much volatility to gather any meaningful inferences may well be true in view of the results obtained. One way of further correcting the volatility issue can be to control/reduce the year on year variation of PE and PB ratios.