Oscar Ortmans & Sasha Bouloudnine



Question 1.



The original clothing dataset was imported and sorted to include only the study variables. The code in some of these steps is rendered ineffective for ease of visual presentation. A summary of the study variables was undertaken along with a visual presentation of their distribution using histograms.

clothing <- read.csv("~/Downloads/clothing.csv", header = TRUE)
#sapply(clothing, class)#
#head(clothing)#
#ncol(clothing)#
#str(clothing)#
clothingfinal <- clothing[c(3,5,6,7,9,11,12,13)]


Study variables:
names(clothingfinal)
## [1] "sales"  "nown"   "nfull"  "npart"  "hoursw" "inv1"   "inv2"   "ssize"


Descriptive statistics:

Some columns (in particular inv1 & inv2) have very large relative differences between their means and medians. This is evident in the statistical summary output given by R. This means that at some point the specific column must be checked for skewness/kurtosis and possibly transformed.

summary(clothingfinal)
##      sales            nown            nfull           npart      
##  Min.   :  300   Min.   : 1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 3904   1st Qu.: 1.000   1st Qu.:1.923   1st Qu.:1.283  
##  Median : 5279   Median : 1.000   Median :1.956   Median :1.283  
##  Mean   : 6335   Mean   : 1.284   Mean   :2.069   Mean   :1.566  
##  3rd Qu.: 7740   3rd Qu.: 1.295   3rd Qu.:2.066   3rd Qu.:2.000  
##  Max.   :27000   Max.   :10.000   Max.   :8.000   Max.   :9.000  
##      hoursw           inv1              inv2            ssize       
##  Min.   : 32.0   Min.   :   1000   Min.   :   350   Min.   :  16.0  
##  1st Qu.: 80.0   1st Qu.:  20000   1st Qu.: 10000   1st Qu.:  80.0  
##  Median :104.0   Median :  22207   Median : 22860   Median : 120.0  
##  Mean   :121.1   Mean   :  58257   Mean   : 27829   Mean   : 151.1  
##  3rd Qu.:145.2   3rd Qu.:  62269   3rd Qu.: 22860   3rd Qu.: 190.0  
##  Max.   :582.0   Max.   :1500000   Max.   :400000   Max.   :1214.0





Statistical tests of the variable’s distributions were undertaken and many non-normal distributions were identified. Non-normal distributions were transformed where appropriate using both a log and outlier approach. All variables other than the number of staff (which are categorical variables) were logged in an attempt to make the data distribution more normal. Some alternative transformations (i.e square root) could have improved the distribution of the data, but this would have compromised the inferential power of the model. This is because of the ability of log-log model outputs to be interpreted in percentage terms. In log-log models one must think of the elasticity between variables and any transformations other than logs would compromise the latter inference.



Visual distribution of study variables:








Variable transformation:


clothingfinal$ssize <- log(clothingfinal$ssize)
clothingfinal$hoursw <- log(clothingfinal$hoursw)
clothingfinal$sales <- log(clothingfinal$sales)
clothingfinal$inv1 <- log(clothingfinal$inv1)
clothingfinal$inv2 <- log(clothingfinal$inv2)



One could log the nown, nfull and npart data but it does not make much sense (i.e one can not make much useful inference) in having a log-log regression output measuring the effect of a percentage increase in nown, nfull and npart on the percentage increase in sales. This is because nown, nfull and npart can not be understood in percentage terms - what is a one percent increase in part time staff for a data frame where the number of staff in shops range from 1-8? To normalise the number of staff variables (categorical variables), a visual approach was used whereby the above histograms were compared to a frequency distribution of each categorical variable. Logical cutoff points were chosen in accordance with the latter and the relevant rows were selected.

The code below was used to select the staff outlier cut-off points. The frequency code used can be seen below however the output is rendered ineffective for ease of visual presentation/digestion.

#sort(clothingfinal$nfull, decreasing = TRUE)#
#sort(clothingfinal$nown, decreasing = TRUE)#
#sort(clothingfinal$npart, decreasing = TRUE)#




Selection and omission of outliers in categorical data:


Only 8 rows out of 400 were deleted when undertaking the outlier removal method described above which is unlikely to compromise the overall model.



subset(clothingfinal, nown > 3.5)
##        sales nown nfull  npart   hoursw     inv1     inv2    ssize
## 33  8.873868    4     3 1.0000 5.247024 12.10071 11.51293 4.941642
## 268 9.865991   10     4 2.2222 6.366470 12.58744 11.17844 4.605170
subset(clothingfinal, npart > 4.5)
##        sales nown nfull npart   hoursw      inv1      inv2    ssize
## 361 8.405968    1     5     5 4.770685  9.750725  9.210340 5.940171
## 391 7.403741    2     5     5 5.953243 11.039223  9.718657 7.101676
## 397 9.721166    1     8     9 4.787492 12.587440 11.750366 5.703782
subset(clothingfinal, nfull > 5.5)
##        sales nown nfull npart   hoursw     inv1      inv2    ssize
## 271 9.138020    1     7     1 5.883322 10.00816 10.037138 6.063785
## 282 8.964696    1     6     2 5.683580 11.03922  9.718657 5.857933
## 397 9.721166    1     8     9 4.787492 12.58744 11.750366 5.703782
clothingfinal <- clothingfinal[-c(33,268,361,391,397,271,282),]




The logged data distribution is better than before, however the modified data distribution is not ideal. As discussed before, alternative transformations (i.e square root) could have improved the distribution of the data, but this would have compromised the inferential power of the model. This is because of the ability of log-log model outputs to be interpreted in percentage terms.


Data Before Transformation:


Data After Transformation:







Question 2:




In order to select the variables for the final linear model, we first conducted a visual observation of the relationship between all the variables. In the pairs matrix it is obvious that one must observe the relationship of sales (the dependent variable) to all the other variables by looking along the first row and/or first column. The observed visual relationships of the independent variables with the dependant variable (sales) must be confirmed in the below linear regression model. Namely, that hours worked has a strong positive relationship with sales and that store size has strong negative correlation with sales. At some point the data must also be checked for interactions between the independent variables.


Pairs Matrix of Study Variables




The linear regression confirms the relationships observed in the pairs matrix above. Only hours worked and store size have statistically significant relationship with sales. The low p values indicate that if the study were to be replicated again, it would be highly probable that only store size and hours worked have an effect on sales.


summary(lm(sales ~ nown + nfull + npart + hoursw + inv1 + inv2 + ssize,  data = clothingfinal))
## 
## Call:
## lm(formula = sales ~ nown + nfull + npart + hoursw + inv1 + inv2 + 
##     ssize, data = clothingfinal)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.91088 -0.22619  0.01063  0.29888  1.09616 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.69279    0.31566  24.371   <2e-16 ***
## nown        -0.02551    0.04986  -0.512    0.609    
## nfull       -0.01617    0.02692  -0.601    0.548    
## npart        0.00726    0.04269   0.170    0.865    
## hoursw       0.93755    0.06678  14.040   <2e-16 ***
## inv1         0.01868    0.02206   0.847    0.398    
## inv2        -0.02039    0.02033  -1.003    0.317    
## ssize       -0.71228    0.04501 -15.825   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.425 on 385 degrees of freedom
## Multiple R-squared:  0.4446, Adjusted R-squared:  0.4345 
## F-statistic: 44.03 on 7 and 385 DF,  p-value: < 2.2e-16



To double check the above relationships, the same linear regression was run on the original untransformed data. The pairs matrix is not a useful indicator for the original data as none of the untransformed variables are centred. The regression results obtained with the original data differ drastically from the ones with the transformed data!



The below regression indicates a strong positive relationship of sales with hours worked, number of part time employees and number of full time employees; and a strong negative relationship with store size. Two added statistically significant effects over the transformed data can be observed when regressing using the original data. There are two hypothesis which can explain the latter, either a type II error has been detected or the outlier selection method used in question 1 is flawed.

A comparison of regression residuals between the original and transformed data should solve the latter problem (with a particular focus on comparing the Cook’s distance graphs). This is undertaken in question 5 and the transformed data is found to have a far more normal distribution. Moreover, the determining factor in choosing to use transformed data is that the R-squared value for the first model is more robust.


summary(lm(sales ~ nown + nfull + npart + hoursw + inv1 + inv2 + ssize,  data = clothing))
## 
## Call:
## lm(formula = sales ~ nown + nfull + npart + hoursw + inv1 + inv2 + 
##     ssize, data = clothing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6413.8 -1868.6  -485.3  1204.4 17725.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.976e+03  4.939e+02   8.051 9.92e-15 ***
## nown        -2.336e+02  2.591e+02  -0.902  0.36776    
## nfull        5.284e+02  1.768e+02   2.989  0.00298 ** 
## npart        6.895e+02  2.294e+02   3.006  0.00281 ** 
## hoursw       3.426e+01  3.466e+00   9.886  < 2e-16 ***
## inv1         1.322e-04  1.561e-03   0.085  0.93257    
## inv2        -8.848e-04  4.022e-03  -0.220  0.82597    
## ssize       -2.415e+01  1.702e+00 -14.194  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2914 on 392 degrees of freedom
## Multiple R-squared:  0.4035, Adjusted R-squared:  0.3929 
## F-statistic: 37.89 on 7 and 392 DF,  p-value: < 2.2e-16




Interaction Effects:


In the covariance matrix below and the pairs matrix above, several notable interaction effects can be observed. Notably, the correlation between number of staff (full time) with hours worked and store size with hours worked.

cov(clothingfinal)
##               sales         nown        nfull        npart     hoursw
## sales   0.319472052  0.014922553  0.073336565 -0.001641109 0.06850264
## nown    0.014922553  0.201073856 -0.002505334  0.016049997 0.05131224
## nfull   0.073336565 -0.002505334  0.799577191  0.045628934 0.16047857
## npart  -0.001641109  0.016049997  0.045628934  0.301622664 0.07058451
## hoursw  0.068502642  0.051312243  0.160478565  0.070584515 0.20890419
## inv1   -0.005457953  0.009679002  0.209934438  0.182591112 0.09941484
## inv2   -0.002773046 -0.006075500  0.151930167  0.043410323 0.07893846
## ssize  -0.111518624  0.040036766  0.091830693  0.100221016 0.17438573
##                inv1         inv2       ssize
## sales  -0.005457953 -0.002773046 -0.11151862
## nown    0.009679002 -0.006075500  0.04003677
## nfull   0.209934438  0.151930167  0.09183069
## npart   0.182591112  0.043410323  0.10022102
## hoursw  0.099414843  0.078938457  0.17438573
## inv1    1.155547990  0.275384906  0.15768494
## inv2    0.275384906  1.204331457  0.07775521
## ssize   0.157684935  0.077755209  0.38551628



Plots of Interaction Effects:


Plot of Hours Worked and Number of Employees (Full-Time):

Plot of Hours Worked and Number of Employees (Full-Time):




Interaction Effect Model:


summary(lm(sales ~ nown + nfull + npart + hoursw + inv1 + inv2 + ssize + nfull:hoursw + hoursw:ssize ,  data = clothingfinal))
## 
## Call:
## lm(formula = sales ~ nown + nfull + npart + hoursw + inv1 + inv2 + 
##     ssize + nfull:hoursw + hoursw:ssize, data = clothingfinal)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.06482 -0.21105  0.01022  0.25530  1.00777 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.483141   1.617734   0.299  0.76537    
## nown         -0.001576   0.049318  -0.032  0.97452    
## nfull        -0.540059   0.306470  -1.762  0.07883 .  
## npart         0.013081   0.041391   0.316  0.75215    
## hoursw        2.463461   0.338590   7.276 1.97e-12 ***
## inv1          0.016255   0.021447   0.758  0.44896    
## inv2         -0.011541   0.019772  -0.584  0.55977    
## ssize         1.067148   0.345377   3.090  0.00215 ** 
## nfull:hoursw  0.113585   0.060356   1.882  0.06061 .  
## hoursw:ssize -0.380208   0.073067  -5.204 3.19e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4117 on 383 degrees of freedom
## Multiple R-squared:  0.4816, Adjusted R-squared:  0.4695 
## F-statistic: 39.54 on 9 and 383 DF,  p-value: < 2.2e-16





Selecting variables to use in final model:


In line with standard statistical methods of 95% confidence intervals and attempting to maintain the highest R-squared value possible; the final model is simplified and many non-valid variables are dropped including an interaction term.



“Everything should be made as simple as possible, but not simpler.”

-Albert Eienstien



The final model below is a less ‘over-fitted’ than the one above which includes all the variables and notable interactions. The R-squared value is slightly reduced, but the model can still predict almost half of the factors affecting sales - a large improvement on the original model with only managed to explain a third of the variance in sales.



summary(lm(sales ~ hoursw + ssize + hoursw:ssize ,  data = clothingfinal))
## 
## Call:
## lm(formula = sales ~ hoursw + ssize + hoursw:ssize, data = clothingfinal)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.01664 -0.21229  0.01204  0.27061  1.02196 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.68511    1.47492   0.465   0.6425    
## hoursw        2.42329    0.31916   7.593 2.35e-13 ***
## ssize         0.76355    0.30814   2.478   0.0136 *  
## hoursw:ssize -0.31313    0.06516  -4.806 2.21e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4119 on 389 degrees of freedom
## Multiple R-squared:  0.4729, Adjusted R-squared:  0.4688 
## F-statistic: 116.3 on 3 and 389 DF,  p-value: < 2.2e-16







Question 3:





Model:

log(sales per sq.m) = B0 + B1 log(hours worked) + B2 log(store size) + B1,2 log(hours worked x store size) + Error


Model Estimates in %:

(sales) = 0.7 + 2.42 (hours worked) + 0.76 (store size) - 0.313(hours worked x store size)


Log-log models are interpreted in percentage terms. Therefore:


  • 1% additional hours worked yields on average 2.42% of extra sales per square meter.

  • 1% extra store size yields on average only 0.76% of sales per square meter. Therefore, increasing store size has a marginal effect on store sales per square meter.


However, because store size and hours worked are inter related i.e. a larger store needs more staff (see question 2 section on interactions), when the hours worked increases, the negative effect of increasing store size on sales also increases. The interaction coefficient represents the negative effect of increasing store size with increasing hours worked.

The latter interaction is illustrated in the plot (on the original unlogged data) below. The green line is the effect of store size on sales per square meter, the red line is the linear model including the interaction and the blue line excludes it. With the interaction term, the negative effects of larger store size values (despite more hours worked) decrease the positive effect of hours worked on sales. This most is evident for larger stores with more than 200 hours worked per week and can be considered an example of decreasing returns to scale. Question 7, will test the effect of decreasing returns to scale as store size becomes larger when making sales forecasts.

Plot of Linear Models & Interactions:


## Warning in abline(lm3, col = "red"): only using the first two of 4
## regression coefficients
## Warning in abline(lm4, col = "blue"): only using the first two of 3
## regression coefficients


Conceptually one can think of the latter as more hours worked will naturally increase sales (blue + red lines). However,if the store is bigger, even though more hours are worked, a lower level of customer service is expected than in a small store and this will reduce the relative effectiveness of working more hours.

In log-log models one must think of the elasticity between variables and thus the red line includes the effect of larger store size values on extra hours worked.





Question 4:




Final model:


log(sales per sq.m) = B0 + B1 log(hours worked) + B2 log(store size) + B1,2 log(hours worked x store size) + Error



The partial/marginal t-test will check the significance of individual regression coefficients in the final linear model. This test measures the contribution of adding a variable whilst the remaining variables are included in the model. In the final model seen above, if the test for B1 is carried out then it will check the significance of adding B1 to the model given the inclusion of B2 and B1,2.


To calculate t-values for the individual coefficients, a variance-covariance matrix of the regression coefficients is used. This allows the manual computation of the standard error of each estimated coefficient.



finalmodel<- lm(sales ~ hoursw + ssize + ssize:hoursw,  data = clothingfinal)
vcov(finalmodel)
##              (Intercept)      hoursw       ssize hoursw:ssize
## (Intercept)   2.17540028 -0.46626060 -0.44687468  0.095060495
## hoursw       -0.46626060  0.10186235  0.09429461 -0.020453373
## ssize        -0.44687468  0.09429461  0.09494840 -0.019886224
## hoursw:ssize  0.09506049 -0.02045337 -0.01988622  0.004245694



The null hypothesises for each coefficient are:

H0: B1 = 0

H0: B2 = 0

H0: B1,B2 = 0



For the coefficents to have valid explanatory power and reject the null hypothesis:



Ha: B1 > or < 0

Ha: B2 > or < 0

Ha: B1,2 > or < 0



In order to compute t-values in the form of signal to noise ratios and reject the null-hypothesis, the following test statistics for each individual coefficient are conducted. Beta hat (signal) is divided by its standard error (noise):

t0 B1 = B1 hat / Se B1hat

t0 B2 = B2 hat / Se B2hat

t0 B1,2 = B1,2hat / Se B1,2hat



The standard error (Se) values are calculated as follows using the diagonal values from the variance-covariance matrix above:

   #Se B1# = 
          sqrt(0.101862)
## [1] 0.3191583
   #Se B2# = 
          sqrt(0.094948)
## [1] 0.3081363
   #Se B1,2# = 
          sqrt(0.004245)
## [1] 0.06515366


The null-hypothesis t-values using the above data are as follows:


   #T0 B1# = 
          (2.42329)/(sqrt(0.101862))
## [1] 7.592753
   #T0 B2# = 
          (0.76355)/(sqrt(0.094948))
## [1] 2.477962
   #T0 B1,2# = 
         (-0.31313)/(sqrt(0.004245))
## [1] -4.806023






However, in order to reject the null hypothesis a significance level (alpha) must be determined and the following formula is used where n is the degrees of freedom and k is the matrix size:


t(alpha)/2, n-(k+1)


t(0.5)/2, 398-(3+1)


t0.25, 394 = 1.96 (value obtained from T-table using infinity degrees of freedom and is postivie/negative)




The corresponding t-values must NOT lie between -1.96 and +1.96 so that the null hypothesis can be rejected. This is because the test seeks to find out whether the null-hypothesis is plausible.


t0 B1 = Null rejected @ significance level 0.5


t0 B2 = Null rejected @ significance level 0.5


t0 B1,2 = Null rejected @ significance level 0.5



Conclusion:


All variables are significant.





Question 5.





There is little to no heteroskidacity in the model. The residuals seem mostly normally distributed and no specific pattern can be observed around them.

Plots of Residuals:








Question 6.





In order to compute White standard errors, the MASS package was installed and a robust linear regression was undertaken.


## Loading required package: MASS

The white standard errors slightly reduced the residual standard error of the overall model from 0.41 to 0.36 and adjusted some of the coefficients. The improvements observed are minor as the model was already corrected for outliers before (see questions 1 & 2).


#WHITE-HUBER#
summary(rlm(sales ~ hoursw + ssize + ssize:hoursw,  data = clothingfinal))
## 
## Call: rlm(formula = sales ~ hoursw + ssize + ssize:hoursw, data = clothingfinal)
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -2.078700 -0.232034 -0.003183  0.261051  1.023657 
## 
## Coefficients:
##              Value   Std. Error t value
## (Intercept)   0.5065  1.3811     0.3667
## hoursw        2.4550  0.2989     8.2149
## ssize         0.8479  0.2885     2.9388
## hoursw:ssize -0.3291  0.0610    -5.3935
## 
## Residual standard error: 0.356 on 389 degrees of freedom



#OLS#
summary(lm(sales ~ hoursw + ssize + ssize:hoursw,  data = clothingfinal))
## 
## Call:
## lm(formula = sales ~ hoursw + ssize + ssize:hoursw, data = clothingfinal)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.01664 -0.21229  0.01204  0.27061  1.02196 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.68511    1.47492   0.465   0.6425    
## hoursw        2.42329    0.31916   7.593 2.35e-13 ***
## ssize         0.76355    0.30814   2.478   0.0136 *  
## hoursw:ssize -0.31313    0.06516  -4.806 2.21e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4119 on 389 degrees of freedom
## Multiple R-squared:  0.4729, Adjusted R-squared:  0.4688 
## F-statistic: 116.3 on 3 and 389 DF,  p-value: < 2.2e-16




The following code was used to compare the difference in residual distribution between the OLS and White linear models.


resOLS <-residuals(lm(sales ~ hoursw + ssize + hoursw:ssize,  data = clothingfinal))

resWHITE <-residuals(rlm(sales ~ hoursw + ssize + hoursw:ssize,  data = clothingfinal))



The density plots reveal slight differences in bandwidth between the OLS and White residuals. The larger bandwidth in the White residual distribution plot indicates that the White linear regression slightly improves the distribution of the errors around the mean (i.e. the data is more centred).



Plot of OLS vs White Linear Model Residuals:






Question 7.



“The negative effects of larger store size values (despite more hours worked) will decrease the positive effect of hours worked on sales.” (Question 5 ref. ’Understanding the Interaction Term’)

Sales forecast: A store owner wants to find out if doubling the amount of hours worked has the same effect on sales for a small and large store.

N.B: The logged clothingfinal data had to be exponentiated in order to double hours values. The results are also expressed in actual sales values (exponentiated) in order to measure the effect of decreasing returns to scale as the store becomes larger.


#REAL SALES + HOURS + INV VALUES#
summary(exp(clothingfinal))
##      sales            nown            nfull             npart       
##  Min.   :  300   Min.   : 2.718   Min.   :  2.718   Min.   : 2.718  
##  1st Qu.: 3904   1st Qu.: 2.718   1st Qu.:  6.842   1st Qu.: 3.609  
##  Median : 5238   Median : 2.718   Median :  7.068   Median : 3.609  
##  Mean   : 6279   Mean   : 3.973   Mean   : 13.790   Mean   : 5.656  
##  3rd Qu.: 7714   3rd Qu.: 3.397   3rd Qu.:  7.389   3rd Qu.: 7.389  
##  Max.   :27000   Max.   :20.086   Max.   :148.413   Max.   :54.598  
##      hoursw           inv1              inv2            ssize      
##  Min.   : 32.0   Min.   :   1000   Min.   :   350   Min.   : 16.0  
##  1st Qu.: 80.0   1st Qu.:  20000   1st Qu.: 10000   1st Qu.: 80.0  
##  Median :104.0   Median :  22207   Median : 22860   Median :120.0  
##  Mean   :118.1   Mean   :  56929   Mean   : 27398   Mean   :146.3  
##  3rd Qu.:144.0   3rd Qu.:  62269   3rd Qu.: 22860   3rd Qu.:187.0  
##  Max.   :318.0   Max.   :1500000   Max.   :400000   Max.   :700.0



Doubling Work Hours for a Large 200sq.mt Store:
#SALES LARGE STORE NORMAL HOURS#
m1 = lm(sales ~ hoursw + ssize + ssize:hoursw,  data = clothingfinal)
pred.dat = data.frame(ssize=(log(200)), hoursw=(log(118.1)))
S1H1 <- predict(m1, newdata=pred.dat)
exp(S1H1)
##        1 
## 4347.241


#SALES LARGE STORE DOUBLE HOURS#
m1 = lm(sales ~ hoursw + ssize + ssize:hoursw,  data = clothingfinal)
pred.dat = data.frame(ssize=(log(200)), hoursw=(log(236.2)))
S1H2 <- predict(m1, newdata=pred.dat)
exp(S1H2)
##        1 
## 7383.729
Resulting % Increase in Sales:
# % INCREASE IN SALES#
(((exp(S1H2))-(exp(S1H1)))/((exp(S1H1))))*100
##        1 
## 69.84862



Doubling Work Hours for a Small 50sq.mt Store:
#SALES SMALL STORE NORMAL HOURS#
m1 = lm(sales ~ hoursw + ssize + ssize:hoursw,  data = clothingfinal)
pred.dat = data.frame(ssize=(log(50)), hoursw=(log(50)))
S1H3 <- predict(m1, newdata=pred.dat)
exp(S1H3)
##        1 
## 4272.952


#SALES SMALL STORE DOUBLE HOURS#
m1 = lm(sales ~ hoursw + ssize + ssize:hoursw,  data = clothingfinal)
pred.dat = data.frame(ssize=(log(50)), hoursw=(log(100)))
S1H4 <- predict(m1, newdata=pred.dat)
exp(S1H4)
##       1 
## 9805.34
Resulting % Increase in Sales:
# % INCREASE IN SALES#
(((exp(S1H4))-(exp(S1H3)))/((exp(S1H3))))*100
##        1 
## 129.4746



It is obvious that the negative effects of larger store size values (despite more hours worked) decrease the positive effect of hours worked on sales. The larger 200 sq.mt store only has a 69% increase in sales when doubling hours worked compared with a 129% increase for the 50sq.mt store. The latter situation can be considered an example of decreasing returns to scale.




Question 8.




The distribution of the model is mostly Gaussian.

res <-residuals(lm(sales ~ hoursw + ssize + hoursw:ssize,  data = clothingfinal))
hist(res, breaks = 80)




Part 3:



Bring you Own Data: Russian Oil Companies



We seek to analyse whether Russian oil companies are undervalued through the use of financial valuation and political risk metrics. A historical comparison of ‘Price-Earnings’ and ‘Price-Book’ ratios of listed oil companies will seek to display whether Russian oil producers remain chronically undervalued in comparison to their peers given the relative political risk they face.


The political risk variable in this study is measured using the ‘World Bank Political Risk Guide’ (2000-2016).


Historical data on company earnings was extracted through Guru Focus (2017). The ‘Price to Earnings Ratios’ for a given year were calculated using yearly earnings per share (December release, Q1-Q4 included) and average share price in the given year (share price at trading closure / number of days traded). ‘Price-to-Book’ ratios in any given year were calculated using net assets per share (debt - tangible assets) for the reported year (Q4 December release) over average share price in the given year (share price at closing time / number of days traded).


Data was gathered on all listed oil companies across 6 major sub regions. The stock exchanges included were the USA (8 exchanges), Canada (3 exchanges), the UK and Ireland (2 exchanges), Europe (21 exchanges), Asia (12 exchanges), Latin America (9 exchanges) and Africa (10 exchanges). In order to be included in this study an oil company had to have been listed on the latter exchanges at any time between 2000-2016; and with a market capitalisation of $500 million+ at local December exchange rates in the given year.


The PE and PB values originally obtained confirm a common observation that the market displays too much volatility to gather any meaningful inferences from raw year-on-year PE and PB data. In line with accepted statistical methods, the Tukey’s interquartile range test was undertaken and upper bound outliers removed accordingly. Any upper bound outliers identified in the aggregate PE and PB data (all years together 2000-2016) were removed. The PE and PB values were then centred on a year to year basis. The latter was conducted by dividing the all the PE and PB values by avreage PE and PB in any given year.


In order to carry out the study, first the relationship between politcal risk and valuation is observed visually. As confirmed by plot below a more stable country commands higher valuation.


Linear Relationship Between Valuation and Political Stability:



Then, in order to check that Russian companies are undervalued: a box plot with an abline displaying the mean valuation of all companies attempts show that Russian companies are undervalued in relation to thier peers WITHOUT considering the effects of political risk.

Price Earnings and Production Country:

Price Book Ratio and Production Country:



The regression/study will therefore seek to check if investors undervalue Russian oil companies GIVEN the relative levels of political risk they face.


The following formula was tested:


Scaled PE/PB = B0 + B1(PROD) + B2(PRRS.STAB) + Error


Where PE and PB are valuation metrics scaled/centred on year-on-year basis and PROD is company production location and PRRS.STAB is the political risk metric. The formula includes all the aggregate valuation data combined for the time period 2000-2016.


Accordingly, the following regressions were run:

summary(lm(formula = PE.AVG.PE ~ PROD + PRRS.STAB, data = dat))
## 
## Call:
## lm(formula = PE.AVG.PE ~ PROD + PRRS.STAB, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1200 -0.3865 -0.1191  0.2351  2.9250 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.88813    0.40250   2.207  0.02772 * 
## PRODAUS     -0.02844    0.14660  -0.194  0.84623   
## PRODBRA     -0.32790    0.18115  -1.810  0.07078 . 
## PRODCAN      0.08974    0.16250   0.552  0.58099   
## PRODCHA      0.08978    0.13579   0.661  0.50876   
## PRODCOL      0.14811    0.23454   0.632  0.52794   
## PRODDEN      0.03419    0.61093   0.056  0.95539   
## PRODGAB     -0.36750    0.23113  -1.590  0.11235   
## PRODGBR     -0.35166    0.18656  -1.885  0.05992 . 
## PRODHUN     -0.45731    0.24136  -1.895  0.05861 . 
## PRODIDO     -0.19196    0.25613  -0.749  0.45386   
## PRODIND     -0.07124    0.19080  -0.373  0.70901   
## PRODISR      0.57340    0.22413   2.558  0.01076 * 
## PRODJAP      0.13768    0.18143   0.759  0.44822   
## PRODKAZ     -0.21494    0.21354  -1.007  0.31457   
## PRODKEN     -0.28508    0.28395  -1.004  0.31578   
## PRODNIG     -0.21116    0.21806  -0.968  0.33326   
## PRODNOR     -0.33289    0.18788  -1.772  0.07693 . 
## PRODPAK     -0.27380    0.22269  -1.230  0.21935   
## PRODPOL     -0.21036    0.17598  -1.195  0.23240   
## PRODROM     -0.09423    0.21133  -0.446  0.65584   
## PRODRUS     -0.32035    0.15707  -2.040  0.04183 * 
## PRODSAF     -0.28680    0.18822  -1.524  0.12809   
## PRODSAU      0.79571    0.43704   1.821  0.06915 . 
## PRODTHI     -0.12243    0.17474  -0.701  0.48381   
## PRODUAE     -0.23401    0.23085  -1.014  0.31114   
## PRODUKR     -0.87034    0.32181  -2.705  0.00703 **
## PRODUSA     -0.25342    0.14794  -1.713  0.08723 . 
## PRODVIT     -0.17538    0.28980  -0.605  0.54530   
## PRRS.STAB    0.35981    0.49119   0.733  0.46413   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5952 on 600 degrees of freedom
##   (2549 observations deleted due to missingness)
## Multiple R-squared:  0.1335, Adjusted R-squared:  0.09167 
## F-statistic: 3.189 on 29 and 600 DF,  p-value: 7.598e-08
summary(lm(formula = PE.AVG.PB ~ PROD + PRRS.STAB, data = dat))
## 
## Call:
## lm(formula = PE.AVG.PB ~ PROD + PRRS.STAB, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3077 -0.3890 -0.1351  0.2635  2.9091 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.31875    0.36342   0.877 0.380749    
## PRODAUS      0.14853    0.15707   0.946 0.344663    
## PRODBRA     -0.04527    0.17918  -0.253 0.800628    
## PRODCAN      0.55620    0.17816   3.122 0.001872 ** 
## PRODCHA      0.25801    0.14456   1.785 0.074732 .  
## PRODCOL      0.73823    0.23699   3.115 0.001915 ** 
## PRODDEN      0.76283    0.63267   1.206 0.228337    
## PRODGAB     -0.10748    0.24389  -0.441 0.659584    
## PRODGBR     -0.29328    0.17121  -1.713 0.087167 .  
## PRODHUN     -0.30731    0.23428  -1.312 0.190052    
## PRODIDO     -0.10748    0.21015  -0.511 0.609217    
## PRODIND      0.36445    0.18176   2.005 0.045332 *  
## PRODISR      1.58029    0.29497   5.357 1.15e-07 ***
## PRODJAP     -0.48801    0.18601  -2.624 0.008894 ** 
## PRODKAZ     -0.05486    0.23379  -0.235 0.814530    
## PRODKEN     -0.49880    0.25629  -1.946 0.052033 .  
## PRODNIG      0.22794    0.21566   1.057 0.290894    
## PRODNOR      0.01577    0.20446   0.077 0.938556    
## PRODPAK      0.74315    0.20538   3.618 0.000318 ***
## PRODPOL     -0.53633    0.18021  -2.976 0.003020 ** 
## PRODROM     -0.08807    0.21225  -0.415 0.678333    
## PRODRUS     -0.12890    0.16875  -0.764 0.445219    
## PRODSAF      0.40399    0.19954   2.025 0.043296 *  
## PRODSAU      0.27006    0.26681   1.012 0.311811    
## PRODTHI      0.39266    0.17873   2.197 0.028352 *  
## PRODUAE     -0.57668    0.20977  -2.749 0.006132 ** 
## PRODUKR     -0.34002    0.45794  -0.742 0.458045    
## PRODUSA     -0.07357    0.15483  -0.475 0.634836    
## PRODVIT      1.07219    0.30397   3.527 0.000448 ***
## PRRS.STAB    0.95623    0.43585   2.194 0.028569 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.616 on 693 degrees of freedom
##   (2456 observations deleted due to missingness)
## Multiple R-squared:  0.2499, Adjusted R-squared:  0.2185 
## F-statistic: 7.959 on 29 and 693 DF,  p-value: < 2.2e-16



Results:


Russian oil companies are estimated by the model to have a lower valuation relative to the mean of other companies given the politial risks they face. However, due to both the high volatility and noise in the data, the answer to the study’s central hypothesis still remains questionable. Only a few estimators are considered to be statistically significant.


Whether the regression confirms that Russian companies remain undervalued, given the relative political risks they face, depends on an individual’s tolerance to statistical fluctuations. For finanical data, having an R-squared value that explains 24% of the variance can be considered rather good in some cases.


Even after having corrected for outliers in the data, the data was found to be very noisy. Only certain regression results for individual countries display statistically significant linear correlations on the dependant variables (PE & PB). LeRoy and Porter (1981) and Shiller’s (1981) argument that the market displays too much volatility to gather any meaningful inferences may well be true in view of the results obtained. One way of further correcting the volatility issue can be to control/reduce the year on year variation of PE and PB ratios.