Festival Reward Prediction

The given dataset shows the last 10 instances of rewards paid during the festive season. We would like to see if establish the regression model for predicting reward amount in the upcoming festival using Simple Linear Regression.

dim(Festival)

## [1] 10  4

str(Festival)

## 'data.frame':    10 obs. of  4 variables:
##  $ Instance: int  1 2 3 4 5 6 7 8 9 10
##  $ Years   : int  4 3 1 3 3 2 6 5 1 3
##  $ Salary  : int  1700 5400 3200 4400 4950 2550 3600 6000 4500 5200
##  $ Amount  : int  250 850 550 400 700 250 600 900 450 650

summary(Festival)

##     Instance         Years          Salary         Amount     
##  Min.   : 1.00   Min.   :1.00   Min.   :1700   Min.   :250.0  
##  1st Qu.: 3.25   1st Qu.:2.25   1st Qu.:3300   1st Qu.:412.5  
##  Median : 5.50   Median :3.00   Median :4450   Median :575.0  
##  Mean   : 5.50   Mean   :3.10   Mean   :4150   Mean   :560.0  
##  3rd Qu.: 7.75   3rd Qu.:3.75   3rd Qu.:5138   3rd Qu.:687.5  
##  Max.   :10.00   Max.   :6.00   Max.   :6000   Max.   :900.0

attach(Festival)

With only one variable available, the best possible measure of the Award is the mean of Awards for last 10 instances. The mean Award Amount is 560 rupees.

par(mfrow =c(2,2))
boxplot(Salary, horizontal = TRUE, main="Boxplot of Salary")
hist(Salary)
boxplot(Amount, horizontal = TRUE, main="Boxplot of Amount")
hist(Amount)

par(mfrow=c(1,1))

plot(Salary, Amount, xlab="Salary", ylab="Amount", col="Red")

cor.test(Salary, Amount)

## 
##  Pearson's product-moment correlation
## 
## data:  Salary and Amount
## t = 4.6391, df = 8, p-value = 0.001668
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4848411 0.9647888
## sample estimates:
##       cor 
## 0.8538222

We will now predict the festival reward amount (y variable) using salary (x variable)

fit1 <- lm(Amount~Salary, data=Festival)
summary(fit1)

## 
## Call:
## lm(formula = Amount ~ Salary, data = Festival)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -195.41  -77.22   31.85  104.21  124.56 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -27.79227  132.69660  -0.209  0.83934   
## Salary        0.14164    0.03053   4.639  0.00167 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 124.7 on 8 degrees of freedom
## Multiple R-squared:  0.729,  Adjusted R-squared:  0.6951 
## F-statistic: 21.52 on 1 and 8 DF,  p-value: 0.001668

From the above regression model, we can infer the following:

Estimates for b0 and b1: b0 = -27.79, b1 = 0.14164

Coefficient of determination (r-squared) = 0.729. It means that 72.9% of variation in the rewards is determined by independent variable Salary.

Predicted <-predict(fit1, data.frame(Salary))
Predicted

##        1        2        3        4        5        6        7        8 
## 212.9901 737.0459 425.4451 595.4092 673.3094 333.3813 482.0998 822.0279 
##        9       10 
## 609.5728 708.7185

Pred_Actual <- data.frame(cbind(Festival$Amount),Predicted)
Pred_Actual

##    cbind.Festival.Amount. Predicted
## 1                     250  212.9901
## 2                     850  737.0459
## 3                     550  425.4451
## 4                     400  595.4092
## 5                     700  673.3094
## 6                     250  333.3813
## 7                     600  482.0998
## 8                     900  822.0279
## 9                     450  609.5728
## 10                    650  708.7185

plot(Predicted, col="Red")
lines(Predicted, col="Red")
lines(Amount,col="Blue")

The above plot shows the predicted / Actual values of the reward amount.

Multi-linear Regression

plot(Festival)

fit2 <- lm(Amount ~ Years, data = Festival)
summary(fit2)

## 
## Call:
## lm(formula = Amount ~ Years, data = Festival)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -350.87 -139.52   35.37  132.04  294.54 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   419.21     163.55   2.563   0.0335 *
## Years          45.41      47.41   0.958   0.3661  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 226.9 on 8 degrees of freedom
## Multiple R-squared:  0.1029, Adjusted R-squared:  -0.009237 
## F-statistic: 0.9176 on 1 and 8 DF,  p-value: 0.3661

The second x- variable number of years of experience does not seem to be significantly contributing to the variation as seen from the above SLR Model. It has low r-squared value.

fit3 <- lm(Amount ~ Salary+Years, data=Festival)
summary(fit3)

## 
## Call:
## lm(formula = Amount ~ Salary + Years, data = Festival)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -191.19  -55.02   11.78   31.84  185.49 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -105.47879  143.81006  -0.733  0.48711   
## Salary         0.13717    0.02988   4.591  0.00251 **
## Years         31.03888   25.49884   1.217  0.26294   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 121.1 on 7 degrees of freedom
## Multiple R-squared:  0.7764, Adjusted R-squared:  0.7125 
## F-statistic: 12.15 on 2 and 7 DF,  p-value: 0.00529

The combined model using multiple independent variables is as shown above. The following inferences can be made:

The intercept value b0 = -105.47879
Coefficient for salary b1 = 0.13717 - Keeping Years of service constant, every single rupee increase in Salary will result in increase of the reward amount by 0.13717
Coefficent for Years of service b2 = 31.03888. Keeping salary constant, for every single increase in the number of years of service, the reward amount increases by 31.04
From the p-value, we see that the Salary is stastically significant and explains the variation in the reward amount more than the Years of service.

End of the document

Festival Reward Prediction

Milind DESAI

September 15, 2018