The given dataset shows the last 10 instances of rewards paid during the festive season. We would like to see if establish the regression model for predicting reward amount in the upcoming festival using Simple Linear Regression.

dim(Festival)
## [1] 10  4
str(Festival)
## 'data.frame':    10 obs. of  4 variables:
##  $ Instance: int  1 2 3 4 5 6 7 8 9 10
##  $ Years   : int  4 3 1 3 3 2 6 5 1 3
##  $ Salary  : int  1700 5400 3200 4400 4950 2550 3600 6000 4500 5200
##  $ Amount  : int  250 850 550 400 700 250 600 900 450 650
summary(Festival)
##     Instance         Years          Salary         Amount     
##  Min.   : 1.00   Min.   :1.00   Min.   :1700   Min.   :250.0  
##  1st Qu.: 3.25   1st Qu.:2.25   1st Qu.:3300   1st Qu.:412.5  
##  Median : 5.50   Median :3.00   Median :4450   Median :575.0  
##  Mean   : 5.50   Mean   :3.10   Mean   :4150   Mean   :560.0  
##  3rd Qu.: 7.75   3rd Qu.:3.75   3rd Qu.:5138   3rd Qu.:687.5  
##  Max.   :10.00   Max.   :6.00   Max.   :6000   Max.   :900.0
attach(Festival)

With only one variable available, the best possible measure of the Award is the mean of Awards for last 10 instances. The mean Award Amount is 560 rupees.

par(mfrow =c(2,2))
boxplot(Salary, horizontal = TRUE, main="Boxplot of Salary")
hist(Salary)
boxplot(Amount, horizontal = TRUE, main="Boxplot of Amount")
hist(Amount)

par(mfrow=c(1,1))

plot(Salary, Amount, xlab="Salary", ylab="Amount", col="Red")

cor.test(Salary, Amount)
## 
##  Pearson's product-moment correlation
## 
## data:  Salary and Amount
## t = 4.6391, df = 8, p-value = 0.001668
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4848411 0.9647888
## sample estimates:
##       cor 
## 0.8538222

We will now predict the festival reward amount (y variable) using salary (x variable)

fit1 <- lm(Amount~Salary, data=Festival)
summary(fit1)
## 
## Call:
## lm(formula = Amount ~ Salary, data = Festival)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -195.41  -77.22   31.85  104.21  124.56 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -27.79227  132.69660  -0.209  0.83934   
## Salary        0.14164    0.03053   4.639  0.00167 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 124.7 on 8 degrees of freedom
## Multiple R-squared:  0.729,  Adjusted R-squared:  0.6951 
## F-statistic: 21.52 on 1 and 8 DF,  p-value: 0.001668

From the above regression model, we can infer the following:

The above plot shows the predicted / Actual values of the reward amount.

Multi-linear Regression

plot(Festival)

fit2 <- lm(Amount ~ Years, data = Festival)
summary(fit2)
## 
## Call:
## lm(formula = Amount ~ Years, data = Festival)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -350.87 -139.52   35.37  132.04  294.54 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   419.21     163.55   2.563   0.0335 *
## Years          45.41      47.41   0.958   0.3661  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 226.9 on 8 degrees of freedom
## Multiple R-squared:  0.1029, Adjusted R-squared:  -0.009237 
## F-statistic: 0.9176 on 1 and 8 DF,  p-value: 0.3661

The second x- variable number of years of experience does not seem to be significantly contributing to the variation as seen from the above SLR Model. It has low r-squared value.

fit3 <- lm(Amount ~ Salary+Years, data=Festival)
summary(fit3)
## 
## Call:
## lm(formula = Amount ~ Salary + Years, data = Festival)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -191.19  -55.02   11.78   31.84  185.49 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -105.47879  143.81006  -0.733  0.48711   
## Salary         0.13717    0.02988   4.591  0.00251 **
## Years         31.03888   25.49884   1.217  0.26294   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 121.1 on 7 degrees of freedom
## Multiple R-squared:  0.7764, Adjusted R-squared:  0.7125 
## F-statistic: 12.15 on 2 and 7 DF,  p-value: 0.00529

The combined model using multiple independent variables is as shown above. The following inferences can be made:

End of the document