0.1 Group Assignment.


0.2 Group Members:

  1. Peris Wambui : SCM 224-0547/2019

    wambui.mwenja@students.jkuat.ac.ke

  2. John Leornard : SCM 224-0559/2019

    Johnleonardkioko@gmail.com

  3. Rodgers Kioko : SCM 224-0583/2017

    kioko.rodgers@students.jkuat.ac.ke

  4. Ebenezar Kyalo : SCM 224-0757/2017

    ebemkyalo710@gmail.com

  5. Francis Thairu : SCM 224-0648/2017

    frankithairu@gmail.com

  6. Richard Kinyua : SCM 224-0648/2017

    ritchiekinyua35@gmail.com

    Reading relevant libraries

library(ggplot2)

1 Reading Our Data

df <- data.frame(x1 = c('-1','2','4','6'),       
                 x2 = c('0','0','1','1'),
                 y = c('0','1','5','8'))
df$x1 <- as.numeric(df$x1)                
df$x2 <- as.numeric(df$x2)                    
df$y <- as.numeric(df$y)               
df 
##   x1 x2 y
## 1 -1  0 0
## 2  2  0 1
## 3  4  1 5
## 4  6  1 8

1.1 Checking our data

#previewing our top entries
head(df)
##   x1 x2 y
## 1 -1  0 0
## 2  2  0 1
## 3  4  1 5
## 4  6  1 8
# checking data composition
str(df)
## 'data.frame':    4 obs. of  3 variables:
##  $ x1: num  -1 2 4 6
##  $ x2: num  0 0 1 1
##  $ y : num  0 1 5 8
#checking dimension of our dataset
dim(df)
## [1] 4 3
# our data frame has 4 rows and 3 columns

1.2 Data cleaning

1.2.1 Checking for outliers

boxplot(df$x1, main= 'Boxplot of x1',col="blue")

boxplot(df$x2, main= 'Boxplot of x2',col="grey")

boxplot(df$y, main= 'Boxplot of y',col="green")

We have no outliers in our dataset


1.3 Exploraory Data Analysis

summary(df)
##        x1              x2            y       
##  Min.   :-1.00   Min.   :0.0   Min.   :0.00  
##  1st Qu.: 1.25   1st Qu.:0.0   1st Qu.:0.75  
##  Median : 3.00   Median :0.5   Median :3.00  
##  Mean   : 2.75   Mean   :0.5   Mean   :3.50  
##  3rd Qu.: 4.50   3rd Qu.:1.0   3rd Qu.:5.75  
##  Max.   : 6.00   Max.   :1.0   Max.   :8.00

1.3.1 Scatter plots

plot(df$y, df$x1, pch=16, col='steelblue',
     main='y vs. x1',
     xlab='y', ylab='x1')

plot(df$y, df$x2, pch=16, col='green',
     main='y vs. x2',
     xlab='y', ylab='x2')

plot(df$y, df$x1, pch=16, col='grey',
     main='xi vs. x2',
     xlab='x1', ylab='x2')

1.3.2 Getting Correlation

matrix=cor(df, use="complete.obs")
matrix
##           x1        x2         y
## x1 1.0000000 0.8700628 0.9511669
## x2 0.8700628 1.0000000 0.9370426
## y  0.9511669 0.9370426 1.0000000
heatmap(matrix)

There is a higher correlation between Y and x2 xompare to Y and x1


1.3.3 Multicollinearity

pairs(df,lower.panel = NULL, )

2 Modelling

model <- lm(y ~ x1 + x2, data = df)
summary(model)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = df)
## 
## Residuals:
##       1       2       3       4 
##  0.5385 -0.5385 -0.8077  0.8077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.1538     1.0074   0.153    0.904
## x1            0.6923     0.5385   1.286    0.421
## x2            2.8846     2.7849   1.036    0.489
## 
## Residual standard error: 1.373 on 1 degrees of freedom
## Multiple R-squared:  0.954,  Adjusted R-squared:  0.8621 
## F-statistic: 10.38 on 2 and 1 DF,  p-value: 0.2144
#creating a multiple linear model with y as the output.
  1. Call shows the function call used to compute the regression model. Residuals. Provide a quick view of the distribution of the residuals, which by definition have a mean zero. Therefore, the median should not be far from zero, and the minimum and maximum should be roughly equal in absolute value.

  2. Coefficients. Shows the regression beta coefficients and their statistical significance. Predictor variables, that are significantly associated to the outcome variable, are marked by stars.

  3. Residual standard error (RSE), R-squared (R2) and the F-statistic are metrics that are used to check how well the model fits to our data.

summary(model)$coefficient
##              Estimate Std. Error   t value  Pr(>|t|)
## (Intercept) 0.1538462  1.0073693 0.1527207 0.9035205
## x1          0.6923077  0.5384615 1.2857143 0.4208332
## x2          2.8846154  2.7849447 1.0357891 0.4888094
#getting summary of our model

##Evaluating our model

2.0.1 Checking R_Square

We note from the summary of the model that the r-squared is 0.95 meaning that 95 of the measure of y can be predicted by x1 an x2 thus making our model a good perforfing model.

2.0.2 Checking for RSE

#This is the measure of erreor in prediction. The lower the RSE, the better our model is.

sigma(model)/mean(df$y)
## [1] 0.3922323

2.1 Challenging our solution

2.1.1 Using X1 as our output variable

model2 <- lm(x1 ~ y + x2, data = df)
summary(model2)
## 
## Call:
## lm(formula = x1 ~ y + x2, data = df)
## 
## Residuals:
##     1     2     3     4 
## -1.05  1.05  0.35 -0.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    0.050      1.161   0.043    0.973
## y              0.900      0.700   1.286    0.421
## x2            -0.900      4.482  -0.201    0.874
## 
## Residual standard error: 1.565 on 1 degrees of freedom
## Multiple R-squared:  0.9084, Adjusted R-squared:  0.7252 
## F-statistic: 4.959 on 2 and 1 DF,  p-value: 0.3026
sigma(model2)/mean(df$y)
## [1] 0.4472136

2.1.2 Using X2 as our output variable

model3 <- lm(x2 ~ y + x1, data = df)
summary(model2)
## 
## Call:
## lm(formula = x1 ~ y + x2, data = df)
## 
## Residuals:
##     1     2     3     4 
## -1.05  1.05  0.35 -0.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    0.050      1.161   0.043    0.973
## y              0.900      0.700   1.286    0.421
## x2            -0.900      4.482  -0.201    0.874
## 
## Residual standard error: 1.565 on 1 degrees of freedom
## Multiple R-squared:  0.9084, Adjusted R-squared:  0.7252 
## F-statistic: 4.959 on 2 and 1 DF,  p-value: 0.3026
sigma(model3)/mean(df$y)
## [1] 0.0978232
  • We can see that while using x1 and x2 as our output variable we get low accuracy scores(RSE) if 44% and 9% respectively which indicates a low performing model.

  • We conclude that tho they all have high correlations, y should be the desired output variable since it gives us the best result when we use it as our output variable to fit our model.

  • We will then use use model one(with y as the output variable as our main model.)

3 Anova Test

#We use a two-way anova test becaus we are modeling two input variables against one outpu varaible
two.way <- aov(y ~ x1 + x2, data = df)

summary(two.way)
##             Df Sum Sq Mean Sq F value Pr(>F)
## x1           1  37.09   37.09  19.682  0.141
## x2           1   2.02    2.02   1.073  0.489
## Residuals    1   1.88    1.88

3.1 Checking for homoscedasticity

par(mfrow=c(2,2))
plot(two.way)

par(mfrow=c(1,1))

# The red line represents the mean of the residuals
# The normal Q-Q plot plots a regression between the theoretical residuals of a perfectly-homoscedastic model and the actual residuals of our model.
#The closer the QQ Slope is closer to 1, the better our model is.

4 Conclusion

Since our model has both a low rse and a high Rsquared value, we conclude its a good model.