0.1 Group Assignment.

0.2 Group Members:

Peris Wambui : SCM 224-0547/2019

wambui.mwenja@students.jkuat.ac.ke
John Leornard : SCM 224-0559/2019

Johnleonardkioko@gmail.com
Rodgers Kioko : SCM 224-0583/2017

kioko.rodgers@students.jkuat.ac.ke
Ebenezar Kyalo : SCM 224-0757/2017

ebemkyalo710@gmail.com
Francis Thairu : SCM 224-0648/2017

frankithairu@gmail.com
Richard Kinyua : SCM 224-0648/2017

ritchiekinyua35@gmail.com

Reading relevant libraries

library(ggplot2)

1 Reading Our Data

df <- data.frame(x1 = c('-1','2','4','6'),       
                 x2 = c('0','0','1','1'),
                 y = c('0','1','5','8'))
df$x1 <- as.numeric(df$x1)                
df$x2 <- as.numeric(df$x2)                    
df$y <- as.numeric(df$y)               
df

##   x1 x2 y
## 1 -1  0 0
## 2  2  0 1
## 3  4  1 5
## 4  6  1 8

1.1 Checking our data

#previewing our top entries
head(df)

##   x1 x2 y
## 1 -1  0 0
## 2  2  0 1
## 3  4  1 5
## 4  6  1 8

# checking data composition
str(df)

## 'data.frame':    4 obs. of  3 variables:
##  $ x1: num  -1 2 4 6
##  $ x2: num  0 0 1 1
##  $ y : num  0 1 5 8

#checking dimension of our dataset
dim(df)

## [1] 4 3

# our data frame has 4 rows and 3 columns

1.2 Data cleaning

1.2.1 Checking for outliers

boxplot(df$x1, main= 'Boxplot of x1',col="blue")

boxplot(df$x2, main= 'Boxplot of x2',col="grey")

boxplot(df$y, main= 'Boxplot of y',col="green")

We have no outliers in our dataset

1.3 Exploraory Data Analysis

summary(df)

##        x1              x2            y       
##  Min.   :-1.00   Min.   :0.0   Min.   :0.00  
##  1st Qu.: 1.25   1st Qu.:0.0   1st Qu.:0.75  
##  Median : 3.00   Median :0.5   Median :3.00  
##  Mean   : 2.75   Mean   :0.5   Mean   :3.50  
##  3rd Qu.: 4.50   3rd Qu.:1.0   3rd Qu.:5.75  
##  Max.   : 6.00   Max.   :1.0   Max.   :8.00

1.3.1 Scatter plots

plot(df$y, df$x1, pch=16, col='steelblue',
     main='y vs. x1',
     xlab='y', ylab='x1')

plot(df$y, df$x2, pch=16, col='green',
     main='y vs. x2',
     xlab='y', ylab='x2')

plot(df$y, df$x1, pch=16, col='grey',
     main='xi vs. x2',
     xlab='x1', ylab='x2')

1.3.2 Getting Correlation

matrix=cor(df, use="complete.obs")
matrix

##           x1        x2         y
## x1 1.0000000 0.8700628 0.9511669
## x2 0.8700628 1.0000000 0.9370426
## y  0.9511669 0.9370426 1.0000000

heatmap(matrix)

There is a higher correlation between Y and x2 xompare to Y and x1

1.3.3 Multicollinearity

pairs(df,lower.panel = NULL, )

2 Modelling

model <- lm(y ~ x1 + x2, data = df)
summary(model)

## 
## Call:
## lm(formula = y ~ x1 + x2, data = df)
## 
## Residuals:
##       1       2       3       4 
##  0.5385 -0.5385 -0.8077  0.8077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.1538     1.0074   0.153    0.904
## x1            0.6923     0.5385   1.286    0.421
## x2            2.8846     2.7849   1.036    0.489
## 
## Residual standard error: 1.373 on 1 degrees of freedom
## Multiple R-squared:  0.954,  Adjusted R-squared:  0.8621 
## F-statistic: 10.38 on 2 and 1 DF,  p-value: 0.2144

#creating a multiple linear model with y as the output.

Call shows the function call used to compute the regression model. Residuals. Provide a quick view of the distribution of the residuals, which by definition have a mean zero. Therefore, the median should not be far from zero, and the minimum and maximum should be roughly equal in absolute value.
Coefficients. Shows the regression beta coefficients and their statistical significance. Predictor variables, that are significantly associated to the outcome variable, are marked by stars.
Residual standard error (RSE), R-squared (R2) and the F-statistic are metrics that are used to check how well the model fits to our data.

summary(model)$coefficient

##              Estimate Std. Error   t value  Pr(>|t|)
## (Intercept) 0.1538462  1.0073693 0.1527207 0.9035205
## x1          0.6923077  0.5384615 1.2857143 0.4208332
## x2          2.8846154  2.7849447 1.0357891 0.4888094

#getting summary of our model

##Evaluating our model

2.0.1 Checking R_Square

We note from the summary of the model that the r-squared is 0.95 meaning that 95 of the measure of y can be predicted by x1 an x2 thus making our model a good perforfing model.

2.0.2 Checking for RSE

#This is the measure of erreor in prediction. The lower the RSE, the better our model is.

sigma(model)/mean(df$y)

## [1] 0.3922323

2.1 Challenging our solution

2.1.1 Using X1 as our output variable

model2 <- lm(x1 ~ y + x2, data = df)
summary(model2)

## 
## Call:
## lm(formula = x1 ~ y + x2, data = df)
## 
## Residuals:
##     1     2     3     4 
## -1.05  1.05  0.35 -0.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    0.050      1.161   0.043    0.973
## y              0.900      0.700   1.286    0.421
## x2            -0.900      4.482  -0.201    0.874
## 
## Residual standard error: 1.565 on 1 degrees of freedom
## Multiple R-squared:  0.9084, Adjusted R-squared:  0.7252 
## F-statistic: 4.959 on 2 and 1 DF,  p-value: 0.3026

sigma(model2)/mean(df$y)

## [1] 0.4472136

2.1.2 Using X2 as our output variable

model3 <- lm(x2 ~ y + x1, data = df)
summary(model2)

## 
## Call:
## lm(formula = x1 ~ y + x2, data = df)
## 
## Residuals:
##     1     2     3     4 
## -1.05  1.05  0.35 -0.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    0.050      1.161   0.043    0.973
## y              0.900      0.700   1.286    0.421
## x2            -0.900      4.482  -0.201    0.874
## 
## Residual standard error: 1.565 on 1 degrees of freedom
## Multiple R-squared:  0.9084, Adjusted R-squared:  0.7252 
## F-statistic: 4.959 on 2 and 1 DF,  p-value: 0.3026

sigma(model3)/mean(df$y)

## [1] 0.0978232

We can see that while using x1 and x2 as our output variable we get low accuracy scores(RSE) if 44% and 9% respectively which indicates a low performing model.
We conclude that tho they all have high correlations, y should be the desired output variable since it gives us the best result when we use it as our output variable to fit our model.
We will then use use model one(with y as the output variable as our main model.)

3 Anova Test

#We use a two-way anova test becaus we are modeling two input variables against one outpu varaible
two.way <- aov(y ~ x1 + x2, data = df)

summary(two.way)

##             Df Sum Sq Mean Sq F value Pr(>F)
## x1           1  37.09   37.09  19.682  0.141
## x2           1   2.02    2.02   1.073  0.489
## Residuals    1   1.88    1.88

3.1 Checking for homoscedasticity

par(mfrow=c(2,2))
plot(two.way)

par(mfrow=c(1,1))

# The red line represents the mean of the residuals
# The normal Q-Q plot plots a regression between the theoretical residuals of a perfectly-homoscedastic model and the actual residuals of our model.
#The closer the QQ Slope is closer to 1, the better our model is.

4 Conclusion

Since our model has both a low rse and a high Rsquared value, we conclude its a good model.

Group assignment

2022-07-15