Peris Wambui : SCM 224-0547/2019
John Leornard : SCM 224-0559/2019
Rodgers Kioko : SCM 224-0583/2017
Ebenezar Kyalo : SCM 224-0757/2017
Francis Thairu : SCM 224-0648/2017
Richard Kinyua : SCM 224-0648/2017
library(ggplot2)
df <- data.frame(x1 = c('-1','2','4','6'),
x2 = c('0','0','1','1'),
y = c('0','1','5','8'))
df$x1 <- as.numeric(df$x1)
df$x2 <- as.numeric(df$x2)
df$y <- as.numeric(df$y)
df
## x1 x2 y
## 1 -1 0 0
## 2 2 0 1
## 3 4 1 5
## 4 6 1 8
#previewing our top entries
head(df)
## x1 x2 y
## 1 -1 0 0
## 2 2 0 1
## 3 4 1 5
## 4 6 1 8
# checking data composition
str(df)
## 'data.frame': 4 obs. of 3 variables:
## $ x1: num -1 2 4 6
## $ x2: num 0 0 1 1
## $ y : num 0 1 5 8
#checking dimension of our dataset
dim(df)
## [1] 4 3
# our data frame has 4 rows and 3 columns
boxplot(df$x1, main= 'Boxplot of x1',col="blue")
boxplot(df$x2, main= 'Boxplot of x2',col="grey")
boxplot(df$y, main= 'Boxplot of y',col="green")
We have no outliers in our dataset
summary(df)
## x1 x2 y
## Min. :-1.00 Min. :0.0 Min. :0.00
## 1st Qu.: 1.25 1st Qu.:0.0 1st Qu.:0.75
## Median : 3.00 Median :0.5 Median :3.00
## Mean : 2.75 Mean :0.5 Mean :3.50
## 3rd Qu.: 4.50 3rd Qu.:1.0 3rd Qu.:5.75
## Max. : 6.00 Max. :1.0 Max. :8.00
plot(df$y, df$x1, pch=16, col='steelblue',
main='y vs. x1',
xlab='y', ylab='x1')
plot(df$y, df$x2, pch=16, col='green',
main='y vs. x2',
xlab='y', ylab='x2')
plot(df$y, df$x1, pch=16, col='grey',
main='xi vs. x2',
xlab='x1', ylab='x2')
matrix=cor(df, use="complete.obs")
matrix
## x1 x2 y
## x1 1.0000000 0.8700628 0.9511669
## x2 0.8700628 1.0000000 0.9370426
## y 0.9511669 0.9370426 1.0000000
heatmap(matrix)
There is a higher correlation between Y and x2 xompare to Y and x1
pairs(df,lower.panel = NULL, )
model <- lm(y ~ x1 + x2, data = df)
summary(model)
##
## Call:
## lm(formula = y ~ x1 + x2, data = df)
##
## Residuals:
## 1 2 3 4
## 0.5385 -0.5385 -0.8077 0.8077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1538 1.0074 0.153 0.904
## x1 0.6923 0.5385 1.286 0.421
## x2 2.8846 2.7849 1.036 0.489
##
## Residual standard error: 1.373 on 1 degrees of freedom
## Multiple R-squared: 0.954, Adjusted R-squared: 0.8621
## F-statistic: 10.38 on 2 and 1 DF, p-value: 0.2144
#creating a multiple linear model with y as the output.
Call shows the function call used to compute the regression model. Residuals. Provide a quick view of the distribution of the residuals, which by definition have a mean zero. Therefore, the median should not be far from zero, and the minimum and maximum should be roughly equal in absolute value.
Coefficients. Shows the regression beta coefficients and their statistical significance. Predictor variables, that are significantly associated to the outcome variable, are marked by stars.
Residual standard error (RSE), R-squared (R2) and the F-statistic are metrics that are used to check how well the model fits to our data.
summary(model)$coefficient
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1538462 1.0073693 0.1527207 0.9035205
## x1 0.6923077 0.5384615 1.2857143 0.4208332
## x2 2.8846154 2.7849447 1.0357891 0.4888094
#getting summary of our model
##Evaluating our model
We note from the summary of the model that the r-squared is 0.95 meaning that 95 of the measure of y can be predicted by x1 an x2 thus making our model a good perforfing model.
#This is the measure of erreor in prediction. The lower the RSE, the better our model is.
sigma(model)/mean(df$y)
## [1] 0.3922323
model2 <- lm(x1 ~ y + x2, data = df)
summary(model2)
##
## Call:
## lm(formula = x1 ~ y + x2, data = df)
##
## Residuals:
## 1 2 3 4
## -1.05 1.05 0.35 -0.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.050 1.161 0.043 0.973
## y 0.900 0.700 1.286 0.421
## x2 -0.900 4.482 -0.201 0.874
##
## Residual standard error: 1.565 on 1 degrees of freedom
## Multiple R-squared: 0.9084, Adjusted R-squared: 0.7252
## F-statistic: 4.959 on 2 and 1 DF, p-value: 0.3026
sigma(model2)/mean(df$y)
## [1] 0.4472136
model3 <- lm(x2 ~ y + x1, data = df)
summary(model2)
##
## Call:
## lm(formula = x1 ~ y + x2, data = df)
##
## Residuals:
## 1 2 3 4
## -1.05 1.05 0.35 -0.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.050 1.161 0.043 0.973
## y 0.900 0.700 1.286 0.421
## x2 -0.900 4.482 -0.201 0.874
##
## Residual standard error: 1.565 on 1 degrees of freedom
## Multiple R-squared: 0.9084, Adjusted R-squared: 0.7252
## F-statistic: 4.959 on 2 and 1 DF, p-value: 0.3026
sigma(model3)/mean(df$y)
## [1] 0.0978232
We can see that while using x1 and x2 as our output variable we get low accuracy scores(RSE) if 44% and 9% respectively which indicates a low performing model.
We conclude that tho they all have high correlations, y should be the desired output variable since it gives us the best result when we use it as our output variable to fit our model.
We will then use use model one(with y as the output variable as our main model.)
#We use a two-way anova test becaus we are modeling two input variables against one outpu varaible
two.way <- aov(y ~ x1 + x2, data = df)
summary(two.way)
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 37.09 37.09 19.682 0.141
## x2 1 2.02 2.02 1.073 0.489
## Residuals 1 1.88 1.88
par(mfrow=c(2,2))
plot(two.way)
par(mfrow=c(1,1))
# The red line represents the mean of the residuals
# The normal Q-Q plot plots a regression between the theoretical residuals of a perfectly-homoscedastic model and the actual residuals of our model.
#The closer the QQ Slope is closer to 1, the better our model is.
Since our model has both a low rse and a high Rsquared value, we conclude its a good model.