R Notebook

TUAN NURSHAFIEKA WAHIDA BINTI TUAN NADIN SD22030 02G LAB REPORT 1

# Set the working directory
setwd("C:\\Users\\shafi\\Downloads")

# Read the data from the Excel file
gmp_data <- read.table("gmp (1).txt", header = TRUE)

# View the first few rows of the data
head(gmp_data)

# Develop the multiple linear regression model
model <- lm(y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11, data = gmp_data)

# Summary of the model
summary(model)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + 
##     x10 + x11, data = gmp_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3441 -1.6711 -0.4486  1.4906  5.2508 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 17.339838  30.355375   0.571   0.5749  
## x1          -0.075588   0.056347  -1.341   0.1964  
## x2          -0.069163   0.087791  -0.788   0.4411  
## x3           0.115117   0.088113   1.306   0.2078  
## x4           1.494737   3.101464   0.482   0.6357  
## x5           5.843495   3.148438   1.856   0.0799 .
## x6           0.317583   1.288967   0.246   0.8082  
## x7          -3.205390   3.109185  -1.031   0.3162  
## x8           0.180811   0.130301   1.388   0.1822  
## x9          -0.397945   0.323456  -1.230   0.2344  
## x10         -0.005115   0.005896  -0.868   0.3971  
## x11          0.638483   3.021680   0.211   0.8350  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.227 on 18 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.8355, Adjusted R-squared:  0.7349 
## F-statistic:  8.31 on 11 and 18 DF,  p-value: 5.231e-05

#PART A (i)
# Normal probability plot of residuals
qqnorm(residuals(model))
qqline(residuals(model))

Part A i) Construct a normal probability plot of the residuals. Does there seem to be any problems with the normality assumption? - the plot shows that the residual are approximately normally distributed. the points fall close to the straight line.

#(ii)
# Shapiro-Wilk normality test
shapiro_test <- shapiro.test(residuals(model))
shapiro_test

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(model)
## W = 0.964, p-value = 0.3904

Support your answer in (i) by using appropriate normality test. -the test p-value is 0.3904 being higher than 0.05 indicates that we cannot reject the null hypothesis suggesting the residuals are normally distributed.

#(iii)
# Plot of residuals vs predicted response
plot(predict(model), residuals(model))
abline(h = 0, col = "red")

#PART B (i)
# Plot of influential observations by Cook's Distance
cook_distance <- cooks.distance(model)
plot(cook_distance,pch="*",cex=2, main = "Influential Observations by Cook's Distance")
abline(h = 4*mean(cook_distance,na.rm=T), col = "red")
text(x=1:length(cook_distance)+1,y=cook_distance,
labels=ifelse(cook_distance>4*mean(cook_distance,na.rm=T),names(cook_distance),""),col="red")

#(ii)
# Examination of outliers
outliers <- as.numeric(names(cook_distance)[(cook_distance>4*mean(cook_distance,na.rm=TRUE))])
head(gmp_data[outliers,])

Examine and list the point observation(s) which consider as outlier(s). Justify your answer.

the outliers are in index 14,17 because it above the threshold. these points are potential influental that may affect the model’s fit.

#PART C (i)
# Construction of lack of fit test
lack_of_fit_model <- lm(y ~ x1 + x2 + x3 + x8 + x9 + x10, data = gmp_data)

# Summary of the lack of fit model
summary(lack_of_fit_model)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x8 + x9 + x10, data = gmp_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7829 -1.6308 -0.2023  1.7894  6.2575 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.891090  19.188065   1.662    0.110
## x1          -0.051858   0.044919  -1.154    0.260
## x2           0.001803   0.056564   0.032    0.975
## x3           0.031761   0.070140   0.453    0.655
## x8           0.129341   0.116709   1.108    0.279
## x9          -0.206554   0.275463  -0.750    0.461
## x10         -0.003947   0.004986  -0.792    0.437
## 
## Residual standard error: 3.206 on 23 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7924, Adjusted R-squared:  0.7383 
## F-statistic: 14.64 on 6 and 23 DF,  p-value: 7.75e-07

By considering only x1, x2, x3, x8, x9 and x10, construct the lack of fit test. Interpret and justify your answer. -The F test-statistic is 14.64 and the p-value is 7.75x10^-7. Since the p-value is lower than 0.05,we reject the null hypothesis.the overall model is statistically significant. means that at least one of the predictors is contributing to the model, even if none of the individual coefficients are significant on their own.

```

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.