Load Packages
Problem Statement
Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
Solution
The data i choose is the heart data from UCI machine learning reprository https://archive.ics.uci.edu/ml/index.php
Read Data
heart <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data",header=FALSE,sep=",",na.strings = '?')
colnames(heart) <- c( "age", "sex", "cp", "trestbps", "chol","fbs", "restecg","thalach","exang", "oldpeak","slope", "ca", "thal", "outcome")correlation plot
data_frame <- heart %>% select(age,trestbps,chol,thalach,oldpeak)
M <- cor(data_frame)
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(M, method="color", col=col(200),
type="upper", order="hclust",
addCoef.col = "black", # Add coefficient of correlation
tl.col="black", tl.srt=45, #Text label color and rotation
# Combine with significance
insig = "blank",
# hide correlation coefficient on the principal diagonal
diag=FALSE
)Multiple Linear Regression:
#lm(formula = Y(dependent) ~ X1(independent) + X2 + X3 + ..., data=table_name_optional)
attach(heart)
MLR <- lm(formula = age ~ trestbps + chol + thalach + oldpeak, data=heart)##
## Call:
## lm(formula = age ~ trestbps + chol + thalach + oldpeak, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.5862 -5.6714 0.2889 5.8933 23.5759
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.873743 5.026193 10.520 < 2e-16 ***
## trestbps 0.126013 0.026385 4.776 2.83e-06 ***
## chol 0.029522 0.008853 3.335 0.000964 ***
## thalach -0.149234 0.021216 -7.034 1.43e-11 ***
## oldpeak 0.091232 0.424752 0.215 0.830082
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.85 on 292 degrees of freedom
## Multiple R-squared: 0.2578, Adjusted R-squared: 0.2476
## F-statistic: 25.36 on 4 and 292 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## (Intercept) 42.98158533 62.76589970
## trestbps 0.07408423 0.17794128
## chol 0.01209884 0.04694573
## thalach -0.19098955 -0.10747933
## oldpeak -0.74473150 0.92719518
Residual analysis
Based on the residual analysis, I would conclude that the linear model is appropriate. From the plots you can see that the points follow a straight line. This tells us that the residuals are normally distributed.