Correlation and Regression Analysis
Probability Theory and Introductory Statistics
Northeastern University
Student: Jayakumar Moris Udayakumar
Instructor Name: Prof. Dee Chiluza, PhD
Date: 16th December 2023


I. INTRODUCTION
  a. Correlation and Linear Regression
  In analytics, correlation analysis and linear regression analysis are the two crucial concepts for observing the relationship’s existence, the strength of the relationship, the level of dependencies, etc. Regression analysis is one of the most important statistical tools that helps us assess the relationship between two or more variables in which at least one of them is independent (Riffenburgh, 2020). Correlation is the measure of association between two quantitative variables and helps to determine the correlation coefficients (Lalanne, 2017).
  b. Simple and Multiple Regression Analysis
  Simple regression analysis is the regression analysis that is performed between one dependent variable and one independent variable. The regression procedure uses the independent variable to predict the dependent variable (Carroll, 2023). Multiple regression analysis is performed between one dependent variable and a set of independent variables. Multiple regression analysis is the most widely used multivariate methodology to examine the causal relationships between the variables (Nayebi, 2020).
  Practical applications in manufacturing:
Some of the practical applications of simple and multiple regression analysis are,
1. Predicting vendor performance based on their on-time delivery.
2. Analyzing the significant factors that influence the inventory.
3. Determining the impact of system changes in the operations.
  c. Hypothesis Testing - Regression Analysis
  Hypothesis testing is a crucial statistical method used to evaluate the significance of relationships between variables. The null hypothesis posits no relationship (coefficients are zero), both for individual predictors and the overall model. The alternative hypothesis, on the other hand, suggests a meaningful association.
  The test statistic, often based on t-distribution for individual coefficients and F-statistic for the entire model, is calculated to determine the probability of observing the data under the null hypothesis. A small p-value (typically below 0.05) leads to rejecting the null hypothesis, indicating a statistically significant relationship. Researchers then conclude that the observed associations in the regression model are unlikely to be due to chance. This process allows for rigorous assessment and inference about the validity of the regression results, providing a foundation for making meaningful interpretations and predictions.
  Some of the practical applications are,
1. Effectiveness of new system implementation in the performance of the supply chain
2. Assessing the impact of a drug in the patient’s health
  d. Two Sample Hypothesis Testing - Dependent Vs Independent Variables
  Two-sample hypothesis testing delineates a critical dichotomy based on the interdependence of the samples under consideration. In the realm of independent samples testing, the scrutinized groups remain autonomous entities devoid of any inherent interrelation, making it apt for comparisons involving entirely discrete populations or disparate conditions within a single population. This paradigm finds resonance in scenarios where the observed phenomena lack mutual influence, typified, for instance, by the comparative assessment of mean test scores between students from divergent educational institutions.
  e. Importance of the final project
  The final project is always one of the critical tasks for all students since it measures all the learnings obtained by the students during the course period. It will help students to demonstrate their knowledge, skills, and experiences.
  Significant milestones are
1. Students can recollect and revisit theoretical and practical concepts and share their understanding.
2. The final project must be performed with utmost seriousness to make sure that the report has the potential to be presented to the recruiters.
  f. Advantages of using R for data analysis
Extensive Community Support:
1. Large and active community of statisticians, data scientists, and researchers.
2. Abundant packages and libraries contributed by the community enhance analytical capabilities.
Flexibility and Integration:
1. Versatile for diverse data analysis tasks, from manipulation to advanced modeling.
2. Seamless integration with other data science tools and platforms.
Reproducibility and Transparency:
1. Script-based coding allows for systematic documentation of analyses.
2. Facilitates reproducibility, collaboration, and scrutiny of results.
Credibility and Rigor:
1. Transparent analytical processes contribute to the credibility of findings.
2. Preferred choice for researchers and analysts aiming for rigorous and reproducible data analysis.
II. ANALYSIS SECTION
1. SIMPLE REGRESSION
  Simple Regression performs the analysis between one independent variable and one dependent variable. In this part 1 section, the dataset mpg will be elaborated and analyzed through descriptive statistics, coefficient of correlation and determination, Linear regression analysis, and Scatterplot.
1.1 Dataset description
  Presenting the brief summary of the descriptive statistics for all the variables of the dataset mpg (fuel economy).

#presenting the summary of mpg using summarytools function
shortsummary_dataset <- summarytools::descr(fueleconomy_data)

#formatting the table using kable function
kable(shortsummary_dataset, table.attr = "style='width:90%;'", align = "c", format = "html", digits = 3)%>%
  kable_styling(bootstrap_options = "bordered", latex_options = "striped", font_size = NULL)
cty cyl displ hwy year
Mean 16.859 5.889 3.472 23.440 2003.500
Std.Dev 4.256 1.612 1.292 5.955 4.510
Min 9.000 4.000 1.600 12.000 1999.000
Q1 14.000 4.000 2.400 18.000 1999.000
Median 17.000 6.000 3.300 24.000 2003.500
Q3 19.000 8.000 4.600 27.000 2008.000
Max 35.000 8.000 7.000 44.000 2008.000
MAD 4.448 2.965 1.334 7.413 6.672
IQR 5.000 4.000 2.200 9.000 9.000
CV 0.252 0.274 0.372 0.254 0.002
Skewness 0.786 0.112 0.439 0.365 0.000
SE.Skewness 0.159 0.159 0.159 0.159 0.159
Kurtosis 1.431 -1.464 -0.911 0.137 -2.009
N.Valid 234.000 234.000 234.000 234.000 234.000
Pct.Valid 100.000 100.000 100.000 100.000 100.000
#presenting the summary of mpg using psych::describe function
extendedsummary_dataset <- t(psych::describe(fueleconomy_data))

#formatting the table using kable function
kable(extendedsummary_dataset, table.attr = "style='width:90%;'", align = "c", format = "html", digits = 3)%>%
  kable_styling(bootstrap_options = "bordered", latex_options = "striped", font_size = NULL)
manufacturer* model* displ year cyl trans* drv* cty hwy fl* class*
vars 1.000 2.000 3.000 4.000 5.000 6.000 7.000 8.000 9.000 10.000 11.000
n 234.000 234.000 234.000 234.000 234.000 234.000 234.000 234.000 234.000 234.000 234.000
mean 7.765 19.090 3.472 2003.500 5.889 5.654 1.667 16.859 23.440 4.628 4.594
sd 5.132 11.147 1.292 4.510 1.612 2.879 0.662 4.256 5.955 0.695 1.990
median 6.000 18.500 3.300 2003.500 6.000 4.000 2.000 17.000 24.000 5.000 5.000
trimmed 7.681 18.979 3.394 2003.500 5.862 5.532 1.585 16.612 23.234 4.771 4.644
mad 5.930 14.085 1.334 6.672 2.965 1.483 1.483 4.448 7.413 0.000 2.965
min 1.000 1.000 1.600 1999.000 4.000 1.000 1.000 9.000 12.000 1.000 1.000
max 15.000 38.000 7.000 2008.000 8.000 10.000 3.000 35.000 44.000 5.000 7.000
range 14.000 37.000 5.400 9.000 4.000 9.000 2.000 26.000 32.000 4.000 6.000
skew 0.206 0.114 0.439 0.000 0.112 0.289 0.482 0.786 0.365 -2.254 -0.140
kurtosis -1.629 -1.232 -0.911 -2.009 -1.464 -1.652 -0.755 1.431 0.137 5.764 -1.516
se 0.335 0.729 0.084 0.295 0.105 0.188 0.043 0.278 0.389 0.045 0.130

Observation:
  It is impressive to obtain the brief descriptive statistics summary of the mpg or fuel economy dataset for each of its variable. Some of the most notable information from the summary are,
  1. Number of observations for each variable is similar, n = 234.
  2. According to the variable “Cylinders”, the standard deviation of the cylinder capacities of the vehicles stands at 1.612 and its average is 5.89, which helps us to understand its distribution and central tendency; besides, its range is between 4 and 8.
  3. According to the variable “Displacement”, the standard deviation of the displacement of the vehicles stands at 1.292 and its average is 3.472, which helps us to understand its distribution and central tendency; besides, its range is between 1.6 and 7.
  4. Observing the complete statistical data such as measure of central tendency, disperson, and postion, helps us to get the glimpse of the data which would be helpful for sophisticated analysis.

1.2 Statistics of a Dependent Variable
In this section, the descriptive statistics of a dependent variable “Displacement” will be presented in tabular format, according to the independent variable “Cylinders”.

#applying mpg (fuel economy) data set
#grouping by different cylinder numbers
#deriving descriptive statistics of displacement
displ_percyl = fueleconomy_data %>%
  group_by(No_of_Cylinders = fueleconomy_data$cyl) %>%
  summarise(Mean = mean(displ),
            SD = sd(displ),
            Minimum = min(displ),
            Maximum = max(displ))

#formatting the table using kable functions
displ_percyl %>%
  kable(align = "c", caption = "Descriptive Statistics of displacement per cylinders", format = "html", digits = 2, table.attr = "style='width:75%;'") %>%
  kable_classic_2(bootstrap_options=c("hover","striped","condensed"), html_font = "Source Sans Pro", position = "center", font_size = 14) %>%
  add_header_above(c(" " = 1,"Vehicle Displacement" = 4))
Descriptive Statistics of displacement per cylinders
Vehicle Displacement
No_of_Cylinders Mean SD Minimum Maximum
4 2.15 0.32 1.6 2.7
5 2.50 0.00 2.5 2.5
6 3.41 0.47 2.5 4.2
8 5.13 0.59 4.0 7.0

Observation:
Based on the observation, it is visible that there are 4 categories in the “cylinder” capacities of the vehicles. For each category, the statistics of the dependent variable “displacement” presented. Some of the notable points are,
  1. Mean and Range of displacement increases as cylinder capacity increases.
  2. Vehicles with five cylinders, doesn’t show any variation in the displacements irrespective of any other variables in place. Its standard deviation is ‘Zero’.
  3. Standard deviation of displacement for all the cylinder capacities stays below 1.

1.3 Coefficient of Correlation
In this section, we will analyze the correlation between cylinders (x, independent) variable and displacement (y, dependent) variable. The same will be presented in the tabular format and make observations.

#correlation table for first 5 values
table_correlation <- head(data.frame(
  x = fueleconomy_data$cyl,
  y = fueleconomy_data$displ,
  xy = fueleconomy_data$cyl * fueleconomy_data$displ,
  x2 = fueleconomy_data$cyl^2,
  y2 = fueleconomy_data$displ^2),5)

#formatting the table using kable function
kable(table_correlation, align = "c", format = "html", caption = "Correlation Table: Cylinder(x)  Vs Displacement (y)")%>%
  kable_styling(bootstrap_options=c("striped","bordered","condensed"), html_font = "Cambria", position = "center", font_size = 13)
Correlation Table: Cylinder(x) Vs Displacement (y)
x y xy x2 y2
4 1.8 7.2 16 3.24
4 1.8 7.2 16 3.24
4 2.0 8.0 16 4.00
4 2.0 8.0 16 4.00
6 2.8 16.8 36 7.84
#using the table, calculating the sum of x, y, xy, x2, and y2
sum_x <- sum(table_correlation$x)
sum_y <- sum(table_correlation$y)
sum_xy <- sum(table_correlation$xy)
sum_x2 <- sum(table_correlation$x2)
sum_y2 <- sum(table_correlation$y2)

# Display the sum values
cat("Σx:", sum_x, "|", "Σy:", sum_y, "|", "Σxy:", sum_xy, "|", "Σx²:", sum_x2, "|", "Σy²:", sum_y2, "\n")
## Σx: 22 | Σy: 10.4 | Σxy: 47.2 | Σx²: 100 | Σy²: 22.32

Observation:
Based on the analysis between cylinders (x) and displacement (y), the correlation table has been presented for the first 5 values. These values considered to be significant to perform further analysis to determine correlation coefficient between them. Based on the obtained information, the pattern and range between the variables can be recognized.

1.4 Correlation and Determination
Formulating correlation and determination using correlation coefficient (r) formula and with the values of x and y obtained from the previous task 1.3.

#number of data pairs
n=5

#computing numerator for coef of correlation
rnum = (n*sum_xy)-(sum_x*sum_y)

#computing denominator for coef of correlation
rden = sqrt((n*sum_x2 - (sum_x)^2)*(n*sum_y2 - sum_y^2))

#coefficient of correlation r between displacement and cylinders
coefcorrelation_r = rnum / rden

#coefficient of determination r2 between displacement and cylinders
coefdetermination_r2 = (coefcorrelation_r)^2

Observation:
  1. The coefficient of correlation between cylinders and displacement is 0.97
  2. The coefficient of determination between cylinders and displacement is 0.942
The correlation coefficient shows that there is a strong positive correlation between cylinders (x) and displacement (y) since their coefficient of correlation value is 0.97. Accordingly, the coefficient of determination is 0.942, which reveals that the displacement (y) is 94.2% dependent on the independent variable cylinders (x), and the remaining 5.8% dependent on the external factors.

1.5 Linear Regression
Formulating the linear regression between the variables displacement and cylinders. Accordingly, identifying the intercept and slope of those two variables.

#formulating linear regression between cylinders and displacement 
linreg_cyl_displ <- lm(fueleconomy_data$displ ~ fueleconomy_data$cyl)

#summarizing linear regression
summary_linreg <- summary(linreg_cyl_displ)

#finding intercept and slope
intercept_a <- coef(summary_linreg)[1]
slope_b <- coef(summary_linreg)[2]

Observation:
  y = -0.92 + 0.75x
  Based on the computation of linear regression between cylinders and displacement, we able to observe the above linear regression formula that includes intercept and slope. Considering the result, we can anticipate that the intersection of dependent and independent variable at the intercept of -0.92 and slope of 0.75.

1.6 Scatterplot
In this section, scatterplot will be presented for the linear regression of independent variable cylinders and dependent variable displacement.

#framing scatterplot between displacement and cylinders
plot(fueleconomy_data$displ ~ fueleconomy_data$cyl, xlim = c(0,10), ylim = c(0,10), pch = 3, xlab = "Cylinders", ylab = "Displacement")

#adding regression line to the graph
abline(linreg_cyl_displ, lty=1, lwd=1, col="#99004C")

#adding mean of cylinders and displacement to the graph
mean_cyl <- mean(fueleconomy_data$cyl)
mean_displ <- mean(fueleconomy_data$displ)
abline(v=c(mean_cyl), col= "red")
abline(h=c(mean_displ), col= "darkblue")

#adding labels to the mean and 
text(x=mean_cyl, y=8, round(mean_cyl,2), cex=0.8, srt=90, adj = c(1,0))
text(x=mean_displ, y=3.8, round(mean_displ,2), cex=0.8, srt=360, adj = c(1,0))

Observation:
As presented in the scatterplot, we able to recognize the presence of the data points, linear regression line, mean of displacement, and mean of cylinders. Some of the significant observations are,
  1. The observed ‘y’ (displacement) with respect to cylinders are closely accumulated and there are no outliers, which makes real sense of the reality of dataset.
  2. The average of displacement and the average of cylinders intersect close to the cylinder capacity approx. 5.9 and displacement level approx. 3.8.
  3. There is a minor variation at each residual value of the data points.
  4. As we calculated coefficient of correlation and determination in the task 1.4, the strength of the correlation between the variables is good, which means that displacement is highly dependent on cylinder capacity of the vehicles.

1.7 Predicted Values and Residuals
In this section, the linear regression formula of the variables cylinders and displacement applied to compute the predicted value of displacement and its residuals. The details have been presented in the tabular format.

#applying linear regression formula and mutate function, computing the table with x, y, observed y, predicted y, and residuals
prediction_table <- fueleconomy_data %>%
  mutate(
  x = fueleconomy_data$cyl,
  observed_y = fueleconomy_data$displ,
  predicted_y = predict(linreg_cyl_displ),
  residuals = residuals(linreg_cyl_displ))

#filtering the first 10 observations of the table
prediction_table <- head(prediction_table, 10)

#formatting the table using kable function
kable(prediction_table, format = "html", align = "c", digits = 3)%>%
  kable_styling(bootstrap_options = "bordered", font_size = 11, table.envir = "table")
manufacturer model displ year cyl trans drv cty hwy fl class x observed_y predicted_y residuals
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 4 1.8 2.063 -0.263
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact 4 1.8 2.063 -0.263
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact 4 2.0 2.063 -0.063
audi a4 2.0 2008 4 auto(av) f 21 30 p compact 4 2.0 2.063 -0.063
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 6 2.8 3.555 -0.755
audi a4 2.8 1999 6 manual(m5) f 18 26 p compact 6 2.8 3.555 -0.755
audi a4 3.1 2008 6 auto(av) f 18 27 p compact 6 3.1 3.555 -0.455
audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact 4 1.8 2.063 -0.263
audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact 4 1.8 2.063 -0.263
audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28 p compact 4 2.0 2.063 -0.063

Observation:
While observing the results of predicted value of y_displacement, according to the independent variable x_cylinders, we can recognize the differences between observed y value and predicted y value. Some of the key points are,
  1. There are significant expectation of displacement levels according to the size of the cylinders. For instance, cylinder size ‘4’ has the predicted y_displacement value of ‘2.063’, irrespective of the other variables. Similarly, cylinder size ‘6’ has the predicted y value of ‘3.555’.
  2. From the first 10 observations, the lowest residual value is -0.063 and highest residual value at -0.755.

1.8 Predicting Dependent Variable - Linear Regression
Based on the instructions, I have chosen x cylinder independent variable value as 6 and tried to predict its respective dependent value y by applying in the linear regression formula.

#choosing the value for X_cylinder
X_cyl = 6

#predicting the y_displacement for the respective X_cyl value
predict_y_displ = (X_cyl * slope_b) + intercept_a

#presenting scatterplot to show regression between displacement and cylinders
plot(fueleconomy_data$displ ~ fueleconomy_data$cyl, xlim = c(0,10), ylim = c(0,10), pch = 4, xlab = "Cylinders", ylab = "Displacement")

#adding the chosen X_cyl value and its predicted y_displ value
abline(v=c(X_cyl), col= "red")
abline(h=c(predict_y_displ), col= "darkblue")

#adding label to the corresponding X and Y values
text(x=X_cyl, y=8, round(X_cyl,2), cex=0.8, srt=90, adj = c(1,0))
text(x=predict_y_displ, y=4, round(predict_y_displ,2), cex=0.7, srt=360, adj = c(1,0))

Observation:
The presented scatterplot is based on the randomly chosen X value 6 (cylinder size) and prediction of its respective y value (displacement) value. The predicted y value is 3.55 and it is found using the linear regression formula y=a+bx. By applying, the intercept, slope, and x value, we able to determine the predicted y value. It can be translated for our understanding that, for the cylinder size 6 of vehicles, the expected displacement level is 3.55.

2. MULTIPLE REGRESSION
Multiple regression analysis is a sophisticated version of regression analysis that deals with one dependent variable and one or more independent variables. In this section, multiple regression analysis performed and the respective observations presented for the same.
2.1 Predictions
For the patient dataset which has variables patientID, Systolic BP, age, and Weight, multiple regression analysis and hypothesis testing has been done in the following section.

#creating objects of the dataset
patient_ID = c("PK01", "PK02", "PK03", "PK04", "PK05", "PK06", "PK07", "PK08", "PK09", "PK10", "PK11", "PK12", "PK13", "PK14", "PK15")
SystolicBP = c(112, 156, 125, 145, 155, 162, 139, 144, 153, 126, 169, 132, 143, 153, 162)
Age = c(45, 60, 55, 60, 62, 71, 57, 59, 64, 42, 75, 52, 59, 67, 73) 
Weight = c(135, 182, 148, 182, 190, 232, 194, 182, 217, 171, 225, 173, 184, 194, 211)

#creating dataframe
patient_data <- data.frame(patient_ID, SystolicBP, Age, Weight)
patient_data_numeric <- data.frame(SystolicBP, Age, Weight)

#multiple regression
multip_linreg <- lm(SystolicBP ~ Age + Weight)
summary_mulreg <- summary(multip_linreg)

# Calculate correlation between Age and SystolicBP
cor_age_sbp <- cor(patient_data$Age, patient_data$SystolicBP)
det_age_sbp <- (cor_age_sbp)^2

# Calculate correlation between Age and Weight
cor_age_weight <- cor(patient_data$Age, patient_data$Weight)
det_age_weight <- (cor_age_weight)^2

# Calculate correlation between Weight and SystolicBP
cor_weight_sbp <- cor(patient_data$Weight, patient_data$SystolicBP)
det_weight_sbp <- (cor_weight_sbp)^2

# Create a data frame to display the results
correlation_table <- data.frame(
  Variable_Pair = c("Age and SystolicBP", "Age and Weight", "Weight and SystolicBP"),
  Multiple_Correlation_Coefficient = c(cor_age_sbp, cor_age_weight, cor_weight_sbp), 
  Multiple_Determination_Coefficient = c(det_age_sbp, det_age_weight,det_weight_sbp)
)

# formatting the table using kable function
kable(correlation_table, align = "c", format = "html", digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "bordered"), table.envir = "table", )
Variable_Pair Multiple_Correlation_Coefficient Multiple_Determination_Coefficient
Age and SystolicBP 0.925 0.855
Age and Weight 0.838 0.702
Weight and SystolicBP 0.898 0.806
#null hypothesis - systematic blood pressure cannot be predicted using the variables age and weight
#alternative hypothesis - systematic blood pressure can be predicted using the variables age and weight

#computing F value and Critical value
F_value = 57.73
cv = qf(0.05, 2, 12, lower.tail=FALSE)

#intercept and slope for the multiple regression
multireg_intercept_a <- coef(summary_mulreg)[1]
multireg_slope_ageb1 <- coef(summary_mulreg)[2]
multireg_slope_weib2 <- coef(summary_mulreg)[3]

#predicted systolic bp for a person age 25 and weight 135
predicted_sbp1 <- multireg_intercept_a + (multireg_slope_ageb1 * 25) + (multireg_slope_weib2 * 135)

#predicted systolic bp for a person age 80 and weight 175
predicted_sbp2 <- multireg_intercept_a + (multireg_slope_ageb1 * 80) + (multireg_slope_weib2 * 175)

Observation:
The hypothesis test result is “Reject the null hypothesis. There is a significant linear relationship between Age, Weight, and Systolic Blood Pressure” since the F value 57.73 greater than critical value 3.89
  1. The predicted systolic blood pressure for a person age 25 and weight 135 is 97.4
  2. The predicted systolic blood pressure for a person age 80 and weight 175 is 161.41
In this section, we have analyzed the multiple regression between the dependent variable “Systolic blood pressure” and independent variables “Age” and “Weight”. Some of the key points are,
  1. It gives us the profound understanding of the correlation and determination between these variables.
  2. The correlation of age and systolic blood pressure appears to be having strong positive with 0.925 compared to the other positive correlations between age and weight which stands at 0.838 and between weight and systolic blood pressure which stands at 0.898.
  3. The determination coefficient of age and systolic blood pressure is also high with 85.5% compared to other variable combinations. The remaining 14.5% dependent on external factors.

2.2 Scatterplots
In this section, single linear regression analysis scatter plot presented for the two independent variables with respect to their common dependent variable.

#using par function to format the chart
par(mfrow=c(1.9,2), mai=c(0.6, 0.8, 0.5, 0.4), mar=c(4,4,1,1))

#computing linear regression between age and systolic bp and between weight and systolic bp
linreg_age_bp <- lm(SystolicBP ~ Age)
linreg_weig_bp <- lm(SystolicBP ~ Weight)

#presenting scatterplot for independent variable systolic bp and dependent variable age
plot(SystolicBP ~ Age, data = patient_data, pch=18, col="orange")
abline(linreg_age_bp, lty=1, lwd=1, col="#660066")

#presenting scatterplot for independent variable systolic bp and dependent variable weight
plot(SystolicBP ~ Weight, data = patient_data, pch=8, col="purple")
abline(linreg_weig_bp, lty=1, lwd=1, col="#FF007F")

Observation:
Based on the presentation of linear regression analysis between age and systolic bp and between weight and systolic bp, we can observe the following key points,
  1. The values of the correlation coefficient and coefficient determination which were found in the previous task, most probably aligning with the data points depicted in these scatter plots.
  2. The correlation of age and systolic bp is much better than correlation of weight and systolic bp.
  3. To review further, the residual values looks higher in the scatterplot of weight and systolic bp, than the age vs systolic bp.

2.3 Predicted Values and Residuals
In this section, predicted value of systolicbp_y will be computed and the same would be compared with the x1_age, x2_weight, and observed_y.

#creating the prediction table for patient dataset by computing x1, x2, observed_y, predicted_y, residuals
patientdata_predtable <- patient_data %>%
  mutate(
  x1 = patient_data$Age,
  x2 = patient_data$Weight,
  observed_y = patient_data$SystolicBP,
  predicted_y = predict(multip_linreg),
  residuals = residuals(multip_linreg)
  )

#formatting the table using kable function
kable(patientdata_predtable, format = "html", align = "c", digits = 3)%>%
  kable_styling(bootstrap_options = "bordered", font_size = 11, table.envir = "table")
patient_ID SystolicBP Age Weight x1 x2 observed_y predicted_y residuals
PK01 112 45 135 45 135 112 117.043 -5.043
PK02 156 60 182 60 182 156 143.504 12.496
PK03 125 55 148 55 148 125 130.111 -5.111
PK04 145 60 182 60 182 145 143.504 1.496
PK05 155 62 190 62 190 155 147.465 7.535
PK06 162 71 232 71 232 162 166.784 -4.784
PK07 139 57 194 57 194 139 143.551 -4.551
PK08 144 59 182 59 182 144 142.522 1.478
PK09 153 64 217 64 217 153 156.165 -3.165
PK10 126 42 171 42 171 126 123.077 2.923
PK11 169 75 225 75 225 169 168.967 0.033
PK12 132 52 173 52 173 132 133.400 -1.400
PK13 143 59 184 59 184 143 143.021 -0.021
PK14 153 67 194 67 194 153 153.375 -0.375
PK15 162 73 211 73 211 162 163.510 -1.510

Observation:
In the above table, the complete dataset has been presented and with the addition of prediction table values x1, x2, observed_y, predicted_y, and residuals. Some of the key observations are,
  1. Residual values is as low as 0.033 and as high as 12.496.
  2. The dependent variables age and weight have been making a significant impact in the dependent variable systolic bp.
  3. Since the age and weight has high influence on systolic blood pressure, it is understandable that the change in those variables make vital difference.

2.4 Residuals
In this section, scatterplot has been between residuals and age/weight inorder to observe their linear regression.

#using par function to present two graphs in columns
par(mfrow=c(1.9,2), mai=c(0.6, 0.8, 0.5, 0.4), mar=c(4,4,1,1))

#scatterplot residuals vs age
plot(patientdata_predtable$residuals ~ patient_data$Age, xlab = "Age", ylab = "Residuals", pch=16)

#scatterplot residuals vs weight
plot(patientdata_predtable$residuals ~ patient_data$Weight, xlab = "Weight", ylab = "Residuals",pch=23)


Observation:
The presentation of two charts between residuals and age, and between residuals and weight gives insights on how the residual values varies and dispersed in the range.
  1. As we know, if the residual value is low or as close to zero, then there is a high possibility for the observed value to achieve predicted value.
  2. Incase, the residual value is so far from zero in the direction of negative or positive, then it is hard for the observed value to achieve predicted value.
  3. In the above examples, residual values with respect to age and weight, more or less similar.
One of the article written by Straume, et.al., 1992, states that the main purpose of the residual analysis is to determine the goodness-for-fit test. It helps to understand the quantitative relationship of the variables.

III. CONCLUSION
In this report, we have explored fundamental concepts of correlation and linear regression analysis within the context of a dataset called “mpg,” focusing on variables related to fuel economy. The introduction emphasizes the importance of these analyses in understanding relationships, dependencies, and predictive capabilities between variables (Bluman, 2018). The practical applications of regression analysis, such as predicting vendor performance and assessing the impact of system changes, underscore its significance in real-world scenarios.
The analysis section begins with a detailed exploration of simple regression, employing descriptive statistics and correlation coefficients. The correlation table and subsequent determination coefficients offer insights into the strength and nature of the relationship between the number of cylinders and vehicle displacement. Linear regression is then applied, providing a formula representing the relationship between cylinders and displacement. A scatterplot visually depicts linear regression, emphasizing the close accumulation of data points and the absence of outliers.
The report extends its analysis to multiple regression, using a patient dataset featuring variables like Systolic Blood Pressure, age, and weight. Hypothesis testing is conducted to assess the significance of the relationships. The calculated correlation and determination coefficients between age, weight, and systolic blood pressure reveal strong positive associations. Scatterplots with linear regression lines visually represent these relationships. The report concludes by applying the multiple regression model to predict systolic blood pressure based on age and weight, emphasizing its practical implications in healthcare scenarios.
Throughout the report, we underscored the advantages of using the R programming language for data analysis, citing factors such as extensive community support, flexibility, integration capabilities, reproducibility, and transparency, contributing to the credibility and rigor of the findings. Including practical applications and hands-on analysis of datasets enhances the report’s educational value, providing a comprehensive overview of correlation and regression techniques in a practical context.

IV. BIBLIOGRAPHY
1. Martin Straume, Michael L. Johnson,[5] Analysis of Residuals: Criteria for determining goodness-of-fit, Methods in Enzymology, Academic Press, Volume 210, 1992, Pages 87-105, ISSN 0076-6879, ISBN 9780121821111, https://doi.org/10.1016/0076-6879(92)10007-Z.
2. Robert H. Riffenburgh, Daniel L. Gillen, 15 - Linear regression and correlation, Editor(s): Robert H. Riffenburgh, Daniel L. Gillen, Statistics in Medicine (Fourth Edition), Academic Press, 2020, Pages 357-390, ISBN 9780128153284, https://doi.org/10.1016/B978-0-12-815328-4.00015-2.
3. Christophe Lalanne, Mounir Mesbah, 4 - Correlation, Linear Regression, Editor(s): Christophe Lalanne, Mounir Mesbah, Biostatistics and Computer-based Analysis of Health Data using SAS, Elsevier, 2017, Pages 77-96, ISBN 9781785481116, https://doi.org/10.1016/B978-1-78548-111-6.50004-6.
4. Carroll, Susan Rovezzi, and David J. Carroll. Simplifying Statistics for Graduate Students : Making the Use of Data Simple and User-Friendly, Rowman & Littlefield Publishers, Incorporated, 2023. ProQuest Ebook Central, https://www.proquest.com/legacydocview/EBC/7222367?accountid=12826.
5. NAYEBI, HOOSHANG. (2020). ADVANCED STATISTICS FOR TESTING ASSUMED CASUAL RELATIONSHIPS multiple regression analysis… path analysis logistic regression analysis. SPRINGER NATURE.
6. Razak, F. A., Rashidah, N., Baharun, N., & Deraman, N. A. (2018). Hypothesis testing on regression: Investigating students’ skill. International Journal of Engineering and Technology (UAE), 7(4), 45-48.
7. Franzese, R., & Kam, C. (2009). Modeling and interpreting interactive hypotheses in regression analysis. University of Michigan Press.
8. Bluman, A. (2018), Elementary Statistics: a step-by-step approach. In Bluman, A., Descriptive and Inferential Statistics, (pp. 400-493)

V. APPENDIX
An R Markdown report has been attached with this report. The name of the file is Final-Project_Correlation-and-Regression_Jayakumar.RMD

VI. ACKNOWLEDGEMENTS
I would like to thank my professor Dee Chiluiza for being such an inspiring teacher and students for being so challenging during learning process