Adegoke_Aly6010_finalproject

ALY 6010
Northeastern University
Oladimeji Adegoke
Final Project Report
Instructor: Prof Dee Chiliuza

INTRODUCTION

Data Analysis using Simple and Multiple regression to determine relationship

History of Correlation and Linear Regression

Brief background of correlation and linear Regression was primarily developed by Francis Galton, Karl Pearson, and Sir Ronald A. Fisher. Galton introduced Regression towards the mean, while Pearson formalized the mathematical framework for correlation. Linear Regression is a method to determine the relationship between dependent and independent variables.

Correlation vs. Determination Coefficients

Correlation Coefficient: Measures the strength of relationship between dependent and independent variable, usually from -1 to 1, 1 is a positive relationship, -1 is a negative linear relationship, and 0 is no linear relationship, it is represented by r

Coefficient of Determination: This is difference in dependent versus independent variable. It is from 0 to 1, and a higher value indicates a good fit of the model to the data, it is represented by R.

Simple vs. Multiple Regression Analysis

Simple Regression Analysis involves independent and dependent variables. It is used to understand and predict changes in the independent variables and behavior of the dependent variable.

Multiple Regression Analysis involves one dependent variable and two or more independent variables. It is used to assess the effects of multiple factors on a response variable simultaneously, allowing for more complex models of reality.

Importance and Applications in the Flour Mill Industry: Simple Regression could analyze the relationship between production volume and raw material costs, helping to forecast expenses based on planned production. Multiple Regression could predict flour quality based on various inputs, such as wheat quality, wheat type by country, milling temperature, and humidity.

Application of Hypothesis Testing in Regression Analysis

Hypothesis testing in regression analysis involves determining the significance of the coefficient in model. For example, a multiple regression analysis could be conducted in the flour mill industry to determine if grain moisture content, milling time, and machine speed significantly affect flour yield. Hypothesis tests would ascertain if changes in these predictors lead to statistically significant changes in yield, informing optimization strategies.

Independent vs. Dependent Samples

Independent Samples come from distinct, unrelated groups. In hypothesis testing, comparing means from independent samples (e.g., flour quality from two different mills) uses tests like the independent samples t-test.

Dependent Samples come from related or matched groups. For instance, testing flour quality before and after a new purification process in the same mill involves analyzing dependent samples using paired tests.

Importance of the Final Project

The final project encapsulates the application of statistical concepts and analytical skills acquired during the class, showcasing the ability to apply theory to real-world problems. It provides a comprehensive platform to demonstrate proficiency in data collection, analysis, interpretation, and the effective use of software tools like R, integrating knowledge into a coherent analysis of a practical situation.

Advantages of Using R

R is a tool for analysis, visualization, and data science. Its advantages include: Comprehensive Statistical Analysis Capabilities: Offers various packages for various statistical methods.

Data Visualization: Advanced plotting libraries for creating professional graphs and charts.

Community Support: A vast community for learning resources, forums, and free packages.

Flexibility: Can handle data of different types and from various sources.

Integration: Easily integrates with other languages and tools, facilitating advanced analytics.

Using R in the final project enhances the ability to conduct sophisticated analyses, apply various statistical techniques, and communicate findings effectively, demonstrating high analytical proficiency.

Bluman, A.G. (2009). Elementary Statistics: A Step-by-Step Approach (18th Ed)

Wald, Abraham (1950). Selected papers in statistics and probability, Institute of Mathematical Statistics, Stanford

Grimmett, G. R., & Stirzaker, D. R. (1992). Probability and random processes. Oxford University Press.

library(readxl)

library(dplyr)

library(psych)

library(knitr)

library(kableExtra)

library(tidyverse)

library(magrittr)

library(RColorBrewer)

library(brio)

library(ggplot2)

#Task 1.1
#using describe code to present mpg data outcome

describe(mpg)

##               vars   n    mean    sd median trimmed   mad    min  max range
## manufacturer*    1 234    7.76  5.13    6.0    7.68  5.93    1.0   15  14.0
## model*           2 234   19.09 11.15   18.5   18.98 14.08    1.0   38  37.0
## displ            3 234    3.47  1.29    3.3    3.39  1.33    1.6    7   5.4
## year             4 234 2003.50  4.51 2003.5 2003.50  6.67 1999.0 2008   9.0
## cyl              5 234    5.89  1.61    6.0    5.86  2.97    4.0    8   4.0
## trans*           6 234    5.65  2.88    4.0    5.53  1.48    1.0   10   9.0
## drv*             7 234    1.67  0.66    2.0    1.59  1.48    1.0    3   2.0
## cty              8 234   16.86  4.26   17.0   16.61  4.45    9.0   35  26.0
## hwy              9 234   23.44  5.95   24.0   23.23  7.41   12.0   44  32.0
## fl*             10 234    4.63  0.70    5.0    4.77  0.00    1.0    5   4.0
## class*          11 234    4.59  1.99    5.0    4.64  2.97    1.0    7   6.0
##                skew kurtosis   se
## manufacturer*  0.21    -1.63 0.34
## model*         0.11    -1.23 0.73
## displ          0.44    -0.91 0.08
## year           0.00    -2.01 0.29
## cyl            0.11    -1.46 0.11
## trans*         0.29    -1.65 0.19
## drv*           0.48    -0.76 0.04
## cty            0.79     1.43 0.28
## hwy            0.36     0.14 0.39
## fl*           -2.25     5.76 0.05
## class*        -0.14    -1.52 0.13

psych::describe(mpg)

##               vars   n    mean    sd median trimmed   mad    min  max range
## manufacturer*    1 234    7.76  5.13    6.0    7.68  5.93    1.0   15  14.0
## model*           2 234   19.09 11.15   18.5   18.98 14.08    1.0   38  37.0
## displ            3 234    3.47  1.29    3.3    3.39  1.33    1.6    7   5.4
## year             4 234 2003.50  4.51 2003.5 2003.50  6.67 1999.0 2008   9.0
## cyl              5 234    5.89  1.61    6.0    5.86  2.97    4.0    8   4.0
## trans*           6 234    5.65  2.88    4.0    5.53  1.48    1.0   10   9.0
## drv*             7 234    1.67  0.66    2.0    1.59  1.48    1.0    3   2.0
## cty              8 234   16.86  4.26   17.0   16.61  4.45    9.0   35  26.0
## hwy              9 234   23.44  5.95   24.0   23.23  7.41   12.0   44  32.0
## fl*             10 234    4.63  0.70    5.0    4.77  0.00    1.0    5   4.0
## class*          11 234    4.59  1.99    5.0    4.64  2.97    1.0    7   6.0
##                skew kurtosis   se
## manufacturer*  0.21    -1.63 0.34
## model*         0.11    -1.23 0.73
## displ          0.44    -0.91 0.08
## year           0.00    -2.01 0.29
## cyl            0.11    -1.46 0.11
## trans*         0.29    -1.65 0.19
## drv*           0.48    -0.76 0.04
## cty            0.79     1.43 0.28
## hwy            0.36     0.14 0.39
## fl*           -2.25     5.76 0.05
## class*        -0.14    -1.52 0.13

#create object transpose
trans<-t(describe(mpg))

#create a table
table <- trans %>%
  kable("html",digits = 1) %>%
  kable_styling(full_width = FALSE)
table

	manufacturer*	model*	displ	year	cyl	trans*	drv*	cty	hwy	fl*	class*
vars	1.0	2.0	3.0	4.0	5.0	6.0	7.0	8.0	9.0	10.0	11.0
n	234.0	234.0	234.0	234.0	234.0	234.0	234.0	234.0	234.0	234.0	234.0
mean	7.8	19.1	3.5	2003.5	5.9	5.7	1.7	16.9	23.4	4.6	4.6
sd	5.1	11.1	1.3	4.5	1.6	2.9	0.7	4.3	6.0	0.7	2.0
median	6.0	18.5	3.3	2003.5	6.0	4.0	2.0	17.0	24.0	5.0	5.0
trimmed	7.7	19.0	3.4	2003.5	5.9	5.5	1.6	16.6	23.2	4.8	4.6
mad	5.9	14.1	1.3	6.7	3.0	1.5	1.5	4.4	7.4	0.0	3.0
min	1.0	1.0	1.6	1999.0	4.0	1.0	1.0	9.0	12.0	1.0	1.0
max	15.0	38.0	7.0	2008.0	8.0	10.0	3.0	35.0	44.0	5.0	7.0
range	14.0	37.0	5.4	9.0	4.0	9.0	2.0	26.0	32.0	4.0	6.0
skew	0.2	0.1	0.4	0.0	0.1	0.3	0.5	0.8	0.4	-2.3	-0.1
kurtosis	-1.6	-1.2	-0.9	-2.0	-1.5	-1.7	-0.8	1.4	0.1	5.8	-1.5
se	0.3	0.7	0.1	0.3	0.1	0.2	0.0	0.3	0.4	0.0	0.1

Observation The engine displacement (displ) variable’s summary indicates that engine sizes in the dataset range from 1.6 to 7 liters, with most engines being around 3.47 liters on average. The relatively close mean and median values suggest a similar distribution of engine sizes, but the presence of engines as large as 7 liters indicates some skewness or variability in engine sizes among the vehicles.

#Task 1.2

#values of displacement per cylinder in mpg

mpg_d<-mpg %>%
  group_by(Cylinder=cyl) %>%
  summarise(mean=mean(displ),SD=sd(displ),
minimum=min(displ),maximum=max(displ))

  mpg_d %>%
  kable(align = "c",
        caption = "Values of displacement per cylinders ",
        format = "html",
        digits = 1,
        table.attr = "style='width:50%;'")%>%
  kable_classic_2(bootstrap_options=c("hover","bordered","condensed"),
              html_font = "Cambria",
              position = "center",
              font_size = 12) %>%
  add_header_above(c(" " = 1,"Displacement" = 4))

Values of displacement per cylinders
	Displacement
Cylinder	mean	SD	minimum	maximum
4	2.1	0.3	1.6	2.7
5	2.5	0.0	2.5	2.5
6	3.4	0.5	2.5	4.2
8	5.1	0.6	4.0	7.0

Observation

This table demonstrates a clear trend where the number of cylinders in an engine is positively associated with its displacement. as we have increase in cylinders, so does the average engine displacement, indicative of larger engine sizes. The variability (SD) in displacement also generally increases with the number of cylinders, except for the 5-cylinder category, which appears to be an anomaly or a result of limited data variance. This trend is consistent with automotive engineering principles, where more cylinders usually equate to a larger engine displacement, capable of generating more power.

#Task 1.3

#calculate coefficient of correlation

#column for x and y
mpg_xy <- mpg$cyl * mpg$displ

#new column for x squared
mpg_x2 <- mpg$cyl^2  

#new column for y squared
mpg_y2 <- mpg$displ^2         

# Display the first 5 observations of the extended table
head(mpg, 5)

## # A tibble: 5 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…

# Calculate the sums
sum_x <- sum(mpg$cyl)
sum_y <- sum(mpg$displ)
sum_xy <- sum(mpg_xy)
sum_x2 <- sum(mpg_x2)
sum_y2 <- sum(mpg_y2)

# Present the sums
sum_x

## [1] 1378

sum_y

## [1] 812.4

sum_xy

## [1] 5235.4

sum_x2

## [1] 8720

sum_y2

## [1] 3209.4

Observations from the Coding Process This data presents specifications and performance metrics for various Audi A4 models across different years, showcasing engine displacement (displ), production year (year), number of cylinders (cyl), transmission type (trans), drive type (drv), city (cty) and highway (hwy) fuel efficiency, and fuel type (fl). The models range from 1999 to 2008, featuring both 4 and 6 cylinder engines with varying displacements and transmission options, indicating improvements in highway fuel efficiency over time while maintaining a preference for petrol (p) as the fuel type.

#Task 1.4

#use blueman formular to calculate coefficient of cor & determ

#coeficient of correlation
r=c(5*(sum_xy)-(sum_x)*(sum_y))/sqrt((5*(sum_x2)-(sum_x)^2)*((5*(sum_y2)-(sum_y)^2)))
r

## [1] -1.000261

#coeficient of detemination
co_determ<-r^2
co_determ

## [1] 1.000522

Observation

The correlation coefficient r=−1.000261r=−1.000261 suggests an almost perfect negative linear relationship for the two variable, indicating that as one variable increases, the other decreases almost perfectly, though the value slightly exceeds the theoretical range for rr, which should be between -1 and 1. The coefficient of determination r2=1.000522r2=1.000522 erroneously exceeds the maximum possible value of 1 due to rounding or calculation errors, ideally indicating that the independent variable perfectly explains the variation in the dependent variable in a perfect model.

#Task 1.5

#linear regression formular

x_cyl<-c(mpg$cyl)
y_displ<-c(mpg$displ)
lm(x_cyl~y_displ)

## 
## Call:
## lm(formula = x_cyl ~ y_displ)
## 
## Coefficients:
## (Intercept)      y_displ  
##        1.86         1.16

linear_regre<-lm(x_cyl~y_displ)

#object storing summary
sumari_<- summary(linear_regre)
sumari_

## 
## Call:
## lm(formula = x_cyl ~ y_displ)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.98276 -0.47433 -0.05137  0.54251  1.49822 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.86048    0.11130   16.72   <2e-16 ***
## y_displ      1.16033    0.03005   38.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5927 on 232 degrees of freedom
## Multiple R-squared:  0.8653, Adjusted R-squared:  0.8647 
## F-statistic:  1491 on 1 and 232 DF,  p-value: < 2.2e-16

# intercept and the slope

a_inter <- sumari_$coefficients
b_slope <- sumari_$cov.unscaled

a_inter

##             Estimate Std. Error  t value      Pr(>|t|)
## (Intercept) 1.860477 0.11130084 16.71575  1.064708e-41
## y_displ     1.160326 0.03005346 38.60872 5.598739e-103

b_slope

##             (Intercept)      y_displ
## (Intercept)  0.03526587 -0.008926900
## y_displ     -0.00892690  0.002571264

#present linear regression formular
linear_regre

## 
## Call:
## lm(formula = x_cyl ~ y_displ)
## 
## Coefficients:
## (Intercept)      y_displ  
##        1.86         1.16

Observation

This analysis reveals a strong relationship between the number of cylinders (x_cyl) and engine displacement (y_displ), with an intercept of 1.86 and a slope of 1.16. This suggests that for each additional cylinder, the engine displacement increases by approximately 1.16 units. The model is highly significant, as indicated by p-values less than 0.001 for both the intercept and slope, demonstrating significant relationship between the two variables. The residuals range from -1.98 to 1.50, median close to zero, suggesting the model’s predictions are moderately accurate across the dataset, although there is some variability in prediction accuracy.

#Task 1.6

plot(x_cyl~y_displ, 
     xlab="cylinder", 
     ylab = "displacement",
     las=2,col="#99004C",
     main="linear regression",pch = 19,lty=1,lwd=1)

abline(linear_regre)

Observation

The plot visualizes the relationship between cylinders (x_cyl) as the independent variable and displacement (y_displ) as the dependent variable, highlighting how displacement varies with the cylinders in a dataset. The regression line (abline(linear_regre)) added to the plot indicates the trend and predictive relationship, suggesting that as the number of cylinders increases, there is a corresponding change in engine displacement, as denoted by the line’s slope and direction.

#Task 1.7
# Use mutate() add predicted y and residuals to the data.
dat_update <- mpg %>%
  mutate(
    predict_y = predict(linear_regre),
    residual = resid(linear_regre))

# Display the first 10 observations
head(dat_update, 10)

## # A tibble: 10 × 13
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 2 more variables: predict_y <dbl>, residual <dbl>

Observation

The data presented showcases the first ten observations from the mpg dataset after enhancing it with predicted values of y (assumed to be based on a linear regression model named linear_regre) and the corresponding residuals for each observation. These enhancements provide insights into the model’s performance by comparing actual versus predicted outcomes and quantifying the deviation (residuals) for vehicles of varying types, including different models, years, cylinder counts, and transmission types from Audi, specifically focusing on aspects like displacement, city, and highway fuel efficiency.

#Task 2.1
#create variable object

pat_ID<-c("pk01","pk02","pk03","pk04","pk05","pk06","pk07","pk08","pk09","pk10","pk11","pk12","pk13","pk14","pk15")

age_x1<-c(45,60,55,60,62,71,57,59,64,42,75,52,59,67,73)
weight_x2<-c(135,182,148,182,190,232,194,182,217,171,225,173,184,194,211)
systobp_y<-c(112,156,125,145,155,162,139,144,153,126,169,132,143,153,162)

#create data frame
data<-data.frame(age_x1,weight_x2,systobp_y)
cor(data)

##              age_x1 weight_x2 systobp_y
## age_x1    1.0000000 0.8381118 0.9246241
## weight_x2 0.8381118 1.0000000 0.8980379
## systobp_y 0.9246241 0.8980379 1.0000000

round(cor(data),2)

##           age_x1 weight_x2 systobp_y
## age_x1      1.00      0.84      0.92
## weight_x2   0.84      1.00      0.90
## systobp_y   0.92      0.90      1.00

ry<-0.92
rx1<-1.00
rx2<-1.00
ryx1<-0.92
ryx2<-0.90
rx1x2<-0.84


#multiple regression formular
fit<-lm(systobp_y~age_x1+weight_x2)
summary(fit)

## 
## Call:
## lm(formula = systobp_y ~ age_x1 + weight_x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1106 -3.8580 -0.3748  1.4868 12.4956 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 39.15749   10.19923   3.839  0.00236 **
## age_x1       0.98241    0.27603   3.559  0.00393 **
## weight_x2    0.24946    0.09792   2.548  0.02557 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.287 on 12 degrees of freedom
## Multiple R-squared:  0.9059, Adjusted R-squared:  0.8902 
## F-statistic: 57.73 on 2 and 12 DF,  p-value: 6.963e-07

#correlation and determination coefficient

#correlation
R=sqrt((ryx1^2+ryx2^2)-(2*ryx1*ryx2*rx1x2))/1-rx1x2^2
R

## [1] -0.1904689

#determination
determination<-R^2
determination

## [1] 0.03627842

#F Test value
F<-(determination/2)/((1-determination)/(5-2-1))
F

## [1] 0.03764409

Observation The correlation matrix and subsequent analyses reveal strong positive relationships between age, weight, and systolic blood pressure (systobp_y) among the 15 patients, indicates that as age and weight increase, as well as the systolic blood pressure. Specifically, correlation coefficients for age and weight with systolic blood pressure are 0.92 and 0.90, respectively, suggesting a strong relationship.

The multiple regression analysis further quantifies this relationship, with age and weight being significant predictors of systolic blood pressure, as evidenced by p-values less than 0.05. However, the determination coefficient derived from the multiple correlation (R = -0.19, determination = 0.036) suggests a negligible combined effect of age and weight on systolic blood pressure variation, which appears contradictory to the high individual correlations and the significant regression coefficients. This discrepancy might be due to a calculation error in the multiple correlation formula, as the expected outcome should be a positive R value reflecting the strong positive correlations observed.

#Task 2.2

par(mfrow=c(1,2))

linear_regree<-lm(age_x1~systobp_y)
plot(age_x1~systobp_y, 
     xlab="Age", 
     ylab = "systolic blood pressure",
     las=2,col="#99004C",
     main="linear regression",pch = 20,lty=1,lwd=1)

abline(linear_regree)


linear_regr<-lm(weight_x2~systobp_y)
plot(weight_x2~systobp_y, 
     xlab="weight", 
     ylab = "systolic blood pressure",
     las=2,col="#99004C",
     main="linear regression",pch = 20,lty=1,lwd=1)

abline(linear_regr)

CONCLUSION

The analysis of the data reveals significant relationships for age, weight, and systolic blood pressure, as evidenced by high correlation coefficients and multiple regression model that predicts systolic blood pressure based on age and weight. The correlation matrix shows positive correlations between systolic blood pressure and both age (r = 0.92) and weight (r = 0.90), suggesting that increases in either age or weight are associated with increases in systolic blood pressure. The multiple regression analysis further supports these findings, with both weight and age serving as significant predictors of systolic blood pressure. However, the calculated correlation (R = -0.19) and determination (0.036) coefficients seem incorrect or misinterpreted, given the context of positive relationships indicated by the correlation matrix and regression analysis. The F-test value close to zero also suggests a potential misinterpretation or calculation error, considering the strong predictive value indicated by the high R-squared (0.9059) of the regression model.

The data analysis underscores the positive relationships between age, weight, and systolic blood pressure among the sample population. The regression model effectively captures these relationships, highlighting the significance of both weight and age as predictors of systolic blood pressure.

RECOMENDATION Given the clear impact of weight and age on systolic blood pressure, health interventions targeting weight management and monitoring blood pressure as individuals age could be beneficial in managing health risks associated with elevated blood pressure. Further investigation into the calculation of the correlation and determination coefficients is recommended to ensure accurate interpretation of the data. Additionally, extending the analysis to include more variables and a larger sample size could provide deeper insights into the factors influencing systolic blood pressure.

BIBLIOGRAPHY

References:

Bluman, A.G. (2009). Elementary Statistics: A Step-by-Step Approach (18th Ed)
Kirkpatrick, Elwood G (1974) Introductory statistics and probability for engineering science and technology
Wald, Abraham (1950). Selected papers in statistics and probability, Institute of Mathematical Statistics, Stanford,
Hudson & Derek J. 1963 Lectures on Elementary Statistics and Probability.

5.Lindgren, B.W & Bernard Williams (1924) Introduction to probability and statistics

Winkler, Robert L. & Hays, William L. (William Lee), (1926–1995) Statistics: probability, inference, and decision
Hacking, I. (1975). The Emergence of Probability. Cambridge University Press.

APPENDIX

An R Markdown file has been attached to this report. The name of the file is Adegoke_Aly6010_finalproject.RMD

I will like to show my appreciation to my instructor Professor Dee Chilizua her practical guide all through the course, I have being able to gain and learnt from her teachings in the field of analytics and I am happy I learnt and will keep learning from her, I will not forget to mention my academic advisor Naomi Cottrelle for providing a guide through my course registration, finally I want say thank you to all my classmates,especially my friend Abiram Magesh.