PROBABILITY THEORY AND INTRODUCTORY STATISTICS
SWAPNESH TIWARI
CORRELATION AND REGRESSION
Date : 12 October, 2022







Section 1 - Intro


Details on the data set

I will be using an open source data set which is build in named as MPG is a part of ggplot2 a package in R. This data set consist of 11 columns and 234 observations. This data set consist of all the data about the configuration of car, such as manufacturer, model, displacement, year, cylinder, transmission, average in city and average on highway and class of vehicle. Data set consist of 6 categorical and 4 numerical variables. The data which we will be using in our project are two sets, one is Cylinders which is a Multi valued discrete data and other is displacement which is a continuous data. In the second part of multiple regression we will use three types of data such as age, weight and systolic bp which are all continuous type of data. The data set consist of all the top cars with 4,5,6,8 cylinders.

History about Correlation and Linear Regression

Going back to the history of correlation and regression, the very first model of tools used for correlation and regression was developed by Sir Francis Galton in 1888, Denis, D. (2001). But later in 1898, Karl Pearson developed a mathematical equation for correlation which was a more important aspect that is used until now to calculate the relationship between the two variables, Stigler, S. M. (1986). But there is a lot of confusion about whether the work of Auguste Bravais is also related to the invention of correlation techniques, Denis, D. (2001). It would be interesting to know that the very first statistical error was used in the field of astronomy in the 18th century, these techniques were mainly used to determine whether the earth is flat or round., and astronomers were having a hard time figuring out a way to put all the observations into one single value so that measuring the data is more reliable and accurate, Denis, D. (2001). There is also one more person named Robert Adrain who was a famous American mathematician who was thought to have defined the probability for errors the two variables which are independent and dependent to occur at one given point, Walker, H. M. (1929).

The above figure 1 shows the Galtons “Table of Correlation”, Denis, D. (2001).


In the above diagram we can see that some things are still missing from the Galton’s equation, Galton experimented in 1885 to conclude his regression model with evidence, therefore he selected 928 offspring and their parents, He first generated a table called as Table of Correlation which is presented above, this below table presented the “heights of mid parents against the heights of offspring’s”, so, for example, he described and compared if the height of parent is about 70 inches and if the adult offspring has a height of 67 inches then according to Galton’s diagram he marked it in the particular cell where the data collected form a 90-degree angle.

Practical applications of correlation in medical industry

One of the best examples where regression analysis can be used is in the medical industry where researchers use regression analysis to identify the relation between the drug and the outcome of the drug, to better understand this theory let’s take an example from the study published by Azoulay, L. in 2017:

For example, let’s take a drug called sulfonylurea which is an anti-diabetic drug, therefore, a researcher wants to determine if administering a particular anti-diabetic drug can lower the blood glucose level to do this researcher uses co-relation testing, there are two types of variables included in this study, one is the predictor variable , the second is a response variable, predictor variable is the drug and the response variable is the blood glucose level, therefore if we add it to the formula we get :

Blood Sugar Level = B0 + B1(Drug Dose)

Where coefficient B0 is the expected level of blood glucose in the human body when no drug is administered

Where coefficient B1 is the mean change in blood glucose levels when the drug is administered.

If the given B1 is negative then it states that as the drug is administered it will decrease the blood glucose levels.

If the B1 is very close to zero, it will mean that administering the drug has no change in the blood glucose levels.

If the given B1 is positive it will show that administering the drug has increased the blood glucose levels.

So, it all depends on which of the following factors are taken into consideration by the researcher so that decision is made.

Hypothesis testing in the context of regression analysis

In contrast to hypothesis testing and linear regression, hypothesis testing bases its foundations mostly on relation between the independent variable and the dependent variable which in the case of linear regression is the variable of response and the other one is the variable that predicts the independent variable. When we approach hypothesis testing through linear regression, we use two types of tests which are F-tests which test if there is no significant difference in variances of the given two populations and they both are equal, on the other, the T-test test is applied when there is unknown stadard deviation of the two populations, but as an analyst, we need to be sure that whatever data we are using it best fits the regression model. Xue, K., & Yao, F. (2021).

Hypothesis testing concerning correlation five possibilities are involved defined by Bluman, A. G. (2009):
1. Independent variables always have a cause that develops the dependent variables.
2. Despite the first point, the reverse situation can also exist such as a statistician sees if consumption of too much nicotine can cause nervousness but he/she also fails to understand if nervousness can cause consumption of nicotine. 3. There can also be an involvement of the third variable.

Importance of Final Project

Analytics is a field where there is an innovation of new things with the help of data, it also provides us evidence if the innovation will be effective, in doing so we need vast amounts of data, Cooper, A. (2012), this course has taught me how to gather data, interpret the data, analyze the data and how to make decisions using the data, some of the techniques included; testing hypothesis with help of critical value and testing hypothesis using the P-value approach, these techniques are essential for me since I will be entering the medical industry as an analyst and these tests are an important factor when testing for the action of drugs. This final project will help develop skills in comparing two populations and developing a hypothesis based on the correlation and regression model. Correlation is an important factor in the medical industry when comparing the effect of drugs while assuming one is the independent and one is the dependent variable as I have given in the example above.

Advantages of using R

There are a lot of advantages when using R for data analysis, such as R is easy to use, R has beautiful graphics which we can use to present data, R has a lot of packages that can make the output more informative, R has a lot of statistical tools which can make the analysis more presentable and understandable, the best thing about R is it is open-source software and is supported by many operating systems, R can also be used with SQL to design databases, can also be used with Python and C++ and has numerous more compatibility with other languages, R is a stable software and is reliable for data analysis, Kabacoff, R. I. (2015).





# Library Used in this project

#Important Libraries
library(ggplot2)
library(dplyr)
library(knitr)

#Extra Packages used
library(magrittr)
library(kableExtra)
library(RColorBrewer)



# Open source data sets will be used in this project to practice analysis of data

#Creating a Data object

CarInfo = mpg




Section 2 - Simple Regression



1. Description of the data set


In this task I will be using different codes to fetch the information about the MPG data set which is a part of ggplot2, to do I need ggplot2 library and psych package. I will be using describe code and to make the table in right shape and size I will be rotating the table using the t code which is transpose.

#Created a data set names CarInfo and then used it to fetch details such as Variation, number of samples, measures of central tendency and dispersion. 
CarInfo%>%
  dplyr::select("manufacturer", "model", "displ", "year","cyl", "trans", "drv", "cty", "hwy", "fl", "class")%>%
  psych::describe()%>% #Used Psych code to describe the data.
  t()%>% #used transpose to create a good orientation of table.
  round(2)%>% #rounded to two decimals
  knitr::kable( caption = " Table 1 - Descriptive statistics of MPG data set from ggplot2")%>% 
  kableExtra::kable_paper()
Table 1 - Descriptive statistics of MPG data set from ggplot2
manufacturer* model* displ year cyl trans* drv* cty hwy fl* class*
vars 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00
n 234.00 234.00 234.00 234.00 234.00 234.00 234.00 234.00 234.00 234.00 234.00
mean 7.76 19.09 3.47 2003.50 5.89 5.65 1.67 16.86 23.44 4.63 4.59
sd 5.13 11.15 1.29 4.51 1.61 2.88 0.66 4.26 5.95 0.70 1.99
median 6.00 18.50 3.30 2003.50 6.00 4.00 2.00 17.00 24.00 5.00 5.00
trimmed 7.68 18.98 3.39 2003.50 5.86 5.53 1.59 16.61 23.23 4.77 4.64
mad 5.93 14.08 1.33 6.67 2.97 1.48 1.48 4.45 7.41 0.00 2.97
min 1.00 1.00 1.60 1999.00 4.00 1.00 1.00 9.00 12.00 1.00 1.00
max 15.00 38.00 7.00 2008.00 8.00 10.00 3.00 35.00 44.00 5.00 7.00
range 14.00 37.00 5.40 9.00 4.00 9.00 2.00 26.00 32.00 4.00 6.00
skew 0.21 0.11 0.44 0.00 0.11 0.29 0.48 0.79 0.36 -2.25 -0.14
kurtosis -1.63 -1.23 -0.91 -2.01 -1.46 -1.65 -0.76 1.43 0.14 5.76 -1.52
se 0.34 0.73 0.08 0.29 0.11 0.19 0.04 0.28 0.39 0.05 0.13



From the following we can observe the mean, median and other measures, this data is represents values for all 11 columns. Values are identified as numerical.We can observe one value which is in negative called a kurtosis which generally measures all the values in the tail relative to the their central center of distribution, Hogg, R. V. (1972).






2. Statistics Table


Imagine we need to know if number of cylinders in car affects the displacement of car, which in turn can affect the performance, to figure it out we need to check the relationship between Displacement and Cylinders, using the unique(CarInfo$cyl) we can see that our MPG data set consist of 4, 6, 8, 5 number of cylinders.


NumOfCylinders = unique(CarInfo$cyl) 

#Calculating values for number of cylinders in car respect to displacement.

CarPlot = CarInfo %>% 
  group_by(Cylinders = cyl) %>%
  summarise(Mean_Value = mean(displ), 
            Standard_Deviation = sd(displ),
            Minimum_Value = min(displ),
            Maximum_Value = max(displ))

#Creating a table to descibe all the values in one
CarPlot %>%
  kable(align = "c",
        caption = "Table 2 - Statistics Of Displacement Per Cylinders In The Mpg Data Set",
        format = "html",
        digits = 2,
        table.attr = "style='width:75%;'")%>%
  kable_paper(bootstrap_options=c("hover","bordered"),
              html_font = "Times New Roman",
              position = "center",
              font_size = 14) %>%
  add_header_above(c(" " = 1,"Displacement" = 4)) #added header so that it accumulates the 4 columns under it.
Table 2 - Statistics Of Displacement Per Cylinders In The Mpg Data Set
Displacement
Cylinders Mean_Value Standard_Deviation Minimum_Value Maximum_Value
4 2.15 0.32 1.6 2.7
5 2.50 0.00 2.5 2.5
6 3.41 0.47 2.5 4.2
8 5.13 0.59 4.0 7.0



From above table we can conclude that if there is increase in number of cylinders there is need to increase the or the displacement of the car increases directly, since if the displacement is low and the number of cylinders are high that can cause misfire and can result in decreased performance. Therefore if the mean in the table is increasing that generally means the displacement of the engine is also increasing as increase in cylinder.For a high performance car it is essential to balance displacement with the number of cylinders but Ferrari did an outstanding job by creating a 1.5 liter engine car with 8 cylinders known as 1.5 l engine 1965 Ferrari 1512, and Chevrolet Corvette having a huge displacement of 7.0L with V8 engines.






3. Finding Correlation and Determination


Lets assume I am a racer and I need a very fast car, so I was told that the higher the cylinder the more fast a car is and now I am interested to know if the said words are true therefore I hire a data analyst to give me the data , therefore the data analyst will check for the relationship between the displacement and cylinders of most of the cars and hand me the results:

#Creating objects 

Displacement = (CarInfo$displ)
Cylinders = (CarInfo$cyl)
Model = (CarInfo$model)
Manufacturer = (CarInfo$manufacturer)

Model2 = head(Model,15)
Manufacturer2 = head(Manufacturer,15)

DisSum = sum(Displacement)
CylnSum = sum(Cylinders)

DispCyln = Displacement*Cylinders
SumDispCyln = sum(DispCyln)

DisplDet = (Displacement^2)
CylnDet = (Cylinders^2)

n = 234

#Using formula for coefficients

R =  round(n*(SumDispCyln) - (DisSum)*(CylnSum)/sqrt(n*sum(DisplDet))-(DisplDet) * (n*sum(CylnDet) - (sum(CylnDet))), 1)


#Creating objects to display first 10 observations of the table

R10 = head(R, 15)
DisplDet2 = head(DisplDet, 15 )
CylnDet2 = head(CylnDet, 15 )

#Creating matrix to create a good visualization of the values

#Table -1
Objects1  = c(Manufacturer2 , Model2, R10)
Task1table1 = matrix(data = c(Objects1),ncol = 3,byrow = F)
colnames(Task1table1) = c("Manufacturer", "Car Names", "Coefficients")
Matrix =  knitr::kable(Task1table1, caption = "Table 3 - First 15 values of Coefficients of Correlation Between Displacement and Cylinders") %>% 
  kableExtra::kable_paper()


#Table -2

Objects2  = c(Manufacturer2 , Model2,DisplDet2, CylnDet2)
Task1table2 = matrix(data = c(Objects2),ncol = 4,byrow = F)
colnames(Task1table2) = c("Manufacturer", "Car Names","Determination of Displacement", "Determination of Cylinders")
Matrix2 =  knitr::kable(Task1table2, caption = "Table 4 - First 15 values of Coefficients of Determination Between Displacement and Cylinders") %>% 
  kableExtra::kable_paper(full_width = FALSE)


Matrix
Table 3 - First 15 values of Coefficients of Correlation Between Displacement and Cylinders
Manufacturer Car Names Coefficients
audi a4 -5359110.6
audi a4 -5359110.6
audi a4 -6903248.2
audi a4 -6903248.2
audi a4 -14705206.6
audi a4 -14705206.6
audi a4 -18301421.8
audi a4 quattro -5359110.6
audi a4 quattro -5359110.6
audi a4 quattro -6903248.2
audi a4 quattro -6903248.2
audi a4 quattro -14705206.6
audi a4 quattro -14705206.6
audi a4 quattro -18301421.8
audi a4 quattro -18301421.8
Matrix2
Table 4 - First 15 values of Coefficients of Determination Between Displacement and Cylinders
Manufacturer Car Names Determination of Displacement Determination of Cylinders
audi a4 3.24 16
audi a4 3.24 16
audi a4 4 16
audi a4 4 16
audi a4 7.84 36
audi a4 7.84 36
audi a4 9.61 36
audi a4 quattro 3.24 16
audi a4 quattro 3.24 16
audi a4 quattro 4 16
audi a4 quattro 4 16
audi a4 quattro 7.84 36
audi a4 quattro 7.84 36
audi a4 quattro 9.61 36
audi a4 quattro 9.61 36


Now I have a basic idea which car can give maximum displacement output if number of cylinders are increased, the above task shows the first few values for Coefficients of correlation and determination, these values shows how strong the relation is between Displacement and Cylinders. The data above means there is decrease in in displacement for every 1 unit of decrease in cylinders. Therefore we can conclude from the above data that cylinders are an important part of performance when selecting a new car.






4. Describe the DescTools


DescTools known as tools for descriptive statistics were first designed by Andri Signorell and colleagues mentioned is a huge collection that consists of various functions in statistics to describe data more effectively, DescTools also consists of various functions which help in the integration of various documents from different formats such as Microsoft Word or Microsoft PowerPoint point and the main thing is data retrieval from Microsoft Excel, there are numerous functions that can help in analysis of data from basic statistical tools to advance tools, which produces summary of different types of variables, it chooses its own suitable graph for the value of X Signorell, A., Aho, K., Alfons, A., Anderegg, N., Aragon, T., Arachchige, C., … & Bolker, B. (2016). DescTools is the fastest way to compute the required data such as Confidence Intervals, lets take an example where DescTools were used to compute the 95 percent confidence interval, therefore in the study done by Bakouny, Z. published in 2021, they used Clopper Pearson method in DescTools to analyze the 95% Confidence interval levels in regards to relationship between cancer and Covid-19.
One of the codes I am going to use in this task is DESC , which is used to store dichotomous variables in one single plot by producing a dotplot with their corresponding error bars, one similar code is summary.plot, Signorell, A., Aho, K., Alfons, A., Anderegg, N., Aragon, T., Arachchige, C., … & Bolker, B. (2016).

The variable of choice here is Cylinders and displacement from MPG data set.

#Using two variables of interest
DescTools::Desc(mpg$displ)
## ------------------------------------------------------------------------------ 
## mpg$displ (numeric)
## 
##   length       n    NAs  unique    0s  mean  meanCI'
##      234     234      0      35     0  3.47    3.31
##           100.0%   0.0%          0.0%          3.64
##                                                    
##      .05     .10    .25  median   .75   .90     .95
##     1.80    2.00   2.40    3.30  4.60  5.40    5.70
##                                                    
##    range      sd  vcoef     mad   IQR  skew    kurt
##     5.40    1.29   0.37    1.33  2.20  0.44   -0.91
##                                                    
## lowest : 1.6 (5), 1.8 (14), 1.9 (3), 2.0 (21), 2.2 (6)
## highest: 6.0, 6.1, 6.2 (2), 6.5, 7.0
## 
## heap(?): remarkable frequency (9.0%) for the mode(s) (= 2)
## 
## ' 95%-CI (classic)

#All the values are relative to 95% confidence interval

DescTools::Desc(mpg$cyl)
## ------------------------------------------------------------------------------ 
## mpg$cyl (integer)
## 
##   length       n    NAs  unique    0s  mean  meanCI'
##      234     234      0       4     0  5.89    5.68
##           100.0%   0.0%          0.0%          6.10
##                                                    
##      .05     .10    .25  median   .75   .90     .95
##     4.00    4.00   4.00    6.00  8.00  8.00    8.00
##                                                    
##    range      sd  vcoef     mad   IQR  skew    kurt
##     4.00    1.61   0.27    2.97  4.00  0.11   -1.46
##                                                    
## 
##    value  freq   perc  cumfreq  cumperc
## 1      4    81  34.6%       81    34.6%
## 2      5     4   1.7%       85    36.3%
## 3      6    79  33.8%      164    70.1%
## 4      8    70  29.9%      234   100.0%
## 
## ' 95%-CI (classic)



Tables :
DESC tools created very beautiful graphs from the following variables listed above, We can see above in the table it contains short summary of every measure in a 2 dimensional array.In the next table About cylinders are the same values we obtained in task 2.8 which shows that there are high number of cars with 4 cylinder and 6 cylinder configuration.
Figures :
The above figure 1 shows 3 types of graph, a histogram which shows no spaces and has good distribution, the second graph shows a box plot whose most of the data about the cylinders lie in the 75th percentile and the last graph is a line graph which shows increase in frequency as the number of cylinders increases. The above figure 2 also shows a box plot in which most of data are evenly distributed in upper and lower quantiles and a line graph which shows increase in frequency as displacement increases.






5. Linear Regression


Suppose we wanna know what will be the effect of cylinders on displacement in future as we need to predict it and to predict it, this task we will obtain the linear regression between the Cylinders and Displacement

DisSum = sum(Displacement)
CylnSum = sum(Cylinders)

DisplCyln = c(Displacement~Cylinders)

#Using summary to get slope and intercept
SummaryL = summary(DisplCyln)

#Or we can also use lm to get the slope and intercept

RegAnalysis = lm(Displacement~Cylinders) #values for regression formula


#Y = a+bx
#therefore a is the intercept

Intercept = -0.9199
Slope = 0.7458

#Calculating predicting values
YPredict = Intercept+Slope*Cylinders



The formula for linear regression model we will use is Y = a+bx where Y is the variable which is dependent variable, a can be denoted by intercept , b can be denoted as slope and x is denoted by independent variable so for our case, Predicted_value = Intercept+Slope*Cylinders the first few answers are = 2.0633, 2.0633, 2.0633, 2.0633, 3.5549 , these following are the predicted values of our regression model in respect of intercept of -0.9199 and slope = 0.7458 with respect to X variable as Cylinders.






6. Scatter Plot


So for example we need to know if number of cylinders can predict the displacement so to get the desired results we will be plotting the above values on a scatter plot , for detailed explanation see the below observation

RegPlot = plot(Displacement~Cylinders,
               main = " Figure 1 - Regression Model of Displacement vs Cylinders",
     col= brewer.pal(6, "Set1"),
     pch = 20, 
     xlim = c(0,11),
     ylim = c(0,9))

abline(RegAnalysis, col = "#99004C", lty = 1, lwd = 2)



Above scatter plot shows dependent variable as Displacement which is a continuous variable.
Independent variable as Cylinders which is a multiple value discrete.
There is a strong correlation between cylinders and displacement and thus the data is good fit model.
Thus if number of cylinders increases, the displacement increases.
Strong relationship between two variables suggest that displacement can be predicted by number of cylinders.
Many variables do affect the performance of vehicle such as quality of tires, driving methods and road conditions.

Legend:
Regression line is marked by a purple line.
Dots represent the observed values.






7. Predicted values and residuals


From the above task we determined how to find the predicted values now we will also retrieve the residual values of the data set, to determine the error we will be encountering in finding the correlation between the number of cylinders and displacement.

#linear regression formula

Data = data.frame( Model, Cylinders, Displacement )

Table2.7 = Data %>% 
mutate(YPredict = Intercept + Slope * Displacement, Residue = Displacement-YPredict)

#First 20 observations

Table2.7_1 = head(Table2.7, 20)


#Table -1
 
Table2.7_1 %>%
  kable(caption = "Table 5 - Prediction and Residul values of Cylinders vs Displacement") %>% 
kableExtra:: kable_paper()
Table 5 - Prediction and Residul values of Cylinders vs Displacement
Model Cylinders Displacement YPredict Residue
a4 4 1.8 0.42254 1.37746
a4 4 1.8 0.42254 1.37746
a4 4 2.0 0.57170 1.42830
a4 4 2.0 0.57170 1.42830
a4 6 2.8 1.16834 1.63166
a4 6 2.8 1.16834 1.63166
a4 6 3.1 1.39208 1.70792
a4 quattro 4 1.8 0.42254 1.37746
a4 quattro 4 1.8 0.42254 1.37746
a4 quattro 4 2.0 0.57170 1.42830
a4 quattro 4 2.0 0.57170 1.42830
a4 quattro 6 2.8 1.16834 1.63166
a4 quattro 6 2.8 1.16834 1.63166
a4 quattro 6 3.1 1.39208 1.70792
a4 quattro 6 3.1 1.39208 1.70792
a6 quattro 6 2.8 1.16834 1.63166
a6 quattro 6 3.1 1.39208 1.70792
a6 quattro 8 4.2 2.21246 1.98754
c1500 suburban 2wd 8 5.3 3.03284 2.26716
c1500 suburban 2wd 8 5.3 3.03284 2.26716



From the above table we can see now the predicted values for displacement, so there will be 0.42254 increase in displacement with increase in every unit of cylinder. Therefore now we can predict displacement with number of cylinders.Therefore for 4 cylinders the observed displacement is 1.8 whereas the predicted value is increase in 0.42254, therefore the difference between the two is 1.37746 , it is important to use residuals to check how reliable are the predictions, well for cars with 4 cylinders has a difference of 1.37746 therefore our predicted displacement differs from the observed displacement by 1.37746. Same applies to rest of the values.






8. Frequency and percentage table


I will be obtaining different values for Number of Cylinders against the Frequency of Cars

CarCyl = table(CarInfo$cyl)
CarFrame = as.data.frame(CarCyl)


names(CarFrame)[names(CarFrame) == "Var1"] = "Number of Cylinders"
names(CarFrame)[names(CarFrame) == "Freq"] = "Frequency_of_Cars"


CarFrame = mutate(CarFrame,
  Cumulative_Frequency = cumsum (CarFrame$Frequency_of_Cars), 
  Percentage = round ((Frequency_of_Cars/sum(Frequency_of_Cars)) * 100, 2), 
  Cumulative_Percentage = cumsum (Percentage))

knitr::kable(CarFrame, caption = "Table 6 - Cumulative Freqeuncy and Percentage relative to number of Cylinders")%>%
  kableExtra::kable_paper()
Table 6 - Cumulative Freqeuncy and Percentage relative to number of Cylinders
Number of Cylinders Frequency_of_Cars Cumulative_Frequency Percentage Cumulative_Percentage
4 81 81 34.62 34.62
5 4 85 1.71 36.33
6 79 164 33.76 70.09
8 70 234 29.91 100.00



Overall observation in Task 9






9. Frequency and percentage Plot


Using the data from the above table we will be using it to plot different graphs

par (mfrow=c(2,2))

# First 

CarFreq = barplot(
  CarFrame$Frequency_of_Cars,
  col = brewer.pal(4,"Dark2"),
  cex.names = 0.7, 
  ylim = c(0,100),
  names.arg = CarFrame$`Number of Cylinders`,
  main = " Cars - Frequencies",
  xlab = "Cars", 
  ylab = "Frequency of cars"
)
text(
  y = CarFrame$Frequency_of_Cars,
  CarFreq,
  CarFrame$Frequency_of_Cars,
  cex = 0.6,
  pos = 3
)
box(which = "plot", col = "red")
box(which = "figure", col = "green")

#Second 

CarCumFreq = barplot(
  CarFrame$Cumulative_Frequency,
  col= brewer.pal(4, "Dark2"),
  cex.names = 0.7, 
  ylim = c(0,280),
  names.arg = CarFrame$`Number of Cylinders`,
  main = "Cumulative Frequencies of Cars",
  xlab = "Cumlative Frequency of Cars",
  ylab = "Frequency of Cars"
)
text(
  y = CarFrame$Cumulative_Frequency,
  CarCumFreq,
  CarFrame$Cumulative_Frequency,
  cex = 0.6,pos = 3
)

box(which = "plot", col = "red")

#Third

CarPercent = barplot(
  CarFrame$Percentage,
  col= brewer.pal(4, "Set3"),
  cex.names = 0.7, 
  ylim = c(0,50),
  names.arg = CarFrame$`Number of Cylinders`,
  main = "Percentages of Cars",
  xlab = "Precentage of Cars",
  ylab = "Frequency Of Cars"
)
text(
  y = CarFrame$Percentage,
  CarPercent,
  CarFrame$Percentage,
  cex = 0.6,pos = 3
)

box(which = "plot", col = "green")

#Fourth

CarCumPercent = barplot(
  CarFrame$Cumulative_Percentage,
  col= heat.colors(8),
  cex.names = 0.7, 
  ylim = c(0,120),
  names.arg = CarFrame$`Number of Cylinders`,
  main = "Cumulative Percent of Cars",
  xlab = "Cumulative Percentage of Cars",
  ylab = "Frequency"
)
text(
  y = CarFrame$Cumulative_Percentage,
  CarCumPercent,
  CarFrame$Cumulative_Percentage,
  cex = 0.6,
  pos = 3
)

box(which = "figure")


Four of the above graphs are taken from the above table which displays frequencies, percentage, cumulative frequencies and cumulative percentage in relative to the number of cars having 4 cylinders, the highest number of cars are the ones having 4 and 6 cylinders. Observe the last value which is equal to the number of observations in data set, this is because the total number of frequencies keeps on adding as it goes more further, same for the cumulative percentage it reaches 100% at-last, also called as less than cumulative frequency, . The bar plot also shows how many percentage of cars are having 4,5,6,8 cylinders.About 34.62 percent of cars are having 4 cylinder engine which gives automobile industry an idea on how many cylinder cars should be produced more and which one is lacking.






10. Making predictions from the above values

Predictions if the car has 2 and 10 cylinders:
We will use the following formula: Y = a+bx

#as the formula stated above lets define the given values

X1 = 2 #Cylinders
X2 = 10 #Cylinders 

#therefore

Intercept = -0.9199
Slope = 0.7458

#What will be displacement if number of cylinders is equal to 2

PredictDispl2 = round(Intercept + Slope * X1,2) 

#What will be displacement if number of cylinders is equal to 10

PredictDispl10 = round(Intercept + Slope * X2,2)

#Table
Objects10  = c(PredictDispl2 ,PredictDispl10)
Task10table1 = matrix(data = c(Objects10),ncol = 2,byrow = F)
colnames(Task10table1) = c("Prediction for 2 Cylinder", "Prediction for 10 Cylinder")
Matrix10 =  knitr::kable(Task10table1, caption = "Table 7 - Prediction values for 2 and 10 cylinder engine") %>% 
  kableExtra::kable_paper()
Matrix10
Table 7 - Prediction values for 2 and 10 cylinder engine
Prediction for 2 Cylinder Prediction for 10 Cylinder
0.57 6.54



The values mean if there are 2 cylinders added in a car there will 0.57 increase in displacement since we have a positive value, and for 10 cylinders engine there will be 6.54 increase in the displacement of engine of car.






Section 3 - Multiple Regression




1. Making predictions in multiple regression model


Assume we need to predict systolic blood pressure in various age groups with various values of weight, so that drug doses can prescribed appropriately and we will know the exact value of Systolic Blood Pressure.

#Data Set
PatientID = c("PK01","PK02","PK03","PK04","PK05","PK06","PK07","PK08","PK09","PK10","PK11","PK12","PK13","PK14","PK15")
SystolicBP =  c(112,156,125,145,155,162,139,144,153,126,169,132,143,153,162)
AGE =  c(45,60,55,60,62,71,57,59,64,42,75,52,59,67,73)
Weight =  c(135,182,148,182,190,232,194,182,217,171,225,173,184,194,211)

#Creating Objects and tables
Objects3.1  = c(PatientID, SystolicBP, AGE, Weight)
Task1table3 = matrix(data = c(Objects3.1),ncol = 4,byrow = F)
colnames(Task1table3) = c("Patient ID", "Systolic BP", "Age of the patient (In Years)", "Weight of the Patient (In Pounds)" )
Matrix3.1 =  knitr::kable(Task1table3, caption = "Table 8 - Patient Details")%>% 
  kableExtra:: kable_paper()

#Calculating correlations:

BpAge = round(cor(SystolicBP,AGE),2)
BPWeight = round(cor(Weight,SystolicBP),2)

DetBPAge = round(BpAge^2, 2)
DetBPWeight = round(BPWeight^2, 2)

Matrix3.1.2 = matrix(data = c(BpAge, BPWeight, DetBPAge, DetBPWeight),nrow = 4,ncol = 1, byrow = T)
rownames(Matrix3.1.2) = c("Correlation between BP and Age", "Correlation between BP and Weight", "Determination of BP and Age", "Determination between BP and Weight")
MatrixTable1 = knitr::kable(Matrix3.1.2, caption = "Table 9 - Correlation and Determination between Systolic BP, Age and Weight")%>%  
kableExtra:: kable_paper()


#Using F-test for correlation

Reg = lm(SystolicBP ~ AGE+Weight)
#summary(Reg)

RegMulti = round(0.9059 , 2)

#Rounding up the Square root 0.9059 to 0.9056 therefore using the F-test formula we get the following:
FTest3.1 = round((0.906/3) / ((1-0.906) / (15-3-1)), 2)


n= 15
df = n-2
alpha = 0.05

CrticalV2 = round(qt((1-0.05/2),df),2)

#Stating null hypothesis and alternative hypothesis:
#H0 = There is no relationship between systolic blood pressure and the two given variables.
#H1 = There is significant relationship between systolic blood pressure and the two given variables.

#Therefore testing for hypothesis using critical value approach
Hypothesis2 = ifelse(FTest3.1 > CrticalV2 ,"Reject H0", "Fail To Reject H0")



Hypotable1 = matrix(data = c(Hypothesis2),nrow =1 ,ncol = 1, byrow = T)
rownames(Hypotable1) = c("F-Test > Crtical Value = TRUE")
Hypo1 = knitr::kable(Hypotable1)%>%  
kableExtra:: kable_paper()



Intercept3.1  = 39.1575
SlopeAge = 0.9824
SlopeWeight = 0.2495

#Systolic blood pressure of a patient of 30 years of age and weight of 148:

SystolicBp1 =  round((Intercept3.1 + SlopeAge * 30 + SlopeWeight* 148),2) 

#Systolic blood pressure of a patient for a age of 75 years and weight 196:

SystolicBp2 =  round((Intercept3.1 + SlopeAge * 75 + SlopeWeight* 196),2)



Matrix3.1.3 = matrix(data = c(RegMulti, FTest3.1, CrticalV2, SystolicBp1, SystolicBp2),nrow =5 ,ncol = 1, byrow = T)
rownames(Matrix3.1.3) = c("Multiple R Squared", "F-Test Value", "Critical Value", "Predicted Systolic BP of Age 30 yrs and Weight 148", "Predicted Systolic BP of Age 75 yrs and Weight 196")
Matrix3.1.4 = knitr::kable(Matrix3.1.3, caption = "Table 10 - showing different values as mentioned")%>%  
kableExtra:: kable_paper()

Matrix3.1
Table 8 - Patient Details
Patient ID Systolic BP Age of the patient (In Years) Weight of the Patient (In Pounds)
PK01 112 45 135
PK02 156 60 182
PK03 125 55 148
PK04 145 60 182
PK05 155 62 190
PK06 162 71 232
PK07 139 57 194
PK08 144 59 182
PK09 153 64 217
PK10 126 42 171
PK11 169 75 225
PK12 132 52 173
PK13 143 59 184
PK14 153 67 194
PK15 162 73 211
MatrixTable1
Table 9 - Correlation and Determination between Systolic BP, Age and Weight
Correlation between BP and Age 0.92
Correlation between BP and Weight 0.90
Determination of BP and Age 0.85
Determination between BP and Weight 0.81
Matrix3.1.4
Table 10 - showing different values as mentioned
Multiple R Squared 0.91
F-Test Value 35.34
Critical Value 2.16
Predicted Systolic BP of Age 30 yrs and Weight 148 105.56
Predicted Systolic BP of Age 75 yrs and Weight 196 161.74
Hypo1 
F-Test > Crtical Value = TRUE Reject H0



In the above task We have calculated various test to determine if the systolic blood pressure of a patient is possible to be predicted using the age and Weight of patient, thus after doing thorough analysis of the data the conclusion is, Yes we can predict systolic blood pressure from the given other two variables, prediction values are as described in the table above , this task also contains hypothesis testing through F critical value approach which shows that F test is greater than Critical value therefore we rejected null hypothesis, hence we have enough evidence to reject null hypothesis therefore the data shows significant difference between systolic blood pressure and the two variables. Thus blood pressure of a patient increases as the age increases and as the weight increases, systolic blood pressure is direct proportional to age and weight of an individual.






2. Scatter Plots


I will be plotting scatter plot for two different variables in relation to systolic blood pressure.

par(mfrow =c(1,2))

#Calculating and plotting regression analysis for Systolic Blood Pressure in regards to Age

RegAGE = lm(SystolicBP  ~  AGE) #Here AGE is our independent variable since Systolic Bp can be affected by age.

AgePlot = plot(AGE,SystolicBP,
                main = "Age and Sys BP",
                col= brewer.pal(8, "Accent"),
                pch = 19,
                ylim= c(110,180),
                xlim = c(35,85))

abline(lm(SystolicBP  ~  AGE), col = "#37920F", lty = 1, lwd = 2)

box(which = "figure", col="red")


#Calculating and plotting regression analysis for Systolic Blood Pressure in regards to Weight
RegWeight = lm(SystolicBP~Weight) #Here weight is our independent variable since obesity leads to Hypertension.

WeightPlot = plot(Weight,SystolicBP,
     col= brewer.pal(8, "Dark2"),
     main = "Weight and Sys BP",
     pch = 19,
     ylim = c(110,180),
     xlim = c(135,250)
  )

abline(lm(SystolicBP ~ Weight), col = "#800F92", lty = 1, lwd = 2)

box(which = "figure", col="green")



There are two figures shown above, one depicts the correlation between Systolic Bp and Age and another shows the correlation between Systolic BP and Weight, as we can see observed values are seen very close to the line in the first graph which shows that it is a good fit and hence the systolic blood pressure is highly related to increase in Age of patient, therefore less error or residual line, on the other hand other graph shows weight as independent variable and systolic blood pressure as dependent variable, data is highly near to the line therefore shows that weight has high significance in increasing systolic blood pressure.
Above scatter plot shows dependent variable as Age and weight both are continuous variable.
Independent variable as systolic blood pressure which is also a continuous variable. There is a strong correlation between weight and systolic blood pressure and thus the data is good fit model. but in case of age it does not have strong correlation with systolic blood pressure therefore it does not fit good in the regression model.
Thus if the weight of the person increases the systolic blood pressure increases, but whereas age of the person does not show much correlation with systolic blood pressure.
Strong relationship between weight and systolic blood pressure shows we can predict systolic blood pressure with weight of the patient, whereas age of the patient cannot predict the systolic blood pressure.

Legend:
Regression line is marked by a green line and purple line.
Dots represent the observed values.

Overall:

  1. Predicted value for Y, Predicted Systolic BP of Age 30 yrs and Weight 148, where B1 is age and B2 is weight a = 39.1575 b1 = 0.9824 b2 = 0.2495 x1 = 30 x2 = 148

Therefore, with the formula: Y = a+b1x1+b2x2 = 105.56 mmHg 2. Multiple R-squared = 0.91

  1. The F-test value is 35.34
  2. The critical value is 2.16
  3. Since both the variables are moving in the same direction the relationship can be called positive.
  4. Null hypothesis states that there is no correlation, H0 = 0.
  5. Since 35.34 > 2.16 thus we reject the null hypothesis and can conclude that there is not enough evidence to declare that correlation is not equal to zero.






3. Predicted values and Residuals


From the above information we can see that in the scatter plot we have observed value and the regression line now we need to know the predicted values of the two variables so that we can see if it is reliable to predict the values of Systolic Blood pressure.

Object3.2  = matrix(c(PatientID,AGE,Weight,SystolicBP),nrow = 15,ncol = 4,byrow = F)  
colnames(Object3.2) = c("Patient ID","Age of Patient","Weight of Patient","Systolic Blood Pressure")

#Creating dataframe to incorporate table
Table3 = as.data.frame(Object3.2)

#Using linear regression code 
AgeBP  = lm(SystolicBP ~ AGE)
SumAgeBP  = summary(AgeBP) #to get multiple R squared value.

#To get selected coefficients 
AgeIntercept2 = SumAgeBP$coefficients[1] #to get the estimated intercept
AgeSlope2 = SumAgeBP$coefficients[2,1] # to get the slope of the first variable AGE

WeightBP  = lm(SystolicBP ~ Weight)
SumWeightBP  = summary(WeightBP)

WeightIntercept2 = SumWeightBP$coefficients[1]
SlopeWeight2 = SumWeightBP$coefficients[2,1] # to get the slope of the first variable Weight

Matrix3.3  = Table3 %>%
           mutate(Predicted_by_Age = AgeIntercept2 + AgeSlope2 * AGE, Predicted_by_Weight = WeightIntercept2 + SlopeWeight2* Weight, 
                  
                  Residual_by_Age = AGE - Predicted_by_Age, Residual_by_Weight= Weight - Predicted_by_Weight )

knitr::kable(Matrix3.3) %>% 
  kableExtra::kable_paper()
Patient ID Age of Patient Weight of Patient Systolic Blood Pressure Predicted_by_Age Predicted_by_Weight Residual_by_Age Residual_by_Weight
PK01 45 135 112 121.3848 116.3651 -76.38477 18.63491
PK02 60 182 156 144.9619 141.8174 -84.96188 40.18257
PK03 55 148 125 137.1028 123.4051 -82.10284 24.59490
PK04 60 182 145 144.9619 141.8174 -84.96188 40.18257
PK05 62 190 155 148.1055 146.1497 -86.10549 43.85026
PK06 71 232 162 162.2518 168.8944 -91.25176 63.10561
PK07 57 194 139 140.2465 148.3159 -83.24646 45.68410
PK08 59 182 144 143.3901 141.8174 -84.39007 40.18257
PK09 64 217 153 151.2491 160.7713 -87.24911 56.22870
PK10 42 171 126 116.6694 135.8605 -74.66935 35.13950
PK11 75 225 169 168.5390 165.1036 -93.53899 59.89639
PK12 52 173 132 132.3874 136.9436 -80.38742 36.05642
PK13 59 184 143 143.3901 142.9005 -84.39007 41.09949
PK14 67 194 153 155.9645 148.3159 -88.96453 45.68410
PK15 73 211 162 165.3954 157.5221 -92.39537 53.47793



In the above table lets take the first value, for age of 45 years and weight of 135 pounds the observed value of systolic blood pressure is 112, the predicted value for age 45 is 121.39, for weight is 116.365 and Residual for age is -76.38 which shows the difference between age and BP therefore concludes that predictions are too high and are not reliable whereas for weight the residues have very less difference, for weight of 135 pounds the residue is 18.63 which calculates the difference between the observed value of the weight in pounds and predicted value of weight, therefore it is reliable and systolic blood pressure can be predicted by the independent variable weight.






4. Residuals Plots


Residuals can be a good statistical tool to determine the accuracy or relaibility of predicting data about the two variables.

par(mfrow = c(1,2))


#For Age

ResidualAge = c(Matrix3.3$Residual_by_Age)

Ageplot2  = plot(AGE,ResidualAge,
            main = "The Residuals vs Age Values",
            xlab = "Age of Patient",
            ylab = "Residuals of Age in Years",
            col = brewer.pal(8,"Set3"),
            pch = 19)

box(which = "figure", col = "darkred")



#For Weight

ResidualWeight = c(Matrix3.3$Residual_by_Weight)

WeightPlot2 = plot(Weight, ResidualWeight,
            main = "The Residuals vs Weight Values",
            ylab = "Residual of Patient Weight in Pounds",
            xlab = "Weight in Pounds",
            pch = 19,
            col = brewer.pal(8,"Dark2"))

box(which = "figure", col = "brown")



There are two figures shown above, one depicts the correlation between Residuals of Age and observed value of Age and another shows the correlation between Residual weight and observed value of Weight, as we can see observed values are seen in the opposite direction that is in negative side which shows that the residuals are too high which makes age a bad predictor of Systolic Blood Pressure, on the other hand other graph shows weight as independent variable and residuals by weight as dependent variable, data is highly near to the line therefore shows that weight has high significance in increasing systolic blood pressure and it can easily predict the future values.
Above scatter plot shows dependent variable as Age and weight both are continuous variable.
Low residuals and positive relationship between weight and residuals of weight shows we can predict systolic blood pressure with weight of the patient, whereas age of the patient cannot predict the systolic blood pressure.

Legend:
Dots represent the observed values.






Conclusion

This conclusion will include making an overall observation and comparing one task with another, with their similarities, challenges encountered, and what their values meant in terms of predicting the outcome, This project is comprised of an MPG dataset which is a part of an R package called ggplot2. I have described the dataset thoroughly and used it for different calculations which are a part of Correlation and regression. This project is comprised of two sections, one for Simple linear regression which generally helps us to identify the relationship between the two continuous variables, Zou, K. H., Tuncali, K., & Silverman, S. G. (2003), and the other is for multiple regression model which helps us to identify the relationship between two independent variables and one dependent variable, Tranmer, M., & Elliot, M. (2008). For Simple Linear Regression this project covers 9 tasks and for Multiple regression, it covers 4 tasks.
Calculations for Simple Linear Regression were done using two variables named Displacement of Car and Number of Cylinders in a car, our car collection included some of the top car manufacturers with high-performance engines including 4,5,6,8 cylinder and heavy displacement. I have analyzed the data of these two variables using different graphical parameters, such as a table used to describe the descriptive stats of the two variables, graphs such as bar plot, line graph, and boxplot using DESC tools, and a linear regression graph also called X and Y graph or scatter plot. To construct the following graphs formulas were applied to the codes, different frequencies were obtained, and were plotted on several graphs. Calculations for the Multiple Regression model were also done using tables and graphs but this time different methods were applied compared to the regression model, the slope, intercept and R squared value were calculated using the R codes where we got the summary of the data, along with the scatter plot, the residual plot was also generated which showed us how reliable is the prediction made by the independent variable for that of the dependent variable.

Recommendations:

Simple Linear Regression
Recommendations if this was a dataset that was being utilized by the car company there would be several analyzed outcomes that can help the automobile industry to build effective and efficient cars. So let’s assume an automobile industry has analyzed this data set and wanted to know if they increase the cylinders will there be a significant change in the displacement of the car, to do this they have already stated their alternative hypothesis, therefore to prove it, now they analyze the data using the above techniques, from the data they now know what will be the effect on the displacement and what are the possible reliable factors through which they can predict the values for future references. After analyzing the data, I would recommend the car company to increase more cylinders since it can also increase displacement, and if displacement is high the power is high, and if cylinders are more it’s more fuel-efficient, making it the best car for everyone with more power and more economy is fuel saving.

Multiple Linear Regression.
Imagine if this dataset for multiple regression was done by the healthcare industry, there will be good values to predict whether the weight of the patient has a significant impact on the increased systolic blood pressure of the patient, not considering age because there is less or no relationship between age and systolic blood pressure. So, recommendations from this results analysis; lifestyle modifications should be encouraged for all obese patients since they carry a high risk of developing Hypertension, and accessible healthcare awareness should be encouraged in hospitals by opening this data analysis to every healthcare industry to prove that weight can affect systolic blood pressure.






References

Azoulay, L., & Suissa, S. (2017). Sulfonylureas and the risks of cardiovascular events and death: a methodological meta-regression analysis of the observational studies. Diabetes care, 40(5), 706-714.

Bakouny, Z., Paciotti, M., Schmidt, A. L., Lipsitz, S. R., Choueiri, T. K., & Trinh, Q. D. (2021). Cancer screening tests and cancer diagnoses during the COVID-19 pandemic. JAMA oncology, 7(3), 458-460.

Bluman, A. G. (2009). Elementary statistics: A step by step approach. New York: McGraw-Hill Higher Education.

Cooper, A. (2012). What is analytics? Definition and essential characteristics. CETIS Analytics Series, 1(5), 1-10.

Denis, D. (2001). The origins of correlation and regression: Francis Galton or Auguste Bravais and the error theorists. History and Philosophy of Psychology Bulletin, 13(2), 36-44.

Hogg, R. V. (1972). More light on the kurtosis and related statistics. Journal of the American Statistical Association, 67(338), 422-424.

Javanmard, A., & Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research, 15(1), 2869-2909.

Kabacoff, R. I. (2015). R in Action (2nd ed.). Manning Publications.
Miller, D., & Millar, I. (1996). The Cambridge dictionary of scientists. Cambridge University Press: Cambridge.

Pearson, K. (1920). Notes on the history of correlation. Biometrika,13, 25-45.

Ribeiro, M. L., Martins, T., Angélico, R. A., Vandepitte, D., & Tita, V. (2012). PROGRESSIVE FAILURE ANALYSIS OF LOW ENERGY IMPACT IN CAR-BON FIBER FILAMENT WINDING CYLINDERS. In 10th World Congress on Computational Mechanics–WCCM.

Signorell, A., Aho, K., Alfons, A., Anderegg, N., Aragon, T., Arachchige, C., … & Bolker, B. (2016). DescTools: Tools for descriptive statistics. R package version 0.99. 18. R Foundation for Statistical Computing, Vienna, Austria.

Stigler, S. M. (1986). The history of statistics: the measurement of uncertainty before 1900. The Belknap Press of Harvard University Press: Cambridge.

Tranmer, M., & Elliot, M. (2008). Multiple linear regression. The Cathie Marsh Centre for Census and Survey Research (CCSR), 5(5), 1-5. Walker, H. M. (1929). Studies in the history of statistical methods. The Williams & Wilkins Company: Baltimore.

Xue, K., & Yao, F. (2021). Hypothesis testing in large-scale functional linear regression. Statistica Sinica, 31, 1101-1123.

Zou, K. H., Tuncali, K., & Silverman, S. G. (2003). Correlation and simple linear regression. Radiology, 227(3), 617-628.






Appendix

A rmd file has been attached along with this HTML report named as Tiwari_FinalReport.rmd.