#Setting up libraries for Project 1
library(readxl)
library(rmarkdown)
library(tidyverse)
library(dplyr)
library(magrittr)
library(knitr)
library(kableExtra)
library(ggplot2)
library(RColorBrewer)
library(wrMisc)
library(DT)
library(MASS)
library(summarytools)
library(agricolae)

dataset_final <- mpg
ds1 <- read_excel("~/Desktop/ds1.xlsx")

ALY6010 Probability and Statistical Theory
Northeastern University
Tsarina Patnaik
Date: 17th December, 2022
Final Project Report: Correlation and Regression Analysis
Instructor: Dr. Dee Chiluiza, PhD



INTRODUCTION
1. History of Correlation:
Correlation was invented by Sir Francis Galton. Charles Darwin’s relative Galton accomplished a lot: he studied medicine, traveled to Africa, wrote books on psychology and anthropology, and created visual methods for mapping the weather. Galton also made an effort to explain heredity, like many others of his day. Galton began writing co-relation as correlation in 1889 and had developed a fascination with fingerprints. The final substantial study Galton would write on the topic would be his 1890 explanation of how he developed correlation.

Galton’s friend and coworker Karl Pearson, who is also the father of Egon Pearson, followed the improvement of correlation with such energy that the metric r, which Pearson termed the Galton coefficient of reversion and Galton dubbed the index of correlation, is now known as Pearson’s r.
Correlation Coefficient: Correlation coefficients aid in quantifying links or relationships between two objects. Human body mass and height, a home’s worth and size are a few examples of well-known correlated variables.

The Pearson correlation coefficient is one of the most popular correlation coefficients (usually denoted by r).
Coefficient of Determination: The coefficient of determination, in contrast to the Pearson correlation coefficient, assesses how closely the expected values correspond with (as opposed to merely following) the actual values. It relies on how far apart the points are from the 1:1 line (as opposed to the best-fit line, as was previously stated). The coefficient of determination is greater the closer the data are to the 1:1 line.

R2 is frequently used to represent the coefficient of determination. It is not, however, the square of anything. Its value can be between -1 and any negative integer.

2. Simple Regression:
To estimate the relationship between two quantitative variables, simple regression is used. You can use simple linear regression to find out:
1. The strength of the relationship between two variables.
2. The value of the dependent variable at a given independent variable value.


By fitting a line to the observed data, regression models describe the relationship between variables. A straight line is used in linear regression models, while a curved line is used in logistic and nonlinear regression models. Regression estimates the change in a dependent variable as the independent variable(s) change.

Simple linear regression assumptions Simple linear regression is a parametric test, which means it makes certain data assumptions. These are the assumptions:
1. Homogeneity of variance: The magnitude of the error in our prediction does not significantly change as the independent variable’s values change.
2. Independence of observations: The dataset’s observations were gathered using methods of sampling that were statistically valid, and there are no unobserved relationships among them.
3. Normality: The data follows a normal distribution. One more assumption is made by linear regression:
4. The relationship between the independent and dependent variables is linear, as shown by the straight line that best fits the data points.
EXAMPLE: Assume that height was the only factor influencing body weight. If we plotted height (independent variable) versus body weight (dependent variable), we might find a very linear relationship. In this simple linear regression, we are looking at how one independent variable affects the outcome. We would expect the points for individual subjects to be close to the line if height were the only determinant of body weight. However, if there were other factors (independent variables) that influenced body weight in addition to height (e.g., age, calorie intake, and exercise level), we might expect the points for individual subjects to be more evenly distributed around the line, because we are only considering height.

Multiple Regression: To estimate the relationship between two or more independent variables and one dependent variable, multiple linear regression is used. You can use multiple linear regression to find out:
1. The degree to which two or more independent variables are positively or negatively correlated with one dependent variable.
2. The value of the dependent variable at a given independent variable value.

Multiple linear regression assumptions
All the assumptions in multiple linear regression are the same as in simple linear regression:
1. Variance homogeneity: the size of the error in our prediction does not vary significantly across independent variable values.
2. Observational independence: the observations in the dataset were gathered using statistically valid sampling methods, and there are no hidden relationships between variables.
3. Because some of the independent variables in multiple linear regression may be correlated with one another, it is critical to check these before developing the regression model. If two independent variables are overly correlated (r2 > 0.6), only one should be included in the regression model.
4. Normality: The data is distributed normally.
5. Linearity: the best fit line through the data points is a straight line, not a curve or a grouping factor.


EXAMPLE: Assume an investigator created a scoring system that allowed her to predict an individual’s body mass index (BMI) based on information about what and how much they ate. The researcher wanted to put this new “diet score” to the test to see how closely it corresponded to actual BMI measurements. A small sample of subjects’ information is collected to compute their “diet score,” and each subject’s weight and height are measured in order to compute their BMI. The relationship between the new “diet score” and BMI suggests that the “diet score” is not a very good predictor (i.e., there is little if any relationship between the two).
While this is disheartening, the investigator believes that confounding by age and/or gender may be masking the true relationship between “diet score” and BMI. She first determines which subjects are over the age of 20, and the scatter plot shows that the younger and older subjects are clustered. Determine which subjects are adults and which are children. There appears to be a linear relationship between BMI and diet score.
Age and gender both meet the criteria for confounders. We can also see a clear linear relationship between “diet score” and BMI across all four age and gender groups. In other words, we can only see a relationship between diet score and BMI after “taking into account” these two confounding variables. These other factors muddled the true relationship.

3. Hypothesis Testing in terms of Regression Analysis: In a linear regression model, hypothesis testing is done to determine if our beta coefficients are significant. Every time we run the linear regression model, we check to see if the line is significant by looking at the coefficient. Using data from a sample, hypothesis testing allows us to make predictions about population parameters.

4. Analytical Skills Gained: In this final project for the course, I have gained a lot of knowledge and skills on how to use R efficiently. I learnt tools like Hypothesis Testing and Regression Analysis. I also learnt how to draw analysis and conclusions out of data models. I learnt for to present a formal report and how to always be well equipped with information ahead of meetings.

5. Advantages of using R for Data Analysis:
The software R is free and designed for statistical computing and graphics. However, R is much more than a statistical package: it is a programming language that was created specifically for statistical analysis. The advantages of using R are:
1. Open source and free R’s free and open-source nature is likely the primary reason many scholars all over the world choose R. Anyone with access to the source code can look under the hood and see what it’s doing. This also implies that you, or anyone else with the desire and aptitude to do so, can immediately fix bugs and make any necessary changes. By doing this, it may not be necessary to wait for the vendor to identify the bug, fix it, and release an updated version.
2. Research that is reproducible Simply write scripts for each step of the analysis, beginning with the loading of data into R and ending with the creation of graphs and tables for reporting the results. This type of script makes it simple to replicate your research. Numerous different approaches can be quickly tested, errors can be fixed, and your analysis can be updated as necessary. Additionally, changing a few lines of code and selecting “Run” will accomplish all of this.
3. Extremely simple data manipulation R has several packages that make getting your data ready for analysis incredibly simple. Your data may be saved in.csv or.txt files, Excel spread sheets, relational databases, or SAS or Stata files. With just one line of code, R can load all these different kinds of files. The data cleaning and transformation process is also simple. With one line of code, you can create a separate dataset with no missing values, and with another, you can apply multiple filters to your data. With such powerful tools at your disposal, you can spend less time getting your data ready for analysis and more time doing the analysis.
4. Advanced visualizations R’s basic functionality allows you to create histograms, scatterplots, and line plots with just a few lines of code. These are extremely useful functions for visualizing your data before beginning any analysis. You can see your data and gain insights that are not visible from tabulated data alone in a matter of seconds. However, if you take the time to learn more advanced visualization packages, such as ggplot2, you will be able to create some very impressive graphs. R appears to have an infinite number of ways to visualize your data. These graphs will appear extremely professional. You’ll also gain access to a slew of new features, such as the ability to add maps to your visualizations and make them animated.

Task 2.1
Dataset Description
Description: In this task, I have described the mpg data set that is a part of an inbuilt library in R, using some fuctions.

About the Dataset: The mpg dataset is included in the Tidyverse and explained in the corresponding  documentation. A sample of the fuel economy data is contained in this dataset. It only includes models that were released each year between 1999 and 2008 - this was used as a proxy for the car’s popularity. There are a total of 234 observations in the dataset, with 11 variables, namely, manufacturer - manufacturer model - model name displ - engine displacement, in litres year - year of manufacture cyl - number of cylinders trans -type of transmission drv -f = front-wheel drive, r = rear wheel drive, 4 = 4wd cty - city miles per gallon hwy - highway miles per gallon fl - fuel type class - “type” of car

# Task 2.1.a mpg data description using summarytools::descr()

dataset_final %>%
  dplyr::select(displ, cyl, cty, hwy) %>%
  summarytools::descr() %>%
  round(2) %>% 
   kbl(caption = "Table.2.1.a. Basic Descriptive Statistics", 
      font_size = 12, 
      align = "c",
      position ='center', 
      digits = 2) %>%
  kable_classic(full_width = T, 
                font_size = 12)
Table.2.1.a. Basic Descriptive Statistics
cty cyl displ hwy
Mean 16.86 5.89 3.47 23.44
Std.Dev 4.26 1.61 1.29 5.95
Min 9.00 4.00 1.60 12.00
Q1 14.00 4.00 2.40 18.00
Median 17.00 6.00 3.30 24.00
Q3 19.00 8.00 4.60 27.00
Max 35.00 8.00 7.00 44.00
MAD 4.45 2.97 1.33 7.41
IQR 5.00 4.00 2.20 9.00
CV 0.25 0.27 0.37 0.25
Skewness 0.79 0.11 0.44 0.36
SE.Skewness 0.16 0.16 0.16 0.16
Kurtosis 1.43 -1.46 -0.91 0.14
N.Valid 234.00 234.00 234.00 234.00
Pct.Valid 100.00 100.00 100.00 100.00
# Task 2.1.b mpg data description using psych::describe ()
dataset_final %>%
  dplyr::select(displ, cyl, cty, hwy) %>%
  psych::describe() %>%
  round(2) %>% 
  t() %>%
   kbl(caption = "Table.2.1.b. Basic Descriptive Statistics", 
      font_size = 12, 
      align = "c",
      position ='center', 
      digits = 2) %>%
  kable_classic(full_width = T, 
                font_size = 12)
Table.2.1.b. Basic Descriptive Statistics
displ cyl cty hwy
vars 1.00 2.00 3.00 4.00
n 234.00 234.00 234.00 234.00
mean 3.47 5.89 16.86 23.44
sd 1.29 1.61 4.26 5.95
median 3.30 6.00 17.00 24.00
trimmed 3.39 5.86 16.61 23.23
mad 1.33 2.97 4.45 7.41
min 1.60 4.00 9.00 12.00
max 7.00 8.00 35.00 44.00
range 5.40 4.00 26.00 32.00
skew 0.44 0.11 0.79 0.36
kurtosis -0.91 -1.46 1.43 0.14
se 0.08 0.11 0.28 0.39


Observations from Task 2.1:
In the above performed task, I have used the summarytools::descr() and the psych::describe () function to describe the dataset. I have populated a table of the results and in the second table I have used the t() function for transpose to improve the visualization of the table. From the tables it is clear that of the 234 variables, the hwy category has the highest mean, median and standard deviation

Task 2.2
Statistical Table
Description: In this task, I have presented a table that displays the statistics for displacement per cylinders for the given dataset and plotted a scatter plot for the same. Scatter plots are diagrams that show the connection between two numerical variables using Cartesian coordinates. A table, in its most basic form, can store information for two variables: variable A and variable B.

# Task 2.2

eff = dataset_final %>% 
  group_by(cylinder = cyl) %>%
  summarise(Mean = mean(displ), 
            SD = sd(displ),
            Minimum = min(displ),
            Maximum = max(displ))

eff %>%
  kable(align = "c",
        caption = "Table.2.2.Descriptive values",
        format = "html",
        digits = 2,
        table.attr = "style='width:60%;'")%>%
  kable_classic_2(bootstrap_options=c("hover","bordered","condensed"),
              html_font = "Cambria",
              position = "center",
              font_size = 12) %>%
  add_header_above(c(" " = 1,"Displacement" = 4))
Table.2.2.Descriptive values
Displacement
cylinder Mean SD Minimum Maximum
4 2.15 0.32 1.6 2.7
5 2.50 0.00 2.5 2.5
6 3.41 0.47 2.5 4.2
8 5.13 0.59 4.0 7.0


Observations from Task 2.2:
In the above performed task, I have presented a table that displays the statistics for displacement per cylinders for the given dataset and plotted a scatter plot for the same. Through the scatter plot, we can see a positive correlation between the displacement vs the number of cylinders showing that if one increases, the other one should too. In general, the bigger the displacement of an engine, the more power it can produce, whereas the lower the displacement, the less fuel it can consume. This is because displacement directly affects how much gasoline must be pumped into a cylinder to generate power and keep the engine running.

Task 2.3
Finding coefficient of correlation and coefficient of determination
Description: In this task, I have calculated the coefficient of correlation and coefficient of determination for the displacement and cylinders.

#Task 2.3

#task 2.3.a finding the coefficient of correlation
n = 234
y = dataset_final$cyl
x = dataset_final$displ

#A
xy = x * y
sum_xy = sum(xy)
A = n * sum_xy
#B
sum_x = sum(x)
sum_y = sum(y)
B = sum_x * sum_y
#c
x_square = x^2
sum_x_square = sum(x_square)
C = ((n * sum_x_square) - (sum_x)^2)
#D
y_square = y^2
sum_y_square = sum(y_square)
D = ((n * sum_y_square) - (sum_y)^2)
#coefficients of correlation
r = (A - B)/sqrt((C)*(D))
r
## [1] 0.9302271
#task 2.3.b finding the coefficient of correlation determination
r2 = r^2
r2
## [1] 0.8653225
#Populating the table for analysis
table_2_3 = matrix(c(x, y, xy, x_square, y_square), ncol=5, byrow= FALSE)
colnames(table_2_3) = c("Displacement", "Cylinders", "xy", "x2", "y2")

table_2_3 = head(table_2_3, 20)

table_2_3_bf = round(table_2_3, 3)

table_2_3_bf %>%
  kbl(caption = "Table.2.3. Analysis Results", 
      font_size = 12, 
      align = "c",
      position ='center', 
      digits = 2) %>%
  kable_classic(full_width = T, 
                font_size = 12)
Table.2.3. Analysis Results
Displacement Cylinders xy x2 y2
1.8 4 7.2 3.24 16
1.8 4 7.2 3.24 16
2.0 4 8.0 4.00 16
2.0 4 8.0 4.00 16
2.8 6 16.8 7.84 36
2.8 6 16.8 7.84 36
3.1 6 18.6 9.61 36
1.8 4 7.2 3.24 16
1.8 4 7.2 3.24 16
2.0 4 8.0 4.00 16
2.0 4 8.0 4.00 16
2.8 6 16.8 7.84 36
2.8 6 16.8 7.84 36
3.1 6 18.6 9.61 36
3.1 6 18.6 9.61 36
2.8 6 16.8 7.84 36
3.1 6 18.6 9.61 36
4.2 8 33.6 17.64 64
5.3 8 42.4 28.09 64
5.3 8 42.4 28.09 64


Observations from Task 2.3:
In the above performed task, I have found the coefficients of correlation and determination using the formula insted of the function. I have created separate values to calculate the numerators and denominator sepatrately and then combined them in order to avoid errors. I have then populated the recieved valued in a table. Through the obtained values, I can observe that the coefficient of correlation is 0.930, showing a very high positive correlation since it is close to +1. While the coefficient of determination of 0.865 or 86.5% shows that 86.5% of the points should fall within the regression line. This means that this proportion of variance in the displacement can be explained by the number of cylinders.

Task 2.4
DescTools Library:
Description: In this task, I have used the “DescTools” library and performed certain functions for analysis.

#Task 2.4
#I used the following codes in aother markdown file in order to observe the results of using the DescTools function.

##DescTools::Desc(mpg)

# description of Variable Manufacturer 
##DescTools::Desc(mpg$manufacturer)

# description of Variable Model 
##DescTools::Desc(mpg$model)

# description of Variable Display 
##DescTools::Desc(mpg$displ)

# description of Variable year 
##DescTools::Desc(mpg$year)

# description of Variable trans 
##DescTools::Desc(mpg$trans)

# task 2.4: presenting the outcomes of 2 variables:
# description of Variable fl 
fl <- DescTools::Desc(dataset_final$fl)
fl
dataset_final$fl (character)
length n NAs unique levels dupes 234 234 0 5 5 y 100.0% 0.0%
level freq perc cumfreq cumperc 1 r 168 71.8% 168 71.8% 2 p 52 22.2% 220 94.0% 3 e 8 3.4% 228 97.4% 4 d 5 2.1% 233 99.6% 5 c 1 0.4% 234 100.0%
r # description of Variable class cl <- DescTools::Desc(dataset_final$class) cl

dataset_final$class (character)

length n NAs unique levels dupes 234 234 0 7 7 y 100.0% 0.0%

    level  freq   perc  cumfreq  cumperc

1 suv 62 26.5% 62 26.5% 2 compact 47 20.1% 109 46.6% 3 midsize 41 17.5% 150 64.1% 4 subcompact 35 15.0% 185 79.1% 5 pickup 33 14.1% 218 93.2% 6 minivan 11 4.7% 229 97.9% 7 2seater 5 2.1% 234 100.0%


Observations from Task 2.4:
DescTools is a large set of various fundamental statistical functions and comfort wrappers for efficient data description that are not accessible in the R basic system. The primary goal of this library is to obtain the initial descriptive tasks in data analysis, which include computing descriptive statistics, producing graphical summaries, and reporting the findings. We have used this function to produce the descriptive statistics for better data analysis. The 2 variables I have chosen for analysis in this task are fl and class which is fuel type and class of type of car. Through producing the code, I could observe that the factor variable, which is the fuel type and class get their own tables, plots and graphs. For both a frequency and percentage horizontal bar plot can be seen. In case of class we can observe that the SUV has the highest frequency while the 2 seater has the lowest.

Task 2.5
Linear Regression:
Description: In this task, I have performed linear regression between cylinders (dependent) and displacement (independent) using the R code:

#Task 2.5

Linear_Reg = lm(dataset_final$cyl~ dataset_final$displ)
model_summary=summary(Linear_Reg)
#intercept value
intercept_value <- model_summary$coefficients[1,1]
paste("The intercept value is:", round(intercept_value,2))
## [1] "The intercept value is: 1.86"
#slope value
slope_val <- model_summary$coefficients[2,1]
paste("The slope value is:", round(slope_val,2))
## [1] "The slope value is: 1.16"


Observations from Task 2.5:
Following is the formula for linear regression, Y_i=f(X_i, )+e_i
Y = 1.16X + 1.86
Y_i = dependent variable
f = function
X_i = independent variable
= unknown parameters
e_i = error terms

Through this task I have performed linear regression between cylinders (dependent) and displacement (independent). I can observe the intercept value obtained is 1.86. The method used here was to derive the correlation formula, but we can can also use the inbuilt formula in order to make the work a little easier.

Task 2.6
Scatter Plot for displacement (dependent) and cylinders (independent):
Description: In this task, I have plotted a scatter plot to observe the relationship between displacement and cylinders.

# TASK 2.6 

plot(mpg$displ ~ dataset_final$cyl,
     pch = 8,
     xlab = "Cylinders",
     ylab = "Displacement",
     main = "Fig.2.6. scatter plot between displacement and cylinders",
     position= "center")
reg_line = lm(dataset_final$displ ~ dataset_final$cyl)
abline(reg_line, 
       col = "#99004C", 
       lty = 1,
       lwd = 1)

abline(h = mean(dataset_final$displ),
       col = "red",
       lwd = 1.5)

text(x = 5,
     y = 3.8,
     paste("Mean =", mean(dataset_final$displ)),
     col = "red",
     cex = 0.7,
     pos = 4)

text(x = 7,
     y = 3,
     paste("Median =", median(dataset_final$displ)),
     col = "blue",
     cex = 0.7,
     pos = 2)

abline(h = median(dataset_final$displ),
       col = "blue",
       lwd = 1.5)


Observations from Task 2.6:
In the above performed task, I have plotted a scatter plot for the displacement and cylinders. I have used a pch value of 8 for presenting the data points and presented the mean and median line on the scatter plot itself. Through the scatter plot I was able to observe that the displacement has a positive linear relationship with cylinders suggesting that as one increases, the other one increases too, as already observed though the above performed tasks. The average is found to be 3.3 and the median at 3.47.

Task 2.7
Scatter Plot for displacement (dependent) and cylinders (independent):
Description: In this task, I have populated a table with the predictr values for displacement and residuals.

# Task 2.7
variable_Dis <- data.frame(dataset_final$displ)
Predicted_value= predict(reg_line, newdata =variable_Dis )
residual_val = resid(reg_line)

 

Table_task2_7 = matrix(c(Predicted_value,residual_val ), ncol = 2)
colnum = c("Predicted", "Residual")
colnames(Table_task2_7) = colnum

 

knitr::kable(data.frame(head(round(Table_task2_7,2),20)))
Predicted Residual
2.06 -0.26
2.06 -0.26
2.06 -0.06
2.06 -0.06
3.55 -0.75
3.55 -0.75
3.55 -0.45
2.06 -0.26
2.06 -0.26
2.06 -0.06
2.06 -0.06
3.55 -0.75
3.55 -0.75
3.55 -0.45
3.55 -0.45
3.55 -0.75
3.55 -0.45
5.05 -0.85
5.05 0.25
5.05 0.25


Observations from Task 2.7:
In the above performed task, populated a table for displacement and their corresponding residulals. It can be noticed that the predicted values for the same cylinder type shows a same residual value. We can see the expected displacement numbers in the above table, therefore there will be a 0.69 increase in displacement with each unit of cylinder size. As a result, we can now anticipate displacement based on the number of cylinders. Therefore, for 4 cylinders, the observed displacement is -0.26 and the predicted value is increased by 1.49. It is important to use residuals to determine how reliable the predictions are.

Task 2.8
frequency of cars based on cylinders
Description: In this task, I have populated a table with the frequency of cars based on cylinders and found their respective frequencies, cumulative frequencies.

#Task 2.8
table_2_8 = dataset_final$cyl%>%
table() %>%
as.data.frame()%>%
rename(Frequency = Freq)%>%
mutate(Cumulative_Frequency = cumsum(Frequency),
Percentage = (Frequency/nrow(dataset_final))*100 ,
Cummulative_Percentage = cumsum(Percentage))

knitr::kable(table_2_8,
digits = 2,
caption = "Table.2.8.Frequency of Car cylinders",
format = "html",
table.attr = "style='width:30%;'",
align = 'c')  %>%

kable_classic(bootstrap_options = "hover",
full_width = TRUE,
position= "center",
font_size = 13,
lightable_options = "basic",
  html_font = "\"Arial Narrow\", \"Source Sans Pro\", sans-serif",
)
Table.2.8.Frequency of Car cylinders
. Frequency Cumulative_Frequency Percentage Cummulative_Percentage
4 81 81 34.62 34.62
5 4 85 1.71 36.32
6 79 164 33.76 70.09
8 70 234 29.91 100.00


Observations from Task 2.8:
In the above performed task, I have formed a frequency table with the respective cumulative frequencies for the cars based on cylinders. I have used knitr::kable() and KableExtra code to improve the visualisation of my results. From the formulated table I can observe that the

Task 2.9
frequency of cars based on cylinders
Description:

#Task 2.9

# visualization 1
labels_2_9 = paste("Freq = ", 
             table_2_8$Frequency, 
             "\n",
             round(table_2_8$Percentage, digits = 2))

pie(table_2_8$Percentage, labels = labels_2_9,
    main = "Fig.2.9.a.Distribution of frequency",
    radius = 1,
    las = 2,
    cex =0.7,
    col = brewer.pal(4,"Set2") )

# visualization 2

Cum_freq = table_2_8$Cumulative_Frequency
value_bins <- graph.freq(Cum_freq, plot=FALSE)
values = ogive.freq(value_bins, frame=FALSE)

#create ogive chart
plot(values, xlab='Values', 
     ylab='Relative Cumulative Frequency',
     main='Fig.2.9.b.Ogive Chart', 
     col='green', 
     type='b', 
     pch=19, 
     las=1, 
     bty='l')

Observations from Task 2.8:
In the above performed task, I have plotted a pie chart and an ogive chart to observe the data. Through the graphs I was able to get a better understanding of the datas and the analysis to be drawn from the data. Though the pie chart it is clearly visible that the highest frequency for 81 is 34.62%. The pie chart is for the frequency of cylinders. The ogive chart gives a general idea, with the highest point between 200 and 250. The ogive chart basically tells us about the cumulative frequencies, and it is a frequency polygon graph. On the y axis we can see the cumulative frequencies and the x axis shows us the class that is cylinders this case.

Task 2.10
frequency of cars based on cylinders
Description: Predictions if a car has two or ten cylinders

#The aim is to forecast the displacement of the engine based on the number of cylinders, which should be between 2 and 10. We can notice a linear positive relationship between the cylinder and the displacement in the scatter plot. Thus, if we need to anticipate engine displacement depending on the number of cylinders, we may use the scatter plot. The displacement for two cylinders appears to be between 1 and 2, which is the smallest of all displacements. Similarly, if we look at the 10 cylinders, we can observe that the displacement ranges from 6 to 8. This is the maximum displacement. As a result, the scatter plot clearly shows that as the number of cylinders grows, so does the displacement of the engine.

Task 3.1
Multiregression Analysis and Hypothesis Testing
Description: In this task, I have performed multiregression analysis and hypothesis testing for another dataset.

# TASK 3.1

cor_age_bp = cor(ds1$SystolicBP,ds1$Age)
coe_wt_bp = cor(ds1$SystolicBP,ds1$Weight)

det_age = cor_age_bp ^ 2
det_wt = coe_wt_bp ^ 2

cor_coeff = c(cor_age_bp, coe_wt_bp)
det_coeff = c(det_age, det_wt)

table_3_1 = matrix(c(cor_coeff, det_coeff ), nrow = 2, byrow = TRUE, ncol = 2)

rownum = c("Correlation Coefficient", "Determination Coefficient")
column = c("Age vs BP", "Weight vs BP")

colnames(table_3_1) = column
rownames(table_3_1) = rownum

table_3_1 %>%
  kbl(caption = "Table.3.1.Coefficient of correlation and determination", 
      font_size = 12, 
      align = "c",
      position ='center', 
      digits = 2) %>%
  kable_classic(full_width = T, 
                font_size = 12)
Table.3.1.Coefficient of correlation and determination
Age vs BP Weight vs BP
Correlation Coefficient 0.92 0.90
Determination Coefficient 0.85 0.81
#Squared R and R 
reg_table = lm(ds1$SystolicBP ~ Age + Weight, data = ds1)
summary(reg_table)
## 
## Call:
## lm(formula = ds1$SystolicBP ~ Age + Weight, data = ds1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1106 -3.8580 -0.3748  1.4868 12.4956 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 39.15749   10.19923   3.839  0.00236 **
## Age          0.98241    0.27603   3.559  0.00393 **
## Weight       0.24946    0.09792   2.548  0.02557 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.287 on 12 degrees of freedom
## Multiple R-squared:  0.9059, Adjusted R-squared:  0.8902 
## F-statistic: 57.73 on 2 and 12 DF,  p-value: 6.963e-07
reg_sum = summary(reg_table)


#Computation of F Test

r_3 = reg_sum$r.squared
paste("The Value of the R Square is :", r_3)
## [1] "The Value of the R Square is : 0.905854056470068"
r_val = sqrt(r_3)
paste("The Value of the R  is :", r_val)
## [1] "The Value of the R  is : 0.95176365578334"
IndependentVariables = 2
n = length(ds1$Age)
FTest = (r_3 / IndependentVariables) / ((1 - r_3) / (n - IndependentVariables - 1))

alpha = 0.05
degfreeNum = n - IndependentVariables
degfreeDen = n - IndependentVariables -1
CriticalValue = qf(alpha, degfreeNum, degfreeDen, lower.tail = FALSE) 
FTest > CriticalValue
## [1] TRUE
#Question 6

paste("Yes, we can predict the values from the two variables")
## [1] "Yes, we can predict the values from the two variables"
#Question 7 
new_dat1 = data.frame(Age = c(30), Weight = c(148))
q7 = predict(reg_table, newdata = new_dat1)
q7
##        1 
## 105.5503
#Question 7 
new_dat2 = data.frame(Age = c(75), Weight = c(196))
q8 = predict(reg_table, newdata = new_dat2)
q8
##        1 
## 161.7331


Observations from Task 3.1:
About the dataset: The given dataset has a total of 4 variables and 15 observations. The variables are namely Patient ID, Systolic BP, Age and Weight.
In the above task, I have performed a total of 8 subtasks. I have deducted the formula for multiple regression. Then I have found the correlation and determination coefficients between age and systolic blood pressure and between age with systolic blood pressure, and populated a table with the reults. I have founf the f test value and performed hypothesis testing for alpha of 0.05. Furthermore, I found the expected systolic blood pressure for a person age = 30, weight = 148 and the expected systolic blood pressure for a person age = 75, weight = 196. From the analysis,

Task 3.2
Scatter Plot
Description: In this task, I have presented scatter plots for age versus systolic blood pressure and weight versus systolic blood pressure:

# TASK 3.2

par(mfrow = c(1,2))

plot(ds1$SystolicBP ~ ds1$Age,
     col = "pink",
     xlab = "Age (Independent Variable)",
     ylab = "Systolic BP (Dependent Variable)",
     main = "Fig.3.2.a.Age Vs Systolic BP",
     pch = 19)

box(which = "figure", col = 4, lty = "solid")

abline(lm(ds1$SystolicBP ~ ds1$Age),
       lty = 1,
       lwd = 1)

plot(ds1$SystolicBP ~ ds1$Weight,
     col = "blue",
     xlab = "Weight (Independent Variable)",
     ylab = "Systolic BP (Dependent Variable)",
     main = "Fig.3.2.b.Weight Vs Systolic BP",
     pch = 19)

box(which = "figure", col = 4, lty = "solid")

abline(lm(ds1$SystolicBP ~ ds1$Weight),
       lty = 1,
       lwd = 1)


Observations from Task 3.2:
In the above performed task, I have plotted a scatter plot for age versus systolic blood pressure and weight versus systolic blood pressure. Through the plots, It can be observed that the Systolic BP has a positive linear relationship between Age and Weight. This means that the independent varieables in this case, which are Age and Weight are directly proportional with the Systolic BP, which increses if the latter increases.

Task 3.3
Residual Values:
Description: In this task, I have found the predicted values and their respective residual values.

# TASK 3.3
#Prediction
pred_val = data.frame(predict(reg_table))
colnames(pred_val) = "Predicted Value"

correspondingResidues = data.frame(ds1$SystolicBP - pred_val)
colnames(correspondingResidues) = "Corressponding Residuals Value"

ds2 = cbind(ds1, pred_val, correspondingResidues)

ds2 %>%
  kbl(caption = "Table.3.3. Analysis Results", 
      font_size = 12, 
      align = "c",
      position ='center', 
      digits = 2) %>%
  kable_classic(full_width = T, 
                font_size = 12)
Table.3.3. Analysis Results
PatientID SystolicBP Age Weight Predicted Value Corressponding Residuals Value
PK01 112 45 135 117.04 -5.04
PK02 156 60 182 143.50 12.50
PK03 125 55 148 130.11 -5.11
PK04 145 60 182 143.50 1.50
PK05 155 62 190 147.46 7.54
PK06 162 71 232 166.78 -4.78
PK07 139 57 194 143.55 -4.55
PK08 144 59 182 142.52 1.48
PK09 153 64 217 156.17 -3.17
PK10 126 42 171 123.08 2.92
PK11 169 75 225 168.97 0.03
PK12 132 52 173 133.40 -1.40
PK13 143 59 184 143.02 -0.02
PK14 153 67 194 153.37 -0.37
PK15 162 73 211 163.51 -1.51


Observations from Task 3.3:
In the above performed task, I have populated a table to show the data set, the predicted values, and the corresponding residues. The difference between an observed value of the response variable and the value predicted by the regression line. Since each data point has one residual, that means:
1. Positive if they are above the regression line
2. Negative if they are below the regression line
3. Zero if the regression line actually passes through the point

Task 3.4
Residual Plot:
Description: In this task, I have presented scatter plots to show
1. the residuals versus age values.
2. the residuals versus weight.

# TASK 3.4

par(mfrow = c(1,2))
plot(ds2$`Corressponding Residuals Value` ~ ds2$Age,
     col = "blue",
     xlab = "Age (Independent Variable)",
     ylab = "Corressponding Residuals Value",
     main = "Fig.3.4.a.Age Vs Residuals",
     pch = 4)
abline(lm(ds2$`Corressponding Residuals Value` ~ ds2$Age),
 lty = 1,
 lwd = 1)

plot(ds2$`Corressponding Residuals Value` ~ ds2$Weight,
     col = "pink",
     xlab = "Weight (Independent Variable)",
     ylab = "Corressponding Residuals Value",
     main = "Fig.3.4.b.Weight Vs Residuals",
     pch = 19)
abline(lm(ds2$`Corressponding Residuals Value` ~ ds2$Weight),
 lty = 1,
 lwd = 1)


Observations from Task 3.4:
In the above performed task, residual plots vs predictor values such as age and weight are generated for the patient data. Taking the first plot into account, we can utilize it to assess if we should include age as a predictor in the model to predict systolic blood pressure. For ages 60 and 75, the plot displays only a few data points that are near to zero. As the fundamental goal of a regression line is to reduce the sum of residuals, a well-behaved plot will bounce randomly and create a roughly horizontal band around the residual line. Age is a reasonably accurate predictor of systolic blood pressure. The second figure depicts residual values vs weight; majority of the data points are far away from and below the 0 line of the y-axis, indicating that projected values are excessively high for these data points. Weight alone is not a reliable predictor of systolic blood pressure.

CONCLUSION
Through this report, I have performed Hypothesis Testing on the given data set. The following are the things I was able to conclude:
To conclude, engine displacement, cylinder count, and drive train type all have an effect on fuel usage. The higher the displacement of an engine, the lower the mpg, which means more fuel usage. The number of cylinders in a vehicle determines its power and fuel consumption. The most energy-efficient and popular type is front-wheel drive.

This data analysis might be useful while deciding which vehicle to buy. There are, however, several more variables to consider before completing the purchase. Always examine your personal needs while keeping your own position in mind.

This report focused on examining the relationship between displacemet, measured by number of cylinders. The data studied, which was a subset of the EPA’s fuel efficiency statistics, revealed a significant positive correlation between engine displacement and cylinders - in general, the larger the number of cylinders in operation, the higher the fuel displacement of a car.

Through this report, I also learned how to correlate hypothesis testing and regression analysis. I learnt how to draw conclusions through the same and perform analysis. I also learnt the various field that regression analysis can be used. For example, it is used to compute implicit costs and forecast reimbursement stakes in hospitals among other things

Through the final project, I lerned how to manipulate my data according to need. From learning how to use functions like DescTools to enhancing visualizations, I learned how to function in R.

Moreover, additional data should be acquired to evaluate whether the conclusions of this study are correct; because there was only one type of cars in the two-seater class, I am unsure whether these findings can be extrapolated to other two-seater cars. Furthermore, because the data in this study only contain EPA data from 1999 to 2008, I would like to investigate other years to determine whether these preliminary conclusions remain true.

BIBLIOGRAPHY
The following are the references used in this report:
1. Chiluiza, D. https://rpubs.com/Dee_Chiluiza/816756 , retrieved on 16th December, 2022
2. Martin R. Huecker, September 24, 2021. Hypothesis Testing, P Values, Confidence Intervals, and Significance. https://www.ncbi.nlm.nih.gov/books/NBK557421/ on 16th December, 2022
3. What Do Different Cylinder Numbers Mean in Regards to Engine Performance or Reliability? https://www.autoblog.com/2015/12/02/what-do-different-cylinder-numbers-mean-in-regards-to-engine-per/ 16th December, 2022
4. Enago, September 5, 2006, Quick Guide to Biostatistics in Clinical Research: Hypothesis Testing https://www.enago.com/academy/quick-guide-to-biostatistics-in-clinical-research-hypothesis-testing/#:~:text=An%20example%20of%20a%20specific,of%20milk%20chocolate%20per%20day.%E2%80%9D retrieved on 5th December, 2022
5. Teresa Lewandowski, June 15, 2019 Analysis using the mpg dataset https://rpubs.com/tlewando/data101_mpgproject retrieved on 16th December, 2022
6. DescTools: a new R “misc package” https://www.r-bloggers.com/2014/09/desctools-a-new-r-misc-package/ retrieved on 16th December, 2022
7. Ogive Graph / Cumulative Frequency Polygon in Easy Steps https://www.statisticshowto.com/ogive-graph/ retrieved on 16th December, 2022


APPENDIX
I have also attachech my report “FinalProject_TsarinaPatnaik” that presents the R chunks and a formla report named “FinalProject1_TsarinaPatnaik” without the R chunks.