#Setting up libraries for Project 1
library(readxl)
library(rmarkdown)
library(tidyverse)
library(dplyr)
library(magrittr)
library(knitr)
library(kableExtra)
library(ggplot2)
library(RColorBrewer)
library(wrMisc)
library(DT)
library(MASS)
library(summarytools)
library(agricolae)
dataset_final <- mpg
ds1 <- read_excel("~/Desktop/ds1.xlsx")
INTRODUCTION
1. History of Correlation:
Correlation was invented by Sir Francis Galton. Charles Darwin’s
relative Galton accomplished a lot: he studied medicine, traveled to
Africa, wrote books on psychology and anthropology, and created visual
methods for mapping the weather. Galton also made an effort to explain
heredity, like many others of his day. Galton began writing co-relation
as correlation in 1889 and had developed a fascination with
fingerprints. The final substantial study Galton would write on the
topic would be his 1890 explanation of how he developed correlation.
Galton’s friend and coworker Karl Pearson, who is also the father of
Egon Pearson, followed the improvement of correlation with such energy
that the metric r, which Pearson termed the Galton coefficient of
reversion and Galton dubbed the index of correlation, is now known as
Pearson’s r.
Correlation Coefficient: Correlation
coefficients aid in quantifying links or relationships between two
objects. Human body mass and height, a home’s worth and size are a few
examples of well-known correlated variables.
The Pearson correlation coefficient is one of the most popular
correlation coefficients (usually denoted by r).
Coefficient of
Determination: The coefficient of determination, in contrast to the
Pearson correlation coefficient, assesses how closely the expected
values correspond with (as opposed to merely following) the actual
values. It relies on how far apart the points are from the 1:1 line (as
opposed to the best-fit line, as was previously stated). The coefficient
of determination is greater the closer the data are to the 1:1 line.
2. Simple Regression:
To estimate the relationship between two quantitative variables,
simple regression is used. You can use simple linear regression to find
out:
1. The strength of the relationship between two variables.
2. The value of the dependent variable at a given independent
variable value.
Simple linear regression assumptions Simple linear regression is
a parametric test, which means it makes certain data assumptions. These
are the assumptions:
1. Homogeneity of variance: The magnitude of
the error in our prediction does not significantly change as the
independent variable’s values change.
2. Independence of
observations: The dataset’s observations were gathered using methods of
sampling that were statistically valid, and there are no unobserved
relationships among them.
3. Normality: The data follows a normal
distribution. One more assumption is made by linear regression:
4.
The relationship between the independent and dependent variables is
linear, as shown by the straight line that best fits the data points.
EXAMPLE: Assume that height was the only factor
influencing body weight. If we plotted height (independent variable)
versus body weight (dependent variable), we might find a very linear
relationship. In this simple linear regression, we are looking at how
one independent variable affects the outcome. We would expect the points
for individual subjects to be close to the line if height were the only
determinant of body weight. However, if there were other factors
(independent variables) that influenced body weight in addition to
height (e.g., age, calorie intake, and exercise level), we might expect
the points for individual subjects to be more evenly distributed around
the line, because we are only considering height.
Multiple Regression: To estimate the relationship between two or
more independent variables and one dependent variable, multiple linear
regression is used. You can use multiple linear regression to find out:
1. The degree to which two or more independent variables are
positively or negatively correlated with one dependent variable.
2.
The value of the dependent variable at a given independent variable
value.
Multiple linear regression assumptions
All the assumptions
in multiple linear regression are the same as in simple linear
regression:
1. Variance homogeneity: the size of the error in our
prediction does not vary significantly across independent variable
values.
2. Observational independence: the observations in the
dataset were gathered using statistically valid sampling methods, and
there are no hidden relationships between variables.
3. Because some
of the independent variables in multiple linear regression may be
correlated with one another, it is critical to check these before
developing the regression model. If two independent variables are overly
correlated (r2 > 0.6), only one should be included in the regression
model.
4. Normality: The data is distributed normally.
5.
Linearity: the best fit line through the data points is a straight line,
not a curve or a grouping factor.
3. Hypothesis Testing in terms of Regression Analysis: In a linear regression model, hypothesis testing is done to determine if our beta coefficients are significant. Every time we run the linear regression model, we check to see if the line is significant by looking at the coefficient. Using data from a sample, hypothesis testing allows us to make predictions about population parameters.
4. Analytical Skills Gained: In this final project for the course, I have gained a lot of knowledge and skills on how to use R efficiently. I learnt tools like Hypothesis Testing and Regression Analysis. I also learnt how to draw analysis and conclusions out of data models. I learnt for to present a formal report and how to always be well equipped with information ahead of meetings.
5. Advantages of using R for Data
Analysis:
The software R is free and designed for
statistical computing and graphics. However, R is much more than a
statistical package: it is a programming language that was created
specifically for statistical analysis. The advantages of using R are:
1. Open source and free R’s free and open-source nature is likely
the primary reason many scholars all over the world choose R. Anyone
with access to the source code can look under the hood and see what it’s
doing. This also implies that you, or anyone else with the desire and
aptitude to do so, can immediately fix bugs and make any necessary
changes. By doing this, it may not be necessary to wait for the vendor
to identify the bug, fix it, and release an updated version.
2.
Research that is reproducible Simply write scripts for each step of the
analysis, beginning with the loading of data into R and ending with the
creation of graphs and tables for reporting the results. This type of
script makes it simple to replicate your research. Numerous different
approaches can be quickly tested, errors can be fixed, and your analysis
can be updated as necessary. Additionally, changing a few lines of code
and selecting “Run” will accomplish all of this.
3. Extremely simple
data manipulation R has several packages that make getting your data
ready for analysis incredibly simple. Your data may be saved in.csv
or.txt files, Excel spread sheets, relational databases, or SAS or Stata
files. With just one line of code, R can load all these different kinds
of files. The data cleaning and transformation process is also simple.
With one line of code, you can create a separate dataset with no missing
values, and with another, you can apply multiple filters to your data.
With such powerful tools at your disposal, you can spend less time
getting your data ready for analysis and more time doing the analysis.
4. Advanced visualizations R’s basic functionality allows you to
create histograms, scatterplots, and line plots with just a few lines of
code. These are extremely useful functions for visualizing your data
before beginning any analysis. You can see your data and gain insights
that are not visible from tabulated data alone in a matter of seconds.
However, if you take the time to learn more advanced visualization
packages, such as ggplot2, you will be able to create some very
impressive graphs. R appears to have an infinite number of ways to
visualize your data. These graphs will appear extremely professional.
You’ll also gain access to a slew of new features, such as the ability
to add maps to your visualizations and make them animated.
Task 2.1
Dataset Description
Description: In this task, I have described the mpg data set that is
a part of an inbuilt library in R, using some fuctions.
About the Dataset: The mpg dataset is included in the
Tidyverse and explained in the corresponding documentation. A sample of
the fuel economy data is contained in this dataset. It only includes
models that were released each year between 1999 and 2008 - this was
used as a proxy for the car’s popularity. There are a total of 234
observations in the dataset, with 11 variables, namely, manufacturer -
manufacturer model - model name displ - engine displacement, in litres
year - year of manufacture cyl - number of cylinders trans -type of
transmission drv -f = front-wheel drive, r = rear wheel drive, 4 = 4wd
cty - city miles per gallon hwy - highway miles per gallon fl - fuel
type class - “type” of car
# Task 2.1.a mpg data description using summarytools::descr()
dataset_final %>%
dplyr::select(displ, cyl, cty, hwy) %>%
summarytools::descr() %>%
round(2) %>%
kbl(caption = "Table.2.1.a. Basic Descriptive Statistics",
font_size = 12,
align = "c",
position ='center',
digits = 2) %>%
kable_classic(full_width = T,
font_size = 12)
| cty | cyl | displ | hwy | |
|---|---|---|---|---|
| Mean | 16.86 | 5.89 | 3.47 | 23.44 |
| Std.Dev | 4.26 | 1.61 | 1.29 | 5.95 |
| Min | 9.00 | 4.00 | 1.60 | 12.00 |
| Q1 | 14.00 | 4.00 | 2.40 | 18.00 |
| Median | 17.00 | 6.00 | 3.30 | 24.00 |
| Q3 | 19.00 | 8.00 | 4.60 | 27.00 |
| Max | 35.00 | 8.00 | 7.00 | 44.00 |
| MAD | 4.45 | 2.97 | 1.33 | 7.41 |
| IQR | 5.00 | 4.00 | 2.20 | 9.00 |
| CV | 0.25 | 0.27 | 0.37 | 0.25 |
| Skewness | 0.79 | 0.11 | 0.44 | 0.36 |
| SE.Skewness | 0.16 | 0.16 | 0.16 | 0.16 |
| Kurtosis | 1.43 | -1.46 | -0.91 | 0.14 |
| N.Valid | 234.00 | 234.00 | 234.00 | 234.00 |
| Pct.Valid | 100.00 | 100.00 | 100.00 | 100.00 |
# Task 2.1.b mpg data description using psych::describe ()
dataset_final %>%
dplyr::select(displ, cyl, cty, hwy) %>%
psych::describe() %>%
round(2) %>%
t() %>%
kbl(caption = "Table.2.1.b. Basic Descriptive Statistics",
font_size = 12,
align = "c",
position ='center',
digits = 2) %>%
kable_classic(full_width = T,
font_size = 12)
| displ | cyl | cty | hwy | |
|---|---|---|---|---|
| vars | 1.00 | 2.00 | 3.00 | 4.00 |
| n | 234.00 | 234.00 | 234.00 | 234.00 |
| mean | 3.47 | 5.89 | 16.86 | 23.44 |
| sd | 1.29 | 1.61 | 4.26 | 5.95 |
| median | 3.30 | 6.00 | 17.00 | 24.00 |
| trimmed | 3.39 | 5.86 | 16.61 | 23.23 |
| mad | 1.33 | 2.97 | 4.45 | 7.41 |
| min | 1.60 | 4.00 | 9.00 | 12.00 |
| max | 7.00 | 8.00 | 35.00 | 44.00 |
| range | 5.40 | 4.00 | 26.00 | 32.00 |
| skew | 0.44 | 0.11 | 0.79 | 0.36 |
| kurtosis | -0.91 | -1.46 | 1.43 | 0.14 |
| se | 0.08 | 0.11 | 0.28 | 0.39 |
Observations from Task 2.1:
In the above performed
task, I have used the summarytools::descr() and the psych::describe ()
function to describe the dataset. I have populated a table of the
results and in the second table I have used the t() function for
transpose to improve the visualization of the table. From the tables it
is clear that of the 234 variables, the hwy category has the highest
mean, median and standard deviation
Task 2.2
Statistical Table
Description: In this task, I have presented a table that displays
the statistics for displacement per cylinders for the given dataset and
plotted a scatter plot for the same. Scatter plots are diagrams that
show the connection between two numerical variables using Cartesian
coordinates. A table, in its most basic form, can store information for
two variables: variable A and variable B.
# Task 2.2
eff = dataset_final %>%
group_by(cylinder = cyl) %>%
summarise(Mean = mean(displ),
SD = sd(displ),
Minimum = min(displ),
Maximum = max(displ))
eff %>%
kable(align = "c",
caption = "Table.2.2.Descriptive values",
format = "html",
digits = 2,
table.attr = "style='width:60%;'")%>%
kable_classic_2(bootstrap_options=c("hover","bordered","condensed"),
html_font = "Cambria",
position = "center",
font_size = 12) %>%
add_header_above(c(" " = 1,"Displacement" = 4))
|
Displacement
|
||||
|---|---|---|---|---|
| cylinder | Mean | SD | Minimum | Maximum |
| 4 | 2.15 | 0.32 | 1.6 | 2.7 |
| 5 | 2.50 | 0.00 | 2.5 | 2.5 |
| 6 | 3.41 | 0.47 | 2.5 | 4.2 |
| 8 | 5.13 | 0.59 | 4.0 | 7.0 |
Observations from Task 2.2:
In the above performed
task, I have presented a table that displays the statistics for
displacement per cylinders for the given dataset and plotted a scatter
plot for the same. Through the scatter plot, we can see a positive
correlation between the displacement vs the number of cylinders showing
that if one increases, the other one should too. In general, the bigger
the displacement of an engine, the more power it can produce, whereas
the lower the displacement, the less fuel it can consume. This is
because displacement directly affects how much gasoline must be pumped
into a cylinder to generate power and keep the engine running.
Task 2.3
Finding coefficient of correlation and
coefficient of determination
Description: In
this task, I have calculated the coefficient of correlation and
coefficient of determination for the displacement and cylinders.
#Task 2.3
#task 2.3.a finding the coefficient of correlation
n = 234
y = dataset_final$cyl
x = dataset_final$displ
#A
xy = x * y
sum_xy = sum(xy)
A = n * sum_xy
#B
sum_x = sum(x)
sum_y = sum(y)
B = sum_x * sum_y
#c
x_square = x^2
sum_x_square = sum(x_square)
C = ((n * sum_x_square) - (sum_x)^2)
#D
y_square = y^2
sum_y_square = sum(y_square)
D = ((n * sum_y_square) - (sum_y)^2)
#coefficients of correlation
r = (A - B)/sqrt((C)*(D))
r
## [1] 0.9302271
#task 2.3.b finding the coefficient of correlation determination
r2 = r^2
r2
## [1] 0.8653225
#Populating the table for analysis
table_2_3 = matrix(c(x, y, xy, x_square, y_square), ncol=5, byrow= FALSE)
colnames(table_2_3) = c("Displacement", "Cylinders", "xy", "x2", "y2")
table_2_3 = head(table_2_3, 20)
table_2_3_bf = round(table_2_3, 3)
table_2_3_bf %>%
kbl(caption = "Table.2.3. Analysis Results",
font_size = 12,
align = "c",
position ='center',
digits = 2) %>%
kable_classic(full_width = T,
font_size = 12)
| Displacement | Cylinders | xy | x2 | y2 |
|---|---|---|---|---|
| 1.8 | 4 | 7.2 | 3.24 | 16 |
| 1.8 | 4 | 7.2 | 3.24 | 16 |
| 2.0 | 4 | 8.0 | 4.00 | 16 |
| 2.0 | 4 | 8.0 | 4.00 | 16 |
| 2.8 | 6 | 16.8 | 7.84 | 36 |
| 2.8 | 6 | 16.8 | 7.84 | 36 |
| 3.1 | 6 | 18.6 | 9.61 | 36 |
| 1.8 | 4 | 7.2 | 3.24 | 16 |
| 1.8 | 4 | 7.2 | 3.24 | 16 |
| 2.0 | 4 | 8.0 | 4.00 | 16 |
| 2.0 | 4 | 8.0 | 4.00 | 16 |
| 2.8 | 6 | 16.8 | 7.84 | 36 |
| 2.8 | 6 | 16.8 | 7.84 | 36 |
| 3.1 | 6 | 18.6 | 9.61 | 36 |
| 3.1 | 6 | 18.6 | 9.61 | 36 |
| 2.8 | 6 | 16.8 | 7.84 | 36 |
| 3.1 | 6 | 18.6 | 9.61 | 36 |
| 4.2 | 8 | 33.6 | 17.64 | 64 |
| 5.3 | 8 | 42.4 | 28.09 | 64 |
| 5.3 | 8 | 42.4 | 28.09 | 64 |
Observations from Task 2.3:
In the above performed
task, I have found the coefficients of correlation and determination
using the formula insted of the function. I have created separate values
to calculate the numerators and denominator sepatrately and then
combined them in order to avoid errors. I have then populated the
recieved valued in a table. Through the obtained values, I can observe
that the coefficient of correlation is 0.930, showing a very high
positive correlation since it is close to +1. While the coefficient of
determination of 0.865 or 86.5% shows that 86.5% of the points should
fall within the regression line. This means that this proportion of
variance in the displacement can be explained by the number of
cylinders.
Task 2.4
DescTools Library:
Description: In this task, I have used the “DescTools” library and
performed certain functions for analysis.
#Task 2.4
#I used the following codes in aother markdown file in order to observe the results of using the DescTools function.
##DescTools::Desc(mpg)
# description of Variable Manufacturer
##DescTools::Desc(mpg$manufacturer)
# description of Variable Model
##DescTools::Desc(mpg$model)
# description of Variable Display
##DescTools::Desc(mpg$displ)
# description of Variable year
##DescTools::Desc(mpg$year)
# description of Variable trans
##DescTools::Desc(mpg$trans)
# task 2.4: presenting the outcomes of 2 variables:
# description of Variable fl
fl <- DescTools::Desc(dataset_final$fl)
fl
| dataset_final$fl (character) |
| length n NAs unique levels dupes 234 234 0 5 5 y 100.0% 0.0% |
| level freq perc cumfreq cumperc 1 r 168 71.8% 168 71.8% 2 p 52 22.2% 220 94.0% 3 e 8 3.4% 228 97.4% 4 d 5 2.1% 233 99.6% 5 c 1 0.4% 234 100.0% |
r # description of Variable class cl <- DescTools::Desc(dataset_final$class) cl |
dataset_final$class (character)
length n NAs unique levels dupes 234 234 0 7 7 y 100.0% 0.0%
level freq perc cumfreq cumperc
1 suv 62 26.5% 62 26.5% 2 compact 47 20.1% 109 46.6% 3 midsize 41 17.5% 150 64.1% 4 subcompact 35 15.0% 185 79.1% 5 pickup 33 14.1% 218 93.2% 6 minivan 11 4.7% 229 97.9% 7 2seater 5 2.1% 234 100.0%
Observations from Task 2.4:
DescTools is a large set
of various fundamental statistical functions and comfort wrappers for
efficient data description that are not accessible in the R basic
system. The primary goal of this library is to obtain the initial
descriptive tasks in data analysis, which include computing descriptive
statistics, producing graphical summaries, and reporting the findings.
We have used this function to produce the descriptive statistics for
better data analysis. The 2 variables I have chosen for analysis in this
task are fl and class which is fuel type and class of type of car.
Through producing the code, I could observe that the factor variable,
which is the fuel type and class get their own tables, plots and graphs.
For both a frequency and percentage horizontal bar plot can be seen. In
case of class we can observe that the SUV has the highest frequency
while the 2 seater has the lowest.
Task 2.5
Linear Regression:
Description: In this task, I have performed linear regression
between cylinders (dependent) and displacement (independent) using the R
code:
#Task 2.5
Linear_Reg = lm(dataset_final$cyl~ dataset_final$displ)
model_summary=summary(Linear_Reg)
#intercept value
intercept_value <- model_summary$coefficients[1,1]
paste("The intercept value is:", round(intercept_value,2))
## [1] "The intercept value is: 1.86"
#slope value
slope_val <- model_summary$coefficients[2,1]
paste("The slope value is:", round(slope_val,2))
## [1] "The slope value is: 1.16"
Observations from Task 2.5:
Following is the
formula for linear regression, Y_i=f(X_i, )+e_i
Y = 1.16X +
1.86
Y_i = dependent variable
f = function
X_i = independent
variable
= unknown parameters
e_i = error terms
Through this task I have performed linear regression between cylinders
(dependent) and displacement (independent). I can observe the intercept
value obtained is 1.86. The method used here was to derive the
correlation formula, but we can can also use the inbuilt formula in
order to make the work a little easier.
Task 2.6
Scatter Plot for displacement (dependent) and
cylinders (independent):
Description: In this
task, I have plotted a scatter plot to observe the relationship between
displacement and cylinders.
# TASK 2.6
plot(mpg$displ ~ dataset_final$cyl,
pch = 8,
xlab = "Cylinders",
ylab = "Displacement",
main = "Fig.2.6. scatter plot between displacement and cylinders",
position= "center")
reg_line = lm(dataset_final$displ ~ dataset_final$cyl)
abline(reg_line,
col = "#99004C",
lty = 1,
lwd = 1)
abline(h = mean(dataset_final$displ),
col = "red",
lwd = 1.5)
text(x = 5,
y = 3.8,
paste("Mean =", mean(dataset_final$displ)),
col = "red",
cex = 0.7,
pos = 4)
text(x = 7,
y = 3,
paste("Median =", median(dataset_final$displ)),
col = "blue",
cex = 0.7,
pos = 2)
abline(h = median(dataset_final$displ),
col = "blue",
lwd = 1.5)
Observations from Task 2.6:
In the above performed
task, I have plotted a scatter plot for the displacement and cylinders.
I have used a pch value of 8 for presenting the data points and
presented the mean and median line on the scatter plot itself. Through
the scatter plot I was able to observe that the displacement has a
positive linear relationship with cylinders suggesting that as one
increases, the other one increases too, as already observed though the
above performed tasks. The average is found to be 3.3 and the median at
3.47.
Task 2.7
Scatter Plot for displacement (dependent) and
cylinders (independent):
Description: In this
task, I have populated a table with the predictr values for displacement
and residuals.
# Task 2.7
variable_Dis <- data.frame(dataset_final$displ)
Predicted_value= predict(reg_line, newdata =variable_Dis )
residual_val = resid(reg_line)
Table_task2_7 = matrix(c(Predicted_value,residual_val ), ncol = 2)
colnum = c("Predicted", "Residual")
colnames(Table_task2_7) = colnum
knitr::kable(data.frame(head(round(Table_task2_7,2),20)))
| Predicted | Residual |
|---|---|
| 2.06 | -0.26 |
| 2.06 | -0.26 |
| 2.06 | -0.06 |
| 2.06 | -0.06 |
| 3.55 | -0.75 |
| 3.55 | -0.75 |
| 3.55 | -0.45 |
| 2.06 | -0.26 |
| 2.06 | -0.26 |
| 2.06 | -0.06 |
| 2.06 | -0.06 |
| 3.55 | -0.75 |
| 3.55 | -0.75 |
| 3.55 | -0.45 |
| 3.55 | -0.45 |
| 3.55 | -0.75 |
| 3.55 | -0.45 |
| 5.05 | -0.85 |
| 5.05 | 0.25 |
| 5.05 | 0.25 |
Observations from Task 2.7:
In the above performed
task, populated a table for displacement and their corresponding
residulals. It can be noticed that the predicted values for the same
cylinder type shows a same residual value. We can see the expected
displacement numbers in the above table, therefore there will be a 0.69
increase in displacement with each unit of cylinder size. As a result,
we can now anticipate displacement based on the number of cylinders.
Therefore, for 4 cylinders, the observed displacement is -0.26 and the
predicted value is increased by 1.49. It is important to use residuals
to determine how reliable the predictions are.
Task 2.8
frequency of cars based on cylinders
Description: In this task, I have populated a table with
the frequency of cars based on cylinders and found their respective
frequencies, cumulative frequencies.
#Task 2.8
table_2_8 = dataset_final$cyl%>%
table() %>%
as.data.frame()%>%
rename(Frequency = Freq)%>%
mutate(Cumulative_Frequency = cumsum(Frequency),
Percentage = (Frequency/nrow(dataset_final))*100 ,
Cummulative_Percentage = cumsum(Percentage))
knitr::kable(table_2_8,
digits = 2,
caption = "Table.2.8.Frequency of Car cylinders",
format = "html",
table.attr = "style='width:30%;'",
align = 'c') %>%
kable_classic(bootstrap_options = "hover",
full_width = TRUE,
position= "center",
font_size = 13,
lightable_options = "basic",
html_font = "\"Arial Narrow\", \"Source Sans Pro\", sans-serif",
)
| . | Frequency | Cumulative_Frequency | Percentage | Cummulative_Percentage |
|---|---|---|---|---|
| 4 | 81 | 81 | 34.62 | 34.62 |
| 5 | 4 | 85 | 1.71 | 36.32 |
| 6 | 79 | 164 | 33.76 | 70.09 |
| 8 | 70 | 234 | 29.91 | 100.00 |
Observations from Task 2.8:
In the above performed
task, I have formed a frequency table with the respective cumulative
frequencies for the cars based on cylinders. I have used knitr::kable()
and KableExtra code to improve the visualisation of my results. From the
formulated table I can observe that the
Task 2.9
frequency of cars based on cylinders
Description:
#Task 2.9
# visualization 1
labels_2_9 = paste("Freq = ",
table_2_8$Frequency,
"\n",
round(table_2_8$Percentage, digits = 2))
pie(table_2_8$Percentage, labels = labels_2_9,
main = "Fig.2.9.a.Distribution of frequency",
radius = 1,
las = 2,
cex =0.7,
col = brewer.pal(4,"Set2") )
# visualization 2
Cum_freq = table_2_8$Cumulative_Frequency
value_bins <- graph.freq(Cum_freq, plot=FALSE)
values = ogive.freq(value_bins, frame=FALSE)
#create ogive chart
plot(values, xlab='Values',
ylab='Relative Cumulative Frequency',
main='Fig.2.9.b.Ogive Chart',
col='green',
type='b',
pch=19,
las=1,
bty='l')
Task 2.10
frequency of cars based on cylinders
Description: Predictions if a car has two or ten
cylinders
#The aim is to forecast the displacement of the engine based on the number of cylinders, which should be between 2 and 10. We can notice a linear positive relationship between the cylinder and the displacement in the scatter plot. Thus, if we need to anticipate engine displacement depending on the number of cylinders, we may use the scatter plot. The displacement for two cylinders appears to be between 1 and 2, which is the smallest of all displacements. Similarly, if we look at the 10 cylinders, we can observe that the displacement ranges from 6 to 8. This is the maximum displacement. As a result, the scatter plot clearly shows that as the number of cylinders grows, so does the displacement of the engine.
Task 3.1
Multiregression Analysis and Hypothesis
Testing
Description: In this task, I have
performed multiregression analysis and hypothesis testing for another
dataset.
# TASK 3.1
cor_age_bp = cor(ds1$SystolicBP,ds1$Age)
coe_wt_bp = cor(ds1$SystolicBP,ds1$Weight)
det_age = cor_age_bp ^ 2
det_wt = coe_wt_bp ^ 2
cor_coeff = c(cor_age_bp, coe_wt_bp)
det_coeff = c(det_age, det_wt)
table_3_1 = matrix(c(cor_coeff, det_coeff ), nrow = 2, byrow = TRUE, ncol = 2)
rownum = c("Correlation Coefficient", "Determination Coefficient")
column = c("Age vs BP", "Weight vs BP")
colnames(table_3_1) = column
rownames(table_3_1) = rownum
table_3_1 %>%
kbl(caption = "Table.3.1.Coefficient of correlation and determination",
font_size = 12,
align = "c",
position ='center',
digits = 2) %>%
kable_classic(full_width = T,
font_size = 12)
| Age vs BP | Weight vs BP | |
|---|---|---|
| Correlation Coefficient | 0.92 | 0.90 |
| Determination Coefficient | 0.85 | 0.81 |
#Squared R and R
reg_table = lm(ds1$SystolicBP ~ Age + Weight, data = ds1)
summary(reg_table)
##
## Call:
## lm(formula = ds1$SystolicBP ~ Age + Weight, data = ds1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1106 -3.8580 -0.3748 1.4868 12.4956
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.15749 10.19923 3.839 0.00236 **
## Age 0.98241 0.27603 3.559 0.00393 **
## Weight 0.24946 0.09792 2.548 0.02557 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.287 on 12 degrees of freedom
## Multiple R-squared: 0.9059, Adjusted R-squared: 0.8902
## F-statistic: 57.73 on 2 and 12 DF, p-value: 6.963e-07
reg_sum = summary(reg_table)
#Computation of F Test
r_3 = reg_sum$r.squared
paste("The Value of the R Square is :", r_3)
## [1] "The Value of the R Square is : 0.905854056470068"
r_val = sqrt(r_3)
paste("The Value of the R is :", r_val)
## [1] "The Value of the R is : 0.95176365578334"
IndependentVariables = 2
n = length(ds1$Age)
FTest = (r_3 / IndependentVariables) / ((1 - r_3) / (n - IndependentVariables - 1))
alpha = 0.05
degfreeNum = n - IndependentVariables
degfreeDen = n - IndependentVariables -1
CriticalValue = qf(alpha, degfreeNum, degfreeDen, lower.tail = FALSE)
FTest > CriticalValue
## [1] TRUE
#Question 6
paste("Yes, we can predict the values from the two variables")
## [1] "Yes, we can predict the values from the two variables"
#Question 7
new_dat1 = data.frame(Age = c(30), Weight = c(148))
q7 = predict(reg_table, newdata = new_dat1)
q7
## 1
## 105.5503
#Question 7
new_dat2 = data.frame(Age = c(75), Weight = c(196))
q8 = predict(reg_table, newdata = new_dat2)
q8
## 1
## 161.7331
Observations from Task 3.1:
About the dataset:
The given dataset has a total of 4 variables and 15 observations.
The variables are namely Patient ID, Systolic BP, Age and Weight.
In the above task, I have performed a total of 8 subtasks. I have
deducted the formula for multiple regression. Then I have found the
correlation and determination coefficients between age and systolic
blood pressure and between age with systolic blood pressure, and
populated a table with the reults. I have founf the f test value and
performed hypothesis testing for alpha of 0.05. Furthermore, I found the
expected systolic blood pressure for a person age = 30, weight = 148 and
the expected systolic blood pressure for a person age = 75, weight =
196. From the analysis,
Task 3.2
Scatter Plot
Description: In this task, I have presented scatter plots for age
versus systolic blood pressure and weight versus systolic blood
pressure:
# TASK 3.2
par(mfrow = c(1,2))
plot(ds1$SystolicBP ~ ds1$Age,
col = "pink",
xlab = "Age (Independent Variable)",
ylab = "Systolic BP (Dependent Variable)",
main = "Fig.3.2.a.Age Vs Systolic BP",
pch = 19)
box(which = "figure", col = 4, lty = "solid")
abline(lm(ds1$SystolicBP ~ ds1$Age),
lty = 1,
lwd = 1)
plot(ds1$SystolicBP ~ ds1$Weight,
col = "blue",
xlab = "Weight (Independent Variable)",
ylab = "Systolic BP (Dependent Variable)",
main = "Fig.3.2.b.Weight Vs Systolic BP",
pch = 19)
box(which = "figure", col = 4, lty = "solid")
abline(lm(ds1$SystolicBP ~ ds1$Weight),
lty = 1,
lwd = 1)
Observations from Task 3.2:
In the above performed
task, I have plotted a scatter plot for age versus systolic blood
pressure and weight versus systolic blood pressure. Through the plots,
It can be observed that the Systolic BP has a positive linear
relationship between Age and Weight. This means that the independent
varieables in this case, which are Age and Weight are directly
proportional with the Systolic BP, which increses if the latter
increases.
Task 3.3
Residual Values:
Description: In this task, I have found the predicted values and
their respective residual values.
# TASK 3.3
#Prediction
pred_val = data.frame(predict(reg_table))
colnames(pred_val) = "Predicted Value"
correspondingResidues = data.frame(ds1$SystolicBP - pred_val)
colnames(correspondingResidues) = "Corressponding Residuals Value"
ds2 = cbind(ds1, pred_val, correspondingResidues)
ds2 %>%
kbl(caption = "Table.3.3. Analysis Results",
font_size = 12,
align = "c",
position ='center',
digits = 2) %>%
kable_classic(full_width = T,
font_size = 12)
| PatientID | SystolicBP | Age | Weight | Predicted Value | Corressponding Residuals Value |
|---|---|---|---|---|---|
| PK01 | 112 | 45 | 135 | 117.04 | -5.04 |
| PK02 | 156 | 60 | 182 | 143.50 | 12.50 |
| PK03 | 125 | 55 | 148 | 130.11 | -5.11 |
| PK04 | 145 | 60 | 182 | 143.50 | 1.50 |
| PK05 | 155 | 62 | 190 | 147.46 | 7.54 |
| PK06 | 162 | 71 | 232 | 166.78 | -4.78 |
| PK07 | 139 | 57 | 194 | 143.55 | -4.55 |
| PK08 | 144 | 59 | 182 | 142.52 | 1.48 |
| PK09 | 153 | 64 | 217 | 156.17 | -3.17 |
| PK10 | 126 | 42 | 171 | 123.08 | 2.92 |
| PK11 | 169 | 75 | 225 | 168.97 | 0.03 |
| PK12 | 132 | 52 | 173 | 133.40 | -1.40 |
| PK13 | 143 | 59 | 184 | 143.02 | -0.02 |
| PK14 | 153 | 67 | 194 | 153.37 | -0.37 |
| PK15 | 162 | 73 | 211 | 163.51 | -1.51 |
Observations from Task 3.3:
In the above performed
task, I have populated a table to show the data set, the predicted
values, and the corresponding residues. The difference between an
observed value of the response variable and the value predicted by the
regression line. Since each data point has one residual, that means:
1. Positive if they are above the regression line
2. Negative if
they are below the regression line
3. Zero if the regression line
actually passes through the point
Task 3.4
Residual Plot:
Description: In this task, I have presented scatter plots to show
1. the residuals versus age values.
2. the residuals versus
weight.
# TASK 3.4
par(mfrow = c(1,2))
plot(ds2$`Corressponding Residuals Value` ~ ds2$Age,
col = "blue",
xlab = "Age (Independent Variable)",
ylab = "Corressponding Residuals Value",
main = "Fig.3.4.a.Age Vs Residuals",
pch = 4)
abline(lm(ds2$`Corressponding Residuals Value` ~ ds2$Age),
lty = 1,
lwd = 1)
plot(ds2$`Corressponding Residuals Value` ~ ds2$Weight,
col = "pink",
xlab = "Weight (Independent Variable)",
ylab = "Corressponding Residuals Value",
main = "Fig.3.4.b.Weight Vs Residuals",
pch = 19)
abline(lm(ds2$`Corressponding Residuals Value` ~ ds2$Weight),
lty = 1,
lwd = 1)
Observations from Task 3.4:
In the above performed
task, residual plots vs predictor values such as age and weight are
generated for the patient data. Taking the first plot into account, we
can utilize it to assess if we should include age as a predictor in the
model to predict systolic blood pressure. For ages 60 and 75, the plot
displays only a few data points that are near to zero. As the
fundamental goal of a regression line is to reduce the sum of residuals,
a well-behaved plot will bounce randomly and create a roughly horizontal
band around the residual line. Age is a reasonably accurate predictor of
systolic blood pressure. The second figure depicts residual values vs
weight; majority of the data points are far away from and below the 0
line of the y-axis, indicating that projected values are excessively
high for these data points. Weight alone is not a reliable predictor of
systolic blood pressure.
CONCLUSION
Through
this report, I have performed Hypothesis Testing on the given data set.
The following are the things I was able to conclude:
To conclude,
engine displacement, cylinder count, and drive train type all have an
effect on fuel usage. The higher the displacement of an engine, the
lower the mpg, which means more fuel usage. The number of cylinders in a
vehicle determines its power and fuel consumption. The most
energy-efficient and popular type is front-wheel drive.
This data analysis might be useful while deciding which vehicle to buy. There are, however, several more variables to consider before completing the purchase. Always examine your personal needs while keeping your own position in mind.
This report focused on examining the relationship between displacemet, measured by number of cylinders. The data studied, which was a subset of the EPA’s fuel efficiency statistics, revealed a significant positive correlation between engine displacement and cylinders - in general, the larger the number of cylinders in operation, the higher the fuel displacement of a car.
Through this report, I also learned how to correlate hypothesis testing and regression analysis. I learnt how to draw conclusions through the same and perform analysis. I also learnt the various field that regression analysis can be used. For example, it is used to compute implicit costs and forecast reimbursement stakes in hospitals among other things
Through the final project, I lerned how to manipulate my data according to need. From learning how to use functions like DescTools to enhancing visualizations, I learned how to function in R.
Moreover, additional data should be acquired to evaluate whether the conclusions of this study are correct; because there was only one type of cars in the two-seater class, I am unsure whether these findings can be extrapolated to other two-seater cars. Furthermore, because the data in this study only contain EPA data from 1999 to 2008, I would like to investigate other years to determine whether these preliminary conclusions remain true.
BIBLIOGRAPHY
The
following are the references used in this report:
1. Chiluiza, D.
https://rpubs.com/Dee_Chiluiza/816756 , retrieved on
16th December, 2022
2. Martin R. Huecker, September 24, 2021.
Hypothesis Testing, P Values, Confidence Intervals, and
Significance. https://www.ncbi.nlm.nih.gov/books/NBK557421/ on 16th
December, 2022
3. What Do Different Cylinder Numbers Mean in
Regards to Engine Performance or Reliability? https://www.autoblog.com/2015/12/02/what-do-different-cylinder-numbers-mean-in-regards-to-engine-per/
16th December, 2022
4. Enago, September 5, 2006, Quick Guide to
Biostatistics in Clinical Research: Hypothesis Testing https://www.enago.com/academy/quick-guide-to-biostatistics-in-clinical-research-hypothesis-testing/#:~:text=An%20example%20of%20a%20specific,of%20milk%20chocolate%20per%20day.%E2%80%9D
retrieved on 5th December, 2022
5. Teresa Lewandowski, June 15,
2019 Analysis using the mpg dataset https://rpubs.com/tlewando/data101_mpgproject retrieved
on 16th December, 2022
6. DescTools: a new R “misc package”
https://www.r-bloggers.com/2014/09/desctools-a-new-r-misc-package/
retrieved on 16th December, 2022
7. Ogive Graph / Cumulative
Frequency Polygon in Easy Steps https://www.statisticshowto.com/ogive-graph/ retrieved
on 16th December, 2022
APPENDIX
I
have also attachech my report
“FinalProject_TsarinaPatnaik” that presents the R chunks and a
formla report named “FinalProject1_TsarinaPatnaik”
without the R chunks.