A. Introduction

There are 3 types of analysis method in scientific research:

1: Analysis of difference

t-test: Dependent variable is continuous, Independent variable including 2 groups
ANOVA:One dependent variable is continuous, Independent variable including more than 2 groups (1,2,3 Way anova)
MANOVA for multiple continuous dependent variables
Chi-square: both variables are categorical variable

2: Correlation analysis and Prediction

Correlation analysis
Linear regression analysis
Logistic regression

3: Association analysis

Linear regression including simple linear regression and multiple linear regression

Input data

library(readxl)
Data <- read_excel("C:/Users/Admin/Desktop/R/Data.xlsx", 
col_types = c("text", "text", "text", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric"))
attach(Data)
require(ggplot2)
require(car)
require(psych)
require(relaimpo)

B. SIMPLE LINEAR REGRESSION

B1: Simple linear regression with X = continuous variable

Purpose:

Estimate the relationship between 2 continuous variables
Estimate r (coefficient correlation)
Estimate the effects of independent (predicted) variables on dependent one
Build up the prediction model ### Equation: Y= α + βX + ε. In which:

Y = Dependent variable (Obligated continuous variable) X = Independent variable (continuous or not) α = Intercept (value of Y when X = 0) β = Slope (Estimate): changing value of Y when X changed 1 unit ε = Random error R^2 = Percentage of the contribution in

Assumption

X and Y have a linear relationship about PARAMETER
X don’t have error (maybe not true in practice)
Y is completely dependent
ε follows normal distribution, mean=0

Principle

A linear will be estimated basing on Least Square Method (a Linear is on a line which formed from minimum d^2)

Run simple linear regression with X = continuous variable in R

cor.test(`Body Weight (g)`,`Liver (g)`)

## 
##  Pearson's product-moment correlation
## 
## data:  Body Weight (g) and Liver (g)
## t = 6.5252, df = 22, p-value = 1.455e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6076240 0.9155086
## sample estimates:
##       cor 
## 0.8119908

Simple_linear_CO=lm(`Body Weight (g)`~`Liver (g)`)
summary(Simple_linear_CO)

## 
## Call:
## lm(formula = `Body Weight (g)` ~ `Liver (g)`)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462.32 -225.54  -40.23  147.18  921.06 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  417.752    235.647   1.773   0.0901 .  
## `Liver (g)`   38.131      5.844   6.525 1.46e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 337.2 on 22 degrees of freedom
## Multiple R-squared:  0.6593, Adjusted R-squared:  0.6438 
## F-statistic: 42.58 on 1 and 22 DF,  p-value: 1.455e-06

p=ggplot(Data,aes(x=`Liver (g)`,y=`Body Weight (g)`))
p+geom_point()+theme_bw()+theme_classic()+geom_smooth(method="lm",formula= y~x)

Read the output

Liver is significant effect on body weight (p=1.46e-06)
Equation: Body Weight = 417.75 + 38.1*Liver + 5.84
r (coefficient of corelation)= 1.455e-06 => There have correlation between two variables
R-squared= r^2=0.6593: Liver can explain 60.93% of the alternation of Body weight
Apply the equation, Body weight can be predicted by Liver

B2. Simple linear regression with X = categorical variable

Type of categorical 1: Nominal: sex, location, nation 2: Ordinal: Level of something, Stage of something t-test can be used in this cage but it just estimate the different and not estimate the prediction, negative or positive relationship

Run simple linear regression with X = categorical variable in R

t.test(`Weight gain (g)`~Gender)

## 
##  Welch Two Sample t-test
## 
## data:  Weight gain (g) by Gender
## t = 2.6819, df = 16.312, p-value = 0.01617
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   14.32066 121.56268
## sample estimates:
## mean in group Female   mean in group Male 
##             207.5917             139.6500

simple_linear_CA=lm(`Weight gain (g)`~Gender)
summary(simple_linear_CA)

## 
## Call:
## lm(formula = `Weight gain (g)` ~ Gender)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -109.49  -45.81   -3.15   47.64  116.41 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   207.59      17.91  11.589 7.75e-11 ***
## GenderMale    -67.94      25.33  -2.682   0.0136 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 62.05 on 22 degrees of freedom
## Multiple R-squared:  0.2464, Adjusted R-squared:  0.2121 
## F-statistic: 7.193 on 1 and 22 DF,  p-value: 0.01362

Read the output

p values of t-test and linear regression are the same
R-squared=0.2464
Male-Female=-67 => Male is less than female 67g

C. Multiple linear regression (Checking interaction before running)

m=lm(`Weight gain (g)`~Gender+`Liver (g)`+Gender:`Liver (g)`)
m1=lm(`Weight gain (g)`~Gender)
m2= lm(`Weight gain (g)`~`Liver (g)`)
summary(m)

## 
## Call:
## lm(formula = `Weight gain (g)` ~ Gender + `Liver (g)` + Gender:`Liver (g)`)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -102.797  -46.847   -0.179   45.105  113.197 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)   
## (Intercept)            188.7844    64.1593   2.942  0.00805 **
## GenderMale             -65.9907    90.7864  -0.727  0.47572   
## `Liver (g)`              0.4980     1.6252   0.306  0.76243   
## GenderMale:`Liver (g)`  -0.0699     2.2532  -0.031  0.97556   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 64.81 on 20 degrees of freedom
## Multiple R-squared:  0.2527, Adjusted R-squared:  0.1406 
## F-statistic: 2.254 on 3 and 20 DF,  p-value: 0.1133

summary(m1)

## 
## Call:
## lm(formula = `Weight gain (g)` ~ Gender)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -109.49  -45.81   -3.15   47.64  116.41 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   207.59      17.91  11.589 7.75e-11 ***
## GenderMale    -67.94      25.33  -2.682   0.0136 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 62.05 on 22 degrees of freedom
## Multiple R-squared:  0.2464, Adjusted R-squared:  0.2121 
## F-statistic: 7.193 on 1 and 22 DF,  p-value: 0.01362

summary(m2)

## 
## Call:
## lm(formula = `Weight gain (g)` ~ `Liver (g)`)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -103.69  -47.85  -18.55   36.56  148.90 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 163.4915    49.8999   3.276  0.00345 **
## `Liver (g)`   0.2626     1.2374   0.212  0.83387   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 71.41 on 22 degrees of freedom
## Multiple R-squared:  0.002044,   Adjusted R-squared:  -0.04332 
## F-statistic: 0.04505 on 1 and 22 DF,  p-value: 0.8339

y=ggplot(Data,aes(x=`Liver (g)`,y=`Weight gain (g)`,fill=Gender))
y+geom_point()+theme_bw()+theme_classic()+geom_smooth(method="lm",formula= y~x)

Read the output

No interaction between gender and liver on Weight gain
Equation:

Male: Weight gain = 188.78 + 0.49Liver -65.91 (Male) - 0.06991 (male)Liver = 122.88 + 0.4201*Liver
Female: Weight gain = 188.78 + 0.49Liver -65.90 (Female) - 0.06990(Female)Liver = 188.78 + 0.49*Liver

R-squared= 0.2527

D. Estimate the most powerful independent variable in multiple variables

pairs.panels(Data)

h_model=lm(`Body Weight (g)`~`Liver (g)`+`Viscera (g)`+`Fillet (g)`+`Abdomial fat (g)`)
metrics=calc.relimp(h_model,type=c("lmg"))
metrics

## Response variable: Body Weight (g) 
## Total response variance: 319289.2 
## Analysis based on 24 observations 
## 
## 4 Regressors: 
## Liver (g) Viscera (g) Fillet (g) Abdomial fat (g) 
## Proportion of variance explained by model: 81.94%
## Metrics are not normalized (rela=FALSE). 
## 
## Relative importance metrics: 
## 
##                         lmg
## Liver (g)        0.21505179
## Viscera (g)      0.18711632
## Fillet (g)       0.33505603
## Abdomial fat (g) 0.08215074
## 
## Average coefficients for different model sizes: 
## 
##                         1X        2Xs        3Xs        4Xs
## Liver (g)        38.130890  20.412847   1.339063  -8.170211
## Viscera (g)       9.543310   8.424649   7.186276   6.675346
## Fillet (g)        2.415306   2.456220   2.372892   2.238671
## Abdomial fat (g) 16.844015 -10.936325 -11.761811 -11.719022

boot=boot.relimp(h_model,b=1000,type=c("lmg"),fixed = F)
booteval.relimp(boot,typesel = c("lmg"),level=0.9,bty = "perc",nodiff=T)

## Response variable: Body Weight (g) 
## Total response variance: 319289.2 
## Analysis based on 24 observations 
## 
## 4 Regressors: 
## Liver (g) Viscera (g) Fillet (g) Abdomial fat (g) 
## Proportion of variance explained by model: 81.94%
## Metrics are not normalized (rela=FALSE). 
## 
## Relative importance metrics: 
## 
##                         lmg
## Liver (g)        0.21505179
## Viscera (g)      0.18711632
## Fillet (g)       0.33505603
## Abdomial fat (g) 0.08215074
## 
## Average coefficients for different model sizes: 
## 
##                         1X        2Xs        3Xs        4Xs
## Liver (g)        38.130890  20.412847   1.339063  -8.170211
## Viscera (g)       9.543310   8.424649   7.186276   6.675346
## Fillet (g)        2.415306   2.456220   2.372892   2.238671
## Abdomial fat (g) 16.844015 -10.936325 -11.761811 -11.719022
## 
##  
##  Confidence interval information ( 1000 bootstrap replicates, bty= perc ): 
## Relative Contributions with confidence intervals: 
##  
##                                      Lower  Upper
##                      percentage 0.9  0.9    0.9   
## Liver (g).lmg        0.2151     _BC_ 0.1626 0.2777
## Viscera (g).lmg      0.1871     _BC_ 0.1305 0.2537
## Fillet (g).lmg       0.3351     A___ 0.2433 0.4171
## Abdomial fat (g).lmg 0.0822     ___D 0.0549 0.1456
## 
## Letters indicate the ranks covered by bootstrap CIs. 
## (Rank bootstrap confidence intervals always obtained by percentile method) 
## CAUTION: Bootstrap confidence intervals can be somewhat liberal.

Read the output

Liver is the most important (accounting for 20% of Body weight)

Linear regression

La Nguyễn Thế Hiển

3/24/2021

A. Introduction

Linear regression including simple linear regression and multiple linear regression

Input data

B. SIMPLE LINEAR REGRESSION

B1: Simple linear regression with X = continuous variable

Purpose:

Assumption

Principle

Run simple linear regression with X = continuous variable in R

Read the output

B2. Simple linear regression with X = categorical variable

Run simple linear regression with X = categorical variable in R

Read the output

C. Multiple linear regression (Checking interaction before running)

Read the output

D. Estimate the most powerful independent variable in multiple variables

Read the output