Introduction

Human body is very complex and various factors affect how the body will grow or reduce based on the growth of the other body parts.
So, it would be interesting to see how a person’s height and their chest size are related to each other.
How a person’s increase in height can lead to an increase or decrease in their chest size.
Or it could be that height does not affect one’s chest size at all.
The data was collected from 507 individuals and is sufficiently large.
Data exploration was done by calculating and visualized using summary statistics, data table, box plots, histogram, and Q-Q Plots.
Various tests performed on the data including the correlation test, test on the intercept and slope of the relation linear line and confidence interval, and p-value of t-statistics were also calculated.
For this experiment chest and height are identified as the variable of interest and all the tests were conducted on them only.
With height as the independent variable and chest as the dependent variable.

Problem Statement

The problem statement is:
- To investigate if there is any statistically significant relationship between a person’s chest diameter (che.di) and height (hgt).
So, based on the problem statement it is clear that we have two variable of interest i.e:
- Person height in centimeters - hgt
- person’s chest diameter in centimeters - che.di
Successful identification of relations between our variable of interest will tell us:
- Is there any relationship between two variables?
- If there is any relationship then is it positive(directly proportional) or negative(inversely proportional)?
- How the change in Independent variable($x$) affects the dependent variable($y$)?
- What proportion or rate the effect is?
If the relationship is not linear, linear regression should not be used.
How much of a relationship can be explained using a linear, or straight line, relationship.
For a simple linear regression line the equation is as follow: \[y = \alpha + \beta x + \epsilon\]
We will test and understand more about Simple Linear Regression in upcoming slides.

Data

The following dataset of body girth measurements called “bdims.csv” was provided.
It has Body girth measurements and skeletal diameter measurements, as well as age, weight, height, and gender, are given for 507 physically active individuals - 247 men and 260 women. [1]
In our data we have 25 features/columns and 507 data rows.
From 25 features only two variables are our variable of interest, i.e, person height in centimeters(hgt) and person’s chest diameter in centimeters(che.di).
The given file bdims.csv is in csv(comma seperated value) format that is read using fread() function from data.table library.
Then the variable of interest is subsetted from the main data for further processing and testing.

df <- fread('bdims.csv')
df <- df[,c('che.di','hgt')]

Descriptive Statistics and Visualisation

Median is somewhat smaller than mean that suggests that in both the parameters there is some little Right skewness.
There are no missing values in both the parameters.
First Quartile value test is that 25% of the data is less than that particular point.
Third Quartile value test is that 75% of the data is less than that particular point.
Low standard deviation and IQR of che.di suggest less width and high peak, whereas high standard deviation and IQR of hgt suggest high width and less peak.
Both the parameters are directly proportional as suggested by the correlation value of 0.63.

GatherDf <- df %>% gather(che.di, hgt, key = 'Parameter', value = 'value')
knitr::kable(GatherDf %>% group_by(GatherDf$Parameter) %>% summarise(Min = min(value,na.rm = TRUE),
                                        Max = max(value, na.rm = TRUE),
                                        n = n(),
                                        Missing = sum(is.na(value)),
                                        Q1 = quantile(value ,probs = .25,na.rm = TRUE),
                                        Median = median(value, na.rm = TRUE),
                                        Q3 = quantile(value, probs = .75,na.rm = TRUE),
                                        Mean = mean(value, na.rm = TRUE),
                                        SD = sd(value, na.rm = TRUE),
                                        IQR = IQR(value ,na.rm = TRUE),
                                        Corr = cor(df$che.di, df$hgt)), "html", caption = "Table 1: Descriptive Statistics", align = "llllllllll", col.names = c("Peer Groups", "Minimum", "Maximum", "Sample Size", "Missing Count","First Quartile", "Median", "Third Quartile", "Mean", "Standard Deviation", "IQR", "Correlation"), digits = 2) %>% kable_styling(latex_options = "HOLD_position") %>% column_spec(1, bold = TRUE) %>% column_spec(c(2,4,6,8,10,12), color = 'white', background = 'black')

Table 1: Descriptive Statistics
Peer Groups	Minimum	Maximum	Sample Size	Missing Count	First Quartile	Median	Third Quartile	Mean	Standard Deviation	IQR	Correlation
che.di	22.2	35.6	507	0	25.65	27.8	29.95	27.97	2.74	4.3	0.63
hgt	147.2	198.1	507	0	163.80	170.3	177.80	171.14	9.41	14.0	0.63

Descriptive Statistics and Visualisation (Cont.)

There are no noticeable outliers that can be removed in both the parameters.
The means of both the parameters is not around the same point(as expected, as person height can not be equal to the side of a person’s chest).
Height have larger IQR as compared to the chest side(see the plot in the second plot for better comparison).
Side by side boxplot on the same axis is shown in the second figure, the difference in a person’s height and chest size.
It can be seen from this plot that hgt have higher IQR as compared to che.di.
Mean is closer to the first and third quartile for the person’s chest size.

ggplot(data = GatherDf, aes(x=Parameter, y=value)) + geom_boxplot(aes(fill=Parameter)) + facet_wrap( ~ Parameter, scales="free") + scale_y_continuous(name = 'Size in cm\n') + ggtitle("Side by side Boxplot for person's Height and Chest Size in cm\n (on different scales)\n") + theme(plot.title = element_text(family="Tahoma", hjust = 0.5), axis.title = element_text(size = 14)); ggplot(GatherDf, aes(x=Parameter, y=value)) + geom_boxplot(outlier.colour="black", outlier.shape=1, outlier.size=1.5 ,fill='#4271AE', color="#1F3552") + theme_economist() + theme(plot.title = element_text(family="Tahoma", hjust = 0.5), text = element_text(family="Tahoma"), axis.title = element_text(size = 12)) + scale_x_discrete(name = "\nParameter")+ ggtitle("Boxplot for person's Height and Chest Size in cm\n") + scale_y_continuous(name = 'Size in cm\n')

Descriptive Statistics and Visualisation (Cont.)

The scatterplot() function from the car library is much more informative.
It draws a box plot for both the parameters on the axis and equally points are plotted on the two-dimensional graph.
Also it helps us by plotting a straight theoretical linear regression line with the experimental dotted line between the two parameters.
It can be seen from the scatter plot that both the parameters more or less follows the theoretical linear regression line with low loss(we will study and test more about in furthur slides).

scatterplot(x = df$hgt, y = df$che.di, xlab = "Height(cm)", ylab = "Chest Diameter(cm)", main = "Scatter plot for relation between person's Chest and Height in cm\n")

Descriptive Statistics and Visualisation (Cont.)

The following distribution is for the person’s height.
The green dotted line is the theoretical normal distribution and the blue line shows the experimental distribution for the parameter.
It can be seen that variance is high hence peak is less.
The distribution is not a smooth normal distribution.
Mean is higher than median suggesting little right skewness, but this can be due to some bias in the data which can be easily reduced as the increase in sample size.

hist(df$hgt, breaks = 20, probability = TRUE, xlab = 'Height(cm)', ylab = 'Frequency', main = "Histogram for person's Height  in cm")
abline(v = mean(df$hgt), col="red", lwd=2, lty=2)
abline(v = median(df$hgt), col="orange", lwd=2, lty=2)
text(x=174, y=0.0505, labels= 'μ = 171.14', cex = 0.72)
text(x=167, y=0.048, labels= 'Median = 170.3', cex = 0.73)
lines(density(df$hgt), col = 'Blue', lwd=2)
curve(dnorm(x, mean=mean(df$hgt), sd=sd(df$hgt)), yaxt="n", lty="dotted", col="darkgreen", lwd=4, add=TRUE)
legend("topright", legend = c("Density Curve for Female Sample", "Normal Curve", 'Mean', 'Median'), bty = "n", text.col = "black", horiz = F, pch=c(15,15, 15, 15), col = c('Blue', "darkgreen", 'red', 'orange'))

Descriptive Statistics and Visualisation (Cont.)

The following histogram is for a person’s chest.
The green dotted line is the theoretical normal distribution and the blue line shows the experimental distribution for the parameter.
It can be seen that variance is less hence peak is more
The distribution is not a smooth normal distribution.
Mean is higher than median suggesting little right skewness, but this can be due to some bias in the data which can be easily reduced as the increase in sample size.

hist(df$che.di, breaks = 20, probability = TRUE, xlab = 'Chest Diameter(cm)', ylab = 'Frequency', main = "Histogram for person's Chest Diameter in cm")
abline(v = mean(df$che.di), col="red", lwd=2, lty=2)
abline(v = median(df$che.di), col="orange", lwd=2, lty=2)
text(x=28.7, y=0.16, labels= 'μ = 27.92', cex = 0.75)
text(x=27, y=0.17, labels= 'Median = 27.8', cex = 0.73)
lines(density(df$che.di), col = 'Blue', lwd=2)
curve(dnorm(x, mean=mean(df$che.di), sd=sd(df$che.di)), yaxt="n", lty="dotted", col="darkgreen", lwd=4, add=TRUE)
legend("topright", legend = c("Density Curve for Female Sample", "Normal Curve", 'Mean', 'Median'), bty = "n", text.col = "black", horiz = F, pch=c(15,15, 15, 15), col = c('Blue', "darkgreen", 'red', 'orange'))

Descriptive Statistics and Visualisation (Cont.)

Normal Quantile-Quantile Plot helps us to compare the sample distribution of our data with that of a theoretical distribution.
Here we are using to compare the theoretical normal distribution with our sample data.
It can be seen from the side by side QQ plot that both the parameters somewhat follow the normal distribution with some skewness, hence confirm the visualization of the previous histograms.

p1 <- ggqqplot(df$che.di, size = 0.5) + ggtitle("QQ Plot for the person's chest") + theme(plot.title = element_text(hjust = 0.5))
p2 <- ggqqplot(df$hgt, size = 0.5) +  ggtitle("QQ Plot for the person's height") + theme(plot.title = element_text(hjust = 0.5))
grid.arrange(p1, p2, nrow = 1)

Hypothesis Testing

Fitting a linear regression line to sample data is done using a method known as ordinary least squares (OLS).
The main motivation of linear regression is to minimize the sum of squared distances $S$ for each dependent and independent data point from our predicted line as much as possible.
The sum of squares is written as: \[ \sum\limits_{i=1}^n d_i^2\]
Assuming Height as independent variable and Chest as Dependent varibale.
Based on the assumption different values are calculated.
Meaning of each value is shown with a comment alongside the code.

sum_x <- sum(df$hgt) # raw sum for x
sum_y <- sum(df$che.di) # raw sum for y
sum_x_sq <- sum(df$hgt^2) # raw sum of squares for x
sum_y_sq <- sum(df$che.di^2) # raw sum of squares for y
sum_xy <- sum(df$hgt*df$che.di) # raw sum of the cross product
n <- length(df$hgt) # Number of variables
Lxx <- sum_x_sq-((sum_x^2)/n) # squared deviation from the mean x
Lyy <- sum_y_sq-((sum_y^2)/n) # squared deviation from the mean y
Lxy = sum_xy - (((sum_x)*(sum_y))/n) # corrected sum of the cross products
b = Lxy/Lxx # slope
a = mean(df$che.di - b*mean(df$hgt)) # Intercept

Hypothesis Testing (Cont.)

Based on the calculation dot plot is produced that shows a relationship between a person’s height and chest size.
X-Axis shows the height and Y-Axis shows the chest size.
A theoretical linear regression line is also down using abline() function.
The line is formed with intercept(a) and slope(b) with x = hgt and y = che.di.

plot(df$che.di ~ df$hgt, data = df, xlab = "Height", ylab = "Chest")
abline(a = a, b = b, col= "red")

Hypothesis Testing (Cont.)

Correlation and simple linear regression are used to examine the relationship between two quantitative (discrete or continuous) variables.
From Table 1. Descriptive Statistics it can be seen that value of correlation(pearson) is 0.63, which is in positive side, i.r $p = 0.63$.
The formula for correlation is: \[ r = \frac{(L_{xx})}{\sqrt{L_{xx} L_{yy}}} \]
Both persons chest and heigh are in directly proposrtional relationship i.e positive slopes.
But to verify the correlation value is not to some bias in data we need to perform hypothesis testing on the correlation Pearson $p$ value.
To do this we are going to use the rcorr() function from the Hmisc library.
The reported correlation value between persons heogh and chest is r=0.63 and the p value for test of correlation is 0(Zero), $p = 0$.
For the correlation test following hypothesis are assumed:
- Null Hypothesis($H_0$): Correlation value is equal to zero.
- Alternate hypothesis($H_A$): Correlation value is not equal to zero. \[H_0: r = 0 \\ H_A: r \neq 0\]

print(cor(df$che.di, df$hgt))

## [1] 0.6268931

rcorr(as.matrix(dplyr::select(df, che.di, hgt)), type = "pearson")

##        che.di  hgt
## che.di   1.00 0.63
## hgt      0.63 1.00
## 
## n= 507 
## 
## 
## P
##        che.di hgt
## che.di         0 
## hgt     0

r = 0.627
n = 507

Hypothesis Testing (Cont.)

The $p-value$ for $r$ can be readily calculated by converting $r$ to a $t-statistic$ using: \[ t = r \sqrt \frac{n - 2}{1 - r^2}\]
Here $n=507$ and $r=0.63$.

t = r*(sqrt((n-2)/(1-(r)^2)))
2*pt(q = t,df = n - 2,lower.tail=FALSE)

## [1] 0.000000000000000000000000000000000000000000000000000000009623577

where df= n−2 = 507 − 2 = 505, based on the value we find $ p < 0.001 $ hence the p value is less than our significance level ($\alpha$), $\alpha = 0.05$ or 95% confidence, so we will reject our Null hypothesis, i.e ($H_0 = 0$).
So, we can say on our $p value$ of $t-test$ that there is a statistically significant positive correlation between person’s chest and height.
The 95% confidence intrval for the correlation value is $CI[0.571, 0.677]$.

CIr(r = cor(df$che.di, df$hgt), n = n, level = 0.95)

## [1] 0.5709813 0.6770164

It can be seen that our correlatoin value of 0.63 lies between out 95% confidence interval and our confidence interval does not capture our null hypothesis $H_0$, so we will reject $H_1$.

Hypothesis Testing (Cont.)

Simple linear regression assumes that a predictor variable, $x$(Person’s Height) multiplied by slope $\beta$, provides information about some dependent variable, $y$(Person’s Chest) based on the constant/intercept ($\alpha$) and error/residuals($\epsilon$), and can be shown as: \[y = \alpha + \beta x + \epsilon\]
Linear regression is a parametric method because $\epsilon$ is assumed to be normally distributed, $N(\mu, \sigma^2)$.
Linear regression also assumes that the relationship between the predictor and dependent variable is explained by a linear, or straight line, relationship.
If the relationship is not linear, linear regression should not be used.
Before applying the linear model we first need to confirm that there is some linear relationship between independent and dependant variables.
From the previous correlation test we confirmed that there is a linear positive relationship between the two variables so it is safe to proceed with the linear model.

Hypothesis Testing (Cont.)

$R^2$ reflects the proportion of variability in the dependent variable that can be explained by a linear relationship with the predictor variable.
Therefore person height explained 39.3% of the variability in a person’s chest size.
The $adjusted R^2$ takes this overestimation into account and down-scales it, size it is also around 39.2% so we can consider any between simple $R^2$ or $Adjusted R^2$.
The $R^2$ is a measure of goodness of fit for linear regression. The better the line fits the data (i.e. the closer the data points sit on the line) the higher $R^2$ will be.
The model summary also reports an $F$ statistic which is used to test the overall regression model. The $F-test$ for the linear regression has the following statistical hypothesis:
- Null Hypothesis($H_0$): The data do not fit the linear regression model
- Alternative Hypothesis($H_A$): The data fit the linear regression model
The $p$ value of the $F-Test$ statistics($F-statistic = 327$) is $p \approx 0.001$, where $df_1 = 1$ and $df_2 = 505$.
As $p$ value is less than our significance level $\alpha = 0.05$ ($p < \alpha$), hence we will reject our null hypothesis ($H_0$).
Therefore we will conclude that there is statistically significant evidence that the data fit a linear regression model.

LinearModel <- lm(che.di ~ hgt, data = df) 
LinearModel %>% summary()

## 
## Call:
## lm(formula = che.di ~ hgt, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3102 -1.4326 -0.0696  1.4168  6.8929 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  -3.2947     1.7319  -1.902              0.0577 .  
## hgt           0.1827     0.0101  18.082 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.138 on 505 degrees of freedom
## Multiple R-squared:  0.393,  Adjusted R-squared:  0.3918 
## F-statistic:   327 on 1 and 505 DF,  p-value: < 0.00000000000000022

Hypothesis Testing (Cont.)

Next step is the estimation and testing of intercept($a$) and slope($b$) ans linear model reported $a=-3.295$.
To test the statistical significance of the constant, hypothesis testing is performed with the following hypothesis:
- Null hypothesis, $H_0: \alpha = 0$
- Alternate Hypothesis, $H_a: \alpha \neq 0$

LinearModel %>% summary() %>% coef()

##               Estimate Std. Error   t value
## (Intercept) -3.2946566  1.7318756 -1.902363
## hgt          0.1827027  0.0101042 18.081859
##                                                                     Pr(>|t|)
## (Intercept) 0.05769227415322646101980552657551015727221965789794921875000000
## hgt         0.00000000000000000000000000000000000000000000000000000001017694

The $t$ test resulted in $t = -1.902$ and $p = 0.05$, so, it is questionable to reject or fail to reject our null hypothesis($H_0$) as in case when $p$ value is slightly greater than our significance value($p \approx \alpha$), so, we should check if our confidence interval($CI$) captures our ($H_0$) value or not.
The reported 95% confidence interval is $CI[6.6972252, 0.1079121]$, hence it captures our $H_0$ so we will fail to reject our null hypothesis.

LinearModel %>% confint()

##                  2.5 %    97.5 %
## (Intercept) -6.6972252 0.1079121
## hgt          0.1628512 0.2025541

The slope of the regression line was reported as $b=0.183$, This shows that a unit change in $x$ results in 0.183 change in $y$.
The relation is positive and to test it following hypothesis are take:
- Null hypothesis($H_0$): slope($\beta$) equals 0, $\beta = 0$
- Alternate hypothesis ($H_1$): slope($\beta$) not equal to 0, $\beta \neq 0$
For this $t-test$ is performed with t value calculated using:[2] \[t = \frac{\beta - \beta_{\mu0}}{\beta_{s.e}}\]
The reported $p-value$ is less than significance level so we will reject our null hypothesis($H_0$).
The 95% confidence interval is $CI[0.163, 0.203]$ and clearly do not captures our null hypothesis($H_0$).

Discussion

Initiall there were 25 variables but only two were selected as the variable of interest.
We used 100% of our data as there were no missing values or any extreme value or outliers.
Even though median was little less than the mean that suggest some Right skewness, the data was found to be somewhat normal as the sample size was 507.
Initailly we tested for the $ assumptions required for the application of simple linear regression model: (1) Independence (2) Linearity (3) Normality of residuals (4) Homoscedasticity
Our data meets all the 4 assumptions, i.e is homoscedastic, ollows normal distribuation, in scale-location graph The red line is close to flat and the variance in the square root of the standardised residuals is consistent across predicted and all the values fall between cooks distance band.

plot(LinearModel)

A positive correlation of 0.63 was found between both the variables with the $CI[0.571, 0.677]$.
$R^2$ value suggest that person height explains 39.3% of the variability in a person’s chest size.
Value of intercet $a=-3.295$ with $CI[6.6972252, 0.1079121]$.
Similarly, value of slope $b=0.183$ with $CI[0.163, 0.203]$.

References

[1] “Exploring Relationships in Body Dimensions”, Journal of Statistics Education, [Online]. Available: https://ww2.amstat.org/publications/jse/v11n2/datasets.heinz.html [Accessed: 24-May-2020].

[2] “Esting Two-Sided Hypotheses Concerning the Slope Coefficient” , Econometrics With R , [Online]. Available: https://www.econometrics-with-r.org/5-1-testing-two-sided-hypotheses-concerning-the-slope-coefficient.html [Accessed: 24-May-2020].

MATH1324 Assignment 3

Test to see if there is any statistical significant relationship between a person’s chest diameter (che.di) and height (hgt)

Introduction

Problem Statement

Data

Descriptive Statistics and Visualisation

Descriptive Statistics and Visualisation (Cont.)

Descriptive Statistics and Visualisation (Cont.)

Descriptive Statistics and Visualisation (Cont.)

Descriptive Statistics and Visualisation (Cont.)

Descriptive Statistics and Visualisation (Cont.)

Hypothesis Testing

Hypothesis Testing (Cont.)

Hypothesis Testing (Cont.)

Hypothesis Testing (Cont.)

Hypothesis Testing (Cont.)

Hypothesis Testing (Cont.)

Hypothesis Testing (Cont.)

Discussion

References