Loading libraries required

library(data.table)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(RColorBrewer)
library(readxl)

Loading dataset from excel

SAGE <- read_excel("SAGE.xlsx")

Getting overview of the data

format(glimpse(SAGE))
## Rows: 2,356
## Columns: 10
## $ ID          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ centre      <chr> "China", "China", "China", "China", "China", "China", "Chi…
## $ gender      <dbl> 1, 2, 1, 2, 1, 2, 2, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1…
## $ age         <dbl> 67, 53, 60, 50, 57, 55, 65, 73, 61, 77, 57, 77, 57, 58, 60…
## $ ideation    <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0…
## $ attempt     <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ health      <dbl> 1, 2, 2, 1, 3, 2, 3, 1, 1, 1, 1, 1, 3, 2, 2, 1, 3, 3, 2, 1…
## $ sleeplength <dbl> 22.0, 8.0, 7.0, 9.0, 10.0, 5.0, 8.0, 7.0, 7.5, 7.0, 8.0, 9…
## $ sleepqual   <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ married     <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2…
##  [1] "# A tibble: 2,356 Ă— 10"                                                          
##  [2] "      ID centre gender   age ideation attempt health sleeplength sleep…¹ married"
##  [3] "   <dbl> <chr>   <dbl> <dbl>    <dbl>   <dbl>  <dbl>       <dbl>   <dbl>   <dbl>"
##  [4] " 1     1 China       1    67        0       0      1        22         0       2"
##  [5] " 2     2 China       2    53        0       0      2         8         0       2"
##  [6] " 3     3 China       1    60        1       1      2         7         0       1"
##  [7] " 4     4 China       2    50        0       0      1         9         0       2"
##  [8] " 5     5 China       1    57        0       0      3        10         0       2"
##  [9] " 6     6 China       2    55        1       0      2         5         1       2"
## [10] " 7     7 China       2    65        0       0      3         8         0       2"
## [11] " 8     8 China       1    73        0       0      1         7         0       2"
## [12] " 9     9 China       2    61        1       0      1         7.5       1       2"
## [13] "10    10 China       2    77        0       0      1         7         0       1"
## [14] "# … with 2,346 more rows, and abbreviated variable name ¹​sleepqual"
format(head(SAGE))
##  [1] "# A tibble: 6 Ă— 10"                                                              
##  [2] "     ID centre gender   age ideation attempt health sleeplength sleepq…¹ married"
##  [3] "  <dbl> <chr>   <dbl> <dbl>    <dbl>   <dbl>  <dbl>       <dbl>    <dbl>   <dbl>"
##  [4] "1     1 China       1    67        0       0      1          22        0       2"
##  [5] "2     2 China       2    53        0       0      2           8        0       2"
##  [6] "3     3 China       1    60        1       1      2           7        0       1"
##  [7] "4     4 China       2    50        0       0      1           9        0       2"
##  [8] "5     5 China       1    57        0       0      3          10        0       2"
##  [9] "6     6 China       2    55        1       0      2           5        1       2"
## [10] "# … with abbreviated variable name ¹​sleepqual"

The data is in a tidy format. # Corelation Analysis

Now We can start with the some basic analysis. First of all we will make pairplot to analyse the corelations between the variables.

# install.packages("GGally")
library(GGally)

ggpairs(SAGE,                 # 
        columns = 1:4,      
        aes(color = centre,  
            alpha = 0.5))   

The pairplot is a great way of analysing the datasets and finding the corelations between the variables. In the above pairplot we can observe that for 5 different countries the corelation coefficent values are vey low or negative between numeric variables. The boxplot between age and centre shows that age is average age is highest for South Africa and lowest for Russian Federation. There are some outliers in the age variable as shown by boxplot. The density plots show that the age is left skewed. Simialrly the corelation coefficent values with aesterics also show that signficant corelation between the variables for different centres. Data from Indian and Russian Federation are highliy signficant from each other. It can also be read by barplot with high values for Russian centr as compared to Indian centre.

According to the requirements of the question we can choose age and sleep length as variables to analyse the corelation between them.

ggpairs(SAGE,                 # 
        columns = c("age","sleeplength"),      
        aes(color = centre,  
            alpha = 0.5))  

The above graph shows that age and sleep length are not highly corelated with each other for different centres. The scatter plots also shows a flat trend which doesn’t show any positive corelation. Moreover as shown above the age is right skewed. Density plot for sleep length shows that it is almost normally distrbuted for 4 countries exceot for Russian Federation. Moreover the overall corelation coefficent is 0.051 which is very small and it indicates that that sleep length shorttage cannot be related to age of the persons in 5 centres. ’Normally distrubution can be check by following test.

Before conducting the regression analysis we can check the normality of the our columns by Shapiro Wilk test

shapiro.test(SAGE$age)
## 
##  Shapiro-Wilk normality test
## 
## data:  SAGE$age
## W = 0.95448, p-value < 2.2e-16

For the age column the p-value is less than 0.05 which is chosen signficance level. Since for the normality test we should have value >0.05 to consider the data to be normal so can conclude that for age data is is not normal.Now we can check for sleep length variable

shapiro.test(SAGE$sleeplength)
## 
##  Shapiro-Wilk normality test
## 
## data:  SAGE$sleeplength
## W = 0.93587, p-value < 2.2e-16

At 5% significance level and 95% confidence interval the result of shapiro wilk normality test shows that data for sleep length is normally distrbuted since p < \(\alpha\).

We can further analyse our two variables by a Linear regression model.

model <- lm(SAGE$age~SAGE$sleeplength)

summary(model)
## 
## Call:
## lm(formula = SAGE$age ~ SAGE$sleeplength)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.711  -8.332  -1.580   6.546  41.919 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       62.0756     0.7538  82.350   <2e-16 ***
## SAGE$sleeplength   0.2507     0.1022   2.453   0.0142 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.793 on 2271 degrees of freedom
##   (83 observations deleted due to missingness)
## Multiple R-squared:  0.002643,   Adjusted R-squared:  0.002204 
## F-statistic: 6.018 on 1 and 2271 DF,  p-value: 0.01424

Result of the linear regression model indicates many important features which are described as

\[sleeplength = 0.2507*age+62.07\]

where 0.2507 is the coefficient estimate and 62.07 is the intercept. It shows that for very 1 year increase in age we should expect an increase of 62 in sleep length.

We can extract this equation by formulas as well

library(equatiomatic)
extract_eq(model,use_coefs=T)

\[ \operatorname{\widehat{SAGE\$age}} = 62.08 + 0.25(\operatorname{SAGE\$sleeplength}) \]

confint(model)
##                        2.5 %    97.5 %
## (Intercept)      60.59740503 63.553828
## SAGE$sleeplength  0.05029108  0.451088

It varifies that our model results are at 95% confidence interval.

Now we will plot the residual and other plots for our model.

#define residuals
res <- resid(model)

#produce residual vs. fitted plot
plot(fitted(model), res)

#add a horizontal line at 0 
abline(0,0)

It shows that residuals are not normally distributed around 0 hence homoscedasticity is not violated.

#create Q-Q plot for residuals
qqnorm(res)

#add a straight diagonal line to the plot
qqline(res) 

The qqplot shows that data is not normally distributed. since at the edges the scatter plot doesn’t follow the straight line.