Exercise D5_TDTU Apr 2023

Ngày 5: Mô hình tăng trưởng tuyến tính và mô hình đa tầng

Việc 1: Bạn muốn đánh giá xem học sinh có tiến bộ theo thời gian bằng dữ liệu willett.csv

wil= read.csv ("D:\\OneDrive\\Statistical courses\\Ton Duc Thang Uni_April2023\\Datasets\\willett.csv")

1.1. Vẽ biểu đồ bánh tằm mô tả tiến bộ của học sinh qua thời gian

library (ggplot2)
ggplot(data=wil, aes(x=time, y=y, group=id, col=id)) + geom_line() + theme(legend.position="none")

1.2. Thực hiện mô hình tuyến tính y = alpha + beta*time

summary(lm(y ~ time, data=wil))

## 
## Call:
## lm(formula = y ~ time, data = wil)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -88.374 -25.584   1.186  28.926  64.746 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  164.374      5.035   32.65   <2e-16 ***
## time          26.960      2.691   10.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.6 on 138 degrees of freedom
## Multiple R-squared:  0.421,  Adjusted R-squared:  0.4168 
## F-statistic: 100.3 on 1 and 138 DF,  p-value: < 2.2e-16

Diễn giải ý nghĩa mô hình
Bạn nhận xét như thế nào về kết quả mô hình tuyến tính này?

1.3 Thực hiện mô hình tuyến tính y = alpha + beta*time cho từng học sinh

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

wil %>% group_by(id) %>% mutate(beta = lm(y ~ time)$coefficients[2]) %>% mutate(alpha = lm(y ~ time)$coefficients[1]) %>% summarize(mbeta=mean(beta), malpha=mean(alpha))

## # A tibble: 35 × 3
##       id mbeta malpha
##    <int> <dbl>  <dbl>
##  1     1  34.2  197. 
##  2     2  28.5  218  
##  3     3  47.9  151. 
##  4     4  21.9  206. 
##  5     5   9.9  200. 
##  6     6  29.6  166. 
##  7     7  32.6  160. 
##  8     8  37.9   92.9
##  9     9  22.7  139. 
## 10    10  26.8  220. 
## # … with 25 more rows

Diễn giải ý nghĩa mô hình
Bạn nhận xét như thế nào về kết quả mô hình tuyến tính này?

1.4 Thực hiện mô hình ảnh hưởng hỗn hợp Y_ij = [b₀₀ + b₁₀*time_ij] + [u_0j + u_1j*time_ij + r_ij]

library (lme4)

## Loading required package: Matrix

library(lmerTest)

## 
## Attaching package: 'lmerTest'

## The following object is masked from 'package:lme4':
## 
##     lmer

## The following object is masked from 'package:stats':
## 
##     step

model1 = lmer (y ~ time + (1 + time|id), data= wil)
summary(model1)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: y ~ time + (1 + time | id)
##    Data: wil
## 
## REML criterion at convergence: 1266.8
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.27205 -0.60035  0.02632  0.60730  1.58439 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  id       (Intercept) 1198.8   34.62         
##           time         132.4   11.51    -0.45
##  Residual              159.5   12.63         
## Number of obs: 140, groups:  id, 35
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)  164.374      6.119  34.000   26.86  < 2e-16 ***
## time          26.960      2.167  34.000   12.44 3.27e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##      (Intr)
## time -0.489

Viết công thức tính điểm cho học sinh i tại thời điểm j (Y_ij) bằng các ước số của mô hình
Diễn giải ý nghĩa của mô hình

1.5. Bạn nhận xét gì về kết quả từ 3 mô hình này?

Việc 2: Bạn muốn đánh giá xem sự tiến bộ của học sinh theo thời gian có bị ảnh hưởng bởi điểm của năm học trước (biến số định lượng covar) không?
2.1 Thực hiện mô hình tuyến tính y = alpha + beta₀Xtime + beta₁Xcovar

wil$ccovar = wil$covar - mean(wil$covar)
model2 = lm(y ~ time + ccovar, data = wil)
summary(model2)

## 
## Call:
## lm(formula = y ~ time + ccovar, data = wil)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -86.522 -24.591   3.165  28.033  70.018 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 164.3743     4.9646  33.109   <2e-16 ***
## time         26.9600     2.6537  10.159   <2e-16 ***
## ccovar        0.5357     0.2410   2.223   0.0278 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.11 on 137 degrees of freedom
## Multiple R-squared:  0.4412, Adjusted R-squared:  0.433 
## F-statistic: 54.08 on 2 and 137 DF,  p-value: < 2.2e-16

Diễn giải ý nghĩa của mô hình
Bạn nhận xét gì? Cụ thể kết quả mô hình có thể trả lời được câu hỏi nghiên cứu không? Tại sao.

2.2. Thực hiện mô hình tuyến tính y = alpha + beta₀ x time + beta₁ x covar x time

model2 = lm(y ~ time + ccovar + time:ccovar, data = wil)
summary(model2)

## 
## Call:
## lm(formula = y ~ time + ccovar + time:ccovar, data = wil)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -88.767 -26.040   3.676  26.224  73.433 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 164.3743     4.9089  33.485   <2e-16 ***
## time         26.9600     2.6239  10.275   <2e-16 ***
## ccovar       -0.1136     0.3987  -0.285   0.7762    
## time:ccovar   0.4329     0.2131   2.031   0.0442 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.71 on 136 degrees of freedom
## Multiple R-squared:  0.4576, Adjusted R-squared:  0.4457 
## F-statistic: 38.25 on 3 and 136 DF,  p-value: < 2.2e-16

Diễn giải ý nghĩa của mô hình. Giải thích điểm khác nhau giữa mô hình 2.1 và 2.2
Bạn nhận xét gì? Cụ thể kết quả mô hình có thể trả lời được câu hỏi nghiên cứu không? Tại sao?

2.3 Thực hiện mô hình hiệu quả hỗn hợp Y_ij = [b₀₀ + b₁₀xtime_ij + b₀₁xcovar_j + b₁₁xtime_ijxcovar_j] + [u_0i + u_1ixtime_ij + r_ij]

model3 = lmer(y ~ time + ccovar + time:ccovar + ( 1 + time | id), data=wil)
summary(model3)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: y ~ time + ccovar + time:ccovar + (1 + time | id)
##    Data: wil
## 
## REML criterion at convergence: 1260.3
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.24817 -0.61873  0.00428  0.61472  1.55688 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  id       (Intercept) 1236.4   35.16         
##           time         107.2   10.36    -0.49
##  Residual              159.5   12.63         
## Number of obs: 140, groups:  id, 35
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept) 164.3743     6.2061  33.0000  26.486  < 2e-16 ***
## time         26.9600     1.9939  33.0000  13.521  5.2e-15 ***
## ccovar       -0.1136     0.5040  33.0000  -0.225   0.8231    
## time:ccovar   0.4329     0.1619  33.0000   2.673   0.0116 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) time   ccovar
## time        -0.522              
## ccovar       0.000  0.000       
## time:ccovar  0.000  0.000 -0.522

Viết công thức tính điểm cho học sinh i tại thời điểm j (Y_ij) bằng các ước số của mô hình
Diễn giải ý nghĩa của mô hình

2.4 Bạn nhận xét gì về kết quả từ 3 mô hình này?

Việc 3: Sử dụng tập tin popular.csv để đánh giá xem điểm số của học sinh (biến số kết cục popular) có bị ảnh hưởng bởi ngoại cảnh (biến số tiên lượng extrov) và giới tính của học sinh (sex) không?

pop = read.csv("D:\\OneDrive\\Statistical courses\\Ton Duc Thang Uni_April2023\\Datasets\\popular.csv")

Đây là dữ liệu về điểm của 2000 học sinh từ 100 lớp học. Mỗi lớp có từ 16 đến 26 học sinh. Các biến số sử dụng trong bài tập này gồm:
pupil: Học sinh
class: Lớp học
sex: Giới tính học sinh (F, M)
extrov: Ảnh hưởng ngoại cảnh
texp: Kinh nghiệm của giáo viên
popular: Điểm (1-10)

3.1 Mô tả bằng biểu đồ: a. Phân bố của điểm học sinh

library (ggplot2)
ggplot(data=pop, aes(popular)) + geom_histogram(fill="blue", col="white")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Điểm học sinh nam có khác với học sinh nữ không?

ggplot(data=pop, aes(x=sex, y=popular, col=sex)) + geom_boxplot() + geom_jitter(alpha=0.1)

Ngoại cảnh ảnh hưởng như thế nào đến điểm học sinh? Có khác nhau giữa học sinh nam và nữ không?

p1= ggplot(data=pop, aes(x=factor(extrov), y=popular, col=factor(extrov))) + geom_boxplot() + geom_jitter(alpha=0.1) 
p2= ggplot(data=pop, aes(x=factor(extrov), y=popular, col=sex)) + geom_boxplot() + geom_jitter(alpha=0.1)
gridExtra::grid.arrange(p1, p2, nrow = 1)

3.2 Dữ liệu này có gì đặc biệt?

3.3 Điểm của học sinh giữa các lớp khác nhau như thế nào?

m1 = lmer(popular ~ 1 + (1 | class), data=pop)
summary(m1)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: popular ~ 1 + (1 | class)
##    Data: pop
## 
## REML criterion at convergence: 6330.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.5655 -0.6975  0.0020  0.6758  3.3175 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  class    (Intercept) 0.7021   0.8379  
##  Residual             1.2218   1.1053  
## Number of obs: 2000, groups:  class, 100
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  5.07786    0.08739 98.90973    58.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.4 GIới tính của học sinh có ảnh hưởng như thế nào đến điểm số?
a. KHông khác biệt giữa các lớp

m2.1 = lmer(popular ~ 1 + Csex + (1 | class), data=pop)
summary(m2.1)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: popular ~ 1 + Csex + (1 | class)
##    Data: pop
## 
## REML criterion at convergence: 5564.1
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.6890 -0.6415  0.0040  0.6788  3.1586 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  class    (Intercept) 0.4839   0.6957  
##  Residual             0.8308   0.9115  
## Number of obs: 2000, groups:  class, 100
## 
## Fixed effects:
##              Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept) 5.070e+00  7.252e-02 9.842e+01   69.91   <2e-16 ***
## Csex        1.350e+00  4.403e-02 1.947e+03   30.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##      (Intr)
## Csex -0.004

Có khác biệt giữa các lớp

m2.2 = lmer(popular ~ 1 + Csex + (1 + Csex | class), data=pop)
summary(m2.2)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: popular ~ 1 + Csex + (1 + Csex | class)
##    Data: pop
## 
## REML criterion at convergence: 5559.1
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.5848 -0.6405  0.0062  0.6664  3.1420 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  class    (Intercept) 0.48299  0.6950        
##           Csex        0.05461  0.2337   -0.30
##  Residual             0.81896  0.9050        
## Number of obs: 2000, groups:  class, 100
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  5.07291    0.07254 96.94844   69.93   <2e-16 ***
## Csex         1.35218    0.05030 82.75135   26.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##      (Intr)
## Csex -0.144

3.5 Ngoại cảnh có ảnh hưởng như thế nào đến điểm số?
a. KHông khác biệt giữa các lớp

m3.1 = lmer(popular ~ 1 + Cextrav + (1 | class), data=pop)
summary(m3.1)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: popular ~ 1 + Cextrav + (1 | class)
##    Data: pop
## 
## REML criterion at convergence: 5832.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0644 -0.7267  0.0165  0.7088  3.3587 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  class    (Intercept) 0.8406   0.9168  
##  Residual             0.9304   0.9646  
## Number of obs: 2000, groups:  class, 100
## 
## Fixed effects:
##              Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept) 5.078e+00  9.421e-02 9.830e+01   53.90   <2e-16 ***
## Cextrav     4.863e-01  2.015e-02 1.965e+03   24.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##         (Intr)
## Cextrav 0.000

Có khác biệt giữa các lớp

m3.2 = lmer(popular ~ 1 + Cextrav + (1 + Cextrav | class), data=pop)
summary(m3.2)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: popular ~ 1 + Cextrav + (1 + Cextrav | class)
##    Data: pop
## 
## REML criterion at convergence: 5779.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.1961 -0.7291  0.0146  0.6816  3.2217 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  class    (Intercept) 0.89178  0.9443        
##           Cextrav     0.02599  0.1612   -0.88
##  Residual             0.89492  0.9460        
## Number of obs: 2000, groups:  class, 100
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  5.03127    0.09702 97.07723   51.86   <2e-16 ***
## Cextrav      0.49286    0.02546 89.69832   19.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##         (Intr)
## Cextrav -0.552

3.6 Ngoại cảnh và giới tính của học sinh ảnh hưởng như thế nào đến điểm số?

m4 = lmer(popular ~ 1 + Cextrav + Csex + (1 + Cextrav + Csex | class), data = pop)

## boundary (singular) fit: see help('isSingular')

summary(m4)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: popular ~ 1 + Cextrav + Csex + (1 + Cextrav + Csex | class)
##    Data: pop
## 
## REML criterion at convergence: 4870.5
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -3.01902 -0.64955 -0.01055  0.67099  3.11761 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr       
##  class    (Intercept) 0.673941 0.82094             
##           Cextrav     0.029847 0.17276  -0.73      
##           Csex        0.005365 0.07325  -0.65 -0.03
##  Residual             0.552883 0.74356             
## Number of obs: 2000, groups:  class, 100
## 
## Fixed effects:
##              Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)   5.02051    0.08411  95.86769   59.69   <2e-16 ***
## Cextrav       0.44300    0.02343  91.03920   18.91   <2e-16 ***
## Csex          1.24483    0.03728 504.13792   33.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##         (Intr) Cextrv
## Cextrav -0.533       
## Csex    -0.127 -0.065
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')

3.7 Bạn có nhận xét gì về kết quả này?

Việc 4: Bạn hãy ghi lại tất cả những hàm/ lệnh trên trong RMarkdown và share trên mang rpubs.com/taikhoancuaban

Exercise D5_TDTU Apr 2023

Ha Tan Duc

2023-04-06