Đọc dữ liệu:

## Warning: package 'readxl' was built under R version 4.5.1

## # A tibble: 6 × 13
##   `Student ID`   Age Gender Height Weight `Blood Type`   BMI Temperature
##          <dbl> <dbl> <chr>   <dbl>  <dbl> <chr>        <dbl>       <dbl>
## 1            1    18 Female   162.   72.4 O             27.6        NA  
## 2            2    NA Male     152.   47.6 B             NA          98.7
## 3            3    32 Female   183.   55.7 A             16.7        98.3
## 4           NA    30 Male     182.   63.3 B             19.1        98.8
## 5            5    23 Female    NA    46.2 O             NA          98.5
## 6            6    32 <NA>     151.   68.6 B             29.9        99.7
## # ℹ 5 more variables: `Heart Rate` <dbl>, `Blood Pressure` <dbl>,
## #   Cholesterol <dbl>, Diabetes <chr>, Smoking <chr>

1. Thông tin cơ bản dữ liệu:

1.1. Số biến, số quan sát:

dim(dataset)

## [1] 100000     13

1.2. Tên các biến:

names(dataset)

##  [1] "Student ID"     "Age"            "Gender"         "Height"        
##  [5] "Weight"         "Blood Type"     "BMI"            "Temperature"   
##  [9] "Heart Rate"     "Blood Pressure" "Cholesterol"    "Diabetes"      
## [13] "Smoking"

1.3. Kiểm tra số quan sát trùng lặp:

sum(duplicated(dataset))

## [1] 0

1.4. Số quan sát bị thiếu:

colSums(is.na(dataset))

##     Student ID            Age         Gender         Height         Weight 
##          10007           9928           9917          10066           9958 
##     Blood Type            BMI    Temperature     Heart Rate Blood Pressure 
##          10008          10048          10015           9891          10072 
##    Cholesterol       Diabetes        Smoking 
##           9975          10116          10124

1.5. Giải thích ý nghĩa các biến trong bộ dữ liệu:

variable_meaning <- data.frame(Variable = c("Student.ID", "Age", "Gender", "Height", "Weight", "Blood.Type", "BMI", "Temperature", "Heart.Rate", "Blood.Pressure", "Cholesterol", "Diabetes", "Smoking"), Meaning = c("Mã số học sinh/bệnh nhân", "Tuổi (năm)", "Giới tính (Male/Female)", "Chiều cao (cm)", "Cân nặng (kg)", "Nhóm máu (A, B, AB, O)", "Chỉ số BMI (khối lượng cơ thể)", "Nhiệt độ cơ thể (°C)", "Nhịp tim (bpm)", "Huyết áp (mmHg)", "Cholesterol (mg/dL)", "Tình trạng tiểu đường (Yes/No hoặc 1/0)", "Tình trạng hút thuốc (Yes/No hoặc lịch sử hút thuốc)"), stringsAsFactors = FALSE)
library(knitr)

## Warning: package 'knitr' was built under R version 4.5.1

kable(variable_meaning, booktabs = TRUE)

Variable	Meaning
Student.ID	Mã số học sinh/bệnh nhân
Age	Tuổi (năm)
Gender	Giới tính (Male/Female)
Height	Chiều cao (cm)
Weight	Cân nặng (kg)
Blood.Type	Nhóm máu (A, B, AB, O)
BMI	Chỉ số BMI (khối lượng cơ thể)
Temperature	Nhiệt độ cơ thể (°C)
Heart.Rate	Nhịp tim (bpm)
Blood.Pressure	Huyết áp (mmHg)
Cholesterol	Cholesterol (mg/dL)
Diabetes	Tình trạng tiểu đường (Yes/No hoặc 1/0)
Smoking	Tình trạng hút thuốc (Yes/No hoặc lịch sử hút thuốc)

1.6. Xử lý giá trị bị thiếu:

dataset_complete <- dataset[complete.cases(dataset), ]
dataset_complete <- na.omit(dataset)
colSums(is.na(dataset_complete))

##     Student ID            Age         Gender         Height         Weight 
##              0              0              0              0              0 
##     Blood Type            BMI    Temperature     Heart Rate Blood Pressure 
##              0              0              0              0              0 
##    Cholesterol       Diabetes        Smoking 
##              0              0              0

1.7. Phân loại biến định lượng và định tính:

sum(sapply(dataset_complete, is.numeric))

## [1] 9

sum(sapply(dataset_complete, function(x) is.factor(x) | is.character(x)))

## [1] 4

names(dataset_complete)[sapply(dataset_complete, is.numeric)]

## [1] "Student ID"     "Age"            "Height"         "Weight"        
## [5] "BMI"            "Temperature"    "Heart Rate"     "Blood Pressure"
## [9] "Cholesterol"

1.8. Thống kê mô tả cơ bản:

summary(dataset_complete)

##    Student ID         Age           Gender              Height     
##  Min.   :    3   Min.   :18.00   Length:25376       Min.   :150.0  
##  1st Qu.:24733   1st Qu.:22.00   Class :character   1st Qu.:162.6  
##  Median :49777   Median :26.00   Mode  :character   Median :174.8  
##  Mean   :49874   Mean   :26.05                      Mean   :174.9  
##  3rd Qu.:74796   3rd Qu.:30.00                      3rd Qu.:187.5  
##  Max.   :99989   Max.   :34.00                      Max.   :200.0  
##      Weight        Blood Type             BMI         Temperature    
##  Min.   : 40.00   Length:25376       Min.   :10.07   Min.   : 96.40  
##  1st Qu.: 54.97   Class :character   1st Qu.:17.85   1st Qu.: 98.26  
##  Median : 70.09   Mode  :character   Median :22.73   Median : 98.60  
##  Mean   : 70.00                      Mean   :23.36   Mean   : 98.60  
##  3rd Qu.: 85.09                      3rd Qu.:28.04   3rd Qu.: 98.94  
##  Max.   :100.00                      Max.   :44.19   Max.   :100.77  
##    Heart Rate    Blood Pressure   Cholesterol      Diabetes        
##  Min.   :60.00   Min.   : 90.0   Min.   :120.0   Length:25376      
##  1st Qu.:70.00   1st Qu.:102.0   1st Qu.:152.0   Class :character  
##  Median :79.00   Median :115.0   Median :184.0   Mode  :character  
##  Mean   :79.44   Mean   :114.6   Mean   :184.5                     
##  3rd Qu.:89.00   3rd Qu.:127.0   3rd Qu.:218.0                     
##  Max.   :99.00   Max.   :139.0   Max.   :249.0                     
##    Smoking         
##  Length:25376      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

2. Phân tổ các biến:

2.1. Phân tổ theo nhóm máu O:

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.5.1

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

bloodtype_O_data <- dataset_complete %>%
  filter(`Blood Type` == "O")
head(bloodtype_O_data)

## # A tibble: 6 × 13
##   `Student ID`   Age Gender Height Weight `Blood Type`   BMI Temperature
##          <dbl> <dbl> <chr>   <dbl>  <dbl> <chr>        <dbl>       <dbl>
## 1           36    21 Male     183.   61.5 O             18.3        99.3
## 2           87    34 Female   151.   90.6 O             39.8        97.6
## 3          123    29 Female   180.   69.6 O             21.4        98.7
## 4          129    29 Male     194.   71.7 O             19.1        99.3
## 5          143    24 Male     163.   48.7 O             18.3        98.6
## 6          144    26 Male     195.   44.7 O             11.7        98.1
## # ℹ 5 more variables: `Heart Rate` <dbl>, `Blood Pressure` <dbl>,
## #   Cholesterol <dbl>, Diabetes <chr>, Smoking <chr>

2.2. Phân tổ theo giới tính nữ:

library(dplyr)
female_data <- dataset_complete %>%
  dplyr::filter(Gender == "Female")
head(female_data)

## # A tibble: 6 × 13
##   `Student ID`   Age Gender Height Weight `Blood Type`   BMI Temperature
##          <dbl> <dbl> <chr>   <dbl>  <dbl> <chr>        <dbl>       <dbl>
## 1            3    32 Female   183.   55.7 A             16.7        98.3
## 2           12    34 Female   182.   76.4 AB            23.0        98.1
## 3           23    29 Female   180.   90.7 AB            28.0        98.8
## 4           25    27 Female   187.   81.2 AB            23.1        97.7
## 5           47    24 Female   156.   96.5 AB            39.7        98.4
## 6           61    26 Female   178.   89.5 B             28.1        99.1
## # ℹ 5 more variables: `Heart Rate` <dbl>, `Blood Pressure` <dbl>,
## #   Cholesterol <dbl>, Diabetes <chr>, Smoking <chr>

2.3. Phân tổ những người thuộc giới tính nữ và nhóm máu O:

library(dplyr)
female_O_data <- dataset_complete %>%
  filter(Gender == "Female" & `Blood Type` == "O")
head(female_O_data)

## # A tibble: 6 × 13
##   `Student ID`   Age Gender Height Weight `Blood Type`   BMI Temperature
##          <dbl> <dbl> <chr>   <dbl>  <dbl> <chr>        <dbl>       <dbl>
## 1           87    34 Female   151.   90.6 O             39.8        97.6
## 2          123    29 Female   180.   69.6 O             21.4        98.7
## 3          280    29 Female   174.   53.8 O             17.8        99.9
## 4          285    29 Female   170.   69.4 O             24.0        98.5
## 5          305    31 Female   156.   83.3 O             34.2        98.4
## 6          326    22 Female   182.   88.6 O             26.8        99.3
## # ℹ 5 more variables: `Heart Rate` <dbl>, `Blood Pressure` <dbl>,
## #   Cholesterol <dbl>, Diabetes <chr>, Smoking <chr>

2.4. Phân tổ theo nhóm tuổi:

library(dplyr)
dataset_complete <- dataset_complete %>%
  mutate(NhomTuoi = cut(Age, breaks = c(0, 18, 40, Inf), labels = c("Duoi 18", "18-40", "Tren 40")))

2.5. Phân tổ giới tính nữ và nhóm máu O, tuổi từ 18–40:

library(dplyr)
female_O_18_40 <- dataset_complete %>%
  filter(Gender == "Female" & `Blood Type` == "O" & NhomTuoi == "18-40")
head(female_O_18_40)

## # A tibble: 6 × 14
##   `Student ID`   Age Gender Height Weight `Blood Type`   BMI Temperature
##          <dbl> <dbl> <chr>   <dbl>  <dbl> <chr>        <dbl>       <dbl>
## 1           87    34 Female   151.   90.6 O             39.8        97.6
## 2          123    29 Female   180.   69.6 O             21.4        98.7
## 3          280    29 Female   174.   53.8 O             17.8        99.9
## 4          285    29 Female   170.   69.4 O             24.0        98.5
## 5          305    31 Female   156.   83.3 O             34.2        98.4
## 6          326    22 Female   182.   88.6 O             26.8        99.3
## # ℹ 6 more variables: `Heart Rate` <dbl>, `Blood Pressure` <dbl>,
## #   Cholesterol <dbl>, Diabetes <chr>, Smoking <chr>, NhomTuoi <fct>

3. Phân tích các biến:

3.1. Phân tích biến Age:

age_stats <- c(Min = min(dataset_complete$Age, na.rm = TRUE), Max = max(dataset_complete$Age, na.rm = TRUE), Mean = mean(dataset_complete$Age, na.rm = TRUE), Median = median(dataset_complete$Age, na.rm = TRUE), SD = sd(dataset_complete$Age, na.rm = TRUE), Var = var(dataset_complete$Age, na.rm = TRUE))
age_stats

##       Min       Max      Mean    Median        SD       Var 
## 18.000000 34.000000 26.046501 26.000000  4.880116 23.815532

Giá trị nhỏ nhất (Min): 18 - Đây là tuổi nhỏ nhất trong dữ liệu, chỉ bao gồm các cá nhân từ 18 tuổi trở lên được tìm thấy.

Giá trị lớn nhất (Max): 34 - Đây là tuổi lớn nhất được ghi nhận trong mẫu dữ liệu

Giá trị trung bình (Mean): 26.04 - Tuổi trung bình của toàn bộ mẫu là khoảng 26 tuổi.

Trung vị (Median): 26 - Một nửa số cá nhân trong mẫu có tuổi dưới hoặc bằng 26.

Độ lệch chuẩn (SD): 4.88 - Mức độ dao động của xung quanh giá trị trung bình là gần 5 tuổi.

Phương sai (Var): 23.81 - Biến thiên tổng thể của kỹ thuật ở trình độ trung bình.

3.2. Phân tích biến BMI:

bmi_stats <- c(Min = min(dataset_complete$BMI, na.rm = TRUE), Max = max(dataset_complete$BMI, na.rm = TRUE), Mean = mean(dataset_complete$BMI, na.rm = TRUE), Median = median(dataset_complete$BMI, na.rm = TRUE), SD = sd(dataset_complete$BMI, na.rm = TRUE), Var = var(dataset_complete$BMI, na.rm = TRUE))
bmi_stats

##       Min       Max      Mean    Median        SD       Var 
## 10.074837 44.194021 23.355807 22.730699  7.059462 49.836003

Giá trị nhỏ nhất (Min): 10.07 — Mức BMI thấp nhất trong mẫu quan sát.

Trung vị (Median): 22.73 — Một nửa mẫu có BMI dưới mức này.

Giá trị trung bình (Mean): 23.35 — Mức BMI trung bình của toàn bộ mẫu.

Giá trị lớn nhất (Max): 44.19 — Mức BMI cao nhất trong dữ liệu.

Độ lệch chuẩn (SD): 7.05 — Mức độ phân tán của BMI quanh giá trị trung bình.

Phương sai (Var): 49.83 — Độ biến thiên tổng thể của BMI trong mẫu.

Ngôn ngữ lập trình trong phân tích dữ liệu - Th.S Trần Mạnh Tường

Đặng Hoài Trinh & Trần Tấn Đức Tài

2025-10-13