Pada tugas kali saya akan menganalisis dataset “Expanded_data_with_more_features” yang berisi informasi tentang nilai siswa dalam tiga mata pelajaran (Matematika, Membaca, dan Menulis) beserta berbagai variabel latar belakang seperti jenis kelamin, kelompok etnis, pendidikan orang tua, dan lainnya. Dimana data variablenya sebagai berikut:
Data Dictionary (column description)
# Import data dengan path yang sesuai
Expanded_data_with_more_features <- read.csv("C:/Users/Pongo/Downloads/Expanded_data_with_more_features.csv", sep=";")
# Menampilkan struktur data
str(Expanded_data_with_more_features)
## 'data.frame': 30641 obs. of 15 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Gender : chr "female" "female" "female" "male" ...
## $ EthnicGroup : chr "" "group C" "group B" "group A" ...
## $ ParentEduc : chr "bachelor's degree" "some college" "master's degree" "associate's degree" ...
## $ LunchType : chr "standard" "standard" "standard" "free/reduced" ...
## $ TestPrep : chr "none" "" "none" "none" ...
## $ ParentMaritalStatus: chr "married" "married" "single" "married" ...
## $ PracticeSport : chr "regularly" "sometimes" "sometimes" "never" ...
## $ IsFirstChild : chr "yes" "yes" "yes" "no" ...
## $ NrSiblings : int 3 0 4 1 0 1 1 1 3 NA ...
## $ TransportMeans : chr "school_bus" "" "school_bus" "" ...
## $ WklyStudyHours : chr "< 5" "05-Oct" "< 5" "05-Oct" ...
## $ MathScore : int 71 69 87 45 76 73 85 41 65 37 ...
## $ ReadingScore : int 71 90 93 56 78 84 93 43 64 59 ...
## $ WritingScore : int 74 88 91 42 75 79 89 39 68 50 ...
# Menampilkan 6 baris pertama data
head(Expanded_data_with_more_features)
## X Gender EthnicGroup ParentEduc LunchType TestPrep
## 1 0 female bachelor's degree standard none
## 2 1 female group C some college standard
## 3 2 female group B master's degree standard none
## 4 3 male group A associate's degree free/reduced none
## 5 4 male group C some college standard none
## 6 5 female group B associate's degree standard none
## ParentMaritalStatus PracticeSport IsFirstChild NrSiblings TransportMeans
## 1 married regularly yes 3 school_bus
## 2 married sometimes yes 0
## 3 single sometimes yes 4 school_bus
## 4 married never no 1
## 5 married sometimes yes 0 school_bus
## 6 married regularly yes 1 school_bus
## WklyStudyHours MathScore ReadingScore WritingScore
## 1 < 5 71 71 74
## 2 05-Oct 69 90 88
## 3 < 5 87 93 91
## 4 05-Oct 45 56 42
## 5 05-Oct 76 78 75
## 6 05-Oct 73 84 79
# Ringkasan statistik untuk nilai Matematika, Membaca, dan Menulis
summary(Expanded_data_with_more_features[c("MathScore", "ReadingScore", "WritingScore")])
## MathScore ReadingScore WritingScore
## Min. : 0.00 Min. : 10.00 Min. : 4.00
## 1st Qu.: 56.00 1st Qu.: 59.00 1st Qu.: 58.00
## Median : 67.00 Median : 70.00 Median : 69.00
## Mean : 66.56 Mean : 69.38 Mean : 68.42
## 3rd Qu.: 78.00 3rd Qu.: 80.00 3rd Qu.: 79.00
## Max. :100.00 Max. :100.00 Max. :100.00
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.3
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.4.3
# Histogram untuk MathScore
p1 <- ggplot(Expanded_data_with_more_features, aes(x = MathScore)) +
geom_histogram(binwidth = 5, fill = "blue", color = "black") +
ggtitle("Distribusi Nilai Matematika") +
theme_minimal()
# Histogram untuk ReadingScore
p2 <- ggplot(Expanded_data_with_more_features, aes(x = ReadingScore)) +
geom_histogram(binwidth = 5, fill = "green", color = "black") +
ggtitle("Distribusi Nilai Membaca") +
theme_minimal()
# Histogram untuk WritingScore
p3 <- ggplot(Expanded_data_with_more_features, aes(x = WritingScore)) +
geom_histogram(binwidth = 5, fill = "red", color = "black") +
ggtitle("Distribusi Nilai Menulis") +
theme_minimal()
# Menampilkan ketiga plot bersama
grid.arrange(p1, p2, p3, ncol = 1)
boxplot(Expanded_data_with_more_features[c("MathScore", "ReadingScore", "WritingScore")],
main = "Boxplot Nilai Siswa",
col = c("blue", "green", "red"),
names = c("Matematika", "Membaca", "Menulis"))
Interpretasi
Distribusi nilai untuk ketiga mata pelajaran menunjukkan variasi yang cukup luas.
Terdapat beberapa outlier terutama di nilai rendah (terlihat dari boxplot).
Nilai rata-rata untuk Matematika cenderung lebih rendah dibandingkan Membaca dan Menulis.
Cara memeriksa outlier:
Visual menggunakan boxplot
Menghitung IQR (Interquartile Range) dan menentukan batas bawah/atas
Menggunakan metode statistik seperti z-score
Cara menangani outlier:
Investigasi apakah outlier merupakan kesalahan input data
Jika valid, pertimbangkan untuk mempertahankannya karena mewakili variasi alami
Untuk analisis tertentu, bisa dihapus atau diubah (winsorizing)
# Menghitung frekuensi jam belajar per minggu
study_hours <- table(Expanded_data_with_more_features$WklyStudyHours)
study_hours
##
## < 5 > 10 05-Oct
## 955 8238 5202 16246
# Visualisasi
ggplot(Expanded_data_with_more_features, aes(x = WklyStudyHours)) +
geom_bar(fill = "purple") +
ggtitle("Distribusi Jam Belajar per Minggu") +
xlab("Kategori Jam Belajar") +
ylab("Jumlah Siswa") +
theme_minimal()
Interpretasi:
Kategori jam belajar “5 - 10” jam per minggu adalah yang paling banyak dipilih oleh siswa.
Siswa yang belajar kurang dari 5 jam lebih banyak daripada yang belajar lebih dari 10 jam.
# Menghitung rata-rata nilai per kelompok etnis
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.3
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ethnic_scores <- Expanded_data_with_more_features %>%
group_by(EthnicGroup) %>%
summarise(
Avg_Math = mean(MathScore, na.rm = TRUE),
Avg_Reading = mean(ReadingScore, na.rm = TRUE),
Avg_Writing = mean(WritingScore, na.rm = TRUE)
)
ethnic_scores
## # A tibble: 6 × 4
## EthnicGroup Avg_Math Avg_Reading Avg_Writing
## <chr> <dbl> <dbl> <dbl>
## 1 "" 66.2 68.9 67.9
## 2 "group A" 63.0 66.8 65.3
## 3 "group B" 63.5 67.3 65.9
## 4 "group C" 64.7 68.4 67.0
## 5 "group D" 67.7 70.4 70.9
## 6 "group E" 75.3 74.3 72.7
# Visualisasi
ggplot(ethnic_scores, aes(x = EthnicGroup, y = Avg_Math, fill = EthnicGroup)) +
geom_bar(stat = "identity") +
ggtitle("Rata-rata Nilai Matematika per Kelompok Etnis") +
theme_minimal()
ggplot(ethnic_scores, aes(x = EthnicGroup, y = Avg_Reading, fill = EthnicGroup)) +
geom_bar(stat = "identity") +
ggtitle("Rata-rata Nilai Membaca per Kelompok Etnis") +
theme_minimal()
ggplot(ethnic_scores, aes(x = EthnicGroup, y = Avg_Writing, fill = EthnicGroup)) +
geom_bar(stat = "identity") +
ggtitle("Rata-rata Nilai Menulis per Kelompok Etnis") +
theme_minimal()
library(dplyr)
library(ggplot2)
# Hitung rata-rata nilai
ethnic_scores <- Expanded_data_with_more_features %>%
group_by(EthnicGroup) %>%
summarise(
Avg_Math = mean(MathScore, na.rm = TRUE),
Avg_Reading = mean(ReadingScore, na.rm = TRUE)
)
# Buat plot
ggplot(ethnic_scores, aes(x = EthnicGroup)) +
geom_bar(aes(y = Avg_Math, fill = "Math"), stat = "identity", position = "dodge") +
geom_bar(aes(y = Avg_Reading, fill = "Reading"), stat = "identity", position = "dodge") +
labs(title = "Rata-rata Nilai per Kelompok Etnis",
y = "Rata-rata Nilai",
x = "Kelompok Etnis") +
scale_fill_manual(values = c("Math" = "blue", "Reading" = "red")) +
theme_minimal()
Interpretasi:
Kelompok etnis E cenderung memiliki nilai rata-rata tertinggi untuk ketiga mata pelajaran.
Kelompok etnis A cenderung memiliki nilai rata-rata terendah.
Pola perbedaan antar kelompok etnis konsisten di ketiga mata pelajaran.
# Menghitung matriks korelasi
cor_matrix <- cor(Expanded_data_with_more_features[c("MathScore", "ReadingScore", "WritingScore")],
use = "complete.obs")
# Visualisasi heatmap
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.4.3
melted_cor <- melt(cor_matrix)
ggplot(melted_cor, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1)) +
geom_text(aes(label = round(value, 2)), color = "black", size = 4) +
theme_minimal() +
ggtitle("Heatmap Korelasi Nilai Siswa")
Interpretasi:
Terdapat korelasi positif yang kuat antara ketiga nilai:
Matematika dan Membaca: r round(cor_matrix[1,2], 2)
Matematika dan Menulis: r round(cor_matrix[1,3], 2)
Membaca dan Menulis: r round(cor_matrix[2,3], 2)
Korelasi tertinggi adalah antara Membaca dan Menulis.
# Menghitung rata-rata nilai berdasarkan frekuensi olahraga
sport_scores <- Expanded_data_with_more_features %>%
group_by(PracticeSport) %>%
summarise(
Avg_Math = mean(MathScore, na.rm = TRUE),
Avg_Reading = mean(ReadingScore, na.rm = TRUE),
Avg_Writing = mean(WritingScore, na.rm = TRUE)
)
sport_scores
## # A tibble: 4 × 4
## PracticeSport Avg_Math Avg_Reading Avg_Writing
## <chr> <dbl> <dbl> <dbl>
## 1 "" 66.6 69.6 68.5
## 2 "never" 64.2 68.3 66.5
## 3 "regularly" 67.8 69.9 69.6
## 4 "sometimes" 66.3 69.2 68.1
# Visualisasi
ggplot(sport_scores, aes(x = PracticeSport, y = Avg_Math, fill = PracticeSport)) +
geom_bar(stat = "identity") +
ggtitle("Rata-rata Nilai Matematika berdasarkan Frekuensi Olahraga") +
theme_minimal()
# Uji ANOVA untuk melihat perbedaan signifikan
aov_result <- aov(MathScore ~ PracticeSport, data = Expanded_data_with_more_features)
summary(aov_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## PracticeSport 3 41751 13917 59.31 <2e-16 ***
## Residuals 30637 7188652 235
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretasi:
Siswa yang berolahraga “regularly” memiliki nilai rata-rata tertinggi di ketiga mata pelajaran.
Siswa yang “never” berolahraga memiliki nilai rata-rata terendah.
Hasil ANOVA menunjukkan perbedaan yang signifikan dalam nilai matematika berdasarkan frekuensi olahraga.
Program khusus untuk kelompok etnis dengan performa rendah.
Promosi pentingnya olahraga teratur untuk prestasi akademik.
Fokus tambahan pada peningkatan nilai matematika.