Date: 2025-11-10
Student 1 Name: Om
Registration No.: 12309326
Roll No.: 61
Student 2 Name: Vishal Raj
Registration No.: 12321269
Roll No.: 59
Subject: CAP 484 - Data Analytics
Section: D2301
Purpose: Summarize structure, completeness, and basic distribution.
| Q# | Question | Type | Visualization |
|---|---|---|---|
| 1 | How many unique students are in the dataset after removing duplicates? | Descriptive | None |
| 2 | What are the column names and their data types in the dataset? | Descriptive | None |
| 3 | Are there any missing values in the dataset? | Descriptive | None |
| 4 | What are the min, max, and mean values for GPA, Study Time, and Absences? | Descriptive | None |
| 5 | How many students fall into each GPA or study category? | Descriptive | None |
Purpose: Analyze how academic and personal factors affect GPA.
| Q# | Question | Type | Visualization |
|---|---|---|---|
| 6 | What is the distribution of GPA among students? | Visual | Histogram |
| 7 | How does weekly study time vary across students? | Visual | Histogram |
| 8 | What is the absence distribution across students? | Visual | Histogram |
| 9 | What is the gender distribution in the dataset? | Visual | Pie Chart |
| 10 | How does parental education level relate to GPA? | Visual | Boxplot |
Purpose: Understand deeper academic relationships.
| Q# | Question | Type | Visualization |
|---|---|---|---|
| 11 | What is the relationship between study time and GPA? | Visual | Scatter Plot |
| 12 | How does parental support relate to GPA? | Visual | Boxplot |
| 13 | What patterns exist between absences and GPA? | Visual | Scatter Plot |
Purpose: Identify numeric relationships between key variables.
| Q# | Question | Type | Visualization |
|---|---|---|---|
| 14 | How are numeric features correlated (Age, StudyTime, Absences, GPA)? | Visual | Correlation Heatmap |
Purpose: Summarize findings and key takeaways.
| Insight Area | Key Observation |
|---|---|
| Study Habits | Students who study more tend to score higher GPAs. |
| Attendance | Fewer absences correlate with higher GPA. |
| Parental Influence | Higher parental support and education improve GPA. |
| Gender | Minor differences in overall GPA distribution. |
Objective: Analyze how study time, attendance, and
parental factors influence student GPA.
Dataset: Student academic performance dataset.
Goal: Discover patterns and visualize academic success
factors.
data_path <- "C:/Users/singh/OneDrive/Desktop/CAP484/Student_performance_data _.csv"
if (!file.exists(data_path)) {
stop(paste("Data file not found at:", data_path))
}
data <- read.csv(data_path, stringsAsFactors = FALSE)
knitr::kable(head(data), caption = "First few rows of the dataset")| StudentID | Age | Gender | Ethnicity | ParentalEducation | StudyTimeWeekly | Absences | Tutoring | ParentalSupport | Extracurricular | Sports | Music | Volunteering | GPA | GradeClass |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1001 | 17 | 1 | 0 | 2 | 19.833723 | 7 | 1 | 2 | 0 | 0 | 1 | 0 | 2.9291956 | 2 |
| 1002 | 18 | 0 | 0 | 1 | 15.408756 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 3.0429148 | 1 |
| 1003 | 15 | 0 | 2 | 3 | 4.210570 | 26 | 0 | 2 | 0 | 0 | 0 | 0 | 0.1126023 | 4 |
| 1004 | 17 | 1 | 0 | 3 | 10.028830 | 14 | 0 | 3 | 1 | 0 | 0 | 0 | 2.0542181 | 3 |
| 1005 | 17 | 1 | 0 | 2 | 4.672495 | 17 | 1 | 3 | 0 | 0 | 0 | 0 | 1.2880612 | 4 |
| 1006 | 18 | 0 | 0 | 1 | 8.191218 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 3.0841836 | 1 |
## 'data.frame': 2392 obs. of 15 variables:
## $ StudentID : int 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
## $ Age : int 17 18 15 17 17 18 15 15 17 16 ...
## $ Gender : int 1 0 0 1 1 0 0 1 0 1 ...
## $ Ethnicity : int 0 0 2 0 0 0 1 1 0 0 ...
## $ ParentalEducation: int 2 1 3 3 2 1 1 4 0 1 ...
## $ StudyTimeWeekly : num 19.83 15.41 4.21 10.03 4.67 ...
## $ Absences : int 7 0 26 14 17 0 10 22 1 0 ...
## $ Tutoring : int 1 0 0 0 1 0 0 1 0 0 ...
## $ ParentalSupport : int 2 1 2 3 3 1 3 1 2 3 ...
## $ Extracurricular : int 0 0 0 1 0 1 0 1 0 1 ...
## $ Sports : int 0 0 0 0 0 0 1 0 1 0 ...
## $ Music : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Volunteering : int 0 0 0 0 0 0 0 0 1 0 ...
## $ GPA : num 2.929 3.043 0.113 2.054 1.288 ...
## $ GradeClass : num 2 1 4 3 4 1 2 4 2 0 ...
## StudentID Age Gender Ethnicity
## Min. :1001 Min. :15.00 Min. :0.0000 Min. :0.0000
## 1st Qu.:1599 1st Qu.:15.00 1st Qu.:0.0000 1st Qu.:0.0000
## Median :2196 Median :16.00 Median :1.0000 Median :0.0000
## Mean :2196 Mean :16.47 Mean :0.5109 Mean :0.8775
## 3rd Qu.:2794 3rd Qu.:17.00 3rd Qu.:1.0000 3rd Qu.:2.0000
## Max. :3392 Max. :18.00 Max. :1.0000 Max. :3.0000
## ParentalEducation StudyTimeWeekly Absences Tutoring
## Min. :0.000 Min. : 0.001056 Min. : 0.00 Min. :0.0000
## 1st Qu.:1.000 1st Qu.: 5.043079 1st Qu.: 7.00 1st Qu.:0.0000
## Median :2.000 Median : 9.705363 Median :15.00 Median :0.0000
## Mean :1.746 Mean : 9.771992 Mean :14.54 Mean :0.3014
## 3rd Qu.:2.000 3rd Qu.:14.408409 3rd Qu.:22.00 3rd Qu.:1.0000
## Max. :4.000 Max. :19.978094 Max. :29.00 Max. :1.0000
## ParentalSupport Extracurricular Sports Music
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :2.000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :2.122 Mean :0.3834 Mean :0.3035 Mean :0.1969
## 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :4.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Volunteering GPA GradeClass
## Min. :0.0000 Min. :0.000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:1.175 1st Qu.:2.000
## Median :0.0000 Median :1.893 Median :4.000
## Mean :0.1572 Mean :1.906 Mean :2.984
## 3rd Qu.:0.0000 3rd Qu.:2.622 3rd Qu.:4.000
## Max. :1.0000 Max. :4.000 Max. :4.000
## Missing values per column:
## StudentID Age Gender Ethnicity
## 0 0 0 0
## ParentalEducation StudyTimeWeekly Absences Tutoring
## 0 0 0 0
## ParentalSupport Extracurricular Sports Music
## 0 0 0 0
## Volunteering GPA GradeClass
## 0 0 0
if ("GPA" %in% names(data)) {
print(
ggplot(data, aes(x = GPA)) +
geom_histogram(fill = "pink", color = "black", bins = 15) +
labs(title = "Distribution of GPA", x = "GPA", y = "Count")
)
}if ("StudyTimeWeekly" %in% names(data)) {
print(
ggplot(data, aes(x = StudyTimeWeekly)) +
geom_histogram(fill = "lightpink", color = "black", bins = 15) +
labs(title = "Distribution of Weekly Study Time", x = "Hours per Week", y = "Count")
)
}if ("Absences" %in% names(data)) {
print(
ggplot(data, aes(x = Absences)) +
geom_histogram(fill = "deeppink", color = "black", bins = 15) +
labs(title = "Distribution of Absences", x = "Absences", y = "Count")
)
}if ("Gender" %in% names(data)) {
data$Gender <- ifelse(data$Gender %in% c(1, "1", "M", "m"), "Male",
ifelse(data$Gender %in% c(0, 2, "2", "F", "f"), "Female", as.character(data$Gender)))
gender_summary <- data %>%
count(Gender) %>%
mutate(percentage = n / sum(n) * 100, label = paste0(Gender, "\n", round(percentage, 1), "%"))
print(
ggplot(gender_summary, aes(x = "", y = n, fill = Gender)) +
geom_col(width = 1) +
coord_polar("y") +
geom_text(aes(label = label), position = position_stack(vjust = 0.5)) +
labs(title = "Gender Distribution") +
theme_void()
)
}if (all(c("ParentalEducation", "GPA") %in% names(data))) {
print(
ggplot(data, aes(x = ParentalEducation, y = GPA, fill = Gender)) +
geom_boxplot() +
labs(title = "Parental Education vs GPA", x = "Parental Education", y = "GPA") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
)
}if (all(c("StudyTimeWeekly", "GPA") %in% names(data))) {
data_sample <- data %>% sample_n(min(100, nrow(data)))
print(
ggplot(data_sample, aes(x = StudyTimeWeekly, y = GPA, color = Gender)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Study Time vs GPA (Sample)", x = "Weekly Study Time (hrs)", y = "GPA")
)
}num_cols <- c("Age", "StudyTimeWeekly", "Absences", "GPA")
present <- num_cols[num_cols %in% names(data)]
if (length(present) >= 2) {
num_data <- data %>% select(all_of(present)) %>% mutate_all(as.numeric)
corr_matrix <- cor(num_data, use = "pairwise.complete.obs")
print(corrplot(corr_matrix, method = "color", addCoef.col = "black", number.cex = 0.7))
}## $corr
## Age StudyTimeWeekly Absences GPA
## Age 1.0000000000 -0.006800031 -0.011510913 0.0002753882
## StudyTimeWeekly -0.0068000307 1.000000000 0.009325535 0.1792751269
## Absences -0.0115109127 0.009325535 1.000000000 -0.9193135764
## GPA 0.0002753882 0.179275127 -0.919313576 1.0000000000
##
## $corrPos
## xName yName x y corr
## 1 Age Age 1 4 1.0000000000
## 2 Age StudyTimeWeekly 1 3 -0.0068000307
## 3 Age Absences 1 2 -0.0115109127
## 4 Age GPA 1 1 0.0002753882
## 5 StudyTimeWeekly Age 2 4 -0.0068000307
## 6 StudyTimeWeekly StudyTimeWeekly 2 3 1.0000000000
## 7 StudyTimeWeekly Absences 2 2 0.0093255348
## 8 StudyTimeWeekly GPA 2 1 0.1792751269
## 9 Absences Age 3 4 -0.0115109127
## 10 Absences StudyTimeWeekly 3 3 0.0093255348
## 11 Absences Absences 3 2 1.0000000000
## 12 Absences GPA 3 1 -0.9193135764
## 13 GPA Age 4 4 0.0002753882
## 14 GPA StudyTimeWeekly 4 3 0.1792751269
## 15 GPA Absences 4 2 -0.9193135764
## 16 GPA GPA 4 1 1.0000000000
##
## $arg
## $arg$type
## [1] "full"
✅ Students who study more generally achieve better GPAs.
✅ Parental education and support improve performance.
✅ Absences hurt GPA outcomes.
✅ Gender differences are minor in academic performance.
## R version 4.5.0 (2025-04-11 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_American Samoa.utf8
## [2] LC_CTYPE=English_American Samoa.utf8
## [3] LC_MONETARY=English_American Samoa.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_American Samoa.utf8
##
## time zone: Asia/Calcutta
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] corrplot_0.95 dplyr_1.1.4 ggplot2_3.5.2
##
## loaded via a namespace (and not attached):
## [1] vctrs_0.6.5 nlme_3.1-168 cli_3.6.5 knitr_1.50
## [5] rlang_1.1.6 xfun_0.52 generics_0.1.3 jsonlite_2.0.0
## [9] labeling_0.4.3 glue_1.8.0 htmltools_0.5.8.1 sass_0.4.10
## [13] scales_1.4.0 rmarkdown_2.29 grid_4.5.0 evaluate_1.0.3
## [17] jquerylib_0.1.4 tibble_3.2.1 fastmap_1.2.0 yaml_2.3.10
## [21] lifecycle_1.0.4 compiler_4.5.0 RColorBrewer_1.1-3 pkgconfig_2.0.3
## [25] mgcv_1.9-1 rstudioapi_0.17.1 lattice_0.22-6 farver_2.1.2
## [29] digest_0.6.37 R6_2.6.1 tidyselect_1.2.1 splines_4.5.0
## [33] pillar_1.10.2 magrittr_2.0.3 Matrix_1.7-3 bslib_0.9.0
## [37] withr_3.0.2 tools_4.5.0 gtable_0.3.6 cachem_1.1.0