Date: 2025-11-10

Student 1 Name: Om
Registration No.: 12309326
Roll No.: 61

Student 2 Name: Vishal Raj
Registration No.: 12321269
Roll No.: 59

Subject: CAP 484 - Data Analytics
Section: D2301


Overview of Analysis Phases

Phase 1: Dataset Understanding (Descriptive Only — No Visuals)

Purpose: Summarize structure, completeness, and basic distribution.

Q# Question Type Visualization
1 How many unique students are in the dataset after removing duplicates? Descriptive None
2 What are the column names and their data types in the dataset? Descriptive None
3 Are there any missing values in the dataset? Descriptive None
4 What are the min, max, and mean values for GPA, Study Time, and Absences? Descriptive None
5 How many students fall into each GPA or study category? Descriptive None

Phase 2: Student Performance Patterns

Purpose: Analyze how academic and personal factors affect GPA.

Q# Question Type Visualization
6 What is the distribution of GPA among students? Visual Histogram
7 How does weekly study time vary across students? Visual Histogram
8 What is the absence distribution across students? Visual Histogram
9 What is the gender distribution in the dataset? Visual Pie Chart
10 How does parental education level relate to GPA? Visual Boxplot

Phase 3: Academic Factors and GPA

Purpose: Understand deeper academic relationships.

Q# Question Type Visualization
11 What is the relationship between study time and GPA? Visual Scatter Plot
12 How does parental support relate to GPA? Visual Boxplot
13 What patterns exist between absences and GPA? Visual Scatter Plot

Phase 4: Correlation Analysis

Purpose: Identify numeric relationships between key variables.

Q# Question Type Visualization
14 How are numeric features correlated (Age, StudyTime, Absences, GPA)? Visual Correlation Heatmap

Phase 5: Conclusion & Insights

Purpose: Summarize findings and key takeaways.

Insight Area Key Observation
Study Habits Students who study more tend to score higher GPAs.
Attendance Fewer absences correlate with higher GPA.
Parental Influence Higher parental support and education improve GPA.
Gender Minor differences in overall GPA distribution.

Phase 0: Project Overview

Objective: Analyze how study time, attendance, and parental factors influence student GPA.
Dataset: Student academic performance dataset.
Goal: Discover patterns and visualize academic success factors.


Phase 1: Dataset Understanding

Q1. Load and Preview the Dataset

data_path <- "C:/Users/singh/OneDrive/Desktop/CAP484/Student_performance_data _.csv"
if (!file.exists(data_path)) {
  stop(paste("Data file not found at:", data_path))
}
data <- read.csv(data_path, stringsAsFactors = FALSE)
knitr::kable(head(data), caption = "First few rows of the dataset")
First few rows of the dataset
StudentID Age Gender Ethnicity ParentalEducation StudyTimeWeekly Absences Tutoring ParentalSupport Extracurricular Sports Music Volunteering GPA GradeClass
1001 17 1 0 2 19.833723 7 1 2 0 0 1 0 2.9291956 2
1002 18 0 0 1 15.408756 0 0 1 0 0 0 0 3.0429148 1
1003 15 0 2 3 4.210570 26 0 2 0 0 0 0 0.1126023 4
1004 17 1 0 3 10.028830 14 0 3 1 0 0 0 2.0542181 3
1005 17 1 0 2 4.672495 17 1 3 0 0 0 0 1.2880612 4
1006 18 0 0 1 8.191218 0 0 1 1 0 0 0 3.0841836 1

Q2. Structure and Summary

str(data)
## 'data.frame':    2392 obs. of  15 variables:
##  $ StudentID        : int  1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
##  $ Age              : int  17 18 15 17 17 18 15 15 17 16 ...
##  $ Gender           : int  1 0 0 1 1 0 0 1 0 1 ...
##  $ Ethnicity        : int  0 0 2 0 0 0 1 1 0 0 ...
##  $ ParentalEducation: int  2 1 3 3 2 1 1 4 0 1 ...
##  $ StudyTimeWeekly  : num  19.83 15.41 4.21 10.03 4.67 ...
##  $ Absences         : int  7 0 26 14 17 0 10 22 1 0 ...
##  $ Tutoring         : int  1 0 0 0 1 0 0 1 0 0 ...
##  $ ParentalSupport  : int  2 1 2 3 3 1 3 1 2 3 ...
##  $ Extracurricular  : int  0 0 0 1 0 1 0 1 0 1 ...
##  $ Sports           : int  0 0 0 0 0 0 1 0 1 0 ...
##  $ Music            : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Volunteering     : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ GPA              : num  2.929 3.043 0.113 2.054 1.288 ...
##  $ GradeClass       : num  2 1 4 3 4 1 2 4 2 0 ...
summary(data)
##    StudentID         Age            Gender         Ethnicity     
##  Min.   :1001   Min.   :15.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1599   1st Qu.:15.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :2196   Median :16.00   Median :1.0000   Median :0.0000  
##  Mean   :2196   Mean   :16.47   Mean   :0.5109   Mean   :0.8775  
##  3rd Qu.:2794   3rd Qu.:17.00   3rd Qu.:1.0000   3rd Qu.:2.0000  
##  Max.   :3392   Max.   :18.00   Max.   :1.0000   Max.   :3.0000  
##  ParentalEducation StudyTimeWeekly        Absences        Tutoring     
##  Min.   :0.000     Min.   : 0.001056   Min.   : 0.00   Min.   :0.0000  
##  1st Qu.:1.000     1st Qu.: 5.043079   1st Qu.: 7.00   1st Qu.:0.0000  
##  Median :2.000     Median : 9.705363   Median :15.00   Median :0.0000  
##  Mean   :1.746     Mean   : 9.771992   Mean   :14.54   Mean   :0.3014  
##  3rd Qu.:2.000     3rd Qu.:14.408409   3rd Qu.:22.00   3rd Qu.:1.0000  
##  Max.   :4.000     Max.   :19.978094   Max.   :29.00   Max.   :1.0000  
##  ParentalSupport Extracurricular      Sports           Music       
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :2.000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :2.122   Mean   :0.3834   Mean   :0.3035   Mean   :0.1969  
##  3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :4.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##   Volunteering         GPA          GradeClass   
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:1.175   1st Qu.:2.000  
##  Median :0.0000   Median :1.893   Median :4.000  
##  Mean   :0.1572   Mean   :1.906   Mean   :2.984  
##  3rd Qu.:0.0000   3rd Qu.:2.622   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :4.000   Max.   :4.000
cat("Missing values per column:\n")
## Missing values per column:
print(colSums(is.na(data)))
##         StudentID               Age            Gender         Ethnicity 
##                 0                 0                 0                 0 
## ParentalEducation   StudyTimeWeekly          Absences          Tutoring 
##                 0                 0                 0                 0 
##   ParentalSupport   Extracurricular            Sports             Music 
##                 0                 0                 0                 0 
##      Volunteering               GPA        GradeClass 
##                 0                 0                 0

Phase 2: Student Performance Patterns

GPA Distribution

if ("GPA" %in% names(data)) {
  print(
    ggplot(data, aes(x = GPA)) +
      geom_histogram(fill = "pink", color = "black", bins = 15) +
      labs(title = "Distribution of GPA", x = "GPA", y = "Count")
  )
}


Study Time and Absences

if ("StudyTimeWeekly" %in% names(data)) {
  print(
    ggplot(data, aes(x = StudyTimeWeekly)) +
      geom_histogram(fill = "lightpink", color = "black", bins = 15) +
      labs(title = "Distribution of Weekly Study Time", x = "Hours per Week", y = "Count")
  )
}

if ("Absences" %in% names(data)) {
  print(
    ggplot(data, aes(x = Absences)) +
      geom_histogram(fill = "deeppink", color = "black", bins = 15) +
      labs(title = "Distribution of Absences", x = "Absences", y = "Count")
  )
}


Gender Distribution

if ("Gender" %in% names(data)) {
  data$Gender <- ifelse(data$Gender %in% c(1, "1", "M", "m"), "Male",
                        ifelse(data$Gender %in% c(0, 2, "2", "F", "f"), "Female", as.character(data$Gender)))
  gender_summary <- data %>%
    count(Gender) %>%
    mutate(percentage = n / sum(n) * 100, label = paste0(Gender, "\n", round(percentage, 1), "%"))
  print(
    ggplot(gender_summary, aes(x = "", y = n, fill = Gender)) +
      geom_col(width = 1) +
      coord_polar("y") +
      geom_text(aes(label = label), position = position_stack(vjust = 0.5)) +
      labs(title = "Gender Distribution") +
      theme_void()
  )
}


Phase 3: Academic Factors and GPA

Parental Education vs GPA

if (all(c("ParentalEducation", "GPA") %in% names(data))) {
  print(
    ggplot(data, aes(x = ParentalEducation, y = GPA, fill = Gender)) +
      geom_boxplot() +
      labs(title = "Parental Education vs GPA", x = "Parental Education", y = "GPA") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))
  )
}


Study Time vs GPA

if (all(c("StudyTimeWeekly", "GPA") %in% names(data))) {
  data_sample <- data %>% sample_n(min(100, nrow(data)))
  print(
    ggplot(data_sample, aes(x = StudyTimeWeekly, y = GPA, color = Gender)) +
      geom_point(size = 2) +
      geom_smooth(method = "lm", se = FALSE) +
      labs(title = "Study Time vs GPA (Sample)", x = "Weekly Study Time (hrs)", y = "GPA")
  )
}


Parental Support vs GPA

if (all(c("ParentalSupport", "GPA") %in% names(data))) {
  print(
    ggplot(data, aes(x = ParentalSupport, y = GPA, fill = Gender)) +
      geom_boxplot() +
      labs(title = "Parental Support vs GPA", x = "Parental Support Level", y = "GPA")
  )
}


Phase 4: Correlation Analysis

num_cols <- c("Age", "StudyTimeWeekly", "Absences", "GPA")
present <- num_cols[num_cols %in% names(data)]
if (length(present) >= 2) {
  num_data <- data %>% select(all_of(present)) %>% mutate_all(as.numeric)
  corr_matrix <- cor(num_data, use = "pairwise.complete.obs")
  print(corrplot(corr_matrix, method = "color", addCoef.col = "black", number.cex = 0.7))
}

## $corr
##                           Age StudyTimeWeekly     Absences           GPA
## Age              1.0000000000    -0.006800031 -0.011510913  0.0002753882
## StudyTimeWeekly -0.0068000307     1.000000000  0.009325535  0.1792751269
## Absences        -0.0115109127     0.009325535  1.000000000 -0.9193135764
## GPA              0.0002753882     0.179275127 -0.919313576  1.0000000000
## 
## $corrPos
##              xName           yName x y          corr
## 1              Age             Age 1 4  1.0000000000
## 2              Age StudyTimeWeekly 1 3 -0.0068000307
## 3              Age        Absences 1 2 -0.0115109127
## 4              Age             GPA 1 1  0.0002753882
## 5  StudyTimeWeekly             Age 2 4 -0.0068000307
## 6  StudyTimeWeekly StudyTimeWeekly 2 3  1.0000000000
## 7  StudyTimeWeekly        Absences 2 2  0.0093255348
## 8  StudyTimeWeekly             GPA 2 1  0.1792751269
## 9         Absences             Age 3 4 -0.0115109127
## 10        Absences StudyTimeWeekly 3 3  0.0093255348
## 11        Absences        Absences 3 2  1.0000000000
## 12        Absences             GPA 3 1 -0.9193135764
## 13             GPA             Age 4 4  0.0002753882
## 14             GPA StudyTimeWeekly 4 3  0.1792751269
## 15             GPA        Absences 4 2 -0.9193135764
## 16             GPA             GPA 4 1  1.0000000000
## 
## $arg
## $arg$type
## [1] "full"

Phase 5: Conclusion & Insights

✅ Students who study more generally achieve better GPAs.
✅ Parental education and support improve performance.
✅ Absences hurt GPA outcomes.
✅ Gender differences are minor in academic performance.


sessionInfo()
## R version 4.5.0 (2025-04-11 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_American Samoa.utf8 
## [2] LC_CTYPE=English_American Samoa.utf8   
## [3] LC_MONETARY=English_American Samoa.utf8
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_American Samoa.utf8    
## 
## time zone: Asia/Calcutta
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] corrplot_0.95 dplyr_1.1.4   ggplot2_3.5.2
## 
## loaded via a namespace (and not attached):
##  [1] vctrs_0.6.5        nlme_3.1-168       cli_3.6.5          knitr_1.50        
##  [5] rlang_1.1.6        xfun_0.52          generics_0.1.3     jsonlite_2.0.0    
##  [9] labeling_0.4.3     glue_1.8.0         htmltools_0.5.8.1  sass_0.4.10       
## [13] scales_1.4.0       rmarkdown_2.29     grid_4.5.0         evaluate_1.0.3    
## [17] jquerylib_0.1.4    tibble_3.2.1       fastmap_1.2.0      yaml_2.3.10       
## [21] lifecycle_1.0.4    compiler_4.5.0     RColorBrewer_1.1-3 pkgconfig_2.0.3   
## [25] mgcv_1.9-1         rstudioapi_0.17.1  lattice_0.22-6     farver_2.1.2      
## [29] digest_0.6.37      R6_2.6.1           tidyselect_1.2.1   splines_4.5.0     
## [33] pillar_1.10.2      magrittr_2.0.3     Matrix_1.7-3       bslib_0.9.0       
## [37] withr_3.0.2        tools_4.5.0        gtable_0.3.6       cachem_1.1.0