Write an R script to perform a practical student project distribution task.
┌───────────────────────────────┐
│ 12 Students │
│ Sorted by CGPA (High → Low) │
└───────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ Divide into 4 Groups │
│ (Each group size = 3) │
└───────────────────────────────┘
│
▼
┌──────────────────┬──────────────────┬──────────────────┬───────────────────┐
│ Group 1 │ Group 2 │ Group 3 │ Group 4 │
│ Top 3 Students │ Next 3 Students │ Next 3 Students │ Bottom 3 Students │
└──────────────────┴──────────────────┴──────────────────┴───────────────────┘
│
▼
┌───────────────────────────────┐
│ FOR EACH GROUP │
│ │
│ • Take 3 students │
│ • Shuffle teachers A, B, C │
│ • Assign sequentially: │
│ Student 1 → Teacher 1 │
│ Student 2 → Teacher 2 │
│ Student 3 → Teacher 3 │
└───────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ FINAL RESULT │
└───────────────────────────────┘
│
▼
┌──────────────────┬──────────────────┬──────────────────┐
│ Teacher A │ Teacher B │ Teacher C │
│ 1 Top + 1 Good │ 1 Top + 1 Good │ 1 Top + 1 Good │
│ +1 Avg +1 Weak │ +1 Avg +1 Weak │ +1 Avg +1 Weak │
└──────────────────┴──────────────────┴──────────────────┘
require(readxl)
library(writexl)
project.df<-read_excel("C:\\Users\\Dell\\OneDrive\\Desktop\\STATISTICS\\2nd Year\\R programming\\varsity\\Project_Distribution_Data.xlsx")
# View data structure
str(project.df)
head(project.df)
Step 1: Sort Data in descending order of CGPA
sorted_df <- project.df[order(project.df$CGPA, decreasing = TRUE), ]
sorted_df
Step 2: Vector of teacher initials (suppose we have currently 23 techers in active service)
teachers <- LETTERS[1:23]
teachers
Step 3: Assign students to groups sequentially, starting from 1, 2, 3… and repeating, based on the number of teachers
n_stu <- nrow(sorted_df)
n_tea <- length(teachers)
n_groups <- ceiling(n_stu / n_tea)
sorted_df$Group <- rep(1:n_groups, each = n_tea, length.out = n_stu)
sorted_df
Step 4: block-wise random assignment of teachers!
set.seed(209)
sorted_df$AssignedT <- NA
groups <- unique(sorted_df$Group) # Get all unique group numbers
for (g in groups) {
# Find indices of students in this group
idx <- which(Group == g)
n_stu_group <- length(idx)
# Randomly assign teachers to these students
# Since teachers repeat across groups, we can sample all teachers
sorted_df$AssignedT[idx] <- sample(teachers)[1:n_stu_group]
}
Step 5: Check Results
head(sorted_df, 20)
# Check distribution
table(sorted_df$AssignedT)
# Verify no teacher gets too many students
teacher_counts <- table(sorted_df$AssignedT)
summary(as.numeric(teacher_counts))
Step 6: Save/ Write df to an Excel file
write_xlsx(sorted_df, "Assigned_Projects.xlsx")
You are given a dataset containing student information including Exam Roll numbers, Sex, CGPA, Group assignment, and Assigned Teacher. Your task is to prepare this data for analysis and perform a comprehensive statistical investigation using R.
Before conducting any analysis, you must properly convert variable types in R. Explain your reasoning for each conversion in your code comments.
numericcharacter (or factor)characterfactornumericnumericnumericfactorcharacterfactorPerform the following analyses and report your findings:
Create the following plots with appropriate titles and labels:
Perform the following hypothesis tests and interpret the results:
For the multiple regression model
(CGPA ~ Sex + Group + AssignedTeacher), check:
Your R script should follow this structure:
Note: Each section in the R script should be clearly separated using comments and written in a logical, sequential manner for readability and grading.
┌───────────────────────────────────────────┐
│ START PROJECT │
│ Student Performance Dataset in R │
└───────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────┐
│ LOAD PACKAGES │
│ readxl, dplyr │
└───────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────┐
│ READ DATA │
│ Import Excel Student Dataset │
└───────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ PART 1: DATA PREPARATION │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ Convert Variable Types │
│ │
│ ExamRoll → character (identifier) │
│ Sex → factor (categorical) │
│ CGPA → numeric (continuous) │
│ Group → factor (category) │
│ AssignedTeacher → factor (category) │
└────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────┐
│ VERIFY STRUCTURE │
│ str(data) │
└───────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ PART 2A: DESCRIPTIVE STATISTICS │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ Univariate Statistics (CGPA) │
│ Mean • Median • Standard Deviation │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ Frequency Distributions │
│ Sex • Group • AssignedTeacher │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ Group-wise CGPA Comparison │
│ CGPA by Sex (Mean, SD, Count) │
│ CGPA by Group (Mean, SD, Count) │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ PART 2B: DATA VISUALIZATION │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ Distribution Plots │
│ Histogram of CGPA │
│ Density Plot of CGPA │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ Categorical Bar Plots │
│ Sex • Group • AssignedTeacher │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ Comparative Boxplots │
│ CGPA by Sex │
│ CGPA by Group │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ PART 2C: INFERENTIAL STATISTICS │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ Mean Comparison Tests │
│ t-test : CGPA ~ Sex │
│ ANOVA : CGPA ~ Group │
│ ANOVA : CGPA ~ AssignedTeacher │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ Association Tests │
│ Chi-square: Sex vs Group │
│ Chi-square: Sex vs AssignedTeacher │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ PART 2D: REGRESSION ANALYSIS │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ Simple Regression Models │
│ CGPA ~ Sex │
│ CGPA ~ Group │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ Multiple Regression Model │
│ CGPA ~ Sex + Group + AssignedTeacher │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ PART 3A: OUTLIER DETECTION │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ IQR Rule for CGPA │
│ Identify and Count Outliers │
│ Boxplot Visualization │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ PART 3B: REGRESSION DIAGNOSTICS │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ Diagnostic Checks │
│ Residual Plots │
│ Normality (Q-Q Plot, Shapiro-Wilk Test) │
└────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────┐
│ END OF ANALYSIS │
│ Results Ready for Report │
└───────────────────────────────────────────┘
# ===================================================================
# Student Performance Analysis - R Script
# ===================================================================
# Load required packages
library(readxl)
library(dplyr)
# Read data
data <- read_excel("C:/Users/Dell/OneDrive/Desktop/STATISTICS/2nd Year/R programming/varsity/Project_Distribution.xlsx")
# Part 1: Data Preparation
# =========================
# Convert variable types
# ExamRoll: Convert to character (identifiers, not for calculations)
data$ExamRoll <- as.character(data$ExamRoll)
# Sex: Convert to factor (categorical variable with fixed levels)
data$Sex <- as.factor(data$Sex)
# CGPA: Keep as numeric (continuous quantitative variable)
# No conversion needed
# Group: Convert to factor (class categories, not measurable quantities)
data$Group <- as.factor(data$Group)
# AssignedTeacher: Convert to factor (teacher names as categorical groups)
data$AssignedTeacher <- as.factor(data$AssignedTeacher)
# Check conversions
str(data)
tibble [80 × 5] (S3: tbl_df/tbl/data.frame)
$ ExamRoll : chr [1:80] "2316669" "2316632" "2316615" "2316643" ...
$ Sex : Factor w/ 2 levels "F","M": 1 2 2 2 1 1 2 2 2 1 ...
$ CGPA : num [1:80] 3.99 3.95 3.94 3.94 3.91 ...
$ Group : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
$ AssignedTeacher: Factor w/ 23 levels "A","B","C","D",..: 2 15 3 6 13 20 11 9 5 7 ...
# Part 2A: Descriptive Statistics
# ===============================
# 1. Univariate Statistics for CGPA
cat("CGPA Statistics:\n")
CGPA Statistics:
cat("Mean:", mean(data$CGPA), "\n")
Mean: 3.415225
cat("Median:", median(data$CGPA), "\n")
Median: 3.484
cat("Standard Deviation:", sd(data$CGPA), "\n\n")
Standard Deviation: 0.4411517
# 2. Frequency Distributions
cat("Frequency Tables:\n")
Frequency Tables:
cat("Sex:\n")
Sex:
print(table(data$Sex))
F M
30 50
cat("\nGroup:\n")
Group:
print(table(data$Group))
1 2 3 4
23 23 23 11
cat("\nAssignedTeacher:\n")
AssignedTeacher:
print(table(data$AssignedTeacher))
A B C D E FT G H I J K L MT N O P Q R S T U V W
4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3
cat("\n")
# 3. Group Comparisons
cat("CGPA by Sex:\n")
CGPA by Sex:
print(data %>% group_by(Sex) %>%
summarise(Mean = mean(CGPA), SD = sd(CGPA), Count = n()))
cat("\nCGPA by Group:\n")
CGPA by Group:
print(data %>% group_by(Group) %>%
summarise(Mean = mean(CGPA), SD = sd(CGPA), Count = n()))
# Part 2B: Data Visualization
# ============================
# 1. Distribution Plots
# Histogram of CGPA
hist(data$CGPA, main = "Histogram of CGPA", xlab = "CGPA", ylab = "Frequency")
# Density plot of CGPA
plot(density(data$CGPA), main = "Density Plot of CGPA", xlab = "CGPA", ylab = "Density")
# 2. Categorical Data Plots
# Bar plot of Sex
barplot(table(data$Sex), main = "Count by Sex", xlab = "Sex", ylab = "Count")
# Bar plot of Group
barplot(table(data$Group), main = "Count by Group", xlab = "Group", ylab = "Count")
# Bar plot of AssignedTeacher
barplot(table(data$AssignedTeacher), main = "Count by Assigned Teacher",
xlab = "Teacher", ylab = "Count", las = 2)
# 3. Comparative Plots
# Boxplot of CGPA by Sex
boxplot(CGPA ~ Sex, data = data, main = "CGPA by Sex", xlab = "Sex", ylab = "CGPA")
# Boxplot of CGPA by Group
boxplot(CGPA ~ Group, data = data, main = "CGPA by Group", xlab = "Group", ylab = "CGPA")
# Part 2C: Inferential Statistics
# ===============================
# 1. Mean Comparisons
cat("\nT-test: CGPA by Sex\n")
T-test: CGPA by Sex
print(t.test(CGPA ~ Sex, data = data))
Welch Two Sample t-test
data: CGPA by Sex
t = 1.7854, df = 70.874, p-value = 0.07848
alternative hypothesis: true difference in means between group F and group M is not equal to 0
95 percent confidence interval:
-0.01995516 0.36152849
sample estimates:
mean in group F mean in group M
3.521967 3.351180
cat("\nANOVA: CGPA by Group\n")
ANOVA: CGPA by Group
print(summary(aov(CGPA ~ Group, data = data)))
Df Sum Sq Mean Sq F value Pr(>F)
Group 3 13.670 4.557 203.1 <2e-16 ***
Residuals 76 1.705 0.022
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
cat("\nANOVA: CGPA by AssignedTeacher\n")
ANOVA: CGPA by AssignedTeacher
print(summary(aov(CGPA ~ AssignedTeacher, data = data)))
Df Sum Sq Mean Sq F value Pr(>F)
AssignedTeacher 22 1.78 0.08091 0.339 0.997
Residuals 57 13.60 0.23850
# 2. Association Tests
cat("\nChi-square: Sex vs Group\n")
Chi-square: Sex vs Group
print(chisq.test(table(data$Sex, data$Group)))
Pearson's Chi-squared test
data: table(data$Sex, data$Group)
X-squared = 4.0095, df = 3, p-value = 0.2604
cat("\nChi-square: Sex vs AssignedTeacher\n")
Chi-square: Sex vs AssignedTeacher
print(chisq.test(table(data$Sex, data$AssignedTeacher)))
Pearson's Chi-squared test
data: table(data$Sex, data$AssignedTeacher)
X-squared = 25.244, df = 22, p-value = 0.2855
# Part 2D: Regression Analysis
# ============================
# Simple Linear Regression
cat("\nRegression: CGPA ~ Sex\n")
Regression: CGPA ~ Sex
print(summary(lm(CGPA ~ Sex, data = data)))
Call:
lm(formula = CGPA ~ Sex, data = data)
Residuals:
Min 1Q Median 3Q Max
-1.29818 -0.22513 0.09743 0.28478 0.60182
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.5220 0.0796 44.244 <2e-16 ***
SexM -0.1708 0.1007 -1.696 0.0938 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.436 on 78 degrees of freedom
Multiple R-squared: 0.03557, Adjusted R-squared: 0.02321
F-statistic: 2.877 on 1 and 78 DF, p-value: 0.09384
cat("\nRegression: CGPA ~ Group\n")
Regression: CGPA ~ Group
print(summary(lm(CGPA ~ Group, data = data)))
Call:
lm(formula = CGPA ~ Group, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.48527 -0.07246 0.00333 0.10026 0.44173
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.83774 0.03123 122.887 < 2e-16 ***
Group2 -0.26913 0.04417 -6.094 4.23e-08 ***
Group3 -0.57900 0.04417 -13.110 < 2e-16 ***
Group4 -1.29947 0.05491 -23.668 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1498 on 76 degrees of freedom
Multiple R-squared: 0.8891, Adjusted R-squared: 0.8847
F-statistic: 203.1 on 3 and 76 DF, p-value: < 2.2e-16
# Multiple Regression
cat("\nMultiple Regression: CGPA ~ Sex + Group + AssignedTeacher\n")
Multiple Regression: CGPA ~ Sex + Group + AssignedTeacher
full_model <- lm(CGPA ~ Sex + Group + AssignedTeacher, data = data)
print(summary(full_model))
Call:
lm(formula = CGPA ~ Sex + Group + AssignedTeacher, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.37629 -0.07189 0.01049 0.06907 0.29031
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.697292 0.079024 46.787 < 2e-16 ***
SexM 0.053618 0.042178 1.271 0.2092
Group2 -0.278455 0.043500 -6.401 4.14e-08 ***
Group3 -0.588325 0.043500 -13.525 < 2e-16 ***
Group4 -1.321622 0.058344 -22.652 < 2e-16 ***
AssignedTeacherB 0.225250 0.102817 2.191 0.0329 *
AssignedTeacherC 0.183846 0.103356 1.779 0.0810 .
AssignedTeacherD 0.066154 0.103356 0.640 0.5249
AssignedTeacherE 0.093191 0.104957 0.888 0.3786
AssignedTeacherFT 0.033096 0.103356 0.320 0.7501
AssignedTeacherG 0.041346 0.103356 0.400 0.6907
AssignedTeacherH 0.130846 0.103356 1.266 0.2111
AssignedTeacherI 0.007691 0.104957 0.073 0.9419
AssignedTeacherJ 0.260404 0.103356 2.519 0.0148 *
AssignedTeacherK 0.264250 0.102817 2.570 0.0130 *
AssignedTeacherL 0.005017 0.114247 0.044 0.9651
AssignedTeacherMT 0.202096 0.111863 1.807 0.0765 .
AssignedTeacherN -0.007777 0.112184 -0.069 0.9450
AssignedTeacherO 0.175557 0.112184 1.565 0.1236
AssignedTeacherP 0.143302 0.113299 1.265 0.2115
AssignedTeacherQ 0.021557 0.112184 0.192 0.8484
AssignedTeacherR 0.193096 0.111863 1.726 0.0901 .
AssignedTeacherS 0.087351 0.114247 0.765 0.4479
AssignedTeacherT 0.113557 0.112184 1.012 0.3160
AssignedTeacherU 0.023890 0.112184 0.213 0.8322
AssignedTeacherV 0.165351 0.114247 1.447 0.1537
AssignedTeacherW 0.211429 0.111863 1.890 0.0642 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1454 on 53 degrees of freedom
Multiple R-squared: 0.9271, Adjusted R-squared: 0.8914
F-statistic: 25.93 on 26 and 53 DF, p-value: < 2.2e-16
# Part 3A: Outlier Detection
# ==========================
# Detect outliers using IQR rule
Q1 <- quantile(data$CGPA, 0.25)
Q3 <- quantile(data$CGPA, 0.75)
IQR_val <- Q3 - Q1
lower <- Q1 - 1.5 * IQR_val
upper <- Q3 + 1.5 * IQR_val
outliers <- data[data$CGPA < lower | data$CGPA > upper, ]
cat("\nOutliers detected:", nrow(outliers), "\n")
Outliers detected: 4
# Visualize outliers
boxplot(data$CGPA, main = "Boxplot of CGPA", ylab = "CGPA")
# Part 3B: Regression Diagnostics
# ===============================
# Diagnostic plots for full model
par(mfrow = c(2, 2))
plot(full_model)
par(mfrow = c(1, 1))
# Shapiro-Wilk test
cat("\nShapiro-Wilk Test on Residuals:\n")
Shapiro-Wilk Test on Residuals:
print(shapiro.test(residuals(full_model)))
Shapiro-Wilk normality test
data: residuals(full_model)
W = 0.97797, p-value = 0.1824