Student Data Analysis

This report analyzes the distribution of student names in our Data Science class based on their first letter.

Step 1: Creating the student data frame

Create a data frame containing the student names, last names and mail IDs:

students_data <- data.frame(
  full_name = c(
    "Gundapuneni, Srividya",
    "Leleina, Boaz",
    "Liu, Zehua",
    "Machiraju, Phaneendra sarath kumar",
    "Momanyi, Pauline nyaboke",
    "Mondi, Bhargav",
    "Muchiri, Dennis mwangi",
    "Muchue, Victoria Wanjiru",
    "Nalla, Lohith reddy",
    "Ndisya, Richard mulu",
    "Njoroge, John kiragu",
    "Oluko, Eddy kingstone",
    "Patel, Falgun yogeshkumar",
    "Putheti, Vamsi krishna",
    "Qiao, Feifan",
    "Sagwa, Susan",
    "Vemula, Harshith sai",
    "Vemula, Sai deepthi",
    "Wanjira, Janet gathoni"
  )
)

extract_names <- function(full_name) {
  parts <- strsplit(full_name, ", ")[[1]]
  last_name <- parts[1]
  first_name <- strsplit(parts[2], " ")[[1]][1]
  return(c(first_name, last_name))
}
names_list <- t(sapply(students_data$full_name, extract_names))
students_data$first_name <- names_list[, 1]
students_data$last_name <- names_list[, 2]
students_data$mail_id <- paste0(
  tolower(substr(students_data$first_name, 1, 1)),
  tolower(students_data$last_name),
  "@student.edu"
)
students_df <- students_data[, c("first_name", "last_name", "mail_id")]
knitr::kable(students_df, caption = "Student Information")
Student Information
first_name last_name mail_id
Srividya Gundapuneni
Boaz Leleina
Zehua Liu
Phaneendra Machiraju
Pauline Momanyi
Bhargav Mondi
Dennis Muchiri
Victoria Muchue
Lohith Nalla
Richard Ndisya
John Njoroge
Eddy Oluko
Falgun Patel
Vamsi Putheti
Feifan Qiao
Susan Sagwa
Harshith Vemula
Sai Vemula
Janet Wanjira

Step 2: Creating a Histogram of First Letters

Analyzed how many student names start with each letter:

first_letters <- toupper(substr(students_data$first_name, 1, 1))
all_letters <- data.frame(
  letter = LETTERS,
  count = 0
)

letter_counts <- table(first_letters)
for (i in 1:length(letter_counts)) {
  letter <- names(letter_counts)[i]
  all_letters$count[all_letters$letter == letter] <- letter_counts[letter]
}

knitr::kable(all_letters, caption = "Distribution of first letters in student names")
Distribution of first letters in student names
letter count
A 0
B 2
C 0
D 1
E 1
F 2
G 0
H 1
I 0
J 2
K 0
L 1
M 0
N 0
O 0
P 2
Q 0
R 1
S 3
T 0
U 0
V 2
W 0
X 0
Y 0
Z 1

Step 3: Visualizing the Distribution

Colorful barplot showing the count of student names by their first letter:

letter_plot <- ggplot(all_letters, aes(x = letter, y = count, fill = letter)) +
  geom_bar(stat = "identity", width = 0.7) +
  geom_text(aes(label = ifelse(count > 0, count, "")), vjust = -0.5, size = 3.5) +
  scale_fill_viridis_d() +  # Colorful palette
  labs(
    title = "Count of Student Names by First Letter",
    x = "First Letter of Name",
    y = "Number of Students"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    axis.text = element_text(face = "bold"),
    legend.position = "none"
  ) +
  scale_y_continuous(breaks = seq(0, max(all_letters$count), by = 1))
print(letter_plot)

ggsave("student_name_distribution.png", letter_plot, width = 10, height = 5)

Conclusion

The analysis shows the distribution of first letters among student names in our class. This simple analysis helps us understand the diversity of names in our classroom. As we can see, some letters appear much more frequently than others as the first letter of student names, while many letters don’t appear at all.

In particular, the letters P and S are the most common starting letters with 3 students each.