Q1- A list of at least 3 columns (or values) in your data which are unclear until you read the documentation

*Application Mode: The column contains numeric values (e.g., 1, 2, 5, 7) that represent various application modes, but without the documentation, it would be unclear what each of these numeric codes corresponds to. The documentation provides a clear description of each code.

*Previous Qualification: This column contains numeric values (e.g., 1, 2, 3) representing different levels of education, but it would be unclear without the documentation what each numeric code signifies. The documentation clarifies the educational levels associated with these codes.

*Age at Enrollment: While the column name is somewhat self-explanatory, the documentation is necessary to confirm what “Age of student at enrollment” actually means and how it’s calculated.

Regarding why the data was encoded this way:

Encoding data with numeric values instead of text labels can help reduce storage space and potentially improve processing speed, as numeric values are more efficient to handle in computational tasks.

Numeric encodings can also be useful for statistical analysis and machine learning algorithms, as many algorithms require numeric input.

Without the documentation:

Without the documentation, interpreting the numeric values in these columns would be challenging, potentially leading to misinterpretations or errors in analysis.

Understanding the meaning of these columns would be difficult, and it might be impossible to draw accurate insights or make informed decisions based on the data.

Researchers and analysts might mistakenly assume certain meanings for the numeric codes, leading to incorrect conclusions.

It’s crucial to rely on the documentation to ensure accurate interpretation and meaningful analysis of the data.

Q2- At least one element or your data that is unclear even after reading the documentation

Application mode: This column represents the mode of application for the students, and it contains integer values ranging from 1 to various other values with specific meanings. The documentation provides a list of these values and their descriptions, such as “1st phase - general contingent,” “Ordinance No. 612/93,” “International student (bachelor),” and many more. While the documentation lists the values and their meanings, it doesn’t explain why these specific application modes exist or how they are determined. It would be helpful to have more context on why certain students choose specific application modes and how this information is used in the educational context. Additionally, understanding the significance of these application modes and their impact on the students’ educational journey would provide a more comprehensive understanding of this column’s relevance within the dataset.

Q3- Build a visualization which uses a column of data that is affected by the issue you brought up in bullet #2, above. In this visualization, find a way to highlight the issue, and explain what is unclear and why it might be unclear.

library(ggplot2)
data<-read.csv('./Downloads/students_dropout_and_academic_success.csv')
# Create a subset of the data for the specific application mode
selected_mode <- "International student (bachelor)"
data_subset <- data[data$`Application_mode` == selected_mode, ]

# Create a bar chart to visualize the distribution of application modes
ggplot(data, aes(x = factor(`Application_mode`))) +
  geom_bar(fill = "lightblue") +
  labs(title = "Distribution of Application Modes",
       x = "Application Mode",
       y = "Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  geom_text(data = data_subset, aes(label = "Highlighted Mode"),
            vjust = -0.5, color = "red", size = 4) +
  annotate("text", x = 12, y = 30, label = "Unclear Mode", color = "red", size = 4)

We’ve created a bar chart that displays the distribution of different application modes on the x-axis and their respective frequencies on the y-axis. The light blue bars represent the various application modes. We’ve used red text to annotate and highlight the specific application mode “International student (bachelor)” within the chart.

What is Unclear:

The issue with this column is that the documentation doesn’t provide clear information about why students select specific application modes or the significance of these modes in the educational context. For example, it’s unclear why some students choose the “International student (bachelor)” mode and how it differs from other modes. Without this context, it’s challenging to interpret the data related to this application mode fully.

Significant Risks:

The significant risk here is that decisions or analyses based on this column may lack a comprehensive understanding of why certain modes are chosen. This could lead to misinterpretations or incorrect conclusions, especially if the choice of application mode has a substantial impact on the students’ educational experiences or outcomes.

Risk Mitigation:

To reduce the negative consequences of this issue, it’s essential to gather additional context from relevant stakeholders or educational institutions. Interviews, surveys, or consultations with students who have chosen different application modes could provide insights into their motivations and how these modes affect their educational journey. This additional qualitative information can complement the quantitative data and lead to more informed analyses and decision-making.