# Load the dplyr library
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.3     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(ggplot2)
mpg<- read_delim("C:/Users/kondo/OneDrive/Desktop/INTRO to Statistics and R/Data Set and work/data.csv", delim = ";",show_col_types = FALSE)
view(mpg)
  1. 3 columns (or values) in your data which are unclear until you read the documentation

Below are the columns in the data which are unclear to me until I read documentation :

  1. Column “Previous qualification”: This column likely contains information about the previous educational qualification of the students, but the specific encoding of these qualifications (e.g., what each value represents) may not be immediately clear without referring to the documentation. Different institutions or datasets may use different encoding schemes for educational qualifications, so reading the documentation is essential to understand the meaning behind these values.

  2. Column “Previous qualification (grade)”: Similar to the previous column, this column likely represents the grades or scores associated with the students’ previous qualifications. However, the encoding of these grades may not be obvious without consulting the documentation. Understanding the grading system used in this dataset is crucial for accurate analysis.

  3. Column “Course”: While this column likely represents the academic program or course that each student is enrolled in, the specific codes or identifiers for these courses may not be self-explanatory. Without referring to the documentation, it would be challenging to interpret what each code corresponds to in terms of the actual courses or majors.

The dataset creators may have chosen to encode the data in this way to preserve privacy and reduce the risk of disclosing sensitive information about individual students. Encoding categorical variables like educational qualifications and courses into numerical codes is a common practice in data anonymization.

2)At least one element or your data that is unclear even after reading the documentation You may need to do some digging, but is there anything about the data that your documentation does not explain?

“Target” : While the documentation mentions that it represents a three-category classification task with values “dropout,” “enrolled,” and “graduate,” it lacks detailed explanations regarding the criteria or grade and conditions that determine whether a student falls into one of these categories.

To visualize this issue and highlight the ambiguity, we can create a bar chart that shows the distribution of the “Target” feature. We will also add annotations to explain the uncertainty:

barplot(table(mpg$Target), col = "skyblue", main = "Distribution of Target Categories", xlab = "Target Category", ylab = "Count")

text(2, max(table(mpg$Target)) + 1, "Unclear", col = "red", cex = 1.2)

ggplot(mpg, aes(x = Target)) +
  geom_bar() +
  labs(title = "Distribution of Target",
       x = "Target",
       y = "Count")

ggplot(mpg, aes(x = `Curricular units 2nd sem (grade)`, y = Target)) +
  geom_point(aes(color = Target)) +
  labs(title = "Scatter Plot: Curricular Units 2nd Sem Grade vs. Target",
       x = "Curricular Units 2nd Sem Grade",
       y = "Target") +
  theme(legend.position = "none") 

To address this issue, it’s essential to seek clarification from the data source or provider to understand the criteria and rules used for classifying students into these categories. Documenting these criteria and conditions in the dataset documentation would reduce the ambiguity and mitigate potential risks associated with misinterpretation.