Data Dive 4

DATA DIVE 4

A list of at least 3 columns (or values) in your data which are unclear until you read the documentation. E.g., this could be a column name, or just some value inside a cell of your data Why do you think they chose to encode the data the way they did? What could have happened if you didn’t read the documentation? At least one element or your data that is unclear even after reading the documentation You may need to do some digging, but is there anything about the data that your documentation does not explain? Build a visualization which uses a column of data that is affected by the issue you brought up in bullet #2, above. In this visualization, find a way to highlight the issue, and explain what is unclear and why it might be unclear. You can use color or an annotation, but also make sure to explain your thoughts using Markdown Do you notice any significant risks? If so, what could you do to reduce negative consequences?

After loading all the packages and setting the working directory.

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

We get the data loaded into dataframe named ‘data’

#Loading the dataset
data <- read_delim("data.csv", delim = ";")

## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr  (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The column names in the Dataframe

##  [1] "Marital status"                                
##  [2] "Application mode"                              
##  [3] "Application order"                             
##  [4] "Course"                                        
##  [5] "Daytime/evening attendance\t"                  
##  [6] "Previous qualification"                        
##  [7] "Previous qualification (grade)"                
##  [8] "Nacionality"                                   
##  [9] "Mother's qualification"                        
## [10] "Father's qualification"                        
## [11] "Mother's occupation"                           
## [12] "Father's occupation"                           
## [13] "Admission grade"                               
## [14] "Displaced"                                     
## [15] "Educational special needs"                     
## [16] "Debtor"                                        
## [17] "Tuition fees up to date"                       
## [18] "Gender"                                        
## [19] "Scholarship holder"                            
## [20] "Age at enrollment"                             
## [21] "International"                                 
## [22] "Curricular units 1st sem (credited)"           
## [23] "Curricular units 1st sem (enrolled)"           
## [24] "Curricular units 1st sem (evaluations)"        
## [25] "Curricular units 1st sem (approved)"           
## [26] "Curricular units 1st sem (grade)"              
## [27] "Curricular units 1st sem (without evaluations)"
## [28] "Curricular units 2nd sem (credited)"           
## [29] "Curricular units 2nd sem (enrolled)"           
## [30] "Curricular units 2nd sem (evaluations)"        
## [31] "Curricular units 2nd sem (approved)"           
## [32] "Curricular units 2nd sem (grade)"              
## [33] "Curricular units 2nd sem (without evaluations)"
## [34] "Unemployment rate"                             
## [35] "Inflation rate"                                
## [36] "GDP"                                           
## [37] "Target"

There are many columns where categorical values are converted to numerical columns. Doing so is helpful to solve the classification problem associated with the data set but for our analysis, we need to plot those categorical values and compare them with the numerical columns. To do that we need to map the description which was provided in the variable column in the dataset’s website - https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success

The Three selected columns, which are categorical columns that were converted to numerical columns

## # A tibble: 4,424 × 3
##    `Marital status` Nacionality Course
##               <dbl>       <dbl>  <dbl>
##  1                1           1    171
##  2                1           1   9254
##  3                1           1   9070
##  4                1           1   9773
##  5                2           1   8014
##  6                2           1   9991
##  7                1           1   9500
##  8                1           1   9254
##  9                1          62   9238
## 10                1           1   9238
## # ℹ 4,414 more rows

As these column names have whitespaces, spelling mistake and all the categorical columns where converted to numerical description were needed to understand the data.

For example the nationality column was converted to numerical, below I have mapped the numerical values to the appropriate description.

Same with Marital Status and Course

As this was a classification dataset, they needed to convert all the categorical columns to numerical columns. Without the description it is not easy to discern the continous values in the columns eg- gender, marital status etc.

The attributes name have whitespace and capitalized, which could cause issue while data analysis. Apart from that, the dataset and the classification problem is published in journal/peer reviewed

##  [1] "marital_status"                                
##  [2] "application_mode"                              
##  [3] "application_order"                             
##  [4] "course"                                        
##  [5] "daytime_evening_attendance"                    
##  [6] "previous_qualification"                        
##  [7] "previous_qualification__grade_"                
##  [8] "nacionality"                                   
##  [9] "mother_s_qualification"                        
## [10] "father_s_qualification"                        
## [11] "mother_s_occupation"                           
## [12] "father_s_occupation"                           
## [13] "admission_grade"                               
## [14] "displaced"                                     
## [15] "educational_special_needs"                     
## [16] "debtor"                                        
## [17] "tuition_fees_up_to_date"                       
## [18] "gender"                                        
## [19] "scholarship_holder"                            
## [20] "age_at_enrollment"                             
## [21] "international"                                 
## [22] "curricular_units_1st_sem__credited_"           
## [23] "curricular_units_1st_sem__enrolled_"           
## [24] "curricular_units_1st_sem__evaluations_"        
## [25] "curricular_units_1st_sem__approved_"           
## [26] "curricular_units_1st_sem__grade_"              
## [27] "curricular_units_1st_sem__without_evaluations_"
## [28] "curricular_units_2nd_sem__credited_"           
## [29] "curricular_units_2nd_sem__enrolled_"           
## [30] "curricular_units_2nd_sem__evaluations_"        
## [31] "curricular_units_2nd_sem__approved_"           
## [32] "curricular_units_2nd_sem__grade_"              
## [33] "curricular_units_2nd_sem__without_evaluations_"
## [34] "unemployment_rate"                             
## [35] "inflation_rate"                                
## [36] "gdp"                                           
## [37] "target"

The only negative consequence of not following the correct way to write attribute names is that during data analysis, you may encounter various issues and challenges that can hinder your analysis and make it less efficient or even error-prone.

Data Dive 4

Pritesh Shah

2023-09-24

DATA DIVE 4