Stage 1: Data Proposal and Probability Analysis Report

Author

Ece Kurtoğlu and Halil Rıfat Başbuğ

Introduction

This Stage 1 report focuses on prediction. The project uses two different real-world datasets. The first dataset is used for a regression prediction problem because the target variable is continuous. The second dataset is used for a classification prediction problem because the target variable is binary.

The main purpose of this report is to import, clean, and explore both datasets before building predictive models in the next stages.

Dataset 1: Regression Dataset

Dataset Description and Source

The first dataset is the Student Performance Factors dataset. It contains information about students’ study habits, attendance, previous scores, family background, school-related factors, and final exam scores. The target variable is Exam_Score, which is a continuous numeric variable.

This dataset is relevant to economics because education is closely related to human capital. Predicting student performance can help schools and policymakers understand which factors may be useful for identifying students who need academic support.

Source: Kaggle / Student Performance Factors dataset.

The dataset contains more than 500 observations and more than 5 variables, including the target variable.

Economic Question

Which student characteristics best predict exam scores?

# Import the student performance dataset
student_raw <- read_csv("StudentPerformanceFactors.csv")

Rows: 6607 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): Parental_Involvement, Access_to_Resources, Extracurricular_Activit...
dbl  (7): Hours_Studied, Attendance, Sleep_Hours, Previous_Scores, Tutoring_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(student_raw)

Rows: 6,607
Columns: 20
$ Hours_Studied              <dbl> 23, 19, 24, 29, 19, 19, 29, 25, 17, 23, 17,…
$ Attendance                 <dbl> 84, 64, 98, 89, 92, 88, 84, 78, 94, 98, 80,…
$ Parental_Involvement       <chr> "Low", "Low", "Medium", "Low", "Medium", "M…
$ Access_to_Resources        <chr> "High", "Medium", "Medium", "Medium", "Medi…
$ Extracurricular_Activities <chr> "No", "No", "Yes", "Yes", "Yes", "Yes", "Ye…
$ Sleep_Hours                <dbl> 7, 8, 7, 8, 6, 8, 7, 6, 6, 8, 8, 6, 8, 8, 8…
$ Previous_Scores            <dbl> 73, 59, 91, 98, 65, 89, 68, 50, 80, 71, 88,…
$ Motivation_Level           <chr> "Low", "Low", "Medium", "Medium", "Medium",…
$ Internet_Access            <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "…
$ Tutoring_Sessions          <dbl> 0, 2, 2, 1, 3, 3, 1, 1, 0, 0, 4, 2, 2, 2, 1…
$ Family_Income              <chr> "Low", "Medium", "Medium", "Medium", "Mediu…
$ Teacher_Quality            <chr> "Medium", "Medium", "Medium", "Medium", "Hi…
$ School_Type                <chr> "Public", "Public", "Public", "Public", "Pu…
$ Peer_Influence             <chr> "Positive", "Negative", "Neutral", "Negativ…
$ Physical_Activity          <dbl> 3, 4, 4, 4, 4, 3, 2, 2, 1, 5, 4, 2, 4, 3, 4…
$ Learning_Disabilities      <chr> "No", "No", "No", "No", "No", "No", "No", "…
$ Parental_Education_Level   <chr> "High School", "College", "Postgraduate", "…
$ Distance_from_Home         <chr> "Near", "Moderate", "Near", "Moderate", "Ne…
$ Gender                     <chr> "Male", "Female", "Male", "Male", "Female",…
$ Exam_Score                 <dbl> 67, 61, 74, 71, 70, 71, 67, 66, 69, 72, 68,…

nrow(student_raw)

[1] 6607

# Clean variable names and remove missing values
student_clean <- student_raw |>
  clean_names() |>
  drop_na()

glimpse(student_clean)

Rows: 6,378
Columns: 20
$ hours_studied              <dbl> 23, 19, 24, 29, 19, 19, 29, 25, 17, 23, 17,…
$ attendance                 <dbl> 84, 64, 98, 89, 92, 88, 84, 78, 94, 98, 80,…
$ parental_involvement       <chr> "Low", "Low", "Medium", "Low", "Medium", "M…
$ access_to_resources        <chr> "High", "Medium", "Medium", "Medium", "Medi…
$ extracurricular_activities <chr> "No", "No", "Yes", "Yes", "Yes", "Yes", "Ye…
$ sleep_hours                <dbl> 7, 8, 7, 8, 6, 8, 7, 6, 6, 8, 8, 6, 8, 8, 8…
$ previous_scores            <dbl> 73, 59, 91, 98, 65, 89, 68, 50, 80, 71, 88,…
$ motivation_level           <chr> "Low", "Low", "Medium", "Medium", "Medium",…
$ internet_access            <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "…
$ tutoring_sessions          <dbl> 0, 2, 2, 1, 3, 3, 1, 1, 0, 0, 4, 2, 2, 2, 1…
$ family_income              <chr> "Low", "Medium", "Medium", "Medium", "Mediu…
$ teacher_quality            <chr> "Medium", "Medium", "Medium", "Medium", "Hi…
$ school_type                <chr> "Public", "Public", "Public", "Public", "Pu…
$ peer_influence             <chr> "Positive", "Negative", "Neutral", "Negativ…
$ physical_activity          <dbl> 3, 4, 4, 4, 4, 3, 2, 2, 1, 5, 4, 2, 4, 3, 4…
$ learning_disabilities      <chr> "No", "No", "No", "No", "No", "No", "No", "…
$ parental_education_level   <chr> "High School", "College", "Postgraduate", "…
$ distance_from_home         <chr> "Near", "Moderate", "Near", "Moderate", "Ne…
$ gender                     <chr> "Male", "Female", "Male", "Male", "Female",…
$ exam_score                 <dbl> 67, 61, 74, 71, 70, 71, 67, 66, 69, 72, 68,…

nrow(student_clean)

[1] 6378

# Compute summary statistics for the regression target variable
student_summary <- student_clean |>
  summarise(
    mean = mean(exam_score),
    median = median(exam_score),
    sd = sd(exam_score),
    min = min(exam_score),
    q1 = quantile(exam_score, 0.25),
    q3 = quantile(exam_score, 0.75),
    max = max(exam_score)
  )

student_summary

# A tibble: 1 × 7
   mean median    sd   min    q1    q3   max
  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  67.3     67  3.91    55    65    69   101

Interpretation of Summary Statistics

The target variable for the regression dataset is exam_score. Since exam scores are numerical and continuous, this dataset is suitable for a regression prediction problem. The summary statistics show the average performance, middle value, variation, and range of exam scores in the dataset.

# Create histogram of exam scores
ggplot(student_clean, aes(x = exam_score)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Distribution of Exam Scores",
    x = "Exam Score",
    y = "Number of Students"
  )

Histogram Interpretation

The histogram shows the distribution of exam scores. The shape appears approximately normal, but there may be some skewness because student scores are concentrated around the middle range. This suggests that most students have moderate exam scores, while fewer students have very low or very high scores.

# Apply log transformation to exam scores
student_clean <- student_clean |>
  mutate(log_exam_score = log(exam_score))

ggplot(student_clean, aes(x = log_exam_score)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Distribution of Log Exam Scores",
    x = "Log of Exam Score",
    y = "Number of Students"
  )

Log Transformation Interpretation

After applying the log transformation, the distribution of exam scores changes only slightly. This is because exam scores are not extremely skewed in the original histogram. Therefore, the original exam_score variable is already appropriate for later regression modeling.

Theoretical Distribution

Based on the histogram, exam scores appear to be approximately normally distributed. Therefore, a normal distribution may be a reasonable approximation for the regression target variable.

Dataset 2: Classification Dataset

Dataset Description and Source

The second dataset is the Loan Approval dataset. It contains information about loan applicants, including income, loan amount, loan term, credit score, number of dependents, employment status, education, and asset values. The target variable is loan_status, which shows whether the loan application was approved or rejected.

This dataset is relevant to economics and finance because loan approval decisions are important for credit markets. Predicting loan approval can help analyze how applicant characteristics are associated with credit access.

Source: Kaggle / Loan Approval Prediction dataset.

The dataset contains more than 500 observations and more than 5 variables, including the target variable.

Economic Question

Can applicant characteristics predict whether a loan application will be approved?

# Import the loan approval dataset
loan_raw <- read_csv("loan_approval_dataset.csv")

Rows: 4269 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): education, self_employed, loan_status
dbl (10): loan_id, no_of_dependents, income_annum, loan_amount, loan_term, c...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(loan_raw)

Rows: 4,269
Columns: 13
$ loan_id                  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
$ no_of_dependents         <dbl> 2, 0, 3, 3, 5, 0, 5, 2, 0, 5, 4, 2, 3, 2, 1, …
$ education                <chr> "Graduate", "Not Graduate", "Graduate", "Grad…
$ self_employed            <chr> "No", "Yes", "No", "No", "Yes", "Yes", "No", …
$ income_annum             <dbl> 9600000, 4100000, 9100000, 8200000, 9800000, …
$ loan_amount              <dbl> 29900000, 12200000, 29700000, 30700000, 24200…
$ loan_term                <dbl> 12, 8, 20, 8, 20, 10, 4, 20, 20, 10, 2, 18, 1…
$ cibil_score              <dbl> 778, 417, 506, 467, 382, 319, 678, 382, 782, …
$ residential_assets_value <dbl> 2400000, 2700000, 7100000, 18200000, 12400000…
$ commercial_assets_value  <dbl> 17600000, 2200000, 4500000, 3300000, 8200000,…
$ luxury_assets_value      <dbl> 22700000, 8800000, 33300000, 23300000, 294000…
$ bank_asset_value         <dbl> 8000000, 3300000, 12800000, 7900000, 5000000,…
$ loan_status              <chr> "Approved", "Rejected", "Rejected", "Rejected…

nrow(loan_raw)

[1] 4269

# Clean variable names, remove missing values, and create binary target variable
loan_clean <- loan_raw |>
  clean_names() |>
  drop_na() |>
  mutate(
    loan_status = str_trim(loan_status),
    loan_approved = ifelse(loan_status == "Approved", 1, 0)
  )

glimpse(loan_clean)

Rows: 4,269
Columns: 14
$ loan_id                  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
$ no_of_dependents         <dbl> 2, 0, 3, 3, 5, 0, 5, 2, 0, 5, 4, 2, 3, 2, 1, …
$ education                <chr> "Graduate", "Not Graduate", "Graduate", "Grad…
$ self_employed            <chr> "No", "Yes", "No", "No", "Yes", "Yes", "No", …
$ income_annum             <dbl> 9600000, 4100000, 9100000, 8200000, 9800000, …
$ loan_amount              <dbl> 29900000, 12200000, 29700000, 30700000, 24200…
$ loan_term                <dbl> 12, 8, 20, 8, 20, 10, 4, 20, 20, 10, 2, 18, 1…
$ cibil_score              <dbl> 778, 417, 506, 467, 382, 319, 678, 382, 782, …
$ residential_assets_value <dbl> 2400000, 2700000, 7100000, 18200000, 12400000…
$ commercial_assets_value  <dbl> 17600000, 2200000, 4500000, 3300000, 8200000,…
$ luxury_assets_value      <dbl> 22700000, 8800000, 33300000, 23300000, 294000…
$ bank_asset_value         <dbl> 8000000, 3300000, 12800000, 7900000, 5000000,…
$ loan_status              <chr> "Approved", "Rejected", "Rejected", "Rejected…
$ loan_approved            <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, …

nrow(loan_clean)

[1] 4269

table(loan_clean$loan_approved)


   0    1 
1613 2656

# Compute summary statistics for the classification target variable
loan_summary <- loan_clean |>
  summarise(
    mean = mean(loan_approved),
    median = median(loan_approved),
    sd = sd(loan_approved),
    min = min(loan_approved),
    q1 = quantile(loan_approved, 0.25),
    q3 = quantile(loan_approved, 0.75),
    max = max(loan_approved)
  )

loan_summary

# A tibble: 1 × 7
   mean median    sd   min    q1    q3   max
  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.622      1 0.485     0     0     1     1

Interpretation of Summary Statistics

The target variable for the classification dataset is loan_approved. It is a binary variable where 1 means the loan was approved and 0 means the loan was rejected. The mean of this variable represents the share of approved loan applications in the dataset.

# Create histogram of the binary loan approval target
ggplot(loan_clean, aes(x = loan_approved)) +
  geom_histogram(bins = 2) +
  labs(
    title = "Distribution of Loan Approval Status",
    x = "Loan Approved (0 = Rejected, 1 = Approved)",
    y = "Number of Applications"
  )

Histogram Interpretation

The histogram shows the distribution of the binary loan approval target. Since loan_approved only takes the values 0 and 1, the distribution is not normal. Instead, it shows how many loan applications were rejected and how many were approved.

# Apply log transformation to the binary target variable
loan_clean <- loan_clean |>
  mutate(log_loan_approved = log(loan_approved + 1))

ggplot(loan_clean, aes(x = log_loan_approved)) +
  geom_histogram(bins = 2) +
  labs(
    title = "Distribution of Log Loan Approval Status",
    x = "Log of Loan Approved",
    y = "Number of Applications"
  )

Log Transformation Interpretation

Because loan_approved is a binary variable, the log transformation does not make the distribution normal. It only changes the values from 0 and 1 to 0 and log(2). Therefore, log transformation is less useful for binary classification targets than for skewed continuous variables.

Theoretical Distribution

The loan_approved variable follows a Bernoulli distribution because it has only two possible outcomes: 0 for rejected and 1 for approved. Therefore, a normal or log-normal distribution is not appropriate for this target variable.

Conclusion

In this Stage 1 report, two datasets were prepared for future predictive modeling. The first dataset is a regression dataset that predicts exam scores using student characteristics. The second dataset is a classification dataset that predicts loan approval status using applicant characteristics.

The probability analysis showed that exam scores are approximately normally distributed, while loan approval status follows a Bernoulli distribution. Both datasets satisfy the project requirements because they include more than 500 observations and at least 5 variables. These datasets are now ready for Stage 2, where predictive models will be built, compared, and evaluated.