HW3: Showcasing CGPA and placement exam marks dataset using simple linear regression analysis

2024-03-19

Libraries

#Loading the required libraries
library(ggplot2)
library(plotly)
library(dplyr)

Introduction

Simple Linear Regression Analysis

A basic statistical technique for modeling the connection between two continuous variables is simple linear regression.
It is extensively used in a wide range of fields, such as biology, engineering, economics, and the social sciences.
I’m planning to use the topic - Making use of the CGPA and placement exam marks dataset using simple linear regression analysis.

Background

Knowing the elements that lead to a good placement is essential when it comes to schooling and job advancement.
It is generally accepted that academic performance, which is often assessed using the Cumulative Grade Point Average (CGPA), is a good indicator of success in a number of settings, including job opportunities and placement exams.
Structured assessments known as placement exams are utilized by businesses and educational establishments to evaluate students’ preparedness for the workforce or for pursuing higher education.
To provide useful information to students, teachers, and recruiters, the challenge is to measure the correlation between academic achievement and placement exam success.

What we are going to do

For our presentation, we will utilize a dataset that focuses on two important variables—CGPA and placement exam scores for our investigation.
This is from the Website called Kaggle which is a free platform for downloading datasets.(https://www.kaggle.com/datasets/)
This dataset offers a singular chance to investigate the connection between these two important facets of a student’s educational and professional path.
The dataset consists of actual student data, including their placement exam results and CGPA ratings.

The Plan

In order to investigate any possible predictive association between a student’s CGPA and their placement exam results, we will use a Simple Linear Regression analysis.
In order to comprehend the distribution and underlying patterns of the data, we will first conduct exploratory data analysis and visualize the data.
Fitting a linear regression model, deciphering the model coefficients, and evaluating the accuracy and dependability of the model are the next tasks.
The primary question that we want to answer is: Is it possible to anticipate a student’s performance on placement examinations based on their CGPA?

These are the key features of the dataset

Cumulative Grade Point Average (CGPA): A measure of a student’s academic success expressed as a number.
Placement Exam Marks: The results of students’ placement examinations, which are essential to their future employment opportunities.

Loading the Data

GGplot

This is a scattered plot for the CGPA vs the placement exam marks.
The plot is in the next slide due to lack of space.

ggplot(placement_data, aes(x = cgpa, y = placement_exam_marks)) + 
  geom_point(color = 'red') + theme_minimal() + 
  labs(title = "Scatter Plot of CGPA vs. Placement Exam Marks",
x = "CGPA", y = "Placement Exam Marks")

The Regression Equation

In mathematics, the Simple Linear Regression model is represented as follows:

The model is defined by the equation: \[ Y = \beta_0 + \beta_1X + \epsilon \]

$Y$ is the dependent variable, which we aim to predict.
$X$ is the independent variable, used as the predictor.
$\beta_0$ is the y-intercept of the regression line.
$\beta_1$ is the slope of the regression line, indicating the change in $Y$ for each unit change in $X$.
$\epsilon$ represents the error term, capturing the variation in $Y$ not explained by $X$.

Estimating the coefficients

Here the slope and intercept of the linear regression model’s coefficients are important for understanding how the independent and dependent variables are related to one another. The method of least squares is used to estimate these. The slope $ _1 $ is calculated as:

\[ \beta_1 = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sum (X_i - \bar{X})^2} \]

And the intercept $ _0 $ is:

\[ \beta_0 = \bar{Y} - \beta_1\bar{X} \]

$ {X} $ and $ {Y} $ are the mean values of the independent(X) and dependent(Y) variables.

The slope $ _1 $ indicates the average change in Y for everu unit that increases in X.
The intercept $ _0 $ represents the expected value of Y when X becomes 0.

Visualizing regression model

Here we will be visualizing the regression model with a regression line with CGPA vs the placememnt exam marks.
The plot is in the next slide due to lack of space.

ggplot(placement_data, aes(x = cgpa, y = placement_exam_marks)) + 
  geom_point() + geom_smooth(method = "lm", se = FALSE, color = "blue") + 
  theme_minimal() + 
  labs(title = "Regression Line with CGPA vs. Placement Exam Marks", 
       x = "CGPA", y = "Placement Exam Marks")

Interactive visualization with Plotly

This is a plotly for CGPA and the examm placement marks
The plot is in the next slide due to lack of space

# This is a plotly for CGPA and the examm placement marks
p <- ggplot(placement_data, aes(x = cgpa, y = placement_exam_marks)) + 
  geom_point(color = 'orange') + theme_minimal() + 
  labs(title = "Interactive Scatter Plot of CGPA vs. Placement Exam Marks", 
       x = "CGPA", y = "Placement Exam Marks")
ggplotly(p)

Average placement exam marks by CGPA range

Regression Analysis Results

Here we are interpreting the model’s output to understand the relationship between CGPA and Placement Exam Marks.
The information is in the next slide due to lack of space

fit <- lm(placement_exam_marks ~ cgpa, data = placement_data)
summary(fit)

## 
## Call:
## lm(formula = placement_exam_marks ~ cgpa, data = placement_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.099 -15.074  -3.917  11.853  66.915 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.1434     6.8687   5.553  3.6e-08 ***
## cgpa         -0.8502     0.9829  -0.865    0.387    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.13 on 998 degrees of freedom
## Multiple R-squared:  0.0007492,  Adjusted R-squared:  -0.0002521 
## F-statistic: 0.7482 on 1 and 998 DF,  p-value: 0.3872

Conclusion: Key Takeaways and Implications

Key Takeaways:

The high p-value for CGPA indicates that the linear regression model does not predict Placement Exam Marks significantly.
The model only partially explains the variation in placement exam marks, according to the low R-squared value.
These findings imply the complexity in factors influencing placement exam performance.

Implications:

The Placement Exam Marks appear to be heavily influenced by variables other than academic success as determined by CGPA.
Academic institutions and students may include factors other than CGPA when assessing placement success determinants.

Future Directions for Research

Exploring Further:

Future studies may examine other factors, such as study habits, test-taking techniques, or student demographics, that affect placement exam marks.
Better understanding of the factors influencing exam success may be possible with the development of a more thorough model.

Broader Perspective:

This study emphasizes the necessity for educational systems to take into account a variety of factors related to student performance and placement exam readiness.
It also advocates a more holistic approach to student evaluation and preparation.