This assignment assesses your understanding of basic statistics, probability, Bayes’ Theorem and linear regression model, covered in Modules 1 and 2. The total marks of this assessment is 50, including 5 marks account for the presentation. This assessment has 20% contribution to your final score.
You can complete your assignment using the codes shared in the unit (ie Alexandria, video, practical activities on Moodle) and this template as the bases. However, you should make sure the codes you are using are correct and relevant to the question.
Please follow the structure of this template as much as you can.
You can use the prepopulated codes cells or change them if you prefer. However, please do not change the name of the key variables, functions, and parameters, e.g. ambient, coolant . It helps us to read and understand your submission more efficiently.
All your answers need to be put into this file, and you can write equations in R markdown, please refer to https://rmd4sci.njtierney.com/math.html#example-math-commands for more information and examples.
Although this is not a coding unit, we expect that you give proper coding comments when writing code snippets. You should also indent the code block properly and avoid copy/paste the same code block multiple times (write a function instead of copy/paste). Any violations of these coding readability requirements will result in some deductions on your presentation score (maximum of 5).
Two files are needed for this assignment, the first one is the .html file (For plagiarism checking), and the second one is the .Rmd file (For checking code). Failure to comply will result in 20% penalty on each missing file. All details in your answers need to be accounted for in both the HTML and the Rmarkdown. The naming format of the files to be submitted should be “StudentId.html” and “StudentId.Rmd”, e.g. “31234567.html” and “31234567.Rmd”. Files with the wrong format may incur 10% penalty each.
Given that \(P(A) = 0.55,\;P(\overline{B}) = 0.35, \text{ and } P(A \cup B) = 0.75\). Determine \(P(B)\) and \(P(A \cap B)\).
\[P(A) = 0.55 \]
\[P(\overline{B}) = 0.35 \] \[P(A \cup B) = 0.75 \]
\[P(B) = 1 - P(\overline{B})\] \[ P(B) = 1 - P(\overline{B}) = 1 - 0.35 = 0.65 \]
\[ P(A \cap B) = P(A) + P(B) - P(A \cup B) \]
\[ P(A \cap B) = 0.55 + 0.65 - 0.75 = 0.45 \]
Mr. and Mrs. Brown have two children, given that the probability of a boy or girl being born is the same and the genders of all children are independent of each other. If one child is random selected at first, and we know she is a girl. What is the probability that both children are girls?
\[ P(A | B) = \frac{P(B | A) \cdot P(A)}{P(B)} \]
Where: - \(A\) event that both children are girls - \(B\) event that one randomly selected child is a girl.
Since the probability of each child being a girl is \(\frac{1}{2}\) and the events are independent, we have:
\[ P(A) = \frac{1}{2} \times \frac{1}{2} = \frac{1}{4} \]
The only situation where no girl is present is when both children are boys, which has a probability of \(\frac{1}{4}\). Therefore:
\[ P(B) = 1 - \frac{1}{4} = \frac{3}{4} \]
If both children are girls, the probability that one randomly selected child is a girl is:
\[ P(B | A) = 1 \]
Using Bayes’ Theorem to calculate \(P(A | B)\):
\[ P(A | B) = \frac{P(B | A) \cdot P(A)}{P(B)} \]
Substituting the values:
\[ P(A | B) = \frac{1 \times \frac{1}{4}}{\frac{3}{4}} = \frac{1}{3} \]
The probability that both children are girls given that one randomly selected child is a girl is \(\frac{1}{3}\).
Leon is marking a hard mutiple choice question. The students may answer the question correctly by guessing with a probability of 1/4. If one student have 50% chance to know the answer. What is the probability that this student indeed know the answer given his MCQ is correctly answered?
\[ P(K | C) = \frac{P(C | K) \cdot P(K)}{P(C)} \]
Where: - \(K\) is the event that the student knows the answer. - \(C\) is the event that the student answers the question correctly.
Given the student has a 50% chance of knowing the answer, we have:
\[ P(K) = 0.5 \]
The student either knows the answer or guesses. Since \(P(K) = 0.5\), the probability that the student is guessing is:
\[ P(G) = 1 - P(K) = 0.5 \]
If the student knows the answer, they will definitely answer correctly, so:
\[ P(C | K) = 1 \]
If the student is guessing, there is a \(\frac{1}{4}\) chance of answering correctly, therefor:
\[ P(C | G) = \frac{1}{4} \]
To calculate \(P(C)\), we use the law of total probability:
\[ P(C) = P(C | K) \cdot P(K) + P(C | G) \cdot P(G) \]
Sub the values:
\[ P(C) = (1 \times 0.5) + \left(\frac{1}{4} \times 0.5\right) = 0.5 + 0.125 = 0.625 \]
Apply Bayes’ Theorem:
\[ P(K | C) = \frac{P(C | K) \cdot P(K)}{P(C)} \]
Substitute the values:
\[ P(K | C) = \frac{1 \times 0.5}{0.625} = \frac{0.5}{0.625} = 0.8 \]
The probability that the student knows the answer given that they answered correctly is \(0.8\), or 80%.
A gambler has a fair coin (i.e., the coin has both the head and the tail side) and a two-tailed coin (i.e., both sides are tail) in his pocket. He selects one of the coins at random, and when he flips it, it shows tail. Then he flips the same coin a second time and again it shows tail. What is the probability that it is the two-tailed coin?
\[ P(C_{2T} | TT) = \frac{P(TT | C_{2T}) \cdot P(C_{2T})}{P(TT)} \]
Where: - \(C_{2T}\) is the event that the coin is the two-tailed coin. - \(TT\) is the event that two tails are observed.
The gambler selects one of the two coins at random. Therefore:
\[ P(C_{2T}) = \frac{1}{2} \]
Similarly, the probability of selecting the fair coin is also:
\[ P(C_F) = \frac{1}{2} \]
\[ P(TT | C_{2T}) = 1 \]
\[ P(TT | C_F) = \frac{1}{2} \times \frac{1}{2} = \frac{1}{4} \]
To find the total probability of getting two tails, we use the law of total probability:
\[ P(TT) = P(TT | C_{2T}) \cdot P(C_{2T}) + P(TT | C_F) \cdot P(C_F) \]
Substituting the known values:
\[ P(TT) = (1 \times \frac{1}{2}) + \left(\frac{1}{4} \times \frac{1}{2}\right) = \frac{1}{2} + \frac{1}{8} = \frac{5}{8} \]
Now, we apply Bayes’ Theorem:
\[ P(C_{2T} | TT) = \frac{P(TT | C_{2T}) \cdot P(C_{2T})}{P(TT)} \]
Substitute the values:
\[ P(C_{2T} | TT) = \frac{1 \times \frac{1}{2}}{\frac{5}{8}} = \frac{1}{2} \times \frac{8}{5} = \frac{4}{5} \]
The probability that the gambler selected the two-tailed coin given that two tails were observed is \(\frac{4}{5}\) or 0.8.
You toss 11 unfair coins where the chance of tossing a head is \(0.6\). What is the combined probability of getting exactly 5 or exactly 6 heads?
#### 1. Binomial Distribution
The binomial distribution gives us the probability of obtaining exactly \(k\) heads in \(n\) independent coin tosses:
\[ P(X = k) = \binom{n}{k} p^k (1 - p)^{n - k} \]
Where: \(n = 11\) , \(p = 0.6\), \(X\) ,
Calculate the cumulative probability of getting up to 5 heads, \(P(X \leq 5)\), using
pbinom(5, n, p). To calculate the probability of getting
exactly 5 heads, we subtract \(P(X \leq
4)\) from \(P(X \leq 5)\):
\[ P(X = 5) = P(X \leq 5) - P(X \leq 4) \]
Similarly, to calculate the probability of getting exactly 6 heads, we subtract \(P(X \leq 5)\) from \(P(X \leq 6)\):
\[ P(X = 6) = P(X \leq 6) - P(X \leq 5) \]
The combined probability of getting either 5 or 6 heads is the sum of the individual probabilities:
\[ P(X = 5 \text{ or } X = 6) = P(X = 5) + P(X = 6) \]
Now, let’s calculate this in R.
n <- 11
p <- 0.6
# Step 1: Calculate the probability of getting exactly 5 heads using pbinom
P_5_heads <- pbinom(5, n, p) - pbinom(4, n, p)
# Step 2: Calculate the probability of getting exactly 6 heads using pbinom
P_6_heads <- pbinom(6, n, p) - pbinom(5, n, p)
# Step 3: Calculate the combined probability of getting exactly 5 or exactly 6 heads
P_combined <- P_5_heads + P_6_heads
P_combined
## [1] 0.3678732
The diameter (a continuous variable) of a certain disk follows the uniform distribution within a specific interval \([a,b]\) i.e., \(X\sim\mathcal{U}\left(a,b\right)\) with \(a=0\) and \(b=2\), find the average area of the disc.
Hint: You have to make use of relationship for the Expectation and Variance. And the formula to compute the area of a disk is \(\pi r^2\)(r means radius).
The diameter of the disk follows a uniform distribution \(X \sim \mathcal{U}(a, b)\), where: - \(a = 0\) . - \(b = 2\).
Since the formula for the area of a disk is \(A = \pi r^2\), and the radius \(r = \frac{X}{2}\), we need to calculate \(E[r^2]\).
To calculate the expected value of \(r^2\), we first express the radius as \(r = \frac{X}{2}\), so \(r^2 = \left(\frac{X}{2}\right)^2 = \frac{X^2}{4}\). Now, the expected value of \(X^2\) for a uniform distribution is:
\[ E[X^2] = \frac{b^3 - a^3}{3(b - a)} \]
Substituting the values \(a = 0\) and \(b = 2\):
\[ E[X^2] = \frac{2^3 - 0^3}{3(2 - 0)} = \frac{8}{6} = \frac{4}{3} \]
Since \(r = \frac{X}{2}\), we have \(r^2 = \frac{X^2}{4}\), so:
\[ E[r^2] = \frac{E[X^2]}{4} = \frac{\frac{4}{3}}{4} = \frac{1}{3} \]
The formula for the area of a disk is \(A = \pi r^2\). Therefore, the average area is:
\[ E[A] = \pi E[r^2] = \pi \times \frac{1}{3} = \frac{\pi}{3} \]
Now, let’s calculate this in R.
# Step 1: Define the bounds of the uniform distribution
a <- 0 # Lower bound of the uniform distribution
b <- 2 # Upper bound of the uniform distribution
# Step 2: Calculate the expected value of r^2 for the uniform distribution
# E[r^2] = (b^3 - a^3) / (3 * (b - a))
E_r_squared <- (b^3 - a^3) / (3 * (b - a)) / 4 # Divide by 4 because r = X / 2
# Step 3: Calculate the average area of the disk
# E[A] = π * E[r^2]
average_area <- pi * E_r_squared
average_area
## [1] 1.047198
Given \(K \sim N(\mu=0,\sigma^2=1)\), i.e, \(K\) is Gaussian distributed, what’s the probability that the equation \(x^2 + 2Kx +1 =0\) has real solutions? Hint: The solution to the quadratic equation \(ax^2+bx+c=0\) where \(a\), \(b\) and \(c\) are real constants and \(x\) is unknown, is \(x=\frac{-b\pm \sqrt{b^2-4ac}}{2a}\).
he quadratic equation \(x^2 + 2Kx + 1 = 0\) will have real solutions if its discriminant is non-negative. For the equation \(x^2 + 2Kx + 1 = 0\), the discriminant \(\Delta\) is given by:
\[ \Delta = b^2 - 4ac \]
where \(a = 1\), \(b = 2K\), and \(c = 1\). Substituting these values:
\[ \Delta = (2K)^2 - 4 \cdot 1 \cdot 1 = 4K^2 - 4 \]
For the quadratic equation to have real solutions, we need \(\Delta \geq 0\):
\[ 4K^2 - 4 \geq 0 \]
Solving this inequality:
\[ 4K^2 - 4 \geq 0 \] \[ 4K^2 \geq 4 \] \[ K^2 \geq 1 \] \[ |K| \geq 1 \]
Thus, the quadratic equation will have real solutions if \(|K| \geq 1\).
To find the probability that \(|K| \geq 1\), we calculate:
\[ P(|K| \geq 1) = P(K \leq -1) + P(K \geq 1) \]
where \(K\) follows a standard normal distribution \(N(0, 1)\).
1.Probability \(P(K \leq -1)\)
\[ P(K \leq -1) = \Phi(-1) \]
Probability \(P(K \geq 1)\)
\[ P(K \geq 1) = 1 - \Phi(1) \]
Here is the R code to compute these probabilities:
# Calculate P(K <= -1)
P_K_less_than_minus_1 <- pnorm(-1, mean = 0, sd = 1)
# Calculate P(K >= 1)
P_K_greater_than_1 <- 1 - pnorm(1, mean = 0, sd = 1)
# Total probability of |K| >= 1
P_real_solutions <- P_K_less_than_minus_1 + P_K_greater_than_1
P_real_solutions
## [1] 0.3173105
Given a random variable \(X\) with density function \[ p(x) = \begin{cases} a+bx^2 &0 \leq x \leq 1\\ 0 & \text{otherwise} \end{cases} \] such that \(E[X] = 2/3\). What’s the value of a and b?
We are given a random variable \(X\) with the following probability density function (PDF):
\[ p(x) = \begin{cases} a + bx^2 & 0 \leq x \leq 1 \\ 0 & \text{otherwise} \end{cases} \]
We are also told that the expected value \(E[X] = \frac{2}{3}\). We need to find the values of \(a\) and \(b\).
The total probability must integrate to 1: \[ \int_0^1 (a + bx^2) \, dx = 1 \]
This gives the equation: \[ a \cdot 1 + \frac{b}{3} = 1 \quad \text{or} \quad a + \frac{b}{3} = 1 \] \[ Rearranging for \(a\): a = 1 - \frac{b}{3} \]
We are also given that the expected value of \(X\) is \(\frac{2}{3}\): \[ E[X] = \int_0^1 x(a + bx^2) \, dx \]
Calculating the integral: \[ E[X] = \frac{a}{2} + \frac{b}{4} \]
We are given that \(E[X] = \frac{2}{3}\), so: \[ \frac{a}{2} + \frac{b}{4} = \frac{2}{3} \]
Substitute \(a = 1 - \frac{b}{3}\) into the expected value equation:
\[ \frac{1 - \frac{b}{3}}{2} + \frac{b}{4} = \frac{2}{3} \]
Multiply the entire equation by 12 to eliminate fractions: \[ 6\left(1 - \frac{b}{3}\right) + 3b = 8 \]
Simplifying: \[ 6 - 2b + 3b = 8 \]
Solving for \(b\): \[ b = 2 \]
Substitute \(b = 2\) back into the first equation: \[ a = 1 - \frac{2}{3} = \frac{1}{3} \]
The values of \(a\) and \(b\) are: \[ a = \frac{1}{3}, \quad b = 2 \]
Let \(X_1,X_2,\ldots,X_n\) be independent, identically distributed random variables with common mean and variance. Find the values of \(c\) and \(d\) that will make the following formula true: \[ \mathbb{E}[(X_1+X_2+\ldots+X_n)^2] = c\;\mathbb{E}[X_1^2] + d\;\mathbb{E}[X_1]^2 \]
# We are given that: # Let X_1, X_2, …, X_n be independent, identically distributed random variables with common mean and variance. # We want to find constants c and d such that the following equation holds: \[ \mathbb{E}[(X_1 + X_2 + \ldots + X_n)^2] = c \, \mathbb{E}[X_1^2] + d \, \mathbb{E}[X_1]^2 \]
\[ \mathbb{E}[(X_1 + X_2 + \ldots + X_n)^2] = \mathbb{E}\left[\sum_{i=1}^n X_i^2 + 2 \sum_{1 \leq i < j \leq n} X_i X_j\right] \]
The expectation of the sum of squares: \[ \mathbb{E}\left[\sum_{i=1}^n X_i^2\right] = n \, \mathbb{E}[X_1^2] \] #### The expectation of the cross terms: \[ \mathbb{E}\left[2 \sum_{1 \leq i < j \leq n} X_i X_j\right] = 2 \sum_{1 \leq i < j \leq n} \mathbb{E}[X_1]^2 = n(n - 1) \, \mathbb{E}[X_1]^2 \]
\[ \mathbb{E}[(X_1 + X_2 + \ldots + X_n)^2] = n \, \mathbb{E}[X_1^2] + n(n - 1) \, \mathbb{E}[X_1]^2 \]
\[ \mathbb{E}[(X_1 + X_2 + \ldots + X_n)^2] = c \, \mathbb{E}[X_1^2] + d \, \mathbb{E}[X_1]^2 \]
\[ c = n \] \[ d = n(n - 1) \]
A dataset for this part is shared with you. Please provide a correlation analysis of the variables (columns) and outline your findings. Please note that result visualisation is needed.
# You are free to install any required library for visualisation
# If you have error that any of the above library is missing, please install it via install.packages(...) or Tools -> Install packages in RStudio
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.1
library(MASS)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data_new = read.csv('Assessment1_data.csv')
head(data_new)
## Marital.status.Application.mode.Application.order.Course.Daytime.evening.attendance..Previous.qualification.Previous.qualification..grade..Nacionality.Mother.s.qualification.Father.s.qualification.Mother.s.occupation.Father.s.occupation.Admission.gra ...
## 1 1;17;5;171;1;1;122.0;1;19;12;5;9;127.3;1;0;0;1;1;0;20;0;0;0;0;0;0.0;0;0;0;0;0;0.0;0;10.8;1.4;1.74;Dropout
## 2 1;15;1;9254;1;1;160.0;1;1;3;3;3;142.5;1;0;0;0;1;0;19;0;0;6;6;6;14.0;0;0;6;6;6;13.666666666666666;0;13.9;-0.3;0.79;Graduate
## 3 1;1;5;9070;1;1;122.0;1;37;37;9;9;124.8;1;0;0;0;1;0;19;0;0;6;0;0;0.0;0;0;6;0;0;0.0;0;10.8;1.4;1.74;Dropout
## 4 1;17;2;9773;1;1;122.0;1;38;37;5;3;119.6;1;0;0;1;0;0;20;0;0;6;8;6;13.428571428571429;0;0;6;10;5;12.4;0;9.4;-0.8;-3.12;Graduate
## 5 2;39;1;8014;0;1;100.0;1;37;38;9;9;141.5;0;0;0;1;0;0;45;0;0;6;9;5;12.333333333333334;0;0;6;6;6;13.0;0;13.9;-0.3;0.79;Graduate
## 6 2;39;1;9991;0;19;133.1;1;37;37;9;7;114.8;0;0;1;1;1;0;50;0;0;5;10;5;11.857142857142858;0;0;5;17;5;11.5;5;16.2;0.3;-0.92;Graduate
# Re-read the CSV with a different delimiter (likely a semicolon or space)
data_new <- read.csv('Assessment1_data.csv', sep = ";") # or sep = "," if comma-separated
# Inspect the new column names
colnames(data_new)
## [1] "Marital.status"
## [2] "Application.mode"
## [3] "Application.order"
## [4] "Course"
## [5] "Daytime.evening.attendance."
## [6] "Previous.qualification"
## [7] "Previous.qualification..grade."
## [8] "Nacionality"
## [9] "Mother.s.qualification"
## [10] "Father.s.qualification"
## [11] "Mother.s.occupation"
## [12] "Father.s.occupation"
## [13] "Admission.grade"
## [14] "Displaced"
## [15] "Educational.special.needs"
## [16] "Debtor"
## [17] "Tuition.fees.up.to.date"
## [18] "Gender"
## [19] "Scholarship.holder"
## [20] "Age.at.enrollment"
## [21] "International"
## [22] "Curricular.units.1st.sem..credited."
## [23] "Curricular.units.1st.sem..enrolled."
## [24] "Curricular.units.1st.sem..evaluations."
## [25] "Curricular.units.1st.sem..approved."
## [26] "Curricular.units.1st.sem..grade."
## [27] "Curricular.units.1st.sem..without.evaluations."
## [28] "Curricular.units.2nd.sem..credited."
## [29] "Curricular.units.2nd.sem..enrolled."
## [30] "Curricular.units.2nd.sem..evaluations."
## [31] "Curricular.units.2nd.sem..approved."
## [32] "Curricular.units.2nd.sem..grade."
## [33] "Curricular.units.2nd.sem..without.evaluations."
## [34] "Unemployment.rate"
## [35] "Inflation.rate"
## [36] "GDP"
## [37] "Target"
# After ensuring proper column names, follow with cleaning if necessary:
colnames(data_new) <- make.names(colnames(data_new), unique = TRUE)
# Now check the cleaned column names
colnames(data_new)
## [1] "Marital.status"
## [2] "Application.mode"
## [3] "Application.order"
## [4] "Course"
## [5] "Daytime.evening.attendance."
## [6] "Previous.qualification"
## [7] "Previous.qualification..grade."
## [8] "Nacionality"
## [9] "Mother.s.qualification"
## [10] "Father.s.qualification"
## [11] "Mother.s.occupation"
## [12] "Father.s.occupation"
## [13] "Admission.grade"
## [14] "Displaced"
## [15] "Educational.special.needs"
## [16] "Debtor"
## [17] "Tuition.fees.up.to.date"
## [18] "Gender"
## [19] "Scholarship.holder"
## [20] "Age.at.enrollment"
## [21] "International"
## [22] "Curricular.units.1st.sem..credited."
## [23] "Curricular.units.1st.sem..enrolled."
## [24] "Curricular.units.1st.sem..evaluations."
## [25] "Curricular.units.1st.sem..approved."
## [26] "Curricular.units.1st.sem..grade."
## [27] "Curricular.units.1st.sem..without.evaluations."
## [28] "Curricular.units.2nd.sem..credited."
## [29] "Curricular.units.2nd.sem..enrolled."
## [30] "Curricular.units.2nd.sem..evaluations."
## [31] "Curricular.units.2nd.sem..approved."
## [32] "Curricular.units.2nd.sem..grade."
## [33] "Curricular.units.2nd.sem..without.evaluations."
## [34] "Unemployment.rate"
## [35] "Inflation.rate"
## [36] "GDP"
## [37] "Target"
data_new$Previous.qualification..grade. <- as.numeric(data_new$Previous.qualification..grade.)
data_new$Admission.grade <- as.numeric(data_new$Admission.grade)
data_new$Curricular.units.1st.sem..grade. <- as.numeric(data_new$Curricular.units.1st.sem..grade.)
data_new$Curricular.units.2nd.sem..grade. <- as.numeric(data_new$Curricular.units.2nd.sem..grade.)
data_new$Unemployment.rate <- as.numeric(data_new$Unemployment.rate)
data_new$Inflation.rate <- as.numeric(data_new$Inflation.rate)
data_new$GDP <- as.numeric(data_new$GDP)
# Step 5: Generate summary statistics for the selected numeric columns
summary(data_new[c('Previous.qualification..grade.', 'Admission.grade',
'Curricular.units.1st.sem..grade.', 'Curricular.units.2nd.sem..grade.',
'Unemployment.rate', 'Inflation.rate', 'GDP')])
## Previous.qualification..grade. Admission.grade
## Min. : 95.0 Min. : 95.0
## 1st Qu.:125.0 1st Qu.:117.9
## Median :133.1 Median :126.1
## Mean :132.6 Mean :127.0
## 3rd Qu.:140.0 3rd Qu.:134.8
## Max. :190.0 Max. :190.0
## Curricular.units.1st.sem..grade. Curricular.units.2nd.sem..grade.
## Min. : 0.00 Min. : 0.00
## 1st Qu.:11.00 1st Qu.:10.75
## Median :12.29 Median :12.20
## Mean :10.64 Mean :10.23
## 3rd Qu.:13.40 3rd Qu.:13.33
## Max. :18.88 Max. :18.57
## Unemployment.rate Inflation.rate GDP
## Min. : 7.60 Min. :-0.800 Min. :-4.060000
## 1st Qu.: 9.40 1st Qu.: 0.300 1st Qu.:-1.700000
## Median :11.10 Median : 1.400 Median : 0.320000
## Mean :11.57 Mean : 1.228 Mean : 0.001969
## 3rd Qu.:13.90 3rd Qu.: 2.600 3rd Qu.: 1.790000
## Max. :16.20 Max. : 3.700 Max. : 3.510000
columns_to_plot <- c('Previous.qualification..grade.', 'Admission.grade',
'Curricular.units.1st.sem..grade.', 'Curricular.units.2nd.sem..grade.',
'Unemployment.rate', 'Inflation.rate', 'GDP')
# Create boxplots for the selected columns
par(mfrow=c(2, 4)) # Set up the layout for multiple plots
for (column in columns_to_plot) {
boxplot(data_new[[column]], main = column, ylab = column)
}
The grades are distributed between approximately 100 and 180. The boxplot shows a median around 140, indicating that most students fall within this range. There are several outliers both above and below the box (the interquartile range), showing that a few students either scored very high or low compared to the majority.
Admission grades also range between 100 and 180, with the median slightly above 140. There are a few outliers, but most of the data points cluster near the middle of the distribution.
The grades for the first semester of curricular units range from 0 to about 15. The boxplot shows the median at around 10, indicating that many students score near this value. Several outliers can be seen both below and above the box, meaning some students performed exceptionally well or poorly.
The range and distribution are similar to the 1st-semester grades, with the median around 10. Again, there are a few outliers, indicating variability in student performance.
The unemployment rate ranges between 10% and 16%, with the median slightly above 12%. The interquartile range is relatively narrow, showing that most of the data points cluster around the median.
The inflation rate ranges between approximately 0% and 3%, with the median close to 1.5%. There is a wider distribution compared to other economic variables, but no extreme outliers are present.
The GDP values range from -2 to 2, with the median slightly above 0. There is some variation in GDP values, but it is symmetrically distributed around the median, showing a balanced economic condition in the dataset.
Grades: The grades for previous qualifications, admissions, and curricular units (for both semesters) show moderate variability, with outliers in each case. This suggests a mix of student performance, with most students clustering around the median but a few outliers representing exceptionally high or low performance.
Economic Variables: The economic indicators such as unemployment rate, inflation rate, and GDP show less variability compared to the grades, with fewer extreme outliers. This indicates relatively stable economic conditions for most of the population represented in the dataset.
# Load necessary libraries
library(ggplot2)
# Step 1: Convert relevant columns to numeric (if not done already)
data_new$Previous.qualification..grade. <- as.numeric(data_new$Previous.qualification..grade.)
data_new$Admission.grade <- as.numeric(data_new$Admission.grade)
data_new$Curricular.units.1st.sem..grade. <- as.numeric(data_new$Curricular.units.1st.sem..grade.)
data_new$Curricular.units.2nd.sem..grade. <- as.numeric(data_new$Curricular.units.2nd.sem..grade.)
data_new$Unemployment.rate <- as.numeric(data_new$Unemployment.rate)
data_new$Inflation.rate <- as.numeric(data_new$Inflation.rate)
data_new$GDP <- as.numeric(data_new$GDP)
# Step 2: Select continuous variables for analysis
selected_columns <- data_new[, c('Previous.qualification..grade.', 'Admission.grade',
'Curricular.units.1st.sem..grade.', 'Curricular.units.2nd.sem..grade.',
'Unemployment.rate', 'Inflation.rate', 'GDP')]
# Step 3: Calculate the correlation matrix
correlation_matrix <- cor(selected_columns, use = "complete.obs")
# Display the correlation matrix
print(correlation_matrix)
## Previous.qualification..grade. Admission.grade
## Previous.qualification..grade. 1.00000000 0.58044420
## Admission.grade 0.58044420 1.00000000
## Curricular.units.1st.sem..grade. 0.05943751 0.07386842
## Curricular.units.2nd.sem..grade. 0.05323897 0.07440153
## Unemployment.rate 0.04522227 0.03875569
## Inflation.rate 0.01871038 -0.02162358
## GDP -0.05262040 -0.01951948
## Curricular.units.1st.sem..grade.
## Previous.qualification..grade. 0.05943751
## Admission.grade 0.07386842
## Curricular.units.1st.sem..grade. 1.00000000
## Curricular.units.2nd.sem..grade. 0.83716974
## Unemployment.rate 0.01482095
## Inflation.rate -0.03390357
## GDP 0.05480076
## Curricular.units.2nd.sem..grade.
## Previous.qualification..grade. 0.053238974
## Admission.grade 0.074401530
## Curricular.units.1st.sem..grade. 0.837169741
## Curricular.units.2nd.sem..grade. 1.000000000
## Unemployment.rate 0.001461858
## Inflation.rate -0.038166042
## GDP 0.071269496
## Unemployment.rate Inflation.rate GDP
## Previous.qualification..grade. 0.045222268 0.01871038 -0.05262040
## Admission.grade 0.038755687 -0.02162358 -0.01951948
## Curricular.units.1st.sem..grade. 0.014820945 -0.03390357 0.05480076
## Curricular.units.2nd.sem..grade. 0.001461858 -0.03816604 0.07126950
## Unemployment.rate 1.000000000 -0.02888466 -0.33517812
## Inflation.rate -0.028884663 1.00000000 -0.11229464
## GDP -0.335178119 -0.11229464 1.00000000
# Step 4: Create scatterplots to visualize the relationships between variables
# Function to generate scatterplots
pairs(selected_columns, main = "Scatterplots of Continuous Variables")
# Step 5: Analyze the strongest correlations
# Look at the correlation matrix and scatterplots to determine which pairs of variables are strongly correlated.
Previous Qualification Grade vs. Admission Grade:A positive correlation indicates that students with higher prior qualification grades tend to have higher admission grades.
Curricular Units (1st & 2nd Sem) vs. Admission Grade: Positive correlations indicating that students who do well in admissions also perform well in curricular units.
Economic Variables (Unemployment Rate, Inflation, GDP): No strong correlations with academic performance.