ITO5197 Assessment 1

Probability, Bayes Theorem, and Correlation Analysis.

Student Name:Dhanya Veeraraghavan

Student Number:29804841

Objectives

This assignment assesses your understanding of basic statistics, probability, Bayes’ Theorem and linear regression model, covered in Modules 1 and 2. The total marks of this assessment is 50, including 5 marks account for the presentation. This assessment has 20% contribution to your final score.

Important Note

  • You can complete your assignment using the codes shared in the unit (ie Alexandria, video, practical activities on Moodle) and this template as the bases. However, you should make sure the codes you are using are correct and relevant to the question.

  • Please follow the structure of this template as much as you can.

  • You can use the prepopulated codes cells or change them if you prefer. However, please do not change the name of the key variables, functions, and parameters, e.g. ambient, coolant . It helps us to read and understand your submission more efficiently.

  • All your answers need to be put into this file, and you can write equations in R markdown, please refer to https://rmd4sci.njtierney.com/math.html#example-math-commands for more information and examples.

  • Although this is not a coding unit, we expect that you give proper coding comments when writing code snippets. You should also indent the code block properly and avoid copy/paste the same code block multiple times (write a function instead of copy/paste). Any violations of these coding readability requirements will result in some deductions on your presentation score (maximum of 5).

  • Two files are needed for this assignment, the first one is the .html file (For plagiarism checking), and the second one is the .Rmd file (For checking code). Failure to comply will result in 20% penalty on each missing file. All details in your answers need to be accounted for in both the HTML and the Rmarkdown. The naming format of the files to be submitted should be “StudentId.html” and “StudentId.Rmd”, e.g. “31234567.html” and “31234567.Rmd”. Files with the wrong format may incur 10% penalty each.

Part A: Probability theory [15 Marks]

Question 1 (3 Marks):

Given that \(P(A) = 0.55,\;P(\overline{B}) = 0.35, \text{ and } P(A \cup B) = 0.75\). Determine \(P(B)\) and \(P(A \cap B)\).

Given values:

\[P(A) = 0.55 \]
\[P(\overline{B}) = 0.35 \] \[P(A \cup B) = 0.75 \]

Calculate P(B)

\[P(B) = 1 - P(\overline{B})\] \[ P(B) = 1 - P(\overline{B}) = 1 - 0.35 = 0.65 \]

Solving for P(A intersection B):

\[ P(A \cap B) = P(A) + P(B) - P(A \cup B) \]

\[ P(A \cap B) = 0.55 + 0.65 - 0.75 = 0.45 \]

Question 2 (4 Marks):

Mr. and Mrs. Brown have two children, given that the probability of a boy or girl being born is the same and the genders of all children are independent of each other. If one child is random selected at first, and we know she is a girl. What is the probability that both children are girls?

\[ P(A | B) = \frac{P(B | A) \cdot P(A)}{P(B)} \]

Where: - \(A\) event that both children are girls - \(B\) event that one randomly selected child is a girl.

1. Probability that both children are girls \(P(A)\):

Since the probability of each child being a girl is \(\frac{1}{2}\) and the events are independent, we have:

\[ P(A) = \frac{1}{2} \times \frac{1}{2} = \frac{1}{4} \]

2. Probability that one randomly selected child is a girl \(P(B)\):

The only situation where no girl is present is when both children are boys, which has a probability of \(\frac{1}{4}\). Therefore:

\[ P(B) = 1 - \frac{1}{4} = \frac{3}{4} \]

3. Probability that the randomly selected child is a girl given both are girls \(P(B | A)\):

If both children are girls, the probability that one randomly selected child is a girl is:

\[ P(B | A) = 1 \]

4. Applying Bayes’ Theorem:

Using Bayes’ Theorem to calculate \(P(A | B)\):

\[ P(A | B) = \frac{P(B | A) \cdot P(A)}{P(B)} \]

Substituting the values:

\[ P(A | B) = \frac{1 \times \frac{1}{4}}{\frac{3}{4}} = \frac{1}{3} \]

Final Answer:

The probability that both children are girls given that one randomly selected child is a girl is \(\frac{1}{3}\).

Question 3 (4 Marks):

Leon is marking a hard mutiple choice question. The students may answer the question correctly by guessing with a probability of 1/4. If one student have 50% chance to know the answer. What is the probability that this student indeed know the answer given his MCQ is correctly answered?

\[ P(K | C) = \frac{P(C | K) \cdot P(K)}{P(C)} \]

Where: - \(K\) is the event that the student knows the answer. - \(C\) is the event that the student answers the question correctly.

1. Probability that the student knows the answer \(P(K)\):

Given the student has a 50% chance of knowing the answer, we have:

\[ P(K) = 0.5 \]

2. Probability that the student is guessing \(P(G)\):

The student either knows the answer or guesses. Since \(P(K) = 0.5\), the probability that the student is guessing is:

\[ P(G) = 1 - P(K) = 0.5 \]

3. Probability that the student answers correctly given they know the answer \(P(C | K)\):

If the student knows the answer, they will definitely answer correctly, so:

\[ P(C | K) = 1 \]

4. Probability that the student answers correctly given they are guessing \(P(C | G)\):

If the student is guessing, there is a \(\frac{1}{4}\) chance of answering correctly, therefor:

\[ P(C | G) = \frac{1}{4} \]

5. Total probability of answering correctly \(P(C)\):

To calculate \(P(C)\), we use the law of total probability:

\[ P(C) = P(C | K) \cdot P(K) + P(C | G) \cdot P(G) \]

Sub the values:

\[ P(C) = (1 \times 0.5) + \left(\frac{1}{4} \times 0.5\right) = 0.5 + 0.125 = 0.625 \]

6. Using Bayes’ Theorem to calculate \(P(K | C)\):

Apply Bayes’ Theorem:

\[ P(K | C) = \frac{P(C | K) \cdot P(K)}{P(C)} \]

Substitute the values:

\[ P(K | C) = \frac{1 \times 0.5}{0.625} = \frac{0.5}{0.625} = 0.8 \]

Answer:

The probability that the student knows the answer given that they answered correctly is \(0.8\), or 80%.

Question 4 (4 Marks):

A gambler has a fair coin (i.e., the coin has both the head and the tail side) and a two-tailed coin (i.e., both sides are tail) in his pocket. He selects one of the coins at random, and when he flips it, it shows tail. Then he flips the same coin a second time and again it shows tail. What is the probability that it is the two-tailed coin?

Bayes’ Theorem is:

\[ P(C_{2T} | TT) = \frac{P(TT | C_{2T}) \cdot P(C_{2T})}{P(TT)} \]

Where: - \(C_{2T}\) is the event that the coin is the two-tailed coin. - \(TT\) is the event that two tails are observed.

1. Probability of selecting the two-tailed coin \(P(C_{2T})\):

The gambler selects one of the two coins at random. Therefore:

\[ P(C_{2T}) = \frac{1}{2} \]

2. Probability of selecting the fair coin \(P(C_F)\):

Similarly, the probability of selecting the fair coin is also:

\[ P(C_F) = \frac{1}{2} \]

3. Probability of getting two tails given the two-tailed coin \(P(TT | C_{2T})\):

\[ P(TT | C_{2T}) = 1 \]

4. Probability of getting two tails given the fair coin \(P(TT | C_F)\):

\[ P(TT | C_F) = \frac{1}{2} \times \frac{1}{2} = \frac{1}{4} \]

5. Total probability of getting two tails \(P(TT)\):

To find the total probability of getting two tails, we use the law of total probability:

\[ P(TT) = P(TT | C_{2T}) \cdot P(C_{2T}) + P(TT | C_F) \cdot P(C_F) \]

Substituting the known values:

\[ P(TT) = (1 \times \frac{1}{2}) + \left(\frac{1}{4} \times \frac{1}{2}\right) = \frac{1}{2} + \frac{1}{8} = \frac{5}{8} \]

6. Applying Bayes’ Theorem to calculate \(P(C_{2T} | TT)\):

Now, we apply Bayes’ Theorem:

\[ P(C_{2T} | TT) = \frac{P(TT | C_{2T}) \cdot P(C_{2T})}{P(TT)} \]

Substitute the values:

\[ P(C_{2T} | TT) = \frac{1 \times \frac{1}{2}}{\frac{5}{8}} = \frac{1}{2} \times \frac{8}{5} = \frac{4}{5} \]

Answer:

The probability that the gambler selected the two-tailed coin given that two tails were observed is \(\frac{4}{5}\) or 0.8.

Part B: Characteristics of random variable, distributions [20 Marks]

Question 1 (4 Marks):

You toss 11 unfair coins where the chance of tossing a head is \(0.6\). What is the combined probability of getting exactly 5 or exactly 6 heads?

#### 1. Binomial Distribution

The binomial distribution gives us the probability of obtaining exactly \(k\) heads in \(n\) independent coin tosses:

\[ P(X = k) = \binom{n}{k} p^k (1 - p)^{n - k} \]

Where: \(n = 11\) , \(p = 0.6\), \(X\) ,

2. Probability of Getting Exactly 5 Heads

Calculate the cumulative probability of getting up to 5 heads, \(P(X \leq 5)\), using pbinom(5, n, p). To calculate the probability of getting exactly 5 heads, we subtract \(P(X \leq 4)\) from \(P(X \leq 5)\):

\[ P(X = 5) = P(X \leq 5) - P(X \leq 4) \]

3. Probability of Getting Exactly 6 Heads

Similarly, to calculate the probability of getting exactly 6 heads, we subtract \(P(X \leq 5)\) from \(P(X \leq 6)\):

\[ P(X = 6) = P(X \leq 6) - P(X \leq 5) \]

4. Combined Probability of Getting Exactly 5 or 6 Heads

The combined probability of getting either 5 or 6 heads is the sum of the individual probabilities:

\[ P(X = 5 \text{ or } X = 6) = P(X = 5) + P(X = 6) \]

Now, let’s calculate this in R.

n <- 11  
p <- 0.6

# Step 1: Calculate the probability of getting exactly 5 heads using pbinom
P_5_heads <- pbinom(5, n, p) - pbinom(4, n, p)

# Step 2: Calculate the probability of getting exactly 6 heads using pbinom
P_6_heads <- pbinom(6, n, p) - pbinom(5, n, p)

# Step 3: Calculate the combined probability of getting exactly 5 or exactly 6 heads
P_combined <- P_5_heads + P_6_heads
P_combined
## [1] 0.3678732

Question 2 (4 Marks):

The diameter (a continuous variable) of a certain disk follows the uniform distribution within a specific interval \([a,b]\) i.e., \(X\sim\mathcal{U}\left(a,b\right)\) with \(a=0\) and \(b=2\), find the average area of the disc.

Hint: You have to make use of relationship for the Expectation and Variance. And the formula to compute the area of a disk is \(\pi r^2\)(r means radius).

The diameter of the disk follows a uniform distribution \(X \sim \mathcal{U}(a, b)\), where: - \(a = 0\) . - \(b = 2\).

Since the formula for the area of a disk is \(A = \pi r^2\), and the radius \(r = \frac{X}{2}\), we need to calculate \(E[r^2]\).

1. Expected Value of \(r^2\)

To calculate the expected value of \(r^2\), we first express the radius as \(r = \frac{X}{2}\), so \(r^2 = \left(\frac{X}{2}\right)^2 = \frac{X^2}{4}\). Now, the expected value of \(X^2\) for a uniform distribution is:

\[ E[X^2] = \frac{b^3 - a^3}{3(b - a)} \]

Substituting the values \(a = 0\) and \(b = 2\):

\[ E[X^2] = \frac{2^3 - 0^3}{3(2 - 0)} = \frac{8}{6} = \frac{4}{3} \]

Since \(r = \frac{X}{2}\), we have \(r^2 = \frac{X^2}{4}\), so:

\[ E[r^2] = \frac{E[X^2]}{4} = \frac{\frac{4}{3}}{4} = \frac{1}{3} \]

2. Average Area of the Disk

The formula for the area of a disk is \(A = \pi r^2\). Therefore, the average area is:

\[ E[A] = \pi E[r^2] = \pi \times \frac{1}{3} = \frac{\pi}{3} \]

Now, let’s calculate this in R.

# Step 1: Define the bounds of the uniform distribution
a <- 0  # Lower bound of the uniform distribution
b <- 2  # Upper bound of the uniform distribution

# Step 2: Calculate the expected value of r^2 for the uniform distribution
# E[r^2] = (b^3 - a^3) / (3 * (b - a))
E_r_squared <- (b^3 - a^3) / (3 * (b - a)) / 4  # Divide by 4 because r = X / 2

# Step 3: Calculate the average area of the disk
# E[A] = π * E[r^2]
average_area <- pi * E_r_squared
average_area  
## [1] 1.047198

Question 3 (4 Marks):

Given \(K \sim N(\mu=0,\sigma^2=1)\), i.e, \(K\) is Gaussian distributed, what’s the probability that the equation \(x^2 + 2Kx +1 =0\) has real solutions? Hint: The solution to the quadratic equation \(ax^2+bx+c=0\) where \(a\), \(b\) and \(c\) are real constants and \(x\) is unknown, is \(x=\frac{-b\pm \sqrt{b^2-4ac}}{2a}\).

he quadratic equation \(x^2 + 2Kx + 1 = 0\) will have real solutions if its discriminant is non-negative. For the equation \(x^2 + 2Kx + 1 = 0\), the discriminant \(\Delta\) is given by:

\[ \Delta = b^2 - 4ac \]

where \(a = 1\), \(b = 2K\), and \(c = 1\). Substituting these values:

\[ \Delta = (2K)^2 - 4 \cdot 1 \cdot 1 = 4K^2 - 4 \]

For the quadratic equation to have real solutions, we need \(\Delta \geq 0\):

\[ 4K^2 - 4 \geq 0 \]

Solving this inequality:

\[ 4K^2 - 4 \geq 0 \] \[ 4K^2 \geq 4 \] \[ K^2 \geq 1 \] \[ |K| \geq 1 \]

Thus, the quadratic equation will have real solutions if \(|K| \geq 1\).

To find the probability that \(|K| \geq 1\), we calculate:

\[ P(|K| \geq 1) = P(K \leq -1) + P(K \geq 1) \]

where \(K\) follows a standard normal distribution \(N(0, 1)\).

Calculations

1.Probability \(P(K \leq -1)\)

\[ P(K \leq -1) = \Phi(-1) \]

  1. Probability \(P(K \geq 1)\)

    \[ P(K \geq 1) = 1 - \Phi(1) \]

Here is the R code to compute these probabilities:

# Calculate P(K <= -1)
P_K_less_than_minus_1 <- pnorm(-1, mean = 0, sd = 1)

# Calculate P(K >= 1)
P_K_greater_than_1 <- 1 - pnorm(1, mean = 0, sd = 1)

# Total probability of |K| >= 1
P_real_solutions <- P_K_less_than_minus_1 + P_K_greater_than_1
P_real_solutions
## [1] 0.3173105

Question 4 (4 Marks):

Given a random variable \(X\) with density function \[ p(x) = \begin{cases} a+bx^2 &0 \leq x \leq 1\\ 0 & \text{otherwise} \end{cases} \] such that \(E[X] = 2/3\). What’s the value of a and b?

We are given a random variable \(X\) with the following probability density function (PDF):

\[ p(x) = \begin{cases} a + bx^2 & 0 \leq x \leq 1 \\ 0 & \text{otherwise} \end{cases} \]

We are also told that the expected value \(E[X] = \frac{2}{3}\). We need to find the values of \(a\) and \(b\).

Step 1: Normalization Condition

The total probability must integrate to 1: \[ \int_0^1 (a + bx^2) \, dx = 1 \]

This gives the equation: \[ a \cdot 1 + \frac{b}{3} = 1 \quad \text{or} \quad a + \frac{b}{3} = 1 \] \[ Rearranging for \(a\): a = 1 - \frac{b}{3} \]

Step 2: Expected Value Condition

We are also given that the expected value of \(X\) is \(\frac{2}{3}\): \[ E[X] = \int_0^1 x(a + bx^2) \, dx \]

Calculating the integral: \[ E[X] = \frac{a}{2} + \frac{b}{4} \]

We are given that \(E[X] = \frac{2}{3}\), so: \[ \frac{a}{2} + \frac{b}{4} = \frac{2}{3} \]

Step 3: Solve the System of Equations

Substitute \(a = 1 - \frac{b}{3}\) into the expected value equation:

\[ \frac{1 - \frac{b}{3}}{2} + \frac{b}{4} = \frac{2}{3} \]

Multiply the entire equation by 12 to eliminate fractions: \[ 6\left(1 - \frac{b}{3}\right) + 3b = 8 \]

Simplifying: \[ 6 - 2b + 3b = 8 \]

Solving for \(b\): \[ b = 2 \]

Substitute \(b = 2\) back into the first equation: \[ a = 1 - \frac{2}{3} = \frac{1}{3} \]

Answer

The values of \(a\) and \(b\) are: \[ a = \frac{1}{3}, \quad b = 2 \]

Question 5 (4 Marks):

Let \(X_1,X_2,\ldots,X_n\) be independent, identically distributed random variables with common mean and variance. Find the values of \(c\) and \(d\) that will make the following formula true: \[ \mathbb{E}[(X_1+X_2+\ldots+X_n)^2] = c\;\mathbb{E}[X_1^2] + d\;\mathbb{E}[X_1]^2 \]

# We are given that: # Let X_1, X_2, …, X_n be independent, identically distributed random variables with common mean and variance. # We want to find constants c and d such that the following equation holds: \[ \mathbb{E}[(X_1 + X_2 + \ldots + X_n)^2] = c \, \mathbb{E}[X_1^2] + d \, \mathbb{E}[X_1]^2 \]

Step 1: Expand the square on the left-hand side:

\[ \mathbb{E}[(X_1 + X_2 + \ldots + X_n)^2] = \mathbb{E}\left[\sum_{i=1}^n X_i^2 + 2 \sum_{1 \leq i < j \leq n} X_i X_j\right] \]

Step 2: Taking the expectation:

The expectation of the sum of squares: \[ \mathbb{E}\left[\sum_{i=1}^n X_i^2\right] = n \, \mathbb{E}[X_1^2] \] #### The expectation of the cross terms: \[ \mathbb{E}\left[2 \sum_{1 \leq i < j \leq n} X_i X_j\right] = 2 \sum_{1 \leq i < j \leq n} \mathbb{E}[X_1]^2 = n(n - 1) \, \mathbb{E}[X_1]^2 \]

Step 3: Combine the two parts:

\[ \mathbb{E}[(X_1 + X_2 + \ldots + X_n)^2] = n \, \mathbb{E}[X_1^2] + n(n - 1) \, \mathbb{E}[X_1]^2 \]

Step 4: Matching the equation with the given form:

\[ \mathbb{E}[(X_1 + X_2 + \ldots + X_n)^2] = c \, \mathbb{E}[X_1^2] + d \, \mathbb{E}[X_1]^2 \]

Therefore, by comparing coefficients:

\[ c = n \] \[ d = n(n - 1) \]

Part C: Correlation analysis [10 Marks]

A dataset for this part is shared with you. Please provide a correlation analysis of the variables (columns) and outline your findings. Please note that result visualisation is needed.

# You are free to install any required library for visualisation

# If you have error that any of the above library is missing, please install it via install.packages(...) or Tools -> Install packages in RStudio

library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.1
library(MASS)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data_new = read.csv('Assessment1_data.csv')
head(data_new)
##   Marital.status.Application.mode.Application.order.Course.Daytime.evening.attendance..Previous.qualification.Previous.qualification..grade..Nacionality.Mother.s.qualification.Father.s.qualification.Mother.s.occupation.Father.s.occupation.Admission.gra ...
## 1                                                                                                                                                      1;17;5;171;1;1;122.0;1;19;12;5;9;127.3;1;0;0;1;1;0;20;0;0;0;0;0;0.0;0;0;0;0;0;0.0;0;10.8;1.4;1.74;Dropout
## 2                                                                                                                                     1;15;1;9254;1;1;160.0;1;1;3;3;3;142.5;1;0;0;0;1;0;19;0;0;6;6;6;14.0;0;0;6;6;6;13.666666666666666;0;13.9;-0.3;0.79;Graduate
## 3                                                                                                                                                      1;1;5;9070;1;1;122.0;1;37;37;9;9;124.8;1;0;0;0;1;0;19;0;0;6;0;0;0.0;0;0;6;0;0;0.0;0;10.8;1.4;1.74;Dropout
## 4                                                                                                                                  1;17;2;9773;1;1;122.0;1;38;37;5;3;119.6;1;0;0;1;0;0;20;0;0;6;8;6;13.428571428571429;0;0;6;10;5;12.4;0;9.4;-0.8;-3.12;Graduate
## 5                                                                                                                                   2;39;1;8014;0;1;100.0;1;37;38;9;9;141.5;0;0;0;1;0;0;45;0;0;6;9;5;12.333333333333334;0;0;6;6;6;13.0;0;13.9;-0.3;0.79;Graduate
## 6                                                                                                                                2;39;1;9991;0;19;133.1;1;37;37;9;7;114.8;0;0;1;1;1;0;50;0;0;5;10;5;11.857142857142858;0;0;5;17;5;11.5;5;16.2;0.3;-0.92;Graduate

(a) Obtain the basic statistical information like in Module 1 tutorial (e.g boxplot, summary, etc). Explain your observations based on your findings.[5 marks]

# Re-read the CSV with a different delimiter (likely a semicolon or space)
data_new <- read.csv('Assessment1_data.csv', sep = ";")  # or sep = "," if comma-separated

# Inspect the new column names
colnames(data_new)
##  [1] "Marital.status"                                
##  [2] "Application.mode"                              
##  [3] "Application.order"                             
##  [4] "Course"                                        
##  [5] "Daytime.evening.attendance."                   
##  [6] "Previous.qualification"                        
##  [7] "Previous.qualification..grade."                
##  [8] "Nacionality"                                   
##  [9] "Mother.s.qualification"                        
## [10] "Father.s.qualification"                        
## [11] "Mother.s.occupation"                           
## [12] "Father.s.occupation"                           
## [13] "Admission.grade"                               
## [14] "Displaced"                                     
## [15] "Educational.special.needs"                     
## [16] "Debtor"                                        
## [17] "Tuition.fees.up.to.date"                       
## [18] "Gender"                                        
## [19] "Scholarship.holder"                            
## [20] "Age.at.enrollment"                             
## [21] "International"                                 
## [22] "Curricular.units.1st.sem..credited."           
## [23] "Curricular.units.1st.sem..enrolled."           
## [24] "Curricular.units.1st.sem..evaluations."        
## [25] "Curricular.units.1st.sem..approved."           
## [26] "Curricular.units.1st.sem..grade."              
## [27] "Curricular.units.1st.sem..without.evaluations."
## [28] "Curricular.units.2nd.sem..credited."           
## [29] "Curricular.units.2nd.sem..enrolled."           
## [30] "Curricular.units.2nd.sem..evaluations."        
## [31] "Curricular.units.2nd.sem..approved."           
## [32] "Curricular.units.2nd.sem..grade."              
## [33] "Curricular.units.2nd.sem..without.evaluations."
## [34] "Unemployment.rate"                             
## [35] "Inflation.rate"                                
## [36] "GDP"                                           
## [37] "Target"
# After ensuring proper column names, follow with cleaning if necessary:
colnames(data_new) <- make.names(colnames(data_new), unique = TRUE)

# Now check the cleaned column names
colnames(data_new)
##  [1] "Marital.status"                                
##  [2] "Application.mode"                              
##  [3] "Application.order"                             
##  [4] "Course"                                        
##  [5] "Daytime.evening.attendance."                   
##  [6] "Previous.qualification"                        
##  [7] "Previous.qualification..grade."                
##  [8] "Nacionality"                                   
##  [9] "Mother.s.qualification"                        
## [10] "Father.s.qualification"                        
## [11] "Mother.s.occupation"                           
## [12] "Father.s.occupation"                           
## [13] "Admission.grade"                               
## [14] "Displaced"                                     
## [15] "Educational.special.needs"                     
## [16] "Debtor"                                        
## [17] "Tuition.fees.up.to.date"                       
## [18] "Gender"                                        
## [19] "Scholarship.holder"                            
## [20] "Age.at.enrollment"                             
## [21] "International"                                 
## [22] "Curricular.units.1st.sem..credited."           
## [23] "Curricular.units.1st.sem..enrolled."           
## [24] "Curricular.units.1st.sem..evaluations."        
## [25] "Curricular.units.1st.sem..approved."           
## [26] "Curricular.units.1st.sem..grade."              
## [27] "Curricular.units.1st.sem..without.evaluations."
## [28] "Curricular.units.2nd.sem..credited."           
## [29] "Curricular.units.2nd.sem..enrolled."           
## [30] "Curricular.units.2nd.sem..evaluations."        
## [31] "Curricular.units.2nd.sem..approved."           
## [32] "Curricular.units.2nd.sem..grade."              
## [33] "Curricular.units.2nd.sem..without.evaluations."
## [34] "Unemployment.rate"                             
## [35] "Inflation.rate"                                
## [36] "GDP"                                           
## [37] "Target"
data_new$Previous.qualification..grade. <- as.numeric(data_new$Previous.qualification..grade.)
data_new$Admission.grade <- as.numeric(data_new$Admission.grade)
data_new$Curricular.units.1st.sem..grade. <- as.numeric(data_new$Curricular.units.1st.sem..grade.)
data_new$Curricular.units.2nd.sem..grade. <- as.numeric(data_new$Curricular.units.2nd.sem..grade.)
data_new$Unemployment.rate <- as.numeric(data_new$Unemployment.rate)
data_new$Inflation.rate <- as.numeric(data_new$Inflation.rate)
data_new$GDP <- as.numeric(data_new$GDP)

# Step 5: Generate summary statistics for the selected numeric columns
summary(data_new[c('Previous.qualification..grade.', 'Admission.grade', 
                   'Curricular.units.1st.sem..grade.', 'Curricular.units.2nd.sem..grade.', 
                   'Unemployment.rate', 'Inflation.rate', 'GDP')])
##  Previous.qualification..grade. Admission.grade
##  Min.   : 95.0                  Min.   : 95.0  
##  1st Qu.:125.0                  1st Qu.:117.9  
##  Median :133.1                  Median :126.1  
##  Mean   :132.6                  Mean   :127.0  
##  3rd Qu.:140.0                  3rd Qu.:134.8  
##  Max.   :190.0                  Max.   :190.0  
##  Curricular.units.1st.sem..grade. Curricular.units.2nd.sem..grade.
##  Min.   : 0.00                    Min.   : 0.00                   
##  1st Qu.:11.00                    1st Qu.:10.75                   
##  Median :12.29                    Median :12.20                   
##  Mean   :10.64                    Mean   :10.23                   
##  3rd Qu.:13.40                    3rd Qu.:13.33                   
##  Max.   :18.88                    Max.   :18.57                   
##  Unemployment.rate Inflation.rate        GDP           
##  Min.   : 7.60     Min.   :-0.800   Min.   :-4.060000  
##  1st Qu.: 9.40     1st Qu.: 0.300   1st Qu.:-1.700000  
##  Median :11.10     Median : 1.400   Median : 0.320000  
##  Mean   :11.57     Mean   : 1.228   Mean   : 0.001969  
##  3rd Qu.:13.90     3rd Qu.: 2.600   3rd Qu.: 1.790000  
##  Max.   :16.20     Max.   : 3.700   Max.   : 3.510000
columns_to_plot <- c('Previous.qualification..grade.', 'Admission.grade', 
                     'Curricular.units.1st.sem..grade.', 'Curricular.units.2nd.sem..grade.', 
                     'Unemployment.rate', 'Inflation.rate', 'GDP')

# Create boxplots for the selected columns
par(mfrow=c(2, 4))  # Set up the layout for multiple plots
for (column in columns_to_plot) {
    boxplot(data_new[[column]], main = column, ylab = column)
}

Previous Qualification Grade:

The grades are distributed between approximately 100 and 180. The boxplot shows a median around 140, indicating that most students fall within this range. There are several outliers both above and below the box (the interquartile range), showing that a few students either scored very high or low compared to the majority.

Admission Grade:

Admission grades also range between 100 and 180, with the median slightly above 140. There are a few outliers, but most of the data points cluster near the middle of the distribution.

Curricular Units 1st Semester (Grade):

The grades for the first semester of curricular units range from 0 to about 15. The boxplot shows the median at around 10, indicating that many students score near this value. Several outliers can be seen both below and above the box, meaning some students performed exceptionally well or poorly.

Curricular Units 2nd Semester (Grade):

The range and distribution are similar to the 1st-semester grades, with the median around 10. Again, there are a few outliers, indicating variability in student performance.

Unemployment Rate:

The unemployment rate ranges between 10% and 16%, with the median slightly above 12%. The interquartile range is relatively narrow, showing that most of the data points cluster around the median.

Inflation Rate:

The inflation rate ranges between approximately 0% and 3%, with the median close to 1.5%. There is a wider distribution compared to other economic variables, but no extreme outliers are present.

GDP:

The GDP values range from -2 to 2, with the median slightly above 0. There is some variation in GDP values, but it is symmetrically distributed around the median, showing a balanced economic condition in the dataset.

General Observations:

Grades: The grades for previous qualifications, admissions, and curricular units (for both semesters) show moderate variability, with outliers in each case. This suggests a mix of student performance, with most students clustering around the median but a few outliers representing exceptionally high or low performance.

Economic Variables: The economic indicators such as unemployment rate, inflation rate, and GDP show less variability compared to the grades, with fewer extreme outliers. This indicates relatively stable economic conditions for most of the population represented in the dataset.

(b) Plotting the dataset to find associations (correlations) among variables. Present and discuss your findings from each of the plots. You can put your findings below the plots. [5 marks]

# Load necessary libraries
library(ggplot2)

# Step 1: Convert relevant columns to numeric (if not done already)
data_new$Previous.qualification..grade. <- as.numeric(data_new$Previous.qualification..grade.)
data_new$Admission.grade <- as.numeric(data_new$Admission.grade)
data_new$Curricular.units.1st.sem..grade. <- as.numeric(data_new$Curricular.units.1st.sem..grade.)
data_new$Curricular.units.2nd.sem..grade. <- as.numeric(data_new$Curricular.units.2nd.sem..grade.)
data_new$Unemployment.rate <- as.numeric(data_new$Unemployment.rate)
data_new$Inflation.rate <- as.numeric(data_new$Inflation.rate)
data_new$GDP <- as.numeric(data_new$GDP)

# Step 2: Select continuous variables for analysis
selected_columns <- data_new[, c('Previous.qualification..grade.', 'Admission.grade', 
                                 'Curricular.units.1st.sem..grade.', 'Curricular.units.2nd.sem..grade.',
                                 'Unemployment.rate', 'Inflation.rate', 'GDP')]

# Step 3: Calculate the correlation matrix
correlation_matrix <- cor(selected_columns, use = "complete.obs")

# Display the correlation matrix
print(correlation_matrix)
##                                  Previous.qualification..grade. Admission.grade
## Previous.qualification..grade.                       1.00000000      0.58044420
## Admission.grade                                      0.58044420      1.00000000
## Curricular.units.1st.sem..grade.                     0.05943751      0.07386842
## Curricular.units.2nd.sem..grade.                     0.05323897      0.07440153
## Unemployment.rate                                    0.04522227      0.03875569
## Inflation.rate                                       0.01871038     -0.02162358
## GDP                                                 -0.05262040     -0.01951948
##                                  Curricular.units.1st.sem..grade.
## Previous.qualification..grade.                         0.05943751
## Admission.grade                                        0.07386842
## Curricular.units.1st.sem..grade.                       1.00000000
## Curricular.units.2nd.sem..grade.                       0.83716974
## Unemployment.rate                                      0.01482095
## Inflation.rate                                        -0.03390357
## GDP                                                    0.05480076
##                                  Curricular.units.2nd.sem..grade.
## Previous.qualification..grade.                        0.053238974
## Admission.grade                                       0.074401530
## Curricular.units.1st.sem..grade.                      0.837169741
## Curricular.units.2nd.sem..grade.                      1.000000000
## Unemployment.rate                                     0.001461858
## Inflation.rate                                       -0.038166042
## GDP                                                   0.071269496
##                                  Unemployment.rate Inflation.rate         GDP
## Previous.qualification..grade.         0.045222268     0.01871038 -0.05262040
## Admission.grade                        0.038755687    -0.02162358 -0.01951948
## Curricular.units.1st.sem..grade.       0.014820945    -0.03390357  0.05480076
## Curricular.units.2nd.sem..grade.       0.001461858    -0.03816604  0.07126950
## Unemployment.rate                      1.000000000    -0.02888466 -0.33517812
## Inflation.rate                        -0.028884663     1.00000000 -0.11229464
## GDP                                   -0.335178119    -0.11229464  1.00000000
# Step 4: Create scatterplots to visualize the relationships between variables

# Function to generate scatterplots
pairs(selected_columns, main = "Scatterplots of Continuous Variables")

# Step 5: Analyze the strongest correlations
# Look at the correlation matrix and scatterplots to determine which pairs of variables are strongly correlated.

Previous Qualification Grade vs. Admission Grade:A positive correlation indicates that students with higher prior qualification grades tend to have higher admission grades.

Curricular Units (1st & 2nd Sem) vs. Admission Grade: Positive correlations indicating that students who do well in admissions also perform well in curricular units.

Economic Variables (Unemployment Rate, Inflation, GDP): No strong correlations with academic performance.