Problem

Independent variables: SSOverall, STOverall , and SCOverall
Dependent variable: GWA (1st sem SY: 2021-2022)

Q1. How many of the observations whose age is at least 21 years old?
Q2. How many of the observations whose grades are above 1.25 to 1.75?
Q3. Provide the results in checking the assumptions in running multiple regression analysis.
Q4. Which of the independent variables significantly predicts the dependent variable?

library(readxl)
ken <- read_excel("D:/Regression Analysis/withage.xlsx")
ken

## # A tibble: 113 × 33
##      Age GWA  (1…¹   SS1   SS2   SS3   SS4   SS5   SS6   SS7   SS8 SSOve…²   ST1
##    <dbl>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>
##  1    21      1.54     5     4     4     5     5     5     5     5    4.75     5
##  2    20      1.27     5     5     5     4     5     5     1     1    3.88     3
##  3    15      1.4      4     4     4     4     5     5     5     5    4.5      4
##  4    15      1.19     5     5     5     5     5     5     5     5    5        4
##  5    18      1.47     1     5     5     4     5     4     3     3    3.75     3
##  6    17      1.85     3     3     3     3     3     3     3     3    3        3
##  7    20      1.4      2     4     4     4     5     2     4     4    3.62     5
##  8    25      1.52     4     3     5     3     3     3     4     3    3.5      5
##  9    25      1.2      3     3     3     3     3     3     3     3    3        2
## 10    26      2        3     5     4     3     5     5     5     5    4.38     5
## # … with 103 more rows, 21 more variables: ST2 <dbl>, ST3 <dbl>, ST4 <dbl>,
## #   ST5 <dbl>, ST6 <dbl>, ST7 <dbl>, ST8 <dbl>, STOverall <dbl>, SC1 <dbl>,
## #   SC2 <dbl>, SC3 <dbl>, SC4 <dbl>, SC5 <dbl>, SC6 <dbl>, SC7 <dbl>,
## #   SC8 <dbl>, SCOverall <dbl>, Q1 <chr>, Q2 <chr>, Q3 <chr>, Q4 <chr>, and
## #   abbreviated variable names ¹`GWA  (1st sem SY: 2021-2022)`, ²SSOverall

Q1. How many of the observations whose age is at least 21 years old?

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

ken%>%
  mutate(Agecode=ifelse(Age>=21, "At least 21 years old", "Less than 21 years old"))%>%
  group_by(Agecode)%>%
  summarise(count=n())%>%
  mutate(Percentage =round((count/sum(count)*100),2))

## # A tibble: 2 × 3
##   Agecode                count Percentage
##   <chr>                  <int>      <dbl>
## 1 At least 21 years old     72       63.7
## 2 Less than 21 years old    41       36.3

Based on the table above, there are 72 observations whose age is at least 21 years old.

Q2. How many of the observations whose grades are above 1.25 to 1.75?

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   0.3.5
## ✔ tibble  3.1.8     ✔ stringr 1.4.1
## ✔ tidyr   1.2.1     ✔ forcats 0.5.2
## ✔ readr   2.1.3     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggpubr)
library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

library(dplyr)
ken%>%
  mutate(GWAcode=ifelse(`GWA  (1st sem SY: 2021-2022)`>=1.25 & `GWA  (1st sem SY: 2021-2022)`<=1.75, "GWA is in the interval [1.25, 1.75]", "Not in the given interval of GWA"))%>%
  group_by(GWAcode)%>%
  summarise(count=n())%>%
  mutate(Percentage =round((count/sum(count)*100),2))

## # A tibble: 2 × 3
##   GWAcode                             count Percentage
##   <chr>                               <int>      <dbl>
## 1 GWA is in the interval [1.25, 1.75]    92       81.4
## 2 Not in the given interval of GWA       21       18.6

As shown in the table, it is conclusive that there are 92 observations whose grades are above 1.25 to 1.75.

Q3. Provide the results in checking the assumptions in running multiple regression analysis.

head(ken)

## # A tibble: 6 × 33
##     Age GWA  (1s…¹   SS1   SS2   SS3   SS4   SS5   SS6   SS7   SS8 SSOve…²   ST1
##   <dbl>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>
## 1    21       1.54     5     4     4     5     5     5     5     5    4.75     5
## 2    20       1.27     5     5     5     4     5     5     1     1    3.88     3
## 3    15       1.4      4     4     4     4     5     5     5     5    4.5      4
## 4    15       1.19     5     5     5     5     5     5     5     5    5        4
## 5    18       1.47     1     5     5     4     5     4     3     3    3.75     3
## 6    17       1.85     3     3     3     3     3     3     3     3    3        3
## # … with 21 more variables: ST2 <dbl>, ST3 <dbl>, ST4 <dbl>, ST5 <dbl>,
## #   ST6 <dbl>, ST7 <dbl>, ST8 <dbl>, STOverall <dbl>, SC1 <dbl>, SC2 <dbl>,
## #   SC3 <dbl>, SC4 <dbl>, SC5 <dbl>, SC6 <dbl>, SC7 <dbl>, SC8 <dbl>,
## #   SCOverall <dbl>, Q1 <chr>, Q2 <chr>, Q3 <chr>, Q4 <chr>, and abbreviated
## #   variable names ¹`GWA  (1st sem SY: 2021-2022)`, ²SSOverall

multiple <- lm( `GWA  (1st sem SY: 2021-2022)`~ SSOverall +  STOverall + SCOverall, data = ken)
multiple

## 
## Call:
## lm(formula = `GWA  (1st sem SY: 2021-2022)` ~ SSOverall + STOverall + 
##     SCOverall, data = ken)
## 
## Coefficients:
## (Intercept)    SSOverall    STOverall    SCOverall  
##    1.996962    -0.047674    -0.067324    -0.005764

library(performance)
check_model(multiple)

The insights on how to interpret the different diagnostic plots and what you should expect are given and clearly were satisfied upon reading the subtitles given for each plot. Thus, conditions of application for multiple regression analysis are met.

Q4. Which of the independent variables significantly predicts the dependent variable?

summary(multiple)

## 
## Call:
## lm(formula = `GWA  (1st sem SY: 2021-2022)` ~ SSOverall + STOverall + 
##     SCOverall, data = ken)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45511 -0.11195 -0.02104  0.10446  0.57345 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.996962   0.173046  11.540   <2e-16 ***
## SSOverall   -0.047674   0.038355  -1.243    0.217    
## STOverall   -0.067324   0.047959  -1.404    0.163    
## SCOverall   -0.005764   0.048030  -0.120    0.905    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1993 on 109 degrees of freedom
## Multiple R-squared:  0.07516,    Adjusted R-squared:  0.0497 
## F-statistic: 2.953 on 3 and 109 DF,  p-value: 0.03583

The table Coefficients gives the estimate for each parameter (column Estimate), together with the p-value of the nullity of the parameter (column Pr(>|t|)).

Null and alternative hypotheses are:

H_0: β_j = 0
H_a: β_j ≠ 0.

The test of β_j = 0 is equivalent to testing the hypothesis: is the dependent variable associated with the independent variable studied, all other things being equal, that is to say, at constant level of the other independent variables.

Answer:

The statistical output displays the coded coefficients, which are the standardized coefficients. STOverall has the standardized coefficient with the largest absolute value. This measure suggests that STOverall is the most important independent variable in the regression model.

Analysis using Multiple Regression

Kyle Kenneth Ruaya

2022-12-04