MATH1324_2050_Assignment_2 Investigating the significant inside of students performance through Linear Regression and Chi-Squared test

FEI FEI YOU S3873510 YUAN HONG S3501537

25/10/2020

Introduction

Students performance always an interesting topic as it refers to gender, race, parental level of education and etc.

“Female achieve higher score than male at school”, “Asian is good at math”, " Parents with higher education background would have high score kids"… are these just normal ideologies or there are some statistics significance behind these statements? This report is about to find out the relation between parent’s education and student’s score through Chi-square test and We will also use the linear regression method to find out whether student who is good a writing will be good at reading.

Problem Statement

There are two main problems I would like to find out from this small project 1 Whether Parental level of education would affect students scores? 2, Are there any relationship between reading or math with the other subjects?

This project is to find out the relationship between two variables use chi-squared test and linear regression to examine the students’ performance

Data

The dataset " Students Performance" comes from https://www.kaggle.com/spscientist/students-performance-in-exams

This data set includes scores from three exams and a variety of personal, social, and economic factors that have interaction effects upon them.

Original dataset has 1000 objectives and 8 variables which are: “gender”,“race/ethnicity”,“parental level of education”,“lunch”,“test preparation course”,“math score”,“reading score”,“writing score”

Data Cont.

In this project, only “gender”,“parental level of education”,“math score”,“reading score” and “writing score”would be chose for observation. The scale of numeric variables should be 1-100 The “gender” column should be factored as “Female” and “Male”; “parental level of education” should factored as “master’s degree”,“bachelor’s degree”, “associate’s degree”,“high school”,“some high school”.

In addition, I will create a new attribute by using mean of math, reading and writing scores into tree categories:“under_60”,“60_to_80”, “over_80”

Linear regression and Visualisation

Here I would use reading score and writing score two attribute to show their linear relationship

StudentsPerformance <- read_csv("C:/Users/wei_s/Desktop/R file/StudentsPerformance.csv")
x<- c(StudentsPerformance$`reading score`)
y<- c(StudentsPerformance$`writing score`)
plot(x,y)

Linear regression

Simple linear regression can be used to illustrate the relationship between two quantitative variables in our case(reading and math score). From the previous graph, we can easily see the result are strongly positive correlate, we can proceed with fitting the linear regression model .

lm.sov<-lm(y~1 +x)
summary(lm.sov)
## 
## Call:
## lm(formula = y ~ 1 + x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9573  -2.9573   0.0363   3.1026  15.0557 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.667554   0.693792  -0.962    0.336    
## x            0.993531   0.009814 101.233   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.529 on 998 degrees of freedom
## Multiple R-squared:  0.9113, Adjusted R-squared:  0.9112 
## F-statistic: 1.025e+04 on 1 and 998 DF,  p-value: < 2.2e-16

Linear regression.

The plot shows a positive relationship but we still need to test is it linear regression. We found out that the p-value< 0.01, and R-square is over 0.9, which means the linear regression is significant and we should reject H0, there was a statistically significant positive relation between reading score and writing score.

Chi-squared test

Create table for Chi-squared test.

df <- StudentsPerformance%>% dplyr::select(6:8)
StudentsPerformance$mean <- rowMeans(df)

StudentsPerformance$mean <- cut(StudentsPerformance$mean, 
                                breaks = c(-1,59,79,100), labels = c("under_60","60_to_80", "over_80"))

df_grouped <- StudentsPerformance%>% 
  dplyr::select(`parental level of education`, mean) %>%
  group_by(`parental level of education`,mean) %>% 
  summarise(count = n() )

df_final <- df_grouped %>% spread(key = `mean`, value = count)

parental_edu <- matrix(c(9,28,22,20,63,35,58,109,55,51,127,48,73,101,22,63,84,32),ncol=3,
                       byrow=TRUE)
colnames(parental_edu) <- c("under_60","60_to_80", "over_80")
rownames(parental_edu) <- c("master's degree", "bachelor's degree","associate's degree",
                            "some college","high school",       
                            "some high school" )
parental_edu <- as.table(parental_edu)

Hypothesis Testing (Chi-squared test)

ChiSq_test for parental education background H0: There is no statistical significance between the parents’ education and students’ result HA: There is no statistical significance between the parents’ education and students’ result

parental_edu <- as.table(parental_edu)
chisq <- chisq.test(parental_edu)
knitr::kable(parental_edu)
under_60 60_to_80 over_80
master’s degree 9 28 22
bachelor’s degree 20 63 35
associate’s degree 58 109 55
some college 51 127 48
high school 73 101 22
some high school 63 84 32
chisq
## 
##  Pearson's Chi-squared test
## 
## data:  parental_edu
## X-squared = 45.477, df = 10, p-value = 1.784e-06

Hypothesis Testing (Chi-squared test)

This p-value=1.784e-06 . We should round this to p<.001. We reject H0 when χ2>χ2crit. As 45.477 > 9.49, H0 was rejected. This p-value is less than the standard significance level of 0.05. Therefore, we reject H0 There is statisical significants that parental_education level has relationship with students’ result

chisq$observed
##                    under_60 60_to_80 over_80
## master's degree           9       28      22
## bachelor's degree        20       63      35
## associate's degree       58      109      55
## some college             51      127      48
## high school              73      101      22
## some high school         63       84      32
chisq$expected
##                    under_60 60_to_80 over_80
## master's degree      16.166   30.208  12.626
## bachelor's degree    32.332   60.416  25.252
## associate's degree   60.828  113.664  47.508
## some college         61.924  115.712  48.364
## high school          53.704  100.352  41.944
## some high school     49.046   91.648  38.306
qchisq(p = .95,df = 10)
## [1] 18.30704

Discussion

In this project, we found that different subject scores are relevant to each other, reading and writing are strongly correlated. From our linear regression, it is obvious that student with good reading would likely to have good writing and math score.

The result of a student’s score can be affected by their parental education background, as our chi-squared test shows parents with higher education, their kids would more likely to achieve a higher score. However,there are some limitations to this sample as it only contains 1000 observations and it can not represent the population. Therefore for future study, we need some more diverse data from different schools.

Also the data miss the information of race and only froup it to A,B,C,D, therefore, by analysing only A,B,C,D won’t make too much sense. Therefore, this data set needs improvement

References