FEI FEI YOU S3873510 YUAN HONG S3501537
25/10/2020
Students performance always an interesting topic as it refers to gender, race, parental level of education and etc.
“Female achieve higher score than male at school”, “Asian is good at math”, " Parents with higher education background would have high score kids"… are these just normal ideologies or there are some statistics significance behind these statements? This report is about to find out the relation between parent’s education and student’s score through Chi-square test and We will also use the linear regression method to find out whether student who is good a writing will be good at reading.
There are two main problems I would like to find out from this small project 1 Whether Parental level of education would affect students scores? 2, Are there any relationship between reading or math with the other subjects?
This project is to find out the relationship between two variables use chi-squared test and linear regression to examine the students’ performance
The dataset " Students Performance" comes from https://www.kaggle.com/spscientist/students-performance-in-exams
This data set includes scores from three exams and a variety of personal, social, and economic factors that have interaction effects upon them.
Original dataset has 1000 objectives and 8 variables which are: “gender”,“race/ethnicity”,“parental level of education”,“lunch”,“test preparation course”,“math score”,“reading score”,“writing score”
In this project, only “gender”,“parental level of education”,“math score”,“reading score” and “writing score”would be chose for observation. The scale of numeric variables should be 1-100 The “gender” column should be factored as “Female” and “Male”; “parental level of education” should factored as “master’s degree”,“bachelor’s degree”, “associate’s degree”,“high school”,“some high school”.
In addition, I will create a new attribute by using mean of math, reading and writing scores into tree categories:“under_60”,“60_to_80”, “over_80”
Here I would use reading score and writing score two attribute to show their linear relationship
StudentsPerformance <- read_csv("C:/Users/wei_s/Desktop/R file/StudentsPerformance.csv")
x<- c(StudentsPerformance$`reading score`)
y<- c(StudentsPerformance$`writing score`)
plot(x,y)Simple linear regression can be used to illustrate the relationship between two quantitative variables in our case(reading and math score). From the previous graph, we can easily see the result are strongly positive correlate, we can proceed with fitting the linear regression model .
##
## Call:
## lm(formula = y ~ 1 + x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9573 -2.9573 0.0363 3.1026 15.0557
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.667554 0.693792 -0.962 0.336
## x 0.993531 0.009814 101.233 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.529 on 998 degrees of freedom
## Multiple R-squared: 0.9113, Adjusted R-squared: 0.9112
## F-statistic: 1.025e+04 on 1 and 998 DF, p-value: < 2.2e-16
The plot shows a positive relationship but we still need to test is it linear regression. We found out that the p-value< 0.01, and R-square is over 0.9, which means the linear regression is significant and we should reject H0, there was a statistically significant positive relation between reading score and writing score.
Create table for Chi-squared test.
df <- StudentsPerformance%>% dplyr::select(6:8)
StudentsPerformance$mean <- rowMeans(df)
StudentsPerformance$mean <- cut(StudentsPerformance$mean,
breaks = c(-1,59,79,100), labels = c("under_60","60_to_80", "over_80"))
df_grouped <- StudentsPerformance%>%
dplyr::select(`parental level of education`, mean) %>%
group_by(`parental level of education`,mean) %>%
summarise(count = n() )
df_final <- df_grouped %>% spread(key = `mean`, value = count)
parental_edu <- matrix(c(9,28,22,20,63,35,58,109,55,51,127,48,73,101,22,63,84,32),ncol=3,
byrow=TRUE)
colnames(parental_edu) <- c("under_60","60_to_80", "over_80")
rownames(parental_edu) <- c("master's degree", "bachelor's degree","associate's degree",
"some college","high school",
"some high school" )
parental_edu <- as.table(parental_edu)ChiSq_test for parental education background H0: There is no statistical significance between the parents’ education and students’ result HA: There is no statistical significance between the parents’ education and students’ result
| under_60 | 60_to_80 | over_80 | |
|---|---|---|---|
| master’s degree | 9 | 28 | 22 |
| bachelor’s degree | 20 | 63 | 35 |
| associate’s degree | 58 | 109 | 55 |
| some college | 51 | 127 | 48 |
| high school | 73 | 101 | 22 |
| some high school | 63 | 84 | 32 |
##
## Pearson's Chi-squared test
##
## data: parental_edu
## X-squared = 45.477, df = 10, p-value = 1.784e-06
This p-value=1.784e-06 . We should round this to p<.001. We reject H0 when χ2>χ2crit. As 45.477 > 9.49, H0 was rejected. This p-value is less than the standard significance level of 0.05. Therefore, we reject H0 There is statisical significants that parental_education level has relationship with students’ result
## under_60 60_to_80 over_80
## master's degree 9 28 22
## bachelor's degree 20 63 35
## associate's degree 58 109 55
## some college 51 127 48
## high school 73 101 22
## some high school 63 84 32
## under_60 60_to_80 over_80
## master's degree 16.166 30.208 12.626
## bachelor's degree 32.332 60.416 25.252
## associate's degree 60.828 113.664 47.508
## some college 61.924 115.712 48.364
## high school 53.704 100.352 41.944
## some high school 49.046 91.648 38.306
## [1] 18.30704
In this project, we found that different subject scores are relevant to each other, reading and writing are strongly correlated. From our linear regression, it is obvious that student with good reading would likely to have good writing and math score.
The result of a student’s score can be affected by their parental education background, as our chi-squared test shows parents with higher education, their kids would more likely to achieve a higher score. However,there are some limitations to this sample as it only contains 1000 observations and it can not represent the population. Therefore for future study, we need some more diverse data from different schools.
Also the data miss the information of race and only froup it to A,B,C,D, therefore, by analysing only A,B,C,D won’t make too much sense. Therefore, this data set needs improvement