This analysis explores student exam performance data using R.
We will load the data, summarize it, visualize distributions, and run a
regression.
data <- read_csv("C:\\Users\\kalyani kumar\\OneDrive\\Desktop\\StudentsPerformance.csv")
## Rows: 1000 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): gender, race/ethnicity, parental level of education, lunch, test pr...
## dbl (3): math score, reading score, writing score
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
## # A tibble: 6 × 8
## gender `race/ethnicity` parental level of educa…¹ lunch test preparation cou…²
## <chr> <chr> <chr> <chr> <chr>
## 1 female group B bachelor's degree stan… none
## 2 female group C some college stan… completed
## 3 female group B master's degree stan… none
## 4 male group A associate's degree free… none
## 5 male group C some college stan… none
## 6 female group B associate's degree stan… none
## # ℹ abbreviated names: ¹`parental level of education`,
## # ²`test preparation course`
## # ℹ 3 more variables: `math score` <dbl>, `reading score` <dbl>,
## # `writing score` <dbl>
summary(data)
## gender race/ethnicity parental level of education
## Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## lunch test preparation course math score reading score
## Length:1000 Length:1000 Min. : 0.00 Min. : 17.00
## Class :character Class :character 1st Qu.: 57.00 1st Qu.: 59.00
## Mode :character Mode :character Median : 66.00 Median : 70.00
## Mean : 66.09 Mean : 69.17
## 3rd Qu.: 77.00 3rd Qu.: 79.00
## Max. :100.00 Max. :100.00
## writing score
## Min. : 10.00
## 1st Qu.: 57.75
## Median : 69.00
## Mean : 68.05
## 3rd Qu.: 79.00
## Max. :100.00
str(data)
## spc_tbl_ [1,000 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:1000] "female" "female" "female" "male" ...
## $ race/ethnicity : chr [1:1000] "group B" "group C" "group B" "group A" ...
## $ parental level of education: chr [1:1000] "bachelor's degree" "some college" "master's degree" "associate's degree" ...
## $ lunch : chr [1:1000] "standard" "standard" "standard" "free/reduced" ...
## $ test preparation course : chr [1:1000] "none" "completed" "none" "none" ...
## $ math score : num [1:1000] 72 69 90 47 76 71 88 40 64 38 ...
## $ reading score : num [1:1000] 72 90 95 57 78 83 95 43 64 60 ...
## $ writing score : num [1:1000] 74 88 93 44 75 78 92 39 67 50 ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. `race/ethnicity` = col_character(),
## .. `parental level of education` = col_character(),
## .. lunch = col_character(),
## .. `test preparation course` = col_character(),
## .. `math score` = col_double(),
## .. `reading score` = col_double(),
## .. `writing score` = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
hist(data$`reading score`, col = "skyblue", main = "Reading Score Distribution")
boxplot(`math score` ~ gender, data = data, col = "lightblue", main = "Math Score by Gender")
model <- lm(`math score` ~ `reading score` + `writing score`, data = data)
summary(model)
##
## Call:
## lm(formula = `math score` ~ `reading score` + `writing score`,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.8779 -6.1750 0.2693 6.0184 24.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.52409 1.32823 5.665 1.93e-08 ***
## `reading score` 0.60129 0.06304 9.538 < 2e-16 ***
## `writing score` 0.24942 0.06057 4.118 4.14e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.667 on 997 degrees of freedom
## Multiple R-squared: 0.674, Adjusted R-squared: 0.6733
## F-statistic: 1031 on 2 and 997 DF, p-value: < 2.2e-16