Hi !! My Name is Caesar Welcome to my Rmd :) in this LBB i will use data StudentsPerformance.csv from https://www.kaggle.com. I hope you enjoy it !
Make sure our data placed in the same folder our R project data.
students <- read.csv("data_input/StudentsPerformance.csv")
names(students)
## [1] "gender" "race.ethnicity"
## [3] "parental.level.of.education" "lunch"
## [5] "test.preparation.course" "math.score"
## [7] "reading.score" "writing.score"
Then we do inspect data
dim(students)
## [1] 1000 8
head(students)
tail(students)
From inspection above, we got short description of the data. students is consist of 1000 x 8 of rows and cloumns. then we need to check the data structure
str(students)
## 'data.frame': 1000 obs. of 8 variables:
## $ gender : chr "female" "female" "female" "male" ...
## $ race.ethnicity : chr "group B" "group C" "group B" "group A" ...
## $ parental.level.of.education: chr "bachelor's degree" "some college" "master's degree" "associate's degree" ...
## $ lunch : chr "standard" "standard" "standard" "free/reduced" ...
## $ test.preparation.course : chr "none" "completed" "none" "none" ...
## $ math.score : int 72 69 90 47 76 71 88 40 64 38 ...
## $ reading.score : int 72 90 95 57 78 83 95 43 64 60 ...
## $ writing.score : int 74 88 93 44 75 78 92 39 67 50 ...
as we see here, we change ’gender’, ‘race.ethnicity’, ‘parental.level.of.education’, ’ lunch ‘, ’test.preparation.course’ column become Factor type
students$gender <- as.factor(students$gender)
students$race.ethnicity <- as.factor(students$race.ethnicity)
students$parental.level.of.education <- as.factor(students$parental.level.of.education)
students$lunch <- as.factor(students$lunch)
students$test.preparation.course <- as.factor(students$test.preparation.course)
str(students)
## 'data.frame': 1000 obs. of 8 variables:
## $ gender : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
## $ race.ethnicity : Factor w/ 5 levels "group A","group B",..: 2 3 2 1 3 2 2 2 4 2 ...
## $ parental.level.of.education: Factor w/ 6 levels "associate's degree",..: 2 5 4 1 5 1 5 5 3 3 ...
## $ lunch : Factor w/ 2 levels "free/reduced",..: 2 2 2 1 2 2 2 1 1 1 ...
## $ test.preparation.course : Factor w/ 2 levels "completed","none": 2 1 2 2 2 2 1 2 1 2 ...
## $ math.score : int 72 69 90 47 76 71 88 40 64 38 ...
## $ reading.score : int 72 90 95 57 78 83 95 43 64 60 ...
## $ writing.score : int 74 88 93 44 75 78 92 39 67 50 ...
Find out missing data for datasetinputed
anyNA(students)
## [1] FALSE
Great!! No missing value
Now, StudentsPerformance dataset is ready to be processed and analyzed
Lets check statistical summary
summary(students)
## gender race.ethnicity parental.level.of.education lunch
## female:518 group A: 89 associate's degree:222 free/reduced:355
## male :482 group B:190 bachelor's degree :118 standard :645
## group C:319 high school :196
## group D:262 master's degree : 59
## group E:140 some college :226
## some high school :179
## test.preparation.course math.score reading.score writing.score
## completed:358 Min. : 0.00 Min. : 17.00 Min. : 10.00
## none :642 1st Qu.: 57.00 1st Qu.: 59.00 1st Qu.: 57.75
## Median : 66.00 Median : 70.00 Median : 69.00
## Mean : 66.09 Mean : 69.17 Mean : 68.05
## 3rd Qu.: 77.00 3rd Qu.: 79.00 3rd Qu.: 79.00
## Max. :100.00 Max. :100.00 Max. :100.00
str(students)
## 'data.frame': 1000 obs. of 8 variables:
## $ gender : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
## $ race.ethnicity : Factor w/ 5 levels "group A","group B",..: 2 3 2 1 3 2 2 2 4 2 ...
## $ parental.level.of.education: Factor w/ 6 levels "associate's degree",..: 2 5 4 1 5 1 5 5 3 3 ...
## $ lunch : Factor w/ 2 levels "free/reduced",..: 2 2 2 1 2 2 2 1 1 1 ...
## $ test.preparation.course : Factor w/ 2 levels "completed","none": 2 1 2 2 2 2 1 2 1 2 ...
## $ math.score : int 72 69 90 47 76 71 88 40 64 38 ...
## $ reading.score : int 72 90 95 57 78 83 95 43 64 60 ...
## $ writing.score : int 74 88 93 44 75 78 92 39 67 50 ...
From summary above, we may conclude some of the things :
1. Female: 518, Male: 482; Gender distribution can be said to be balanced.
2. We see that Ethnicity is not balanced. ‘group C’ predominates.
3. We see that the education level of the parents is not evenly distributed. We see that the parents with a Master’s degree are in the minority, and the ones with Some College are in the majority.
4. In the ‘lunch’ feature, it can be said that the ‘standard’ doubles the other.
5. We see that 642 students did not take test preparation courses, 358 of them did.
1. We will check the distribution data Race Ethnicity with all subject Score
ggplot(students, aes(x = race.ethnicity, y = math.score , fill = race.ethnicity )) +
geom_boxplot(show.legend = F)+
labs(title = "Distribution Math Score Based on Race Ethnicity",
subtitle = "Students Performance",
caption = "Source: https://www.kaggle.com",
x = " Race Ethnicity", y = "Math Score")
ggplot(students, aes(x = race.ethnicity, y = reading.score, fill = race.ethnicity )) +
geom_boxplot(show.legend = F)+
labs(title = "Distribution Reading Score Based on Race Ethnicity",
subtitle = "Students Performance",
caption = "Source: https://www.kaggle.com",
x = " Race Ethnicity", y = "Reading Score")
ggplot(students, aes(x = race.ethnicity, y = writing.score, fill = race.ethnicity )) +
geom_boxplot(show.legend = F)+
labs(title = "Distribution Writing Score Based on Race Ethnicity",
subtitle = "Students Performance",
caption = "Source: https://www.kaggle.com",
x = " Race Ethnicity", y = "Writing Score")
Group E is show best score in all subjects. and Group A is Worst grade.
We can know that Groups have priority (Grade : A < B < C < D < E)
2. how much of the various categories in the ‘race/ethnicity’ feature occur ?
students.count <- count(students,race.ethnicity)
ggplot(data = students.count, mapping = aes(x = n, y = reorder(race.ethnicity, n))) +
geom_col(aes(fill= n)) +
scale_fill_gradient(low = "#030000", high = "#b00404") +
geom_text(aes(label = n), color = "white", hjust = 1.5, size = 4) +
labs(title = "Distribution Data Based On Race Ethnicity",
subtitle = "Students Performance",
x = "",
y = "Race Ethnicity",
fill = "Frequency",
caption = "Source: https://www.kaggle.com")
as we see here, Group C the most various categories in the ’race/ethnicity
3. how much of the various categories in the ‘parental.level.of.education’ feature occur ?
students.count2 <- count(students,parental.level.of.education)
ggplot(data = students.count2, mapping = aes(x = n , y = reorder(parental.level.of.education, n))) +
geom_col(aes(fill= n)) +
scale_fill_gradient(low = "#34d5eb", high = "#218f9e") +
geom_text(aes(label = n), color = "black", hjust = 1.5, size = 4) +
labs(title = "Distribution Data Based On Parental Level Of Education",
subtitle = "Students Performance",
x = "",
y = "Parental Level Of Education",
fill = "Frequency",
caption = "Source: https://www.kaggle.com")
as we see here, ‘Some College’ the most various categories in the ‘parental.level.of.education’
4. how is the distribution data for each field of study ?
students.countmath <- count(students,math.score)
ggplot(data = students.countmath, mapping = aes( x = math.score, y = n )) +
geom_col(fill= (color = "#059bff"))+
scale_x_continuous(breaks = seq(0,100,20)) +
labs(title = "Distribution Math Score",
subtitle = "Students Performance",
x = "Math Score",
y = "Frequency",
caption = "Source: https://www.kaggle.com")
students.countreading <- count(students,reading.score)
ggplot(data = students.countreading, mapping = aes( x = reading.score, y = n )) +
geom_col(fill= (color = "#059bff"))+
scale_x_continuous(breaks = seq(0,100,20)) +
labs(title = "Distribution Reading Score",
subtitle = "Students Performance",
x = "Reading Score",
y = "Frequency",
caption = "Source: https://www.kaggle.com")
students.countwriting <- count(students,writing.score)
ggplot(data = students.countwriting, mapping = aes( x = writing.score, y = n )) +
geom_col(fill= (color = "#059bff"))+
scale_x_continuous(breaks = seq(0,100,20)) +
labs(title = "Distribution Writing Score",
subtitle = "Students Performance",
x = "Writing Score",
y = "Frequency",
caption = "Source: https://www.kaggle.com")
In the ‘math score’, there is an accumulation in the range of 60-80.
There is no congestion in reading score like math. It can be said that the grade of the majority varies between 75-80.
The distribution in the ‘writing score’ feature is like in reading score.
5.for math and reading scores which were the best for boys or girls ?
ggplot(data = students, aes(x = math.score, y = reading.score, col = gender)) +
geom_point() +
labs(title = "Correlation Math Score & Reading Score",
subtitle = "Based on Gender",
caption = "Source: https://www.kaggle.com",
x = "Math Score", y = "Reading Score",
col = "Gender")
the conclusion from this plot is that the boys are better than the girls at maths, while the girls are better than the boys at reading.
6.How does the comparison between students who have completed exam preparation with those who have not ?
ggplot(data = students, aes(x = math.score, y = reading.score, col = test.preparation.course)) +
geom_point() +
labs(title = "Correlation Math Score & Reading Score",
subtitle = "Based on Test Preparation Course",
caption = "Source: https://www.kaggle.com",
x = "Math Score", y = "Reading Score",
col = "Test Preparation Course")
the conclusions is while it is perfectly possible for a person to do well on the test without preparation, pupils who have completed the preparation perform on average at a higher level on their tests than those who have not.
7. facet plot based on Gender
ggplot(students, aes(x = frequency(gender) , y = race.ethnicity)) +
geom_col(fill = "blue")+
facet_grid(test.preparation.course~gender) +
labs(title = "Comparison About Female and Male Students ",
subtitle = "Students Performance",
x = "",
y = "Race Ethnicity",
caption = "Source: https://www.kaggle.com")
Based on the plot above, the number of students who completed the preparation course mostly came from ethnicity group C, both female and male.
And then the highest number of students who not completed the preparation course came from ethnicity group C for female and group D for male
From all graphs above, we may say some assumptions, such as :
1. The distribution of variations in math scores is lower than reading and writing
2. The conclusions is while it is perfectly possible for a person to do well on the test without preparation, pupils who have completed the preparation perform on average at a higher level on their tests than those who have not
3. Group E is show best score in all subjects. and Group A is Worst grade.
We can know that Groups have priority (Grade : A < B < C < D < E).