1 Explanation

Hi !! My Name is Caesar Welcome to my Rmd :) in this LBB i will use data StudentsPerformance.csv from https://www.kaggle.com. I hope you enjoy it !

2 Input Data

Make sure our data placed in the same folder our R project data.

2.1 Data Input & Structure

students <- read.csv("data_input/StudentsPerformance.csv")

names(students)

## [1] "gender"                      "race.ethnicity"             
## [3] "parental.level.of.education" "lunch"                      
## [5] "test.preparation.course"     "math.score"                 
## [7] "reading.score"               "writing.score"

Then we do inspect data

dim(students)

## [1] 1000    8

head(students)

tail(students)

From inspection above, we got short description of the data. students is consist of 1000 x 8 of rows and cloumns. then we need to check the data structure

str(students)

## 'data.frame':    1000 obs. of  8 variables:
##  $ gender                     : chr  "female" "female" "female" "male" ...
##  $ race.ethnicity             : chr  "group B" "group C" "group B" "group A" ...
##  $ parental.level.of.education: chr  "bachelor's degree" "some college" "master's degree" "associate's degree" ...
##  $ lunch                      : chr  "standard" "standard" "standard" "free/reduced" ...
##  $ test.preparation.course    : chr  "none" "completed" "none" "none" ...
##  $ math.score                 : int  72 69 90 47 76 71 88 40 64 38 ...
##  $ reading.score              : int  72 90 95 57 78 83 95 43 64 60 ...
##  $ writing.score              : int  74 88 93 44 75 78 92 39 67 50 ...

as we see here, we change ’gender’, ‘race.ethnicity’, ‘parental.level.of.education’, ’ lunch ‘, ’test.preparation.course’ column become Factor type

students$gender <- as.factor(students$gender)
students$race.ethnicity <- as.factor(students$race.ethnicity)
students$parental.level.of.education <- as.factor(students$parental.level.of.education)
students$lunch <- as.factor(students$lunch)
students$test.preparation.course <- as.factor(students$test.preparation.course)

str(students)

## 'data.frame':    1000 obs. of  8 variables:
##  $ gender                     : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
##  $ race.ethnicity             : Factor w/ 5 levels "group A","group B",..: 2 3 2 1 3 2 2 2 4 2 ...
##  $ parental.level.of.education: Factor w/ 6 levels "associate's degree",..: 2 5 4 1 5 1 5 5 3 3 ...
##  $ lunch                      : Factor w/ 2 levels "free/reduced",..: 2 2 2 1 2 2 2 1 1 1 ...
##  $ test.preparation.course    : Factor w/ 2 levels "completed","none": 2 1 2 2 2 2 1 2 1 2 ...
##  $ math.score                 : int  72 69 90 47 76 71 88 40 64 38 ...
##  $ reading.score              : int  72 90 95 57 78 83 95 43 64 60 ...
##  $ writing.score              : int  74 88 93 44 75 78 92 39 67 50 ...

2.2 Missing Data

Find out missing data for datasetinputed

anyNA(students)

## [1] FALSE

Great!! No missing value

Now, StudentsPerformance dataset is ready to be processed and analyzed

2.3 Practical Statistics

Lets check statistical summary

summary(students)

##     gender    race.ethnicity     parental.level.of.education          lunch    
##  female:518   group A: 89    associate's degree:222          free/reduced:355  
##  male  :482   group B:190    bachelor's degree :118          standard    :645  
##               group C:319    high school       :196                            
##               group D:262    master's degree   : 59                            
##               group E:140    some college      :226                            
##                              some high school  :179                            
##  test.preparation.course   math.score     reading.score    writing.score   
##  completed:358           Min.   :  0.00   Min.   : 17.00   Min.   : 10.00  
##  none     :642           1st Qu.: 57.00   1st Qu.: 59.00   1st Qu.: 57.75  
##                          Median : 66.00   Median : 70.00   Median : 69.00  
##                          Mean   : 66.09   Mean   : 69.17   Mean   : 68.05  
##                          3rd Qu.: 77.00   3rd Qu.: 79.00   3rd Qu.: 79.00  
##                          Max.   :100.00   Max.   :100.00   Max.   :100.00

str(students)

## 'data.frame':    1000 obs. of  8 variables:
##  $ gender                     : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
##  $ race.ethnicity             : Factor w/ 5 levels "group A","group B",..: 2 3 2 1 3 2 2 2 4 2 ...
##  $ parental.level.of.education: Factor w/ 6 levels "associate's degree",..: 2 5 4 1 5 1 5 5 3 3 ...
##  $ lunch                      : Factor w/ 2 levels "free/reduced",..: 2 2 2 1 2 2 2 1 1 1 ...
##  $ test.preparation.course    : Factor w/ 2 levels "completed","none": 2 1 2 2 2 2 1 2 1 2 ...
##  $ math.score                 : int  72 69 90 47 76 71 88 40 64 38 ...
##  $ reading.score              : int  72 90 95 57 78 83 95 43 64 60 ...
##  $ writing.score              : int  74 88 93 44 75 78 92 39 67 50 ...

From summary above, we may conclude some of the things :
1. Female: 518, Male: 482; Gender distribution can be said to be balanced.
2. We see that Ethnicity is not balanced. ‘group C’ predominates.
3. We see that the education level of the parents is not evenly distributed. We see that the parents with a Master’s degree are in the minority, and the ones with Some College are in the majority.
4. In the ‘lunch’ feature, it can be said that the ‘standard’ doubles the other.
5. We see that 642 students did not take test preparation courses, 358 of them did.

3 Study Case

1. We will check the distribution data Race Ethnicity with all subject Score

ggplot(students, aes(x = race.ethnicity, y = math.score , fill = race.ethnicity )) +
  geom_boxplot(show.legend = F)+
   labs(title = "Distribution Math Score Based on Race Ethnicity",
       subtitle = "Students Performance",
       caption = "Source: https://www.kaggle.com",
       x = " Race Ethnicity", y = "Math Score")

ggplot(students, aes(x = race.ethnicity, y = reading.score, fill = race.ethnicity )) +
  geom_boxplot(show.legend = F)+
   labs(title = "Distribution Reading Score Based on Race Ethnicity",
       subtitle = "Students Performance",
       caption = "Source: https://www.kaggle.com",
       x = " Race Ethnicity", y = "Reading Score")

ggplot(students, aes(x = race.ethnicity, y = writing.score, fill = race.ethnicity )) +
  geom_boxplot(show.legend = F)+
   labs(title = "Distribution Writing Score Based on Race Ethnicity",
       subtitle = "Students Performance",
       caption = "Source: https://www.kaggle.com",
       x = " Race Ethnicity", y = "Writing Score")

Group E is show best score in all subjects. and Group A is Worst grade.
We can know that Groups have priority (Grade : A < B < C < D < E)

2. how much of the various categories in the ‘race/ethnicity’ feature occur ?

students.count <- count(students,race.ethnicity)

ggplot(data = students.count, mapping = aes(x = n, y = reorder(race.ethnicity, n))) +
  geom_col(aes(fill= n)) +
  scale_fill_gradient(low = "#030000", high = "#b00404") +
  geom_text(aes(label = n), color = "white", hjust = 1.5, size = 4) +
  labs(title = "Distribution Data Based On Race Ethnicity",
  subtitle = "Students Performance",
  x = "",
  y = "Race Ethnicity",
  fill = "Frequency",
  caption = "Source: https://www.kaggle.com")

as we see here, Group C the most various categories in the ’race/ethnicity

3. how much of the various categories in the ‘parental.level.of.education’ feature occur ?

students.count2 <- count(students,parental.level.of.education)

ggplot(data = students.count2, mapping = aes(x = n , y = reorder(parental.level.of.education, n))) +
  geom_col(aes(fill= n)) +
  scale_fill_gradient(low = "#34d5eb", high = "#218f9e") +
  geom_text(aes(label = n), color = "black", hjust = 1.5, size = 4) +
  labs(title = "Distribution Data Based On Parental Level Of Education",
  subtitle = "Students Performance",
  x = "",
  y = "Parental Level Of Education",
  fill = "Frequency",
  caption = "Source: https://www.kaggle.com")

as we see here, ‘Some College’ the most various categories in the ‘parental.level.of.education’

4. how is the distribution data for each field of study ?

students.countmath <- count(students,math.score)

ggplot(data = students.countmath, mapping = aes( x = math.score, y = n )) +
  geom_col(fill= (color = "#059bff"))+
  scale_x_continuous(breaks = seq(0,100,20)) +
  labs(title = "Distribution Math Score",
  subtitle = "Students Performance",
  x = "Math Score",
  y = "Frequency",
  caption = "Source: https://www.kaggle.com")

students.countreading <- count(students,reading.score)

ggplot(data = students.countreading, mapping = aes( x = reading.score, y = n )) +
  geom_col(fill= (color = "#059bff"))+
  scale_x_continuous(breaks = seq(0,100,20)) +
  labs(title = "Distribution Reading Score",
  subtitle = "Students Performance",
  x = "Reading Score",
  y = "Frequency",
  caption = "Source: https://www.kaggle.com")

students.countwriting <- count(students,writing.score)

ggplot(data = students.countwriting, mapping = aes( x = writing.score, y = n )) +
  geom_col(fill= (color = "#059bff"))+
  scale_x_continuous(breaks = seq(0,100,20)) +
  labs(title = "Distribution Writing Score",
  subtitle = "Students Performance",
  x = "Writing Score",
  y = "Frequency",
  caption = "Source: https://www.kaggle.com")

In the ‘math score’, there is an accumulation in the range of 60-80.
There is no congestion in reading score like math. It can be said that the grade of the majority varies between 75-80.
The distribution in the ‘writing score’ feature is like in reading score.

5.for math and reading scores which were the best for boys or girls ?

ggplot(data = students, aes(x = math.score, y = reading.score, col = gender)) +
  geom_point() +
  labs(title = "Correlation Math Score & Reading Score",
       subtitle = "Based on Gender",
       caption = "Source: https://www.kaggle.com",
       x = "Math Score", y = "Reading Score",
       col = "Gender")

the conclusion from this plot is that the boys are better than the girls at maths, while the girls are better than the boys at reading.

6.How does the comparison between students who have completed exam preparation with those who have not ?

ggplot(data = students, aes(x = math.score, y = reading.score, col = test.preparation.course)) +
  geom_point() +
  labs(title = "Correlation Math Score & Reading Score",
       subtitle = "Based on Test Preparation Course",
       caption = "Source: https://www.kaggle.com",
       x = "Math Score", y = "Reading Score",
       col = "Test Preparation Course")

the conclusions is while it is perfectly possible for a person to do well on the test without preparation, pupils who have completed the preparation perform on average at a higher level on their tests than those who have not.

7. facet plot based on Gender

ggplot(students, aes(x = frequency(gender) , y = race.ethnicity)) +
  geom_col(fill = "blue")+
  facet_grid(test.preparation.course~gender) +
  labs(title = "Comparison About Female and Male Students ",
  subtitle = "Students Performance",
  x = "",
  y = "Race Ethnicity",
  caption = "Source: https://www.kaggle.com")

Based on the plot above, the number of students who completed the preparation course mostly came from ethnicity group C, both female and male.
And then the highest number of students who not completed the preparation course came from ethnicity group C for female and group D for male

4 Final Conclusion

From all graphs above, we may say some assumptions, such as :
1. The distribution of variations in math scores is lower than reading and writing
2. The conclusions is while it is perfectly possible for a person to do well on the test without preparation, pupils who have completed the preparation perform on average at a higher level on their tests than those who have not
3. Group E is show best score in all subjects. and Group A is Worst grade.
We can know that Groups have priority (Grade : A < B < C < D < E).

Data Visualization - Students Performance

Ibnu Caesar