A Step-by-step Approach From Data Importing to Cleaning and Visualization
Reproduced example from Joe Tran
Exploratory Data Analysis (EDA) is the process of analyzing and visualizing the data to get a better understanding of the data and glean insight from it. There are various steps involved when doing EDA but the following are the common steps that a data analyst can take when performing EDA:
This article focuses on EDA of a dataset, which means that it would involve all the steps mentioned above. Therefore, this article will walk you through all the steps required and the tools used in each step. So you would expect to find the followings in this article:
For beginners to EDA, if you do not have a lot of time and do not know where to start, I would recommend you to start with Tidyverse and ggplot2. You can do almost everything about EDA with these 2 packages. There are various resources online like DataCamp, Setscholars, and books like Introduction to Data Science and so on. In this article, I would walk you through the process of EDA through the analysis of the PISA score dataset which is available here.
Let’s get started, folks!!
Before importing the data into R for analysis, let’s look at how the data looks like
When importing this data into R, we want the last column to be ‘numeric’ and the rest to be ‘factor’. With this in mind, let’s look at the following 3 scenarios:
file<-"https://raw.githubusercontent.com/prasertcbs/basic-dataset/master/Pisa%20mean%20perfromance%20scores%202013%20-%202015%20Data.csv"
df.raw <- read.csv(file , fileEncoding="UTF-8-BOM", na.strings = '..')
df.raw1 <- read.csv(file)
df.raw2 <- read.csv(file, na.strings = '..')
These are 3 ways of importing the data into R. Usually, one with go for the df.raw1 because it seems to be the most convenient way of importing the data. Let’s see the structure of the imported data:
df.raw1 <- read.csv(file)
str(df.raw1)
## 'data.frame': 1161 obs. of 7 variables:
## $ ï..Country.Name: chr "Albania" "Albania" "Albania" "Albania" ...
## $ Country.Code : chr "ALB" "ALB" "ALB" "ALB" ...
## $ Series.Name : chr "PISA: Mean performance on the mathematics scale" "PISA: Mean performance on the mathematics scale. Female" "PISA: Mean performance on the mathematics scale. Male" "PISA: Mean performance on the reading scale" ...
## $ Series.Code : chr "LO.PISA.MAT" "LO.PISA.MAT.FE" "LO.PISA.MAT.MA" "LO.PISA.REA" ...
## $ X2013..YR2013. : chr ".." ".." ".." ".." ...
## $ X2014..YR2014. : chr ".." ".." ".." ".." ...
## $ X2015..YR2015. : chr "413.157" "417.750029482799" "408.545458736189" "405.2588" ...
There are 2 problems that we can spot immediately. The last column is ‘factor’ and not ‘numeric’ like what we desire. Secondly, the first column ‘Country name’ is encoded differently from the raw dataset.
Now let’s try the second case scenario
df.raw2 <- read.csv(file, na.strings = '..')
str(df.raw2)
## 'data.frame': 1161 obs. of 7 variables:
## $ ï..Country.Name: chr "Albania" "Albania" "Albania" "Albania" ...
## $ Country.Code : chr "ALB" "ALB" "ALB" "ALB" ...
## $ Series.Name : chr "PISA: Mean performance on the mathematics scale" "PISA: Mean performance on the mathematics scale. Female" "PISA: Mean performance on the mathematics scale. Male" "PISA: Mean performance on the reading scale" ...
## $ Series.Code : chr "LO.PISA.MAT" "LO.PISA.MAT.FE" "LO.PISA.MAT.MA" "LO.PISA.REA" ...
## $ X2013..YR2013. : logi NA NA NA NA NA NA ...
## $ X2014..YR2014. : logi NA NA NA NA NA NA ...
## $ X2015..YR2015. : num 413 418 409 405 435 ...
The last column is now ‘numeric’. However, the name of the first column is not imported correctly.
What about the last scenario?
df.raw <- read.csv(file, fileEncoding="UTF-8-BOM", na.strings = '..')
str(df.raw)
## 'data.frame': 1161 obs. of 7 variables:
## $ Country.Name : chr "Albania" "Albania" "Albania" "Albania" ...
## $ Country.Code : chr "ALB" "ALB" "ALB" "ALB" ...
## $ Series.Name : chr "PISA: Mean performance on the mathematics scale" "PISA: Mean performance on the mathematics scale. Female" "PISA: Mean performance on the mathematics scale. Male" "PISA: Mean performance on the reading scale" ...
## $ Series.Code : chr "LO.PISA.MAT" "LO.PISA.MAT.FE" "LO.PISA.MAT.MA" "LO.PISA.REA" ...
## $ X2013..YR2013.: logi NA NA NA NA NA NA ...
## $ X2014..YR2014.: logi NA NA NA NA NA NA ...
## $ X2015..YR2015.: num 413 418 409 405 435 ...
As you can see, the first column is now named properly and the last column is ‘numeric’.
na.strings = '..' allows R to replace those blanks in the dataset with NA. This will be useful and convenient later when we want to remove all the ‘NA’s.
fileEncoding="UTF-8-BOM" allows R, in the laymen term, to read the characters as correctly as they would appear on the raw dataset.
We want to do a few things to clean the dataset:
Make sure that each row in the dataset corresponds to ONLY one country: Use spread() function in tidyverse package
Make sure that only useful columns and rows are kept: Use drop_na() and data subsetting
Rename the Series Code column for meaningful interpretation: Use rename()
df <- df.raw[1:1161, c(1, 4, 7)]%>% #select relevant rows and cols
spread(key=Series.Code, value=X2015..YR2015.) %>%
rename(Maths = LO.PISA.MAT,
Maths.F = LO.PISA.MAT.FE,
Maths.M = LO.PISA.MAT.MA,
Reading = LO.PISA.REA,
Reading.F = LO.PISA.REA.FE,
Reading.M = LO.PISA.REA.MA,
Science = LO.PISA.SCI,
Science.F = LO.PISA.SCI.FE,
Science.M = LO.PISA.SCI.MA
) %>%
drop_na()
Now let’s see how the clean data looks like:
df %>%
head(10) %>%
kable()%>%
kable_styling() %>%
scroll_box(width = "100%", height = "400px")
| Country.Name | Maths | Maths.F | Maths.M | Reading | Reading.F | Reading.M | Science | Science.F | Science.M |
|---|---|---|---|---|---|---|---|---|---|
| Albania | 413.1570 | 417.7500 | 408.5455 | 405.2588 | 434.6396 | 375.7592 | 427.2250 | 439.4430 | 414.9576 |
| Algeria | 359.6062 | 363.0725 | 356.4951 | 349.8593 | 366.2082 | 335.1854 | 375.7451 | 383.2209 | 369.0352 |
| Argentina | 409.0333 | 400.4431 | 418.3884 | 425.3031 | 432.9581 | 416.9666 | 432.2262 | 424.9944 | 440.1020 |
| Australia | 493.8962 | 490.9855 | 496.7613 | 502.9006 | 518.8658 | 487.1855 | 509.9939 | 508.9216 | 511.0493 |
| Austria | 496.7423 | 483.1330 | 510.0982 | 484.8656 | 495.0752 | 474.8460 | 495.0375 | 485.5268 | 504.3712 |
| Belgium | 506.9844 | 499.7390 | 514.0026 | 498.5242 | 506.6386 | 490.6642 | 501.9997 | 496.0319 | 507.7805 |
| Brazil | 377.0695 | 369.5493 | 385.0406 | 407.3486 | 418.5617 | 395.4633 | 400.6821 | 398.7000 | 402.7830 |
| Bulgaria | 441.1899 | 442.1631 | 440.3189 | 431.7175 | 456.5986 | 409.4498 | 445.7720 | 453.9011 | 438.4966 |
| Canada | 515.6474 | 511.1417 | 520.1661 | 526.6678 | 539.7624 | 513.5355 | 527.7047 | 527.1562 | 528.2548 |
| Chile | 422.6714 | 413.4490 | 431.7981 | 458.5709 | 464.5616 | 452.6422 | 446.9561 | 439.6174 | 454.2186 |
Ranking of Maths Score by Countries
ggplot(data=df,aes(x=reorder(Country.Name,Maths),y=Maths)) +
geom_bar(stat ='identity',aes(fill=Maths))+
coord_flip() +
theme_grey() +
scale_fill_gradient(name="Maths Score Level")+
labs(title = 'Ranking of Countries by Maths Score',
y='Score',x='Countries')+
geom_hline(yintercept = mean(df$Maths),size = 1, color = 'blue')+
theme_bw()
If we use the dataset above, we will not be able to draw a boxplot. This is because boxplot needs only 2 variables x and y but in the cleaned data that we have, there are so many variables. So we need to combine those into 2 variables. We name this as df2
df2 = df[,c(1,3,4,6,7,9,10)] %>% # select relevant columns
pivot_longer(c(2,3,4,5,6,7),names_to = 'Score')
df2 %>%
head(10) %>%
kable()%>%
kable_styling() %>%
scroll_box(width = "100%", height = "400px")
| Country.Name | Score | value |
|---|---|---|
| Albania | Maths.F | 417.7500 |
| Albania | Maths.M | 408.5455 |
| Albania | Reading.F | 434.6396 |
| Albania | Reading.M | 375.7592 |
| Albania | Science.F | 439.4430 |
| Albania | Science.M | 414.9576 |
| Algeria | Maths.F | 363.0725 |
| Algeria | Maths.M | 356.4951 |
| Algeria | Reading.F | 366.2082 |
| Algeria | Reading.M | 335.1854 |
Great! Now we can make boxplots
ggplot(data = df2, aes(x=Score,y=value, color=Score)) +
geom_boxplot()+
scale_color_brewer(palette="Dark2") +
geom_jitter(shape=16, position=position_jitter(0.2))+
labs(title = 'Did males perform better than females?',
y='Scores',x='Test Type')+
theme_bw()
geom_jitter() allows you to plot the data points on the plot.
You can play around with the code above to get various plots. For example, I can change from ‘color = Score’ to ‘fill=Score’:
ggplot(data = df2, aes(x=Score,y=value, fill=Score)) +
geom_boxplot()+
scale_fill_brewer(palette="Green") +
geom_jitter(shape=16, position=position_jitter(0.2))+
labs(title = 'Did males perform better than females?',
y='Scores',x='Test Type')+
theme_bw()
The plot looks a bit messy. A better visualisation would be to separate Subjects and Genders and plot them side by side.
How can we do that?
Since we want to separate Subjects and Genders from a column containing ‘Subject.Gender’ (e.g. Maths.F), we need to use strsplit () to do this job for us.
S = numeric(408) # create an empty vector
for (i in 1:length(df2$Score)) {
S[i] = strsplit(df2$Score[i],".",fixed = TRUE)
}
Now S is a list of 408 components, each of which has 2 sub-components ‘Subject’ and ‘Gender’. We need to transform S into a data frame with 1 column of Subject and 1 column of Gender. We will name this data frame as df3
df3 = S%>%unlist() %>% matrix(ncol = 2, byrow = TRUE)%>% as.data.frame()
df3 %>%
kable()%>%
kable_styling() %>%
scroll_box(width = "100%", height = "400px")
| V1 | V2 |
|---|---|
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
| Maths | F |
| Maths | M |
| Reading | F |
| Reading | M |
| Science | F |
| Science | M |
We now need to combine this df3 with df2 we created earlier, and name the result as df4
df4 = cbind(df2,df3)
colnames(df4) = c('Country','Score','Value','Test','Gender')
df4$Score = NULL # since the 'Score' column is redundant
df4%>%
kable()%>%
kable_styling() %>%
scroll_box(width = "100%", height = "400px")
| Country | Value | Test | Gender |
|---|---|---|---|
| Albania | 417.7500 | Maths | F |
| Albania | 408.5455 | Maths | M |
| Albania | 434.6396 | Reading | F |
| Albania | 375.7592 | Reading | M |
| Albania | 439.4430 | Science | F |
| Albania | 414.9576 | Science | M |
| Algeria | 363.0725 | Maths | F |
| Algeria | 356.4951 | Maths | M |
| Algeria | 366.2082 | Reading | F |
| Algeria | 335.1854 | Reading | M |
| Algeria | 383.2209 | Science | F |
| Algeria | 369.0352 | Science | M |
| Argentina | 400.4431 | Maths | F |
| Argentina | 418.3884 | Maths | M |
| Argentina | 432.9581 | Reading | F |
| Argentina | 416.9666 | Reading | M |
| Argentina | 424.9944 | Science | F |
| Argentina | 440.1020 | Science | M |
| Australia | 490.9855 | Maths | F |
| Australia | 496.7613 | Maths | M |
| Australia | 518.8658 | Reading | F |
| Australia | 487.1855 | Reading | M |
| Australia | 508.9216 | Science | F |
| Australia | 511.0493 | Science | M |
| Austria | 483.1330 | Maths | F |
| Austria | 510.0982 | Maths | M |
| Austria | 495.0752 | Reading | F |
| Austria | 474.8460 | Reading | M |
| Austria | 485.5268 | Science | F |
| Austria | 504.3712 | Science | M |
| Belgium | 499.7390 | Maths | F |
| Belgium | 514.0026 | Maths | M |
| Belgium | 506.6386 | Reading | F |
| Belgium | 490.6642 | Reading | M |
| Belgium | 496.0319 | Science | F |
| Belgium | 507.7805 | Science | M |
| Brazil | 369.5493 | Maths | F |
| Brazil | 385.0406 | Maths | M |
| Brazil | 418.5617 | Reading | F |
| Brazil | 395.4633 | Reading | M |
| Brazil | 398.7000 | Science | F |
| Brazil | 402.7830 | Science | M |
| Bulgaria | 442.1631 | Maths | F |
| Bulgaria | 440.3189 | Maths | M |
| Bulgaria | 456.5986 | Reading | F |
| Bulgaria | 409.4498 | Reading | M |
| Bulgaria | 453.9011 | Science | F |
| Bulgaria | 438.4966 | Science | M |
| Canada | 511.1417 | Maths | F |
| Canada | 520.1661 | Maths | M |
| Canada | 539.7624 | Reading | F |
| Canada | 513.5355 | Reading | M |
| Canada | 527.1562 | Science | F |
| Canada | 528.2548 | Science | M |
| Chile | 413.4490 | Maths | F |
| Chile | 431.7981 | Maths | M |
| Chile | 464.5616 | Reading | F |
| Chile | 452.6422 | Reading | M |
| Chile | 439.6174 | Science | F |
| Chile | 454.2186 | Science | M |
| Colombia | 384.4883 | Maths | F |
| Colombia | 395.3911 | Maths | M |
| Colombia | 432.2819 | Reading | F |
| Colombia | 416.6816 | Reading | M |
| Colombia | 411.0316 | Science | F |
| Colombia | 420.9651 | Science | M |
| Costa Rica | 392.3129 | Maths | F |
| Costa Rica | 408.4516 | Maths | M |
| Costa Rica | 434.8748 | Reading | F |
| Costa Rica | 419.8605 | Reading | M |
| Costa Rica | 410.8349 | Science | F |
| Costa Rica | 428.6660 | Science | M |
| Croatia | 457.9612 | Maths | F |
| Croatia | 470.5987 | Maths | M |
| Croatia | 499.5858 | Reading | F |
| Croatia | 473.1367 | Reading | M |
| Croatia | 472.5863 | Science | F |
| Croatia | 478.4173 | Science | M |
| Cyprus | 439.5341 | Maths | F |
| Cyprus | 434.7064 | Maths | M |
| Cyprus | 468.6583 | Reading | F |
| Cyprus | 416.8271 | Reading | M |
| Cyprus | 440.9482 | Science | F |
| Cyprus | 424.1478 | Science | M |
| Czech Republic | 488.6656 | Maths | F |
| Czech Republic | 495.7942 | Maths | M |
| Czech Republic | 500.6527 | Reading | F |
| Czech Republic | 474.5475 | Reading | M |
| Czech Republic | 488.3983 | Science | F |
| Czech Republic | 497.0304 | Science | M |
| Denmark | 506.3748 | Maths | F |
| Denmark | 515.7565 | Maths | M |
| Denmark | 510.9516 | Reading | F |
| Denmark | 488.7816 | Reading | M |
| Denmark | 498.9027 | Science | F |
| Denmark | 504.9427 | Science | M |
| Dominican Republic | 329.7459 | Maths | F |
| Dominican Republic | 325.5866 | Maths | M |
| Dominican Republic | 372.7806 | Reading | F |
| Dominican Republic | 342.1682 | Reading | M |
| Dominican Republic | 330.8290 | Science | F |
| Dominican Republic | 332.4770 | Science | M |
| Estonia | 516.8728 | Maths | F |
| Estonia | 522.0804 | Maths | M |
| Estonia | 533.3620 | Reading | F |
| Estonia | 505.4863 | Reading | M |
| Estonia | 532.5228 | Science | F |
| Estonia | 535.7986 | Science | M |
| Finland | 514.9650 | Maths | F |
| Finland | 507.4528 | Maths | M |
| Finland | 550.5112 | Reading | F |
| Finland | 503.9746 | Reading | M |
| Finland | 540.5118 | Science | F |
| Finland | 521.4797 | Science | M |
| France | 489.9540 | Maths | F |
| France | 495.9317 | Maths | M |
| France | 513.7640 | Reading | F |
| France | 484.6293 | Reading | M |
| France | 494.0342 | Science | F |
| France | 495.9353 | Science | M |
| Georgia | 410.5960 | Maths | F |
| Georgia | 397.7478 | Maths | M |
| Georgia | 431.8820 | Reading | F |
| Georgia | 373.7585 | Reading | M |
| Georgia | 419.6164 | Science | F |
| Georgia | 403.4965 | Science | M |
| Germany | 497.5311 | Maths | F |
| Germany | 514.1177 | Maths | M |
| Germany | 519.6741 | Reading | F |
| Germany | 498.9021 | Reading | M |
| Germany | 503.8121 | Science | F |
| Germany | 514.2837 | Science | M |
| Greece | 453.5732 | Maths | F |
| Greece | 453.6821 | Maths | M |
| Greece | 486.4600 | Reading | F |
| Greece | 449.1362 | Reading | M |
| Greece | 459.4177 | Science | F |
| Greece | 450.5984 | Science | M |
| Hong Kong SAR, China | 546.7682 | Maths | F |
| Hong Kong SAR, China | 549.0658 | Maths | M |
| Hong Kong SAR, China | 540.9844 | Reading | F |
| Hong Kong SAR, China | 512.7113 | Reading | M |
| Hong Kong SAR, China | 523.7491 | Science | F |
| Hong Kong SAR, China | 522.8172 | Science | M |
| Hungary | 472.7395 | Maths | F |
| Hungary | 480.9055 | Maths | M |
| Hungary | 481.9596 | Reading | F |
| Hungary | 457.1377 | Reading | M |
| Hungary | 475.2484 | Science | F |
| Hungary | 478.2405 | Science | M |
| Iceland | 488.5870 | Maths | F |
| Iceland | 487.4457 | Maths | M |
| Iceland | 501.7167 | Reading | F |
| Iceland | 460.1036 | Reading | M |
| Iceland | 474.6556 | Science | F |
| Iceland | 471.7177 | Science | M |
| Indonesia | 387.4450 | Maths | F |
| Indonesia | 384.7793 | Maths | M |
| Indonesia | 408.9994 | Reading | F |
| Indonesia | 385.5642 | Reading | M |
| Indonesia | 405.1289 | Science | F |
| Indonesia | 401.0783 | Science | M |
| Ireland | 495.4450 | Maths | F |
| Ireland | 511.5797 | Maths | M |
| Ireland | 526.9491 | Reading | F |
| Ireland | 514.9914 | Reading | M |
| Ireland | 497.1740 | Science | F |
| Ireland | 507.7026 | Science | M |
| Israel | 465.5169 | Maths | F |
| Israel | 473.9902 | Maths | M |
| Israel | 490.1650 | Reading | F |
| Israel | 467.3026 | Reading | M |
| Israel | 464.4477 | Science | F |
| Israel | 468.7432 | Science | M |
| Italy | 479.8237 | Maths | F |
| Italy | 499.7621 | Maths | M |
| Italy | 492.7091 | Reading | F |
| Italy | 476.7038 | Reading | M |
| Italy | 472.1190 | Science | F |
| Italy | 489.0838 | Science | M |
| Japan | 525.4960 | Maths | F |
| Japan | 539.2673 | Maths | M |
| Japan | 522.6553 | Reading | F |
| Japan | 509.3740 | Reading | M |
| Japan | 531.5329 | Science | F |
| Japan | 545.1415 | Science | M |
| Jordan | 387.3772 | Maths | F |
| Jordan | 373.0013 | Maths | M |
| Jordan | 443.5964 | Reading | F |
| Jordan | 371.9124 | Reading | M |
| Jordan | 427.9967 | Science | F |
| Jordan | 388.9627 | Science | M |
| Kazakhstan | 460.5772 | Maths | F |
| Kazakhstan | 459.1038 | Maths | M |
| Kazakhstan | 435.3836 | Reading | F |
| Kazakhstan | 419.4282 | Reading | M |
| Kazakhstan | 457.8602 | Science | F |
| Kazakhstan | 455.1954 | Science | M |
| Korea, Rep. | 527.7567 | Maths | F |
| Korea, Rep. | 520.7669 | Maths | M |
| Korea, Rep. | 538.6092 | Reading | F |
| Korea, Rep. | 498.0690 | Reading | M |
| Korea, Rep. | 520.8329 | Science | F |
| Korea, Rep. | 511.2151 | Science | M |
| Latvia | 483.2573 | Maths | F |
| Latvia | 481.3558 | Maths | M |
| Latvia | 508.8251 | Reading | F |
| Latvia | 466.7552 | Reading | M |
| Latvia | 495.5792 | Science | F |
| Latvia | 484.8872 | Science | M |
| Lebanon | 385.9892 | Maths | F |
| Lebanon | 408.0571 | Maths | M |
| Lebanon | 353.2773 | Reading | F |
| Lebanon | 338.8062 | Reading | M |
| Lebanon | 385.5465 | Science | F |
| Lebanon | 387.5658 | Science | M |
| Lithuania | 479.0178 | Maths | F |
| Lithuania | 477.7677 | Maths | M |
| Lithuania | 492.2423 | Reading | F |
| Lithuania | 453.1567 | Reading | M |
| Lithuania | 479.1618 | Science | F |
| Lithuania | 471.7669 | Science | M |
| Luxembourg | 480.1360 | Maths | F |
| Luxembourg | 491.4831 | Maths | M |
| Luxembourg | 492.0348 | Reading | F |
| Luxembourg | 470.6970 | Reading | M |
| Luxembourg | 479.0131 | Science | F |
| Luxembourg | 486.6521 | Science | M |
| Macao SAR, China | 547.9227 | Maths | F |
| Macao SAR, China | 539.7238 | Maths | M |
| Macao SAR, China | 524.5389 | Reading | F |
| Macao SAR, China | 492.9614 | Reading | M |
| Macao SAR, China | 532.4157 | Science | F |
| Macao SAR, China | 524.7126 | Science | M |
| Macedonia, FYR | 374.7397 | Maths | F |
| Macedonia, FYR | 368.2374 | Maths | M |
| Macedonia, FYR | 375.8380 | Reading | F |
| Macedonia, FYR | 330.1350 | Reading | M |
| Macedonia, FYR | 394.4027 | Science | F |
| Macedonia, FYR | 374.0698 | Science | M |
| Malaysia | 449.1973 | Maths | F |
| Malaysia | 442.6725 | Maths | M |
| Malaysia | 445.3567 | Reading | F |
| Malaysia | 414.1259 | Reading | M |
| Malaysia | 444.7308 | Science | F |
| Malaysia | 440.9622 | Science | M |
| Malta | 480.6731 | Maths | F |
| Malta | 476.6767 | Maths | M |
| Malta | 468.0581 | Reading | F |
| Malta | 425.9101 | Reading | M |
| Malta | 470.2137 | Science | F |
| Malta | 459.5117 | Science | M |
| Mexico | 404.3349 | Maths | F |
| Mexico | 411.6105 | Maths | M |
| Mexico | 431.2322 | Reading | F |
| Mexico | 415.5398 | Reading | M |
| Mexico | 411.5536 | Science | F |
| Mexico | 419.7517 | Science | M |
| Moldova | 420.6077 | Maths | F |
| Moldova | 418.7273 | Maths | M |
| Moldova | 442.1850 | Reading | F |
| Moldova | 390.4922 | Reading | M |
| Moldova | 431.4186 | Science | F |
| Moldova | 424.6058 | Science | M |
| Montenegro | 417.7265 | Maths | F |
| Montenegro | 418.1326 | Maths | M |
| Montenegro | 444.1791 | Reading | F |
| Montenegro | 410.3504 | Reading | M |
| Montenegro | 413.8134 | Science | F |
| Montenegro | 408.9238 | Science | M |
| Netherlands | 511.0159 | Maths | F |
| Netherlands | 513.4978 | Maths | M |
| Netherlands | 514.7035 | Reading | F |
| Netherlands | 491.1367 | Reading | M |
| Netherlands | 506.5181 | Science | F |
| Netherlands | 510.6452 | Science | M |
| New Zealand | 490.9219 | Maths | F |
| New Zealand | 499.4705 | Maths | M |
| New Zealand | 525.5199 | Reading | F |
| New Zealand | 493.2258 | Reading | M |
| New Zealand | 510.7250 | Science | F |
| New Zealand | 515.8496 | Science | M |
| Norway | 502.8737 | Maths | F |
| Norway | 500.6149 | Maths | M |
| Norway | 533.3560 | Reading | F |
| Norway | 493.5367 | Reading | M |
| Norway | 497.0014 | Science | F |
| Norway | 499.9234 | Science | M |
| Peru | 381.7972 | Maths | F |
| Peru | 391.2786 | Maths | M |
| Peru | 401.4247 | Reading | F |
| Peru | 393.6952 | Reading | M |
| Peru | 391.6810 | Science | F |
| Peru | 401.6386 | Science | M |
| Poland | 498.6467 | Maths | F |
| Poland | 510.0877 | Maths | M |
| Poland | 520.6746 | Reading | F |
| Poland | 491.2446 | Reading | M |
| Poland | 498.3032 | Science | F |
| Poland | 504.4576 | Science | M |
| Portugal | 486.5997 | Maths | F |
| Portugal | 496.5606 | Maths | M |
| Portugal | 506.5518 | Reading | F |
| Portugal | 489.8629 | Reading | M |
| Portugal | 496.0552 | Science | F |
| Portugal | 506.0510 | Science | M |
| Qatar | 408.3214 | Maths | F |
| Qatar | 396.7547 | Maths | M |
| Qatar | 428.9546 | Reading | F |
| Qatar | 376.0766 | Reading | M |
| Qatar | 429.4479 | Science | F |
| Qatar | 406.3239 | Science | M |
| Russian Federation | 491.1219 | Maths | F |
| Russian Federation | 497.0891 | Maths | M |
| Russian Federation | 507.4833 | Reading | F |
| Russian Federation | 481.3743 | Reading | M |
| Russian Federation | 484.5143 | Science | F |
| Russian Federation | 488.8133 | Science | M |
| Singapore | 564.2545 | Maths | F |
| Singapore | 564.1290 | Maths | M |
| Singapore | 545.5585 | Reading | F |
| Singapore | 525.3153 | Reading | M |
| Singapore | 552.2730 | Science | F |
| Singapore | 558.6637 | Science | M |
| Slovak Republic | 472.2708 | Maths | F |
| Slovak Republic | 478.0148 | Maths | M |
| Slovak Republic | 470.8627 | Reading | F |
| Slovak Republic | 435.2488 | Reading | M |
| Slovak Republic | 461.2153 | Science | F |
| Slovak Republic | 460.3604 | Science | M |
| Slovenia | 507.9492 | Maths | F |
| Slovenia | 511.7637 | Maths | M |
| Slovenia | 527.5539 | Reading | F |
| Slovenia | 484.3099 | Reading | M |
| Slovenia | 515.7736 | Science | F |
| Slovenia | 510.1401 | Science | M |
| Spain | 477.8593 | Maths | F |
| Spain | 493.8485 | Maths | M |
| Spain | 505.6847 | Reading | F |
| Spain | 485.4411 | Reading | M |
| Spain | 489.4560 | Science | F |
| Spain | 496.1251 | Science | M |
| Sweden | 495.0526 | Maths | F |
| Sweden | 492.8041 | Maths | M |
| Sweden | 519.9498 | Reading | F |
| Sweden | 480.7183 | Reading | M |
| Sweden | 495.7183 | Science | F |
| Sweden | 491.1678 | Science | M |
| Switzerland | 514.9832 | Maths | F |
| Switzerland | 527.0155 | Maths | M |
| Switzerland | 505.3732 | Reading | F |
| Switzerland | 480.0795 | Reading | M |
| Switzerland | 502.3059 | Science | F |
| Switzerland | 508.4492 | Science | M |
| Thailand | 416.7297 | Maths | F |
| Thailand | 413.8092 | Maths | M |
| Thailand | 422.5533 | Reading | F |
| Thailand | 391.5848 | Reading | M |
| Thailand | 425.2518 | Science | F |
| Thailand | 416.2207 | Science | M |
| Trinidad and Tobago | 425.9341 | Maths | F |
| Trinidad and Tobago | 408.3298 | Maths | M |
| Trinidad and Tobago | 452.4412 | Reading | F |
| Trinidad and Tobago | 401.4599 | Reading | M |
| Trinidad and Tobago | 434.7075 | Science | F |
| Trinidad and Tobago | 414.2140 | Science | M |
| Tunisia | 363.9996 | Maths | F |
| Tunisia | 370.0814 | Maths | M |
| Tunisia | 372.6726 | Reading | F |
| Tunisia | 347.6038 | Reading | M |
| Tunisia | 384.6152 | Science | F |
| Tunisia | 388.4740 | Science | M |
| Turkey | 417.5270 | Maths | F |
| Turkey | 423.3777 | Maths | M |
| Turkey | 442.2461 | Reading | F |
| Turkey | 414.4396 | Reading | M |
| Turkey | 428.6548 | Science | F |
| Turkey | 422.3278 | Science | M |
| United Arab Emirates | 430.7020 | Maths | F |
| United Arab Emirates | 424.1734 | Maths | M |
| United Arab Emirates | 458.2870 | Reading | F |
| United Arab Emirates | 408.1053 | Reading | M |
| United Arab Emirates | 449.3386 | Science | F |
| United Arab Emirates | 423.7710 | Science | M |
| United Kingdom | 486.5746 | Maths | F |
| United Kingdom | 498.2005 | Maths | M |
| United Kingdom | 509.0904 | Reading | F |
| United Kingdom | 487.1961 | Reading | M |
| United Kingdom | 508.8409 | Science | F |
| United Kingdom | 509.5904 | Science | M |
| United States | 465.3685 | Maths | F |
| United States | 473.8857 | Maths | M |
| United States | 506.9752 | Reading | F |
| United States | 486.9014 | Reading | M |
| United States | 492.8651 | Science | F |
| United States | 499.6176 | Science | M |
| Uruguay | 411.5024 | Maths | F |
| Uruguay | 425.1379 | Maths | M |
| Uruguay | 447.7196 | Reading | F |
| Uruguay | 424.2969 | Reading | M |
| Uruguay | 431.3077 | Science | F |
| Uruguay | 439.8285 | Science | M |
| Vietnam | 496.0059 | Maths | F |
| Vietnam | 492.9591 | Maths | M |
| Vietnam | 499.0386 | Reading | F |
| Vietnam | 473.9191 | Reading | M |
| Vietnam | 525.9139 | Science | F |
| Vietnam | 523.3141 | Science | M |
Awesome! Now the data looks clean and neat. Let’s create multiple plots with the use of facet_wrap() function in ggplot2
ggplot(data = df4, aes(x=Test,y=Value, fill=Test)) +
geom_boxplot(notch = T)+
scale_fill_brewer(palette="Green") +
geom_jitter(shape=16, position=position_jitter(0.2))+
labs(title = 'Did males perform better than females?',
y='Scores',x='Test')+
facet_wrap(~Gender,nrow = 1)+
theme_bw()
Here, we categorized the plot by Gender, hence facet_wrap(~Gender,nrow = 1) We can also categorize the plot by Test by changing facet_wrap(~Test,nrow = 1)
ggplot(data = df4, aes(x=Gender,y=Value, fill=Gender)) +
geom_boxplot(notch=T)+
scale_fill_brewer(palette="Green") +
geom_jitter(shape=16, position=position_jitter(0.2))+
labs(title = 'Did males perform better than females?',
y='Scores',x='')+
facet_wrap(~Test,nrow = 1)+
theme_bw()
By looking at these plots, we can make some insights about the performance of males and females. Generally, males performed better in Science and Maths, but females performed better in Reading. However, it would be naive for us to make a conclusion only after looking at the boxplot. Let’s dive deeper into the data and see any other insights we can get after manipulating the dataset.
Since I want to compare the performance of Males and Females in each subject across all participating countries, I would need to calculate the % difference in terms of the score for each subject between males and females and then plot it out to visualize.
How would I do this in R? Use mutate() function
Let’s look at the original clean data set that we have, df
We will now use the mutate() function to calculate, for each country, the % difference between Males and Females for each subject.
df = df %>% mutate(Maths.Diff = ((Maths.M - Maths.F)/Maths.F)*100,
Reading.Diff = ((Reading.M - Reading.F)/Reading.F)*100,
Science.Diff = ((Science.M - Science.F)/Science.F)*100,
Total.Score = Maths + Reading + Science,
Avg.Diff = (Maths.Diff+Reading.Diff+Science.Diff)/3
)
df%>%
kable()%>%
kable_styling() %>%
scroll_box(width = "100%", height = "400px")
| Country.Name | Maths | Maths.F | Maths.M | Reading | Reading.F | Reading.M | Science | Science.F | Science.M | Maths.Diff | Reading.Diff | Science.Diff | Total.Score | Avg.Diff |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Albania | 413.1570 | 417.7500 | 408.5455 | 405.2588 | 434.6396 | 375.7592 | 427.2250 | 439.4430 | 414.9576 | -2.2033681 | -13.546953 | -5.5718993 | 1245.641 | -7.1074068 |
| Algeria | 359.6062 | 363.0725 | 356.4951 | 349.8593 | 366.2082 | 335.1854 | 375.7451 | 383.2209 | 369.0352 | -1.8115869 | -8.471338 | -3.7017041 | 1085.211 | -4.6615429 |
| Argentina | 409.0333 | 400.4431 | 418.3884 | 425.3031 | 432.9581 | 416.9666 | 432.2262 | 424.9944 | 440.1020 | 4.4813468 | -3.693538 | 3.5547951 | 1266.563 | 1.4475345 |
| Australia | 493.8962 | 490.9855 | 496.7613 | 502.9006 | 518.8658 | 487.1855 | 509.9939 | 508.9216 | 511.0493 | 1.1763777 | -6.105678 | 0.4180624 | 1506.791 | -1.5037459 |
| Austria | 496.7423 | 483.1330 | 510.0982 | 484.8656 | 495.0752 | 474.8460 | 495.0375 | 485.5268 | 504.3712 | 5.5813179 | -4.086078 | 3.8812368 | 1476.645 | 1.7921588 |
| Belgium | 506.9844 | 499.7390 | 514.0026 | 498.5242 | 506.6386 | 490.6642 | 501.9997 | 496.0319 | 507.7805 | 2.8541997 | -3.153025 | 2.3685191 | 1507.508 | 0.6898979 |
| Brazil | 377.0695 | 369.5493 | 385.0406 | 407.3486 | 418.5617 | 395.4633 | 400.6821 | 398.7000 | 402.7830 | 4.1919394 | -5.518524 | 1.0240879 | 1185.100 | -0.1008323 |
| Bulgaria | 441.1899 | 442.1631 | 440.3189 | 431.7175 | 456.5986 | 409.4498 | 445.7720 | 453.9011 | 438.4966 | -0.4170813 | -10.326091 | -3.3937938 | 1318.679 | -4.7123219 |
| Canada | 515.6474 | 511.1417 | 520.1661 | 526.6678 | 539.7624 | 513.5355 | 527.7047 | 527.1562 | 528.2548 | 1.7655377 | -4.858976 | 0.2083976 | 1570.020 | -0.9616804 |
| Chile | 422.6714 | 413.4490 | 431.7981 | 458.5709 | 464.5616 | 452.6422 | 446.9561 | 439.6174 | 454.2186 | 4.4380579 | -2.565731 | 3.3213455 | 1328.198 | 1.7312243 |
| Colombia | 389.6438 | 384.4883 | 395.3911 | 424.9052 | 432.2819 | 416.6816 | 415.7288 | 411.0316 | 420.9651 | 2.8356606 | -3.608826 | 2.4167274 | 1230.278 | 0.5478542 |
| Costa Rica | 400.2534 | 392.3129 | 408.4516 | 427.4875 | 434.8748 | 419.8605 | 419.6080 | 410.8349 | 428.6660 | 4.1137450 | -3.452567 | 4.3402113 | 1247.349 | 1.6671297 |
| Croatia | 464.0401 | 457.9612 | 470.5987 | 486.8632 | 499.5858 | 473.1367 | 475.3912 | 472.5863 | 478.4173 | 2.7595081 | -5.294206 | 1.2338441 | 1426.294 | -0.4336181 |
| Cyprus | 437.1443 | 439.5341 | 434.7064 | 442.8443 | 468.6583 | 416.8271 | 432.5964 | 440.9482 | 424.1478 | -1.0983500 | -11.059504 | -3.8100543 | 1312.585 | -5.3226359 |
| Czech Republic | 492.3254 | 488.6656 | 495.7942 | 487.2501 | 500.6527 | 474.5475 | 492.8300 | 488.3983 | 497.0304 | 1.4587808 | -5.214233 | 1.7674354 | 1472.406 | -0.6626722 |
| Denmark | 511.0876 | 506.3748 | 515.7565 | 499.8146 | 510.9516 | 488.7816 | 501.9369 | 498.9027 | 504.9427 | 1.8527224 | -4.338958 | 1.2106588 | 1512.839 | -0.4251921 |
| Dominican Republic | 327.7020 | 329.7459 | 325.5866 | 357.7377 | 372.7806 | 342.1682 | 331.6388 | 330.8290 | 332.4770 | -1.2613862 | -8.211916 | 0.4981496 | 1017.078 | -2.9917175 |
| Estonia | 519.5291 | 516.8728 | 522.0804 | 519.1429 | 533.3620 | 505.4863 | 534.1937 | 532.5228 | 535.7986 | 1.0075324 | -5.226405 | 0.6151492 | 1572.866 | -1.2012412 |
| Finland | 511.0769 | 514.9650 | 507.4528 | 526.4247 | 550.5112 | 503.9746 | 530.6612 | 540.5118 | 521.4797 | -1.4587702 | -8.453340 | -3.5211343 | 1568.163 | -4.4777482 |
| France | 492.9204 | 489.9540 | 495.9317 | 499.3061 | 513.7640 | 484.6293 | 494.9776 | 494.0342 | 495.9353 | 1.2200420 | -5.670840 | 0.3848057 | 1487.204 | -1.3553307 |
| Georgia | 403.8332 | 410.5960 | 397.7478 | 401.2881 | 431.8820 | 373.7585 | 411.1315 | 419.6164 | 403.4965 | -3.1291643 | -13.458177 | -3.8415909 | 1216.253 | -6.8096440 |
| Germany | 505.9713 | 497.5311 | 514.1177 | 509.1041 | 519.6741 | 498.9021 | 509.1406 | 503.8121 | 514.2837 | 3.3337763 | -3.997113 | 2.0784633 | 1524.216 | 0.4717088 |
| Greece | 453.6299 | 453.5732 | 453.6821 | 467.0395 | 486.4600 | 449.1362 | 454.8288 | 459.4177 | 450.5984 | 0.0239946 | -7.672529 | -1.9196790 | 1375.498 | -3.1894043 |
| Hong Kong SAR, China | 547.9310 | 546.7682 | 549.0658 | 526.6753 | 540.9844 | 512.7113 | 523.2774 | 523.7491 | 522.8172 | 0.4202205 | -5.226229 | -0.1779279 | 1597.884 | -1.6613123 |
| Hungary | 476.8309 | 472.7395 | 480.9055 | 469.5233 | 481.9596 | 457.1377 | 476.7475 | 475.2484 | 478.2405 | 1.7273862 | -5.150190 | 0.6295895 | 1423.102 | -0.9310714 |
| Iceland | 488.0332 | 488.5870 | 487.4457 | 481.5255 | 501.7167 | 460.1036 | 473.2301 | 474.6556 | 471.7177 | -0.2335942 | -8.294143 | -0.6189646 | 1442.789 | -3.0489006 |
| Indonesia | 386.1096 | 387.4450 | 384.7793 | 397.2595 | 408.9994 | 385.5642 | 403.0997 | 405.1289 | 401.0783 | -0.6880194 | -5.729900 | -0.9998441 | 1186.469 | -2.4725879 |
| Ireland | 503.7220 | 495.4450 | 511.5797 | 520.8148 | 526.9491 | 514.9914 | 502.5751 | 497.1740 | 507.7026 | 3.2566054 | -2.269230 | 2.1177000 | 1527.112 | 1.0350252 |
| Israel | 469.6695 | 465.5169 | 473.9902 | 478.9606 | 490.1650 | 467.3026 | 466.5528 | 464.4477 | 468.7432 | 1.8202116 | -4.664226 | 0.9248704 | 1415.183 | -0.6397146 |
| Italy | 489.7287 | 479.8237 | 499.7621 | 484.7580 | 492.7091 | 476.7038 | 480.5468 | 472.1190 | 489.0838 | 4.1553529 | -3.248432 | 3.5933370 | 1455.034 | 1.5000860 |
| Japan | 532.4399 | 525.4960 | 539.2673 | 515.9585 | 522.6553 | 509.3740 | 538.3948 | 531.5329 | 545.1415 | 2.6206217 | -2.541112 | 2.5602532 | 1586.793 | 0.8799211 |
| Jordan | 380.2590 | 387.3772 | 373.0013 | 408.1022 | 443.5964 | 371.9124 | 408.6691 | 427.9967 | 388.9627 | -3.7110814 | -16.159739 | -9.1201595 | 1197.030 | -9.6636599 |
| Kazakhstan | 459.8160 | 460.5772 | 459.1038 | 427.1410 | 435.3836 | 419.4282 | 456.4836 | 457.8602 | 455.1954 | -0.3199020 | -3.664675 | -0.5819995 | 1343.441 | -1.5221923 |
| Korea, Rep. | 524.1062 | 527.7567 | 520.7669 | 517.4367 | 538.6092 | 498.0690 | 515.8099 | 520.8329 | 511.2151 | -1.3244225 | -7.526835 | -1.8466308 | 1557.353 | -3.5659629 |
| Latvia | 482.3051 | 483.2573 | 481.3558 | 487.7581 | 508.8251 | 466.7552 | 490.2250 | 495.5792 | 484.8872 | -0.3934704 | -8.268039 | -2.1574710 | 1460.288 | -3.6063269 |
| Lebanon | 396.2497 | 385.9892 | 408.0571 | 346.5490 | 353.2773 | 338.8062 | 386.4854 | 385.5465 | 387.5658 | 5.7172328 | -4.096245 | 0.5237528 | 1129.284 | 0.7149137 |
| Lithuania | 478.3834 | 479.0178 | 477.7677 | 472.4066 | 492.2423 | 453.1567 | 475.4089 | 479.1618 | 471.7669 | -0.2609610 | -7.940328 | -1.5432985 | 1426.199 | -3.2481958 |
| Luxembourg | 485.7706 | 480.1360 | 491.4831 | 481.4391 | 492.0348 | 470.6970 | 482.8064 | 479.0131 | 486.6521 | 2.3633018 | -4.336643 | 1.5947328 | 1450.016 | -0.1262028 |
| Macao SAR, China | 543.8078 | 547.9227 | 539.7238 | 508.6905 | 524.5389 | 492.9614 | 528.5496 | 532.4157 | 524.7126 | -1.4963500 | -6.020060 | -1.4468080 | 1581.048 | -2.9877394 |
| Macedonia, FYR | 371.3114 | 374.7397 | 368.2374 | 351.7415 | 375.8380 | 330.1350 | 383.6824 | 394.4027 | 374.0698 | -1.7351727 | -12.160277 | -5.1553820 | 1106.735 | -6.3502772 |
| Malaysia | 446.1098 | 449.1973 | 442.6725 | 430.5782 | 445.3567 | 414.1259 | 442.9475 | 444.7308 | 440.9622 | -1.4525510 | -7.012545 | -0.8473885 | 1319.636 | -3.1041614 |
| Malta | 478.6448 | 480.6731 | 476.6767 | 446.6661 | 468.0581 | 425.9101 | 464.7819 | 470.2137 | 459.5117 | -0.8314104 | -9.004874 | -2.2759957 | 1390.093 | -4.0374267 |
| Mexico | 408.0235 | 404.3349 | 411.6105 | 423.2765 | 431.2322 | 415.5398 | 415.7099 | 411.5536 | 419.7517 | 1.7994103 | -3.638965 | 1.9920016 | 1247.010 | 0.0508156 |
| Moldova | 419.6635 | 420.6077 | 418.7273 | 416.2293 | 442.1850 | 390.4922 | 427.9978 | 431.4186 | 424.6058 | -0.4470866 | -11.690318 | -1.5791547 | 1263.891 | -4.5721865 |
| Montenegro | 417.9341 | 417.7265 | 418.1326 | 426.8845 | 444.1791 | 410.3504 | 411.3136 | 413.8134 | 408.9238 | 0.0972056 | -7.616006 | -1.1815980 | 1256.132 | -2.9001329 |
| Netherlands | 512.2528 | 511.0159 | 513.4978 | 502.9591 | 514.7035 | 491.1367 | 508.5748 | 506.5181 | 510.6452 | 0.4856807 | -4.578703 | 0.8147938 | 1523.787 | -1.0927430 |
| New Zealand | 495.2233 | 490.9219 | 499.4705 | 509.2707 | 525.5199 | 493.2258 | 513.3035 | 510.7250 | 515.8496 | 1.7413445 | -6.145166 | 1.0033901 | 1517.797 | -1.1334770 |
| Norway | 501.7298 | 502.8737 | 500.6149 | 513.1912 | 533.3560 | 493.5367 | 498.4811 | 497.0014 | 499.9234 | -0.4491643 | -7.465795 | 0.5879270 | 1513.402 | -2.4423439 |
| Peru | 386.5606 | 381.7972 | 391.2786 | 397.5414 | 401.4247 | 393.6952 | 396.6836 | 391.6810 | 401.6386 | 2.4833696 | -1.925508 | 2.5422909 | 1180.786 | 1.0333841 |
| Poland | 504.4693 | 498.6467 | 510.0877 | 505.6971 | 520.6746 | 491.2446 | 501.4353 | 498.3032 | 504.4576 | 2.2944181 | -5.652282 | 1.2350661 | 1511.602 | -0.7075992 |
| Portugal | 491.6270 | 486.5997 | 496.5606 | 498.1289 | 506.5518 | 489.8629 | 501.1001 | 496.0552 | 506.0510 | 2.0470288 | -3.294603 | 2.0150443 | 1490.856 | 0.2558233 |
| Qatar | 402.4007 | 408.3214 | 396.7547 | 401.8874 | 428.9546 | 376.0766 | 417.6112 | 429.4479 | 406.3239 | -2.8327436 | -12.327182 | -5.3845780 | 1221.899 | -6.8481678 |
| Russian Federation | 494.0600 | 491.1219 | 497.0891 | 494.6278 | 507.4833 | 481.3743 | 486.6310 | 484.5143 | 488.8133 | 1.2150130 | -5.144791 | 0.8872903 | 1475.319 | -1.0141627 |
| Singapore | 564.1897 | 564.2545 | 564.1290 | 535.1002 | 545.5585 | 525.3153 | 555.5747 | 552.2730 | 558.6637 | -0.0222471 | -3.710549 | 1.1571638 | 1654.865 | -0.8585440 |
| Slovak Republic | 475.2301 | 472.2708 | 478.0148 | 452.5143 | 470.8627 | 435.2488 | 460.7749 | 461.2153 | 460.3604 | 1.2162436 | -7.563542 | -0.1853410 | 1388.519 | -2.1775463 |
| Slovenia | 509.9196 | 507.9492 | 511.7637 | 505.2159 | 527.5539 | 484.3099 | 512.8636 | 515.7736 | 510.1401 | 0.7509620 | -8.197081 | -1.0922348 | 1527.999 | -2.8461179 |
| Spain | 485.8432 | 477.8593 | 493.8485 | 495.5764 | 505.6847 | 485.4411 | 492.7861 | 489.4560 | 496.1251 | 3.3460152 | -4.003201 | 1.3625490 | 1474.206 | 0.2351211 |
| Sweden | 493.9181 | 495.0526 | 492.8041 | 500.1556 | 519.9498 | 480.7183 | 493.4224 | 495.7183 | 491.1678 | -0.4541794 | -7.545254 | -0.9179621 | 1487.496 | -2.9724653 |
| Switzerland | 521.2506 | 514.9832 | 527.0155 | 492.1982 | 505.3732 | 480.0795 | 505.5058 | 502.3059 | 508.4492 | 2.3364591 | -5.004942 | 1.2230060 | 1518.955 | -0.4818258 |
| Thailand | 415.4638 | 416.7297 | 413.8092 | 409.1301 | 422.5533 | 391.5848 | 421.3373 | 425.2518 | 416.2207 | -0.7008067 | -7.328893 | -2.1236973 | 1245.931 | -3.3844656 |
| Trinidad and Tobago | 417.2434 | 425.9341 | 408.3298 | 427.2733 | 452.4412 | 401.4599 | 424.5905 | 434.7075 | 414.2140 | -4.1331188 | -11.268035 | -4.7143335 | 1269.107 | -6.7051623 |
| Tunisia | 366.8180 | 363.9996 | 370.0814 | 361.0555 | 372.6726 | 347.6038 | 386.4034 | 384.6152 | 388.4740 | 1.6708275 | -6.726742 | 1.0032943 | 1114.277 | -1.3508734 |
| Turkey | 420.4540 | 417.5270 | 423.3777 | 428.3351 | 442.2461 | 414.4396 | 425.4895 | 428.6548 | 422.3278 | 1.4012703 | -6.287562 | -1.4760145 | 1274.279 | -2.1207688 |
| United Arab Emirates | 427.4827 | 430.7020 | 424.1734 | 433.5423 | 458.2870 | 408.1053 | 436.7311 | 449.3386 | 423.7710 | -1.5158159 | -10.949842 | -5.6900329 | 1297.756 | -6.0518968 |
| United Kingdom | 492.4785 | 486.5746 | 498.2005 | 497.9719 | 509.0904 | 487.1961 | 509.2215 | 508.8409 | 509.5904 | 2.3893446 | -4.300667 | 0.1472813 | 1499.672 | -0.5880138 |
| United States | 469.6285 | 465.3685 | 473.8857 | 496.9351 | 506.9752 | 486.9014 | 496.2424 | 492.8651 | 499.6176 | 1.8302065 | -3.959520 | 1.3700639 | 1462.806 | -0.2530833 |
| Uruguay | 417.9919 | 411.5024 | 425.1379 | 436.5721 | 447.7196 | 424.2969 | 435.3630 | 431.3077 | 439.8285 | 3.3136089 | -5.231559 | 1.9755851 | 1289.927 | 0.0192117 |
| Vietnam | 494.5183 | 496.0059 | 492.9591 | 486.7738 | 499.0386 | 473.9191 | 524.6445 | 525.9139 | 523.3141 | -0.6142687 | -5.033571 | -0.4943479 | 1505.937 | -2.0473957 |
Now let’s plot this out to visualize the data better MATHS SCORE
ggplot(data=df, aes(x=reorder(Country.Name, Maths.Diff), y=Maths.Diff)) +
geom_bar(stat = "identity", aes(fill=Maths.Diff)) +
coord_flip() +
theme_light() +
geom_hline(yintercept = mean(df$Maths.Diff), size=1, color="black") +
scale_fill_gradient(name="% Difference Level") +
labs(title="Are Males better at math?", x="", y="% difference from female")+
theme_bw()
This plot represents the % difference in scores using Females as reference. A positive difference means males scored higher, while a negative difference means males scored lower. The black line represents the mean difference across all countries.
Some interesting insights one can draw from here:
In general, Males performed better than Females in Maths, especially in most of the DCs, males scored generally higher than females
Interestingly, in Singapore and Hongkong, females and males performed equally well, with the difference in score is around 0 in each of these countries. This is good insight maybe for policy-makers because we do not want a huge gap between males and females performance in education.
We can do the same thing for Reading and Science score.
df <- df[,c(1,3,4,6,7,9,10)] #select relevant columns
To create correlation plot, simply use cor():
res <- cor(df[,-1]) # -1 here means we look at all columns except the first column
res%>%
kable()%>%
kable_styling() %>%
scroll_box(width = "100%", height = "400px")
| Maths.F | Maths.M | Reading.F | Reading.M | Science.F | Science.M | |
|---|---|---|---|---|---|---|
| Maths.F | 1.0000000 | 0.9845874 | 0.9377498 | 0.9177645 | 0.9711420 | 0.9547097 |
| Maths.M | 0.9845874 | 1.0000000 | 0.9312678 | 0.9468078 | 0.9576479 | 0.9758210 |
| Reading.F | 0.9377498 | 0.9312678 | 1.0000000 | 0.9663211 | 0.9555957 | 0.9440488 |
| Reading.M | 0.9177645 | 0.9468078 | 0.9663211 | 1.0000000 | 0.9283812 | 0.9692920 |
| Science.F | 0.9711420 | 0.9576479 | 0.9555957 | 0.9283812 | 1.0000000 | 0.9736539 |
| Science.M | 0.9547097 | 0.9758210 | 0.9440488 | 0.9692920 | 0.9736539 | 1.0000000 |
We can calculate p-value to see whether the correlation is significant
library("Hmisc")
res2 <- rcorr(as.matrix(df[,-1]))
res2
## Maths.F Maths.M Reading.F Reading.M Science.F Science.M
## Maths.F 1.00 0.98 0.94 0.92 0.97 0.95
## Maths.M 0.98 1.00 0.93 0.95 0.96 0.98
## Reading.F 0.94 0.93 1.00 0.97 0.96 0.94
## Reading.M 0.92 0.95 0.97 1.00 0.93 0.97
## Science.F 0.97 0.96 0.96 0.93 1.00 0.97
## Science.M 0.95 0.98 0.94 0.97 0.97 1.00
##
## n= 68
##
##
## P
## Maths.F Maths.M Reading.F Reading.M Science.F Science.M
## Maths.F 0 0 0 0 0
## Maths.M 0 0 0 0 0
## Reading.F 0 0 0 0 0
## Reading.M 0 0 0 0 0
## Science.F 0 0 0 0 0
## Science.M 0 0 0 0 0
The smaller the p-value, the more significant the correlation.
The purpose of this section about correlation plot is to introduce to you how to calculate correlation between variables in R. For this dataset, it is obvious that all the variables are correlated
To visualize
#install.packages("corrplot")
library(corrplot)
corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
The stronger the color and the bigger the size, the higher the correlation. The result is similar to the one we got earlier: All the variables are intercorrelated.
That is it! Hope you guys enjoyed and picked up something from this article. While this guide is not exhaustive, it will more or less give you some ideas of how to do some basic EDA with R.