Exploratory Data Analysis in R for beginners (Part 1)

A Step-by-step Approach From Data Importing to Cleaning and Visualization

Reproduced example from Joe Tran

Exploratory Data Analysis (EDA) is the process of analyzing and visualizing the data to get a better understanding of the data and glean insight from it. There are various steps involved when doing EDA but the following are the common steps that a data analyst can take when performing EDA:

  1. Import the data
  2. Clean the data
  3. Process the data
  4. Visualize the data

What would you expect to find in this article?

This article focuses on EDA of a dataset, which means that it would involve all the steps mentioned above. Therefore, this article will walk you through all the steps required and the tools used in each step. So you would expect to find the followings in this article:

  1. Tidyverse package for tidying up the data set
  2. ggplot2 package for visualizations
  3. Some other basic functions to manipulate data like strsplit (), cbind (), matrix () and so on.

For beginners to EDA, if you do not have a lot of time and do not know where to start, I would recommend you to start with Tidyverse and ggplot2. You can do almost everything about EDA with these 2 packages. There are various resources online like DataCamp, Setscholars, and books like Introduction to Data Science and so on. In this article, I would walk you through the process of EDA through the analysis of the PISA score dataset which is available here.

Let’s get started, folks!!

Importing the data

Before importing the data into R for analysis, let’s look at how the data looks like Alt text

When importing this data into R, we want the last column to be ‘numeric’ and the rest to be ‘factor’. With this in mind, let’s look at the following 3 scenarios:

file<-"https://raw.githubusercontent.com/prasertcbs/basic-dataset/master/Pisa%20mean%20perfromance%20scores%202013%20-%202015%20Data.csv"
df.raw <- read.csv(file , fileEncoding="UTF-8-BOM", na.strings = '..')
df.raw1 <- read.csv(file)
df.raw2 <- read.csv(file, na.strings = '..')

These are 3 ways of importing the data into R. Usually, one with go for the df.raw1 because it seems to be the most convenient way of importing the data. Let’s see the structure of the imported data:

df.raw1 <- read.csv(file)
str(df.raw1)
## 'data.frame':    1161 obs. of  7 variables:
##  $ ï..Country.Name: chr  "Albania" "Albania" "Albania" "Albania" ...
##  $ Country.Code   : chr  "ALB" "ALB" "ALB" "ALB" ...
##  $ Series.Name    : chr  "PISA: Mean performance on the mathematics scale" "PISA: Mean performance on the mathematics scale. Female" "PISA: Mean performance on the mathematics scale. Male" "PISA: Mean performance on the reading scale" ...
##  $ Series.Code    : chr  "LO.PISA.MAT" "LO.PISA.MAT.FE" "LO.PISA.MAT.MA" "LO.PISA.REA" ...
##  $ X2013..YR2013. : chr  ".." ".." ".." ".." ...
##  $ X2014..YR2014. : chr  ".." ".." ".." ".." ...
##  $ X2015..YR2015. : chr  "413.157" "417.750029482799" "408.545458736189" "405.2588" ...

There are 2 problems that we can spot immediately. The last column is ‘factor’ and not ‘numeric’ like what we desire. Secondly, the first column ‘Country name’ is encoded differently from the raw dataset.

Now let’s try the second case scenario

df.raw2 <- read.csv(file, na.strings = '..')
str(df.raw2)
## 'data.frame':    1161 obs. of  7 variables:
##  $ ï..Country.Name: chr  "Albania" "Albania" "Albania" "Albania" ...
##  $ Country.Code   : chr  "ALB" "ALB" "ALB" "ALB" ...
##  $ Series.Name    : chr  "PISA: Mean performance on the mathematics scale" "PISA: Mean performance on the mathematics scale. Female" "PISA: Mean performance on the mathematics scale. Male" "PISA: Mean performance on the reading scale" ...
##  $ Series.Code    : chr  "LO.PISA.MAT" "LO.PISA.MAT.FE" "LO.PISA.MAT.MA" "LO.PISA.REA" ...
##  $ X2013..YR2013. : logi  NA NA NA NA NA NA ...
##  $ X2014..YR2014. : logi  NA NA NA NA NA NA ...
##  $ X2015..YR2015. : num  413 418 409 405 435 ...

The last column is now ‘numeric’. However, the name of the first column is not imported correctly.

What about the last scenario?

df.raw <- read.csv(file, fileEncoding="UTF-8-BOM", na.strings = '..')
str(df.raw)
## 'data.frame':    1161 obs. of  7 variables:
##  $ Country.Name  : chr  "Albania" "Albania" "Albania" "Albania" ...
##  $ Country.Code  : chr  "ALB" "ALB" "ALB" "ALB" ...
##  $ Series.Name   : chr  "PISA: Mean performance on the mathematics scale" "PISA: Mean performance on the mathematics scale. Female" "PISA: Mean performance on the mathematics scale. Male" "PISA: Mean performance on the reading scale" ...
##  $ Series.Code   : chr  "LO.PISA.MAT" "LO.PISA.MAT.FE" "LO.PISA.MAT.MA" "LO.PISA.REA" ...
##  $ X2013..YR2013.: logi  NA NA NA NA NA NA ...
##  $ X2014..YR2014.: logi  NA NA NA NA NA NA ...
##  $ X2015..YR2015.: num  413 418 409 405 435 ...

As you can see, the first column is now named properly and the last column is ‘numeric’.

na.strings = '..' allows R to replace those blanks in the dataset with NA. This will be useful and convenient later when we want to remove all the ‘NA’s.

fileEncoding="UTF-8-BOM" allows R, in the laymen term, to read the characters as correctly as they would appear on the raw dataset.

We want to do a few things to clean the dataset:

  1. Make sure that each row in the dataset corresponds to ONLY one country: Use spread() function in tidyverse package

  2. Make sure that only useful columns and rows are kept: Use drop_na() and data subsetting

  3. Rename the Series Code column for meaningful interpretation: Use rename()

df <- df.raw[1:1161, c(1, 4, 7)]%>%  #select relevant rows and cols
     spread(key=Series.Code, value=X2015..YR2015.) %>%
      rename(Maths = LO.PISA.MAT,                        
         Maths.F = LO.PISA.MAT.FE,
         Maths.M = LO.PISA.MAT.MA,
         Reading = LO.PISA.REA,
         Reading.F = LO.PISA.REA.FE,
         Reading.M = LO.PISA.REA.MA,
         Science = LO.PISA.SCI,
         Science.F = LO.PISA.SCI.FE,
         Science.M = LO.PISA.SCI.MA
         ) %>%
  drop_na()

Now let’s see how the clean data looks like:

df %>%
  head(10) %>% 
  kable()%>% 
  kable_styling() %>%
  scroll_box(width = "100%", height = "400px")
Country.Name Maths Maths.F Maths.M Reading Reading.F Reading.M Science Science.F Science.M
Albania 413.1570 417.7500 408.5455 405.2588 434.6396 375.7592 427.2250 439.4430 414.9576
Algeria 359.6062 363.0725 356.4951 349.8593 366.2082 335.1854 375.7451 383.2209 369.0352
Argentina 409.0333 400.4431 418.3884 425.3031 432.9581 416.9666 432.2262 424.9944 440.1020
Australia 493.8962 490.9855 496.7613 502.9006 518.8658 487.1855 509.9939 508.9216 511.0493
Austria 496.7423 483.1330 510.0982 484.8656 495.0752 474.8460 495.0375 485.5268 504.3712
Belgium 506.9844 499.7390 514.0026 498.5242 506.6386 490.6642 501.9997 496.0319 507.7805
Brazil 377.0695 369.5493 385.0406 407.3486 418.5617 395.4633 400.6821 398.7000 402.7830
Bulgaria 441.1899 442.1631 440.3189 431.7175 456.5986 409.4498 445.7720 453.9011 438.4966
Canada 515.6474 511.1417 520.1661 526.6678 539.7624 513.5355 527.7047 527.1562 528.2548
Chile 422.6714 413.4490 431.7981 458.5709 464.5616 452.6422 446.9561 439.6174 454.2186

Visualizing the data

  1. Barplot

Ranking of Maths Score by Countries

ggplot(data=df,aes(x=reorder(Country.Name,Maths),y=Maths)) + 
  geom_bar(stat ='identity',aes(fill=Maths))+
  coord_flip() + 
  theme_grey() + 
  scale_fill_gradient(name="Maths Score Level")+
  labs(title = 'Ranking of Countries by Maths Score',
       y='Score',x='Countries')+ 
  geom_hline(yintercept = mean(df$Maths),size = 1, color = 'blue')+
  theme_bw()

  1. Boxplot

If we use the dataset above, we will not be able to draw a boxplot. This is because boxplot needs only 2 variables x and y but in the cleaned data that we have, there are so many variables. So we need to combine those into 2 variables. We name this as df2

df2 = df[,c(1,3,4,6,7,9,10)] %>%   # select relevant columns 
  pivot_longer(c(2,3,4,5,6,7),names_to = 'Score')

df2 %>% 
  head(10) %>% 
  kable()%>% 
  kable_styling() %>%
  scroll_box(width = "100%", height = "400px")
Country.Name Score value
Albania Maths.F 417.7500
Albania Maths.M 408.5455
Albania Reading.F 434.6396
Albania Reading.M 375.7592
Albania Science.F 439.4430
Albania Science.M 414.9576
Algeria Maths.F 363.0725
Algeria Maths.M 356.4951
Algeria Reading.F 366.2082
Algeria Reading.M 335.1854

Great! Now we can make boxplots

ggplot(data = df2, aes(x=Score,y=value, color=Score)) + 
  geom_boxplot()+
  scale_color_brewer(palette="Dark2") + 
  geom_jitter(shape=16, position=position_jitter(0.2))+
  labs(title = 'Did males perform better than females?',
       y='Scores',x='Test Type')+
  theme_bw()

geom_jitter() allows you to plot the data points on the plot.

You can play around with the code above to get various plots. For example, I can change from ‘color = Score’ to ‘fill=Score’:

ggplot(data = df2, aes(x=Score,y=value, fill=Score)) + 
  geom_boxplot()+
  scale_fill_brewer(palette="Green") + 
  geom_jitter(shape=16, position=position_jitter(0.2))+
  labs(title = 'Did males perform better than females?',
       y='Scores',x='Test Type')+
  theme_bw()

The plot looks a bit messy. A better visualisation would be to separate Subjects and Genders and plot them side by side.

How can we do that?

Since we want to separate Subjects and Genders from a column containing ‘Subject.Gender’ (e.g. Maths.F), we need to use strsplit () to do this job for us.

S = numeric(408)     # create an empty vector
for (i in 1:length(df2$Score)) {
  S[i] = strsplit(df2$Score[i],".",fixed = TRUE)
}

Now S is a list of 408 components, each of which has 2 sub-components ‘Subject’ and ‘Gender’. We need to transform S into a data frame with 1 column of Subject and 1 column of Gender. We will name this data frame as df3

df3 = S%>%unlist() %>% matrix(ncol = 2, byrow = TRUE)%>% as.data.frame()
df3 %>%  
  kable()%>% 
  kable_styling() %>%
  scroll_box(width = "100%", height = "400px")
V1 V2
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M
Maths F
Maths M
Reading F
Reading M
Science F
Science M

We now need to combine this df3 with df2 we created earlier, and name the result as df4

df4 = cbind(df2,df3) 
colnames(df4) = c('Country','Score','Value','Test','Gender')
df4$Score = NULL # since the 'Score' column is redundant

df4%>%  
  kable()%>% 
  kable_styling() %>%
  scroll_box(width = "100%", height = "400px")
Country Value Test Gender
Albania 417.7500 Maths F
Albania 408.5455 Maths M
Albania 434.6396 Reading F
Albania 375.7592 Reading M
Albania 439.4430 Science F
Albania 414.9576 Science M
Algeria 363.0725 Maths F
Algeria 356.4951 Maths M
Algeria 366.2082 Reading F
Algeria 335.1854 Reading M
Algeria 383.2209 Science F
Algeria 369.0352 Science M
Argentina 400.4431 Maths F
Argentina 418.3884 Maths M
Argentina 432.9581 Reading F
Argentina 416.9666 Reading M
Argentina 424.9944 Science F
Argentina 440.1020 Science M
Australia 490.9855 Maths F
Australia 496.7613 Maths M
Australia 518.8658 Reading F
Australia 487.1855 Reading M
Australia 508.9216 Science F
Australia 511.0493 Science M
Austria 483.1330 Maths F
Austria 510.0982 Maths M
Austria 495.0752 Reading F
Austria 474.8460 Reading M
Austria 485.5268 Science F
Austria 504.3712 Science M
Belgium 499.7390 Maths F
Belgium 514.0026 Maths M
Belgium 506.6386 Reading F
Belgium 490.6642 Reading M
Belgium 496.0319 Science F
Belgium 507.7805 Science M
Brazil 369.5493 Maths F
Brazil 385.0406 Maths M
Brazil 418.5617 Reading F
Brazil 395.4633 Reading M
Brazil 398.7000 Science F
Brazil 402.7830 Science M
Bulgaria 442.1631 Maths F
Bulgaria 440.3189 Maths M
Bulgaria 456.5986 Reading F
Bulgaria 409.4498 Reading M
Bulgaria 453.9011 Science F
Bulgaria 438.4966 Science M
Canada 511.1417 Maths F
Canada 520.1661 Maths M
Canada 539.7624 Reading F
Canada 513.5355 Reading M
Canada 527.1562 Science F
Canada 528.2548 Science M
Chile 413.4490 Maths F
Chile 431.7981 Maths M
Chile 464.5616 Reading F
Chile 452.6422 Reading M
Chile 439.6174 Science F
Chile 454.2186 Science M
Colombia 384.4883 Maths F
Colombia 395.3911 Maths M
Colombia 432.2819 Reading F
Colombia 416.6816 Reading M
Colombia 411.0316 Science F
Colombia 420.9651 Science M
Costa Rica 392.3129 Maths F
Costa Rica 408.4516 Maths M
Costa Rica 434.8748 Reading F
Costa Rica 419.8605 Reading M
Costa Rica 410.8349 Science F
Costa Rica 428.6660 Science M
Croatia 457.9612 Maths F
Croatia 470.5987 Maths M
Croatia 499.5858 Reading F
Croatia 473.1367 Reading M
Croatia 472.5863 Science F
Croatia 478.4173 Science M
Cyprus 439.5341 Maths F
Cyprus 434.7064 Maths M
Cyprus 468.6583 Reading F
Cyprus 416.8271 Reading M
Cyprus 440.9482 Science F
Cyprus 424.1478 Science M
Czech Republic 488.6656 Maths F
Czech Republic 495.7942 Maths M
Czech Republic 500.6527 Reading F
Czech Republic 474.5475 Reading M
Czech Republic 488.3983 Science F
Czech Republic 497.0304 Science M
Denmark 506.3748 Maths F
Denmark 515.7565 Maths M
Denmark 510.9516 Reading F
Denmark 488.7816 Reading M
Denmark 498.9027 Science F
Denmark 504.9427 Science M
Dominican Republic 329.7459 Maths F
Dominican Republic 325.5866 Maths M
Dominican Republic 372.7806 Reading F
Dominican Republic 342.1682 Reading M
Dominican Republic 330.8290 Science F
Dominican Republic 332.4770 Science M
Estonia 516.8728 Maths F
Estonia 522.0804 Maths M
Estonia 533.3620 Reading F
Estonia 505.4863 Reading M
Estonia 532.5228 Science F
Estonia 535.7986 Science M
Finland 514.9650 Maths F
Finland 507.4528 Maths M
Finland 550.5112 Reading F
Finland 503.9746 Reading M
Finland 540.5118 Science F
Finland 521.4797 Science M
France 489.9540 Maths F
France 495.9317 Maths M
France 513.7640 Reading F
France 484.6293 Reading M
France 494.0342 Science F
France 495.9353 Science M
Georgia 410.5960 Maths F
Georgia 397.7478 Maths M
Georgia 431.8820 Reading F
Georgia 373.7585 Reading M
Georgia 419.6164 Science F
Georgia 403.4965 Science M
Germany 497.5311 Maths F
Germany 514.1177 Maths M
Germany 519.6741 Reading F
Germany 498.9021 Reading M
Germany 503.8121 Science F
Germany 514.2837 Science M
Greece 453.5732 Maths F
Greece 453.6821 Maths M
Greece 486.4600 Reading F
Greece 449.1362 Reading M
Greece 459.4177 Science F
Greece 450.5984 Science M
Hong Kong SAR, China 546.7682 Maths F
Hong Kong SAR, China 549.0658 Maths M
Hong Kong SAR, China 540.9844 Reading F
Hong Kong SAR, China 512.7113 Reading M
Hong Kong SAR, China 523.7491 Science F
Hong Kong SAR, China 522.8172 Science M
Hungary 472.7395 Maths F
Hungary 480.9055 Maths M
Hungary 481.9596 Reading F
Hungary 457.1377 Reading M
Hungary 475.2484 Science F
Hungary 478.2405 Science M
Iceland 488.5870 Maths F
Iceland 487.4457 Maths M
Iceland 501.7167 Reading F
Iceland 460.1036 Reading M
Iceland 474.6556 Science F
Iceland 471.7177 Science M
Indonesia 387.4450 Maths F
Indonesia 384.7793 Maths M
Indonesia 408.9994 Reading F
Indonesia 385.5642 Reading M
Indonesia 405.1289 Science F
Indonesia 401.0783 Science M
Ireland 495.4450 Maths F
Ireland 511.5797 Maths M
Ireland 526.9491 Reading F
Ireland 514.9914 Reading M
Ireland 497.1740 Science F
Ireland 507.7026 Science M
Israel 465.5169 Maths F
Israel 473.9902 Maths M
Israel 490.1650 Reading F
Israel 467.3026 Reading M
Israel 464.4477 Science F
Israel 468.7432 Science M
Italy 479.8237 Maths F
Italy 499.7621 Maths M
Italy 492.7091 Reading F
Italy 476.7038 Reading M
Italy 472.1190 Science F
Italy 489.0838 Science M
Japan 525.4960 Maths F
Japan 539.2673 Maths M
Japan 522.6553 Reading F
Japan 509.3740 Reading M
Japan 531.5329 Science F
Japan 545.1415 Science M
Jordan 387.3772 Maths F
Jordan 373.0013 Maths M
Jordan 443.5964 Reading F
Jordan 371.9124 Reading M
Jordan 427.9967 Science F
Jordan 388.9627 Science M
Kazakhstan 460.5772 Maths F
Kazakhstan 459.1038 Maths M
Kazakhstan 435.3836 Reading F
Kazakhstan 419.4282 Reading M
Kazakhstan 457.8602 Science F
Kazakhstan 455.1954 Science M
Korea, Rep.  527.7567 Maths F
Korea, Rep.  520.7669 Maths M
Korea, Rep.  538.6092 Reading F
Korea, Rep.  498.0690 Reading M
Korea, Rep.  520.8329 Science F
Korea, Rep.  511.2151 Science M
Latvia 483.2573 Maths F
Latvia 481.3558 Maths M
Latvia 508.8251 Reading F
Latvia 466.7552 Reading M
Latvia 495.5792 Science F
Latvia 484.8872 Science M
Lebanon 385.9892 Maths F
Lebanon 408.0571 Maths M
Lebanon 353.2773 Reading F
Lebanon 338.8062 Reading M
Lebanon 385.5465 Science F
Lebanon 387.5658 Science M
Lithuania 479.0178 Maths F
Lithuania 477.7677 Maths M
Lithuania 492.2423 Reading F
Lithuania 453.1567 Reading M
Lithuania 479.1618 Science F
Lithuania 471.7669 Science M
Luxembourg 480.1360 Maths F
Luxembourg 491.4831 Maths M
Luxembourg 492.0348 Reading F
Luxembourg 470.6970 Reading M
Luxembourg 479.0131 Science F
Luxembourg 486.6521 Science M
Macao SAR, China 547.9227 Maths F
Macao SAR, China 539.7238 Maths M
Macao SAR, China 524.5389 Reading F
Macao SAR, China 492.9614 Reading M
Macao SAR, China 532.4157 Science F
Macao SAR, China 524.7126 Science M
Macedonia, FYR 374.7397 Maths F
Macedonia, FYR 368.2374 Maths M
Macedonia, FYR 375.8380 Reading F
Macedonia, FYR 330.1350 Reading M
Macedonia, FYR 394.4027 Science F
Macedonia, FYR 374.0698 Science M
Malaysia 449.1973 Maths F
Malaysia 442.6725 Maths M
Malaysia 445.3567 Reading F
Malaysia 414.1259 Reading M
Malaysia 444.7308 Science F
Malaysia 440.9622 Science M
Malta 480.6731 Maths F
Malta 476.6767 Maths M
Malta 468.0581 Reading F
Malta 425.9101 Reading M
Malta 470.2137 Science F
Malta 459.5117 Science M
Mexico 404.3349 Maths F
Mexico 411.6105 Maths M
Mexico 431.2322 Reading F
Mexico 415.5398 Reading M
Mexico 411.5536 Science F
Mexico 419.7517 Science M
Moldova 420.6077 Maths F
Moldova 418.7273 Maths M
Moldova 442.1850 Reading F
Moldova 390.4922 Reading M
Moldova 431.4186 Science F
Moldova 424.6058 Science M
Montenegro 417.7265 Maths F
Montenegro 418.1326 Maths M
Montenegro 444.1791 Reading F
Montenegro 410.3504 Reading M
Montenegro 413.8134 Science F
Montenegro 408.9238 Science M
Netherlands 511.0159 Maths F
Netherlands 513.4978 Maths M
Netherlands 514.7035 Reading F
Netherlands 491.1367 Reading M
Netherlands 506.5181 Science F
Netherlands 510.6452 Science M
New Zealand 490.9219 Maths F
New Zealand 499.4705 Maths M
New Zealand 525.5199 Reading F
New Zealand 493.2258 Reading M
New Zealand 510.7250 Science F
New Zealand 515.8496 Science M
Norway 502.8737 Maths F
Norway 500.6149 Maths M
Norway 533.3560 Reading F
Norway 493.5367 Reading M
Norway 497.0014 Science F
Norway 499.9234 Science M
Peru 381.7972 Maths F
Peru 391.2786 Maths M
Peru 401.4247 Reading F
Peru 393.6952 Reading M
Peru 391.6810 Science F
Peru 401.6386 Science M
Poland 498.6467 Maths F
Poland 510.0877 Maths M
Poland 520.6746 Reading F
Poland 491.2446 Reading M
Poland 498.3032 Science F
Poland 504.4576 Science M
Portugal 486.5997 Maths F
Portugal 496.5606 Maths M
Portugal 506.5518 Reading F
Portugal 489.8629 Reading M
Portugal 496.0552 Science F
Portugal 506.0510 Science M
Qatar 408.3214 Maths F
Qatar 396.7547 Maths M
Qatar 428.9546 Reading F
Qatar 376.0766 Reading M
Qatar 429.4479 Science F
Qatar 406.3239 Science M
Russian Federation 491.1219 Maths F
Russian Federation 497.0891 Maths M
Russian Federation 507.4833 Reading F
Russian Federation 481.3743 Reading M
Russian Federation 484.5143 Science F
Russian Federation 488.8133 Science M
Singapore 564.2545 Maths F
Singapore 564.1290 Maths M
Singapore 545.5585 Reading F
Singapore 525.3153 Reading M
Singapore 552.2730 Science F
Singapore 558.6637 Science M
Slovak Republic 472.2708 Maths F
Slovak Republic 478.0148 Maths M
Slovak Republic 470.8627 Reading F
Slovak Republic 435.2488 Reading M
Slovak Republic 461.2153 Science F
Slovak Republic 460.3604 Science M
Slovenia 507.9492 Maths F
Slovenia 511.7637 Maths M
Slovenia 527.5539 Reading F
Slovenia 484.3099 Reading M
Slovenia 515.7736 Science F
Slovenia 510.1401 Science M
Spain 477.8593 Maths F
Spain 493.8485 Maths M
Spain 505.6847 Reading F
Spain 485.4411 Reading M
Spain 489.4560 Science F
Spain 496.1251 Science M
Sweden 495.0526 Maths F
Sweden 492.8041 Maths M
Sweden 519.9498 Reading F
Sweden 480.7183 Reading M
Sweden 495.7183 Science F
Sweden 491.1678 Science M
Switzerland 514.9832 Maths F
Switzerland 527.0155 Maths M
Switzerland 505.3732 Reading F
Switzerland 480.0795 Reading M
Switzerland 502.3059 Science F
Switzerland 508.4492 Science M
Thailand 416.7297 Maths F
Thailand 413.8092 Maths M
Thailand 422.5533 Reading F
Thailand 391.5848 Reading M
Thailand 425.2518 Science F
Thailand 416.2207 Science M
Trinidad and Tobago 425.9341 Maths F
Trinidad and Tobago 408.3298 Maths M
Trinidad and Tobago 452.4412 Reading F
Trinidad and Tobago 401.4599 Reading M
Trinidad and Tobago 434.7075 Science F
Trinidad and Tobago 414.2140 Science M
Tunisia 363.9996 Maths F
Tunisia 370.0814 Maths M
Tunisia 372.6726 Reading F
Tunisia 347.6038 Reading M
Tunisia 384.6152 Science F
Tunisia 388.4740 Science M
Turkey 417.5270 Maths F
Turkey 423.3777 Maths M
Turkey 442.2461 Reading F
Turkey 414.4396 Reading M
Turkey 428.6548 Science F
Turkey 422.3278 Science M
United Arab Emirates 430.7020 Maths F
United Arab Emirates 424.1734 Maths M
United Arab Emirates 458.2870 Reading F
United Arab Emirates 408.1053 Reading M
United Arab Emirates 449.3386 Science F
United Arab Emirates 423.7710 Science M
United Kingdom 486.5746 Maths F
United Kingdom 498.2005 Maths M
United Kingdom 509.0904 Reading F
United Kingdom 487.1961 Reading M
United Kingdom 508.8409 Science F
United Kingdom 509.5904 Science M
United States 465.3685 Maths F
United States 473.8857 Maths M
United States 506.9752 Reading F
United States 486.9014 Reading M
United States 492.8651 Science F
United States 499.6176 Science M
Uruguay 411.5024 Maths F
Uruguay 425.1379 Maths M
Uruguay 447.7196 Reading F
Uruguay 424.2969 Reading M
Uruguay 431.3077 Science F
Uruguay 439.8285 Science M
Vietnam 496.0059 Maths F
Vietnam 492.9591 Maths M
Vietnam 499.0386 Reading F
Vietnam 473.9191 Reading M
Vietnam 525.9139 Science F
Vietnam 523.3141 Science M

Awesome! Now the data looks clean and neat. Let’s create multiple plots with the use of facet_wrap() function in ggplot2

ggplot(data = df4, aes(x=Test,y=Value, fill=Test)) + 
  geom_boxplot(notch = T)+
  scale_fill_brewer(palette="Green") + 
  geom_jitter(shape=16, position=position_jitter(0.2))+
  labs(title = 'Did males perform better than females?',
       y='Scores',x='Test')+
  facet_wrap(~Gender,nrow = 1)+
  theme_bw()

Here, we categorized the plot by Gender, hence facet_wrap(~Gender,nrow = 1) We can also categorize the plot by Test by changing facet_wrap(~Test,nrow = 1)

ggplot(data = df4, aes(x=Gender,y=Value, fill=Gender)) + 
  geom_boxplot(notch=T)+
  scale_fill_brewer(palette="Green") + 
  geom_jitter(shape=16, position=position_jitter(0.2))+
  labs(title = 'Did males perform better than females?',
       y='Scores',x='')+
  facet_wrap(~Test,nrow = 1)+
  theme_bw()

By looking at these plots, we can make some insights about the performance of males and females. Generally, males performed better in Science and Maths, but females performed better in Reading. However, it would be naive for us to make a conclusion only after looking at the boxplot. Let’s dive deeper into the data and see any other insights we can get after manipulating the dataset.

Since I want to compare the performance of Males and Females in each subject across all participating countries, I would need to calculate the % difference in terms of the score for each subject between males and females and then plot it out to visualize.

How would I do this in R? Use mutate() function

Let’s look at the original clean data set that we have, df

We will now use the mutate() function to calculate, for each country, the % difference between Males and Females for each subject.

df = df %>% mutate(Maths.Diff = ((Maths.M - Maths.F)/Maths.F)*100,
         Reading.Diff = ((Reading.M - Reading.F)/Reading.F)*100,
         Science.Diff = ((Science.M - Science.F)/Science.F)*100,
         Total.Score = Maths + Reading + Science,
         Avg.Diff = (Maths.Diff+Reading.Diff+Science.Diff)/3
         )
df%>%  
  kable()%>% 
  kable_styling() %>%
  scroll_box(width = "100%", height = "400px")
Country.Name Maths Maths.F Maths.M Reading Reading.F Reading.M Science Science.F Science.M Maths.Diff Reading.Diff Science.Diff Total.Score Avg.Diff
Albania 413.1570 417.7500 408.5455 405.2588 434.6396 375.7592 427.2250 439.4430 414.9576 -2.2033681 -13.546953 -5.5718993 1245.641 -7.1074068
Algeria 359.6062 363.0725 356.4951 349.8593 366.2082 335.1854 375.7451 383.2209 369.0352 -1.8115869 -8.471338 -3.7017041 1085.211 -4.6615429
Argentina 409.0333 400.4431 418.3884 425.3031 432.9581 416.9666 432.2262 424.9944 440.1020 4.4813468 -3.693538 3.5547951 1266.563 1.4475345
Australia 493.8962 490.9855 496.7613 502.9006 518.8658 487.1855 509.9939 508.9216 511.0493 1.1763777 -6.105678 0.4180624 1506.791 -1.5037459
Austria 496.7423 483.1330 510.0982 484.8656 495.0752 474.8460 495.0375 485.5268 504.3712 5.5813179 -4.086078 3.8812368 1476.645 1.7921588
Belgium 506.9844 499.7390 514.0026 498.5242 506.6386 490.6642 501.9997 496.0319 507.7805 2.8541997 -3.153025 2.3685191 1507.508 0.6898979
Brazil 377.0695 369.5493 385.0406 407.3486 418.5617 395.4633 400.6821 398.7000 402.7830 4.1919394 -5.518524 1.0240879 1185.100 -0.1008323
Bulgaria 441.1899 442.1631 440.3189 431.7175 456.5986 409.4498 445.7720 453.9011 438.4966 -0.4170813 -10.326091 -3.3937938 1318.679 -4.7123219
Canada 515.6474 511.1417 520.1661 526.6678 539.7624 513.5355 527.7047 527.1562 528.2548 1.7655377 -4.858976 0.2083976 1570.020 -0.9616804
Chile 422.6714 413.4490 431.7981 458.5709 464.5616 452.6422 446.9561 439.6174 454.2186 4.4380579 -2.565731 3.3213455 1328.198 1.7312243
Colombia 389.6438 384.4883 395.3911 424.9052 432.2819 416.6816 415.7288 411.0316 420.9651 2.8356606 -3.608826 2.4167274 1230.278 0.5478542
Costa Rica 400.2534 392.3129 408.4516 427.4875 434.8748 419.8605 419.6080 410.8349 428.6660 4.1137450 -3.452567 4.3402113 1247.349 1.6671297
Croatia 464.0401 457.9612 470.5987 486.8632 499.5858 473.1367 475.3912 472.5863 478.4173 2.7595081 -5.294206 1.2338441 1426.294 -0.4336181
Cyprus 437.1443 439.5341 434.7064 442.8443 468.6583 416.8271 432.5964 440.9482 424.1478 -1.0983500 -11.059504 -3.8100543 1312.585 -5.3226359
Czech Republic 492.3254 488.6656 495.7942 487.2501 500.6527 474.5475 492.8300 488.3983 497.0304 1.4587808 -5.214233 1.7674354 1472.406 -0.6626722
Denmark 511.0876 506.3748 515.7565 499.8146 510.9516 488.7816 501.9369 498.9027 504.9427 1.8527224 -4.338958 1.2106588 1512.839 -0.4251921
Dominican Republic 327.7020 329.7459 325.5866 357.7377 372.7806 342.1682 331.6388 330.8290 332.4770 -1.2613862 -8.211916 0.4981496 1017.078 -2.9917175
Estonia 519.5291 516.8728 522.0804 519.1429 533.3620 505.4863 534.1937 532.5228 535.7986 1.0075324 -5.226405 0.6151492 1572.866 -1.2012412
Finland 511.0769 514.9650 507.4528 526.4247 550.5112 503.9746 530.6612 540.5118 521.4797 -1.4587702 -8.453340 -3.5211343 1568.163 -4.4777482
France 492.9204 489.9540 495.9317 499.3061 513.7640 484.6293 494.9776 494.0342 495.9353 1.2200420 -5.670840 0.3848057 1487.204 -1.3553307
Georgia 403.8332 410.5960 397.7478 401.2881 431.8820 373.7585 411.1315 419.6164 403.4965 -3.1291643 -13.458177 -3.8415909 1216.253 -6.8096440
Germany 505.9713 497.5311 514.1177 509.1041 519.6741 498.9021 509.1406 503.8121 514.2837 3.3337763 -3.997113 2.0784633 1524.216 0.4717088
Greece 453.6299 453.5732 453.6821 467.0395 486.4600 449.1362 454.8288 459.4177 450.5984 0.0239946 -7.672529 -1.9196790 1375.498 -3.1894043
Hong Kong SAR, China 547.9310 546.7682 549.0658 526.6753 540.9844 512.7113 523.2774 523.7491 522.8172 0.4202205 -5.226229 -0.1779279 1597.884 -1.6613123
Hungary 476.8309 472.7395 480.9055 469.5233 481.9596 457.1377 476.7475 475.2484 478.2405 1.7273862 -5.150190 0.6295895 1423.102 -0.9310714
Iceland 488.0332 488.5870 487.4457 481.5255 501.7167 460.1036 473.2301 474.6556 471.7177 -0.2335942 -8.294143 -0.6189646 1442.789 -3.0489006
Indonesia 386.1096 387.4450 384.7793 397.2595 408.9994 385.5642 403.0997 405.1289 401.0783 -0.6880194 -5.729900 -0.9998441 1186.469 -2.4725879
Ireland 503.7220 495.4450 511.5797 520.8148 526.9491 514.9914 502.5751 497.1740 507.7026 3.2566054 -2.269230 2.1177000 1527.112 1.0350252
Israel 469.6695 465.5169 473.9902 478.9606 490.1650 467.3026 466.5528 464.4477 468.7432 1.8202116 -4.664226 0.9248704 1415.183 -0.6397146
Italy 489.7287 479.8237 499.7621 484.7580 492.7091 476.7038 480.5468 472.1190 489.0838 4.1553529 -3.248432 3.5933370 1455.034 1.5000860
Japan 532.4399 525.4960 539.2673 515.9585 522.6553 509.3740 538.3948 531.5329 545.1415 2.6206217 -2.541112 2.5602532 1586.793 0.8799211
Jordan 380.2590 387.3772 373.0013 408.1022 443.5964 371.9124 408.6691 427.9967 388.9627 -3.7110814 -16.159739 -9.1201595 1197.030 -9.6636599
Kazakhstan 459.8160 460.5772 459.1038 427.1410 435.3836 419.4282 456.4836 457.8602 455.1954 -0.3199020 -3.664675 -0.5819995 1343.441 -1.5221923
Korea, Rep.  524.1062 527.7567 520.7669 517.4367 538.6092 498.0690 515.8099 520.8329 511.2151 -1.3244225 -7.526835 -1.8466308 1557.353 -3.5659629
Latvia 482.3051 483.2573 481.3558 487.7581 508.8251 466.7552 490.2250 495.5792 484.8872 -0.3934704 -8.268039 -2.1574710 1460.288 -3.6063269
Lebanon 396.2497 385.9892 408.0571 346.5490 353.2773 338.8062 386.4854 385.5465 387.5658 5.7172328 -4.096245 0.5237528 1129.284 0.7149137
Lithuania 478.3834 479.0178 477.7677 472.4066 492.2423 453.1567 475.4089 479.1618 471.7669 -0.2609610 -7.940328 -1.5432985 1426.199 -3.2481958
Luxembourg 485.7706 480.1360 491.4831 481.4391 492.0348 470.6970 482.8064 479.0131 486.6521 2.3633018 -4.336643 1.5947328 1450.016 -0.1262028
Macao SAR, China 543.8078 547.9227 539.7238 508.6905 524.5389 492.9614 528.5496 532.4157 524.7126 -1.4963500 -6.020060 -1.4468080 1581.048 -2.9877394
Macedonia, FYR 371.3114 374.7397 368.2374 351.7415 375.8380 330.1350 383.6824 394.4027 374.0698 -1.7351727 -12.160277 -5.1553820 1106.735 -6.3502772
Malaysia 446.1098 449.1973 442.6725 430.5782 445.3567 414.1259 442.9475 444.7308 440.9622 -1.4525510 -7.012545 -0.8473885 1319.636 -3.1041614
Malta 478.6448 480.6731 476.6767 446.6661 468.0581 425.9101 464.7819 470.2137 459.5117 -0.8314104 -9.004874 -2.2759957 1390.093 -4.0374267
Mexico 408.0235 404.3349 411.6105 423.2765 431.2322 415.5398 415.7099 411.5536 419.7517 1.7994103 -3.638965 1.9920016 1247.010 0.0508156
Moldova 419.6635 420.6077 418.7273 416.2293 442.1850 390.4922 427.9978 431.4186 424.6058 -0.4470866 -11.690318 -1.5791547 1263.891 -4.5721865
Montenegro 417.9341 417.7265 418.1326 426.8845 444.1791 410.3504 411.3136 413.8134 408.9238 0.0972056 -7.616006 -1.1815980 1256.132 -2.9001329
Netherlands 512.2528 511.0159 513.4978 502.9591 514.7035 491.1367 508.5748 506.5181 510.6452 0.4856807 -4.578703 0.8147938 1523.787 -1.0927430
New Zealand 495.2233 490.9219 499.4705 509.2707 525.5199 493.2258 513.3035 510.7250 515.8496 1.7413445 -6.145166 1.0033901 1517.797 -1.1334770
Norway 501.7298 502.8737 500.6149 513.1912 533.3560 493.5367 498.4811 497.0014 499.9234 -0.4491643 -7.465795 0.5879270 1513.402 -2.4423439
Peru 386.5606 381.7972 391.2786 397.5414 401.4247 393.6952 396.6836 391.6810 401.6386 2.4833696 -1.925508 2.5422909 1180.786 1.0333841
Poland 504.4693 498.6467 510.0877 505.6971 520.6746 491.2446 501.4353 498.3032 504.4576 2.2944181 -5.652282 1.2350661 1511.602 -0.7075992
Portugal 491.6270 486.5997 496.5606 498.1289 506.5518 489.8629 501.1001 496.0552 506.0510 2.0470288 -3.294603 2.0150443 1490.856 0.2558233
Qatar 402.4007 408.3214 396.7547 401.8874 428.9546 376.0766 417.6112 429.4479 406.3239 -2.8327436 -12.327182 -5.3845780 1221.899 -6.8481678
Russian Federation 494.0600 491.1219 497.0891 494.6278 507.4833 481.3743 486.6310 484.5143 488.8133 1.2150130 -5.144791 0.8872903 1475.319 -1.0141627
Singapore 564.1897 564.2545 564.1290 535.1002 545.5585 525.3153 555.5747 552.2730 558.6637 -0.0222471 -3.710549 1.1571638 1654.865 -0.8585440
Slovak Republic 475.2301 472.2708 478.0148 452.5143 470.8627 435.2488 460.7749 461.2153 460.3604 1.2162436 -7.563542 -0.1853410 1388.519 -2.1775463
Slovenia 509.9196 507.9492 511.7637 505.2159 527.5539 484.3099 512.8636 515.7736 510.1401 0.7509620 -8.197081 -1.0922348 1527.999 -2.8461179
Spain 485.8432 477.8593 493.8485 495.5764 505.6847 485.4411 492.7861 489.4560 496.1251 3.3460152 -4.003201 1.3625490 1474.206 0.2351211
Sweden 493.9181 495.0526 492.8041 500.1556 519.9498 480.7183 493.4224 495.7183 491.1678 -0.4541794 -7.545254 -0.9179621 1487.496 -2.9724653
Switzerland 521.2506 514.9832 527.0155 492.1982 505.3732 480.0795 505.5058 502.3059 508.4492 2.3364591 -5.004942 1.2230060 1518.955 -0.4818258
Thailand 415.4638 416.7297 413.8092 409.1301 422.5533 391.5848 421.3373 425.2518 416.2207 -0.7008067 -7.328893 -2.1236973 1245.931 -3.3844656
Trinidad and Tobago 417.2434 425.9341 408.3298 427.2733 452.4412 401.4599 424.5905 434.7075 414.2140 -4.1331188 -11.268035 -4.7143335 1269.107 -6.7051623
Tunisia 366.8180 363.9996 370.0814 361.0555 372.6726 347.6038 386.4034 384.6152 388.4740 1.6708275 -6.726742 1.0032943 1114.277 -1.3508734
Turkey 420.4540 417.5270 423.3777 428.3351 442.2461 414.4396 425.4895 428.6548 422.3278 1.4012703 -6.287562 -1.4760145 1274.279 -2.1207688
United Arab Emirates 427.4827 430.7020 424.1734 433.5423 458.2870 408.1053 436.7311 449.3386 423.7710 -1.5158159 -10.949842 -5.6900329 1297.756 -6.0518968
United Kingdom 492.4785 486.5746 498.2005 497.9719 509.0904 487.1961 509.2215 508.8409 509.5904 2.3893446 -4.300667 0.1472813 1499.672 -0.5880138
United States 469.6285 465.3685 473.8857 496.9351 506.9752 486.9014 496.2424 492.8651 499.6176 1.8302065 -3.959520 1.3700639 1462.806 -0.2530833
Uruguay 417.9919 411.5024 425.1379 436.5721 447.7196 424.2969 435.3630 431.3077 439.8285 3.3136089 -5.231559 1.9755851 1289.927 0.0192117
Vietnam 494.5183 496.0059 492.9591 486.7738 499.0386 473.9191 524.6445 525.9139 523.3141 -0.6142687 -5.033571 -0.4943479 1505.937 -2.0473957

Now let’s plot this out to visualize the data better MATHS SCORE

ggplot(data=df, aes(x=reorder(Country.Name, Maths.Diff), y=Maths.Diff)) +
  geom_bar(stat = "identity", aes(fill=Maths.Diff)) +
  coord_flip() +
  theme_light() +
  geom_hline(yintercept = mean(df$Maths.Diff), size=1, color="black") +
  scale_fill_gradient(name="% Difference Level") +
  labs(title="Are Males better at math?", x="", y="% difference from female")+
  theme_bw()

This plot represents the % difference in scores using Females as reference. A positive difference means males scored higher, while a negative difference means males scored lower. The black line represents the mean difference across all countries.

Some interesting insights one can draw from here:

  • In general, Males performed better than Females in Maths, especially in most of the DCs, males scored generally higher than females

  • Interestingly, in Singapore and Hongkong, females and males performed equally well, with the difference in score is around 0 in each of these countries. This is good insight maybe for policy-makers because we do not want a huge gap between males and females performance in education.

We can do the same thing for Reading and Science score.

  1. Correlation Plot
df <- df[,c(1,3,4,6,7,9,10)] #select relevant columns

To create correlation plot, simply use cor():

res <- cor(df[,-1]) # -1 here means we look at all columns except the first column
res%>%  
  kable()%>% 
  kable_styling() %>%
  scroll_box(width = "100%", height = "400px")
Maths.F Maths.M Reading.F Reading.M Science.F Science.M
Maths.F 1.0000000 0.9845874 0.9377498 0.9177645 0.9711420 0.9547097
Maths.M 0.9845874 1.0000000 0.9312678 0.9468078 0.9576479 0.9758210
Reading.F 0.9377498 0.9312678 1.0000000 0.9663211 0.9555957 0.9440488
Reading.M 0.9177645 0.9468078 0.9663211 1.0000000 0.9283812 0.9692920
Science.F 0.9711420 0.9576479 0.9555957 0.9283812 1.0000000 0.9736539
Science.M 0.9547097 0.9758210 0.9440488 0.9692920 0.9736539 1.0000000

We can calculate p-value to see whether the correlation is significant

library("Hmisc")

res2 <- rcorr(as.matrix(df[,-1]))
res2
##           Maths.F Maths.M Reading.F Reading.M Science.F Science.M
## Maths.F      1.00    0.98      0.94      0.92      0.97      0.95
## Maths.M      0.98    1.00      0.93      0.95      0.96      0.98
## Reading.F    0.94    0.93      1.00      0.97      0.96      0.94
## Reading.M    0.92    0.95      0.97      1.00      0.93      0.97
## Science.F    0.97    0.96      0.96      0.93      1.00      0.97
## Science.M    0.95    0.98      0.94      0.97      0.97      1.00
## 
## n= 68 
## 
## 
## P
##           Maths.F Maths.M Reading.F Reading.M Science.F Science.M
## Maths.F            0       0         0         0         0       
## Maths.M    0               0         0         0         0       
## Reading.F  0       0                 0         0         0       
## Reading.M  0       0       0                   0         0       
## Science.F  0       0       0         0                   0       
## Science.M  0       0       0         0         0

The smaller the p-value, the more significant the correlation.

The purpose of this section about correlation plot is to introduce to you how to calculate correlation between variables in R. For this dataset, it is obvious that all the variables are correlated

To visualize

#install.packages("corrplot")

library(corrplot)
corrplot(res, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)

The stronger the color and the bigger the size, the higher the correlation. The result is similar to the one we got earlier: All the variables are intercorrelated.

That is it! Hope you guys enjoyed and picked up something from this article. While this guide is not exhaustive, it will more or less give you some ideas of how to do some basic EDA with R.