EDA

Performing Exploratory Data Analysis

This is an EDA performed on the dataset Pisa Scores 2013-2015.csv which if you want you can download from here .

PISA stands for “Program for International Student Assessment” and it is applied to 15 year-old students across the world to assess their performance in Math, Reading and Science. Here, I have tried to analyze and explore the dataset and resultantly infer some meaningful insights from it.

The libraries I used in this process are:

tidyverse cleaning of the datase and performing some transitions.
ggplot2 for plotting.
corrplot for the correlation plot.
Hmisc for cor and rcorr functions to find correlation and finding the P-values.
ggmap for retrieving map tiles from online servies.

Setting up the Working Dir. and calling the libraries.

setwd("D:/R/Visualization TDS")
library(tidyverse)
library(ggmap)
library(ggplot2)
library(Hmisc)
library(corrplot)

Data Loading.

Loading the dataset from the system analyzing it’s structure and the number of rows and columns present in the dataset.

data <- read.csv(file = "Pisa mean perfromance scores 2013 - 2015 Data.csv",encoding = "UTF-8-BOM",na.string='..')
str(data)

## 'data.frame':    1166 obs. of  7 variables:
##  $ ï..Country.Name: chr  "Albania" "Albania" "Albania" "Albania" ...
##  $ Country.Code   : chr  "ALB" "ALB" "ALB" "ALB" ...
##  $ Series.Name    : chr  "PISA: Mean performance on the mathematics scale" "PISA: Mean performance on the mathematics scale. Female" "PISA: Mean performance on the mathematics scale. Male" "PISA: Mean performance on the reading scale" ...
##  $ Series.Code    : chr  "LO.PISA.MAT" "LO.PISA.MAT.FE" "LO.PISA.MAT.MA" "LO.PISA.REA" ...
##  $ X2013..YR2013. : logi  NA NA NA NA NA NA ...
##  $ X2014..YR2014. : logi  NA NA NA NA NA NA ...
##  $ X2015..YR2015. : num  413 418 409 405 435 ...

ncol(data)

## [1] 7

nrow(data)

## [1] 1166

Data Pre-processing.

First we will choose those columns which are helpuful for us. For e.g., the 2013[YR2013] and 2014[YR2014] are NAs so we will not choose them and similarly some other columns. We chose Country Names(column1), Series Code(column4) and 2015[YR2015] marks(column7).

We use the unique row values from Series.Code column of the dataset as the Column Names for the new DataSet(dataf) which we rename for better readability and removed the NA values. This was done using pipes method.

dataf <- data[1:1161,c(1,4,7)] %>%
  pivot_wider(names_from = Series.Code,values_from = X2015..YR2015.) %>%
  rename(CountryName=ï..Country.Name,Maths=LO.PISA.MAT,MathsF=LO.PISA.MAT.FE,MathsM=LO.PISA.MAT.MA,Reading=LO.PISA.REA,ReadingF=LO.PISA.REA.FE,ReadingM=LO.PISA.REA.MA,Science=LO.PISA.SCI,ScienceF=LO.PISA.SCI.FE,ScienceM=LO.PISA.SCI.MA)%>%
  drop_na() 
head(dataf)

## # A tibble: 6 x 10
##   CountryName Maths MathsF MathsM Reading ReadingF ReadingM Science ScienceF
##   <chr>       <dbl>  <dbl>  <dbl>   <dbl>    <dbl>    <dbl>   <dbl>    <dbl>
## 1 Albania      413.   418.   409.    405.     435.     376.    427.     439.
## 2 Algeria      360.   363.   356.    350.     366.     335.    376.     383.
## 3 Argentina    409.   400.   418.    425.     433.     417.    432.     425.
## 4 Australia    494.   491.   497.    503.     519.     487.    510.     509.
## 5 Austria      497.   483.   510.    485.     495.     475.    495.     486.
## 6 Belgium      507.   500.   514.    499.     507.     491.    502.     496.
## # ... with 1 more variable: ScienceM <dbl>

A view of the newly formed dataset.

Visualization

Since, in the dataset there are several Countries present, plotting all of them Graphically in the World map.

wrldmap <- map_data("world") 
mrgddata <- merge(wrldmap,dataf,by.x="region",by.y="CountryName")
mrgddata <- mrgddata[order(mrgddata$group,mrgddata$order),]
head(mrgddata)

##     region     long      lat group order subregion   Maths MathsF   MathsM
## 8  Albania 20.06396 42.54727     6   770      <NA> 413.157 417.75 408.5455
## 5  Albania 20.10352 42.52466     6   771      <NA> 413.157 417.75 408.5455
## 21 Albania 20.18574 42.42588     6   772      <NA> 413.157 417.75 408.5455
## 33 Albania 20.24053 42.33897     6   773      <NA> 413.157 417.75 408.5455
## 15 Albania 20.34824 42.30879     6   774      <NA> 413.157 417.75 408.5455
## 45 Albania 20.40830 42.27495     6   775      <NA> 413.157 417.75 408.5455
##     Reading ReadingF ReadingM Science ScienceF ScienceM
## 8  405.2588 434.6396 375.7592 427.225  439.443 414.9576
## 5  405.2588 434.6396 375.7592 427.225  439.443 414.9576
## 21 405.2588 434.6396 375.7592 427.225  439.443 414.9576
## 33 405.2588 434.6396 375.7592 427.225  439.443 414.9576
## 15 405.2588 434.6396 375.7592 427.225  439.443 414.9576
## 45 405.2588 434.6396 375.7592 427.225  439.443 414.9576

ggplot(mrgddata) +
  aes(x=long,y=lat,group=group) + geom_polygon() + aes(fill=region) +
  theme_dark()

Visualizing the Math Score Data with the Country Names.

ggplot(dataf) +
  aes(x=reorder(CountryName,Maths),y=Maths) +
  geom_bar(stat = 'identity') +
  aes(fill=Maths) +
  coord_flip() +
  scale_fill_gradient(name="Score Level") +
  geom_hline(yintercept = mean(dataf$Maths)) +
  theme_grey() +
  labs(x="Country Name",y="Math Score",title = "Graph Relation b/w Math & Countries") +
  geom_hline(yintercept = mean(dataf$Maths),size=1,col="salmon3")

Visualizing the Science Score Data with the Country Names.

ggplot(dataf) +
  aes(x=reorder(CountryName,Science),y=Science) +
  geom_bar(stat='identity') +
  aes(fill=Science) +
  coord_flip() +
  scale_fill_gradient("Score Level") +
  labs(x="Country Name",y="Science Score",title="Graph Relation b/w Science & Countries") +
  theme_bw() +
  geom_hline(yintercept = mean(dataf$Science),size=1,col="sienna3")

Visualizing the Reading Data with the Country Names.

ggplot(dataf) +
  aes(x=reorder(CountryName,Reading),Reading) +
  geom_bar(stat = 'identity') +
  aes(fill=Reading) +
  coord_flip() +
  scale_fill_gradient(name="Score Level") +
  theme_classic() +
  labs(x="Country Name",y="Reading Score",title="Graoh Relation b/w Reading & Countries") +
  geom_hline(yintercept = mean(dataf$Reading),size=1,col="seashell4")

Subsetting data in a new dataframe according to gender and subject with only CountryName, SubjectGender and value as the column names.

dataf2 <- dataf[,c(1,3,4,6,7,9,10)] %>%
  pivot_longer(c(2,3,4,5,6,7),names_to = 'SubjectGender')

Now, Gender wise each subject marks Visualization in Box - plot form.

ggplot(dataf2)+
  aes(x=SubjectGender,y=value)+
  geom_point(pos="jitter",alpha=0.25) +
  geom_boxplot(alpha=0.75) +
  aes(fill=SubjectGender) +
  scale_fill_manual(values = c("wheat4","tomato","wheat","violet","seashell3","salmon")) +
  theme_linedraw() +
  labs(x="Subject and Gender",y="Values/Scores")+
  facet_wrap(.~SubjectGender,scales = "free_x",nrow=3,ncol=2)

The boxplots look similar due to the scales=“free_x” argument in facet_wrap() function.

Though it is early to judge but, the boxplots infer that the Gender Male performed better in Maths and Science but the womwn performed better in Reading.

Now we plot the correlation graph.

Subsetting relevant data and finding correlation between the Values by the Pearson Method.

dataf3 <- dataf[,c(1,3,4,6,7,9,10)]
res <- cor(dataf3[,-1])
res

##             MathsF    MathsM  ReadingF  ReadingM  ScienceF  ScienceM
## MathsF   1.0000000 0.9845874 0.9377498 0.9177645 0.9711420 0.9547097
## MathsM   0.9845874 1.0000000 0.9312678 0.9468078 0.9576479 0.9758210
## ReadingF 0.9377498 0.9312678 1.0000000 0.9663211 0.9555957 0.9440488
## ReadingM 0.9177645 0.9468078 0.9663211 1.0000000 0.9283812 0.9692920
## ScienceF 0.9711420 0.9576479 0.9555957 0.9283812 1.0000000 0.9736539
## ScienceM 0.9547097 0.9758210 0.9440488 0.9692920 0.9736539 1.0000000

Note : Pearson method measures the strength of linear relation ship b/w two variables. Value is always betweeen -1 and 1.

p-Value for the corr. Lower the p-value more significant the correlation.

rcorr(as.matrix(dataf3[,-1]))$P

##          MathsF MathsM ReadingF ReadingM ScienceF ScienceM
## MathsF       NA      0        0        0        0        0
## MathsM        0     NA        0        0        0        0
## ReadingF      0      0       NA        0        0        0
## ReadingM      0      0        0       NA        0        0
## ScienceF      0      0        0        0       NA        0
## ScienceM      0      0        0        0        0       NA

Visualizing the correlation. Stronger the color and bigger the size, higher the correlation is. Thus, all the values are correlated.

corrplot(res,order="hclust",type = "upper",tl.col = "black",tl.cex =0.5)

Creating another new sub-dataset from the present dataset to find the difference in the various subjects for each specific gender and all the countries separately.

datafg <- mutate(dataf[,1],MathDiff = ((dataf$MathsF-dataf$MathsM)/dataf$MathsM)*100,
                 ScienceDiff = (dataf$ScienceF-dataf$ScienceM)/dataf$ScienceM*100,
                 ReadingDiff = (dataf$ReadingF-dataf$ReadingM)/dataf$ReadingM*100,
                 Total=dataf$Maths+dataf$Science+dataf$Reading,
                 AverageDiff=(MathDiff+ScienceDiff+ReadingDiff)/3)
head(datafg)

## # A tibble: 6 x 6
##   CountryName MathDiff ScienceDiff ReadingDiff Total AverageDiff
##   <chr>          <dbl>       <dbl>       <dbl> <dbl>       <dbl>
## 1 Albania         2.25       5.90        15.7  1246.       7.94 
## 2 Algeria         1.85       3.84         9.26 1085.       4.98 
## 3 Argentina      -4.29      -3.43         3.84 1267.      -1.30 
## 4 Australia      -1.16      -0.416        6.50 1507.       1.64 
## 5 Austria        -5.29      -3.74         4.26 1477.      -1.59 
## 6 Belgium        -2.77      -2.31         3.26 1508.      -0.611

A view of the new dataset.

Forming graphs with respect to the present differences in the dataset.

Graph Representation of Maths Score Difference between Female and Male students.

ggplot(datafg)+
  aes(x=reorder(CountryName,MathDiff),y=MathDiff)+
  geom_bar(stat='identity') + aes(fill=MathDiff) +
  scale_fill_gradient("Math Difference") +
  geom_hline(yintercept = mean(datafg$MathDiff),size=1,col="brown") +
  labs(x="Country Names",y="Math Score Diff %",title="Math Diff v/s Country") + coord_flip() +
  theme_bw()

From this plot it can be inferred that Men are better in Maths Subject in more countries.

Graph Representation of Reading Score Difference between Female and Male students.

ggplot(datafg) +
  aes(x=reorder(CountryName,ReadingDiff),y=ReadingDiff) +
  geom_bar(stat='identity') + aes(fill=ReadingDiff) +
  scale_fill_gradient("Reading Difference Palette") +
  geom_hline(yintercept = mean(datafg$ReadingDiff),size=1,col="tan2") +
  labs(x="Country Names",y="Reading Score Diff",title="Reading Diff v/s Country") +
  coord_flip() +
  theme_bw()

We can infer that females were better at reading in almost every country.

Graph Representation of Science Score Difference between Female and Male students.

ggplot(datafg) +
  aes(x=reorder(CountryName,ScienceDiff),y=ScienceDiff) +
  geom_bar(stat='identity') + aes(fill=ScienceDiff) +
  scale_fill_gradient("Science Difference",) +
  geom_hline(yintercept = mean(datafg$ScienceDiff),size=1,col="purple") +
  labs(x="Country Names",y="Science Score Diff %",title="Science Diff v/s Country") +
  coord_flip() +
  theme_bw()

Science also shows trend similar to Maths.

So now we form a graph with the average difference values in all the countries in all three subjects.

ggplot(datafg)+
  aes(x=reorder(CountryName,AverageDiff),y=AverageDiff) +
  geom_bar(stat='identity') + aes(fill=AverageDiff) +
  scale_fill_gradient("Average Diff") +
  geom_hline(yintercept = mean(datafg$AverageDiff),size=1,col="tomato3") +
  labs(x="Country Name",y="Average Diff",title="Average Diff v/s Country") + 
  coord_flip() + theme_bw()

Final plot for showing the varying of Average Values with the total combined scores(maths + science + reading).

ggplot(datafg) +
  aes(x=AverageDiff,y=Total) +
  geom_jitter() +
  geom_smooth(fill="springgreen4",alpha=0.50,col="snow4") + theme_light() +
  labs(x="Average Diff",y="Total",title="Average Diff v/s Total")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The difference between Male and Female candidates is very low i.e.; almost 0.

Made with ❤️ by Shikhar .