Sample data

This time we are going to use the palmerpenguins dataset. The goal of this sample dataset is to provide a great example for data exploration & visualization, as an alternative to iris.

Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

The palmerpenguins package contains two datasets.

library(palmerpenguins)
data(package = 'palmerpenguins')
head(penguins)

## # A tibble: 6 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge~           39.1          18.7              181        3750 male 
## 2 Adelie  Torge~           39.5          17.4              186        3800 fema~
## 3 Adelie  Torge~           40.3          18                195        3250 fema~
## 4 Adelie  Torge~           NA            NA                 NA          NA <NA> 
## 5 Adelie  Torge~           36.7          19.3              193        3450 fema~
## 6 Adelie  Torge~           39.3          20.6              190        3650 male 
## # ... with 1 more variable: year <int>

One is called penguins, and is a simplified version of the raw data; see ?penguins for more info. Both datasets contain data for 344 penguins. There are 3 different species of penguins in this dataset, collected from 3 islands in the Palmer Archipelago, Antarctica.

The curated palmerpenguins::penguins dataset contains 8 variables (n = 344 penguins). You can read more about the variables by typing ?penguins.

glimpse(penguins)

## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
## $ sex               <fct> male, female, female, NA, female, male, female, male~
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~

The palmerpenguins::penguins data contains 333 complete cases, with 19 missing values.

visdat::vis_dat(penguins)

Exploring correlations

Bill dimensions

The culmen is the upper ridge of a bird’s bill. In the simplified penguins data, culmen length and depth are renamed as variables bill_length_mm and bill_depth_mm to be more intuitive.

For this penguin data, the culmen (bill) length and depth are measured as shown below.

Exercise 1.

Penguin mass vs. flipper length

## Warning: Removed 2 rows containing missing values (geom_point).

Exercise 2.

Flipper length vs. Species

flipper_box <- ggplot(data = penguins, aes(x = species, y = flipper_length_mm)) +
  geom_boxplot(aes(color = species), width = 0.3, show.legend = FALSE) +
  geom_jitter(aes(color = species), alpha = 0.5, show.legend = FALSE, position = position_jitter(width = 0.2, seed = 0)) +
  scale_color_manual(values = c("darkorange","purple","cyan4")) +
  theme_minimal() +
  labs(x = "Species",
       y = "Flipper length (mm)")

flipper_box

Exercise 3.

Penguin flipper and body mass vs. sex

ggplot(penguins, aes(x = flipper_length_mm,
                            y = body_mass_g)) +
  geom_point(aes(color = sex)) +
  theme_minimal() +
  scale_color_manual(values = c("darkorange","cyan4"), na.translate = FALSE) +
  labs(title = "Penguin flipper and body mass",
       subtitle = "Dimensions for male and female Adelie, Chinstrap and Gentoo Penguins at Palmer Station LTER",
       x = "Flipper length (mm)",
       y = "Body mass (g)",
       color = "Penguin sex") +
  theme(legend.position = "bottom",
        legend.background = element_rect(fill = "white", color = NA),
        plot.title.position = "plot",
        plot.caption = element_text(hjust = 0, face= "italic"),
        plot.caption.position = "plot") +
  facet_wrap(~species)

## Warning: Removed 11 rows containing missing values (geom_point).

Sample data 2

The second sample dataset today is an educational data set which is collected from learning management system (LMS) called Kalboard 360. Kalboard 360 is a multi-agent LMS, which has been designed to facilitate learning through the use of leading-edge technology. Such system provides users with a synchronous access to educational resources from any device with Internet connection.
The dataset consists of 480 student records and 16 features. The features are classified into three major categories: (1) Demographic features such as gender and nationality. (2) Academic background features such as educational stage, grade Level and section. (3) Behavioral features such as raised hand on class, opening resources, answering survey by parents, and school satisfaction.

The dataset consists of 305 males and 175 females. The students come from different origins such as 179 students are from Kuwait, 172 students are from Jordan, 28 students from Palestine, 22 students are from Iraq, 17 students from Lebanon, 12 students from Tunis, 11 students from Saudi Arabia, 9 students from Egypt, 7 students from Syria, 6 students from USA, Iran and Libya, 4 students from Morocco and one student from Venezuela.

The dataset is collected through two educational semesters: 245 student records are collected during the first semester and 235 student records are collected during the second semester.

The data set includes also the school attendance feature such as the students are classified into two categories based on their absence days: 191 students exceed 7 absence days and 289 students their absence days under 7.

Sample data 2 - attributes

1 Gender - student’s gender (nominal: ‘Male’ or ‘Female’)
2 Nationality- student’s nationality (nominal:’ Kuwait’,’ Lebanon’,’ Egypt’,’ SaudiArabia’,’ USA’,’ Jordan’,‘Venezuela’,’ Iran’,’ Tunis’,’ Morocco’,’ Syria’,’ Palestine’,’ Iraq’,’ Lybia’)
3 Place of birth- student’s Place of birth (nominal:’ Kuwait’,’ Lebanon’,’ Egypt’,’ SaudiArabia’,’ USA’,’ Jordan’,‘Venezuela’,’ Iran’,’ Tunis’,’ Morocco’,’ Syria’,’ Palestine’,’ Iraq’,’ Lybia’)
4 Educational Stages- educational level student belongs (nominal: ‘lowerlevel’,‘MiddleSchool’,‘HighSchool’)
5 Grade Levels- grade student belongs (nominal: ‘G-01’, ‘G-02’, ‘G-03’, ‘G-04’, ‘G-05’, ‘G-06’, ‘G-07’, ‘G-08’, ‘G-09’, ‘G-10’, ‘G-11’, ‘G-12’)
6 Section ID- classroom student belongs (nominal:‘A’,‘B’,‘C’)
7 Topic- course topic (nominal:’ English’,’ Spanish’, ‘French’,’ Arabic’,’ IT’,’ Math’,’ Chemistry’, ‘Biology’, ‘Science’,’ History’,’ Quran’,’ Geology’)
8 Semester- school year semester (nominal:’ First’,’ Second’)
9 Parent responsible for student (nominal:‘mom’,‘father’)
10 Raised hand- how many times the student raises his/her hand on classroom (numeric:0-100)
11- Visited resources- how many times the student visits a course content(numeric:0-100)
12 Viewing announcements-how many times the student checks the new announcements(numeric:0-100)
13 Discussion groups- how many times the student participate on discussion groups (numeric:0-100)
14 Parent Answering Survey- parent answered the surveys which are provided from school or not (nominal:‘Yes’,‘No’)
15 Parent School Satisfaction- the Degree of parent satisfaction from school(nominal:‘Yes’,‘No’)
16 Student Absence Days-the number of absence days for each student (nominal: above-7, under-7)

The students are classified into three numerical intervals based on their total grade/mark:
Low-Level: interval includes values from 0 to 69,
Middle-Level: interval includes values from 70 to 89,
High-Level: interval includes values from 90-100.

Exercise 4.

Student Absence Days vs. Total grade/mark (‘Class’):

tb <- table(edu$Class, edu$StudentAbsenceDays)
CramerV(tb)

## [1] 0.6849647

c<-CramerV(tb)
sbtitle <- 
  sprintf("Cramer's V correlation coefficient = %.2f", c)

tb.prop <- prop.table(tb, 1)
tb.prop

##    
##        Above-7    Under-7
##   L 0.91338583 0.08661417
##   M 0.33649289 0.66350711
##   H 0.02816901 0.97183099

tb.df <- as.data.frame(tb.prop)
glimpse(tb.df)

## Rows: 6
## Columns: 3
## $ Var1 <fct> L, M, H, L, M, H
## $ Var2 <fct> Above-7, Above-7, Above-7, Under-7, Under-7, Under-7
## $ Freq <dbl> 0.91338583, 0.33649289, 0.02816901, 0.08661417, 0.66350711, 0.971~

names(tb.df) <- c("Totalgrade", "StudentAbsenceDays", "Frequency")
ggplot(tb.df, aes(x=Totalgrade, y=Frequency, fill=StudentAbsenceDays)) + geom_col(position="dodge") + 
  labs(title="Final grade vs. Student Absence",
    subtitle=sbtitle,
    x="Final grade", y="Absence")

Exercise 5.

Raised hands vs. Total grade/mark (‘Class’)

First - recode total grades into numerical values (1,2,3)
Second - don’t forget to change the format to num
Third - calculate the rank correlation coefficient (Kendall’s Tau-b - which is robust for ties!)

edu$class_num<-recode(Class,"L"=1, "M"=2, "H"=3)
edu$class_num<-as.numeric(edu$class_num)
edu$raisedhands<-as.numeric(edu$raisedhands)
rank_coef<-cor(edu$class_num, edu$raisedhands, method = "kendall")
sbtitle <- 
  sprintf("Kendall's Tau-b correlation coefficient = %.2f", rank_coef)

p<-ggplot(edu, aes(x=Class, y=raisedhands)) +
  geom_boxplot()
p + geom_jitter(shape=16, position=position_jitter(0.2))+ 
  labs(title="Raised hands vs. Total grade/mark",
    subtitle=sbtitle,
    x="", y="")

Exercise 6.

Raised hands vs. Visited Resources (with and without controlling for Discussion in the class or/and Gender).

ggplot(edu, aes(x = VisITedResources,
                            y = raisedhands)) +
  geom_point(aes(color = gender)) +
  theme_minimal()

Let’s take a look at the scatterplots together with linear correlation coefficients.

ggpairs(edu, columns = c( "VisITedResources", "raisedhands", "Discussion"), title = "Bivariate analysis", upper = list(continuous = wrap("cor",
        size = 3)),
    lower = list(continuous = wrap("smooth",
        alpha = 0.3,
        size = 0.1)),
    mapping = aes(color = gender))

If you are interested in the original strength of this relationship, not being influenced by some external factor(s) then you should calculate partial and/or semi-partial correlation coefficients:

##    estimate      p.value statistic   n gp  Method
## 1 0.6674844 4.475026e-63  19.57778 480  1 pearson

##    estimate      p.value statistic   n gp  Method
## 1 0.6828498 4.904621e-67  20.41405 480  1 pearson

##    estimate      p.value statistic   n gp  Method
## 1 0.6278673 7.010728e-54  17.61849 480  1 pearson

As you can see above, the relationship between raisedhands and VisITedResources doesn’t change significantly even if we will control for Discussion or gender.

Bivariate Analysis - exercises

Karol Flisikowski

28 04 2021