Here we are going to use R statistical environment to examine the relationship between variables. First, we are going to read data from an Excel file. After we’ve done this, we are going to do some basic data manipulation. That is, we are going to remove colummns in the R dataframe. After we’ve removed the columns, we are going to calculate some basic descriptive statistics. In the next sections, the relationship between multiple variables are going to be examine using a correlation matrix.

Before we go on, and have a look at what packages that we are going to use, I just want to mention that some of methods used here was found at Erik Marsja’s blog. That said, let’s continue by having a look at the packages.

Packages used

Here we are going to use tidyverse for reading the Excel file, removing columns, visualizing data, and calculating descriptive statistics. As you may know, you can install r-packages by typing install.packages(PACKAGENAME). For example, to follow this tutorial you need to run this code: install.packages(c('tidyverse'. 'Hmisc'))

First of all, we will load the packages;

library('tidyverse')

## -- Attaching packages -------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.1
## v tidyr   1.1.1     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ----------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library('readxl')

In the next section, we are going to read the data from an Excel file and make a quick exploration of the dataset.

How to Read an Excel (.xlsx) File in R

Here’s how to quickly load data from a .xls file using readxl and R:

dataf <- read_xls('PopularKids.xls')

When we have loaded the data (from the Excel file) we can make some basic exploration tasks. That’s exactly what we’re doing next.

Exploring the Dimension of the Dataset

Here we explore the data with `dim()’ function to get the number of columns and rows:

dim(dataf)

## [1] 478  11

As you can se in the output we have 478 rows and 11 columns. In the next example, we are going to use names(), an R function, to print the column names.

Getting, and printing, the Column Names:

Here we are going to have a look at the variables (column names) of the dataframe:

names(dataf)

##  [1] "Gender"      "Grade"       "Age"         "Race"        "Urban/Rural"
##  [6] "School"      "Goals"       "Grades"      "Sports"      "Looks"      
## [11] "Money"

Printing Parts of the Dataframe

Finally, here are we going to print parts (first 5 rows) of the dataframe:

head(dataf)

## # A tibble: 6 x 11
##   Gender Grade   Age Race  `Urban/Rural` School Goals  Grades Sports Looks Money
##   <chr>  <dbl> <dbl> <chr> <chr>         <chr>  <chr>   <dbl>  <dbl> <dbl> <dbl>
## 1 boy        5    11 White Rural         Elm    Sports      1      2     4     3
## 2 boy        5    10 White Rural         Elm    Popul~      2      1     4     3
## 3 girl       5    11 White Rural         Elm    Popul~      4      3     1     2
## 4 girl       5    11 White Rural         Elm    Popul~      2      3     4     1
## 5 girl       5    10 White Rural         Elm    Popul~      4      2     1     3
## 6 girl       5    11 White Rural         Elm    Popul~      4      2     1     3

Dropping (Removing) a Column in R

As we saw when printing the first 5 rows there’s a strange column named “Race”, let’s remove the column from the R dataframe using the select() function from dplyr (part of tidyverse):

dataf <- dataf %>% 
  select(-4)

What we did, above, was to delete the column by it’s index (it was the 4th column. Of course, you could remove the column by it’s name, as well. That is, `select(-Race)’ would have removed the same column.

Here’s the dataframe now:

head(dataf)

## # A tibble: 6 x 10
##   Gender Grade   Age `Urban/Rural` School Goals   Grades Sports Looks Money
##   <chr>  <dbl> <dbl> <chr>         <chr>  <chr>    <dbl>  <dbl> <dbl> <dbl>
## 1 boy        5    11 Rural         Elm    Sports       1      2     4     3
## 2 boy        5    10 Rural         Elm    Popular      2      1     4     3
## 3 girl       5    11 Rural         Elm    Popular      4      3     1     2
## 4 girl       5    11 Rural         Elm    Popular      2      3     4     1
## 5 girl       5    10 Rural         Elm    Popular      4      2     1     3
## 6 girl       5    11 Rural         Elm    Popular      4      2     1     3

As you can see we have deleted the column named “Race” from the R dataframe. ## Descriptive Statistics Here we are going to summarize the dataframe.

dataf %>% group_by(Gender) %>%
  summarise(sd = sd(Grades, na.rm = T),
            mean = mean(Grades, na.rm = T),
            range = paste(min(Grades, na.rm = T), "-", max(Grades, na.rm = T)),
            n = sum(!is.na(Grades)))

## # A tibble: 2 x 5
##   Gender    sd  mean range     n
##   <chr>  <dbl> <dbl> <chr> <int>
## 1 boy     1.05  2.65 1 - 4   227
## 2 girl    1.11  2.59 1 - 4   251

In the code above, we used the %>% (pipe) to group the data by Gender (using gorup_by()). After this, we used the summarise() function to calculate mean, standard deviation, range, and number of observations per group.

Correlation Matrix

Let’s have a look at the relationship between Grades, Sports, Looks, and Money:

library(Hmisc)
ds_cor <- dataf %>%
  select(c(Grades,  Sports, Looks, Money)) %>% 
  as.matrix() %>%
  rcorr(type = "spearman")

That was it! In the code above, we used dplyr (select()) to select the columns, we created a matrix using as.matrix(). Finally, we used rcor() to calculate the correlations among the variables we wanted to examine and stored it as ds_cor.

Here’s how to print the output:

ds_cor

##        Grades Sports Looks Money
## Grades   1.00  -0.15 -0.50 -0.45
## Sports  -0.15   1.00 -0.47 -0.32
## Looks   -0.50  -0.47  1.00 -0.06
## Money   -0.45  -0.32 -0.06  1.00
## 
## n= 478 
## 
## 
## P
##        Grades Sports Looks  Money 
## Grades        0.0011 0.0000 0.0000
## Sports 0.0011        0.0000 0.0000
## Looks  0.0000 0.0000        0.2021
## Money  0.0000 0.0000 0.2021

Correlation analysis in R

Wayan

2020-08-23