Here we are going to use R statistical environment to examine the relationship between variables. First, we are going to read data from an Excel file. After we’ve done this, we are going to do some basic data manipulation. That is, we are going to remove colummns in the R dataframe. After we’ve removed the columns, we are going to calculate some basic descriptive statistics. In the next sections, the relationship between multiple variables are going to be examine using a correlation matrix.
Before we go on, and have a look at what packages that we are going to use, I just want to mention that some of methods used here was found at Erik Marsja’s blog. That said, let’s continue by having a look at the packages.
Here we are going to use tidyverse for reading the Excel file, removing columns, visualizing data, and calculating descriptive statistics. As you may know, you can install r-packages by typing install.packages(PACKAGENAME). For example, to follow this tutorial you need to run this code: install.packages(c('tidyverse'. 'Hmisc'))
First of all, we will load the packages;
library('tidyverse')
## -- Attaching packages -------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.1
## v tidyr 1.1.1 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ----------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library('readxl')
In the next section, we are going to read the data from an Excel file and make a quick exploration of the dataset.
Here’s how to quickly load data from a .xls file using readxl and R:
dataf <- read_xls('PopularKids.xls')
When we have loaded the data (from the Excel file) we can make some basic exploration tasks. That’s exactly what we’re doing next.
Here we explore the data with `dim()’ function to get the number of columns and rows:
dim(dataf)
## [1] 478 11
As you can se in the output we have 478 rows and 11 columns. In the next example, we are going to use names(), an R function, to print the column names.
Here we are going to have a look at the variables (column names) of the dataframe:
names(dataf)
## [1] "Gender" "Grade" "Age" "Race" "Urban/Rural"
## [6] "School" "Goals" "Grades" "Sports" "Looks"
## [11] "Money"
Finally, here are we going to print parts (first 5 rows) of the dataframe:
head(dataf)
## # A tibble: 6 x 11
## Gender Grade Age Race `Urban/Rural` School Goals Grades Sports Looks Money
## <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 boy 5 11 White Rural Elm Sports 1 2 4 3
## 2 boy 5 10 White Rural Elm Popul~ 2 1 4 3
## 3 girl 5 11 White Rural Elm Popul~ 4 3 1 2
## 4 girl 5 11 White Rural Elm Popul~ 2 3 4 1
## 5 girl 5 10 White Rural Elm Popul~ 4 2 1 3
## 6 girl 5 11 White Rural Elm Popul~ 4 2 1 3
As we saw when printing the first 5 rows there’s a strange column named “Race”, let’s remove the column from the R dataframe using the select() function from dplyr (part of tidyverse):
dataf <- dataf %>%
select(-4)
What we did, above, was to delete the column by it’s index (it was the 4th column. Of course, you could remove the column by it’s name, as well. That is, `select(-Race)’ would have removed the same column.
Here’s the dataframe now:
head(dataf)
## # A tibble: 6 x 10
## Gender Grade Age `Urban/Rural` School Goals Grades Sports Looks Money
## <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 boy 5 11 Rural Elm Sports 1 2 4 3
## 2 boy 5 10 Rural Elm Popular 2 1 4 3
## 3 girl 5 11 Rural Elm Popular 4 3 1 2
## 4 girl 5 11 Rural Elm Popular 2 3 4 1
## 5 girl 5 10 Rural Elm Popular 4 2 1 3
## 6 girl 5 11 Rural Elm Popular 4 2 1 3
As you can see we have deleted the column named “Race” from the R dataframe. ## Descriptive Statistics Here we are going to summarize the dataframe.
dataf %>% group_by(Gender) %>%
summarise(sd = sd(Grades, na.rm = T),
mean = mean(Grades, na.rm = T),
range = paste(min(Grades, na.rm = T), "-", max(Grades, na.rm = T)),
n = sum(!is.na(Grades)))
## # A tibble: 2 x 5
## Gender sd mean range n
## <chr> <dbl> <dbl> <chr> <int>
## 1 boy 1.05 2.65 1 - 4 227
## 2 girl 1.11 2.59 1 - 4 251
In the code above, we used the %>% (pipe) to group the data by Gender (using gorup_by()). After this, we used the summarise() function to calculate mean, standard deviation, range, and number of observations per group.
Let’s have a look at the relationship between Grades, Sports, Looks, and Money:
library(Hmisc)
ds_cor <- dataf %>%
select(c(Grades, Sports, Looks, Money)) %>%
as.matrix() %>%
rcorr(type = "spearman")
That was it! In the code above, we used dplyr (select()) to select the columns, we created a matrix using as.matrix(). Finally, we used rcor() to calculate the correlations among the variables we wanted to examine and stored it as ds_cor.
Here’s how to print the output:
ds_cor
## Grades Sports Looks Money
## Grades 1.00 -0.15 -0.50 -0.45
## Sports -0.15 1.00 -0.47 -0.32
## Looks -0.50 -0.47 1.00 -0.06
## Money -0.45 -0.32 -0.06 1.00
##
## n= 478
##
##
## P
## Grades Sports Looks Money
## Grades 0.0011 0.0000 0.0000
## Sports 0.0011 0.0000 0.0000
## Looks 0.0000 0.0000 0.2021
## Money 0.0000 0.0000 0.2021