Data sctructures in R

Most of the data in R are saved in a dataframe format or tibble format . This will help your data analysis and also make easy the use of many libraries in R.

library(readr)# read cvs
library(readxl) #read xls
library(dplyr)
library(knitr)      # web widget
library(tidyverse)  # data manipulation
library(data.table) # fast file reading
library(kableExtra) # nice table html formating 
library(gridExtra)  # arranging ggplot in grid
library(caTools)    # split 
library(plotrix)
library(MASS)

Import data

Data used in this material are downloaded from this link: https://www.kaggle.com/sonujha090/bank-marketing

data_bank <- read.csv("~/ALDA DOC/ALDA 2020/Tirana bank-desktop/Tirana bank-trajnimi qershor Alda/Tirana Bank-R materials/Data _bank dataset/bank-full.csv", sep=";")

Looking at the dataset

View(data_bank)# view dataset
head(data_bank)# head of rows
summary(data_bank)# summary statistics
str(data_bank) # structure of dataset
nrow(data_bank)# number of rows
ncol(data_bank)# number of columns

If we want to code binary data

Lets try it in loan variable with categories (yes, no). We are going to use the ifelse() function.

data_bank$housing= ifelse(data_bank$housing=='yes',1,0)
head(data_bank$housing,10)
table(data_bank$housing)

Let’s work with our first format dataset. We will import it again here.

data_bank <- read.csv("~/ALDA DOC/ALDA 2020/Tirana bank-desktop/Tirana bank-trajnimi qershor Alda/Tirana Bank-R materials/Data _bank dataset/bank-full.csv", sep=";")
head(data_bank,10)

Check “missing values, NA”

sum(!complete.cases(data_bank)) # it will turn 0 because this dataset is cleaned from NA-s

Check for dublicates in rows

sum(duplicated(data_bank))# it will return the number of dublicated , here it will return 0 because we have cleaned the dublicate

Descriptive statistics

summary(data_bank)
summary(data_bank$age)# only for age variable
summary(data_bank[,3:6])# only for variables in columns 3 up to 6

Data Wrangling

filter()

filter() is a function in dplyr that takes logical expressions and returns the rows for which all are TRUE.

library(dplyr)
# filter individuals of age less than 25 years old
filter(data_bank, age < 25) 

# filter individuals housing == yes, and age less than 25 years old
filter(data_bank, housing== "yes")
filter(data_bank, age < 25)

# individuals with profession management and age =30 years old
filter(data_bank, job == "management", age == 30)

# individuals with profession management and age less than 25 years old 
filter(data_bank, job == "management", age < 25)

# we want to display in job the retired and management professions
filter(data_bank, job %in% c("retired", "management")) 

Exercise a. What was the average age of “management” professionals housing “yes”.

Hint: you can do this in 2 steps by assigning a variable and then using the mean() function.

  1. What was the average balance of the loan for secondary education?

Solution. a

av.age <- filter(data_bank, job == "management", housing== "yes")  
mean(av.age$age)  

select()

We use select() to subset the data on variables or columns.

We can select multiple columns with a comma, after we specify the data frame (data_bank).

The logic:

select(df, A, B ,C): Select the variables A, B and C from df dataset. select(df, A:C) : Select all variables from A to C from df dataset. select(df, -C): Exclude C from the dataset from df dataset.

The error below it is likely that you are either using a package besides dplyr that also has a select() function or you just forgot to load the dplyr package with library(dplyr).

data_bank %>% select(education) # Error in select(., education) : unused argument (education)

We can use this syntax of obtaining the results with select()

data_bank %>% dplyr::select(age)

# or select 1st variable (age)
head(data_bank[,1]) 

data_bank %>% dplyr::select(age, marital, duration) 

We can also use - (minus) to deselect columns. The code below will de-select from our dataset the variables age and marital.

data_bank %>% dplyr::select(-age, -marital) 

We can use the pipe to chain those two operations together.

We want to filter for profession (job - variable) the “management” and from these individuals to show their “education” and “loan” variable.

Exercise How can you do it for individuals “marital” status to show their “age” and “balance”

data_bank.1 <- data_bank %>% 
  filter(job == "management") %>%
 dplyr::select(education,loan) 

data_bank.1

mutate() adds new variables to dataframe

Let suppose we want to add a variable combining two or more existing variables. Or we want to add a new variable from an existing vector which is not part of the dataframe.

Let’s suppose I want to see the balance for days of the duration period. (balance/duration).

Adding a new variable from two existing variables of the dataframe, named “day.balance”

data_bank %>%
  mutate(day.balance = round(balance/duration,2))

I can add from another existing variable outside the dataset.

X<-rnorm(length(data_bank$age))
length(X)
data_bank %>% mutate(X)

group_by() operates on groups

summarize() will actually only keep the columns that are grouped_by or summarized.

ungroup() removes the grouping and it’s good to get in the habit of using it after a group_by().

data_bank %>%
  filter(age == 30) %>%
  group_by(marital) %>%
  summarize(sum(duration)) %>%
  ungroup()

arrange()

arrange() function is ordered alphabetically from A to Z.

# arrange by job
data_bank %>%
   group_by(marital,job) %>%
  summarize(mean(duration)) %>%
  arrange(job)

# arrange by marital status
 data_bank %>%
   group_by(marital,job) %>%
  summarize(mean(duration)) %>%
  arrange(marital) 

Try to combine by yourself the above functions.

Exercise What is the maximum duration for all jobs at age 30?

Exercise Try to understand what the below command will give you as an output?

library(tidyverse) ## install.packages('tidyverse')
## summarize
data_bank.2 <- data_bank %>% 
  dplyr::select(-contact, -poutcome) %>% 
  dplyr::group_by(marital) %>%
  dplyr::mutate(day.balance = round(balance/duration,2)) %>%
  dplyr::summarize(min_day.balance = min(day.balance)) %>%
  dplyr::ungroup() 

GRAPHICS

plot() function

Plot only one variable.

# not a good graphical presentation of age
plot(data_bank$age) 
# a better view 
plot(table(data_bank$age),ylab = "Frequency",xlab="Age",main="Age of bank data",col="red",lwd=5) 

Exercise Try to add text (“Graphics in R”) to your second graph at coordinates age 68 and frequency 1400.

histograms hist()

Observe the argument breaks=. For more arguments of hist() see in ?histogram.

hist(data_bank$age, main="Histogram for age variable",xlab="age",ylab="freq",col="red")
hist(data_bank$age, main="Histogram for age variable",xlab="",ylab="",col="red",breaks = 5)
hist(data_bank$age, main="Histogram for age variable",xlab="",ylab="",col="red",breaks = 10)
hist(data_bank$age, main="Histogram for age variable",xlab="",ylab="",col="red",breaks = 15)

PIE Chart pie()

When applying the pie() we need to take care of the frequencies because R will automatically produce a pie chart for each individual (observation).

# not the appropriate way to obtain a pie chart
pie(data_bank$age)

# combined with the function table() 
# for housing 
pie(table(data_bank$housing),main="Housing or not",col=c("red","blue"))

# for marital variable
pie(table(data_bank$marital),main="Civil status",col=c("red","blue","yellow"))

# changing colors based on the categories of marital variable
pie(table(data_bank$marital),main="Civil status",col=rainbow(length(table(data_bank$marital))))

pie(table(data_bank$month),main="Month ",col=rainbow(length(table(data_bank$marital))))

pie(table(data_bank$job),main="Profession (job)",col=rainbow(length(table(data_bank$marital))))

# Add percentage to labels
# create a vector of percentage 
perc<- round(100*(table(data_bank$marital)/length(data_bank$marital)))
perc
# create a vector of labels
label<-c("divorced","married","single")

# paste percentage and labels together
label<- paste(label, perc) # combine frequencies for each civil status
 # add  % to each label
label <- paste(label,"%",sep="")
# obtain the pie chart
pie(table(data_bank$marital),label,main="Civil status",col=rainbow(length(table(data_bank$marital))))

# 3D Exploded Pie Chart
library(plotrix)
pie3D(table(data_bank$marital),explode=0.1,main="Civil status",col=c("red","blue","yellow"))
# explode argument changes the dimension of distance between the parts , try it =0.5

Boxplot

May be created for only one variable and also for a group of characteristics in a numerical variable and a non-numerical variable.

boxplot(data_bank$age,col="green",main="Boxplot age",ylab="age")

boxplot(data_bank$balance,col="red",main="Boxplot balance",ylab="balance")

boxplot(data_bank$age~data_bank$marital,col=c("red","green","purple"),main="BoxPlot civil status",ylab="age")

boxplot(data_bank$duration~data_bank$marital,col=c("red","green","purple"),main="BoxPlot civil status",ylab="Duration",xlab="civil status")

Correlation plots

library(psych)

#
data_bank.1 <- data_bank %>% dplyr::select(duration, age, balance)
pairs(data_bank.1,col="red",main="Scatterplot for many variables")
# 
# Calculate correlations between two variables
cor1=cor(data_bank$age,data_bank$balance)
cor1
cor2=cor(data_bank$age,data_bank$duration)
cor2
library(psych)#  
pairs.panels(data_bank[,c(1,6,12)])

library(corrplot) and library(“PerformanceAnalytics”)

Another library for correlation plot.

library(corrplot)
library("PerformanceAnalytics")
cor.mat=cor(data_bank[,c(1,6,12)])# correlation matrix 
cor.mat
chart.Correlation(data_bank[,c(1,6,12)],histogram=TRUE, pch=19)

library(corrplot) and library(RColorBrewer)

library(corrplot)
library(RColorBrewer)
M <-cor(data_bank[,c(1,6,12)]) #select only numerical variables
corrplot(M, type="upper", order="hclust",col=brewer.pal(n=8, name="RdYlBu"))

library(ggplot2) and library(GGally)

library(ggplot2)
library(GGally)

# display a pair plot of all four columns of data
GGally::ggpairs(data_bank[,c(1,6,12)])

ESQUISSE library

A graphical add-in in R-Studio

The purpose of this add-in is to let you explore your data quickly to extract the information they hold. You can create visualization with {ggplot2}, filter data with {dplyr} and retrieve generated code.

To run Esquisse in R studio copy and paste the following code (don’t forget to previously install “esquisse”)

Start …

library(esquisse) esquisse::esquisser()

to display esquisser you can try one of the viewer options below to launch in your browser or pane

esquisse::esquisser(data, viewer = “browser”)

…. End

library(esquisse)
bank <- read.csv("~/CRYSTAL System-Data Science project/Didactic materials Crystal Solution/DATASET-Crystal/bank.csv")
# to display esquisser you can try one of the viewer options below
esquisse::esquisser(viewer = "browser")
# or
esquisse::esquisser(viewer = "pane")

Select data

Import any data you have from your session in R

Alt text

Create a plot

Start create a plot by drag any variable in the windows (X, Y) and other grouping windows.

Alt text

Labels and Titles

Add Title, subtitle and axis name.

Alt text ### Plot Options Modify the plot options

Alt text

Data

Modify data from the dataset

Alt text

Export and Code

Export the code in R-console and modify/use it by changing arguments.

Alt text

For more on Esquisse see:

https://cran.r-project.org/web/packages/esquisse/readme/README.html

https://cran.r-project.org/web/packages/esquisse/vignettes/get-started.html

11 January 2021

Eralda Gjika (Dhamo)

