Data sctructures in R
Most of the data in R are saved in a dataframe format or tibble format . This will help your data analysis and also make easy the use of many libraries in R.
library(readr)# read cvs
library(readxl) #read xls
library(dplyr)
library(knitr) # web widget
library(tidyverse) # data manipulation
library(data.table) # fast file reading
library(kableExtra) # nice table html formating
library(gridExtra) # arranging ggplot in grid
library(caTools) # split
library(plotrix)
library(MASS)
Import data
Data used in this material are downloaded from this link: https://www.kaggle.com/sonujha090/bank-marketing
data_bank <- read.csv("~/ALDA DOC/ALDA 2020/Tirana bank-desktop/Tirana bank-trajnimi qershor Alda/Tirana Bank-R materials/Data _bank dataset/bank-full.csv", sep=";")
Looking at the dataset
View(data_bank)# view dataset
head(data_bank)# head of rows
summary(data_bank)# summary statistics
str(data_bank) # structure of dataset
nrow(data_bank)# number of rows
ncol(data_bank)# number of columns
If we want to code binary data
Lets try it in loan variable with categories (yes, no). We are going to use the ifelse() function.
data_bank$housing= ifelse(data_bank$housing=='yes',1,0)
head(data_bank$housing,10)
table(data_bank$housing)
Let’s work with our first format dataset. We will import it again here.
data_bank <- read.csv("~/ALDA DOC/ALDA 2020/Tirana bank-desktop/Tirana bank-trajnimi qershor Alda/Tirana Bank-R materials/Data _bank dataset/bank-full.csv", sep=";")
head(data_bank,10)
Check “missing values, NA”
sum(!complete.cases(data_bank)) # it will turn 0 because this dataset is cleaned from NA-s
Check for dublicates in rows
sum(duplicated(data_bank))# it will return the number of dublicated , here it will return 0 because we have cleaned the dublicate
Descriptive statistics
summary(data_bank)
summary(data_bank$age)# only for age variable
summary(data_bank[,3:6])# only for variables in columns 3 up to 6
Data Wrangling
filter()
filter() is a function in dplyr that takes logical expressions and returns the rows for which all are TRUE.
library(dplyr)
# filter individuals of age less than 25 years old
filter(data_bank, age < 25)
# filter individuals housing == yes, and age less than 25 years old
filter(data_bank, housing== "yes")
filter(data_bank, age < 25)
# individuals with profession management and age =30 years old
filter(data_bank, job == "management", age == 30)
# individuals with profession management and age less than 25 years old
filter(data_bank, job == "management", age < 25)
# we want to display in job the retired and management professions
filter(data_bank, job %in% c("retired", "management"))
Exercise a. What was the average age of “management” professionals housing “yes”.
Hint: you can do this in 2 steps by assigning a variable and then using the mean() function.
- What was the average balance of the loan for secondary education?
Solution. a
av.age <- filter(data_bank, job == "management", housing== "yes")
mean(av.age$age)
select()
We use select() to subset the data on variables or columns.
We can select multiple columns with a comma, after we specify the data frame (data_bank).
The logic:
select(df, A, B ,C): Select the variables A, B and C from df dataset. select(df, A:C) : Select all variables from A to C from df dataset. select(df, -C): Exclude C from the dataset from df dataset.
The error below it is likely that you are either using a package besides dplyr that also has a select() function or you just forgot to load the dplyr package with library(dplyr).
data_bank %>% select(education) # Error in select(., education) : unused argument (education)
We can use this syntax of obtaining the results with select()
data_bank %>% dplyr::select(age)
# or select 1st variable (age)
head(data_bank[,1])
data_bank %>% dplyr::select(age, marital, duration)
We can also use - (minus) to deselect columns. The code below will de-select from our dataset the variables age and marital.
data_bank %>% dplyr::select(-age, -marital)
We can use the pipe to chain those two operations together.
We want to filter for profession (job - variable) the “management” and from these individuals to show their “education” and “loan” variable.
Exercise How can you do it for individuals “marital” status to show their “age” and “balance”
data_bank.1 <- data_bank %>%
filter(job == "management") %>%
dplyr::select(education,loan)
data_bank.1
mutate() adds new variables to dataframe
Let suppose we want to add a variable combining two or more existing variables. Or we want to add a new variable from an existing vector which is not part of the dataframe.
Let’s suppose I want to see the balance for days of the duration period. (balance/duration).
Adding a new variable from two existing variables of the dataframe, named “day.balance”
data_bank %>%
mutate(day.balance = round(balance/duration,2))
I can add from another existing variable outside the dataset.
X<-rnorm(length(data_bank$age))
length(X)
data_bank %>% mutate(X)
group_by() operates on groups
summarize() will actually only keep the columns that are grouped_by or summarized.
ungroup() removes the grouping and it’s good to get in the habit of using it after a group_by().
data_bank %>%
filter(age == 30) %>%
group_by(marital) %>%
summarize(sum(duration)) %>%
ungroup()
arrange()
arrange() function is ordered alphabetically from A to Z.
# arrange by job
data_bank %>%
group_by(marital,job) %>%
summarize(mean(duration)) %>%
arrange(job)
# arrange by marital status
data_bank %>%
group_by(marital,job) %>%
summarize(mean(duration)) %>%
arrange(marital)
Try to combine by yourself the above functions.
Exercise What is the maximum duration for all jobs at age 30?
Exercise Try to understand what the below command will give you as an output?
library(tidyverse) ## install.packages('tidyverse')
## summarize
data_bank.2 <- data_bank %>%
dplyr::select(-contact, -poutcome) %>%
dplyr::group_by(marital) %>%
dplyr::mutate(day.balance = round(balance/duration,2)) %>%
dplyr::summarize(min_day.balance = min(day.balance)) %>%
dplyr::ungroup()
GRAPHICS
plot() function
Plot only one variable.
# not a good graphical presentation of age
plot(data_bank$age)
# a better view
plot(table(data_bank$age),ylab = "Frequency",xlab="Age",main="Age of bank data",col="red",lwd=5)
Exercise Try to add text (“Graphics in R”) to your second graph at coordinates age 68 and frequency 1400.
histograms hist()
Observe the argument breaks=. For more arguments of hist() see in ?histogram.
hist(data_bank$age, main="Histogram for age variable",xlab="age",ylab="freq",col="red")
hist(data_bank$age, main="Histogram for age variable",xlab="",ylab="",col="red",breaks = 5)
hist(data_bank$age, main="Histogram for age variable",xlab="",ylab="",col="red",breaks = 10)
hist(data_bank$age, main="Histogram for age variable",xlab="",ylab="",col="red",breaks = 15)
PIE Chart pie()
When applying the pie() we need to take care of the frequencies because R will automatically produce a pie chart for each individual (observation).
# not the appropriate way to obtain a pie chart
pie(data_bank$age)
# combined with the function table()
# for housing
pie(table(data_bank$housing),main="Housing or not",col=c("red","blue"))
# for marital variable
pie(table(data_bank$marital),main="Civil status",col=c("red","blue","yellow"))
# changing colors based on the categories of marital variable
pie(table(data_bank$marital),main="Civil status",col=rainbow(length(table(data_bank$marital))))
pie(table(data_bank$month),main="Month ",col=rainbow(length(table(data_bank$marital))))
pie(table(data_bank$job),main="Profession (job)",col=rainbow(length(table(data_bank$marital))))
# Add percentage to labels
# create a vector of percentage
perc<- round(100*(table(data_bank$marital)/length(data_bank$marital)))
perc
# create a vector of labels
label<-c("divorced","married","single")
# paste percentage and labels together
label<- paste(label, perc) # combine frequencies for each civil status
# add % to each label
label <- paste(label,"%",sep="")
# obtain the pie chart
pie(table(data_bank$marital),label,main="Civil status",col=rainbow(length(table(data_bank$marital))))
# 3D Exploded Pie Chart
library(plotrix)
pie3D(table(data_bank$marital),explode=0.1,main="Civil status",col=c("red","blue","yellow"))
# explode argument changes the dimension of distance between the parts , try it =0.5
Boxplot
May be created for only one variable and also for a group of characteristics in a numerical variable and a non-numerical variable.
boxplot(data_bank$age,col="green",main="Boxplot age",ylab="age")
boxplot(data_bank$balance,col="red",main="Boxplot balance",ylab="balance")
boxplot(data_bank$age~data_bank$marital,col=c("red","green","purple"),main="BoxPlot civil status",ylab="age")
boxplot(data_bank$duration~data_bank$marital,col=c("red","green","purple"),main="BoxPlot civil status",ylab="Duration",xlab="civil status")
Correlation plots
library(psych)
#
data_bank.1 <- data_bank %>% dplyr::select(duration, age, balance)
pairs(data_bank.1,col="red",main="Scatterplot for many variables")
#
# Calculate correlations between two variables
cor1=cor(data_bank$age,data_bank$balance)
cor1
cor2=cor(data_bank$age,data_bank$duration)
cor2
library(psych)#
pairs.panels(data_bank[,c(1,6,12)])
library(corrplot) and library(RColorBrewer)
library(corrplot)
library(RColorBrewer)
M <-cor(data_bank[,c(1,6,12)]) #select only numerical variables
corrplot(M, type="upper", order="hclust",col=brewer.pal(n=8, name="RdYlBu"))
library(ggplot2) and library(GGally)
library(ggplot2)
library(GGally)
# display a pair plot of all four columns of data
GGally::ggpairs(data_bank[,c(1,6,12)])
