2nd class - Benchmarking

Functions

C++ using the package “Rcpp”.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector rowSumsC(NumericMatrix x) {
  int nrow = x.nrow(), ncol = x.ncol();
  NumericVector out(nrow);

  for (int i = 0; i < nrow; i++) {
    double total = 0;
    for (int j = 0; j < ncol; j++) {
      total += x(i, j);
    }
    out[i] = total;
  }
  return out;
}

In all the codes below, big.matrix is an matrix argument and N is an integer that tells us the number of times to evaluate the expression. Both will be inserted by the user.

cppSum <- function(big.matrix, N) {
  stopifnot(exists("big.matrix"))
  stopifnot(exists("N"))
  Time <- microbenchmark({
    colsums <- rowSumsC(big.matrix)
  }, times = N)$time
  return(Time)  
}

By using the optimized “colSums” function from R:

ColSum <- function(big.matrix, N) {
  stopifnot(exists("big.matrix"))
  stopifnot(exists("N"))
  Time <- microbenchmark({
    colsums <- colSums(big.matrix)
  }, times = N)$time
  return(Time)  
}

By using the “apply” function from R:

Apply <- function(big.matrix, N) {
  stopifnot(exists("big.matrix"))
  stopifnot(exists("N"))
  Time <- microbenchmark({
    colsums <- apply(big.matrix, 2, sum)
  }, times = N)$time
  return(Time)  
}

By using matricidal multiplication:

matrixSum <- function(big.matrix, N) {
  stopifnot(exists("big.matrix"))
  stopifnot(exists("N"))
  Time <- microbenchmark({
    ones <- rep(1, dim(big.matrix)[1])
    colsums <- ones %*% big.matrix
  }, times = N)$time
  return(Time)  
}

The single loop function:

SingleLoop <- function(big.matrix, N) {
  stopifnot(exists("big.matrix"))
  stopifnot(exists("N"))
  colsums <- rep(NA, dim(big.matrix)[2])
  Time <- microbenchmark({
    for (i in 1:dim(big.matrix)[2]){
      colsums[i] <- sum(big.matrix[,i])
    }
  }, times = N)$time
  return(Time)
}

The double loop function:

DoubleLoop <- function(big.matrix, N) {
  stopifnot(exists("big.matrix"))
  stopifnot(exists("N"))
  Time <- microbenchmark({
  colsums <- rep(NA, dim(big.matrix)[2])    
      for (i in 1:dim(big.matrix)[2]){
        s <- 0
        for (j in 1:dim(big.matrix)[1]){
          s <- s + big.matrix[j, i]
        }
        colsums[i] <- s
      }
    }, times = N)$time
  return(Time)
} 

First experiment

The codes we have built previously are used to sum the columns of a matrix. Now, we are wondering in which function we spent more time on process. For this we are going to create a random process to evaluate the performance of each function.

Trials = 15 # Number of experiments
Experiments <- as.data.frame(matrix(NA, nrow= Trials, ncol=5)) # Initializing an empty dataframe
colnames(Experiments)<-c("CppSum", "SingleLoop", "Apply", "ColSum", "MatrixSum")  # Changing the name of the columns

for (n in 1:Trials) {                                       
  big.matrix <- matrix(runif(1e+6), nrow = 1000)
  Experiments[n,] = c(cppSum(big.matrix, N = 1),    # FILLING THE EMPTY DATAFRAME
                      SingleLoop(big.matrix, N = 1), 
                      Apply(big.matrix, N = 1), 
                      ColSum(big.matrix, N = 1), 
                      matrixSum(big.matrix, N = 1))
}

It follows the experiments table below, which one has the time spent for processing time in each function built in 15 trials:

CppSum SingleLoop Apply ColSum MatrixSum
11857844 13553991 14060222 1909235 2087521
14339624 11960463 10770208 1804205 1743486
4030401 8320680 12187318 1838304 1242288
3999986 6605255 12878942 1663743 1485341
5558623 7051670 13773656 1695383 1892197
5481298 19691736 15555191 1705745 3211198
4666651 9135062 11582805 1884365 1608423
4117916 6783961 10277630 1645628 5595849
3784278 6832791 10698034 1736938 7849448
4026158 7018728 12260245 1696375 1519902
4559054 6774388 12343171 1725949 1759426
3894739 7740347 13834733 1752531 1967901
3882907 6804470 10815676 1848055 1511235
4158680 7281457 18262559 2498912 57192659
4055408 8646319 31066072 1699679 1707201

Now, there is the box plot between the variables time spent and the method used to manipulate the operations on the matrix, where we can easily see that the code spent most time running the Apply function. The SigleLoop function also has a big percent of our processing time.

p  <-  Experiments  %>% gather("Method", "Time")  %>% 
       ggplot(aes(x =  reorder(Method, Time, FUN=median), y = Time, fill = Method)) 
p + geom_boxplot() + labs(x = "Method", y = "Time in nanoseconds")

3rd class - Using ggplot2

Eventually, there will be code here.

4th class - Importing database

Let’s import the database from INEP http://inep.gov.br/microdados by using readr.
On this class we will use the Censo da Educação Superior from 2016.
data2 is the DM_DOCENTE data frame;
data3 is the DM_IES data frame;
data4 is the modified DM_DOCENTE data frame.

Bringing data to Rstudio environment

library(readr)
data2 <- read_delim("dados/DM_DOCENTE.CSV", 
                    "|", escape_double = FALSE, col_types = cols(CO_DOCENTE = col_character(),
                      CO_DOCENTES_IES = col_character(), 
                      CO_MUNICIPIO_NASCIMENTO = col_character()), 
                    locale = locale(encoding = "ISO-8859-1"), 
                    trim_ws = TRUE)
data3 <- read_delim("dados/DM_IES.CSV", 
                    "|", escape_double = FALSE, 
                    locale = locale(encoding = "ISO-8859-1"),
                    trim_ws = TRUE)

Now we are wondering the percentage of each type of higher education institution per state.

library(tidyverse)
cont_uf <- data3 %>% 
  group_by(SGL_UF_IES, DS_CATEGORIA_ADMINISTRATIVA) %>% 
  summarise(cont = n())

After summarize the information, we may note that our object cont_ufis a “tidy” data object. To improve the visualization let’s make some changes on the layout.

cont_uf <- spread(cont_uf, DS_CATEGORIA_ADMINISTRATIVA, cont)
colnames(cont_uf) <- c("State", "Special", "Private with profits", "Private without profits", "Public State", "Public Federal", "Public County")

Setting up the table

Once setup, let’s finally see the table with the information about the percentages per state.

library(knitr)
library(kableExtra)
kable(cont_uf, "html") %>%
  kable_styling() %>%
  column_spec(1, bold = T, border_right = T) %>% 
  scroll_box(width = "800px", height = "400px")
State Special Private with profits Private without profits Public State Public Federal Public County
AC NA 5 4 NA 2 NA
AL NA 12 12 2 2 NA
AM NA 8 8 1 2 NA
AP NA 8 5 1 2 NA
BA NA 78 33 4 6 NA
CE NA 40 18 3 4 NA
DF NA 31 24 2 2 NA
ES NA 31 45 1 2 2
GO 3 56 24 1 3 1
MA NA 28 7 1 2 NA
MG NA 107 169 4 17 1
MS NA 8 22 1 3 NA
MT NA 34 22 1 2 NA
PA NA 30 11 1 5 NA
PB NA 28 8 1 3 NA
PE 3 34 39 1 5 19
PI NA 27 11 1 2 NA
PR 1 97 79 7 4 1
RJ 2 22 89 13 10 2
RN NA 15 8 2 3 NA
RO NA 12 18 NA 2 NA
RR NA 3 1 1 2 NA
RS NA 44 68 1 9 NA
SC 4 38 45 1 4 2
SE NA 13 3 NA 2 NA
SP 8 233 277 71 5 15
TO NA 10 9 1 2 2

From the table above, we can easily see that the state of São Paulo has the most institutions of the types Private with and without profits and Public State. The state of Minas Gerais leads in another important type such as Public Federal. In general most Public institutions are located in the south and southeast regions. Economic development is a candidate to explain this distribution of institutions, because it is straightforward that the richest region should have more structure in many areas, include education. Another interesting fact is that only a few states have institutions of the Special type; It could be theme for a further investigation and analysis. All the cells with “NA” means zero to the corresponding value.

5th class - Using readRDS

Reading data by using readRDS function

On previously class we have loaded the data3 by using the following code:

data3 <- read_delim("dados/DM_IES.CSV", 
                    "|", escape_double = FALSE, 
                    locale = locale(encoding = "ISO-8859-1"),
                    trim_ws = TRUE)

But now let’s use a function to improve your processing time to read and render the R Markdown document. Once we have loaded the data3, we create a new file from data3 using the saveRDS function and then restore it in a new object. So, now we have the same data but in a format that allows us to improve the processing time.

saveRDS(data3, file = "dados/data3.rds")

data3opt <- readRDS("data3.rds")

Bulding a bar plot

On the previously class we had summarize the categories of the higher education institutions per state in Brazil. Let’s now build a bar plot to summarize the same information but, this time, only with the state of Pernambuco.

bar_pe <- data3 %>% select(SGL_UF_IES, DS_CATEGORIA_ADMINISTRATIVA) %>%
  group_by(SGL_UF_IES, DS_CATEGORIA_ADMINISTRATIVA) %>%
  filter(SGL_UF_IES == "PE") %>% 
  count() %>% arrange(desc(n)) %>% 
  rename(State = SGL_UF_IES, Category = DS_CATEGORIA_ADMINISTRATIVA, N = n)  
pltbar_pe <- ggplot(bar_pe, aes(x = reorder(Category, N), y = N, fill = Category))
pltbar_pe +  geom_bar(stat = "Identity") + 
             ggtitle("Bar plot of the higher education institutions in Pernambuco ") + 
             xlab("Category") + ylab("Proportion") +
             theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

bar_pe <- data3 %>% select(SGL_UF_IES, DS_CATEGORIA_ADMINISTRATIVA) %>%
  group_by(SGL_UF_IES, DS_CATEGORIA_ADMINISTRATIVA) %>%
  filter(SGL_UF_IES == "PE") %>% 
  count() %>% arrange(desc(n)) %>% 
  rename(State = SGL_UF_IES, Category = DS_CATEGORIA_ADMINISTRATIVA, N = n) %>% 
  ungroup %>% 
  mutate(Frequency = N/sum(N)) %>% mutate(Category = factor(Category, levels = Category))
bar_pe$Category <- c("Private without profits", "Private with profits", "Public County", 
                     "Public Federal", "Special", "Public State")
pltbar_pe2 <- ggplot(bar_pe, aes(x = reorder(Category, N), y = N, fill = Category))
pltbar_pe2 +  geom_bar(stat = "Identity") + guides(fill = "none") +
              ggtitle("Bar plot of the higher education institutions
                      in Pernambuco ") + xlab("Category") +
              coord_flip() + geom_label(aes(label = round(Frequency*100))) + 
              scale_fill_brewer(palette = "YlGn", direction = -1)

As we can see in the graphic above, private with and without profits categories represents approximately 73% of the total. The public county category comes next with approximately 19% of the institutions. The categories public federal and state and special represents less than 9% of institutions presents in the state. In particular, the public state category has the smallest percentage which equals 1%, approximately.

Let’s now build another bar plot but now we are interesting in the comparison between the states of Pernambuco and Paraíba, both in the northeast region of Brazil.

bar_pepb <- data3 %>% select(SGL_UF_IES, DS_CATEGORIA_ADMINISTRATIVA) %>%
  group_by(SGL_UF_IES, DS_CATEGORIA_ADMINISTRATIVA) %>%
  filter(SGL_UF_IES == "PE" | SGL_UF_IES == "PB") %>% 
  count() %>% arrange(desc(n)) %>% 
  rename(State = SGL_UF_IES, Category = DS_CATEGORIA_ADMINISTRATIVA, N = n) %>% 
  ungroup %>% 
  mutate(Frequency = N/sum(N))
pltbar_pepb <- ggplot(bar_pepb, aes(x = reorder(Category, Frequency), y = Frequency, fill = State))
pltbar_pepb +  geom_bar(stat = "identity", position = "dodge") + 
               ggtitle("Bar plot of the higher education institutions 
    between Pernambuco and Paraíba") + xlab("Category") + coord_flip()

As we can see above, the state of Pernambuco has more institutions of higher education than the state of Paraíba in almost all categories, except the Public State category. The state of Paraíba has a expressive number in comparison to Pernambuco in Private with profits and Public Federal categories. One interesting fact is that the state of Paraíba hasn’t institutions of Public County and Special categories.

6th class - Estimating the density of the professors’ age

Now we are wondering estimate the density of the professors’ age to try understand how it is distributed. The first step is create a new data frame data4 containing the age data. In data2 we have already the information about the birth of each professor, so let’s now compute the age (in years) in a new variable called age.

data4 <-data2 %>% select(NU_ANO_DOCENTE_NASC, NU_MES_DOCENTE_NASC, NU_DIA_DOCENTE_NASC, 
       DS_ESCOLARIDADE_DOCENTE, IN_SEXO_DOCENTE) %>% 
       mutate(Age = (2016 - NU_ANO_DOCENTE_NASC) + 
                      (NU_MES_DOCENTE_NASC/12) + 
                      (NU_DIA_DOCENTE_NASC/365))   

Building density plots

Based on the data4 let’s create the first one density plot.

s <- ggplot(aes(x = Age), data = data4)
s + geom_density(color = "Red") + 
    ggtitle("Density function of the professors' age") +
    theme_linedraw()

We can see that the data are has a right asymmetry. Which indicates that possibly the mean and 1st quartile (or 25% of the observations) are nearly and that the mean has a greater value than median. To model the data there are the possible probability distributions candidates as Gamma, Exponential, Log-Gamma, Weibull.
Let’s see how age is distributed by group of degree:

w <- ggplot(aes(x = Age, color = Degree), data = data4)
w + geom_density() + 
    ggtitle("Density function of the professors' age by degree") +
    theme_linedraw()

The data divided by group follows the same asymmetric tendency. The undergraduate degree is the most asymmetric which can be explained by the idea that the undergraduate degree is the first step on the academic life. The post-graduate programs in Brazil are new if comparated to countries like USA, England, Canada, etc. Because of this fact there is a group of older professors with undergraduate degree. The Master and Specialist degrees seems to be equally distributed, once most people achieve these degrees thereafter undergraduate. We may note that there is a group included in “No degree” category, it may sounds strange! How could anyone become a teacher in a institution of higher education without a undergraduate degree, at least? It might be theme for a further analysis.

Building box plots

From the graphic, we can see that the variables age and degree are associated, because as the level of education also increases the age.

b <- ggplot(aes(x = reorder(Degree, -Age, FUN = median), y = Age, fill = Degree), data = data4)
b + geom_boxplot() + coord_flip() + guides(fill = "none") +
    ggtitle("Box plot between Dregree and Age ") + xlab("Degree")

Male x Female

Let’s now reproduce again the same staff but looking to the gender in a separate manner.

library(gridExtra)
d1 <- ggplot(aes(x = Age), data = data4) + 
      geom_density() + facet_grid(Gender~.)

w1 <- ggplot(aes(x = Age, color = Degree), data = data4) + 
      geom_density() + facet_grid(Gender~.)
grid.arrange(d1,w1, nrow = 1)

The distribution of the age is similar to both genders. However, when we look to the groups in relation to degrees we see the male gender has more observations to the undergraduate degree after 40 years old than female gender.

c <- ggplot(aes(x = Gender, y = Age, fill = Gender),data = data4)
c + geom_boxplot()+  coord_flip() + labs(fill = "Gender")

It seems the variables gender and age are not associated. Both male and female have the same minimum age and approximately the same median.
Above we have the box plot between degree and age by gender and we can see that the variables are associated for both genders, as we found out when we looked without considering the genders.

7th class - Regression Models

Most of the following content was extracted from https://www.listendata.com/2018/03/regression-analysis.html.

Linear regression

It is the simplest form of regression. It is a technique in which the dependent variables is continuous in nature. The relationship between the dependent variable and independent variables is assumed to be linear in nature.
When you have only 1 independent variable and 1 dependent variable, it is called simple linear regression:
\[y_i = \beta_1 + \beta_0 x_i + \epsilon_i\].
And when you have more than 1 independent variable and 1 dependent variable, it is called multiple linear regression:
\[Y = X \beta + \epsilon\].

Assumptions of linear regression:
1. There must be a linear relation between independent and dependent variables;
2. There should not be any outliers present;
3. No heteroscedasticity;
4. Sample observations should be independent;
5. Error terms should be normally distributed with mean 0 and constant variance;
6. Absence of multicollinearity and auto-correlation.

Example:
We consider the swiss data set for carrying out linear regression in R. We use lm() function in the base package. We try to estimate Fertility with the help of other variables.

library(datasets)
model <- lm(Fertility~., data = swiss)
summary(model)

Polynomial regression

It is a technique to fit a nonlinear equation by taking polynomial functions of independent variable. Thus a polynomial of degree k in one variable is written as:
\[y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i² + ... + \beta_k x_i^k + \epsilon_i\].

Example:
We are using poly.csv data for fitting polynomial regression where we try to estimate the Prices of the house given their area. Firstly we read the data using read.csv() and divide it into the dependent and independent variable.

data <- read.csv("poly.csv")
x <- data$Area
y <- data$Price
new_x <- cbind(x, x^2)
model2 <- lm(y~new_x)
summary(model2)

Logistic regression

In logistic regression, the dependent variable is binary in nature (having two categories). Independent variables can be continuous or binary. In multinomial logistic regression, you can have more than two categories in your dependent variable. Here my model is:
\[p = \dfrac{1}{1 + e^{-(b_0 + b_1 x_1 + b_2 x_2 + ... + b_k x_k)}}\]

Quantile regression

Quantile regression is the extension of linear regression and we generally use it when outliers, high skewenness and heteroscedasticity exist in the data. In linear regression, we predict the mean of the dependent variable for given independent variables. Since mean does not describe the whole distribution, so modeling the mean is not a full description of a relationship between dependent and independent variables. So we can use quantile regression which predicts a quantile (or percentile) for given independent variables.
For qth quantile we have the following regression model:
\[Y = X' \beta + \epsilon\].
Advantages of quantile over linear regression:
1. Quite beneficial when heteroscedasticity is present in the data;
2. Robust to outliers;
3. Distribution of dependent variable can be described via various quantiles; 4. It is more useful than linear regression when the data is skewed.

Example:
We need to install quantreg package in order to carry out quantile regression. By using ’rq` function we try to predict the estimate the 25th quantile of Fertility Rate in Swiss data. For this we set \(\tau = 0.25\).

model4 <- rq(Fertility~)
summary(model4)

Support vector regression

Support vector regression can solve both linear and non-linear models. SVM uses non-linear kernel functions (such as polynomial) to find the optimal solution for non-linear models.The main idea of SVR is to minimize error, individualizing the hyperplane which maximizes the margin.
Example:

library(e1071)
model10 <- svm(Y ~ X , data)
pred <- predict(model10, data)
points(data$X, pred, col = "red", pch=4)

Ordinal regression

Ordinal Regression is used to predict ranked values. In simple words, this type of regression is suitable when dependent variable is ordinal in nature. Example of ordinal variables - Survey responses (1 to 6 scale), patient reaction to drug dose (none, mild, severe).
Example:

library(ordinal)
model11 <- clm(rating ~ ., data = wine)
summary(model11)

Poisson regression

Poisson regression is used when dependent variable has count data.
Application of poisson regression:
1. Predicting the number of calls in customer care related to a particular product; 2. Estimating the number of emergency service calls during an event.
The dependent variable must meet the following conditions:
1. The dependent variable has a Poisson distribution;
2. Counts cannot be negative; 3. This method is not suitable on non-whole numbers.
n the code below, we are using dataset named warpbreaks which shows the number of breaks in Yarn during weaving. In this case, the model includes terms for wool type, wool tension and the interaction between the two.

model12 <- glm(breaks~wool*tension, data = warpbreaks, family=poisson)
summary(model12)

Negative Binomial regression

Like Poisson Regression, it also deals with count data. The question arises “how it is different from Poisson regression”. The answer is negative binomial regression does not assume distribution of count having variance equal to its mean. While Poisson regression assumes the variance equal to its mean. Example:

library(MASS)
model13 <- glm.nb(Days ~ Sex/(Age + Eth*Lrn), data = quine)
summary(model13)

Quasi poisson regression

It is an alternative to negative binomial regression. It can also be used for overdispersed count data. Both the algorithms give similar results, there are differences in estimating the effects of covariates. The variance of a quasi-Poisson model is a linear function of the mean while the variance of a negative binomial model is a quadratic function of the mean. Quasi-Poisson regression can handle both over-dispersion and under-dispersion.
Example:

model14 <- glm(Days ~ Sex/(Age + Eth*Lrn), data = quine,  family = "quasipoisson")

Cox regression

Cox regression is suitable for time-to-event data. See the examples below:
1. Time from customer opened the account until attrition;
2. Time after cancer treatment until death;
3. Time from first heart attack to the second.
As well as estimating the time it takes to reach a certain event, survival analysis can also be used to compare time-to-event for multiple groups. Dual targets are set for the survival model:
1. A continuous variable representing the time to event;
2. A binary variable representing the status whether event occurred or not. Example:

library(survival) # Lung cancer data
lung$SurvObj <- with(lung, Surv(time, status == 2)) # status: death = 2
model15 <- coxph(SurvObj ~ age + sex + ph.karno + wt.loss, data = lung)
model15

8th class - Creating interactive maps

Today we are going to create maps using sp, leaflet packages.

Only latitude and longitude points

Let’s get the coordinates of the 3 campi of UFPE. The campi are located in Recife, Caruaru and Vitória, respectively. First of all we build an object with our coordinates, then by using sp we transform the data into spatial data as follows.

library(sp)
coordinates <- rbind(c(-34.950440,-8.051570),
                     c(-35.981222,-8.225415),
                     c(-35.298355,-8.116218))
spatial.points <- SpatialPoints(coordinates)
plot(spatial.points)

As we can see above, the plot is not good in many ways. So let’s use the package leafletto plot the coordinates in a real map.

library(leaflet)
leaflet() %>% addTiles() %>% addMarkers(data = spatial.points)

Dealing with polygons

Now we want to understand the proportion of people without degree in the state of Pernambuco. For this we’ll need to import the database from IBGE: ftp://ftp.ibge.gov.br/Censos/Censo_Demografico_2010/Resultados_Gerais_da_Amostra/Microdados/PE.zip and also import the spatial database of Brazil using raster.
The code of this application can be found at:
https://drive.google.com/drive/folders/1bIbI294KKJwo0hHs9_XQ9R6JQP3iH-hZ?usp=sharing
Eventually there will be the application: https://gabrielteotonio.shinyapps.io/NoDegree-PE/.

9th class - A robot

Let’s build a robot to get the share price of Facebook. The process must be automatic, but before we need to define our routine to get the values:

library(rvest)
library(stringr)
library(lubridate)
vignette("selectorgadget")

Time <- now()

site <- read_html("https://www.nasdaq.com/symbol/fb/real-time")

price <- site %>%
  html_nodes(xpath = '//*[(@id = "qwidget_lastsale")]') %>%
  html_text() %>% str_remove_all(pattern = "[R$]") %>%  
  str_replace(pattern = "[\\,]", replacement = "\\.") %>%
  as.numeric()

cat(as.character(Time), " ", price, "\n", file = "SC/robot/preçosFB.txt", append = TRUE)

We created a .txt document to write the prices down. Now we need to define to our machine when it should run the routine. In a linux machine, from the terminal:

crontab -e

And then edit by using nano editor:

*/5 8-17 * * 1-5 Rscript ~/SC/robot/robotFB.R

The command above is telling the machine to run this file , which I set the path, every 5 minutes from 8 am to 5 pm on every weekdays. With that staff setted up, we got the values in a .txt document.

library(readr)
pricesFB <- read_table2("~/SC/robot/preçosFB.txt")

After read the .txt document let’s now build a chart to visualize the variation (in USD) through a day.

10th class - package

Follow below the link to the source code of the package required.
https://drive.google.com/file/d/12KiO-2V_Rs9LmAn1IXoHuBciCMu7zcNi/view?usp=sharing