Introduction to Statistics in R

source('~/Desktop/Statistics-Fundamentals-with-R/ANALYSIS/libraries.R', echo=TRUE)

## 
## > packages = c("dplyr", "ggfortify", "magrittr", "ggplot2", 
## +     "broom", "yardstick")
## 
## > new <- packages[!(packages %in% installed.packages()[, 
## +     "Package"])]
## 
## > if (length(new)) install.packages(new)
## 
## > a = lapply(packages, require, character.only = TRUE)

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: ggfortify

## Loading required package: ggplot2

## Loading required package: magrittr

## Loading required package: broom

## Loading required package: yardstick

## For binary classification, the first factor level is assumed to be the event.
## Use the argument `event_level = "second"` to alter this as needed.

## 
## > library(dplyr)
## 
## > library(ggfortify)
## 
## > library(magrittr)
## 
## > library(ggplot2)
## 
## > library(ggfortify)
## 
## > library(yardstick)

Summary Statistics

What is statistics?

Descriptive statistics: focuses on describing and summarizing collected data.

Inferential statistics: uses handy data to figure out to make population statistics.

Type of data:

Numeric:

-Countinous (Measured):

Airplane speed, time espendt waiting in line

-Discrete (Counted):

Number of pets, number of packages shipped.

Used types of summaries and plots:

Mean

Scatterplots

Categorical:

-Nominal (Unordered):

Married/unmarried, country of residence

-Ordinal (Ordered):

Multiple choice.

Used types of summaries and plots:

Counts

Barplots

Measures of center

Depende si la grafica es simetrica o no lo es, si es simetrica es mejor utilizar la media, si no es simetrica, es mejor utilizar la mediana.

Mean and median

In this chapter, you’ll be working with the 2018 Food Carbon Footprint Index from nu3. The food_consumption dataset contains information about the kilograms of food consumed per person per year in each country in each food category (consumption) as well as information about the carbon footprint of that food category (co2_emissions) measured in kilograms of carbon dioxide, or CO, per person per year in each country.

In this exercise, you’ll compute measures of center to compare food consumption in the US and Belgium using your dplyr skills.

dplyr is loaded for you and food_consumption is available.

Create two data frames: one that holds the rows of food_consumption for “Belgium” and the another that holds rows for “USA”. Call these belgium_consumption and usa_consumption. Calculate the mean and median of kilograms of food consumed per person per year for both countries.

# Filter for Belgium
belgium_consumption <- food_consumption %>%
  filter(country == "Belgium")

# Filter for USA
usa_consumption <- food_consumption %>%
  filter(country == "USA")

# Calculate mean and median consumption in Belgium
mean(belgium_consumption$consumption)

## [1] 42.13273

median(belgium_consumption$consumption)

## [1] 12.59

# Calculate mean and median consumption in USA
mean(usa_consumption$consumption)

## [1] 44.65

median(usa_consumption$consumption)

## [1] 14.58

Filter food_consumption for rows with data about Belgium and the USA. Group the filtered data by country. Calculate the mean and median of the kilograms of food consumed per person per year in each country. Call these columns mean_consumption and median_consumption.

food_consumption %>%
  # Filter for Belgium and USA
  filter(country %in% c("Belgium", "USA")) %>%
  # Group by country
  group_by(country) %>%
  # Get mean_consumption and median_consumption
  summarise(mean_consumption = mean(consumption),
      median_consumption = median(consumption))

## # A tibble: 2 x 3
##   country mean_consumption median_consumption
##   <chr>              <dbl>              <dbl>
## 1 Belgium             42.1               12.6
## 2 USA                 44.6               14.6

Mean vs. median

In the video, you learned that the mean is the sum of all the data points divided by the total number of data points, and the median is the middle value of the dataset where 50% of the data is less than the median, and 50% of the data is greater than the median. In this exercise, you’ll compare these two measures of center.

dplyr and ggplot2 are loaded and food_consumption is available.

Filter food_consumption to get the rows where food_category is “rice”. Create a histogram using ggplot2 of co2_emission for rice.

food_consumption %>%
  # Filter for rice food category
  filter(food_category == "rice") %>%
  # Create histogram of co2_emission
  ggplot(aes(co2_emission)) +
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Take a look at the histogram of the CO emissions for rice you just calculated. Which of the following terms best describes the shape of the data?

Correct answer: Right-skewed

Filter food_consumption to get the rows where food_category is “rice”. Summarize the data to get the mean and median of co2_emission, calling them mean_co2 and median_co2.

food_consumption %>%
  # Filter for rice food category
  filter(food_category == "rice") %>% 
  # Get mean_co2 and median_co2
  summarize(mean_co2 = mean(co2_emission),
            median_co2 = median(co2_emission))

## # A tibble: 1 x 2
##   mean_co2 median_co2
##      <dbl>      <dbl>
## 1     37.6       15.2

Given the skew of this data, what measure of central tendency best summarizes the kilograms of CO emissions per person per year for rice?

Correct answer: Median.

Measures of spread

Variance: average distance from each data point to the data’s mean.

The higher the variance the more of variance the data is.

var()

Standard deviation: the units are usually easier to understand.

Mean absolute deviation: takes the absolute value of the distances to the mean, and then takes the mean of those differences:

mean(abs())

Quartiles: They split out the data into four different parts. quantile()

To see them visually we use the boxplots.

Quantiles are equal to percentiles, by default it returns four peaces, but we can adjust it by adding the probs argument and specifying how many splits we will. It can be useful to use the seq() to specify those divisions.

Interquartile range (IQR): it is the distance between the 75% and the 25% percentile. We can also calculate it by substracting to the quntile(75%)-quantile(25%).

Outliers: data that is substantially different to the others. How to get it:

Low data: data < (Q1-1.5xIQR)

High data: data > (Q3 + 1.5xIQR)

Quartiles, quantiles, and quintiles

Quantiles are a great way of summarizing numerical data since they can be used to measure center and spread, as well as to get a sense of where a data point stands in relation to the rest of the dataset. For example, you might want to give a discount to the 10% most active users on a website.

In this exercise, you’ll calculate quartiles, quintiles, and deciles, which split up a dataset into 4, 5, and 10 pieces, respectively.

The dplyr package is loaded and food_consumption is available.

Calculate the quartiles of the co2_emission column of food_consumption.

# Calculate the quartiles of co2_emission
quantile(food_consumption$co2_emission)

##        0%       25%       50%       75%      100% 
##    0.0000    5.2100   16.5300   62.5975 1712.0000

Calculate the six quantiles that split up the data into 5 pieces (quintiles) of the co2_emission column of food_consumption.

# Calculate the quintiles of co2_emission
quantile(food_consumption$co2_emission, probs = seq(0,1,0.2))

##       0%      20%      40%      60%      80%     100% 
##    0.000    3.540   11.026   25.590   99.978 1712.000

Calculate the eleven quantiles of co2_emission that split up the data into ten pieces (deciles).

# Calculate the deciles of co2_emission
quantile(food_consumption$co2_emission, probs = seq(0,1,0.1))

##       0%      10%      20%      30%      40%      50%      60%      70% 
##    0.000    0.668    3.540    7.040   11.026   16.530   25.590   44.271 
##      80%      90%     100% 
##   99.978  203.629 1712.000

Variance and standard deviation

Variance and standard deviation are two of the most common ways to measure the spread of a variable, and you’ll practice calculating these in this exercise. Spread is important since it can help inform expectations. For example, if a salesperson sells a mean of 20 products a day, but has a standard deviation of 10 products, there will probably be days where they sell 40 products, but also days where they only sell one or two. Information like this is important, especially when making predictions.

Both dplyr and ggplot2 are loaded, and food_consumption is available.

Calculate the variance and standard deviation of co2_emission for each food_category by grouping by and summarizing variance as var_co2 and standard deviation as sd_co2.

# Calculate variance and sd of co2_emission for each food_category
food_consumption %>% 
  group_by(food_category) %>% 
  summarize(var_co2 = var(co2_emission),
           sd_co2 = sd(co2_emission))

## # A tibble: 11 x 3
##    food_category   var_co2  sd_co2
##    <fct>             <dbl>   <dbl>
##  1 beef          88748.    298.   
##  2 eggs             21.4     4.62 
##  3 fish            922.     30.4  
##  4 lamb_goat     16476.    128.   
##  5 dairy         17672.    133.   
##  6 nuts             35.6     5.97 
##  7 pork           3095.     55.6  
##  8 poultry         245.     15.7  
##  9 rice           2281.     47.8  
## 10 soybeans          0.880   0.938
## 11 wheat            71.0     8.43

Create a histogram of co2_emission for each food_category using facet_wrap().

# Create subgraphs for each food_category: histogram of co2_emission
ggplot(food_consumption, aes(co2_emission)) +
  # Create a histogram
  geom_histogram() +
  # Create a separate sub-graph for each food_category
  facet_wrap(~ food_category)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Superb spread measurement! Beef has the biggest amount of variation in its CO emissions, while eggs, nuts, and soybeans have relatively small amounts of variation.

Finding outliers using IQR

Outliers can have big effects on statistics like mean, as well as statistics that rely on the mean, such as variance and standard deviation. Interquartile range, or IQR, is another way of measuring spread that’s less influenced by outliers. IQR is also often used to find outliers. If a value is less than or greater than , it’s considered an outlier. In fact, this is how the lengths of the whiskers in a ggplot2 box plot are calculated.

In this exercise, you’ll calculate IQR and use it to find some outliers. Both dplyr and ggplot2 are loaded and food_consumption is available.

Calculate the total co2_emission per country by grouping by country and taking the sum of co2_emission. Call the sum total_emission and store the resulting data frame as emissions_by_country.

# Calculate total co2_emission per country: emissions_by_country
emissions_by_country <- food_consumption %>%
  group_by(country) %>%
  summarize(total_emission = sum(co2_emission))

emissions_by_country %>%
  head()

## # A tibble: 6 x 2
##   country   total_emission
##   <chr>              <dbl>
## 1 Albania            1778.
## 2 Algeria             708.
## 3 Angola              413.
## 4 Argentina          2172.
## 5 Armenia            1110.
## 6 Australia          1939.

Compute the first and third quartiles of total_emission and store these as q1 and q3.

# Compute the first and third quantiles and IQR of total_emission
q1 <- quantile(emissions_by_country$total_emission, 0.25)
q3 <- quantile(emissions_by_country$total_emission, 0.75)
iqr <- q3-q1

Calculate the lower and upper cutoffs for outliers of total_emission, and store these as lower and upper.

# Calculate the lower and upper cutoffs for outliers
lower <- q1-1.5*iqr
upper <- q3+1.5*iqr

Use filter() to get countries with a total_emission greater than the upper cutoff or a total_emission less than the lower cutoff.

# Filter emissions_by_country to find outliers
emissions_by_country %>%
  filter(total_emission < lower | total_emission > upper)

## # A tibble: 1 x 2
##   country   total_emission
##   <chr>              <dbl>
## 1 Argentina          2172.

Outstanding outlier detection! It looks like Argentina has a substantially higher amount of CO emissions per person than other countries in the world.

Random Numbers and Probability

What are the chances?

We can measure chance along with probability, that’s it, dividing ways the event can happen by total of possible outcomes. We can do this in dplyr with the function sample_n() and passing the number of outcomes we desire. Watch out that sample() selects randomly.

Sampling with resampling:

However if we want to ensure that we get the same result always we must compute: set.seed(). The seed ensures that R will compute the selecting always from the same point, the thing is that it does not matter the number we insert into the seed. It matters that we run the same number each time we run the script.

Sampling without resampling:

That means that it cannnot sample the same object always, so that, we pass number “2” to sample_n() function.

Sampling with replacement:

This escenario happens when the probability of affecting the object reruns each time we run the script.

sample_n(2, replace = TRUE)

Independent events:

Two events are independent if the probability of the second event isn’t affected by the outcome of the first event.

In general, with sammpling with replacement, each pick is independent

Dependent events:

Two events are dependent if the probability of the second event is affected by the outcome of the first event.

If we sample without replacement, if we pick one person the first time, the second time it will result out that we cannot pick up it again, because it has been already picked up.

Calculating probabilities

You’re in charge of the sales team, and it’s time for performance reviews, starting with Amir. As part of the review, you want to randomly select a few of the deals that he’s worked on over the past year so that you can look at them more deeply. Before you start selecting deals, you’ll first figure out what the chances are of selecting certain deals.

Recall that the probability of an event can be calculated by dplyr is loaded and amir_deals is available.

Count the number of deals Amir worked on for each product type.

# Count the deals for each product
amir_deals %>%
  count(product)

##      product  n
## 1  Product A 23
## 2  Product B 62
## 3  Product C 15
## 4  Product D 40
## 5  Product E  5
## 6  Product F 11
## 7  Product G  2
## 8  Product H  8
## 9  Product I  7
## 10 Product J  2
## 11 Product N  3

Create a new column called prob by dividing n by the total number of deals Amir worked on.

# Calculate probability of picking a deal with each product
amir_deals %>%
  count(product) %>%
  mutate(prob = n/sum(n))

##      product  n       prob
## 1  Product A 23 0.12921348
## 2  Product B 62 0.34831461
## 3  Product C 15 0.08426966
## 4  Product D 40 0.22471910
## 5  Product E  5 0.02808989
## 6  Product F 11 0.06179775
## 7  Product G  2 0.01123596
## 8  Product H  8 0.04494382
## 9  Product I  7 0.03932584
## 10 Product J  2 0.01123596
## 11 Product N  3 0.01685393

If you randomly select one of Amir’s deals, what’s the probability that the deal will involve Product C?

Answer: 8.43%

Sampling deals

In the previous exercise, you counted the deals Amir worked on. Now it’s time to randomly pick five deals so that you can reach out to each customer and ask if they were satisfied with the service they received. You’ll try doing this both with and without replacement.

Additionally, you want to make sure this is done randomly and that it can be reproduced in case you get asked how you chose the deals, so you’ll need to set the random seed before sampling from the deals.

dplyr is loaded and amir_deals is available.

Set the random seed to 31. Take a sample of 5 deals without replacement.

# Set random seed to 31
set.seed(31)

# Sample 5 deals without replacement
amir_deals %>%
  sample_n(5)

##     product  client status  amount num_users
## 1 Product D Current   Lost 3086.88        55
## 2 Product C Current   Lost 3727.66        19
## 3 Product D Current   Lost 4274.80         9
## 4 Product B Current    Won 4965.08         9
## 5 Product A Current    Won 5827.35        50

Take a sample of 5 deals with replacement.

amir_deals %>%
  sample_n(5, replace = TRUE)

##     product  client status  amount num_users
## 1 Product A Current    Won 6010.04        24
## 2 Product B Current   Lost 5701.70        53
## 3 Product D Current    Won 6733.62        27
## 4 Product F Current    Won 6780.85        80
## 5 Product C Current    Won -539.23        11

Question What type of sampling is better to use for this situation? Without replacement

Discrete distributions

A probability distribution describes the probability of each possible outcome in a scenario. We can also talk about the expected value of a distribution, which is the mean of a distribution.

Creating a probability distribution

A new restaurant opened a few months ago, and the restaurant’s management wants to optimize its seating space based on the size of the groups that come most often. On one night, there are 10 groups of people waiting to be seated at the restaurant, but instead of being called in the order they arrived, they will be called randomly. In this exercise, you’ll investigate the probability of groups of different sizes getting picked first. Data on each of the ten groups is contained in the restaurant_groups data frame.

Remember that expected value can be calculated by multiplying each possible outcome with its corresponding probability and taking the sum. The restaurant_groups data is available and dplyr and ggplot2 are loaded.

Create a histogram of the group_size column of restaurant_groups, setting the number of bins to 5.

# Create a histogram of group_size
ggplot(restaurant_groups, aes(group_size)) +
  geom_histogram(bins = 5)

Count the number of each group_size in restaurant_groups, then add a column called probability that contains the probability of randomly selecting a group of each size. Store this in a new data frame called size_distribution.

# Create probability distribution
size_distribution <-restaurant_groups %>%
  # Count number of each group size
  count(group_size) %>%
  # Calculate probability
  mutate(probability = n / sum(n))

size_distribution

##   group_size n probability
## 1          2 6         0.6
## 2          3 1         0.1
## 3          4 2         0.2
## 4          6 1         0.1

Calculate the expected value of the size_distribution, which represents the expected group size.

# Calculate expected group size
expected_val <- sum(size_distribution$group_size * size_distribution$probability)
expected_val

## [1] 2.9

Calculate the probability of randomly picking a group of 4 or more people by filtering and summarizing.

# Calculate probability of picking group of 4 or more
size_distribution %>%
  # Filter for groups of 4 or larger
  filter(group_size >= 4) %>%
  # Calculate prob_4_or_more by taking sum of probabilities
  summarize(prob_4_or_more = sum(probability))

##   prob_4_or_more
## 1            0.3

Continuous distributions

Data back-ups

The sales software used at your company is set to automatically back itself up, but no one knows exactly what time the back-ups happen. It is known, however, that back-ups happen exactly every 30 minutes. Amir comes back from sales meetings at random times to update the data on the client he just met with. He wants to know how long he’ll have to wait for his newly-entered data to get backed up. Use your new knowledge of continuous uniform distributions to model this situation and answer Amir’s questions.

To model how long Amir will wait for a back-up using a continuous uniform distribution, save his lowest possible wait time as min and his longest possible wait time as max. Remember that back-ups happen every 30 minutes.

# Min and max wait times for back-up that happens every 30 min
min <- 0
max <- 30

Calculate the probability that Amir has to wait less than 5 minutes, and store in a variable called prob_less_than_5.

# Calculate probability of waiting less than 5 mins
prob_less_than_5 <- punif(5, min = min, max = max)
prob_less_than_5

## [1] 0.1666667

Calculate the probability that Amir has to wait more than 5 minutes, and store in a variable called prob_greater_than_5.

# Calculate probability of waiting more than 5 mins
prob_greater_than_5 <- punif(30, min = min, max= max)- punif(5, min = min, max = max)
prob_greater_than_5

## [1] 0.8333333

Calculate the probability that Amir has to wait between 10 and 20 minutes, and store in a variable called prob_between_10_and_20.

# Calculate probability of waiting 10-20 mins
prob_between_10_and_20 <- punif(20, min = min, max = max) - punif(10, min = min, max = max)
prob_between_10_and_20

## [1] 0.3333333

Simulating wait times

To give Amir a better idea of how long he’ll have to wait, you’ll simulate Amir waiting 1000 times and create a histogram to show him what he should expect. Recall from the last exercise that his minimum wait time is 0 minutes and his maximum wait time is 30 minutes.

A data frame called wait_times is available and dplyr and ggplot2 are loaded.

Set the random seed to 334.

set.seed(334)

Generate 1000 wait times from the continuous uniform distribution that models Amir’s wait time. Add this as a new column called time in the wait_times data frame.

Create a histogram of the simulated wait times with 30 bins.

# Generate 1000 wait times between 0 and 30 mins, save in time column
wait_times %>%
  mutate(time = runif(1000, min = 0, max = 30)) %>%
  # Create a histogram of simulated times
  ggplot(aes(time)) +
  geom_histogram(bins = 30)

Superb simulating! Unless Amir figures out exactly what time each backup happens, he won’t be able to time his data entry so it gets backed up sooner, but it looks like he’ll wait about 15 minutes on average.

The binomial distribution

The binomial distribution describes the probability of the number of successes in a sequence of independent trials. In other words, it can tell us the probability of getting some number of heads in a sequence of coin flips. Note that this is a discrete distribution since we’re working with a countable outcome.

n = total number of trials p = probability to success

Funciones:

Las funciones de la lista anterior se pueden calcular en R para un conjunto de valores con las funciones dbinom (probabilidad), pbinom (distribución) y qbinom (cuantil)

la función rbinom permite obtener nn observaciones aleatorios que siguen una distribución binomial en R

TO APPLY THIS RULE EACH TRY MUST BE INDEPENDENT

Simulating sales deals

Assume that Amir usually works on 3 deals per week, and overall, he wins 30% of deals he works on. Each deal has a binary outcome: it’s either lost, or won, so you can model his sales deals with a binomial distribution. In this exercise, you’ll help Amir simulate a year’s worth of his deals so he can better understand his performance.

Set the random seed to 10 and simulate a single deal.

# Set random seed to 10
set.seed(10)

# Simulate a single deal
rbinom(1, 1, 0.3) #tenemos que tener en cuenta que es 1 semana, eso lo suponemos porque queremos calcular una probabilidad inicial, y nos comenta que esta semana solamente hacemos un deal, por lo que N = 1, y la probabilidad de que este deal sea un éxito se mantiene es decir, es un 30%.

## [1] 0

Sin embargo si cogieramos una semana natural, es decir, que hubiera 3 deals, el resultado sería diferente, fijate:

# Set random seed to 10
set.seed(10)

# Simulate a single deal
rbinom(1, 3, 0.3) # Es 1 semana natural, por lo que el primer argumento es un 1, tenemos 3 deals tal cual dice el enunciado por lo que N = 3, y la probabilidad de que este deal sea un éxito es un 30%. Dicho esto el resultado es bien diferente.

## [1] 1

Simulate a typical week of Amir’s deals, or one week of 3 deals.

# Set random seed to 10
set.seed(10)

# Simulate 1 week of 3 deals
rbinom(1,3,0.3) #en este caso, nos dice que es una semana típica, es decir, 1 semana, por eso el primer argumento es 1, en luego nos dice que esa semana ha habido 3 deals, por lo que N = 3, y la probabilidad de que el deal sea un éxito es un 30%.

## [1] 1

Simulate a year’s worth of Amir’s deals, or 52 weeks of 3 deals each, and store in deals. Calculate the mean number of deals he won per week.

# Set random seed to 10
set.seed(10)

# Simulate 52 weeks of 3 deals
deals <- rbinom(52, 3, 0.3) # tenemos que tener en cuenta que son 52 repeticiones, y que en cada repeticion hay tres deals, es decir, la cantidad de EXITO = 3, y la probabilidad de que esto ocurra es un 30%. 

# Calculate mean deals won per week
mean(deals)

## [1] 0.8076923

Calculating binomial probabilities

Just as in the last exercise, assume that Amir wins 30% of deals. He wants to get an idea of how likely he is to close a certain number of deals each week. In this exercise, you’ll calculate what the chances are of him closing different numbers of deals using the binomial distribution.

What’s the probability that Amir closes all 3 deals in a week?

# Probability of closing 3 out of 3 deals
dbinom(3, 3, 0.3) # aqui utilizamos la funcion de dbinom() puesto que queremos sacar una sola probabilidad de masa, es decir 3 deals en una semana, listo. Queremos saber cual es la probabilidad de cerrar con éxito el 100% de los deals.

## [1] 0.027

#Dicho de otra manera, de 3 ensayos (deals) que hacemos queremos ver la probabiidad de que esto ocurra el 100% de las veces, es decir, 3, teniendo en cuenta que la probabilidad de exito es el 30%.

What’s the probability that Amir closes 1 or fewer deals in a week?

# Probability of closing <= 1 deal out of 3 deals
pbinom(1, 3, 0.3) # sin embargo, en este caso, queremos sacar más probabilidades que en el caso anterior, es decir, antes solamente calculabamos la probabilidad de saldar 3 deals en 1 semana, es decir una solo caso.

## [1] 0.784

#Ahora en cambio, queremos saber cual es la probabilidad de que el éxito ocurra 1 vez en 3 deals, teniendo en cuenta que la probabilidad es del 30%.

What’s the probability that Amir closes more than 1 deal?

# Probability of closing > 1 deal out of 3 deals
pbinom(1, 3, 0.3, lower.tail = FALSE)

## [1] 0.216

#En este caso vamos a ver cual es la probabilidad de tener  por lo menos, 1 éxito, o más 2 o incluso 3, de tres ensayos.

How many sales will be won?

Now Amir wants to know how many deals he can expect to close each week if his win rate changes. Luckily, you can use your binomial distribution knowledge to help him calculate the expected value in different situations. Recall from the video that the expected value of a binomial distribution can be calculated by “n * p”.

Calculate the expected number of sales out of the 3 he works on that Amir will win each week if he maintains his 30% win rate.

# Expected number won with 30% win rate
won_30pct <- 3 * 0.3
won_30pct

## [1] 0.9

Calculate the expected number of sales out of the 3 he works on that he’ll win if his win rate drops to 25%.

# Expected number won with 25% win rate
won_25pct <- 3 *0.25
won_25pct

## [1] 0.75

Calculate the expected number of sales out of the 3 he works on that he’ll win if his win rate rises to 35%.

# Expected number won with 35% win rate
won_35pct <- 3 * 0.35
won_35pct

## [1] 1.05

More Distributions and the Central Limit Theorem

The normal distribution

La distribución normal es una de las técnicas más usadas. Es una distribución simétrica y con forma de campana.

También tenemos que añadir que junto a la distribución normal existe la distribución normal standar, es decir la que tiene una media de 0 y una desviación estandar de 1.

Formulas a tener en cuenta:

Objetivo la probabilidad

¿Cuál es el procentaje de mujeres que es menor que 154 cm?

En este caso usaremos la fórmula pnorm( probabilidad objetivo, media = nuestra media, sd = nuestro sd.)

Objetivo cantidad exacta

¿Cuál es la altura del las mujeres más bajas que el 90% de todas?

En este caso usaremos qnorm(nuestra probabilidad, media = nuestra media, sd = nuestra desvicación estandar.)

También podemos generar una distribución estandar aleatoriamente usando la funcion de rnorm(cantidad de puntos, mean = la que queramos crear, sd = la que queramos)

rnorm(10, mean = 0, sd = 1)

##  [1] -0.68755543 -0.87215883 -0.10176101 -0.25378053 -1.85374045 -0.07794607
##  [7]  0.96856634  0.18492596 -1.37994358 -1.43551436

Distribution of Amir’s sales

Since each deal Amir worked on (both won and lost) was different, each was worth a different amount of money. These values are stored in the amount column of amir_deals As part of Amir’s performance review, you want to be able to estimate the probability of him selling different amounts, but before you can do this, you’ll need to determine what kind of distribution the amount variable follows.

Both dplyr and ggplot2 are loaded and amir_deals is available.

Create a histogram with 10 bins to visualize the distribution of the amount.

amir_deals %>% 
  head()

##     product  client status  amount num_users
## 1 Product F Current    Won 7389.52        19
## 2 Product C     New    Won 4493.01        43
## 3 Product B     New    Won 5738.09        87
## 4 Product I Current    Won 2591.24        83
## 5 Product E Current    Won 6622.97        17
## 6 Product B     New    Won 5496.27         2

# Histogram of amount with 10 bins
ggplot(amir_deals, aes(amount))+
  geom_histogram(bins = 10)

Which probability distribution do the sales amounts most closely follow?

Correct answer = normal.

Probabilities from the normal distribution

Since each deal Amir worked on (both won and lost) was different, each was worth a different amount of money. These values are stored in the amount column of amir_deals and follow a normal distribution with a mean of 5000 dollars and a standard deviation of 2000 dollars. As part of his performance metrics, you want to calculate the probability of Amir closing a deal worth various amounts.

What’s the probability of Amir closing a deal worth less than $7500?

#Explicación: usamos pnorm, porque tenemos la cantidad y estamos buscando la probabilidad. 

# Probability of deal < 7500
pnorm(7500, mean = 5000, sd = 2000)

## [1] 0.8943502

What’s the probability of Amir closing a deal worth more than $1000?

# Probability of deal > 1000
1 - pnorm(1000, mean = 5000, sd = 2000)

## [1] 0.9772499

EXPLICACION: ¿ por qué hago 1-? Pues la función de pnorm busca las prbabilidades más pequeñas, es decir, me buscara la probabilidad que Amir cierre un deal que valga menos que 1.000$. Entonces todo lo que no sea menor que 1.000 sera superior, por lo que le resto la probabilidad.

La probabildiad de que cierre uno menor a 1000 $ es:

pnorm(1000, mean = 5000, sd = 2000)

## [1] 0.02275013

Cuadra.

What’s the probability of Amir closing a deal worth between ‘$3000’ and $7000?

# Probability of deal between 3000 and 7000
pnorm(7000, mean = 5000, sd = 2000) - pnorm(3000, mean = 5000, sd = 2000)

## [1] 0.6826895

What amount will 75% of Amir’s sales be more than?

# Calculate amount that 75% of deals will be more than
qnorm(0.75, mean  =5000, sd = 2000, lower.tail = FALSE)

## [1] 3651.02

EXPLICACION: en la teoria he mencionado que habia 2 formulas. Esta es la segunda, aqui vamos a buscar los dolares que puede llegar a ganar con la probabilidad. Bien, esta formula, como la anterior , busca la probabilidad hacia la izquierda, es decir, me buscara la cantidad de las ventas en las que la cantidad sea menor al 75% de las ventas:

qnorm(0.75, mean  =5000, sd = 2000)

## [1] 6348.98

Sin embargo, no es lo que buscamos, nos preguntan que podemos hacer en el 75% del tiempo de amir, en sus ventas, pues, cerrara tratos por encima de los 3.651,02$.

Simulating sales under new market conditions

The company’s financial analyst is predicting that next quarter, the worth of each sale will increase by 20% and the volatility, or standard deviation, of each sale’s worth will increase by 30%. To see what Amir’s sales might look like next quarter under these new market conditions, you’ll simulate new sales amounts using the normal distribution and store these in the new_sales data frame, which has already been created for you.

In addition, dplyr and ggplot2 are loaded.

new_sales <- seq(1, 36, 1) %>% 
  as.data.frame() %>% 
  set_names("sale_num")

Currently, Amir’s average sale amount is $5000. Calculate what his new average amount will be if it increases by 20% and store this in new_mean.

# Calculate new average amount
new_mean <- 5000 * 1.2

Amir’s current standard deviation is $2000. Calculate what his new standard deviation will be if it increases by 30% and store this in new_sd.

# Calculate new standard deviation
new_sd <- 2000 * 1.3

Add a new column called amount to the data frame new_sales, which contains 36 simulated amounts from a normal distribution with a mean of new_mean and a standard deviation of new_sd.

new_sales %>% 
  mutate(amount = rnorm(36, mean = new_mean, sd = new_sd))

##    sale_num    amount
## 1         1  6941.427
## 2         2  1426.374
## 3         3  5156.186
## 4         4  4305.936
## 5         5  8825.034
## 6         6  4017.383
## 7         7  3845.477
## 8         8  8169.632
## 9         9  3484.105
## 10       10  5925.080
## 11       11  6604.565
## 12       12  5216.857
## 13       13  4238.202
## 14       14  7703.592
## 15       15  4958.342
## 16       16  5130.153
## 17       17  9556.680
## 18       18 11558.194
## 19       19  7315.130
## 20       20  8044.490
## 21       21  3654.249
## 22       22  7385.532
## 23       23  4320.675
## 24       24  6756.567
## 25       25  2782.254
## 26       26  4813.942
## 27       27  3841.161
## 28       28  6884.301
## 29       29  8772.579
## 30       30  9161.927
## 31       31  7912.796
## 32       32  4748.858
## 33       33  7463.136
## 34       34  2759.569
## 35       35  6990.398
## 36       36  2280.889

# Simulate 36 sales
new_sales <- new_sales %>% 
  mutate(amount = rnorm(36, mean = new_mean, sd = new_sd))

Plot the distribution of the new_sales amounts using a histogram with 10 bins.

# Create histogram with 10 bins
ggplot(new_sales, aes(amount)) +
  geom_histogram(bins = 10)

Successful simulating! Although the average sale amount went up, the variation also increased, so it’s not straightforward to decide whether these sales are better than his current ones. In the next exercise, you’ll explore the effects of higher variation.

Which market is better?

The key metric that the company uses to evaluate salespeople is the percent of sales they make over $1000 since the time put into each sale is usually worth a bit more than that, so the higher this metric, the better the salesperson is performing.

Recall that Amir’s current sales amounts have a mean of $5000 and a standard deviation of $2000, and Amir’s predicted amounts in next quarter’s market have a mean of $6000 and a standard deviation of $2600.

Based only on the metric of percent of sales over $1000, does Amir perform better in the current market or the predicted market?

Bien tendremos que ver cuales son los sales que hay con los diferentes indicadores:

Situacion 1: mean = 5000, sd = 2000.

rnorm(36, 5000, 2000) %>% 
  as.data.frame() %>% 
  set_names("Sales") %>% 
   count(Sales>1000)

##   Sales > 1000  n
## 1         TRUE 36

Situacion 2: mercado nuevo, mean = 6000, sd = 2600

rnorm(36, 6000, 2600) %>% 
  as.data.frame() %>% 
  set_names("Sales") %>% 
   count(Sales>1000)

##   Sales > 1000  n
## 1        FALSE  1
## 2         TRUE 35

Great work! In the current market, Amir makes sales over $1000 about 97.7% of the time, and about 97.3% of the time in the predicted market, so there’s not much of a difference. However, his average sale amount is higher in the predicted market, so the company may want to consider other metrics as well.

The central limit theorem

Lo que viene diciendo esta teoria es que la distribución uniforme de un estadistico, e decir por ejemplo un dado, se acerca mas a una distribució normal cuando la cantidad de intentos aumenta.

Ejemplo: vamos a lanzar el dado 5 veces y vamos a tomar la media

die <- seq(1,6,1)

sample(die, 5, replace = T) %>%
  mean()

## [1] 3.2

sample(die, 5, replace = T) %>%
  mean()

## [1] 3.6

sample(die, 5, replace = T) %>%
  mean()

## [1] 4

Las medias son totalmente diferentes, pero que pasa si lo lanzamos 10 veces?

sample_means <- replicate(10, sample(die, 5, replace = T) %>% mean())
sample_means

##  [1] 4.0 3.8 4.6 4.4 3.2 2.0 4.0 2.6 3.4 1.8

Y si lo vemos queda así:

sample_means %>% 
  hist()

Ahora bien, que pasa si lo aumentamos a 100?

replicate(100, sample(die, 5, replace = T) %>% mean()) %>% 
  hist()

Esta cogiendo un aspecto más similar a una campana.

Y si lo hacemos con 1.000?

replicate(1000, sample(die, 5, replace = T) %>% mean()) %>% 
  hist()

Esto sí que tiene forma de campana.

ESTO ES LA TEORIA DE LOS LIMITES CENTRALES.

SE DEBE USAR SOLAMENTE CUANDO LOS TRIALS SON RANDOM Y SON INDEPENDIENTES.

Su efecto en los estadísticos:

replicate(1000, sample(die, 5, replace = T) %>% sd()) %>% 
  hist()

El sample standar deviation de las desviaciones estandares se ven distribuídas normalmente.

Proportion

Es decir, la proporción.

sales_team <- c("Amir", "Brian", "Claire", "Damian")

Si cogemos esto, y sampleamos 10 veces

vista_numerica_hist <- function(data){
  
  data %>% 
  as.data.frame() %>% 
  set_names("Nombres") %>% 
  group_by(Nombres) %>%
  summarise("veces" = n())  %>% 
   
  mutate(
    Proporcion = veces/10
  ) %>% 
  filter(Nombres == "Claire") %>% 
  select(Proporcion)
  
  }

sample(sales_team, 10, replace = T)  %>% 
  vista_numerica_hist()

## # A tibble: 1 x 1
##   Proporcion
##        <dbl>
## 1        0.4

Se ve que la proporcion de Claire se mantiene en un 30%.

Y si lo hacemos como unas 1000?¿

Finalmente, el aspecto que tendría seria una campana.

Visualizing sampling distributions

On the right, try creating sampling distributions of different summary statistics from samples of different distributions. Which distribution does the central limit theorem not apply to?

Victorious visualizing! Regardless of the shape of the distribution you’re taking sample means from, the central limit theorem will apply if the sampling distribution contains enough sample means.

The CLT in action

The central limit theorem states that a sampling distribution of a sample statistic approaches the normal distribution as you take more samples, no matter the original distribution being sampled from.

In this exercise, you’ll focus on the sample mean and see the central limit theorem in action while examining the num_users column of amir_deals more closely, which contains the number of people who intend to use the product Amir is selling.

Both dplyr and ggplot2 are loaded and amir_deals is available.

Create a histogram of the num_users column of amir_deals. Use 10 bins.

amir_deals %>% 
  head()

##     product  client status  amount num_users
## 1 Product F Current    Won 7389.52        19
## 2 Product C     New    Won 4493.01        43
## 3 Product B     New    Won 5738.09        87
## 4 Product I Current    Won 2591.24        83
## 5 Product E Current    Won 6622.97        17
## 6 Product B     New    Won 5496.27         2

# Create a histogram of num_users
amir_deals %>% ggplot(.,aes(num_users)) + geom_histogram(bins = 10)

Set the seed to 104. Take a sample of size 20 with replacement from the num_users column of amir_deals, and take the mean.

# Set seed to 104
set.seed(104)

# Sample 20 num_users with replacement from amir_deals
sample(amir_deals$num_users, size = 20, replace = TRUE) %>%
  # Take mean
  mean()

## [1] 30.35

Repeat this 100 times and store as sample_means. This will take 100 different samples and calculate the mean of each.

# Repeat the above 100 times
sample_means <- replicate(100, sample(amir_deals$num_users, size = 20, replace = TRUE) %>% mean())

A data frame called samples has been created for you with a column mean, which contains the values from sample_means. Create a histogram of the mean column with 10 bins.

# Create data frame for plotting
samples <- data.frame(mean = sample_means)

# Histogram of sample means
ggplot(samples,aes(mean)) + geom_histogram(bins= 10)

Fabulous job! You’ve just seen the central limit thorem at work. Even though the distribution of num_users is not normal, the distribution of its sample mean resembles the normal distribution.

The mean of means

You want to know what the average number of users (num_users) is per deal, but you want to know this number for the entire company so that you can see if Amir’s deals have more or fewer users than the company’s average deal. The problem is that over the past year, the company has worked on more than ten thousand deals, so it’s not realistic to compile all the data. Instead, you’ll estimate the mean by taking several random samples of deals, since this is much easier than collecting data from everyone in the company.

The user data for all the company’s deals is available in all_deals.

Set the random seed to 321. Take 30 samples of size 20 from all_deals$num_users and take the mean of each sample. Store the sample means in sample_means.

# Set seed to 321
set.seed(321)

# Take 30 samples of 20 values of num_users, take mean of each sample
sample_means <- replicate(30, sample(all_deals$num_users, 20) %>% mean())

Take the mean of sample_means.

# Calculate mean of sample_means
mean(sample_means)

## [1] 37.02667

Take the mean of the num_users column of amir_deals.

# Calculate mean of num_users in amir_deals
mean(amir_deals$num_users)

## [1] 37.65169

Magnificent mean calculation! Amir’s average number of users is very close to the overall average, so it looks like he’s meeting expectations. Make sure to note this in his performance review!

The Poisson distribution

Una distribución Possion es una cadena de eventos que pasan en una casuística pero que su aparición es totalemente aleatoria:

1- El número de terremotos en California cada semana 2.- El número de adopciones de perros en una perrara.

Lo que describe es la probabilidad de que x eventos ocurran en un rango de tiempo dado.

Ejemplo: P(x), siendo X: la probabilidad que se adopten en una perrera más o igual a 5 perros a la semana.

La distribución de poisson esta descrita por medio de Lambda, siendo Lambda: la media de eventos que pasan en un intervalo de tiempo. Tambien es la media esperada de la distribución.

Tendremos muchas veces una media de lo que ocurre, es decir, como el número de terremotos en california medio por año, es decir teenmos una lambda.

Cálculo de poisson:

Probabilidad exacta dpois(la que queremos investigar., lambda = media buscada) : esto no sdara una probabilidad de que ocurra la lambda que estamos buscando, el primer argumento.

#nuestro lambda

lambda = 8


#queremos calcular cual es la probablidad de que ocurran 5 terremotos cada semana en california: 

dpois(5, lambda = lambda)

## [1] 0.09160366

probabilidad de que ocurra algo o menor

Se utilizara la funcion de ppois, pero la gramática es la misma.

Ejemplo:

¿Cuál es la probabilidad que haya menos de 5 o 5 terremotos semanales en california, siendo la media de terremotos semanales 8?

ppois(5, lambda = lambda)

## [1] 0.1912361

Probabilidad superior

Simplemente le añadiremos a ppois la coletilla de “lower.tail = False”, pues esta funcion también mira a la izquierda en la distribución y no a la derecha.

ppois(5, lambda = lambda, lower.tail = F)

## [1] 0.8087639

Que si lo pensamos friamente es lo mismo que hacer:

1 - ppois(5, lambda = lambda)

## [1] 0.8087639

Muestreo Sampling de Poisson:

Esto lo vamos a hacer por medio de la funcion rpois(numero de eventos, lambda = lambda que tenemos)

Es decir, si queremos saber cuales van a aser las adopciones que tendra un shelter de animiales en 10 semanas:

rpois(10, lambda = lambda)

##  [1]  6  6 10  8  5  6  7 11 12  6

En cuanto a la distribución se refiere, cuanto más casos encontremos, su aspecto final se asemejará mas a una campana.

Un año:

replicate(52, rpois(10, lambda = lambda) %>% mean()) %>% 
  hist()

Dos años:

replicate(
  (52*2), #dos años, 52 semanas * 2
  rpois(10, lambda = lambda) %>% mean()) %>% 
  hist()

Valor del lambda:

El valor de la lambda puede hacer cambiar la forma de la distribución:

lambda_1 <- 1
lambda_4 <- 4
lambda_8 <- 8
muestreo <- 1000

replicate(
  muestreo,
  rpois(10, lambda = lambda_1) %>% mean()) %>% 
  hist()

replicate(
  muestreo,
  rpois(10, lambda = lambda_4) %>% mean()) %>% 
  hist()

replicate(
  muestreo,
  rpois(10, lambda = lambda_8) %>% mean()) %>% 
  hist()

Marvelous matching! The Poisson distribution is a family of distributions, just like the uniform, binomial, or normal distributions.

Ejercicios:

Tracking lead responses

Your company uses sales software to keep track of new sales leads. It organizes them into a queue so that anyone can follow up on one when they have a bit of free time. Since the number of lead responses is a countable outcome over a period of time, this scenario corresponds to a Poisson distribution. On average, Amir responds to 4 leads each day. In this exercise, you’ll calculate probabilities of Amir responding to different numbers of leads.

What’s the probability that Amir responds to 5 leads in a day, given that he responds to an average of 4?

# Probability of 5 responses

#nuestro lambda es de 4
dpois(5, lambda = 4)

## [1] 0.1562935

Amir’s coworker responds to an average of 5.5 leads per day. What is the probability that she answers 5 leads in a day?

# Probability of 5 responses from coworker
dpois(5, 5.5)

## [1] 0.1714007

What’s the probability that Amir responds to 2 or fewer leads in a day?

# Probability of 2 or fewer responses
ppois(2, 4)

## [1] 0.2381033

What’s the probability that Amir responds to more than 10 leads in a day?

# Probability of > 10 responses
ppois(10, 4, lower.tail = FALSE)

## [1] 0.002839766

Que es lo mismo que hacer

1 - ppois(10, 4)

## [1] 0.002839766

Perfect Poisson probabilities! Note that if you provide dpois() or ppois() with a non-integer, it returns 0 and throws a warning since the Poisson distribution only applies to integers.

More probability distributions

Exponential distribution:

Su uso: define la probabilidad que algo pase entre dos eventos Poisson.

Cuando se puede usar, ejemplos:

La probabilidad de un que pase más de un día entre adopcion y adopcion.

Lambda y rate:

El significado de las mismas es la misma en este contexto.

Naturaleza.

Continua puesto que tenemos la variable de tiempo en juego.

Calculos:

La probabilidad de menos de algo ocurra

pexp(objetivo, rate = nuestro dato)

Por ejemplo, cual es la probabilidad que tengamos que experar menos de un minuto para que se genere un ticket?

La frequencia es que cada 2 minutos se genera un ticket.

pexp(1, rate = 2)

## [1] 0.8646647

# Mucho ojo, le estamos pasando la frequencia, no la media, lambda.

La probabilidad de que ocurra algo más veces

Es decir, ¿cuál es la probabilidad que cada minuto se generen mas de 3 tickets?

pexp(3, rate = 2, lower.tail = FALSE)

## [1] 0.002478752

Como en otras ocasiones, decir, que hacer lower.tail equivale a

1 - pexp(3, rate = 2)

## [1] 0.002478752

la probabilidad de que algo ocurra entre dos intervalos

Como esto, funciona como lo de poisson, es decir, mirando a la distribución hacia la izquierda, vamos a tener que generar un poco más de código:

¿Cuál es la probabilidad que tengamos que experar entre 1 y 3minutos para que se gnere un ticket?

pexp(3, rate = 2) - pexp(1, rate = 2)

## [1] 0.1328565

Diferencia con poisson

En poisson la frequencia (Rate) se debería de traducir a lambda, es decir, la media, por ejemplo la media minutal.

Si sabemos que se genera 1 ticket cada 2 minutos, cada minuto se generaría 0.5 ticket.

lambda_poisson <- 1/2

Sin embargo, en la distribución exponencial, habria que hacerlo de otra manera, es decir:

1/lambda, que esto nos daria 1 ticket cada 2 minutos

1/lambda_poisson

## [1] 2

Eso quiere decir que de media hay 2 minutos entre generacion y generacion de tickets.

Student’s distribution:

Tiene una distribución parecida a la distribución normal, sin embargo, tiene diferencias.

T de student

Conocida como el grado de freedom, afecta a la grosura de las alas en el gráfico.

t student baja resula en alas más bajas, es decir, con una desviación standar más alta.

t student alta resulta en una distribución más parecia a la normal.

Log normal Distribution:

Las casuísticas que caen aqui tienen un logaritmo que cae normalmente distribuído.

Normalmente estan tirando a un lado, es decir, estan inclinadas, aunque parezca que esto no se usa en el mundo real, hay casuísitcas que se asemejan:

1- La presion arterial de los adultos 2- Cuanto puede durar una partida de ajedrez.

Practicamos

Modeling time between leads (Exponential)

To further evaluate Amir’s performance, you want to know how much time it takes him to respond to a lead after he opens it. On average, it takes 2.5 hours for him to respond. In this exercise, you’ll calculate probabilities of different amounts of time passing between Amir receiving a lead and sending a response.

What’s the probability it takes Amir less than an hour to respond to a lead?

# Probability response takes < 1 hour
pexp(1, rate = 1/2.5)

## [1] 0.32968

What’s the probability it takes Amir more than 4 hours to respond to a lead?

# Probability response takes > 4 hours
pexp(4, rate = 1/2.5, lower.tail = F)

## [1] 0.2018965

What’s the probability it takes Amir 3-4 hours to respond to a lead?

# Probability response takes 3-4 hours
pexp(4, rate = 1/2.5) - pexp(3, rate = 1/2.5)

## [1] 0.09929769

Correlation

Relationships between variables

In this chapter, you’ll be working with a dataset world_happiness containing results from the 2019 World Happiness Report. The report scores various countries based on how happy people in that country are. It also ranks each country on various societal aspects such as social support, freedom, corruption, and others. The dataset also includes the GDP per capita and life expectancy for each country.

In this exercise, you’ll examine the relationship between a country’s life expectancy (life_exp) and happiness score (happiness_score) both visually and quantitatively. Both dplyr and ggplot2 are loaded and world_happiness is available.

Create a scatterplot of happiness_score vs. life_exp using ggplot2.

# Create a scatterplot of happiness_score vs. life_exp
ggplot(world_happiness, aes(life_exp, happiness_score))+
geom_point()

Add a linear trendline to the scatterplot, setting se to FALSE.

# Add a linear trendline to scatterplot
ggplot(world_happiness, aes(life_exp, happiness_score)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

Calculate the correlation between life_exp and happiness_score.

# Correlation between life_exp and happiness_score
cor(world_happiness$life_exp, world_happiness$happiness_score)

## [1] 0.7737615

Correlation caveats

Puede ser que haya correlaciones pero que no sean lineales, es decir, si no es lineal, no nos sirve.

Se puede hacer ajustes puesot que algunos metodos estadisticos dependen en que las variables tengan una relacion lineal.

Transformaciones logaritmicas

Pasamos la funcion de log() a la columna que queremos modificar. Su uso es efectivo cuando la naturaleza de la variable esta inclinada hacia un extremo, es decir:

hist(world_happiness$gdp_per_cap)

world_happiness %>% 
  mutate(log_gdp_per_cap = log(gdp_per_cap))  %>% 
  ggplot(aes(log_gdp_per_cap)) +
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Raiz cuadrada

Se puede aplicar de la misma manera que el logaritmo pero en diferentes situaciones, esto se usa cuando la distribución esta tirando a la derecha.

Propiedad causistica

La correlacion no quiere decir que haya relacion de causa consequencia. Lo que hay que ver es en otros contextos si hay variables que realmente afecten a la correlacion, por lo que si existen variables más de peso que el inicial para justificar la correlación, diremos que la primera variable es simplemente una variable “confounding”.

Practicamos

What can’t correlation measure?

While the correlation coefficient is a convenient way to quantify the strength of a relationship between two variables, it’s far from perfect. In this exercise, you’ll explore one of the caveats of the correlation coefficient by examining the relationship between a country’s GDP per capita (gdp_per_cap) and happiness score.

Both dplyr and ggplot2 are loaded and world_happiness is available.

Create a scatterplot showing the relationship between gdp_per_cap (on the x-axis) and life_exp (on the y-axis).

# Scatterplot of gdp_per_cap and life_exp
ggplot(world_happiness, aes(gdp_per_cap, life_exp))+ geom_point()

Calculate the correlation between gdp_per_cap and life_exp.

# Correlation between gdp_per_cap and life_exp
cor(world_happiness$gdp_per_cap, world_happiness$life_exp)

## [1] 0.7235027

Transforming variables

When variables have skewed distributions, they often require a transformation in order to form a linear relationship with another variable so that correlation can be computed. In this exercise, you’ll perform a transformation yourself.

Both dplyr and ggplot2 are loaded and world_happiness is available.

Create a scatterplot of happiness_score vs. gdp_per_cap and calculate the correlation between them.

# Scatterplot of happiness_score vs. gdp_per_cap
ggplot(world_happiness, aes(gdp_per_cap, happiness_score)) + geom_point()

# Calculate correlation
cor(world_happiness$gdp_per_cap, world_happiness$happiness_score)

## [1] 0.7601853

Add a new column to world_happiness called log_gdp_per_cap that contains the log of gdp_per_cap. Create a scatterplot of log_gdp_per_cap vs. happiness_score and calculate the correlation between them.

# Create log_gdp_per_cap column
world_happiness <- world_happiness %>%
  mutate(log_gdp_per_cap = log(gdp_per_cap))

# Scatterplot of log_gdp_per_cap vs. happiness_score
ggplot(world_happiness, aes(log_gdp_per_cap, happiness_score)) +
  geom_point()

# Calculate correlation
cor(world_happiness$happiness_score, world_happiness$log_gdp_per_cap)

## [1] 0.7965484

Terrific transforming! The relationship between GDP per capita and happiness became more linear by applying a log transformation. Log transformations are great to use on variables with a skewed distribution, such as GDP.

Does sugar improve happiness?

A new column has been added to world_happiness called grams_sugar_per_day, which contains the average amount of sugar eaten per person per day in each country. In this exercise, you’ll examine the effect of a country’s average sugar consumption on its happiness score.

Both dplyr and ggplot2 are loaded and world_happiness is available.

Create a scatterplot showing the relationship between grams_sugar_per_day (on the x-axis) and happiness_score (on the y-axis).

# Scatterplot of grams_sugar_per_day and happiness_score
ggplot(world_happiness, aes(grams_sugar_per_day, happiness_score)) + geom_point()

Interpretacion: Nice interpretation of correlation! If correlation always implied that one thing causes another, people may do some nonsensical things, like eat more sugar to be happier.

Calculate the correlation between grams_sugar_per_day and happiness_score.

# Correlation between grams_sugar_per_day and happiness_score
cor(world_happiness$grams_sugar_per_day, world_happiness$happiness_score)

## [1] 0.69391

Design of experiments

Normalmente cuando nos encontramos con datos solemos tener datos fijados para un objetivo, por lo que suele ser buena idea empezar a analizar el ambiente antes de los datos.

Vocabulario:

Experimentos: Cual es el effecto del tratamiento en una respuesta?

Tratamiento: es una variable explicativa / independiente.
Response: respuesta o variable dependiente.

Estudios de control de grupo:

Variables en el juego:

Tenemos que tener en cuenta que cuando hacemos experimentos vamos a tener varaibles que pueden llegar a ser confounding, por lo que a veces hay que hacer algunos ajustes.

1.- Randomized controlled trial:

De esta manera nos aseguramos que los participantes sean divididos aleatoriamente, así no vamos a tener ningún susto o este lo vamos a decrecer.

2- Usar un placebo:

Esto se puede usar siempre y cuando la prueba consista en algo físico.

3- Double-blind trial:

Esto es el placebo pero nivel bestia puesto que ni el repartidor de la dosis sabe si está repartiendo el placebo o el medicamento.

Con estas medidas se puede observar más profundamente si hay efecto de causa porque estaremos minimizando la variable de confounding.

Estudios observacionales:

Estos no son para nada asignados aleatoriamente, sino que son asignados a dedo.

En estos casos vamos a ver la asociación no la causa.

Introduction to Regression

Simple Linear Regression

A tale of two variables

Tipos de regresiones:

Regresion lineal:

Se usa cuando la respuesta tiene que ser numérica.

Regresión logística:

Se usa cuando la respuesta del algoritmo es logica, es decir, si es TRUE o FALSE.

Regresión simple / logística:

Se usa cuando solo tienes una variable explicativa.

Visualizing two variables

Before you can run any statistical models, it’s usually a good idea to visualize your dataset. Here, we’ll look at the relationship between house price per area and the number of nearby convenience stores, using the Taiwan real estate dataset.

One challenge in this dataset is that the number of convenience stores contains integer data, causing points to overlap. To solve this, you will make the points transparent.

taiwan_real_estate is available, ggplot2 is loaded, and its black and white theme has been set.

Using taiwan_real_estate, draw a scatter plot of price_twd_msq (y-axis) versus n_convenience (x-axis).

# Draw a scatter plot of n_convenience vs. price_twd_msq
ggplot(taiwan_real_estate, aes(n_convenience, price_twd_msq)) + 
  geom_point()

Update the plot to make the points 50% transparent by setting alpha to 0.5.

# Make points 50% transparent
ggplot(taiwan_real_estate, aes(n_convenience, price_twd_msq)) +
  geom_point(alpha = 0.5)

Update the plot by adding a trend line, calculated using a linear regression. You can omit the confidence ribbon.

# Add a linear trend line without a confidence ribbon
ggplot(taiwan_real_estate, aes(n_convenience, price_twd_msq)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", 
  se = F)

## `geom_smooth()` using formula 'y ~ x'

Fitting a linear regression

La teoria dice que :

y = intercept + slope * x

Intercept: es el valor que coge la figura del “y” cuando “x” es 0.
Slope: es el valor en el cual se incrementa y cuando x se incrementa en 1.

Linear regression with lm()

While ggplot can display a linear regression trend line using geom_smooth(), it doesn’t give you access to the intercept and slope as variables, or allow you to work with the model results as variables. That means that sometimes you’ll need to run a linear regression yourself.

Time to run your first model!

taiwan_real_estate is available. TWD is an abbreviation for Taiwan dollars.

Run a linear regression with price_twd_msq as the response variable, n_convenience as the explanatory variable, and taiwan_real_estate as the dataset.

# Run a linear regression of price_twd_msq vs. n_convenience
lm( price_twd_msq~   n_convenience ,data = taiwan_real_estate)

## 
## Call:
## lm(formula = price_twd_msq ~ n_convenience, data = taiwan_real_estate)
## 
## Coefficients:
##   (Intercept)  n_convenience  
##        8.2242         0.7981

Question The model had an (Intercept) coefficient of 8.2242. What does this mean?

Correct answer: On average, a house with zero convenience stores nearby had a price of 8.2242 TWD per square meter.

Question: The model had an n_convenience coefficient of 0.7981. What does this mean?

If you increase the number of nearby convenience stores by one, then the expected increase in house price is 0.7981 TWD per square meter.

Categorical explanatory variables

Visualizing numeric vs. categorical

If the explanatory variable is categorical, the scatter plot that you used before to visualize the data doesn’t make sense. Instead, a good option is to draw a histogram for each category.

The Taiwan real estate dataset has a categorical variable in the form of the age of each house. The ages have been split into 3 groups: 0 to 15 years, 15 to 30 years, and 30 to 45 years.

taiwan_real_estate is available and ggplot2 is loaded.

Using taiwan_real_estate, plot a histogram of price_twd_msq with 10 bins. Facet the plot by house_age_years to give 3 panels.

# Using taiwan_real_estate, plot price_twd_msq
ggplot(taiwan_real_estate, aes(price_twd_msq)) +
  # Make it a histogram with 10 bins
  geom_histogram(bins = 10) +
  # Facet the plot so each house age group gets its own panel
  facet_wrap(vars(house_age_years))

Calculating means by category

A good way to explore categorical variables is to calculate summary statistics such as the mean for each category. Here, you’ll look at grouped means for the house prices in the Taiwan real estate dataset.

taiwan_real_estate is available and dplyr is loaded.

Group taiwan_real_estate by house_age_years. Summarize to calculate the mean price_twd_msq for each group, naming the column mean_by_group. Assign the result to summary_stats and look at the numbers.

summary_stats <- taiwan_real_estate %>% 
  # Group by house age
  group_by(house_age_years) %>% 
  # Summarize to calculate the mean house price/area
  summarise(mean_by_group = mean(price_twd_msq))

# See the result
summary_stats

## # A tibble: 3 x 2
##   house_age_years mean_by_group
##   <chr>                   <dbl>
## 1 0 to 15                 12.6 
## 2 15 to 30                 9.88
## 3 30 to 45                11.4

lm() with a categorical explanatory variable

Linear regressions also work with categorical explanatory variables. In this case, the code to run the model is the same, but the coefficients returned by the model are different. Here you’ll run a linear regression on the Taiwan real estate dataset.

taiwan_real_estate is available.

Run a linear regression with price_twd_msq as the response variable, house_age_years as the explanatory variable, and taiwan_real_estate as the dataset. Assign to mdl_price_vs_age.

# Run a linear regression of price_twd_msq vs. house_age_years
mdl_price_vs_age <- lm(
  price_twd_msq ~ house_age_years, 
  data = taiwan_real_estate
)

# See the result
mdl_price_vs_age

## 
## Call:
## lm(formula = price_twd_msq ~ house_age_years, data = taiwan_real_estate)
## 
## Coefficients:
##             (Intercept)  house_age_years15 to 30  house_age_years30 to 45  
##                  12.637                   -2.761                   -1.244

Update the model formula so that no intercept is included in the model. Assign to mdl_price_vs_age_no_intercept.

# Update the model formula to remove the intercept
mdl_price_vs_age_no_intercept <- lm(
  price_twd_msq ~ house_age_years + 0, 
  data = taiwan_real_estate
)

# See the result
mdl_price_vs_age_no_intercept

## 
## Call:
## lm(formula = price_twd_msq ~ house_age_years + 0, data = taiwan_real_estate)
## 
## Coefficients:
##  house_age_years0 to 15  house_age_years15 to 30  house_age_years30 to 45  
##                  12.637                    9.877                   11.393

Predictions and model objects

Making predictions

Predicting house prices

Perhaps the most useful feature of statistical models like linear regression is that you can make predictions. That is, you specify values for each of the explanatory variables, feed them to the model, and you get a prediction for the corresponding response variable. The code flow is as follows.

Here, you’ll make predictions for the house prices in the Taiwan real estate dataset.

taiwan_real_estate is available. The linear regression model of house price versus number of convenience stores is available as mdl_price_vs_conv (print it and read the call to see how it was made); and dplyr is loaded.

mdl_price_vs_conv <- lm(formula = price_twd_msq ~ n_convenience, data = taiwan_real_estate)

Create a tibble of explanatory data, where the number of convenience stores, n_convenience, takes the integer values from zero to ten.

# Create a tibble with n_convenience column from zero to ten
explanatory_data <- tibble(n_convenience = 0:10)

Use the model mdl_price_vs_conv to make predictions from explanatory_data.

# From previous steps
explanatory_data <- tibble(
  n_convenience = 0:10
)

# Edit this, so predictions are stored in prediction_data
predict(mdl_price_vs_conv, explanatory_data)

##         1         2         3         4         5         6         7         8 
##  8.224237  9.022317  9.820397 10.618477 11.416556 12.214636 13.012716 13.810795 
##         9        10        11 
## 14.608875 15.406955 16.205035

Create a tibble of predictions named prediction_data.

Start with explanatory_data. Add an extra column, price_twd_msq, containing the predictions.

# Edit this, so predictions are stored in prediction_data
 prediction_data<- explanatory_data %>% 
mutate(
  price_twd_msq = predict(mdl_price_vs_conv, explanatory_data)

)

# See the result
prediction_data

## # A tibble: 11 x 2
##    n_convenience price_twd_msq
##            <int>         <dbl>
##  1             0          8.22
##  2             1          9.02
##  3             2          9.82
##  4             3         10.6 
##  5             4         11.4 
##  6             5         12.2 
##  7             6         13.0 
##  8             7         13.8 
##  9             8         14.6 
## 10             9         15.4 
## 11            10         16.2

Visualizing predictions

The prediction data you calculated contains a column of explanatory variable values and a column of response variable values. That means you can plot it on the same scatter plot of response versus explanatory data values.

prediction_data is available and ggplot2 is loaded. The code for the scatter plot with linear trend line you drew in Chapter 1 is shown

Extend the plotting code to include the point predictions in prediction_data. Color the points yellow.

# Add to the plot
ggplot(taiwan_real_estate, aes(n_convenience, price_twd_msq)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  # Add a point layer of prediction data, colored yellow
  geom_point(data = prediction_data, color = "yellow")

## `geom_smooth()` using formula 'y ~ x'

The limits of prediction

In the last exercise you made predictions on some sensible, could-happen-in-real-life, situations. That is, the cases when the number of nearby convenience stores were between zero and ten. To test the limits of the model’s ability to predict, try some impossible situations.

Use the console to try predicting house prices from mdl_price_vs_conv when there are -1 convenience stores. Do the same for 2.5 convenience stores. What happens in each case?

mdl_price_vs_conv is available and dplyr is loaded.

Create some impossible explanatory data. Define a tibble with one column, n_convenience, set to minus one, assigning to minus_one. Create another with n_convenience set to two point five, assigning to two_pt_five.

# Define a tibble where n_convenience is -1
minus_one <- tibble(n_convenience = -1)

# Define a tibble where n_convenience is 2.5
two_pt_five <- tibble(n_convenience = 2.5)

Try making predictions on your two impossible cases. What happens?

Probamos con minus_one:

#En primer lugar vamos a crear una funcion para hacerlo más facil 

probando_explanatories <- function(explanatory){
  # From previous steps
explanatory_data <- explanatory

prediction_data<- explanatory_data %>% 
mutate(
  price_twd_msq = predict(mdl_price_vs_conv, explanatory_data)

)

# Add to the plot
ggplot(taiwan_real_estate, aes(n_convenience, price_twd_msq)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  # Add a point layer of prediction data, colored yellow
  geom_point(data = prediction_data, color = "yellow")
  
  
  
}

probando_explanatories(minus_one)

## `geom_smooth()` using formula 'y ~ x'

Se sale totalmente de la linea logica es decir, el numero de conveniencia no puede ser negativo.

Probamos con el 2.5

probando_explanatories(two_pt_five)

## `geom_smooth()` using formula 'y ~ x'

Tampoco puede ser que nos haga una prediccion en base a un valor que sea no continuo.

Answer: The model successfully gives a prediction about cases that are impossible in real life.

Working with model objects

Las funciones para hacer la vida más facil al hacer predicciones son las siguientes:

1- Coefficients(): tras pasar el objeto del modelo, esta formula te devuelve los coeficientes del modelo.

2- fitted(): es la manera de extraer las predicciones en el dataset original, lo devuelve en fomato de vector, tambien hay que pasarle el objeto del modelo.

3- Residuals(): es la manera de extraer las inaccuracies del modelo.

4- Summary(): te da un resumen más generalizado de lo que nos puede ofrecer el modelo.

4-1: Coefficients:

Primera columna, son los valores que obtendríamos si usamos la funcion Coefficients() en el objeto del modelo. La última columna es el valor P, que se refiere a significado estadístico

Lo que nos muestra la función summary() está hecho para tener una lectura rápida de la situación, si lo quisieramos manipular en cambio… esta la siguiente función.

5-broom::tidy: nos devuelve la información de los coeficients de summary() pero de forma de objeto.

6- augment(): da la misma información que summary() pero para cada fila del dataframe.

Extracting model elements

The variable returned by lm() that contains the model object has many elements. In order to perform further analysis on the model results, you need to extract the useful bits of it. The model coefficients, the fitted values, and the residuals are perhaps the most important bits of the linear model object.

mdl_price_vs_conv is available.

Print the coefficients of mdl_price_vs_conv.

# Get the model coefficients of mdl_price_vs_conv
coefficients(mdl_price_vs_conv)

##   (Intercept) n_convenience 
##     8.2242375     0.7980797

Print the fitted values of mdl_price_vs_conv.

# Get the fitted values of mdl_price_vs_conv
fitted(mdl_price_vs_conv)[1:10]

##         1         2         3         4         5         6         7         8 
## 16.205035 15.406955 12.214636 12.214636 12.214636 10.618477 13.810795 13.012716 
##         9        10 
##  9.022317 10.618477

Print the residuals of mdl_price_vs_conv.

# Get the residuals of mdl_price_vs_conv
residuals(mdl_price_vs_conv)[1:10]

##          1          2          3          4          5          6          7 
## -4.7375611 -2.6384224  2.0970130  4.3663019  0.8262112 -0.9059199 -1.6171495 
##          8          9         10 
##  1.1173901 -3.3339662 -3.9316385

Print a summary of mdl_price_vs_conv.

# Print a summary of mdl_price_vs_conv
summary(mdl_price_vs_conv)

## 
## Call:
## lm(formula = price_twd_msq ~ n_convenience, data = taiwan_real_estate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.7132  -2.2213  -0.5409   1.8105  26.5299 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8.22424    0.28500   28.86   <2e-16 ***
## n_convenience  0.79808    0.05653   14.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.384 on 412 degrees of freedom
## Multiple R-squared:  0.326,  Adjusted R-squared:  0.3244 
## F-statistic: 199.3 on 1 and 412 DF,  p-value: < 2.2e-16

Manually predicting house prices

You can manually calculate the predictions from the model coefficients. When making predictions in real life, it is better to use predict(), but doing this manually is helpful to reassure yourself that predictions aren’t magic – they are simply arithmetic.

In fact, for a simple linear regression, the predicted value is just the intercept plus the slope times the explanatory variable.

mdl_price_vs_conv and explanatory_data are available, and dplyr is loaded.

Get the coefficients of mdl_price_vs_conv, assigning to coeffs.

# Get the coefficients of mdl_price_vs_conv
coeffs <- coefficients(mdl_price_vs_conv)

Get the intercept, which is the first element of coeffs, assigning to intercept.

# Get the intercept
intercept <- coeffs[1]

Get the slope, which is the second element of coeffs, assigning to slope.

# Get the slope
slope <- coeffs[2]

Manually predict price_twd_msq using the intercept, slope, and n_convenience.

explanatory_data %>% 
  mutate(
    # Manually calculate the predictions
    price_twd_msq = intercept + slope * explanatory_data$n_convenience
  )

## # A tibble: 11 x 2
##    n_convenience price_twd_msq
##            <int>         <dbl>
##  1             0          8.22
##  2             1          9.02
##  3             2          9.82
##  4             3         10.6 
##  5             4         11.4 
##  6             5         12.2 
##  7             6         13.0 
##  8             7         13.8 
##  9             8         14.6 
## 10             9         15.4 
## 11            10         16.2

# Compare to the results from predict()
predict(mdl_price_vs_conv, explanatory_data)

##         1         2         3         4         5         6         7         8 
##  8.224237  9.022317  9.820397 10.618477 11.416556 12.214636 13.012716 13.810795 
##         9        10        11 
## 14.608875 15.406955 16.205035

Son iguales.

Using broom

Many programming tasks are easier if you keep all your data inside data frames. This is particularly true if you are a tidyverse fan, where dplyr and ggplot2 require you to use data frames. The broom package contains functions that decompose models into three data frames: one for the coefficient-level elements (the coefficients themselves, as well as p-values for each coefficient), the observation-level elements (like fitted values and residuals), and the model-level elements (mostly performance metrics).

The functions in broom are generic. That is, they work with many model types, not just linear regression model objects. They also work with logistic regression model objects (as you’ll see in Chapter 4), and many other types of model.

mdl_price_vs_conv is available and broom is loaded.

Tidy the model to print the coefficient-level elements of mdl_price_vs_conv.

# Get the coefficient-level elements of the model
broom::tidy(mdl_price_vs_conv)

## # A tibble: 2 x 5
##   term          estimate std.error statistic   p.value
##   <chr>            <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)      8.22     0.285       28.9 5.81e-101
## 2 n_convenience    0.798    0.0565      14.1 3.41e- 37

Augment the model to print the observation-level elements of mdl_price_vs_conv.

# Get the observation-level elements of the model
broom::augment(mdl_price_vs_conv)

## # A tibble: 414 x 8
##    price_twd_msq n_convenience .fitted .resid    .hat .sigma  .cooksd .std.resid
##            <dbl>         <int>   <dbl>  <dbl>   <dbl>  <dbl>    <dbl>      <dbl>
##  1         11.5             10   16.2  -4.74  0.0121    3.38  1.22e-2     -1.41 
##  2         12.8              9   15.4  -2.64  0.00913   3.39  2.83e-3     -0.783
##  3         14.3              5   12.2   2.10  0.00264   3.39  5.10e-4      0.621
##  4         16.6              5   12.2   4.37  0.00264   3.38  2.21e-3      1.29 
##  5         13.0              5   12.2   0.826 0.00264   3.39  7.92e-5      0.244
##  6          9.71             3   10.6  -0.906 0.00275   3.39  9.91e-5     -0.268
##  7         12.2              7   13.8  -1.62  0.00477   3.39  5.50e-4     -0.479
##  8         14.1              6   13.0   1.12  0.00343   3.39  1.88e-4      0.331
##  9          5.69             1    9.02 -3.33  0.00509   3.38  2.49e-3     -0.988
## 10          6.69             3   10.6  -3.93  0.00275   3.38  1.87e-3     -1.16 
## # … with 404 more rows

Glance at the model to print the model-level elements of mdl_price_vs_conv.

# Get the model-level elements of the model
broom::glance(mdl_price_vs_conv)

## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.326         0.324  3.38      199. 3.41e-37     1 -1091. 2188. 2200.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Regression to the mean

En el modelo hasta ahora hemos visto que: Response value = fitted value + residual

Es decir, que lo que nos da el modelo es las predicciones que ha hecho más un error, es decir un residuo. Dicho de otra manera, las cosas que he podido predecir más las cosas que no he podido predecir.

El residuo exite por varias cosas, por un lado, porque tu modelo no es perfecto y porque existe una aleatoriedad en los datos.

Plotting consecutive portfolio returns

Regression to the mean is also an important concept in investing. Here you’ll look at the annual returns from investing in companies in the Standard and Poor 500 index (S&P 500), in 2018 and 2019.

Using sp500_yearly_returns, draw a scatter plot of return_2019 vs. return_2018. Add an “A-B line”, colored “green”, with size 1. Add a smooth trend line made with the linear regression method, and no standard error ribbon. Fix the coordinates so distances along the x and y axes appear the same.

# Using sp500_yearly_returns, plot return_2019 vs. return_2018
ggplot(sp500_yearly_returns, aes(return_2018, return_2019)) +
  # Make it a scatter plot
  geom_point() +
  # Add a line at y = x, colored green, size 1
  geom_abline(color = "green", size = 1) +
  # Add a linear regression trend line, no std. error ribbon
  geom_smooth(method = "lm", se = FALSE) +
  # Fix the coordinate ratio
  coord_fixed()

## `geom_smooth()` using formula 'y ~ x'

Super scatter plotting! The regression trend line looks very different to the y equals x line. As the financial advisors say, “Past performance is no guarantee of future results.”

Modeling consecutive returns

Let’s quantify the relationship between returns in 2019 and 2018 by running a linear regression and making predictions. By looking at companies with extremely high or extremely low returns in 2018, we can see if their performance was similar in 2019.

sp500_yearly_returns is available and dplyr is loaded.

Run a linear regression on return_2019 versus return_2018 using sp500_yearly_returns. Assign to mdl_returns.

# Run a linear regression on return_2019 vs. return_2018
# using sp500_yearly_returns
mdl_returns <- lm(return_2019  ~ return_2018, data = sp500_yearly_returns)




# See the result
mdl_returns

## 
## Call:
## lm(formula = return_2019 ~ return_2018, data = sp500_yearly_returns)
## 
## Coefficients:
## (Intercept)  return_2018  
##     0.31127      0.04691

Create a data frame (or tibble) named explanatory_data. Give it one column with 2018 returns set to a vector containing -1, 0, and 1. Use mdl_returns to predict with explanatory_data.

# From previous step
mdl_returns <- lm(
  return_2019 ~ return_2018, 
  data = sp500_yearly_returns
)

# Create a data frame with return_2018 at -1, 0, and 1 
explanatory_data <- tibble(
  return_2018 = c(-1, 0, 1)
)

# Use mdl_returns to predict with explanatory_data
predict(mdl_returns, explanatory_data)

##         1         2         3 
## 0.2643603 0.3112714 0.3581826

Incredible investment predictions! Investments that gained a lot in value in 2018 on average gained only a small amount in 2019. Similarly, investments that lost a lot of value in 2018 on average also gained a small amount in 2019.

Explicacion teorica: hemos creado un vector con retornos economicos, del -1, 0 y del 1, es decir que la rentabilidad bajo un 100%, que se quedo al año tal y como empezó, o que se gano un 100%.

Transforming variables

Transforming the explanatory variable

If there is no straight line relationship between the response variable and the explanatory variable, it is sometimes possible to create one by transforming one or both of the variables. Here, you’ll look at transforming the explanatory variable.

You’ll take another look at the Taiwan real estate dataset, this time using the distance to the nearest MRT (metro) station as the explanatory variable. You’ll use code to make every commuter’s dream come true: shortening the distance to the metro station by taking the square root. Take that, geography!

taiwan_real_estate is available and ggplot2 and tibble are loaded.

Run the code provided, and look at the plot. Edit the plot so the x aesthetic is square root transformed. Look at the new plot. Notice how the numbers on the x-axis have changed. This is a different line to what was shown before. Do the points track the line more closely?

Without making a sqrt:

# Run the code to see the plot
# Edit so x-axis is square root of dist_to_mrt_m
ggplot(taiwan_real_estate, aes(dist_to_mrt_m, price_twd_msq)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

# Run the code to see the plot
# Edit so x-axis is square root of dist_to_mrt_m
ggplot(taiwan_real_estate, aes(sqrt(dist_to_mrt_m), price_twd_msq)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

Run a linear regression of price_twd_msq versus the square root of dist_to_mrt_m using taiwan_real_estate.

# Run a linear regression of price_twd_msq vs. 
# square root of dist_to_mrt_m using taiwan_real_estate
mdl_price_vs_dist <- lm(
  price_twd_msq ~ sqrt(dist_to_mrt_m), 
  data = taiwan_real_estate
)

# See the result
mdl_price_vs_dist

## 
## Call:
## lm(formula = price_twd_msq ~ sqrt(dist_to_mrt_m), data = taiwan_real_estate)
## 
## Coefficients:
##         (Intercept)  sqrt(dist_to_mrt_m)  
##             16.7098              -0.1828

Create a data frame of prediction data named prediction_data. Start with explanatory_data, and add a column named after the response variable. Predict values using mdl_price_vs_dist and explanatory_data.

# From previous step
mdl_price_vs_dist <- lm(
  price_twd_msq ~ sqrt(dist_to_mrt_m), 
  data = taiwan_real_estate
)

# Use this explanatory data
explanatory_data <- tibble(
  dist_to_mrt_m = seq(0, 80, 10) ^ 2
)

# Use mdl_price_vs_dist to predict explanatory_data
prediction_data <- explanatory_data %>% 
  mutate(
    price_twd_msq = predict(mdl_price_vs_dist, explanatory_data)
  )

# See the result
prediction_data

## # A tibble: 9 x 2
##   dist_to_mrt_m price_twd_msq
##           <dbl>         <dbl>
## 1             0         16.7 
## 2           100         14.9 
## 3           400         13.1 
## 4           900         11.2 
## 5          1600          9.40
## 6          2500          7.57
## 7          3600          5.74
## 8          4900          3.91
## 9          6400          2.08

Edit the plot to add a layer of points from prediction_data, colored “green”, size 5.

ggplot(taiwan_real_estate, aes(sqrt(dist_to_mrt_m), price_twd_msq)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  # Add points from prediction_data, colored green, size 5
  geom_point(data = prediction_data, color = "green", size = 5)

## `geom_smooth()` using formula 'y ~ x'

Transforming the response variable too

The response variable can be transformed too, but this means you need an extra step at the end to undo that transformation. That is, you “back transform” the predictions.

In the video, you saw the first step of the digital advertising workflow: spending money to buy ads, and counting how many people see them (the “impressions”). The next step is determining how many people click on the advert after seeing it.

ad_conversion <- read.table("/Users/kaietiglesiasbaraibar/Desktop/Statistics-Fundamentals-with-R/DATABASE/ad_conversion.csv", sep = ";", header = T) %>% 
  select(-"X") %>% 
  select(c("spent_usd"    , "n_impressions" ,"n_clicks" ))

ad_conversion is available and ggplot2 and tibble are loaded.

Run the code provided, and look at the plot.

# Run the code to see the plot
# Edit to raise x, y aesthetics to power 0.25
ggplot(ad_conversion, aes(n_impressions, n_clicks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

Edit the plot so the x and y aesthetics are transformed by raising them to the power 0.25. Look at the new plot. Do the points track the line more closely?

# Run the code to see the plot
# Edit to raise x, y aesthetics to power 0.25
ggplot(ad_conversion, aes(n_impressions^0.25, n_clicks^0.25)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

Yes they do look like pretty more closely.

Run a linear regression of n_clicks to the power 0.25 versus n_impressions to the power 0.25 using ad_conversion. Each variable in the formula needs to be specified “as is”, using I().

# Run a linear regression of n_clicks to the power 0.25 vs. 
# n_impressions to the power 0.25 using ad_conversion
mdl_click_vs_impression <- lm(I(n_clicks^0.25) ~ I(n_impressions^0.25), data = ad_conversion)

Complete the code for the prediction data. Use mdl_click_vs_impression to predict n_clicks to the power 0.25 from explanatory_data. Back transform by raising n_clicks_025 to the power 4 to get n_clicks.

# From previous step
mdl_click_vs_impression <- lm(
  I(n_clicks ^ 0.25) ~ I(n_impressions ^ 0.25),
  data = ad_conversion
)

# Use this explanatory data
explanatory_data <- tibble(
  n_impressions = seq(0, 3e6, 5e5)
)

prediction_data <- explanatory_data %>% 
  mutate(
    # Use mdl_click_vs_impression to predict n_clicks ^ 0.25
    n_clicks_025 = predict(mdl_click_vs_impression, explanatory_data),
    # Back transform to get n_clicks
    n_clicks = n_clicks_025 ^ 4
  )

Edit the plot to add a layer of points from prediction_data, colored “green”.

# From previous steps
mdl_click_vs_impression <- lm(
  I(n_clicks ^ 0.25) ~ I(n_impressions ^ 0.25),
  data = ad_conversion
)
explanatory_data <- tibble(
  n_impressions = seq(0, 3e6, 5e5)
)
prediction_data <- explanatory_data %>% 
  mutate(
    n_clicks_025 = predict(mdl_click_vs_impression, explanatory_data),
    n_clicks = n_clicks_025 ^ 4
  )

ggplot(ad_conversion, aes(n_impressions ^ 0.25, n_clicks ^ 0.25)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  # Add points from prediction_data, colored green
  geom_point(data = prediction_data, color = "green")

## `geom_smooth()` using formula 'y ~ x'

Two-way transform! When the response variable is transformed you need to back transform the predictions.

Assessing model fit

Quantifying model fit

R SQUARED:

La calidad de como realmente de bien esta hecha la prediccion del modelo se hace por medio del coeficiente de determinacion o r-squared.

1 <- PERCECT FIT
0 <- MENAS THE WORST POSSIBLE FIT.

Funciones:

1- Broom::glance(): devuelve muchas metricas, si se usa con dplyr::pull() se puede sacar del tibble información valiosa.

2- cor(explanatory variable, response variable) ^ 2.

Residual standard error (RSE):

How much the prediction does wrong.

1- summary() also provides a RSE.

Podemos usar dplyr::glance() y extraer “sigma” que seria el RSE.

2- mutate( residuals_sq = residuals(DATASET) ^ 2 ) %>% summarize( residu_sum_of_sq = sum(residuals_sq), d deg_freedom = n() - 2 <- Esto es el numero de observaciones - El nº de los coeficientes del modelo., RSE = sqrt(residu_sum_of_sq / deg_freedom) ).

Resultado ejemplo = 74, Interpretacion: the differnece between predicted bream masses and observed bream masses is typically about 74 g.

Root mean square error (RMSE):

La funcion que hace es casi la misma que el “Residual standart error” pero es peor para calcular como de bien estan haciendo los modelos.

mutate( residuals_sq = residuals(DATASET) ^ 2 ) %>% summarize( residu_sum_of_sq = sum(residuals_sq), d deg_freedom = n(), RSE = sqrt(residu_sum_of_sq / deg_freedom) )

Coefficient of determination

The coefficient of determination is a measure of how well the linear regression line fits the observed values. For simple linear regression, it is equal to the square of the correlation between the explanatory and response variables.

Here, you’ll take another look at the second stage of the advertising pipeline: modeling the click response to impressions. Two models are available: mdl_click_vs_impression_orig models n_clicks versus n_impressions. mdl_click_vs_impression_trans is the transformed model you saw in Chapter 2. It models n_clicks ^ 0.25 versus n_impressions ^ 0.25.

mdl_click_vs_impression_orig <- lm(formula = n_clicks ~ n_impressions, data = ad_conversion)


mdl_click_vs_impression_trans <- lm(formula = I(n_clicks^0.25) ~ I(n_impressions^0.25), data = ad_conversion)

Print a summary of mdl_click_vs_impression_orig. Do the same for mdl_click_vs_impression_trans.

# Print a summary of mdl_click_vs_impression_orig
summary(mdl_click_vs_impression_orig)

## 
## Call:
## lm(formula = n_clicks ~ n_impressions, data = ad_conversion)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -186.099   -5.392   -1.422    2.070  119.876 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.683e+00  7.888e-01   2.133   0.0331 *  
## n_impressions 1.718e-04  1.960e-06  87.654   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.91 on 934 degrees of freedom
## Multiple R-squared:  0.8916, Adjusted R-squared:  0.8915 
## F-statistic:  7683 on 1 and 934 DF,  p-value: < 2.2e-16

# Print a summary of mdl_click_vs_impression_trans
summary(mdl_click_vs_impression_trans)

## 
## Call:
## lm(formula = I(n_clicks^0.25) ~ I(n_impressions^0.25), data = ad_conversion)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57061 -0.13229  0.00582  0.14494  0.46888 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.0717479  0.0172019   4.171 3.32e-05 ***
## I(n_impressions^0.25) 0.1115330  0.0008844 126.108  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1969 on 934 degrees of freedom
## Multiple R-squared:  0.9445, Adjusted R-squared:  0.9445 
## F-statistic: 1.59e+04 on 1 and 934 DF,  p-value: < 2.2e-16

Get the coefficient of determination for mdl_click_vs_impression_orig by glancing at the model, then pulling the r.squared value.

library(broom)
# Get coeff of determination for mdl_click_vs_impression_orig
mdl_click_vs_impression_orig %>% 
  # Get the model-level details
  glance() %>% 
  # Pull out r.squared
  pull(r.squared)

## [1] 0.8916135

Do the same for mdl_click_vs_impression_trans.

# Do the same for the transformed model
mdl_click_vs_impression_trans %>% 
  # Get the model-level details
  glance() %>% 
  # Pull out r.squared
  pull(r.squared)

## [1] 0.9445273

mdl_click_vs_impression_orig has a coefficient of determination of 0.89. Which statement about the model is true?

TRUE ANSWER = The number of impressions explains 89% of the variability in the number of clicks.

Which model does the coefficient of determination suggest gives a better fit?

The transformed model, mdl_click_vs_impression_trans.

Cool coefficient of determination calculating! The transformed model has a higher coefficient of determination that the original model, suggesting that it gives a better fit to the data.

Residual standard error

Residual standard error (RSE) is a measure of the typical size of the residuals. Equivalently, it’s a measure of how badly wrong you can expect predictions to be. Smaller numbers are better, with zero being a perfect fit to the data.

Again, you’ll look at the models from the advertising pipeline, mdl_click_vs_impression_orig and mdl_click_vs_impression_trans. broom is loaded.

Get the residual standard error for mdl_click_vs_impression_orig by glancing at the model, then pulling the “sigma” value.

# Get RSE for mdl_click_vs_impression_orig
mdl_click_vs_impression_orig %>% 
  # Get the model-level details
  glance() %>% 
  # Pull out sigma
  pull(sigma)

## [1] 19.90584

Do the same for mdl_click_vs_impression_trans.

# Do the same for the transformed model
mdl_click_vs_impression_trans %>% 
  # Get the model-level details
  glance() %>% 
  # Pull out sigma
  pull(sigma)

## [1] 0.1969064

mdl_click_vs_impression_orig has an RSE of 20. Which statement is true?

The typical difference between observed number of clicks and predicted number of clicks is 20.

Which model does the RSE suggest gives more accurate predictions?

The transformed model, mdl_click_vs_impression_trans.

Rapid RSE wrangling! RSE is a measure of accuracy for regression models. It even works on other other statistical model types like regression trees, so you can compare accuracy across different classes of models.

Visualizing model fit

Residuals vs fitted values:

Si la predicicon es buena, a la hora de dibujar los fitted values against los residuals, el linear trend debería de situarse encima del 0.

Q - Q plot:

Muestra si los residuals siguen una distribución normal. En el X = están los quantiles de la distribución normal. En el Y = estan las desviaciones estandar.

Interpretacion: si los puntos dibujados estan en una linea recta, la distribucion es normal, si no lo están, no es una distribución normal.

Scale - location

Se pueden ver los standarized residuals, es decir sqrt(residuals) en el Y, y fitted values en el X.

Este plot muestra si el tamaño de los residuos se convierte en mayor o menor, es importante trazar una trend line para ver si este cambio es grande o pequeño en cada uno de los tipos de factores que estamos investigando para poder comparar entre ellos.

Dibujarlo:

library(ggplot2) library(ggfortify)

autoplot(dataset, which = ??)

which:

<- residuals vs fitted values
<- Q-Q plot
<- scale-location.

Se pueden dibujar todos a la vez de la siguiente manera:

autoplot(Dataset, which = 1:3, nrow = 3, ncol = 1)

Drawing diagnostic plots

It’s time for you to draw these diagnostic plots yourself. Let’s go back to the Taiwan real estate dataset and the model of house prices versus number of convenience stores.

Recall that autoplot() lets you specify which diagnostic plots you are interested in.

1 residuals vs. fitted values 2 Q-Q plot 3 scale-location mdl_price_vs_conv is available, and ggplot2 and ggfortify are loaded.

Plot the three diagnostic plots (numbered 1 to 3) for mdl_price_vs_conv. Use a layout of three rows and one column.

library(ggfortify)
# Plot the three diagnostics for mdl_price_vs_conv
autoplot(mdl_price_vs_conv, 
which = 1:3, 
nrow = 3, 
ncol = 1)

Outliers, leverage, and influence

Outliers:

A la hora de dibujar un geom_point() como podemos sabe a priori cuales osn unos outliers?¿

Extreme explanatory variables

En un geom_point() visualmente hablando se puede hacer de una manera muy simple y es visualmente.

Leverage: (.hat)

Is a measure of how exreem the explanatory variable valeus are.

Leverage measures how unusual or extreme the explanatory variables are for each observation. Very roughly, a high leverage means that the explanatory variable has values that are different to other points in the dataset. In the case of simple linear regression, where there is only one explanatory value, this typically means values with a very high or very low explanatory value.

Influence:

Influence measures how much a model would change if each observation was left out of the model calculations, one at a time. That is, it measures how different the prediction line would look if you ran a linear regression on all data points except that point, compared to running a linear regression on the whole dataset.

Extracting leverage and influence

In the last few exercises you explored which observations had the highest leverage and influence. Now you’ll extract those values from an augmented version of the model, and visualize them.

mdl_price_vs_dist is available. dplyr, ggplot2 and ggfortify are loaded.

Augment mdl_price_vs_dist, then arrange observations by descending influence (.hat), and get the head of the results.

mdl_price_vs_dist %>% 
  # Augment the model
  augment() %>% 
  # Arrange rows by descending leverage
  arrange(desc(.hat)) %>% 
  # Get the head of the dataset
  head()

## # A tibble: 6 x 7
##   price_twd_msq `sqrt(dist_to_mrt_m)` .fitted   .hat .sigma   .cooksd .std.resid
##           <dbl>                 <dbl>   <dbl>  <dbl>  <dbl>     <dbl>      <dbl>
## 1          3.39                  80.5    1.98 0.0267   2.82 0.00351       0.506 
## 2          3.69                  80.0    2.09 0.0261   2.82 0.00447       0.577 
## 3          4.54                  79.4    2.19 0.0256   2.82 0.00937       0.844 
## 4          5.69                  74.2    3.13 0.0211   2.82 0.00906       0.916 
## 5          5.26                  74.2    3.13 0.0211   2.82 0.00630       0.764 
## 6          4.05                  67.9    4.30 0.0163   2.82 0.0000644    -0.0882

Augment mdl_price_vs_dist, then arrange observations by descending influence (.cooksd), and get the head of the results.

mdl_price_vs_dist %>% 
  # Augment the model
  augment() %>% 
  # Arrange rows by descending Cook's distance
  arrange(desc(.cooksd)) %>% 
  # Get the head of the dataset
  head()

## # A tibble: 6 x 7
##   price_twd_msq `sqrt(dist_to_mrt_m)` .fitted    .hat .sigma .cooksd .std.resid
##           <dbl>                 <dbl>   <dbl>   <dbl>  <dbl>   <dbl>      <dbl>
## 1         35.6                   15.9   13.8  0.00385   2.61  0.116        7.73
## 2         13.6                   61.5    5.47 0.0121    2.79  0.0524       2.92
## 3         14.1                   56.3    6.41 0.00933   2.80  0.0354       2.74
## 4         23.7                   13.7   14.2  0.00440   2.78  0.0251       3.37
## 5          2.30                  19.8   13.1  0.00310   2.77  0.0228      -3.83
## 6         23.6                   17.8   13.4  0.00344   2.78  0.0225       3.61

Plot the three outlier diagnostic plots (numbered 4 to 6) for mdl_price_vs_dist. Use a layout of three rows and one column.

# Plot the three outlier diagnostics for mdl_price_vs_conv
autoplot(mdl_price_vs_dist, which = 4:6, nrow = 3, ncol = 1)

Simple logistic regression

Why you need logistic regression

Aquí lo que se mide en el resultado es la probabilidad de que ocurra el resultado.

Se usa cuando el response variable es logico y normalmente las respuestas siguen una logistica de curva con forma “S”.

Forma lineal:

Se puede calcular por medio de glm() teniendo en cuenta que en el argumento family hay que ponerle gaussian, para indicarle que los erroes siguen una distribución normal.

Forma logística:

Lo mismo pero en family ponemos “binomial” para especificar que los residuos siguen una distribución binomial.

Exploring the explanatory variables

When the response variable is logical, all the points lie on the y equals zero and y equals one lines, making it difficult to see what is happening. In the video, until you saw the trend line, it wasn’t clear how the explanatory variable was distributed on each line. This can be solved with a histogram of the explanatory variable, faceted on the response.

You will use these histograms to get to know the financial services churn dataset seen in the video.

churn is available and ggplot2 is loaded.

Using churn, plot time_since_last_purchase as a histogram with binwidth 0.25 faceted in a grid with has_churned on each row.

# Using churn, plot time_since_last_purchase
ggplot(churn, aes(time_since_last_purchase)) +
  # as a histogram with binwidth 0.25
  geom_histogram(binwidth = 0.25) +
  # faceted in a grid with has_churned on each row
  facet_grid(rows = vars(has_churned))

Redraw the plot with time_since_first_purchase. That is, using churn, plot time_since_first_purchase as a histogram with binwidth 0.25 faceted in a grid with has_churned on each row.

# Redraw the plot with time_since_first_purchase
ggplot(churn, aes(time_since_first_purchase)) +
  geom_histogram(binwidth = 0.25) +
  facet_grid(rows = vars(has_churned))

Visualizing linear and logistic models

As with linear regressions, ggplot2 will draw model predictions for a logistic regression without you having to worry about the modeling code yourself. To see how the predictions differ for linear and logistic regressions, try drawing both trend lines side by side. Spoiler: you should see a linear (straight line) trend from the linear model, and a logistic (S-shaped) trend from the logistic model.

churn is available and ggplot2 is loaded.

Using churn plot has_churned vs. time_since_first_purchase as a scatter plot, adding a red linear regression trend line (without a standard error ribbon).

# Using churn plot has_churned vs. time_since_first_purchase
ggplot(churn, aes(time_since_first_purchase, has_churned)) +
  # Make it a scatter plot
  geom_point() +
  # Add an lm trend line, no std error ribbon, colored red
  geom_smooth(method = "lm", 
  se = FALSE, 
  color = "red")

## `geom_smooth()` using formula 'y ~ x'

Update the plot by adding a second trend line from logistic regression. (No standard error ribbon again).

ggplot(churn, aes(time_since_first_purchase, has_churned)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  # Add a glm trend line, no std error ribbon, binomial family
  geom_smooth(method = "glm", 
  se = F, 
  method.args = list (family = "binomial"))

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Logistic regression with glm()

Linear regression and logistic regression are special cases of a broader type of models called generalized linear models (“GLMs”). A linear regression makes the assumption that the residuals follow a Gaussian (normal) distribution. By contrast, a logistic regression assumes that residuals follow a binomial distribution.

Here, you’ll model how the length of relationship with a customer affects churn.

churn is available.

Fit a logistic regression of has_churned versus time_since_first_purchase using the churn dataset. Assign to mdl_churn_vs_relationship.

# Fit a logistic regression of churn vs. 
# length of relationship using the churn dataset
mdl_churn_vs_relationship <- glm(has_churned ~time_since_first_purchase, data = churn, family = binomial)





# See the result
mdl_churn_vs_relationship

## 
## Call:  glm(formula = has_churned ~ time_since_first_purchase, family = binomial, 
##     data = churn)
## 
## Coefficients:
##               (Intercept)  time_since_first_purchase  
##                  -0.01518                   -0.35479  
## 
## Degrees of Freedom: 399 Total (i.e. Null);  398 Residual
## Null Deviance:       554.5 
## Residual Deviance: 543.7     AIC: 547.7

Predictions and odds ratios

Making predictions with extreme values
Crear un tibble con los explanatory variables, es decir, vamos a ver cual es neustra variable explanatory.

Hacemos el histograma:

# Using churn, plot time_since_last_purchase
ggplot(churn, aes(time_since_last_purchase)) +
  # as a histogram with binwidth 0.25
  geom_histogram(binwidth = 0.25) +
  # faceted in a grid with has_churned on each row
  facet_grid(rows = vars(has_churned))

Vemos cual es el valor máximo(6):

max(churn$time_since_last_purchase) %>% 
  round(.,1)

## [1] 5.9

Ya que estamos tambien calculamos el mínimo.

min(churn$time_since_last_purchase)  %>% 
  round(.,1)

## [1] -0.9

Creamos el tibble:

tibble(time_since_last_purchase = seq(-1, #<- el miniom
                                      6,  # <- el maximo que lo acabamos de sacar
                                      0.25# el bindwith
                                      ))

## # A tibble: 29 x 1
##    time_since_last_purchase
##                       <dbl>
##  1                    -1   
##  2                    -0.75
##  3                    -0.5 
##  4                    -0.25
##  5                     0   
##  6                     0.25
##  7                     0.5 
##  8                     0.75
##  9                     1   
## 10                     1.25
## # … with 19 more rows

Y hacemos la prediccion, ponemos el type = “response” para obtener las predicciones:

#creamos el modelo
mdl_recency <- glm(  has_churned ~ time_since_last_purchase, data = churn, family = "binomial")

mdl_recency

## 
## Call:  glm(formula = has_churned ~ time_since_last_purchase, family = "binomial", 
##     data = churn)
## 
## Coefficients:
##              (Intercept)  time_since_last_purchase  
##                 -0.03502                   0.26921  
## 
## Degrees of Freedom: 399 Total (i.e. Null);  398 Residual
## Null Deviance:       554.5 
## Residual Deviance: 546.4     AIC: 550.4

#creamos el tibble
explanatory_data <- tibble(  time_since_last_purchase = seq(-1, 6, 0.25))

#hacemos la prediccion
prediction_data <- explanatory_data %>% 
    mutate(    has_churned = predict(mdl_recency, explanatory_data, type = "response")  )

#lo pintamos, primero el outcome simple del glm

plot_churn_vs_recency_base <- ggplot(churn, aes(churn$time_since_last_purchase, churn$has_churned)) +
  geom_point() 

plot_churn_vs_recency_base

## Warning: Use of `churn$time_since_last_purchase` is discouraged. Use
## `time_since_last_purchase` instead.

## Warning: Use of `churn$has_churned` is discouraged. Use `has_churned` instead.

Ahora, como quedaria nuestra prediccion:

plot_churn_vs_recency_base <- plot_churn_vs_recency_base +
  geom_point(
    data = prediction_data, aes(prediction_data$time_since_last_purchase, prediction_data$has_churned), 
    color = "blue"
  )

plot_churn_vs_recency_base

## Warning: Use of `churn$time_since_last_purchase` is discouraged. Use
## `time_since_last_purchase` instead.

## Warning: Use of `churn$has_churned` is discouraged. Use `has_churned` instead.

## Warning: Use of `prediction_data$time_since_last_purchase` is discouraged. Use
## `time_since_last_purchase` instead.

## Warning: Use of `prediction_data$has_churned` is discouraged. Use `has_churned`
## instead.

De la misma manera podemos ahcer que el sistema mismo haga sus predicicones, es decir, conseguir los “most likely response”.

Si la probabilidad de chunr es menor a 0.5, lo mas probable es que no se piren, sin embargo, si es superior a 0.5, seguramente se piraran.

#hacemos la prediccion
prediction_data <- explanatory_data %>% 
    mutate(    
      
      has_churned = predict(mdl_recency, explanatory_data, type = "response"),
      most_likely_outcome = round(has_churned)
    )


#finalmente, le añadimos el ajuste que queremos que el sistema nos pinte
plot_churn_vs_recency_base + 
  geom_point(
    aes(time_since_last_purchase, most_likely_outcome), 
      data = prediction_data, 
      color = "green"
    )

## Warning: Use of `churn$time_since_last_purchase` is discouraged. Use
## `time_since_last_purchase` instead.

## Warning: Use of `churn$has_churned` is discouraged. Use `has_churned` instead.

Los clientes más activos por lo que se ve no se piran, pero los que no son activos si que se piran.

Todo ello se puede hacer por medio de una funcion :

source('~/Desktop/Statistics-Fundamentals-with-R/ANALYSIS/formula_glm_plot.R')


# extreme_prediction_plot(data = churn, columna_response = churn$has_churned, columna_explanatory = churn$time_since_last_purchase)

Odds ratios

Hay otra manera de hablar sobre los resultados binarios, sobre todo usados en tema de juegos, es el odds ratio es decir la probabilidad de que pase algo dividido por la probabilidad de que no pase:

Odds ratio = probability / (1- probability).

prediction_data <- explanatory_data %>% 
    mutate(    
      
      has_churned = predict(mdl_recency, explanatory_data, type = "response"),
      most_likely_outcome = round(has_churned), 
      odds_ratio = has_churned / (1- has_churned)
    )


ggplot(prediction_data, 
       aes(prediction_data$time_since_last_purchase, prediction_data$odds_ratio)) +
  geom_line() +
  geom_hline(yintercept =  1, linetype = "dotted")

## Warning: Use of `prediction_data$time_since_last_purchase` is discouraged. Use
## `time_since_last_purchase` instead.

## Warning: Use of `prediction_data$odds_ratio` is discouraged. Use `odds_ratio`
## instead.

En este grafico, usando los calculos anteriores, podemos ver la relacion entre el tiempo de la ultima compra y el odds ratio. En el punto 1 en el eje Y vemos que la probabilidad de churn y el not churn es la misma.

Por debajod de 1 en el eje Y, las probabilidades de las probabilidades de churn son mas pequeñs que de not churning. Es decir, el cliente se queda.

Por encima de 1, sobre todo or encima de 4 por ejemplo, las probabildiades de churn son mas de 4 veces más posibles de que not churn.

En cuantoa las predicciones logisiticas se refiere podemos hacer un cambio los calculos:

prediction_data <- explanatory_data %>% 
    mutate(    
      
      #Calculos iniciales en modo de PROBABILIDAD: 
      has_churned = predict(mdl_recency, explanatory_data, type = "response"),
      most_likely_outcome = round(has_churned), 
      odds_ratio = has_churned / (1- has_churned), 
      
      #CALCULOS EN FORMA DE LOGARITMO: 
      
      log_odds_ratio = log(odds_ratio), 
      
      
      #de hecho predict nso va a devolcer los resultados en modo "log" a menos que le especifiquemos el response. 
      log_odds_ratio2 = predict(mdl_recency, explanatory_data)
    )


ggplot(prediction_data, 
       aes(prediction_data$time_since_last_purchase, prediction_data$log_odds_ratio)) +
  geom_line() +
  geom_hline(yintercept =  1, linetype = "dotted")

## Warning: Use of `prediction_data$time_since_last_purchase` is discouraged. Use
## `time_since_last_purchase` instead.

## Warning: Use of `prediction_data$log_odds_ratio` is discouraged. Use
## `log_odds_ratio` instead.

Falta la interpretación.

Probabilities

There are four main ways of expressing the prediction from a logistic regression model – we’ll look at each of them over the next four exercises. Firstly, since the response variable is either “yes” or “no”, you can make a prediction of the probability of a “yes”. Here, you’ll calculate and visualize these probabilities.

Three variables are available:

mdl_churn_vs_relationship is the logistic regression model of has_churned versus time_since_first_purchase. explanatory_data is a data frame of explanatory values. plt_churn_vs_relationship is a scatter plot of has_churned versus time_since_first_purchase with a smooth glm line. dplyr is loaded.

Use the model, mdl_churn_vs_relationship, and the explanatory data, explanatory_data, to predict the probability of churning. Assign the predictions to the has_churned column of a data frame, prediction_data. Remember to set the prediction type.

# Make a data frame of predicted probabilities

#creamos el tibble

min_explanatory_data <- min(churn$time_since_first_purchase) %>% 
  round()

max_explanatory_data <- max(churn$time_since_first_purchase) %>% 
  round()


explanatory_data <- tibble(  time_since_first_purchase = seq(min_explanatory_data, max_explanatory_data, 
                                                             0.25))

#creamos el modelo
mdl_churn_vs_relationship <- glm(formula = has_churned ~ time_since_first_purchase, family = binomial, 
    data = churn)


prediction_data <- explanatory_data %>% 
  mutate(
    has_churned = predict(mdl_churn_vs_relationship, explanatory_data, type = "response")
  )


# See the result
prediction_data

## # A tibble: 21 x 2
##    time_since_first_purchase has_churned
##                        <dbl>       <dbl>
##  1                     -1          0.584
##  2                     -0.75       0.562
##  3                     -0.5        0.540
##  4                     -0.25       0.518
##  5                      0          0.496
##  6                      0.25       0.474
##  7                      0.5        0.452
##  8                      0.75       0.430
##  9                      1          0.409
## 10                      1.25       0.387
## # … with 11 more rows

Update the plt_churn_vs_relationship plot to add points from prediction_data, colored yellow.

ggplot(churn, 
       aes(time_since_first_purchase, has_churned)
       ) + 
  geom_point() +
  #añadimos la linea de la tendencia logaritmica
  geom_smooth(method = "glm", 
  se = F, 
  method.args = list (family = "binomial")) +
  
  #añadimos lo que nos piden ahora
  
  # Add points from prediction_data, colored yellow, size 2
  geom_point(data = prediction_data, color = "yellow", size = 2)

## `geom_smooth()` using formula 'y ~ x'

Most likely outcome

When explaining your results to a non-technical audience, you may wish to side-step talking about probabilities and simply explain the most likely outcome. That is, rather than saying there is a 60% chance of a customer churning, you say that the most likely outcome is that the customer will churn. The tradeoff here is easier interpretation at the cost of nuance.

mdl_churn_vs_relationship, explanatory_data, and plt_churn_vs_relationship are available and dplyr is loaded.

Update prediction_data to add a column of the most likely churn outcome, most_likely_outcome.

# Update the data frame
prediction_data <- explanatory_data %>% 
  mutate(   
    has_churned = predict(mdl_churn_vs_relationship, explanatory_data, type = "response"),
    # Add the most likely churn outcome
    most_likely_outcome = round(has_churned)
  )

# See the result
prediction_data

## # A tibble: 21 x 3
##    time_since_first_purchase has_churned most_likely_outcome
##                        <dbl>       <dbl>               <dbl>
##  1                     -1          0.584                   1
##  2                     -0.75       0.562                   1
##  3                     -0.5        0.540                   1
##  4                     -0.25       0.518                   1
##  5                      0          0.496                   0
##  6                      0.25       0.474                   0
##  7                      0.5        0.452                   0
##  8                      0.75       0.430                   0
##  9                      1          0.409                   0
## 10                      1.25       0.387                   0
## # … with 11 more rows

Update plt_churn_vs_relationship, adding yellow points of size 2 with most_likely_outcome as the y aesthetic, using prediction_data.

ggplot(churn, 
       aes(time_since_first_purchase, has_churned)
       ) + 
  geom_point() +
  #añadimos la linea de la tendencia logaritmica
  geom_smooth(method = "glm", 
  se = F, 
  method.args = list (family = "binomial")) +
  # Add most likely outcome points from prediction_data, 
  # colored yellow, size 2
  geom_point(
    data = prediction_data, 
    aes(y = most_likely_outcome), color = "yellow", size = 2
  )

## `geom_smooth()` using formula 'y ~ x'

Por hacer la explicacion de la prediccion más simple para una audiencia no tecnica, a la audiencia le diremos que lo más probable es que si el tiempo desde la ultima compra es menor a 0 el consumidor se ira pero que si el tiempo desde la primera compra es mayor, el consumidor, se quedara.

Odds ratio

Odds ratios compare the probability of something happening with the probability of it not happening. This is sometimes easier to reason about than probabilities, particularly when you want to make decisions about choices. For example, if a customer has a 20% chance of churning, it maybe more intuitive to say “the chance of them not churning is four times higher than the chance of them churning”.

mdl_churn_vs_relationship, explanatory_data, and plt_churn_vs_relationship are available and dplyr is loaded.

Update prediction_data to add a column, odds_ratio, of the odds ratios.

# Update the data frame
prediction_data <- explanatory_data %>% 
  mutate(   
    has_churned = predict(
      mdl_churn_vs_relationship, explanatory_data, 
      type = "response"
    ),
    # Add the odds ratio
    odds_ratio = has_churned / (1-has_churned)
  )

# See the result
prediction_data

## # A tibble: 21 x 3
##    time_since_first_purchase has_churned odds_ratio
##                        <dbl>       <dbl>      <dbl>
##  1                     -1          0.584      1.40 
##  2                     -0.75       0.562      1.29 
##  3                     -0.5        0.540      1.18 
##  4                     -0.25       0.518      1.08 
##  5                      0          0.496      0.985
##  6                      0.25       0.474      0.901
##  7                      0.5        0.452      0.825
##  8                      0.75       0.430      0.755
##  9                      1          0.409      0.691
## 10                      1.25       0.387      0.632
## # … with 11 more rows

Using prediction_data, draw a line plot of odds_ratio versus time_since_first_purchase. Add a dotted horizontal line at odds_ratio equal to 1.

prediction_data <- explanatory_data %>% 
  mutate(   
    has_churned = predict(mdl_churn_vs_relationship, explanatory_data, type = "response"),
    odds_ratio = has_churned / (1 - has_churned)
  )

# Using prediction_data, plot odds_ratio vs. time_since_first_purchase
ggplot(prediction_data, aes(time_since_first_purchase, odds_ratio)) +
  # Make it a line plot
  geom_line() +
  # Add a dotted horizontal line at y = 1
  geom_hline(yintercept = 1, linetype = "dotted")

Esto nos esta dando la misma información que el gráfico anterior pero de una manera más sencilla es decir, si el consumidor no ha hecho ninguna compra, fijo fijo fijo que se ira, es decir, no es nuestro consumidor, entonces sí que hay una probabilidad de 100% de que se vaya, pues no ha entrado.

Sin embargo, cuanto mñas tiempo pase desde la primera compra, la probabilidad de que este se vaya es menor.

Log odds ratio

One downside to probabilities and odds ratios for logistic regression predictions is that the prediction lines for each are curved. This makes it harder to reason about what happens to the prediction when you make a change to the explanatory variable. The logarithm of the odds ratio (the “log odds ratio”) does have a linear relationship between predicted response and explanatory variable. That means that as the explanatory variable changes, you don’t see dramatic changes in the response metric - only linear changes.

Since the actual values of log odds ratio are less intuitive than (linear) odds ratio, for visualization purposes it’s usually better to plot the odds ratio and apply a log transformation to the y-axis scale.

mdl_churn_vs_relationship, explanatory_data, and plt_churn_vs_relationship are available and dplyr is loaded.

Update prediction_data to add the log odds ratio calculated two ways. Calculate it from the odds_ratio, then directly using predict().

# Update the data frame
prediction_data <- explanatory_data %>% 
  mutate(   
    has_churned = predict(mdl_churn_vs_relationship, explanatory_data, type = "response"),
    odds_ratio = has_churned / (1 - has_churned),
    # Add the log odds ratio from odds_ratio
    log_odds_ratio = log(odds_ratio),
    # Add the log odds ratio using predict()
    log_odds_ratio2 = predict(mdl_churn_vs_relationship, explanatory_data)
  )

# See the result
prediction_data

## # A tibble: 21 x 5
##    time_since_first_purch… has_churned odds_ratio log_odds_ratio log_odds_ratio2
##                      <dbl>       <dbl>      <dbl>          <dbl>           <dbl>
##  1                   -1          0.584      1.40          0.340           0.340 
##  2                   -0.75       0.562      1.29          0.251           0.251 
##  3                   -0.5        0.540      1.18          0.162           0.162 
##  4                   -0.25       0.518      1.08          0.0735          0.0735
##  5                    0          0.496      0.985        -0.0152         -0.0152
##  6                    0.25       0.474      0.901        -0.104          -0.104 
##  7                    0.5        0.452      0.825        -0.193          -0.193 
##  8                    0.75       0.430      0.755        -0.281          -0.281 
##  9                    1          0.409      0.691        -0.370          -0.370 
## 10                    1.25       0.387      0.632        -0.459          -0.459 
## # … with 11 more rows

Update the plot to use a logarithmic y-scale.

# Update the plot
ggplot(prediction_data, aes(time_since_first_purchase, odds_ratio)) +
  geom_line() +
  geom_hline(yintercept = 1, linetype = "dotted") +
  # Use a logarithmic y-scale
  scale_y_log10()

Aquí el resultado es el mismo que en el anterior gráfico pero en vez de verlo de forma logarítmica lo cual no nos deja interpretarlo facilmente, lo vemos en forma lineal.

Quantifying logistic regression fit

Ahora vamos a ver cual es el performance de los logaritmos, para ello vamos a ver la Matriz de Confusión.

Primero de todo hacemos el modelo:

mdl_recency <- glm(formula = has_churned ~ time_since_last_purchase, family = "binomial", 
    data = churn)

Luego apartamos lo que son los datos originales

actual_response <- churn$has_churned

A continuación sacamos las predicciones:

predicted_response <- round(fitted(mdl_recency))

Finalemente usamos table para hacer el count y poder obtener la matriz de consfusion:

outcomes <- table(
  predicted_response, 
  actual_response
)

outcomes

##                   actual_response
## predicted_response   0   1
##                  0 141 111
##                  1  59  89

Interpretacion: acertamos en que 141 consumidores no se piraron de la empresa y que 89 se piraron.

Vamos a ver el resultado en un plot:

library(ggplot2)
library(yardstick)

#convertimos el table en una matriz de consufion como objeto

consfusion <- conf_mat(outcomes)


#vemos el plot

autoplot(consfusion)

Es la misma que si fuera con los numeros.

Si queremos ver los estadísticos:

summary(consfusion, event_level = "second")

## # A tibble: 13 x 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.575
##  2 kap                  binary         0.15 
##  3 sens                 binary         0.445
##  4 spec                 binary         0.705
##  5 ppv                  binary         0.601
##  6 npv                  binary         0.560
##  7 mcc                  binary         0.155
##  8 j_index              binary         0.150
##  9 bal_accuracy         binary         0.575
## 10 detection_prevalence binary         0.37 
## 11 precision            binary         0.601
## 12 recall               binary         0.445
## 13 f_meas               binary         0.511

Accuracy: es decir, la proporcion de las predicciones correctas.

summary(consfusion) %>% 
  slice(1)

## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.575

Sensitivity: es la proporcion de los true positives: TP / FN + TP

summary(consfusion) %>% 
  slice(3)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 sens    binary         0.705

Cuanto mayor sensitividad mejor.

Specificity; es la proporcion de TRUE NEGATIVES: TN/ TN + FP

summary(consfusion) %>% 
  slice(4)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 spec    binary         0.445

Cuando la respuesta es false y el modelo efectivamente predice que es false.

Cuanto mayor especifidad mejor.

Calculating the confusion matrix

A confusion matrix (occasionally called a confusion table) is the basis of all performance metrics for models with a categorical response (such as a logistic regression). It contains the counts of each actual response-predicted response pair. In this case, where there are two possible responses (churn or not churn), there are four overall outcomes.

The customer churned and the model predicted that. The customer churned but the model didn’t predict that. The customer didn’t churn but the model predicted they did. The customer didn’t churn and the model predicted that. churn and mdl_churn_vs_relationship are available.

Get the actual responses from the has_churned column of the dataset. Assign to actual_response.

# Get the actual responses from the dataset
actual_response <- churn$has_churned

Get the “most likely” predicted responses from the model. Assign to predicted_response.

# Get the "most likely" responses from the model
predicted_response <- round(fitted(mdl_churn_vs_relationship))

Create a table of counts from the actual and predicted response vectors. Assign to outcomes.

# Create a table of counts
outcomes <- table(
  predicted_response, 
  actual_response
)


# See the result
outcomes

##                   actual_response
## predicted_response   0   1
##                  0 112  76
##                  1  88 124

Measuring logistic model performance

Having the confusion matrix as a table object is OK, but a little hard to program with. By converting this to a yardstick confusion matrix object, you get methods for plotting and extracting performance metrics.

The confusion matrix, outcomes is available as a table object. ggplot2 and yardstick are loaded, and the yardstick.event_first option is set to FALSE

Convert outcomes to a yardstick confusion matrix. Assign to confusion. Automatically plot confusion. Get performance metrics from confusion, remembering that the positive response is in the second column.

# Convert outcomes to a yardstick confusion matrix
confusion <- conf_mat(outcomes)

# Plot the confusion matrix
autoplot(confusion)

# Get performance metrics for the confusion matrix
summary(confusion, event_level = "second")

## # A tibble: 13 x 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.59 
##  2 kap                  binary         0.18 
##  3 sens                 binary         0.62 
##  4 spec                 binary         0.56 
##  5 ppv                  binary         0.585
##  6 npv                  binary         0.596
##  7 mcc                  binary         0.180
##  8 j_index              binary         0.180
##  9 bal_accuracy         binary         0.59 
## 10 detection_prevalence binary         0.53 
## 11 precision            binary         0.585
## 12 recall               binary         0.62 
## 13 f_meas               binary         0.602

Statistics Fundamentals with R

Introduction to Statistics in R

Summary Statistics

What is statistics?

Measures of center

Mean and median

Mean vs. median

Measures of spread

Quartiles, quantiles, and quintiles

Variance and standard deviation

Finding outliers using IQR

Random Numbers and Probability

What are the chances?

Calculating probabilities

Sampling deals

Discrete distributions

Creating a probability distribution

Continuous distributions

Data back-ups

Simulating wait times

The binomial distribution

Funciones:

Simulating sales deals

Calculating binomial probabilities

How many sales will be won?

More Distributions and the Central Limit Theorem

The normal distribution

Distribution of Amir’s sales

Probabilities from the normal distribution

Simulating sales under new market conditions

Which market is better?

The central limit theorem

Visualizing sampling distributions

The CLT in action

The mean of means

The Poisson distribution

Cálculo de poisson:

Muestreo Sampling de Poisson:

Valor del lambda:

Ejercicios:

Tracking lead responses

More probability distributions

Exponential distribution:

Lambda y rate:

Naturaleza.

Calculos:

Diferencia con poisson

Student’s distribution:

T de student

Log normal Distribution:

Practicamos

Modeling time between leads (Exponential)

Correlation

Relationships between variables

Correlation caveats

Transformaciones logaritmicas

Raiz cuadrada

Propiedad causistica

Practicamos

What can’t correlation measure?

Transforming variables

Does sugar improve happiness?

Design of experiments

Vocabulario:

Estudios de control de grupo:

Variables en el juego:

Estudios observacionales:

Introduction to Regression

Simple Linear Regression

A tale of two variables

Visualizing two variables

Fitting a linear regression

Linear regression with lm()

Categorical explanatory variables

Visualizing numeric vs. categorical

Calculating means by category

lm() with a categorical explanatory variable

Predictions and model objects

Making predictions

Predicting house prices