Correlation and Regression in R (DataCamp)

Ch. 1 - Visualizing two variables

Visualizing bivariate relationships

[Video]

Scatterplots

# Scatterplot of weight vs. weeks
ggplot(data = ncbirths, aes(x = weeks, y = weight)) + 
  geom_point()

## Warning: Removed 2 rows containing missing values (geom_point).

Boxplots as discretized/conditioned scatterplots

# Boxplot of weight vs. weeks
ggplot(data = ncbirths, 
       aes(x = cut(weeks, breaks = 5), y = weight)) + 
  geom_boxplot()

Characterizing bivariate relationships

[Video]

Creating scatterplots

# Mammals scatterplot
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
  geom_point()

# Baseball player scatterplot
ggplot(data = mlbBat10, aes(x = OBP, y = SLG)) +
  geom_point()

# Body dimensions scatterplot
ggplot(data = bdims, aes(x = hgt, y = wgt, color = factor(sex))) +
  geom_point()

# Smoking scatterplot
ggplot(data = smoking, aes(x = age, y = amtWeekdays)) +
  geom_point()

## Warning: Removed 1270 rows containing missing values (geom_point).

Characterizing scatterplots

This scatterplot shows the relationship between the poverty rates and high school graduation rates of counties in the United States.

Describe the form, direction, and strength of this relationship.

Linear, positive, strong
Linear, negative, weak
[*] Linear, negative, moderately strong
Non-linear, negative, strong

Transformations

# Scatterplot with coord_trans()
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
  geom_point() + 
  coord_trans(x = "log10", y = "log10")

# Scatterplot with scale_x_log10() and scale_y_log10()
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
  geom_point() +
  scale_x_log10() + 
  scale_y_log10()

Outliers

[Video]

Identifying outliers

# Filter for AB greater than or equal to 200
ab_gt_200 <- mlbBat10 %>% 
  filter(AB >= 200)

# Scatterplot of SLG vs. OBP
ggplot(ab_gt_200, aes(x = OBP, y = SLG)) +
  geom_point()

# Identify the outlying player
ab_gt_200 %>%
  filter(OBP < 0.2)

##     name team position  G  AB  R  H 2B 3B HR RBI TB BB SO SB CS   OBP   SLG
## 1 B Wood  LAA       3B 81 226 20 33  2  0  4  14 47  6 71  1  0 0.174 0.208
##     AVG
## 1 0.146

Ch. 2 - Correlation

Quantifying the strength of bivariate relationships

[Video]

Understanding correlation scale

In a scientific paper, three correlations are reported with the following values:

-0.395
1.827
0.738

Choose the correct interpretation of these findings.

1. is invalid.
[*] (2) is invalid.
1. is invalid.
Both (1) and (2) are invalid.
Both (2) and (3) are invalid.

Understanding correlation sign

In a scientific paper, three correlations are reported with the following values:

0.582
0.134
-0.795

Which of these values represents the strongest correlation?

0.582
0.134
[*] -0.795
Can’t tell!

Computing correlation

# Compute correlation
ncbirths %>%
  summarize(N = n(), r = cor(weight, mage))

##      N          r
## 1 1000 0.05506589

# Compute correlation for all non-missing pairs
ncbirths %>%
  summarize(N = n(), r = cor(weight, weeks, use = "pairwise.complete.obs"))

##      N         r
## 1 1000 0.6701013

The Anscombe dataset

[Video]

Exploring Anscombe

# Compute properties of Anscombe
Anscombe %>%
  group_by(set) %>%
  summarize(
    N = n(), 
    mean_of_x = mean(x), 
    std_dev_of_x = sd(x), 
    mean_of_y = mean(y), 
    std_dev_of_y = sd(y), 
    correlation_between_x_and_y = cor(x, y)
  )

## # A tibble: 4 x 7
##     set     N mean_of_x std_dev_of_x mean_of_y std_dev_of_y correlation_between…
##   <int> <int>     <dbl>        <dbl>     <dbl>        <dbl>                <dbl>
## 1     1    11         9         3.32      7.50         2.03                0.816
## 2     2    11         9         3.32      7.50         2.03                0.816
## 3     3    11         9         3.32      7.5          2.03                0.816
## 4     4    11         9         3.32      7.50         2.03                0.817

Perception of correlation

Recall the scatterplot between the poverty rate of counties in the United States and the high school graduation rate in those counties from the previous chapter. Which of the following values is the correct correlation between poverty rate and high school graduation rate?

-0.861
[*] -0.681
-0.186
0.186
0.681
0.861

Perception of correlation (2)

# Run this and look at the plot
ggplot(data = mlbBat10, aes(x = OBP, y = SLG)) +
  geom_point()

# Correlation for all baseball players
mlbBat10 %>%
  summarize(N = n(), r = cor(OBP, SLG))

##      N         r
## 1 1199 0.8145628

# Run this and look at the plot
mlbBat10 %>% 
    filter(AB > 200) %>%
    ggplot(aes(x = OBP, y = SLG)) + 
    geom_point()

# Correlation for all players with at least 200 ABs
mlbBat10 %>%
  filter(AB >= 200) %>%
  summarize(N = n(), r = cor(OBP, SLG))

##     N         r
## 1 329 0.6855364

# Run this and look at the plot
ggplot(data = bdims, aes(x = hgt, y = wgt, color = factor(sex))) +
  geom_point()

# Correlation of body dimensions
bdims %>%
  group_by(sex) %>%
  summarize(N = n(), r = cor(hgt, wgt))

## # A tibble: 2 x 3
##     sex     N     r
##   <int> <int> <dbl>
## 1     0   260 0.431
## 2     1   247 0.535

# Run this and look at the plot
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
  geom_point() + scale_x_log10() + scale_y_log10()

# Correlation among mammals, with and without log
mammals %>%
  summarize(N = n(), 
            r = cor(BodyWt, BrainWt), 
            r_log = cor(log(BodyWt), log(BrainWt)))

##    N         r     r_log
## 1 62 0.9341638 0.9595748

Interpretation of correlation

[Video]

Interpreting correlation in context

Recall that you previously determined the value of the correlation coefficient between the poverty rate of counties in the United States and the high school graduation rate in those counties was -0.681. Choose the correct interpretation of this value.

People who graduate from high school are less likely to be poor.
Counties with lower high school graduation rates are likely to have lower poverty rates.
[*] Counties with lower high school graduation rates are likely to have higher poverty rates.
Because the correlation is negative, there is no relationship between poverty rates and high school graduate rates.
Having a higher percentage of high school graduates in a county results in that county having lower poverty rates.

Correlation and causation

In the San Francisco Bay Area from 1960-1967, the correlation between the birthweight of 1,236 babies and the length of their gestational period was 0.408. Which of the following conclusions is not a valid statistical interpretation of these results.

We observed that babies with longer gestational periods tended to be heavier at birth.
It may be that a longer gestational period contributes to a heavier birthweight among babies, but a randomized, controlled experiment is needed to confirm this observation.
[*] Staying in the womb longer causes babies to be heavier when they are born.
These data suggest that babies with longer gestational periods tend to be heavier at birth, but there are many potential confounding factors that were not taken into account.

Spurious correlation

[Video]

Spurious correlation in random data

# Create faceted scatterplot
ggplot(data = noise, aes(x = x, y = y)) +
  geom_point() + 
  facet_wrap(~ z)

# Compute correlations for each dataset
noise_summary <- noise %>%
  group_by(z) %>%
  summarize(N = n(), spurious_cor = cor(x, y))

# Isolate sets with correlations above 0.2 in absolute strength
noise_summary %>%
  filter(abs(spurious_cor) > 0.2)

## # A tibble: 3 x 3
##       z     N spurious_cor
##   <int> <int>        <dbl>
## 1     7    50        0.240
## 2    15    50        0.309
## 3    16    50        0.218

Ch. 3 - Simple linear regression

Visualization of linear models

[Video]

The “best fit” line

# Scatterplot with regression line
ggplot(data = bdims, aes(x = hgt, y = wgt)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

Uniqueness of least squares regression line

# Estimate optimal value of my_slope
add_line(my_slope = 1)

Understanding the linear model

[Video]

Regression model terminology

Consider a linear regression model of the form:

Y=β0+β1⋅X+ϵ, where ϵ∼N(0,σϵ).

The slope coefficient is:

Y
β0
[*] β1
ϵ

Regression model output terminology

The fitted model for the poverty rate of U.S. counties as a function of high school graduation rate is:

povertyˆ=64.594−0.591⋅hs_grad

In Hampshire County in western Massachusetts, the high school graduation rate is 92.4%. These two facts imply that the poverty rate in Hampshire County is ___.

exactly 11.7%
exactly 10.0%
[*] expected to be about 10.0%
expected to be about 11.7%

Fitting a linear model “by hand”

# Print bdims_summary
bdims_summary

##     N         r mean_hgt   sd_hgt mean_wgt   sd_wgt
## 1 507 0.7173011 171.1438 9.407205 69.14753 13.34576

# Add slope and intercept
bdims_summary %>%
  mutate(slope = r * sd_wgt / sd_hgt, 
         intercept = mean_wgt - slope * mean_hgt)

##     N         r mean_hgt   sd_hgt mean_wgt   sd_wgt    slope intercept
## 1 507 0.7173011 171.1438 9.407205 69.14753 13.34576 1.017617 -105.0113

Regression vs. regression to the mean

[Video]

Regression to the mean

# Height of children vs. height of father
ggplot(data = Galton_men, aes(x = father, y = height)) +
  geom_point() + 
  geom_abline(slope = 1, intercept = 0) + 
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

# Height of children vs. height of mother
ggplot(data = Galton_women, aes(x = mother, y = height)) +
  geom_point() + 
  geom_abline(slope = 1, intercept = 0) + 
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

“Regression” in the parlance of our time

In an opinion piece about nepotism published in The New York Times in 2015, economist Seth Stephens-Davidowitz wrote that:

"Regression to the mean is so powerful that once-in-a-generation talent basically never sires once-in-a-generation talent. It explains why Michael Jordan’s sons were middling college basketball players and Jakob Dylan wrote two good songs. It is why there are no American parent-child pairs among Hall of Fame players in any major professional sports league."

The author is arguing that…

Because of regression to the mean, an outstanding basketball player is likely to have sons that are as good at basketball as him.
Because of regression to the mean, an outstanding basketball player is likely to have sons that are not good at basketball.
[*] Because of regression to the mean, an outstanding basketball player is likely to have sons that are good at basketball, but not as good as him.
Linear regression is incapable of evaluating musical or athletic talent.

Ch. 4 - Interpreting regression models

Interpretation of regression

[Video]

Interpretation of coefficients

Recall that the fitted model for the poverty rate of U.S. counties as a function of high school graduation rate is:

povertyˆ=64.594−0.591⋅hs_grad

Which of the following is the correct interpretation of the slope coefficient?

Among U.S. counties, each additional percentage point increase in the poverty rate is associated with about a 0.591 percentage point decrease in the high school graduation rate.
[*] Among U.S. counties, each additional percentage point increase in the high school graduation rate is associated with about a 0.591 percentage point decrease in the poverty rate.
Among U.S. counties, each additional percentage point increase in the high school graduation rate is associated with about a 0.591 percentage point increase in the poverty rate.
Among U.S. counties, a 1% increase in the high school graduation rate is associated with about a 0.591% decrease in the poverty rate.

Interpretation in context

A politician interpreting the relationship between poverty rates and high school graduation rates implores his constituents:

If we can lower the poverty rate by 59%, we’ll double the high school graduate rate in our county (i.e. raise it by 100%).

Which of the following mistakes in interpretation has the politician made?

Implying that the regression model establishes a cause-and-effect relationship.
Switching the role of the response and explanatory variables.
Confusing percentage change with percentage point change.
[*] All of the above.
None of the above.

Fitting simple linear models

# Linear model for weight as a function of height
lm(wgt ~ hgt, data = bdims)

## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Coefficients:
## (Intercept)          hgt  
##    -105.011        1.018

# Linear model for SLG as a function of OBP
lm(SLG ~ OBP, data = mlbBat10)

## 
## Call:
## lm(formula = SLG ~ OBP, data = mlbBat10)
## 
## Coefficients:
## (Intercept)          OBP  
##    0.009407     1.110323

# Log-linear model for body weight as a function of brain weight
lm(log(BodyWt) ~ log(BrainWt), data = mammals)

## 
## Call:
## lm(formula = log(BodyWt) ~ log(BrainWt), data = mammals)
## 
## Coefficients:
##  (Intercept)  log(BrainWt)  
##       -2.509         1.225

Units and scale

In the previous examples, we fit two regression models:

wgtˆ=−105.011+1.018⋅hgt

and

SLGˆ=0.009+1.110⋅OBP.

Which of the following statements is incorrect?

A person who is 170 cm tall is expected to weigh about 68 kg.
[*] Because the slope coefficient for OBP is larger (1.110) than the slope coefficient for hgt (1.018), we can conclude that the association between OBP and SLG is stronger than the association between height and weight.
None of the above.

Your linear model object

[Video]

The lm summary output

# Show the coefficients
coef(mod)

## (Intercept)         hgt 
## -105.011254    1.017617

# Show the full output
summary(mod)

## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.743  -6.402  -1.231   5.059  41.103 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
## hgt            1.01762    0.04399   23.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared:  0.5145, Adjusted R-squared:  0.5136 
## F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16

Fitted values and residuals

# Mean of weights equal to mean of fitted values?
mean(bdims$wgt) == mean(fitted.values(mod))

## [1] TRUE

# Mean of the residuals
mean(residuals(mod))

## [1] -1.266971e-15

Tidying your linear model

# Load broom
library(broom)

# Create bdims_tidy
bdims_tidy <- augment(mod)

# Glimpse the resulting data frame
glimpse(bdims_tidy)

## Rows: 507
## Columns: 9
## $ wgt        <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62.0, 81.6…
## $ hgt        <dbl> 174.0, 175.3, 193.5, 186.5, 187.2, 181.5, 184.0, 184.5, 17…
## $ .fitted    <dbl> 72.05406, 73.37697, 91.89759, 84.77427, 85.48661, 79.68619…
## $ .se.fit    <dbl> 0.4320546, 0.4520060, 1.0667332, 0.7919264, 0.8183471, 0.6…
## $ .resid     <dbl> -6.4540648, -1.5769666, -11.1975919, -12.1742745, -6.68660…
## $ .hat       <dbl> 0.002154570, 0.002358152, 0.013133942, 0.007238576, 0.0077…
## $ .sigma     <dbl> 9.312824, 9.317005, 9.303732, 9.301360, 9.312471, 9.314716…
## $ .cooksd    <dbl> 5.201807e-04, 3.400330e-05, 9.758463e-03, 6.282074e-03, 2.…
## $ .std.resid <dbl> -0.69413418, -0.16961994, -1.21098084, -1.31269063, -0.721…

Using your linear model

[Video]

Making predictions

# Print ben
ben

##    wgt   hgt
## 1 74.8 182.8

# Predict the weight of ben
predict(mod, newdata = ben)

##        1 
## 81.00909

Adding a regression line to a plot manually

# Add the line to the scatterplot
ggplot(data = bdims, aes(x = hgt, y = wgt)) + 
  geom_point() + 
  geom_abline(data = coefs, 
              aes(intercept = Intercept, slope = hgt),  
              color = "dodgerblue")

Ch. 5 - Model Fit

Assessing model fit

[Video]

RMSE

The residual standard error reported for the regression model for poverty rate of U.S. counties in terms of high school graduation rate is 4.67. What does this mean?

[*] The typical difference between the observed poverty rate and the poverty rate predicted by the model is about 4.67 percentage points.
The typical difference between the observed poverty rate and the poverty rate predicted by the model is about 4.67%.
The model explains about 4.67% of the variability in poverty rate among counties.
The model correctly predicted the poverty rate of 4.67% of the counties.

Standard error of residuals

# View summary of model
summary(mod)

## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.743  -6.402  -1.231   5.059  41.103 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
## hgt            1.01762    0.04399   23.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared:  0.5145, Adjusted R-squared:  0.5136 
## F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16

# Compute the mean of the residuals
mean(residuals(mod))

## [1] -1.266971e-15

# Compute RMSE
sqrt(sum(residuals(mod)^2) / df.residual(mod))

## [1] 9.30804

Comparing model fits

[Video]

Assessing simple linear model fit

# View model summary
summary(mod)

## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.743  -6.402  -1.231   5.059  41.103 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
## hgt            1.01762    0.04399   23.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared:  0.5145, Adjusted R-squared:  0.5136 
## F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16

# Compute R-squared
bdims_tidy %>%
  summarize(var_y = var(wgt), var_e = var(.resid)) %>%
  mutate(R_squared = 1 - var_e / var_y)

## # A tibble: 1 x 3
##   var_y var_e R_squared
##   <dbl> <dbl>     <dbl>
## 1  178.  86.5     0.515

Interpretation of R^2

The R2 reported for the regression model for poverty rate of U.S. counties in terms of high school graduation rate is 0.464.

lm(formula = poverty ~ hs_grad, data = countyComplete) %>% summary()

How should this result be interpreted?

46.4% of the variability in high school graduate rate among U.S. counties can be explained by poverty rate.
[*] 46.4% of the variability in poverty rate among U.S. counties can be explained by high school graduation rate.
This model is 46.4% effective.
The correlation between poverty rate and high school graduation rate is 0.464.

Linear vs. average

# Compute SSE for null model
mod_null %>%
  summarize(SSE = sum(.resid^2))

##        SSE
## 1 90123.34

# Compute SSE for regression model
mod_hgt %>%
  summarize(SSE = sum(.resid^2))

##        SSE
## 1 43753.01

Unusual points

[Video]

Leverage

# Rank points of high leverage
mod %>%
  augment() %>%
  arrange(desc(.hat)) %>%
  head()

## # A tibble: 6 x 9
##     wgt   hgt .fitted .se.fit .resid   .hat .sigma  .cooksd .std.resid
##   <dbl> <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>    <dbl>      <dbl>
## 1  85.5  198.    96.6    1.26 -11.1  0.0182   9.30 0.0134       -1.20 
## 2  90.9  197.    95.6    1.21  -4.66 0.0170   9.31 0.00221      -0.505
## 3  49.8  147.    44.8    1.13   5.02 0.0148   9.31 0.00221       0.543
## 4  80.7  194.    91.9    1.07 -11.2  0.0131   9.30 0.00976      -1.21 
## 5  95.9  193     91.4    1.05   4.51 0.0126   9.32 0.00152       0.488
## 6  44.8  150.    47.1    1.04  -2.32 0.0124   9.32 0.000397     -0.251

Influence

# Rank influential points
mod %>%
  augment() %>%
  arrange(desc(.cooksd)) %>%
  head()

## # A tibble: 6 x 9
##     wgt   hgt .fitted .se.fit .resid    .hat .sigma .cooksd .std.resid
##   <dbl> <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>      <dbl>
## 1  73.2  151.    48.8   0.974   24.4 0.0109    9.25  0.0386       2.64
## 2 116.   178.    75.9   0.507   40.5 0.00296   9.14  0.0282       4.36
## 3 104.   165.    63.0   0.491   41.1 0.00279   9.14  0.0273       4.42
## 4 109.   190.    88.8   0.946   19.8 0.0103    9.28  0.0238       2.13
## 5  67.3  152.    50.1   0.922   17.2 0.00982   9.29  0.0171       1.86
## 6  76.8  158.    55.3   0.729   21.5 0.00613   9.27  0.0166       2.32

Dealing with unusual points

[Video]

Removing outliers

# Create nontrivial_players
nontrivial_players <- mlbBat10 %>%
  filter(AB >= 10, OBP < 0.5)

# Fit model to new data
mod_cleaner <- lm(SLG ~ OBP, data = nontrivial_players)

# View model summary
summary(mod_cleaner)

## 
## Call:
## lm(formula = SLG ~ OBP, data = nontrivial_players)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31383 -0.04165 -0.00261  0.03992  0.35819 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.043326   0.009823  -4.411 1.18e-05 ***
## OBP          1.345816   0.033012  40.768  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07011 on 734 degrees of freedom
## Multiple R-squared:  0.6937, Adjusted R-squared:  0.6932 
## F-statistic:  1662 on 1 and 734 DF,  p-value: < 2.2e-16

# Visualize new model
ggplot(data = nontrivial_players, aes(x = OBP, y = SLG)) +
  geom_point() + 
  geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

High leverage points

# Rank high leverage points
mod %>%
  augment() %>%
  arrange(desc(.hat), .cooksd) %>%
  head()

## # A tibble: 6 x 9
##     wgt   hgt .fitted .se.fit .resid   .hat .sigma  .cooksd .std.resid
##   <dbl> <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>    <dbl>      <dbl>
## 1  85.5  198.    96.6    1.26 -11.1  0.0182   9.30 0.0134       -1.20 
## 2  90.9  197.    95.6    1.21  -4.66 0.0170   9.31 0.00221      -0.505
## 3  49.8  147.    44.8    1.13   5.02 0.0148   9.31 0.00221       0.543
## 4  80.7  194.    91.9    1.07 -11.2  0.0131   9.30 0.00976      -1.21 
## 5  95.9  193     91.4    1.05   4.51 0.0126   9.32 0.00152       0.488
## 6  44.8  150.    47.1    1.04  -2.32 0.0124   9.32 0.000397     -0.251

Conclusion

[Video]

About Michael Mallari

Michael is a hybrid thinker and doer—a byproduct of being a StrengthsFinder “Learner” over time. With 20+ years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.

Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.

LinkedIn | Twitter | www.michaelmallari.com/data | www.columbia.edu/~mm5470