R Markdown

This presentation aims at showing the features of R. The roadmap of the session is:

  • Why R
  • What Can We Do with R

R community

  • R is an open source: A big community in StackOverflow

R for Academic

  • R is very popular amoung academic

R for Business

  • R is in heavy use at several of the best companies who are hiring data scientists.
    • Google
    • Facebook
    • Twitter
    • Uber
    • Microsoft
    • World Bank

R for Business and academic

  • As data science matures, data scientists in the business world will need to communicate more with academic scientists.

What can we do with R

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Slide with R Output

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Slide with math

We can write math equation

\[ J(\theta ) = -\frac{1}{m}\sum [y^{(i)}log(p^{(i)})+(1-y^{(i)})log(1-p^{(i)})] \]

Connect to the world bank

Stata format

Import data From Any Format

  • CSV file
##               X  mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1     Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2 Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3    Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
  • STATA
## # A tibble: 3 x 4
##   admit   gre   gpa  rank
##   <dbl> <dbl> <dbl> <dbl>
## 1     0   380  3.61     3
## 2     1   660  3.67     3
## 3     1   800  4.00     1

Import data From Any Format

  • Excel
## # A tibble: 71 x 2
##    weight      feed
##     <dbl>     <chr>
##  1    179 horsebean
##  2    160 horsebean
##  3    136 horsebean
##  4    227 horsebean
##  5    217 horsebean
##  6    168 horsebean
##  7    108 horsebean
##  8    124 horsebean
##  9    143 horsebean
## 10    140 horsebean
## # ... with 61 more rows
  • R also accepts SAS, SPSS + connect easily with DropBox and Google Drive

Import data From the Web

## The Greatest Showman is rated 7.9/10
## from Internet Movie Database and released in 20 Dec 2017

R pipeline

  • R pipeline feature makes the manipulation clean, fast and less prompt to error
  • R pipeline is a code which performs steps without saving intermediate steps to the hard drive
  • Let's consider a project where we need to:
    • Import two datasets
    • Merge them together
    • Create a new variable
    • Summary statitstics within groups
  • R pipeline is definitelly the right candidate

Generate the data table

  • Table User
ID name age occupation
1 John 21 Student
2 Jack 28 Employee
  • Table items
ID price quantity
1 1 1
3 4 5
5 10 5

Example

items %>%
  # Merge
  left_join(users) %>%
  # Create group
  group_by(name) %>%
  # Create average basket
  mutate(average_basket = crossprod(price, Quantity)/sum(Quantity))%>%
  # Summarize
  summarise(average_basket = mean(average_basket),mean_price=  mean(price), count = n()) %>%
  # ascending sort
  arrange(mean_price)
## # A tibble: 5 x 4
##     name average_basket mean_price count
##   <fctr>          <dbl>      <dbl> <int>
## 1   Jack       3.000000   3.000000     1
## 2   John       4.333333   3.500000     2
## 3 AMELIA       7.000000   7.000000     1
## 4  ELLIS       7.222222   7.600000     5
## 5 SOPHIE       8.625000   7.666667     3

Graphics with R

  • Graphics in R follows the 'The Grammar of Graphics'
    • Data: the raw data that you want to plot
    • Geometries: the shapes that will represent the data
    • Aesthetics: for a given geometry, we have certain parameters we must set
    • Scale: the transformation which maps data to the aesthetic dimensions, such as the data range, plot width, or colors associated with different factors
  • Let's see some examples

R and Econometrics

  • We can do a some regressions, get the results and export it for publication
Regression Results
Dependent variable:
rating high.rating
OLS probit
(1) (2) (3)
complaints 0.692*** 0.682***
(0.149) (0.129)
privileges -0.104 -0.103
(0.135) (0.129)
learning 0.249 0.238* 0.164***
(0.160) (0.139) (0.053)
raises -0.033
(0.202)
critical 0.015 -0.001
(0.147) (0.044)
advance -0.062
(0.042)
Constant 11.011 11.258 -7.476**
(11.704) (7.318) (3.570)
Observations 30 30 30
R2 0.715 0.715
Adjusted R2 0.656 0.682
Log Likelihood -9.087
Akaike Inf. Crit. 26.175
Residual Std. Error 7.139 (df = 24) 6.863 (df = 26)
F Statistic 12.063*** (df = 5; 24) 21.743*** (df = 3; 26)
Note: p<0.1; p<0.05; p<0.01

Create and Share Your App

  • We can share our results with the world thanks to Shiny app.

  • This app has been developped by Daniele Landolo. It shows the world bank data nicely.
  • Another simple one for the GDP/cap from the world bank: app
  • My app is stored in my website

GitHub: Plateform to manage project

  • My GitHub
  • GitHub is a plateform to manage, check and share projects.
  • Everyone in your project can see what edit has been done.
  • Project revisions can be discussed publicly, so a mass of experts can contribute knowledge and collaborate to advance a project forward.
  • GitHub is primarly used for Data Science. Academic can found particular interested to get benefit from the collaboration and sharing plateform.