Introduction to R

Novia Permatasari

May 7, 2021

Overview of R

What is R?
R is a language and environment for statistical computing and graphics (https://www.r-project.org/l)

What is R used for?

  • Basic and Advanced statistical analysis
  • Machine learning algorithm
  • Data Wrangling
  • Data scraping
  • Data visualization and Dashboard
  • Make a Report or Presentation (pdf, word, ppt)

Why use R ?

  • Most comprehensive statistical analysis tools
  • Open source
  • Supports various data types
  • Powerful graphics
  • Continuously growing

Disadvantages of R

  • Need time for coding
  • Some Packages may be of Poor Quality

Introduction to R Code

R vs RStudio vs RPackage

  • R : A programming language.
  • RStudio : An IDE for using R.
  • R-Package : A collection of script (function, dataset) used to do a specific task.

Installing R

RStudio

  • 01 : Script file, contains our code.
  • 02 : All objects we have defined in the environment.
  • 03 : Help files and plots.
  • 04 : The console, to type commands and display results.

R-Package

  • Can be accessed and also contribute on https://cran.r-project.org/
  • Now, there are 17525 available packages.
  • Install package and load package
# install.packages("survey")
library(survey)

Most common package on R:

  • tidyverse : data exploration, data wrangling
  • ggplot2 : data visualization
  • shiny : make a dashboard
  • survey : analysis of complex survey samples

Variables and Data Structure

  • Variable
number = 1
name <- "Novia Permatasari"
  • Vector
num_vect <- c(1,2,3,4)
num_vect
## [1] 1 2 3 4

  • DataFrame
BMI <-  data.frame(
   gender = c("Male", "Male","Female"), 
   height = c(152, 171.5, 165), 
   weight = c(81,93, 78),
   Age = c(42,38,26)
)
BMI
##   gender height weight Age
## 1   Male  152.0     81  42
## 2   Male  171.5     93  38
## 3 Female  165.0     78  26
  • Another data structure: matrix, list, array

Application on BPS

Tabulation

## [1] "Banyaknya Penduduk Berumur 5 Tahun ke Atas menurut Karakteristik dan Status Pendidikan, 2020"
##   Jenis Kelamin 02 Tidak / belums ekolah     03 SD   04 SMP 05 SMA keatas
## 1             1                 3270.072 10384.928 6146.938      5735.927
## 2             2                 3618.676  8769.409 4430.642      8604.462
##   06 Tidak bersekolah lagi
## 1                 72952.29
## 2                 69074.83
## [1] "Persentase Penduduk Berumur 5 Tahun ke Atas menurut Karakteristik dan Status Pendidikan, 2020"
##   Jenis Kelamin 02 Tidak / belums ekolah     03 SD   04 SMP 05 SMA keatas
## 1             1                 3.320202 10.544128 6.241170      5.823858
## 2             2                 3.829367  9.279993 4.688609      9.105442
##   06 Tidak bersekolah lagi
## 1                 74.07064
## 2                 73.09659

Data Visualization

Another visualization : https://www.r-graph-gallery.com/

Dashboard

https://gallery.shinyapps.io/nz-trade-dash/?_ga=2.132804878.1015659636.1620275477-171881207.1614696815

Direct Estimation - Complex Sampling

##   Jenis Kelamin Tidak Sekolah (Proporsi)          SE   CI LOWER   CI UPPER
## 1             1               0.03320202 0.006364916 0.02072702 0.04567703
## 2             2               0.03829367 0.005958246 0.02661572 0.04997162
##         RSE  RSE (%)          VAR      DEFF
## 1 0.1917027 19.17027 4.051215e-05 1.2415514
## 2 0.1555935 15.55935 3.550069e-05 0.9309919

Statistical Method for Analysis

  • Statistical Analysis
  • Forecasting
  • Small Area Estimation

Hands - On

Persentase Penduduk Berumur 5 Tahun ke Atas menurut Karakteristik dan Status Pendidikan, 2020 Persentase Penduduk Berumur 5 Tahun ke Atas menurut Karakteristik dan Status Pendidikan, 2020

Preparation

  • R and R Studio
  • Package :
    • tidyverse (data preparation)
    • ggplot2 (visualization)
    • survey (direct estimates)
# install.packages("tidyverse")
# install.packages("ggplot2")
# install.packages("survey")

base function

  • sum(), mean(), max(), min() : basic statistics
  • plot(), points(), lines(), hist() : visualization
  • xtabs(), proportions() : make a contigency table

tidyverse

  • ‘%>%’ : to show sequence of operators
  • select() : select variables
  • filter() : filter records / rows
  • mutate() : compute and add new variable
  • group_by() & summarise() : aggregate data
  • left_join(), right_join(), inner_join() : join table

ggplot2

  • ggplot() : define data and variables
  • geom_bar() : make a bar chart
  • geom_line() : make a line chart
  • geom_scatter() : make a scatter plot

survey

  • svydesign() : define survey design (psu, ssu, strata, weight, data)
  • svyby() : get direct estimates

Let’s Go!

https://rpubs.com/n_statistics/IntroR_sharingBPS