Unlocking Statitics and Regression with R and RStudio Cloud

Motivation

Why I should learn Statistic with R - GP initiative

  • Statistics is used everywhere.

    • Research, M.Sc, Ph.D, Business Analytics

    • SPSS, SAS, STATA, IBM, MatLat, etc dependency

    • R is open source for all platform

  • Current AI wave and Data Science Job

    • Deep Learning = Domain Knowledge + Statistics + Coding Skill

    • Machine Learning = Domain Knowledge + Statistics + Coding Skill

      • Regression vs. Business Predictive Analytics

      • Classification

    • Data Science Salary is quite high

      • Data Scientist

      • Data Analyst

      • Data Engineer

  • Need to assist to learn Statistics since young

Objectives

  • Assist and Support to all interested R learners

  • Provide Technical Assistance and services to them

    • Professional Analysis

    • Thesis, Dissertation and Business report writing and web publishing

    • Dashboard development

  • Educate R Statistic for Data Science Career

    • Machine Learning

    • Deep Learning

    • ChatGPT and Generative AI integration

Prerequisite - Software Installation

  • R

  • RStudio (Posit Cloud) IDE

  • Rtools (Github)

  • Quarto (Reporting)

  • Anaconda (Jupyter Notebook, Jupyter lab, ipython, )

Lecture 1: Introduction to R and Basic Coding

Introduction to R and Basic Coding

  • What is R? (brief overview of R and its uses)

  • Basic syntax and data types in R (vectors, variables, basic operators)

  • Example code for basic arithmetic operations

  • Hands-on exercise: Write a simple R script to perform basic arithmetic operations

  • Solutions to hands-on exercise



What is R?

R is a programming language and environment for statistical computing and graphics

R was introduced since 2001 using S language

R can be used in different platform such as Window, Mac, Linux

R is widely used in academia, research, and industry for data analysis and visualization

R is free and open-source, Current version - 4.4.1

R vs. RStudio vs Positron

R is a main workhorse as a motor engine to run

RStudio is an Integrated Development Environment (IDE), serve as a car body

Packages are tools to work with like shovel, knife, sachet, other parts of cars

There are 21140 packages by now.

Top 10 packages are ‘ggplot2’, ‘rlang’, ‘magrittr’, ‘dplyr’, ‘vctrs’, ‘cli’, ‘tibble’, ‘devtools’, ‘jsonlite’, ‘Rcpp’

Their respective downloads are 147,070,480; 135,195,761; 125,445,993; 110,729,472; 98,242,310; 95,558,907; 93,010,991; 91,765,271; 91,013,906; 87,825,074

Positron is integrated for both R and Python for data science projects

R packages downloads

Motivation

  • Data Science is one of Sexiest Job in 21 centuries.

  • R, Python and Julia are top for data science projects

  • Salary range 5000 USD - 12000 USD per month

  • What you need to know?

    • Machine Learning

      • Classification

        • Supervise

        • Unsupervise

      • Regression

      • Reinforcement Learning

    • Statistics

    • Data Engineering

      • Big Data management

      • Data manipulation

Basic syntax and data types in R

Scalars: only one value

Vectors: a collection of values of the same type

Variables: a name given to a value or a vector

Basic operators: `+, -, *, /, %%, > , >=, <, <=, !=, etc.

Used in

  • Mathematic Calculation

  • Matrix (Linear Programming (lpSolve))

  • Statistics

  • Probability and Distribution

  • Text mining

  • Data Visualization

  • Spatial Data Analysis

  • Reporting

  • Dashboard

  • Business Analytics

Function

name_of_function <- function(){}

  • mean( ), median( ), mode( ), sd( ), summary( ), sample( )

  • cor( )

  • lm( ) # \(y = \beta + \alpha x\)

Basic Syntax

Example code for basic arithmetic operations

https://webr.r-wasm.org/latest/

Lecture 2: Data Types and Structures

Data Types and Structures

  • Data types in R (numeric, character, logical, factor)

  • Data structures in R (vectors, matrices, data frames)

  • Example code for creating vectors, matrices, and data frames

  • Hands-on exercise: Create a data frame from vectors

  • Solutions to hands-on exercise

Data types in R

Numeric: 1, 2, 3, etc.

Character: “hello”, “world”, etc.

Logical: TRUE, FALSE, etc.

Factor: categorical data, e.g. “male”, “female”, etc.

Data types in details

  • String (character)

    single quote ' ' or double quote " " escape character “\”

    read_csv(“C:/Users/kyawmoeaung/Downloads/iris.csv”) - Mac read_csv(“C:\Users\kyawmoeaung\Downloads\iris.csv”) - Window

  • Numeric

    • double (floating point, decimal)

    • int (integer)

  • Factor

    • binary (yes, no; gender- male, female)

    • ordinal (0 - 100, 101 - 200, 201 - 300, 301 - 400)

    • nominal data - (eg. forest types, rice varieties, endangered species)

  • Logical

    • TRUE # must be calculator

    • FALSE

Data structures in R

Vectors: a collection of values of the same type

Matrices: a two-dimensional array of values of the same type

Data frames: a collection of vectors of different types

List: all objects can be stored

Vectors

Vectors can be created using c() function which object length length() is more than 1. can be verified by typeof() and class()

  1. numeric vector
  1. integer data
  1. logical vector
  1. factor

Matrices

Can be created with matrix() function.

DataFrame

Commonly used data type.

List

Store all type of data covering, vector, matrix, dataframe, any objects list()` function is used to create a list

Converting from one type to another

Use as.. ### For vectors * as.numeric()

  • as.character()

  • as.factor()

For matrix and Data Frame

  • as.data.frame()

  • as_tibble()

  • as.matrix()

checking data type/structure with as.

  • is.numeric() # checking double or integer type

  • is.vector() # vector data

  • is.matrix() # matrix

  • is.data.frame() # data.frame

  • is_tibble() # tibble

  • is.character() # character

  • is.na() # NA value

Lecture 3: Importing and Exporting Data

Importing and Exporting Data

  • Importing data into R (e.g. CSV, Excel, )

  • Example code for importing data using read.csv() and readxl

  • Exporting data from R (e.g. CSV, Excel)

  • Example code for exporting data using write.csv() and write_xlsx()

  • Hands-on exercise: Import and export a sample dataset

Importing data into R

Import files using RStudio IDE (GUI)

* Go to Environment Pane

* Click on 'Import Dataset'

* Choose Text, Excel, SPSS, Stata, SAS

* Choose file 

* click import

Learn the code while importing

importing data

  • read_csv() # load(readr package)
  • read_tsv() # load(readr package)
  • read_excel() # load(readxl package)
  • read_sav() # load(haven package)
  • read_sas() # load(haven package)
  • read_dta() # load(haven package)

:::

Exporting data

  • write_csv() # require readr
  • write_tsv() # require readr
  • write_xlsx() # require writexl package
  • write_sav() # require haven
  • write_sas() # require haven
  • write_dta() # require haven

Example code for importing data

Lecture 4: Summary Statistics and Data Visualization

Summary Statistics and Data Visualization

  • Summary statistics in R (mean, median, mode, standard deviation, variance )

  • Example code for calculating summary statistics using summary()

  • Data visualization in R (histograms, box plots)

  • Example code for creating visualizations using ggplot2

  • Hands-on exercise: Create summary statistics and visualizations for a sample dataset

Summary statistics in R

gapminder data contains world pop, gdp and life expectancy data.

Life Expectancy, Populaion, GDP summary

continent mean_pop mean_lifeExp mean_gdp sd_pop sd_lifeExp sd_gdp
Africa 9916003 48.86533 2193.755 15490923 9.150210 9.150210
Americas 24504795 64.65874 7136.110 50979430 9.345088 9.345088
Asia 77038722 60.06490 7902.150 206885205 11.864532 11.864532
Europe 17169765 71.90369 14469.476 20519438 5.433178 5.433178
Oceania 8874672 74.32621 18621.609 6506342 3.795611 3.795611

Summary statistics in R

summary(): calculate summary statistics for a dataset summary( ) function can be used on the whole dataset and individual columns we specified.

summary of gapminder data

country continent year lifeExp pop gdpPercap
Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60 Min. :6.001e+04 Min. : 241.2
Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20 1st Qu.:2.794e+06 1st Qu.: 1202.1
Algeria : 12 Asia :396 Median :1980 Median :60.71 Median :7.024e+06 Median : 3531.8
Angola : 12 Europe :360 Mean :1980 Mean :59.47 Mean :2.960e+07 Mean : 7215.3
Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
Australia : 12 NA Max. :2007 Max. :82.60 Max. :1.319e+09 Max. :113523.1
(Other) :1632 NA NA NA NA NA

Example code for calculating summary statistics

  • summary(gapminder$pop)
  • summary(gapminder$gdpPercap)
  • summary(gapminder$lifeExp)

[1] “Summary of Pop”

Min. 1st Qu. Median Mean 3rd Qu. Max.
60011 2793664 7023596 29601212 19585222 1318683096

[1] “Summary of lifeExp”

Min. 1st Qu. Median Mean 3rd Qu. Max.
23.60 48.20 60.71 59.47 70.85 82.60

[1] “Summary of gdp”

Min. 1st Qu. Median Mean 3rd Qu. Max.
241.2 1202.1 3531.8 7215.3 9325.5 113523.1

Visualization

  • Explore gapminder data

  • library(gapminder) # Load your package

  • library(tidyverse) # load tidyverse

  • library(knitr) # load knitr

  • glimpse(gapminder) # data structure and data type exploration

  • plot(gapminder) # looking into correlated variables, factor and numeric variables

  • ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +

  • add geom_point() +

  • Because scales are too different between lifeExp and gdpPercap, we need to transform x value by natural logarithym

  • add scale_x_log10()

Gapminder data visualization

Life expectancy vs. gdpPercap (before scaling)

basic visualization

Gapminder data visualization

  • add scale_x_log10() function to adjust x and y scale.

Life expectancy vs. gdpPercap (after scaling)

basic visualization

Life expectancy vs. gdpPercap (after scaling and adding color)

Life expectancy vs. gdpPercap (after scaling and adding color)

basic visualization

Life expectancy vs. gdpPercap (after scaling and adding color, adding size)

Life expectancy vs. gdpPercap (after scaling and adding color, adding size)

Life expectancy vs. gdpPercap (after scaling and adding color, adding size and shape)

Life expectancy vs. gdpPercap (after scaling and adding color, adding size and shape)

Life expectancy vs. gdpPercap (after scaling and adding color, adding size) by grouping continent (only 1952 and 2007)

Life expectancy vs. gdpPercap (after scaling and adding color, adding size)

Life expectancy vs. gdpPercap (after scaling and adding color, adding size with life Expectancy) by grouping continent (only 1952 and 2007)

Life expectancy vs. gdpPercap (after scaling and adding color, adding sizewith life Expectancy) by grouping continent (only 1952 and 2007)

Life expectancy vs. gdpPercap (after scaling and adding color, adding size with life Expectancy) by grouping continent (only 1952 and 2007)

Life expectancy vs. gdpPercap (after scaling and adding color, adding size with population) by grouping continent (only 1952 and 2007))

Lecture 5: Correlation and Regression

Correlation and Regression

  • Correlation analysis in R (Pearson’s r, scatter plots)

  • Example code for calculating correlation using cor()

  • Simple linear regression in R

  • Example code for performing simple linear regression using lm()

  • Hands-on exercise: Perform correlation analysis and simple linear regression on a sample dataset

Correlation analysis in R

Correlation is measured the strength of relationship between two varialbes.

  • Correlation strength ranges from -1 to 1.

  • SPSS survival manual - Julian Pellet mentioned how to interpret them.

    • 1 - perfect correlation
    • above 0.7 - very strong correlation
    • 0.6 to 0.69 - strong correlation
    • 0.4 to 0.59 - medium correlation
    • 0.2 to 0.39 - weak correlation
    • 0.1 to 0.19 - very week correlation (no correlation)
    • 0 - no correlation

cor()1: calculate correlation between two variables

corrplot(): visulization of correlation among variables need to install corrplot package

Example code for calculating correlation

cor(data\(x, data\)y)

cor() requires two arguments for x and y variables. Each column can be accessed by data followed by $ and by column name.

[The correlation between carat and price is: 0.92159130119347]

We can conclude that there is a very strong correlation between price and carat from diamond dataset.

Correlation with cor() on palmerpenguins

  • cor(penguins[3:6]) |> knitr::kable()
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
bill_length_mm 1.0000000 -0.2286256 0.6530956 0.5894511
bill_depth_mm -0.2286256 1.0000000 -0.5777917 -0.4720157
flipper_length_mm 0.6530956 -0.5777917 1.0000000 0.8729789
body_mass_g 0.5894511 -0.4720157 0.8729789 1.0000000
  • round(cor(penguins[3:6]), 2)|> knitr::kable()
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
bill_length_mm 1.00 -0.23 0.65 0.59
bill_depth_mm -0.23 1.00 -0.58 -0.47
flipper_length_mm 0.65 -0.58 1.00 0.87
body_mass_g 0.59 -0.47 0.87 1.00

Correllation with corrplot()

corrplot() function takes two arguments: data with variables, method.

corrplot(data, method)

method can be: 1. “circle”, 2. “square”, 3. “ellipse”, 4. “number”, 5. “shade”, 6. “color”, 7. “pie”.

Example code with corrplot()

corrplot() on penguin dataset

corrplot() on iris dataset

Rows: 150

Columns: 5

$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…

$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…

$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…

$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…

$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…

corrplot() on iris dataset

Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.00 -0.12 0.87 0.82
Sepal.Width -0.12 1.00 -0.43 -0.37
Petal.Length 0.87 -0.43 1.00 0.96
Petal.Width 0.82 -0.37 0.96 1.00

corrplot() on iris dataset

iris correlation plot

Final exercise for correlation

  • Use mtcars dataset for cor() function and corrplot() function

  • Step 1. look at your data

  • Step 2. Remove NA values using na.omit() function

  • Step 3. Remove all charater related factors columns

  • search correlation among variables

  • Show your corrplot()

Linear Regression - Introduction

Linear regression is widely used by researchers, academia, finance, banking and business for prediction of variables.

There are many regression models and some of them are:

  • Simple Linear Regression Model

  • Multiple Linear Regression

  • Simple Non-Linear Regression

  • Multiple Non-Linear Regression

Linear Regression formula in R

lm(formula, data)

glm(formula, family, data) # for binary data

Linear Model

lm : linear model

formula : y ~ x

data : dataset

Generalized Linear Model

glm : Generalize Linear Model

formula : y ~ x

family : gaussian, bionomial or poisson

data : dataset

Linear Model formula
lm(y ~ x, data= data)

Generalize Linear Model
lm(y ~ x, family, data= data)

Some definitions

  • coefficients: intercept or slope terms

  • residuals : actual - predicted

  • R-squared : Model Performance

  • Adjusted R-squared : Model Performance

  • RSE : Residual Standard Error

  • RMSE : Root Mean Squared Error

Steps to analyze Linear Model

  • load packages - tidyverse, modelr, broom, palmerpenguins

  • datasets : mpg, iris, gapminder, penguins, mtcars, fish

  • take a look on the structure of data and identify dependent variable or response variable and independent variable or explanatory variable

  • explore data with glimpse(), head(), tail(), summary(), plot() to identify data types, correlation patterns among variables.

  • start lm() function, add formular (response ~ explanatory variable) and add dataset.

  • assign the variable with fit or the name you like

  • run lm() model

  • print with summary() function.

Example code

mpg data exploration

str(mtcars)

‘data.frame’: 32 obs. of 11 variables:

$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 …

$ cyl : num 6 6 4 6 8 6 8 4 4 6 …

$ disp: num 160 160 108 258 360 …

$ hp : num 110 110 93 110 175 105 245 62 95 123 …

$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 …

$ wt : num 2.62 2.88 2.32 3.21 3.44 …

$ qsec: num 16.5 17 18.6 19.4 17 …

$ vs : num 0 0 1 1 0 1 0 1 1 1 …

$ am : num 1 1 1 0 0 0 0 0 0 0 …

$ gear: num 4 4 4 3 3 3 3 4 4 4 …

$ carb: num 4 4 1 1 2 1 4 2 2 4 …

mpg data exploration

head(mtcars)

mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

mpg data exploration

tail(mtcars)

mpg cyl disp hp drat wt qsec vs am gear carb
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2

mpg data exploration

summary()

mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
Median :3.695 Median :3.325 Median :17.71 Median :0.0000
Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
am gear carb
Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :4.000 Median :2.000
Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000

mtcars data exploration

plot(mtcars)

mpg : Miles gallon

cyl : number of cylinders

displ: engine displacement

hp: housepower

drat: Rear axle ratio

wt: weight

mtcars Simple Linear Regression

mtcars Simple Linear Regression

Call:

lm(formula = mpg ~ wt, data = mtcars)

Coefficients:

(Intercept) wt

 37.285       -5.344  

response = Intercept + slope * explanatory variable

1 pound weight corresponds to 37.285 + (-5.344) * 1 = 31.941 miles per gallon.

2 pound weight corresponds to 37.285 + (-5.344) * 2 = 26.597 miles per gallon.

3 pound weight corresponds to 37.285 + (-5.344) * 3 = 21.253 miles per gallon.

4 pound weight corresponds to 37.285 + (-5.344) * 4= 15.909 miles per gallon.

5 pound weight corresponds to 37.285 + (-5.344) * 5= 10.565 miles per gallon.

Interpretation and visualization

Interpretation and visualization

mtcars_nonlinear

Interpretation and visualization (change color to local variable)

Interpretation and visualization (change color to local variable)

mtcars nonlinear loess

Interpretation and visualization (change color to local variable and method to ‘lm’)

Interpretation and visualization (change color to local variable and method to ‘lm’)

mtcars linear

Interpreting model performance with broom package

  • install broom and modelr package

  • load packages

  • fit the model

  • use augment(), glance(), pull()

  • pull r.squared, adjusted.r.squared, sigma

Interpreting model performance with broom package

Linear Model exercise with palmerpenguins dataset

  • load required libraries

  • check the structure and data type of penguins dataset

  • identify two variables (response vs explanatory variables)

  • Remove NA values using na.omit() function on penguins by overwritting the penguin dataset

  • write formula lm(formula, data) and assign to a variable you like

  • print the fitted variable or the model

  • print summary() of the model

  • Interpret model performance and visualize the data

Lecture 6: Cluster Analysis

Cluster Analysis

  • Cluster analysis in R (k-means, hierarchical clustering)

  • Example code for performing k-means clustering using kmeans()

  • Example code for performing hierarchical clustering using hclust

  • Example code for visualizing clusters using fviz_cluster()

  • Hands-on exercise: Perform k-means clustering and hierarchical clustering on a sample dataset

Cluster analysis in R

  • K-means clustering: partition data into k clusters based on similarity
  • Hierarchical clustering: build a hierarchy of clusters by merging or splitting existing clusters
  • kmeans() requires two arguments data and centers

Example code for performing k-means and hierachical clustering

Additional Resources

  • R documentation: https://www.r-project.org/
  • R tutorials: https://www.datacamp.com/tutorial/r-tutorial
  • R cheat sheets: https://www.rstudio.com/online-learning/

General Tips

  • Use clear and concise language in your slides
  • Use images, diagrams, and charts to illustrate complex concepts
  • Use code examples to demonstrate how to perform tasks in R
  • Leave space for notes and comments
  • Use a consistent font and color scheme throughout the presentation