Run R and Python together

## Warning: package 'reticulate' was built under R version 4.3.3

R vs. Python

Category

R

Python

General Coding

1. Execution blocks or coding conventions

use curly braces {} and parentheses (), r code can be spead on multiple rows

for (i in 1:5){
  print("code block 1")
  if (i > 3) {
    print("code block 2")
  }
  print("code block 1 again")
}
## [1] "code block 1"
## [1] "code block 1 again"
## [1] "code block 1"
## [1] "code block 1 again"
## [1] "code block 1"
## [1] "code block 1 again"
## [1] "code block 1"
## [1] "code block 2"
## [1] "code block 1 again"
## [1] "code block 1"
## [1] "code block 2"
## [1] "code block 1 again"

use indentation and columns, python operations on multiple rows must be declared either using  or ()

for i in range(5):
  print("code block 1")
  if i > 3:
    print("code block 2")
  print("code block 1 again")
## code block 1
## code block 1 again
## code block 1
## code block 1 again
## code block 1
## code block 1 again
## code block 1
## code block 1 again
## code block 1
## code block 2
## code block 1 again

2. Range of numbers

use colon : btw 2 numbers to create a series

##  [1]  1  2  3  4  5  6  7  8  9 10

use range function, must be transformed to a list to access the elements list(range()) or np.arange()

## [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

3. Indexing

indexing starts at 1 and ends with the last element in the sequence included

for (i in 1:10) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

indexing starts at 0 and ends w/o the last element

for i in range(10):
  print(i)
## 0
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9

4. Functions

assigning a variable with the function command

func_name = function(arguments){
  return(sum(arguments))
}
func_name(c(1,2,3))
## [1] 6

use def command

def func_name(args):
  return sum(args)

func_name([1,2,3])
## 6

5. Default arguments (parameters) in functions

specify default arguments regardless of their order

f1 = function(x1=3, x2, x3=5){
  return(x1+x2+x3)
}
f1(x2=5)
## [1] 13

default arguments must be after the non-default

def f1(x2, x1=3, x3=5):
  return x1+x2+x3

f1(x2=5)
## 13

6. String concatenation

paste or paste0 or sprintf to concatenate strings and add variables to it

var1 = 5
var2 = 7
paste("The value of variable 1 is:", var1, ", and of variable 2 is:", var2)
## [1] "The value of variable 1 is: 5 , and of variable 2 is: 7"
sprintf("The value of variable 1 is: %d, and of variable 2 is: %d", var1, var2)
## [1] "The value of variable 1 is: 5, and of variable 2 is: 7"

print(f"")

var1 = 5
var2 = 7
print(f"The value of variable 1 is: {var1}, and of variable 2 is: {var2}")
## The value of variable 1 is: 5, and of variable 2 is: 7

7. Boolean

use TorForTRUE``FALSE

c(TRUE, FALSE, T, F)
## [1]  TRUE FALSE  TRUE FALSE

TrueorFalse

True, False
## (True, False)

8. Empty values

NULL

print_message = function(x=NULL) {
  if(!is.null(x)){
    print(paste("The message is:", x))
  } else {
    print("No message provided")
  }
}
print_message("Hello!")
## [1] "The message is: Hello!"
print_message()
## [1] "No message provided"

None

def print_message(x=None):
  if x is not None:
    print("The message is:", x)
  else:
    print("No message provided")

print_message("Hello!")
## The message is: Hello!
print_message()
## No message provided

9. Dictionary/ Hash-Table

no buit-in dictionary, but most objects are named (e.g.listcan be used). If the elements are all the same type, a regular vector can be used

dict1 = list(
  name = "John",
  age = 25,
  score = 8.9
)
dict1$age
## [1] 25
key = "age"
dict1[[key]]
## [1] 25
dict2 = c(45,30,55)
names(dict2) = c("key1","key2","key3")

key = "key2"
dict2[[key]]
## [1] 30

buit-in dictionary, instantiate dictionary then access a certain element

dict1 = {
  "name": "John",
  "age": 25,
  "score": 8.9
}

key = "age"
dict1[key]
## 25

10. Loading dependencies

libraryorrequire: automatically loads all functions into namespace and if some methods were overwritten, we can still access the other elements ::

require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(dplyr) # mask the base filter function
# filter(df, Age>25)

x = 1:100
stats::filter(x, c(1,1,1)) # use the base filter function
## Time Series:
## Start = 1 
## End = 100 
## Frequency = 1 
##   [1]  NA   6   9  12  15  18  21  24  27  30  33  36  39  42  45  48  51  54
##  [19]  57  60  63  66  69  72  75  78  81  84  87  90  93  96  99 102 105 108
##  [37] 111 114 117 120 123 126 129 132 135 138 141 144 147 150 153 156 159 162
##  [55] 165 168 171 174 177 180 183 186 189 192 195 198 201 204 207 210 213 216
##  [73] 219 222 225 228 231 234 237 240 243 246 249 252 255 258 261 264 267 270
##  [91] 273 276 279 282 285 288 291 294 297  NA

import: allow to access the commands using the model name & “.” numpy.arange, allies the library import numpy as np, import specific functions from numpy import arange, or import all functions from numpy import *

import numpy
numpy.arange(10)
## array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
import numpy as np
np.arange(10)
## array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
from numpy import arange
arange(10)
## array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

from numpy import *

11. Interacting with other code files

# source("utils.R")
# result <- my_function(10)
# print(result)
# from utils import *
# result = my_function(10)
# result

Data Handling

12.Vector, Matrix, Array

# vector
vec1 = c(1,2,3)
vec2 = c(4,5,6)
vec1 + vec2
## [1] 5 7 9
# matrix
mat = matrix(1:9, ncol=3, nrow=3)
mat
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
# multi-dimensional arrays
arr = array(1:24, dim=c(3,4,2))
arr
## , , 1
## 
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
## 
## , , 2
## 
##      [,1] [,2] [,3] [,4]
## [1,]   13   16   19   22
## [2,]   14   17   20   23
## [3,]   15   18   21   24
# vector
import numpy as np
vec = np.array([1,2,3,4,5])
result = vec * 2
result
## array([ 2,  4,  6,  8, 10])
# matrix
mat = np.array([[1,2,3],[4,5,6]])
mat
## array([[1, 2, 3],
##        [4, 5, 6]])
# array
np.arange(24).reshape(2,3,4)
## array([[[ 0,  1,  2,  3],
##         [ 4,  5,  6,  7],
##         [ 8,  9, 10, 11]],
## 
##        [[12, 13, 14, 15],
##         [16, 17, 18, 19],
##         [20, 21, 22, 23]]])

13.Matrix multiplication

mat %*% mat
##      [,1] [,2] [,3]
## [1,]   30   66  102
## [2,]   36   81  126
## [3,]   42   96  150
mat = np.arange(1,10).reshape(3,3)
mat @ mat
## array([[ 30,  36,  42],
##        [ 66,  81,  96],
##        [102, 126, 150]])

14.Matrix transpose

t (mat)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
mat.T
## array([[1, 4, 7],
##        [2, 5, 8],
##        [3, 6, 9]])

15.Reshape vector to matrix

matrix(1:6, nrow=2) # column-wise by default
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
matrix(1:6, nrow=2, byrow=T) # to switch to row-wise
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
np.arange(6).reshape(2,3) # row-wise by default
## array([[0, 1, 2],
##        [3, 4, 5]])
np.arange(6).reshape(2,3,order='F') # F for Fortran (col-wise)
## array([[0, 2, 4],
##        [1, 3, 5]])

16.Broadcasting

refers to using arithmetic operations with arrays of different dimensions. For ex., adding a vector to a matrix.

mat = matrix(1:6, nrow=2)
vec = c(10,20,30)
mat + vec # column-wise, not what you'd expect coming from python
##      [,1] [,2] [,3]
## [1,]   11   33   25
## [2,]   22   14   36
t(t(mat) + vec) # row-wise
##      [,1] [,2] [,3]
## [1,]   11   23   35
## [2,]   12   24   36
sweep(mat, 2, vec, "+") # or using sweep
##      [,1] [,2] [,3]
## [1,]   11   23   35
## [2,]   12   24   36

done row-wise

mat = np.ones((3,3)) # create a 3x3 matrix of 1's
vec = np.array([10,20,30])
mat + vec # row-wise
## array([[11., 21., 31.],
##        [11., 21., 31.],
##        [11., 21., 31.]])

17.Data frame

has built-in data frame capabilities, preferably use dplyr for handling the data

df = data.frame(
  Name = c("Alice","Bob","Charlie"),
  Age = c(25,30,22),
  Score = c(95,80,75)
)
df
library(dplyr)
df2 = df %>%
  mutate(Score_Doubled = Score * 2) %>% # create a new column/variable
  filter(Age > 23)                      # filtering rows
df2

doesn’t have built-in data frame capabilities, the only data frame library is pandas

import pandas as pd
data = {
  'Name': ["Alice","Bob","Charlie"],
  'Age': [25,30,22],
  'Score': [95,80,75]
}
df = pd.DataFrame(data)
print(df)
##       Name  Age  Score
## 0    Alice   25     95
## 1      Bob   30     80
## 2  Charlie   22     75
df['Score_Doubled'] = df['Score']*2 # creating a new column/variable
df2 = df['Age'] > 23              # filtering rows
df2
## 0     True
## 1     True
## 2    False
## Name: Age, dtype: bool

#  Note that can't load pandas in Quarto platform

18.Accessing column in data frame

$ or specifying the column name in []

df$Age
## [1] 25 30 22
df["Age"]

. or specify a column name in []

df.Age
## 0    25
## 1    30
## 2    22
## Name: Age, dtype: int64
df['Age']
## 0    25
## 1    30
## 2    22
## Name: Age, dtype: int64

19.Stacking operations/ method chaining

using dplyr pipe command %>% (ctr+shift+m)

result = df %>%
  filter(Age > 22) %>% 
  mutate(Score_Doubled = Score*2)
result

panda using .

result = (
  df
  .query('Age > 22')
  .assign(Score_Doubled = lambda x: x['Score']*2)
  )
result
##     Name  Age  Score  Score_Doubled
## 0  Alice   25     95            190
## 1    Bob   30     80            160

20.Filtering rows

using vectorized-boolean expressions `ordplyrfilter` command

filtered_df = df[df$Age >22 & df$Score >90,]
filtered_df
filtered_df = df %>% 
  filter(Age >22 & Score >90)
filtered_df
filtered_df = df[(df['Age'] >22) & (df['Score'] >90)]
filtered_df
##     Name  Age  Score  Score_Doubled
## 0  Alice   25     95            190
filtered_df = df.query('Age > 22 and Score >90')
filtered_df
##     Name  Age  Score  Score_Doubled
## 0  Alice   25     95            190

21.Subsetting/ Slicing

mat = matrix(1:16, ncol=4)

# arrays
mat[,2:4] # or, using empty
##      [,1] [,2] [,3]
## [1,]    5    9   13
## [2,]    6   10   14
## [3,]    7   11   15
## [4,]    8   12   16
# dataframe: subsetting same as array, R also has the ability to name the rows in a dataframe (change the df index to be some labels and then slice it as done in Python panda)
df[,1:2]
rownames(df)=c("row1","row2","row3") 
df[c("row1","row3"),"Age"]
## [1] 25 22
mat = np.arange(16).reshape(4,4)

# arrays
mat[:,2:4] # using empty
## array([[ 2,  3],
##        [ 6,  7],
##        [10, 11],
##        [14, 15]])
# dataframe, using `.loc` labeled indice
df.loc['Alice':'Charlie','Age'] #
## Series([], Name: Age, dtype: int64)
# Note that unlike the normal python behavior, .loc will return the last element in the slice)
df.iloc[0:2,0] # or `.iloc` regular indice (will have regular python behavior, will only return the 0th and 1st row)
## 0    Alice
## 1      Bob
## Name: Name, dtype: object

22.Plotting

using base-R or ggplot

set.seed(27)
x = runif(50)
y = 2*x + 0.1*rnorm(50)
plot(x,y)

library(ggplot2)
z = c(rep(1,25), rep(2,25))
df = data.frame(x,y,z)
ggplot(df, aes(x=x, y=y, color=z)) +
  geom_point()

ggplot() +
  geom_point(aes(x=x, y=y, color=z))

using matplotlib; plotlyis JavaScript based library which can be used both in Python and R

import matplotlib.pyplot as plt

# generate data
np.random.seed(27)
x = np.random.rand(50)
y = 2*x + 0.1*np.random.randn(50)
z = np.r_[np.ones(25), np.ones(25)*2]

plt.scatter(x, y, c=z) # c for color

23.Setting random seeds

built-in random number generator

set.seed(27)
runif(1) # will give the same number each time
## [1] 0.9717502
runif(1) # will give a different number each time
## [1] 0.08375751

set the seed on the component that is making the generation with import random random.seed(42) random.random(); withimport numpy as npnp.random.seed(42)np.random.random()`

import random
random.seed(27)
random.random() # running this cell will always output the same No.
## 0.6484972199788831
random.random() # will give a new number each time
## 0.7013695409686748
import numpy as np
np.random.seed(27) # set the seed on the right component that generates the random No. Otherwise it won't work
np.random.random()
## 0.4257214105188958
np.random.random()
## 0.8145837404945526
# Caution!!
random.seed(27) # setting the seed on the random library
np.random.random() # calling the numpy library random No. generator; result will be different each time
## 0.7353972901996796

24.Length/ shape/ dim

length for vector, list; dim() for matrix, dataframe

vec = c(1,2,3)
length(vec)
## [1] 3
mat = matrix(1:16, ncol=4)
dim(mat)
## [1] 4 4
dim(df)
## [1] 50  3

len, .shape in numpy for vector, matrix; .shape in numpy for dataframe

arr = [1,2,3]
len(arr)
## 3
vec = np.array([4,5,6])
vec.shape
## (3,)
mat = np.ones((3,3))
mat.shape
## (3, 3)
df.shape
## (3, 4)

25.Missing values

using NA or NaN

is.na(NaN)
## [1] TRUE

using np.nan

arr = np.array([1,2,np.nan,4,5])

np.isnan(arr)
## array([False, False,  True, False, False])

26.Exponentiation

caret ^, double asterisk **

3^3
## [1] 27
3**3
## [1] 27

only **, as ^ is bit-wise xor

3**3
## 27

Run chart

Four rules are used to help interpret run charts by detecting non-random patterns (i.e. signals) in the data.

  1. A run of 6 or more consecutive points above or below the centre line (a shift).
  2. A trend of 5 or more consecutive points all going up or down.
  3. Too few or too many runs
  • A run is a series of points in a row on one side of the centre line (median). Too few or too many runs is a signal of a non-random pattern.
  • To determine the number of runs count the number of times the line connecting the data points crosses the centre line and add 1.
  • Consult Table to determine whether the number of runs is within the expected range, or whether there are too few or too many (indicating a non-random pattern).
  1. An astronomical data point (an obviously different value that everyone would agree is highly unusual).

Control chart

Five rules for identifying special cause in control charts to understand whether improvement is occurring:

  1. A single point outside the control limits.
  2. Eight or more consecutive points above or below the centre line.
  3. Six consecutive points increasing (trend up) or decreasing (trend down).
  4. Two out of three consecutive points near a control limit (outer one-third).
  5. Fifteen consecutive points close to the centre line (inner one-third).

Numerical data

  • Continuous data is numerical data that can be measured
  • Discrete data is represented by exact values that result from counting

Line charts

  • to illustrate trends over time, line charts connect data points to reveal patterns and fluctuations

Histogram

  • to use for continuous data and distribution analysis
  • to show the frequency distribution of values in a dataset.
  • to be drawn with bin widths and number of bins. Contiguous bars representing intervals or bins that span from one number to another)

Boxplot

  • to describe statistical summaries such as median, quartiles, and outliers, aiding in data distribution analysis

Scatterplot

  • to showcase relationships between two continuous variables, identifying patterns and outliers through scattered data points.
  • Regression analysis is scatterplot with regression lines, forest plots, heatmaps and coefficient chart

Categorical data

  • Categorical data are data forms that are in categories and describe characteristics, or qualities, of a category

Bar chart

  • to use for categorical data and group comparisons
  • to compare values across different categories

Pie chart

  • to display parts of a whole, pie charts offer a simple way to understand proportions and percentages

Pareto chart

  • is a chart that contains both bars and a line graph, where individual values are represented in descending order by bars, and the cumulative total is represented by the line.

Statistical Tests

Machine Learning