Hope this cheatsheet helps you with your learning in ST1131!🌟
Please reach out to me if you have any questions.
Reserve letters: Don’t use these alphabets as your variable names!
Clear Workspace
Logical Expressions
== #Equality
!= #Inequality
< #Less than
> #Greater than
< #Less than or equal to
>= #Greater than or equal to
y & x #AND
y | x #OR
!y #NOTVector, matrix, dataframe, list
vector= c(1,2,3)
mat = matrix(1:25, 5, 5) #matrix(range, col, row, byrow = TRUE)
df = data.frame(A = c("apr", "may","jul"),
B = c(1,2,3))
list = list("Cat", c(13,3,31), FALSE, 51.11, 23.4)Seq, rep and sample:
#repeat a sequence
x = c(1,2,3)
rep(x, times = 3) #1 2 3 1 2 3 1 2 3
rep(x, times = c(1,2,3)) #1 2 2 3 3 3
Loop
# for loop
x = c(1, 2, 3, 4, 5)
for (i in x) {
print(i)
}
# while loop
i = 1
while (i <= 5) {
print(i)
i = i + 1
}#read in df
df = read.csv(filename,sep=",",header=TRUE col.names=names) #names=c("colname1","colname2")
df = read.csv()#df inspection
class(df) #"data.frame"
names(df) #column names
colnames(df) #column names
dim(df) #number of rows/columns for matrix & df only
length(df) #number of columns
nrow(df) #number of rows
ncol(df) #number of columns
head(df)
str(df) #internal structure of df #select a column
df[,"Height"]
df$Height
#or
#access the column directly using its name
attach(df)
Height#rename 1 or more columns
names(df) = c("new_name1", "new_name2")
attach(df)
#or
colnames(df)[1] = "new_name1" df["Height">150 & "Weight"< 50, ] #extract rows based on condition #remember the comma!!
data[Gender == "M" & HW == "A",]
df[order(df$col), ] #arrange rows based on values in column
df[order(rev(df$col)), ] #for reverse
df = df[, c(5, 4, 1, 2, 3)] #reorder columns by index or name# Measures of Spread and Variability
sd()
var()
range(), max(), min()
IQR()
quantile(df, c(0.25, 0.75)) # QuartilesWhat to report?
barplot(data,
xlab = "",
ylab = "",
main = "",
beside = TRUE, #bars are beside each other. Otherwise, they're stacked
ylim = c(0,70), #the range of the y-axis. same for xlim = c()
col = 5) #colourWhat to report?
Outliers
boxplot(df$col)$out #gives values of outliers
out = which(boxplot(df$col)$out) #give index of outliers #or
out = which(df$col %in% boxplot(col)$out)
new_df = data[!data %in% data[out]] #create new df without outliersGroups
grp = Boxplot()$group
which(grp == 1)
boxplot(age ~ cancer)$out[which(grp == 1)] #give values of outliers in grp 1What to report?
hist(df$col,
freq = FALSE, #to plot density or density = TRUE
breaks = 10) #affects the number of columnsWhat to report?
cor(x,y) to confirmabline(h=1) #to plot a horizontal line at y=1
abline(v=1) #to plot a vertical line at v=1
abline(lm(y~x, data = df)) #to plot a best-fit lineFor significance level 0.95
CI = c(p - qnorm(0.975) * sqrt(p * (1 - p) / n), #n = sample size
p + qnorm(0.975) * sqrt(p * (1 - p) / n))Overall:
Distributions:
Tests:
Relevant Rcodes:
# Generate 5 random numbers from a Binomial distribution with 10 trials and 0.5 probability of success
rbinom(5, 10, 0.5)
# Calculate cumulative probability for 0.1 successes in 10 trials with a 0.5 probability of success
pbinom(0.1, 10, 0.5)
# Find the quantile for a 0.5 probability in 10 trials with a 0.5 probability of success
qbinom(0.5, 10, 0.5)# Generate 5 random numbers from a Normal distribution with a mean of 10 and standard deviation of 2
rnorm(5, mean = 10, sd = 2)
# Calculate cumulative probability for a value of 12 in a Normal distribution with a mean of 10 and standard deviation of 2
pnorm(12, mean = 10, sd = 2)
# Find the quantile for a probability of 0.25 in a Normal distribution with a mean of 10 and standard deviation of 2
qnorm(0.25, mean = 10, sd = 2)Assume that the test statistic is mu = 500
t.test(data, mu = 500) #two.sided (default)
t.test(data, mu = 500, alternative = "greater") #one sided
t.test(data, mu = 500, alternative = "less") #one sidedif variance_test$p.value <= 0.05
if variance_test$p.value > 0.05
Assumptions:
var.test().Degrees of freedom:
Conclusion:
Get test statistic and p-value from
summary(M1):
Same as above, except that var.equal = FALSE
Test is similar to one-sample t-test.
Conclusion:
t.test(sampleBefore, sampleAfter, alternative="two.sided/less/greater", paired = TRUE, conf.level= 0.95)Calculation for test statistic (t.score):
sample.mean = mean(x)
sample.sd = sd(x)
n = length(x)
t.score = (sample.mean - 500)/(sample.sd / sqrt(n)) #500 is the test value. It will be given by the question.To build a linear regression model, we need to include significant regressors, and ensure that our model is significant. This is to ensure our regressors have a true impact on the prediction of the response variable. Also, we need to make sure that the assumptions are met. In other words, check if the model is adequate. Then, check that goodness of fit of your model.
Table of Content:
lm()predict()summary(M1)Assuming you have a linear regression model named M1
99% CI of mean of y at x=x0 predict() -> generic function for any model
99% CI of mean of y at x=x0 predict.lm() -> specifically for linear models created using the lm()
Information to extract from summary_model:
#R-squared and Adjusted R-squared
summary_model$r.squared # Extract the R-squared value
summary_model$adj.r.squared
#Residuals
summary_model$residuals #the differences between the observed and predicted values.
#Degrees of freedom for the model
summary_model$df
#Information about the individual predictors
summary_model$coef
summary_model$std.error
summary_model$t.value
summary_model$p.value
confint(M1, level=0.95) #Confidence Intervals for Model ParametersPurpose:
Assumptions:
Hypothesis:
Conclusion:
Get test statistic and p-value
from summary(M1):
Purpose:
Hypothesis:
Note:
In the simple model, the t-test to test the significance of the slope and the F-test to test the significance of model have the same p-value. Because simple model ony have one regressor.
Conclusion:
Get test statistic and p-value
from summary(M1):
Assumptions:
Scatterplot of Y~X
Suggested fixes:
Get residuals:
M1 = lm(y~x, data=df)
raw.res = M1$res # lists all the raw residuals of model M
SR = rstandard(M1) # lists all the standardized residuals of model M1
Residual plots:
hist()qqplotshapiro.test()1. Scatterplot for Y~X
Suggested fixes:
plot() again to see if the linear regression
assumptions are now met.
2. Scatterplot for plot(SR~Y) and
plot(SR~X)
We expect to see the points scatter randomly about 0, within the
interval (-3, 3).
If there’s a funnel shape, constant variance assumption
is violated.
optional
To make abline dashed and slightly transparent
abline(h =3, col = adjustcolor("black", alpha = 0.5), lty = 2)
abline(h =-3, col = adjustcolor("black", alpha = 0.5), lty = 2)
# Residuals
raw.res <- residuals(M1) #or raw.res = M1$res
SR <- rstandard(M1)
# Check for outliers (values outside of -3 to 3)
outliers <- which(SR < -3 | SR > 3)
# Check for influential points using Cook's distance
cooksd <- cooks.distance(M1)
which(cooksd>1) # index of influential pointR2:
Get R2 and adj
R2 from:
M1 = lm(y~x, data=df)
summary_model = summary(M1)
#R-squared and Adjusted R-squared
summary_model$r.squared
summary_model$adj.r.squared
This is the end!
All the best to your practicals and finals!
Do reach out to me (tele: @sekyichin06) if
you have any questions!!
😊✨