This details several regression techniques I’ve found useful over time.
R Tutorial http://www.r-tutor.com/content/r-tutorial-ebook
Machine Learning A-Z, Kirill Eremenko and Hadelin de Ponteves Credit: Eremenko, Ponteves
R-bloggers, Sammy Ngugi https://www.r-bloggers.com/two-way-anova-in-r-exercises/
Analysis | # | Indpendent Vars | # | Dependent Vars |
---|---|---|---|---|
Chi square | 1 | Categorical | 1 | Categorical |
t-Test | 1 | Dichotomous | 1 | Continuous |
ANOVA | 1+ | Categorical | 1 | Continuous |
MANOVA | 1+ | Categorical | 2+ | Continuous |
Correlation | 1 | Dichotomous or Continuous | 1 | Continuous |
Multiple Regression | 2+ | Dichotomous or Continuous | 1 | Continuous |
Logistic Regression | 1+ | Categorical or Continuous | 1 | Dichotomous |
Smoke records student smoking habit. Exer denotes student exercise level.
Credit: R Tutorials
library(MASS)
tbl = table(survey$Smoke, survey$Exer)
tbl
##
## Freq None Some
## Heavy 7 1 3
## Never 87 18 84
## Occas 12 3 4
## Regul 9 1 7
chisq.test(tbl)
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 5.4885, df = 6, p-value = 0.4828
Since p-value is greater than 0.05, do not reject null hypothesis that smoking habit is independent of the exercise level of students.
t-tests allow the comparison of two means.
Here, looking for an interval estimate of the difference of population means. immer refers to the barley yield from 1931 and 1932 for the sames field. paired refers to a paired t-test.
Credit: R Tutorials
library(MASS)
head(immer)
## Loc Var Y1 Y2
## 1 UF M 81.0 80.7
## 2 UF S 105.4 82.3
## 3 UF V 119.7 80.4
## 4 UF T 109.7 87.2
## 5 UF P 98.3 84.2
## 6 W M 146.6 100.4
t.test(immer$Y1, immer$Y2, paired = TRUE)
##
## Paired t-test
##
## data: immer$Y1 and immer$Y2
## t = 3.324, df = 29, p-value = 0.002413
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 6.121954 25.704713
## sample estimates:
## mean of the differences
## 15.91333
The 95% confidence interval for the difference in means of the barley yields is between 6.12 and 25.7.
While t-tests allow the comparison of means between two groups, ANOVA allows the comparison of even more. So, more than one independent variable and one dependent variable: what affects the DV? Are there real differences between the means of the groups or is the variation not statistically significant? ANOVA generates the F statistic. Consider typically an a = 0.5 for statistical significance. A p value < a means there is a statistically significant difference. R^2 is the percent explained by the independent variable for the difference.
Credit: R-bloggers
library(car)
moth.experiment = read.csv("moth-trap-experiment.csv", header = TRUE)
# put number of moths on log scale
no.of.moth.log = log(moth.experiment$number.of.moths)
moth.experiment$no.of.moth.log = no.of.moth.log
# anova formula
moth.anova = aov(moth.experiment$no.of.moth.log ~ moth.experiment$location * moth.experiment$type.of.lure)
# run anova
Anova(moth.anova,type = "III")
## Anova Table (Type III tests)
##
## Response: moth.experiment$no.of.moth.log
## Sum Sq Df F value
## (Intercept) 43.018 1 427.6842
## moth.experiment$location 1.302 3 4.3144
## moth.experiment$type.of.lure 0.102 2 0.5054
## moth.experiment$location:moth.experiment$type.of.lure 0.196 6 0.3245
## Residuals 4.828 48
## Pr(>F)
## (Intercept) < 2.2e-16 ***
## moth.experiment$location 0.008988 **
## moth.experiment$type.of.lure 0.606429
## moth.experiment$location:moth.experiment$type.of.lure 0.920916
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I will fill this in later.
correlation r = covariance / (product of standard deviations) or simply cor(x,y, method = “”) where method could be pearson (the default), kendall or spearman
covariance cov(x, y) x and y must be vectors of the same length
x = seq(from=40, to=78, by=2)
y = seq(from=20, to = 1, by=-1)
sdx = sd(x)
sdy = sd(y)
coVariance = cov(x, y)
coVariance
## [1] -70
correlationXY = coVariance / (sdx*sdy)
correlationXY
## [1] -1
cor(x,y)
## [1] -1
A correlation of -1 is a negative correlation
Credit: Eremenko, Ponteves
# Logistic Regression
# Importing the dataset
dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[3:5]
# Encoding the target feature as factor
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])
# Fitting Logistic Regression to the Training set
classifier = glm(formula = Purchased ~ .,
family = binomial,
data = training_set)
# Predicting the Test set results
prob_pred = predict(classifier, type = 'response', newdata = test_set[-3])
y_pred = ifelse(prob_pred > 0.5, 1, 0)
# Making the Confusion Matrix
cm = table(test_set[, 3], y_pred > 0.5)
# Visualising the Training set results
library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
prob_set = predict(classifier, type = 'response', newdata = grid_set)
y_grid = ifelse(prob_set > 0.5, 1, 0)
plot(set[, -3],
main = 'Logistic Regression (Training set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
# Visualising the Test set results
library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
prob_set = predict(classifier, type = 'response', newdata = grid_set)
y_grid = ifelse(prob_set > 0.5, 1, 0)
plot(set[, -3],
main = 'Logistic Regression (Test set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
Credit: Eremenko, Ponteves
# Importing the dataset
dataset = read.csv('Salary_Data.csv')
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Salary, SplitRatio = 2/3)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
# training_set = scale(training_set)
# test_set = scale(test_set)
# Fitting Simple Linear Regression to the Training set
regressor = lm(formula = Salary ~ YearsExperience,
data = training_set)
# Predicting the Test set results
y_pred = predict(regressor, newdata = test_set)
# Visualising the Training set results
library(ggplot2)
ggplot() +
geom_point(aes(x = training_set$YearsExperience, y = training_set$Salary),
colour = 'red') +
geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
colour = 'blue') +
ggtitle('Salary vs Experience (Training set)') +
xlab('Years of experience') +
ylab('Salary')
# Visualising the Test set results
library(ggplot2)
ggplot() +
geom_point(aes(x = test_set$YearsExperience, y = test_set$Salary),
colour = 'red') +
geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
colour = 'blue') +
ggtitle('Salary vs Experience (Test set)') +
xlab('Years of experience') +
ylab('Salary')
Credit: Eremenko, Ponteves
# Multiple Linear Regression
# Importing the dataset
dataset = read.csv('50_Startups.csv')
# Encoding categorical data
dataset$State = factor(dataset$State,
levels = c('New York', 'California', 'Florida'),
labels = c(1, 2, 3))
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Profit, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
# training_set = scale(training_set)
# test_set = scale(test_set)
# Fitting Multiple Linear Regression to the Training set
regressor = lm(formula = Profit ~ .,
data = training_set)
# Predicting the Test set results
y_pred = predict(regressor, newdata = test_set)
# Building the optimal model using Backward Elimination
regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State,
data = dataset)
summary(regressor)
##
## Call:
## lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend +
## State, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33504 -4736 90 6672 17338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.008e+04 6.953e+03 7.204 5.76e-09 ***
## R.D.Spend 8.060e-01 4.641e-02 17.369 < 2e-16 ***
## Administration -2.700e-02 5.223e-02 -0.517 0.608
## Marketing.Spend 2.698e-02 1.714e-02 1.574 0.123
## State2 4.189e+01 3.256e+03 0.013 0.990
## State3 2.407e+02 3.339e+03 0.072 0.943
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9439 on 44 degrees of freedom
## Multiple R-squared: 0.9508, Adjusted R-squared: 0.9452
## F-statistic: 169.9 on 5 and 44 DF, p-value: < 2.2e-16
# Optional Step: Remove State2 only (as opposed to removing State directly)
# regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + factor(State, exclude = 2),
# data = dataset)
# summary(regressor)
regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend,
data = dataset)
summary(regressor)
##
## Call:
## lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend,
## data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33534 -4795 63 6606 17275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.012e+04 6.572e+03 7.626 1.06e-09 ***
## R.D.Spend 8.057e-01 4.515e-02 17.846 < 2e-16 ***
## Administration -2.682e-02 5.103e-02 -0.526 0.602
## Marketing.Spend 2.723e-02 1.645e-02 1.655 0.105
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9232 on 46 degrees of freedom
## Multiple R-squared: 0.9507, Adjusted R-squared: 0.9475
## F-statistic: 296 on 3 and 46 DF, p-value: < 2.2e-16
regressor = lm(formula = Profit ~ R.D.Spend + Marketing.Spend,
data = dataset)
summary(regressor)
##
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33645 -4632 -414 6484 17097
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.698e+04 2.690e+03 17.464 <2e-16 ***
## R.D.Spend 7.966e-01 4.135e-02 19.266 <2e-16 ***
## Marketing.Spend 2.991e-02 1.552e-02 1.927 0.06 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9161 on 47 degrees of freedom
## Multiple R-squared: 0.9505, Adjusted R-squared: 0.9483
## F-statistic: 450.8 on 2 and 47 DF, p-value: < 2.2e-16
regressor = lm(formula = Profit ~ R.D.Spend,
data = dataset)
summary(regressor)
##
## Call:
## lm(formula = Profit ~ R.D.Spend, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34351 -4626 -375 6249 17188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.903e+04 2.538e+03 19.32 <2e-16 ***
## R.D.Spend 8.543e-01 2.931e-02 29.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9416 on 48 degrees of freedom
## Multiple R-squared: 0.9465, Adjusted R-squared: 0.9454
## F-statistic: 849.8 on 1 and 48 DF, p-value: < 2.2e-16
Credit: Eremenko, Ponteves
# Polynomial Regression
# Importing the dataset
dataset = read.csv('Position_Salaries.csv')
dataset = dataset[2:3]
# Splitting the dataset into the Training set and Test set
# # install.packages('caTools')
# library(caTools)
# set.seed(123)
# split = sample.split(dataset$Salary, SplitRatio = 2/3)
# training_set = subset(dataset, split == TRUE)
# test_set = subset(dataset, split == FALSE)
# Feature Scaling
# training_set = scale(training_set)
# test_set = scale(test_set)
# Fitting Linear Regression to the dataset
lin_reg = lm(formula = Salary ~ .,
data = dataset)
# Fitting Polynomial Regression to the dataset
dataset$Level2 = dataset$Level^2
dataset$Level3 = dataset$Level^3
dataset$Level4 = dataset$Level^4
poly_reg = lm(formula = Salary ~ .,
data = dataset)
# Visualising the Linear Regression results
# install.packages('ggplot2')
library(ggplot2)
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary),
colour = 'red') +
geom_line(aes(x = dataset$Level, y = predict(lin_reg, newdata = dataset)),
colour = 'blue') +
ggtitle('Truth or Bluff (Linear Regression)') +
xlab('Level') +
ylab('Salary')
# Visualising the Polynomial Regression results
# install.packages('ggplot2')
library(ggplot2)
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary),
colour = 'red') +
geom_line(aes(x = dataset$Level, y = predict(poly_reg, newdata = dataset)),
colour = 'blue') +
ggtitle('Truth or Bluff (Polynomial Regression)') +
xlab('Level') +
ylab('Salary')
# Visualising the Regression Model results (for higher resolution and smoother curve)
# install.packages('ggplot2')
library(ggplot2)
x_grid = seq(min(dataset$Level), max(dataset$Level), 0.1)
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary),
colour = 'red') +
geom_line(aes(x = x_grid, y = predict(poly_reg,
newdata = data.frame(Level = x_grid,
Level2 = x_grid^2,
Level3 = x_grid^3,
Level4 = x_grid^4))),
colour = 'blue') +
ggtitle('Truth or Bluff (Polynomial Regression)') +
xlab('Level') +
ylab('Salary')
# Predicting a new result with Linear Regression
predict(lin_reg, data.frame(Level = 6.5))
## 1
## 330378.8
# Predicting a new result with Polynomial Regression
predict(poly_reg, data.frame(Level = 6.5,
Level2 = 6.5^2,
Level3 = 6.5^3,
Level4 = 6.5^4))
## 1
## 158862.5