This tutorial is to demonstrate how to R as a tool for data science.
Before we dive into R syntax, I want to clarify some of the common terminolories in statistics and R programming.
Variable=feature=column
Observation=row=case
Independent variable=predictors (variables)=regressors
Dependent variable=response variable
Vector= a single column or variable
A dataframe = a dataset with rows and columns
NA is missing values
Numeric data: Discrete data; continuous data
Categorical data: Nomial data, ordinal data
Factors: a categorical data with mathematical computation
Random with replacement: Who knows?
Let’s experience the first taste of R
– Type a random number in R console
4
[1] 4
# Store 4 in an object named four
four=4
# Call out the object 4
four
[1] 4
– Type a random character in R console
# Why ti throws an error?
# How can we write a character in R
"University"; 'Thai Nguyen'
[1] "University"
[1] "Thai Nguyen"
# Give it a name
mycha<-"University"
mycha<-c("University","Thainguyen")
mycha
[1] "University" "Thainguyen"
# how to create a list of 10 numbers from 1 to 10
mynum<-c(1:10)
mynum
[1] 1 2 3 4 5 6 7 8 9 10
Dive in R
- How to grenerate a list of randomly normal distributed numbers
x<-rnorm(1000)
# Check the length of x
length(x)
[1] 1000
# Plot the histogram of x
hist(x, xlab = "Values", ylab="Frequency", main="Histogram or Randomly distributed Values",col=rainbow(20))

- Some of the common notations used in R
=, <-, ==, !, is.na(), &, |, and >, >=
# = or <-
a<-5
a=5
# ==
a==4
[1] FALSE
# differ
a!=5 # a is different from 5
[1] FALSE
a!=8 # a is different from 5
[1] TRUE
# > or >=, <. <=, is.na(), A|B, A&B
A<-5
B<-6
A>B
[1] FALSE
A<B
[1] TRUE
# Check the current date
Sys.Date()
[1] "2018-01-22"
- If you want to know any functions that start with
lm, type the following command
apropos("norm")
[1] ".rs.normalizeKeyboardShortcut" ".rs.normalizePath"
[3] ".rs.validateAndNormalizeEncoding" "dlnorm"
[5] "dnorm" "norm"
[7] "normalizePath" "plnorm"
[9] "pnorm" "qlnorm"
[11] "qnorm" "qqnorm"
[13] "rlnorm" "rnorm"
How to read data in R
- How to mannually generate two columns of data with one column
hight and age
# set the fixed state
set.seed(112)
Height<-round(runif(20,140,190),1) # Unit in centemtter
Age<-round(runif(20,10,70))
df<-data.frame(Height,Age)
head(df)
# If you want to fix the dataset, you can type edit(df) or fix(df)
## ANother way of creating a dataset
set.seed(123)
num<-c(1:20) # Create a sequence of number starting from 1 and ending at 20
Letter<-sample(letters,20, replace = T) # Creating a list of characters randomly selected from 26 English letters
df1<-data.frame(num, Letter) # Form a dataframe
head(df1)
- Read data from our local computer and website, and in this case is
github
– Read data in .csv format
– Read dataset in .txt format in R
head(myland_survey) # Check the first few rows of the dataset
head(myland_survey) # Check the first few rows of the dataset
names(myland_survey)
[1] "A1" "A2" "A3" "A4" "A5" "YA" "XA" "B6" "B7" "B8" "B9" "B10" "YB" "XB"
[15] "C11" "C12" "C13" "C14" "YC" "XC" "D15" "D16" "YD" "XD" "E17" "E18" "E19" "E20"
[29] "YE" "XE" "F21" "F22" "F23" "YF" "XF" "G24" "G25" "YG" "XG" "H26" "H27" "YH"
[43] "XH" "Y.TB"
names(myland_survey)
[1] "A1" "A2" "A3" "A4" "A5" "YA" "XA" "B6" "B7" "B8" "B9" "B10" "YB" "XB"
[15] "C11" "C12" "C13" "C14" "YC" "XC" "D15" "D16" "YD" "XD" "E17" "E18" "E19" "E20"
[29] "YE" "XE" "F21" "F22" "F23" "YF" "XF" "G24" "G25" "YG" "XG" "H26" "H27" "YH"
[43] "XH" "Y.TB"
dim(myland_survey) # Showing the number of rows and columns
[1] 50 44
– We can read data from SPSS using foreign package
library(foreign)
# df2<-read.spss(file.choose(), to.data.frame = T, encoding="cp1252")
– Reading data in R using readr package (reference)
library(readr)
– converting a certain data type to another using parse_
a<-c(1,2,"Ha")
# Convert a to integer
parse_integer(a)
1 parsing failure.
row # A tibble: 1 x 4 col row col expected actual expected <int> <int> <chr> <chr> actual 1 3 NA an integer Ha
[1] 1 2 NA
attr(,"problems")
# parse_number()
parse_number("230%")
[1] 230
parse_number("this is 100$")
[1] 100
# Other parsers like parse_logical(), parse_character(), parse_double(),
- Data organization and manipulation
In this section, we will demonstrate the ability of R for data manipulation and organization. while this job can be done difficultly in excel or other software, R can implement it easily with the help of a powerful package called tidyrvese, dplyr and tidyr by Hadly Wickham
# Load packages
library(tidyverse)
mydata<-read.csv("https://raw.githubusercontent.com/tuyenhavan/Land-Relocation-Survey-Data/master/Datafromspss.csv",header=T)
head(mydata)
– Get all MALES from the dataset
# Another one
males<- mydata %>% dplyr::filter(sex=="MALES")
head(males)
# Another dataset
head(diamonds)
# get all diamonds with cut==Ideal and Good, clarity= SI2 and VS1
cut_clarity<- diamonds %>% dplyr::filter(cut==c("Ideal","Good") & clarity==c("SI2","VS1") )
head(cut_clarity)
- How to select columns and rows together
mydata<- diamonds %>% dplyr::filter(cut=="Good") %>% dplyr::select(1:5)
head(mydata)
– How to encode a continous variable into a categorical variable
# if price from 0 to 5000 =Low, 5000<price<=8000=Medium, 8000<price<=10000 =High,>10000=Very high
mydf<-diamonds %>% mutate(price_Encode=cut(price,breaks = c(0, 5000, 8000,10000, 20000),labels=c("Low","Medium","High","Very High"),include.lowest = T))
Error in diamonds %>% mutate(price_Encode = cut(price, breaks = c(0, 5000, :
could not find function "%>%"
– How to create a factorial variable
# Create a dataframe
a<-c(1:20)
b<-c(21:40)
df3<-data.frame(a,b)
head(df3)
# Using `cut` to categorize the variable
df3$cut1<-cut(df3$a,3)
head(df3)
# Another
df3$cut2<-cut(df3$a,3, labels=c("Low","Middle","High"))
head(df3)
# Another way
df3$cut3<-cut(df3$a, breaks = c(0,5,10,15,21), labels=c("Low","Middle","Relative High","High"), include.lowest = T)
head(df3)
– cut an equal internval of values using cut2 function from Hmisc package
library(Hmisc)
package 㤼㸱Hmisc㤼㸲 was built under R version 3.4.3Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Attaching package: 㤼㸱Hmisc㤼㸲
The following objects are masked from 㤼㸱package:dplyr㤼㸲:
combine, src, summarize
The following objects are masked from 㤼㸱package:base㤼㸲:
format.pval, round.POSIXt, trunc.POSIXt, units
# devide a into 3 groups with an equal interval from smallest to biggest values
df3$cuth<-cut2(df3$a, g=3)
head(df3,20)
- Some of the mathematical operations
# e^123
exp(123)
[1] 2.619517e+53
# log_e()
log(12)
[1] 2.484907
# log10(100)
log10(100)
[1] 2
# cos(pi)
cos(pi/4)
[1] 0.7071068
# sum of square: tổng bình phương
x<-c(1:5)
sum(x^2) # sum of square = 1^2+2^2+3^2+4^2+5^2
[1] 55
– Adjusted sum of square: tổng bình phương điều chỉnh
\[ A= \sum_{i=1}^n (x_i-\bar{x})^2 \]
# Adjusted sum of square
x<-c(1:5)
sum((x-mean(x))^2)
[1] 10
– mean square: Tính sai số bình phương
\[ X= \sum_{i=1}^n (x_i-\bar{x})^2/n \]
x<-c(1:5) # Create a vector
sum((x-mean(x))^2)/length(x) # mean square
[1] 2
– Variance: Tính phương sai
\[ S^2 = \sum_{i=1}^n (x_i-\bar{x})^2/(n-1) \]
x<-c(1:5)
var1<-sum((x-mean(x))^2)/(length(x)-1) # Phương sai
var(x)
[1] 2.5
– Standard deviation: Độ lệch chuẩn
\[ S=\sqrt{S^2} \]
# Standard deviation of x
S=sqrt(var1)
S
sd(x) # another way
– How to create date time data
date1<-as.Date("17/12/23",format="%y/ %m/ %d" )
date1
# How to create of list date between 2010 and 2017
date2<-seq(as.Date("2010-01-01"), as.Date("2017-12-01"),by='month')
head(date2)
# by 2 days
date2<-seq(as.Date("2010-01-01"), as.Date("2017-12-01"),by='2 days')
head(date2)
# by 2 weeks
date2<-seq(as.Date("2010-01-01"), as.Date("2017-12-01"),by='2 weeks')
head(date2)
How to generate random numbers or levels using seq, rep and gl
# Create a sequential numbers
seq(1,10,by=0.2)
# Create a repetition list of numbers
rep(10,times=10)
rep("Ha Van Tuyen",times=10)
rep(c(1:10),3)
# create a list levels
gl(3, 6) # 3 levels and 6*3=18 items
# levels with notations
gl(3,5,length = 20, label=c("Ha","Van","Tuyen"))
# Create a repetation of every 2
rep(1:10, each=2)
– Create a matrix
# Create a squared matrix
A<-matrix(c(1:9),nrow=3,ncol=3,byrow = T)
A
– Transposed matrix: Ma trận chuyển vị
t(A) # This will transpose matrix A to transposed matrix
– Scalar matrix: Ma trận vô hướng
B<-matrix(0,4,4)
diag(B)<-1
B
– Matrix operations
C<-matrix(1:9, ncol=3,nrow=3)
D<-matrix(-1:-9,ncol=3, byrow = T)
# Addition
C+D
# Subtraction
C-D
# Multiplication
C%*%D
– Permutation: Hoán vị prod
# Having 5 students and 5 chairs. How many ways to organize five students in 5 chairs
prod(5:1)
– Combination: Tổ hợp
Graphics- Biểu Đồ or Data visualization (Base R)
# create a random values
N<-runif(1000, -5,5)
x<-N
y<-sin(x) + 0.5*rnorm(N)
# Scatter plot
plot(y~x, main="Scatter Plot", col=rainbow(1000))

# Hítogram
hist(x,main="Histogram")

# Barplot
barplot(N,main="Barplot",col=2)

# Boxplot
boxplot(N,main="Boxplot")
# Group four plots into one single box
par(mfrow=c(2,2))

plot(y~x, main="Scatter Plot", col=rainbow(1000))
hist(x,main="Histogram")
barplot(N,main="Barplot",col=2)
boxplot(y, main="Boxplot")

– Adding title, subtitle and others
# Create a normal distributed values
a<-rnorm(2000)
hist(a,xlab="Values",ylab="Frequency",col=rainbow(50),main="Histogram Plot")
title(sub="Figure 1: Histogram showing the randomly distributed values")

– Legends: Ghi chú
x<-runif(2000,-5,5)
y<- x + 0.5*rnorm(2000)
plot(y~x,main="Linear Regression",pch=16,col=4)
abline(lm(y~x),col=2)
legend(-5,5,c("Point patterns","Regression Line"),pch=16,lty=c(0,1),bty="n",col=c(4,2))

– Barplot
# Cars93 dataset
library(MASS)
head(Cars93)
myiris<-table(Cars93$Type)
# Tweek the plot position
par(mfrow=c(1,2))
barplot(myiris,col=c(2,4),main="Barplot")
barplot(myiris, main="Barplot", horiz = T,col=c(3,4))
title(sub="Figure 2: Three types of car ")

– Barplot with two categorical variables
library(Hmisc)
cut2<-cut2(Cars93$Price,g=4)
mytwo<-table(Cars93$Type,cut2)
# Check the order of levels of Type
levels(Cars93$Type)
[1] "Compact" "Large" "Midsize" "Small" "Sporty" "Van"
barplot(mytwo, beside = T,xlab="Price Group", col = as.numeric(Cars93$Type), ylab="Frequency")

– Density plot
# Create a randomly normal distributed dataset
a<-rnorm(2000,4,5)
hist(a,prob=T,col=3,border = "blue", main="Histogram Plot")
lines(density(a),lty=1, col=2,lwd=1)

– Scatter plot using car package
library(car)
scatterplot(cars$speed~cars$dist,smooth=F, main="Scatter Plot", ylab="Speed",xlab="Distance")
– Multiple interaction plots: Biểu đồ tương quan đa biến
library(psych)
df<-Cars93
head(df)
library(tidyverse)
df1<- df %>% dplyr::select(4:8)
pairs.panels(df1)
Graphics with ggplot2, a “masterpiece” of R data visualization
library(ggplot2)
a<-rnorm(2000)
df<-data.frame(a=a, b=gl)
# ggplot2
ggplot(data=df,aes(x=a)) + geom_histogram(binwidth = 0.1,aes(y=..density..),color=4,fill="green") + geom_density() + labs(x="Values",y="Density", title="Density Plot",caption="Tuyen, V. Ha - Massey University, NZ")
# Facets
ggplot(data=Cars93,aes(x=Price)) + geom_histogram(binwidth = 0.1,aes(y=..density..),color=4,fill="green") + geom_density() + labs(x="Values",y="Density", title="Density Plot",caption="Tuyen, V. Ha - Massey University, NZ") + facet_wrap(~Type)
- Descritive statistics: Thống kê mô tả
summary(df1)
num Letter
Min. : 1.00 y :3
1st Qu.: 5.75 b :2
Median :10.50 l :2
Mean :10.50 o :2
3rd Qu.:15.25 x :2
Max. :20.00 c :1
(Other):8
# Psych can summarise data as follow
library(psych) # this package can give standard error
describe(df1)
vars n mean sd median trimmed mad min max range skew kurtosis
num 1 20 10.50 5.92 10.5 10.50 7.41 1 20 19 0.00 -1.38
Letter* 2 20 8.15 4.46 8.5 8.31 5.93 1 14 13 -0.17 -1.40
se
num 1.32
Letter* 1.00
# Summarizing data under group (psych package)
df2<-Cars93[,c(3:8)]
Error: object 'Cars93' not found
– Categorical descritive statistics
# For one categorical variable
library(tables)
tabular(Type~1,data=Cars93) # Calculate the number of each type
tabular(Type~Origin*(n=1+ Percent("col")), data=Cars93) # Display both numbers and percent
– We can integrate different categorical variables to form a descrition statistics
library(tables)
tabular(Type*Origin~DriveTrain*(n=1 +Percent("col")),data=Cars93)
– Disriptive statistics for one variable
library(gmodels)
CrossTable(Type ,Origin, data=Cars93, prop.chisq = T,chisq = T,prop.r = T,prop.c = T,prop.t = T)
– tabular
tabular(Type~Origin* Width*(n=1+mean+sd+median),data=Cars93) # For one continous variable
tabular(Origin*Type~ (Width+Horsepower)*(n=1+mean+sd+median),data=Cars93)
- Test of Normal distribution
# Using Shapiro.test to check the normal distribution
shapiro.test(cars$speed)
# p-value>0.05 => It is normally distributed
hist(cars$speed)
# Another example
a<-rnorm(2000)
shapiro.test(a)
# p-value>0.05 => a is normally distributed
hist(a) # Check it out
– T-test: there are two types of t-test. Firstly, t-test for one variable while the second for two variables
- Example (t-test type 1): If my class’s mark average is known 7 for many years, but now I collect a recent class and calculate new average. Are there any difference between new and previous average marks?
*t-test formula
\[ t-test= \frac{\bar{x}-\mu}{S/\sqrt{n}} \]
Notes: n is sample size; S is standard deviation, x bar is an average of a sample; mu is a known population mean
# A known mark: 7
set.seed(123)
myclass<-rnorm(30, 7,1.5)
t.test(myclass, mu=7)
# There is a strong evidence to say that there is no difference between mean of 7 and myclass's mean (p-value=0.794)
# Assumption 2
t.test(myclass,mu=8)
# p-value=0.004166, and therefore there is difference between mean of 8 and myclass' mean
– t-test for two variables
Formula
\[ t-test= \frac{\bar{x_2}-\bar{x_1}}{SED} \] Note: SED=sqrt(SE1^2 + SE2^2) standard error
t.test(Cars93$Price~Cars93$Origin)
# It is likely to say that there is no difference in price between USA and Non-USA as p-value >0.005: Mức độ khác biệt giữa 2 nhóm không có ý nghĩa thống kê
# If variance is assummed to be equal
t.test(Cars93$Price~Cars93$Origin,var.equal=T)
– Test of variance
var.test(Cars93$Price~Cars93$Origin)
# There is difference between two origins. USA is greater 0.4779 non-USA and statistically significant (p-value<0.005)
– If our data is not normally distributed, we gonna have an alternative test called wilcox.test
# Check normal distribution of Price in Cars93 dataset
shapiro.test(Cars93$Price)
## Ho. Price is not normally distributed
##H1: Price is normally distributed
# there is a strong evidence to show that Price is not normally distributed due to a tiny small p-value
# Wilcoxon test
wilcox.test(Cars93$Price~Cars93$Origin)
# p-value >0.05 => no difference in price mean between two groups
– T.test for a pair of objects
Example: I want to test if any difference in satisfaction of a customer when they taste my product and other’s product.
myproduct<-c(56,60,58,70,95,82,62,56,76,87)
other<-c(87,56,65,67,48,46,53,59,71,63)
df<-data.frame(myproduct,other)
attach(df) # If you like
t.test(myproduct,other,paired = T)
# There is no difference in mean between myproduct and other
wilcox.test(myproduct,other,paired = T)
# The same result of t.test
– Test for Categorical variables
# A dataset derived from Massey University, NZ
df<-read.csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/CowMilk2008.csv",header=T,sep=";")
head(df)
– Frequency or counts
# Count the number of levels or things in each categorical variable
t<-table(df$Breed,df$Mast) # Number of breads in each mast
prop.table(t,1) # It means that in Breed group, 92% is from Mast_0 and 8% from Mast_1.
– Proportion test for one event
Example: There are 200 people in my study in which 112 males and 88 females. If I want to test if the proportion of male in this study is really larger than 50%?
prop.test(112,200,0.5)
# There is no evidence to say that the proportion of males in this study is not greater than 50% due to a large p-value
# Alternative test
binom.test(112,200,0.5) # The same result as above
– Proportion test for two groups
Example: A study included 100 people (A) using medicine and 120 (B) without medicine. After 2 years, it is observed that 13 people suffered from cancer from A, and 24 from B. Are there any difference in the proportion of suffers between two groups? or medicine effective?
suffer<-c(13,24)
total<-c(100,120)
prop.test(suffer,total)
# There is no difference between using medicine and without medicine
– Using different approaches to a multiple-proportion test prop.test(); binom.test(); chisq.test(); fisher.test
# Muốn biết tỷ lệ nữ giữa các nhóm có khác nhau ko? dung prop.test()
# Converting numeric values to factorical values for two categorical variables
df$Region<-as.factor(df$Region)
df$Mast<-as.factor(df$Mast)
# @ Want to know if any difference in proportion of Mast in each Region
table(df$Mast,df$Region)
Zero<-c(10245,9321,12496,13862)
total<-c(sum(10245,887),sum(9321,688),sum(12496,1686),sum(13862,913))
prop.test(Zero,total) # Test the difference in the proportion of Mast_0 for each region
#=> There is evidence to say that the proportion of Mast_0 is different in each Region
# Another example
table(df$Breed,df$Region)
CB<-c(5271,5142,7515,7473)
total<-c(sum(5271,4983,878),sum(5142,2945,1922),sum(7515,4795,1872),sum(7473,5145,2157))
prop.test(CB,total)
# There is evidence to say that there is difference in proportion of CB in each Region (p-value<0.05)
# Alternative test 1
chisq.test(df$Mast,df$Region)
# The same result as above
Linear Regression Model
– Corelation coefficients
# Create a simulation dataset
set.seed(123)
wt<-runif(100,69,110) # Weight for 100 people
ht<-wt+0.5*rnorm(100,150,10) # height for 100 people
df1<-round(data.frame(wt,ht))
head(df1)
– Find the correlation between ht and wt using Pearson’s correlation
\[ r=\sum_{i=1}^{n}\left( \frac{(x_i-\bar{x})*(y_i-\bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2*(\sum_{i=1}^{n}(y_i-\bar{y})^2)}} \right) \]
# Pearson's correlation
cor(wt,ht)
plot(ht~wt,col=3,pch=16)
– Correlation test
# Pearson's correlation test
cor.test(wt,ht)
# - H0: there is no correlation between wt and ht
# - H1: There is a correlation between wt and ht
#=> p-value<0.05 - there is evidence to say that correlation between ht and wt is statistically significant
– Kendall correlation test using non-normally distributed data
cor.test(wt,ht,method="kendall") # Non-parametric method
-Correlation is great, but we cannot use it to predict any output or a given data point. Therefore, we have to come up with an equation to estimate parameters and use it to predict a known x data point. For this reason, Linear regression model is very good choice for this problem.
Before, we gonna go further, let’s review some of the basic theory of a linear regression model. First, we look at a simple linear regression model. Mathematically, it is presented as follows:
\[ y_i=\alpha +\beta*x_i +\epsilon_i\]
The equation above is the population equation with epsilon is noise part of the model. In practice, we don’t know the alpha and beta parameters, and therefore we gonna use statistical or sample estimates alpha hat and beta hat to predict or inference the parameters. To do so, some assumptions are needed. For example, epsilon is normally distributed and mean is zero, and equally variance. A newly re-defined equation for sample data can be written as:
\[ \hat{y_i}= \hat{\alpha} + \hat{\beta}*x_i \]
In this approach, we have to find a line that fits data well. What I mean here is that it is believed that using least square method can be best to estimate the coefficients and minimize the model errors. Coeffients alpha hat and beta hat can be estimated as follow:
\[ \hat{\beta} = \frac{\sum_{i=1}^{n}(x_i-\bar{x})*(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2} ; \hat{\alpha}= \bar{y}-\hat{\beta}*\bar{x} \]
Notes: alpha hat and beta hat are sample estimates, y hat is predicted values using the above equation. Residuals can be estimated as
\[ Residuals=y_i -\hat{y_i} \]
and variance of residuals (s^2) is a estimate of epsilon and can be written as
\[ s^2 = \frac{\sum_{i=1}^{n}(y_i-\hat{y_i})}{n-2} \]
–Solving regression problem using R
This task can be easily achieved in R with a simple command line as follow
m1<-lm(ht~wt,data=df1)
# Get the summary of the model
summary(m1)
There are three important components in the above model: Residuals; Coefficients and residual standard error, R-squared and F-statistics
R-squared (coefficient of determination) can be estimated as:
\[R^2= \frac{\sum_{i=1}^{n}(\hat{y_i}-\bar{y})^2}{\sum_{i=1}^{n}(y_i-\bar{y})^2} \]
In linear regression context, it is assumed some criteria, and should be satisfied it.
epsilon follows normal distribution
epsilon has zero mean
Variance of epsilon should be fixed at any given point x
Values of epsilon is independent; epsilon 1 is independent from epsilon 2
– Check if our model is satisfied the condition
Investigate each plot to see if they are met or not?
par(mfrow=c(2,2))
plot(m1)
– Graphic plots
pred1<-predict.lm(m1,newdata = data.frame(wt=df1$wt), interval = "prediction")
mp<-data.frame(pred1,wt=df1$wt)
plot(df1$ht~df1$wt,col=3,pch=16,main="Height is a function of Weight",xlab="Weight",ylab="Height")
abline(m1)
lines(mp$lwr~mp$wt,col=2,lwd=1)
lines(mp$upr~mp$wt,col=4,lwd=1)
— Predict new given x values
# Create x data points
wt1<-data.frame(wt=c(67,90,78,88,82,57,110,108))
y_pred1<-predict.lm(m1,newdata = wt1,interval = "confidence") # Confidence interval
y_pred2<-predict.lm(m1,newdata = wt1,interval = "prediction") # Prediction iterval
plot(ht~wt,col=4,main="Height is a function of Weight",pch=16,xlab="Weight (kg)",ylab="Height (cm)")
# Adding some lines
res1<-cbind(y_pred1,wt1)
res2<-cbind(y_pred2,wt1)
lines(res1$fit~res1$wt,col=1)
lines(res1$upr~res1$wt,col=3)
lines(res1$lwr~res1$wt,col=3)
lines(res2$lwr~res2$wt,col=2,lty=2)
lines(res2$upr~res2$wt,col=2,lty=2)
- Multiple linear regression
The theory in multiple linear regression is similar to simple linear regression’s theory. However, more independent variables are adding to the model.
# Using the following dataset to predict ...
library(MASS)
head(Boston)
names(Boston)
[1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
[8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
bh<-Boston[,c(1,7,8,13,14)]
head(bh)
# Multiple linea regression
ml<-lm(medv~., data=bh)
summary(ml)
Call:
lm(formula = medv ~ ., data = bh)
Residuals:
Min 1Q Median 3Q Max
-16.768 -3.853 -1.503 2.131 22.966
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.776557 1.651987 23.473 < 2e-16 ***
crim -0.106856 0.035921 -2.975 0.00307 **
age -0.003031 0.015743 -0.193 0.84741
dis -0.789790 0.196320 -4.023 6.63e-05 ***
lstat -0.999954 0.050022 -19.990 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.056 on 501 degrees of freedom
Multiple R-squared: 0.5698, Adjusted R-squared: 0.5664
F-statistic: 165.9 on 4 and 501 DF, p-value: < 2.2e-16
– Using transformation to refit the linear regression model
head(trees)
# Plot the scatter
plot(trees$Volume~trees$Height)
mt1<-lm(Volume~Height,data=trees) # Simple linear regression
summary(mt1)
Call:
lm(formula = Volume ~ Height, data = trees)
Residuals:
Min 1Q Median 3Q Max
-21.274 -9.894 -2.894 12.068 29.852
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.1236 29.2731 -2.976 0.005835 **
Height 1.5433 0.3839 4.021 0.000378 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.4 on 29 degrees of freedom
Multiple R-squared: 0.3579, Adjusted R-squared: 0.3358
F-statistic: 16.16 on 1 and 29 DF, p-value: 0.0003784
# Another Example
head(Animals)
## Fitting a simple linear model
ma1<-lm(brain~body, data=Animals)
summary(ma1)
Call:
lm(formula = brain ~ body, data = Animals)
Residuals:
Min 1Q Median 3Q Max
-576.0 -554.1 -438.1 -156.3 5138.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.764e+02 2.659e+02 2.168 0.0395 *
body -4.326e-04 1.589e-02 -0.027 0.9785
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1360 on 26 degrees of freedom
Multiple R-squared: 2.853e-05, Adjusted R-squared: -0.03843
F-statistic: 0.0007417 on 1 and 26 DF, p-value: 0.9785
par(mfrow=c(2,2))

plot(ma1)
# It is clear that this model is not met. How about transformation both x and y
mt2<-lm(log(brain)~log(body),data=Animals)
summary(mt2)
Call:
lm(formula = log(brain) ~ log(body), data = Animals)
Residuals:
Min 1Q Median 3Q Max
-3.2890 -0.6763 0.3316 0.8646 2.5835
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.55490 0.41314 6.184 1.53e-06 ***
log(body) 0.49599 0.07817 6.345 1.02e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.532 on 26 degrees of freedom
Multiple R-squared: 0.6076, Adjusted R-squared: 0.5925
F-statistic: 40.26 on 1 and 26 DF, p-value: 1.017e-06
par(mfrow=c(2,2))

plot(mt2)

-Choosing a best model from a pool of possible models :)
It is obvious that there are so many possible models that we can build depending the number of independent variables. However, We only choose one model that can be best explained the variations in y.To do so, a indicator called AIC Akaike Information Criterion which can be used to sellect statistically significant variables. This creteria can be identified as
\[ AIC=log(\frac{RSS}{n})+\frac{2k}{n}; \ \ which \ \ RSS=\sum_{i=1}^{n}(\hat{y_i}-y_i)^2, \ \ k=number \ of \ independent \ variables, \ n=No \ of \ observations \] We choose a model with the smallest AIC
– Practice choosing a optimal model using AIC criteria
head(CO2)
Analysis of Variance
– One-way analysis of variance (ANOVA)
---
title: "R Notebook"
output: html_notebook
---

This tutorial is to demonstrate how to R as a tool for data science. 

- Before we dive into R syntax, I want to clarify some of the common terminolories in statistics and R programming.

* Variable=feature=column

* Observation=row=case

* Independent variable=predictors (variables)=regressors

* Dependent variable=response variable

* Vector= a single column or variable

* A dataframe = a dataset with rows and columns

* `NA` is missing values

* Numeric data: Discrete data; continuous data

* Categorical data: Nomial data, ordinal data 

* Factors: a categorical data with mathematical computation

* Random with replacement: Who knows?

# Let's experience the first taste of R 

-- Type a random number in R console 

```{r}
4
# Store 4 in an object named four

four=4
# Call out the object 4

four

```

-- Type a random character in R console 

```{r}

# Why ti throws an error?

# How can we write a character in R

"University"; 'Thai Nguyen'
# Give it a name

mycha<-"University"

mycha<-c("University","Thainguyen")

mycha

# how to create a list of 10 numbers from 1 to 10

mynum<-c(1:10)

mynum
```

# Dive in R 

- How to grenerate a list of randomly normal distributed numbers

```{r}
x<-rnorm(1000)
# Check the length of x

length(x)

# Plot the histogram of x

hist(x, xlab = "Values", ylab="Frequency", main="Histogram or Randomly distributed Values",col=rainbow(20))

```

- Some of the common notations used in R `=`, `<-`, `==`, `!`, `is.na()`, `&`, `|`, and `>, >=` 

```{r}
# = or <-

a<-5 

a=5

# ==

a==4

# differ

a!=5 # a is different from 5

a!=8 # a is different from 5

# > or >=, <. <=, is.na(), A|B, A&B

A<-5

B<-6

A>B
A<B

# Check the current date

Sys.Date()

```

- If you want to know any functions that start with `lm`, type the following command

```{r}

apropos("norm")

```

# How to read data in R

- How to mannually generate two columns of data with one column `hight` and `age`

```{r}
# set the fixed state

set.seed(112)

Height<-round(runif(20,140,190),1) # Unit in centemtter 

Age<-round(runif(20,10,70))

df<-data.frame(Height,Age)

head(df)

# If you want to fix the dataset, you can type edit(df) or fix(df)

## ANother way of creating a dataset

set.seed(123)

num<-c(1:20) # Create a sequence of number starting from 1 and ending at 20

Letter<-sample(letters,20, replace = T) # Creating a list of characters randomly selected from 26 English letters


df1<-data.frame(num, Letter) # Form a dataframe 

head(df1)

```

- Read data from our local computer and website, and in this case is `github` 

-- Read data in `.csv` format 

```{r}
# Read data from our local computer

# df<-read.csv(file.choose(),header=T)

# Read data in csv format from website

Blue_sherd<-read.csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/master/Blue_Sherd.csv",header=T)

head(Blue_sherd) # Check the first few rows of the dataset

tail(Blue_sherd) # View last few rows of the dataset

# Another dataset

Survey<-read.csv("https://raw.githubusercontent.com/tuyenhavan/Land-Relocation-Survey-Data/Sample-Data/Sample_Data1.csv",header=T,sep=";")

head(Survey) # Check the first few rows of the dataset

tail(Survey) # Check the last few rows of the dataset


```

-- Read dataset in `.txt` format in R

```{r}

myland_survey<-read.table("https://raw.githubusercontent.com/tuyenhavan/Land-Relocation-Survey-Data/Sample-Data/Sample_Data1.txt",header=T)

head(myland_survey) # Check the first few rows of the dataset

# We can check the variable names

names(myland_survey)

dim(myland_survey) # Showing the number of rows and columns

```


-- We can read data from `SPSS` using `foreign` package

```{r}
library(foreign)

#  df2<-read.spss(file.choose(), to.data.frame = T, encoding="cp1252")

```

-- Reading data in R using `readr` package (reference)

```{r}
library(readr)

```

-- converting a certain data type to another using parse_

```{r}
a<-c(1,2,"Ha")

# Convert a to integer

parse_integer(a)

# parse_number()

parse_number("230%")

parse_number("this is 100$")

# Other parsers like parse_logical(), parse_character(), parse_double(), 
```


- Data organization and manipulation

In this section, we will demonstrate the ability of R for data manipulation and organization. while this job can be done difficultly in excel or other software, R can implement it easily with the help of a powerful package called `tidyrvese`, `dplyr` and `tidyr` by Hadly Wickham 

```{r}
# Load packages

library(tidyverse)

mydata<-read.csv("https://raw.githubusercontent.com/tuyenhavan/Land-Relocation-Survey-Data/master/Datafromspss.csv",header=T)

head(mydata)

```

-- Get all `MALES` from the dataset 

```{r}
# Another one

males<- mydata %>% dplyr::filter(sex=="MALES")

head(males)

# Another dataset

head(diamonds)

# get all diamonds with cut==Ideal and Good, clarity= SI2 and VS1

cut_clarity<- diamonds %>% dplyr::filter(cut==c("Ideal","Good") & clarity==c("SI2","VS1") )

head(cut_clarity)
```

- How to select columns and rows together

```{r}
mydata<- diamonds %>% dplyr::filter(cut=="Good") %>% dplyr::select(1:5)

head(mydata)
```

-- How to encode a continous variable into a categorical variable 

```{r}

# if price from 0 to 5000 =Low, 5000<price<=8000=Medium, 8000<price<=10000 =High,>10000=Very high

mydf<-diamonds %>% mutate(price_Encode=cut(price,breaks = c(0, 5000, 8000,10000, 20000),labels=c("Low","Medium","High","Very High"),include.lowest = T))

head(mydf)

tail(mydf)

```


-- How to create a factorial variable

```{r}
# Create a dataframe

a<-c(1:20)

b<-c(21:40)

df3<-data.frame(a,b)

head(df3)

# Using `cut` to categorize the variable 

df3$cut1<-cut(df3$a,3)

head(df3)

# Another

df3$cut2<-cut(df3$a,3, labels=c("Low","Middle","High"))

head(df3)

# Another way

df3$cut3<-cut(df3$a, breaks = c(0,5,10,15,21), labels=c("Low","Middle","Relative High","High"), include.lowest = T)

head(df3)
```

-- cut an equal internval of values using `cut2` function from `Hmisc` package 

```{r}
library(Hmisc)
# devide a into 3 groups with an equal interval from smallest to biggest values
df3$cuth<-cut2(df3$a, g=3)

head(df3,20)

```

- Some of the mathematical operations

```{r}
# e^123

exp(123)

# log_e()

log(12)

# log10(100)

log10(100)

# cos(pi)

cos(pi/4)

# sum of square: tổng bình phương

x<-c(1:5)

sum(x^2) # sum of square = 1^2+2^2+3^2+4^2+5^2

```

-- Adjusted sum of square: tổng bình phương điều chỉnh

$$ A= \sum_{i=1}^n (x_i-\bar{x})^2   $$

```{r}
# Adjusted sum of square 

x<-c(1:5)

sum((x-mean(x))^2)

```

-- mean square: Tính sai số bình phương

$$ X= \sum_{i=1}^n (x_i-\bar{x})^2/n $$

```{r}
x<-c(1:5) # Create a vector

sum((x-mean(x))^2)/length(x) # mean square 

```

-- Variance: Tính phương sai

$$ S^2 = \sum_{i=1}^n (x_i-\bar{x})^2/(n-1)  $$

```{r}
x<-c(1:5)

var1<-sum((x-mean(x))^2)/(length(x)-1) # Phương sai 

var(x)

```

-- Standard deviation: Độ lệch chuẩn

$$ S=\sqrt{S^2} $$ 


```{r}
# Standard deviation of x

S=sqrt(var1)

S

sd(x) # another way 
```


-- How to create date time data

```{r}
date1<-as.Date("17/12/23",format="%y/ %m/ %d" )

date1

# How to create of list date between 2010 and 2017

date2<-seq(as.Date("2010-01-01"), as.Date("2017-12-01"),by='month')

head(date2)

# by 2 days

date2<-seq(as.Date("2010-01-01"), as.Date("2017-12-01"),by='2 days')

head(date2)

# by 2 weeks

date2<-seq(as.Date("2010-01-01"), as.Date("2017-12-01"),by='2 weeks')

head(date2)
```

# How to generate random numbers or levels using `seq`, `rep` and `gl`

```{r}
# Create a sequential numbers

seq(1,10,by=0.2)

# Create a repetition list of numbers

rep(10,times=10)

rep("Ha Van Tuyen",times=10)

rep(c(1:10),3)

# create a list levels

gl(3, 6) # 3 levels and 6*3=18 items

# levels with notations

gl(3,5,length = 20, label=c("Ha","Van","Tuyen"))

# Create a repetation of every 2

rep(1:10, each=2)


```

-- Create a matrix

```{r}
# Create a squared matrix

A<-matrix(c(1:9),nrow=3,ncol=3,byrow = T)

A
```

-- Transposed matrix: Ma trận chuyển vị

```{r}
t(A) # This will transpose matrix A to transposed matrix

```

-- Scalar matrix: Ma trận vô hướng

```{r}
B<-matrix(0,4,4)

diag(B)<-1

B
```

-- Matrix operations

```{r}
C<-matrix(1:9, ncol=3,nrow=3)

D<-matrix(-1:-9,ncol=3, byrow = T)

# Addition

C+D

# Subtraction

C-D

# Multiplication

C%*%D

```

-- Permutation: Hoán vị `prod`

```{r}
# Having 5 students and 5 chairs. How many ways to organize five students in 5 chairs

prod(5:1)

```

-- Combination: Tổ hợp 

# Graphics- Biểu Đồ or Data visualization (Base R)

```{r}
# create a random values

N<-runif(1000, -5,5)

x<-N

y<-sin(x) + 0.5*rnorm(N)
# Scatter plot

plot(y~x, main="Scatter Plot", col=rainbow(1000))

# Hítogram

hist(x,main="Histogram")

# Barplot

barplot(N,main="Barplot",col=2)

# Boxplot

boxplot(N,main="Boxplot")

# Group four plots into one single box

par(mfrow=c(2,2))

plot(y~x, main="Scatter Plot", col=rainbow(1000))

hist(x,main="Histogram")

barplot(N,main="Barplot",col=2)

boxplot(y, main="Boxplot")

```

-- Adding `title`, `subtitle` and others

```{r}
# Create a normal distributed values

a<-rnorm(2000)

hist(a,xlab="Values",ylab="Frequency",col=rainbow(50),main="Histogram Plot")

title(sub="Figure 1: Histogram showing the randomly distributed values")
```

-- Legends: Ghi chú

```{r}
x<-runif(2000,-5,5)

y<- x + 0.5*rnorm(2000)

plot(y~x,main="Linear Regression",pch=16,col=4)

abline(lm(y~x),col=2)

legend(-5,5,c("Point patterns","Regression Line"),pch=16,lty=c(0,1),bty="n",col=c(4,2))

```

-- Barplot

```{r}
# Cars93 dataset

library(MASS)

head(Cars93)

myiris<-table(Cars93$Type)

# Tweek the plot position

par(mfrow=c(1,2))

barplot(myiris,col=c(2,4),main="Barplot")

barplot(myiris, main="Barplot", horiz = T,col=c(3,4))

title(sub="Figure 2: Three types of car ")
```

-- Barplot with two categorical variables 

```{r}

library(Hmisc)

cut2<-cut2(Cars93$Price,g=4)

mytwo<-table(Cars93$Type,cut2)
# Check the order of levels of Type

levels(Cars93$Type)

barplot(mytwo, beside = T,xlab="Price Group", col = as.numeric(Cars93$Type), ylab="Frequency")
```

-- Density plot

```{r}
# Create a randomly normal distributed dataset

a<-rnorm(2000,4,5)

hist(a,prob=T,col=3,border = "blue", main="Histogram Plot")

lines(density(a),lty=1, col=2,lwd=1)
```

-- Scatter plot using `car` package 

```{r}
library(car)

scatterplot(cars$speed~cars$dist,smooth=F, main="Scatter Plot", ylab="Speed",xlab="Distance")
```

-- Multiple interaction plots: Biểu đồ tương quan đa biến

```{r}
library(psych)

df<-Cars93

head(df)

library(tidyverse)

df1<- df %>% dplyr::select(4:8)

pairs.panels(df1)
```

# Graphics with `ggplot2`, a "masterpiece" of R data visualization 

```{r}
library(ggplot2)

a<-rnorm(2000)

df<-data.frame(a=a, b=gl)

# ggplot2 

ggplot(data=df,aes(x=a)) + geom_histogram(binwidth = 0.1,aes(y=..density..),color=4,fill="green") + geom_density() + labs(x="Values",y="Density", title="Density Plot",caption="Tuyen, V. Ha - Massey University, NZ")

# Facets

ggplot(data=Cars93,aes(x=Price)) + geom_histogram(binwidth = 0.1,aes(y=..density..),color=4,fill="green") + geom_density() + labs(x="Values",y="Density", title="Density Plot",caption="Tuyen, V. Ha - Massey University, NZ") + facet_wrap(~Type)
```

- Descritive statistics: Thống kê mô tả

```{r}
summary(df1)
# Psych can summarise data as follow

library(psych) # this package can give standard error 

describe(df1)

# Summarizing data under group (psych package)

df2<-Cars93[,c(3:8)]

head(df2)

describeBy(df2,group = "Type")
```

-- Categorical descritive statistics

```{r}
# For one categorical variable

library(tables)

tabular(Type~1,data=Cars93) # Calculate the number of each type

tabular(Type~Origin*(n=1+ Percent("col")), data=Cars93) # Display both numbers and percent 

```

-- We can integrate different categorical variables to form a descrition statistics

```{r}
library(tables)

tabular(Type*Origin~DriveTrain*(n=1 +Percent("col")),data=Cars93)
```

-- Disriptive statistics for one variable

```{r}
library(gmodels)

CrossTable(Type ,Origin, data=Cars93, prop.chisq = T,chisq = T,prop.r = T,prop.c = T,prop.t = T)
```

-- `tabular`

```{r}
tabular(Type~Origin* Width*(n=1+mean+sd+median),data=Cars93) # For one continous variable

tabular(Origin*Type~ (Width+Horsepower)*(n=1+mean+sd+median),data=Cars93)

```

- Test of Normal distribution

```{r}
# Using Shapiro.test to check the normal distribution

shapiro.test(cars$speed)

# p-value>0.05 => It is normally distributed

hist(cars$speed)

# Another example

a<-rnorm(2000)

shapiro.test(a)

# p-value>0.05 => a is normally distributed

hist(a) # Check it out
```

-- T-test: there are two types of `t-test`. Firstly, t-test for one variable while the second for two variables

- Example (t-test type 1): If my class's mark average is known 7 for many years, but now I collect a recent class and calculate new average. Are there any difference between new and previous average marks?

*t-test formula

$$ t-test= \frac{\bar{x}-\mu}{S/\sqrt{n}} $$

Notes: `n` is sample size; S is standard deviation, x bar is an average of a sample; mu is a known population mean

```{r}
# A known mark: 7

set.seed(123)

myclass<-rnorm(30, 7,1.5)

t.test(myclass, mu=7)

# There is a strong evidence to say that there is no difference between mean of 7 and myclass's mean (p-value=0.794)

# Assumption 2

t.test(myclass,mu=8)

# p-value=0.004166, and therefore there is difference between mean of 8 and myclass' mean

```
-- t-test for two variables

 Formula 
 
$$  t-test= \frac{\bar{x_2}-\bar{x_1}}{SED} $$
Note: SED=sqrt(SE1^2 + SE2^2) standard error

```{r}
t.test(Cars93$Price~Cars93$Origin)

# It is likely to say that there is no difference in price between USA and Non-USA as p-value >0.005: Mức độ khác biệt giữa 2 nhóm không có ý nghĩa thống kê

# If variance is assummed to be equal

t.test(Cars93$Price~Cars93$Origin,var.equal=T)
```

-- Test of variance

```{r}

var.test(Cars93$Price~Cars93$Origin)
# There is difference between two origins. USA is greater 0.4779 non-USA and statistically significant (p-value<0.005)

```

-- If our data is not normally distributed, we gonna have an alternative test called `wilcox.test` 

```{r}
# Check normal distribution of Price in Cars93 dataset

shapiro.test(Cars93$Price)

## Ho. Price is not normally distributed

##H1: Price is normally distributed

# there is a strong evidence to show that Price is not normally distributed due to a tiny small p-value

# Wilcoxon test 

wilcox.test(Cars93$Price~Cars93$Origin)

# p-value >0.05 => no difference in price mean between two groups
```

-- T.test for a pair of objects

 Example: I want to test if any difference in satisfaction of a customer when they taste my product and other's product.
 
```{r}
myproduct<-c(56,60,58,70,95,82,62,56,76,87)


other<-c(87,56,65,67,48,46,53,59,71,63)

df<-data.frame(myproduct,other)

attach(df) # If you like 

t.test(myproduct,other,paired = T)

# There is no difference in mean between myproduct and other

wilcox.test(myproduct,other,paired = T)

# The same result of t.test
```

-- Test for Categorical variables

```{r}
# A dataset derived from Massey University, NZ

df<-read.csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/CowMilk2008.csv",header=T,sep=";")

head(df)
```

-- Frequency or counts

```{r}
# Count the number of levels or things in each categorical variable

t<-table(df$Breed,df$Mast) # Number of breads in each mast

prop.table(t,1) # It means that in Breed group, 92% is from Mast_0 and 8% from Mast_1.

```

-- Proportion test for one event

Example: There are 200 people in my study in which 112 males and 88 females. If I want to test if the proportion of male in this study is really larger than 50%?

```{r}
prop.test(112,200,0.5)
# There is no evidence to say that the proportion of males in this study is not greater than 50% due to a large p-value

# Alternative test

binom.test(112,200,0.5) # The same result as above
```
-- Proportion test for two groups

Example: A study included 100 people (A) using medicine and 120 (B) without medicine. After 2 years, it is observed that 13 people suffered from cancer from A, and 24 from B. Are there any difference in the proportion of suffers between two groups? or medicine effective?

```{r}
suffer<-c(13,24)

total<-c(100,120)

prop.test(suffer,total)

# There is no difference between using medicine and without medicine
```

-- Using different approaches to a multiple-proportion test  `prop.test()`; `binom.test()`; `chisq.test()`; `fisher.test`

```{r}
# Muốn biết tỷ lệ nữ giữa các nhóm có khác nhau ko? dung prop.test()

# Converting numeric values to factorical values for two categorical variables

df$Region<-as.factor(df$Region)

df$Mast<-as.factor(df$Mast)
# @ Want to know if any difference in proportion of Mast in each Region

table(df$Mast,df$Region)

Zero<-c(10245,9321,12496,13862)

total<-c(sum(10245,887),sum(9321,688),sum(12496,1686),sum(13862,913))

prop.test(Zero,total) # Test the difference in the proportion of Mast_0 for each region

#=> There is evidence to say that the proportion of Mast_0 is different in each Region

# Another example

table(df$Breed,df$Region)

CB<-c(5271,5142,7515,7473)

total<-c(sum(5271,4983,878),sum(5142,2945,1922),sum(7515,4795,1872),sum(7473,5145,2157))

prop.test(CB,total)

# There is evidence to say that there is difference in proportion of CB in each Region (p-value<0.05)

# Alternative test 1

chisq.test(df$Mast,df$Region)

# The same result as above
```

# Linear Regression Model

-- Corelation coefficients 

```{r}
# Create a simulation dataset

set.seed(123)

wt<-runif(100,69,110) # Weight for 100 people


ht<-wt+0.5*rnorm(100,150,10) # height for 100 people

df1<-round(data.frame(wt,ht))

head(df1)
```

-- Find the correlation between `ht` and `wt` using Pearson's correlation

$$  r=\sum_{i=1}^{n}\left( \frac{(x_i-\bar{x})*(y_i-\bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2*(\sum_{i=1}^{n}(y_i-\bar{y})^2)}} \right) $$
```{r}
# Pearson's correlation

cor(wt,ht)

plot(ht~wt,col=3,pch=16)
```

-- Correlation test

```{r}
# Pearson's correlation test

cor.test(wt,ht)

# - H0: there is no correlation between wt and ht

# - H1: There is a correlation between wt and ht

#=> p-value<0.05 - there is evidence to say that correlation between ht and wt is statistically significant  
```

-- Kendall correlation test using non-normally distributed data

```{r}
cor.test(wt,ht,method="kendall") # Non-parametric method

```

-Correlation is great, but we cannot use it to predict any output or a given data point. Therefore, we have to come up with an equation to estimate parameters and use it to predict a known x data point. For this reason, `Linear regression model` is very good choice for this problem. 

 Before, we gonna go further, let's review some of the basic theory of a linear regression model. First, we look at a simple linear regression model. Mathematically, it is presented as follows:
 
 $$ y_i=\alpha +\beta*x_i +\epsilon_i$$

 The equation above is the population equation with `epsilon` is noise part of the model. In practice, we don't know the `alpha` and `beta` parameters, and therefore we gonna use statistical or sample estimates `alpha hat` and `beta hat` to predict or inference the  parameters. To do so, some assumptions are needed. For example, `epsilon` is normally distributed and mean is zero, and equally variance. A newly re-defined equation for sample data can be written as:

$$ \hat{y_i}= \hat{\alpha} + \hat{\beta}*x_i   $$

In this approach, we have to find a line that fits data well. What I mean here is that it is believed that using `least square method` can be best to estimate the coefficients and minimize the model errors. Coeffients ` alpha hat` and `beta hat` can be estimated as follow:

$$  \hat{\beta} = \frac{\sum_{i=1}^{n}(x_i-\bar{x})*(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}  ;     \hat{\alpha}= \bar{y}-\hat{\beta}*\bar{x} $$

Notes: `alpha hat` and `beta hat` are sample estimates, `y hat` is predicted values using the above equation. Residuals can be estimated as

$$ Residuals=y_i -\hat{y_i} $$

and variance of residuals (s^2) is a estimate of `epsilon` and can be written as 

$$ s^2 = \frac{\sum_{i=1}^{n}(y_i-\hat{y_i})}{n-2}  $$

--Solving regression problem using `R` 

 This task can be easily achieved in R with a simple command line as follow

```{r}
m1<-lm(ht~wt,data=df1)

# Get the summary of the model

summary(m1)
```

 There are three important components in the above model: ` Residuals`; `Coefficients` and `residual standard error, R-squared and F-statistics` 

 R-squared (coefficient of determination) can be estimated as:
 
$$R^2= \frac{\sum_{i=1}^{n}(\hat{y_i}-\bar{y})^2}{\sum_{i=1}^{n}(y_i-\bar{y})^2} $$

 In linear regression context, it is assumed some criteria, and should be satisfied it. 
 
1. `epsilon` follows normal distribution

2. `epsilon` has zero mean

3. Variance of `epsilon` should be fixed at any given point x

4. Values of `epsilon` is independent; `epsilon 1` is independent from `epsilon 2`

-- Check if our model is satisfied the condition

 Investigate each plot to see if they are met or not?

```{r}
par(mfrow=c(2,2))

plot(m1)
```

-- Graphic plots

```{r}

pred1<-predict.lm(m1,newdata = data.frame(wt=df1$wt), interval = "prediction")

mp<-data.frame(pred1,wt=df1$wt)

plot(df1$ht~df1$wt,col=3,pch=16,main="Height is a function of Weight",xlab="Weight",ylab="Height")

abline(m1)

lines(mp$lwr~mp$wt,col=2,lwd=1)

lines(mp$upr~mp$wt,col=4,lwd=1)
```

--- Predict new given x values

```{r}
# Create x data points

wt1<-data.frame(wt=c(67,90,78,88,82,57,110,108))

y_pred1<-predict.lm(m1,newdata = wt1,interval = "confidence") # Confidence interval

y_pred2<-predict.lm(m1,newdata = wt1,interval = "prediction") # Prediction iterval

plot(ht~wt,col=4,main="Height is a function of Weight",pch=16,xlab="Weight (kg)",ylab="Height (cm)")

# Adding some lines

res1<-cbind(y_pred1,wt1)

res2<-cbind(y_pred2,wt1)

lines(res1$fit~res1$wt,col=1)

lines(res1$upr~res1$wt,col=3)

lines(res1$lwr~res1$wt,col=3)

lines(res2$lwr~res2$wt,col=2,lty=2)

lines(res2$upr~res2$wt,col=2,lty=2)

```

- Multiple linear regression

 The theory in multiple linear regression is similar to simple linear regression's theory. However, more independent variables are adding to the model.
 
```{r}
# Using the following dataset to predict ...

library(MASS)

head(Boston)

names(Boston)

bh<-Boston[,c(1,7,8,13,14)]

head(bh)

# Multiple linea regression

ml<-lm(medv~., data=bh)

summary(ml)
```

-- Using `transformation` to refit the linear regression model

```{r}

head(trees)
# Plot the scatter 

plot(trees$Volume~trees$Height)

mt1<-lm(Volume~Height,data=trees) # Simple linear regression

summary(mt1)

# Another Example
head(Animals)

## Fitting a simple linear model

ma1<-lm(brain~body, data=Animals)

summary(ma1)

par(mfrow=c(2,2))

plot(ma1)
# It is clear that this model is not met. How about transformation both x and y

mt2<-lm(log(brain)~log(body),data=Animals)

summary(mt2)

par(mfrow=c(2,2))

plot(mt2)
```

-Choosing a `best model` from a pool of possible models :)

 It is obvious that there are so many possible models that we can build depending the number of independent variables. However, We only choose one model that can be best explained the variations in `y`.To do so, a indicator called `AIC` Akaike Information Criterion which can be used to sellect statistically significant variables. This creteria can be identified as 
 
 $$ AIC=log(\frac{RSS}{n})+\frac{2k}{n}; \  \ which \ \  RSS=\sum_{i=1}^{n}(\hat{y_i}-y_i)^2, \   \ k=number \ of \ independent \ variables, \ n=No \ of \ observations $$ 
 We choose a model with the smallest AIC

-- Practice choosing a `optimal model` using AIC criteria

```{r}
head(CO2)

```

# Analysis of Variance 

-- One-way analysis of variance (ANOVA)
