#This recitation include constructing boxplot, histogram, two-way table, normal quantile plot, histogram with fited normal curve, linear regression line, and calculateing probability in normal distribution.
##Recitation 2.1 boxplot and histogram #A research surveys the words spoken everyday by 79 people, 37 of whom are male (GenderMale1=1) and 42 are female.
#Step 1: import the data set
rm(list = ls()) #Clean previous data
Daily_Words_Spoken <- read.csv("Ch3_Daily Words Spoken.csv")
#Step 2: understand the data with summary function ##we can use summary function to get the statistics (including five-number summary and mean)of the variable WordsPerDay.
summary(Daily_Words_Spoken)
## ID GenderMale1 WordsPerDay
## Min. : 1.0 Min. :1.000 Min. : 695
## 1st Qu.:20.5 1st Qu.:1.000 1st Qu.: 8346
## Median :40.0 Median :2.000 Median :12460
## Mean :40.0 Mean :1.532 Mean :14186
## 3rd Qu.:59.5 3rd Qu.:2.000 3rd Qu.:18050
## Max. :79.0 Max. :2.000 Max. :36345
#Step 3: calculate the mean, median and standard deviation of WordsPerDay for all observations
mean(Daily_Words_Spoken$WordsPerDay)
## [1] 14186.01
median(Daily_Words_Spoken$WordsPerDay)
## [1] 12460
sd(Daily_Words_Spoken$WordsPerDay,na.rm = TRUE)
## [1] 7729.664
#Step 4: create a boxplot and histogram for WordsPerDay.
boxplot(Daily_Words_Spoken$WordsPerDay)
hist(Daily_Words_Spoken$WordsPerDay)
#Step 5: create a pie chart and bar chart
counts <- table(Daily_Words_Spoken$GenderMale1)
counts
##
## 1 2
## 37 42
pie(counts)
barplot(counts,xlab = "Gender", ylab = "Counts", col = "blue", border = "red")
#Step 6: split the data set into two data sets: male and female
male <- subset(Daily_Words_Spoken,GenderMale1 == 1)
male
## ID GenderMale1 WordsPerDay
## 1 1 1 23871
## 2 2 1 5180
## 3 3 1 9951
## 4 4 1 12460
## 5 5 1 17155
## 6 6 1 10344
## 7 7 1 9811
## 8 8 1 12387
## 9 9 1 29920
## 10 10 1 21791
## 11 11 1 9789
## 12 12 1 31127
## 13 13 1 8572
## 14 14 1 6942
## 15 15 1 2539
## 16 16 1 36345
## 17 17 1 6858
## 18 18 1 24024
## 19 19 1 5488
## 20 20 1 9960
## 21 21 1 11118
## 22 22 1 4970
## 23 23 1 10710
## 24 24 1 15011
## 25 25 1 1569
## 26 26 1 23794
## 27 27 1 23689
## 28 28 1 11769
## 29 29 1 26846
## 30 30 1 17386
## 31 31 1 7987
## 32 32 1 25638
## 33 33 1 695
## 34 34 1 2366
## 35 35 1 16075
## 36 36 1 16789
## 37 37 1 9308
#Step 7: calculate the mean of WordsPerDay for male and female (Hint:use tapply function, or group_by + summarize functions)
#tapply function:
tapply(Daily_Words_Spoken$WordsPerDay,Daily_Words_Spoken$GenderMale1,mean)
## 1 2
## 14060.38 14296.69
#Recitation 2.2: Two-way table #Case introduction: On April 15, 1912, the largest passenger liner ever made collided with an iceberg during her maiden voyage. When the Titanic sank it killed many passengers and crew. One of the reasons that the shipwreck resulted in such loss of life was that there were not enough lifeboats for the passengers and crew. ##The “Ch3_Titanic.csv” file contains data for 1309 of the real Titanic passengers. Each row represents one person. The columns describe different attributes about the person including their passenger-class, whether they survived, their sex, and their age. ##Note that the first class in a passenger ship is usually on the upper deck and the third class is in the lower deck.
#Step 1: import the data set
rm(list = ls()) #Clean previous data
Titanic <- read.csv("Ch3_Titanic.csv")
#Step 2: preprocess the data set #The variables pclass (passenger class) and survived (1=survived, 0=perished) are categorical variables. However, in the data set just imported, RStudio automatically assumes these two variables are numerical variables. Run the following R code to summarize all the variables:
summary(Titanic)
## pclass survived sex age
## Min. :1.000 Min. :0.000 Length:1309 Min. : 0.1667
## 1st Qu.:2.000 1st Qu.:0.000 Class :character 1st Qu.:21.0000
## Median :3.000 Median :0.000 Mode :character Median :28.0000
## Mean :2.295 Mean :0.382 Mean :29.8811
## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.:39.0000
## Max. :3.000 Max. :1.000 Max. :80.0000
## NA's :263
#If a variable is categorical, the output should just display the categories and the corresponding counts. If the output is displayed as five-number summary, the variable is recognized as numeric by RStudio. Therefore, variables “pclass”, “survived” and “age” are assumed numeric be default. #To convert variables “pclass” and “survived” to categorical, run the following codes:
Titanic$pclass <- as.factor(Titanic$pclass)
Titanic$survived <- as.factor(Titanic$survived)
summary(Titanic)
## pclass survived sex age
## 1:323 0:809 Length:1309 Min. : 0.1667
## 2:277 1:500 Class :character 1st Qu.:21.0000
## 3:709 Mode :character Median :28.0000
## Mean :29.8811
## 3rd Qu.:39.0000
## Max. :80.0000
## NA's :263
#Step 3: Generate a two-way table for “pclass” and “survived” #Run the following codes to generate a two-way table for “pclass” and “survived”.
#install.packages("gmodels")
library(gmodels)
## Warning: package 'gmodels' was built under R version 4.2.3
CrossTable(Titanic$pclass,Titanic$survived)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1309
##
##
## | Titanic$survived
## Titanic$pclass | 0 | 1 | Row Total |
## ---------------|-----------|-----------|-----------|
## 1 | 123 | 200 | 323 |
## | 29.411 | 47.587 | |
## | 0.381 | 0.619 | 0.247 |
## | 0.152 | 0.400 | |
## | 0.094 | 0.153 | |
## ---------------|-----------|-----------|-----------|
## 2 | 158 | 119 | 277 |
## | 1.017 | 1.645 | |
## | 0.570 | 0.430 | 0.212 |
## | 0.195 | 0.238 | |
## | 0.121 | 0.091 | |
## ---------------|-----------|-----------|-----------|
## 3 | 528 | 181 | 709 |
## | 18.411 | 29.788 | |
## | 0.745 | 0.255 | 0.542 |
## | 0.653 | 0.362 | |
## | 0.403 | 0.138 | |
## ---------------|-----------|-----------|-----------|
## Column Total | 809 | 500 | 1309 |
## | 0.618 | 0.382 | |
## ---------------|-----------|-----------|-----------|
##
##
In the inner cell of the generated two-way table, there are 5
entries: Count, Chi-square contribution, row %, col% and total %. The
learning objective in this recitation is to learn how to use the R to
analyze a two-way table and its joint, marginal and conditional
probabilities. There are the 3 types of probabilities: 1.A Joint
probability (total % in the inner cells) is a statistical measure where
the likelihood of two events occurring together and at the same point in
time is calculated (cell/total). All the joint distribution should add
to 1.
#For example, the probability that a first class passenger “and” he/she
survived = 15.28%. 2.A Marginal probability allows us to study one
variable at a time. It is calculated by (row or column summation/
total).
#The marginal probabilities for pclass is in the last column labeled Row
Total, and the marginal probabilities for survived is in the last row
labeled Column Total.
#For example, what percentage of the Titanic passenger survived? Answer:
38.20%. 3.A Conditional probability is the probability of an event (A)
given another event (B) has occurred: (cell/ row or column summation).
#GIVEN that a passenger survived (or not), the col % gives the
probabilities of he/she was in pclass = 1, 2 or 3. For example, among
all passengers that survived, 40.00% of them was in the first class.
#GIVEN that a passenger was in pclass = 1, 2 or 3, the row % gives the
probability of whether or not he/she survived. For example, among all
2nd class passengers, 42.96% survived.
#Recitation 2.3: Calculate Probability in Normal Distribution.
\[P(Z \leq 2.05)\]
pnorm(2.05)
## [1] 0.9798178
##The pnorm(x, mean = µ, sd= σ) returns the area under the normal curve from -inf to x.
pnorm(6,mean = 3,sd=2)
## [1] 0.9331928
##approximately 68% of the observations fall within σ of µ; ##approximately 95% of the observations fall within 2σ of µ; ##approximately 99.7% of the observations fall within 3σ of µ.
##For standard normal distribution (µ=0 and σ=1):
pnorm(1)-pnorm(-1)
## [1] 0.6826895
pnorm(2)-pnorm(-2)
## [1] 0.9544997
pnorm(3)-pnorm(-3)
## [1] 0.9973002
##For normal distribution (µ=3 and σ=2):
pnorm(5,mean=3,sd=2)-pnorm(1,mean=3,sd=2)
## [1] 0.6826895
pnorm(1,lower.tail = FALSE)
## [1] 0.1586553
#If given the probability, return z score
qnorm(0.5)
## [1] 0
qnorm(0.84,mean = 3,sd=2)
## [1] 4.988916
rnorm(50) # by default mean=0, and sd=1 (random numbers that follow standard normal distribution)
## [1] 0.16133446 -1.68170703 2.05590100 -0.07092908 1.94965713 -2.44762291
## [7] 0.87375041 -0.58907058 1.48745694 1.46253779 0.39347253 -1.15227629
## [13] -0.63493778 -0.22032546 -1.92770850 0.20250070 -0.05419261 -0.03957231
## [19] 1.52298877 0.28696503 0.43657338 0.20947924 1.01907027 0.19931156
## [25] 1.48741923 0.11318637 -0.91860083 -0.41719646 0.96604666 -1.14170588
## [31] 0.47178793 -0.49772583 1.48108091 0.06317800 0.17950954 -1.35421677
## [37] -0.38019194 -1.70238007 1.62189911 -0.59841234 0.39508433 1.58990696
## [43] -1.23158350 -0.65923622 -1.80246408 1.20146959 0.36019853 -1.40978246
## [49] 0.09649710 -0.36607454
Test = rnorm(100,mean = 3,sd=2) #genertate 100 random numbers that follow normal distribution with mean=3 and sd=2
hist(Test)
#Recitation 2.4: generate normal quantile plot, and histogram with
fited normal curve
### Step 1: Import the dataset
rm(list = ls()) #Clean previous data
mydata <- read.csv("Ch3_Data set for normality testing.csv")
car and generate the normal
quantile plot with qqplot() function#install.packages("car")
library(car)
## Warning: package 'car' was built under R version 4.2.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.2.3
####The qq-plot or normal quartile plot for first variable
Var.1 is produced by the code below
qqPlot(mydata$Var.1)
## [1] 36 83
library(moments)
skewness(mydata$Var.1)
## [1] -0.01835039
###The rule of thumb seems to be: #If the skewness is between -0.5 and 0.5, the data are fairly symmetrical. #if the skewness is between -1 and -0.5(negatively/left skewed) or between 0.5 and 1(positively/right skewed), the data are moderately skewed. #If the skewness is less than -1(negatively skewed) or greater than 1(positively skewed), the data are highly skewed.
Var.1####We use the following code to get the histogram with a fitted
normal curve of numeric variable Var.1
#install.packages("rcompanion")
library(rcompanion)
## Warning: package 'rcompanion' was built under R version 4.2.3
## Registered S3 method overwritten by 'DescTools':
## method from
## reorder.factor gdata
plotNormalHistogram(mydata$Var.1, col = "red", main = "Histogram with a normal curve",linecol = "blue", lwd = 4)
#Recitation 2.5: Linear Regression Analysis
rm(list = ls()) #Clean previous data
beer <- read.csv("Ch3_Beer.csv")
#Step 2: Compute the correlation coefficient between two variables PercentAlcohol and Carbohydrate.
cor(beer$PercentAlcohol,beer$Carbohydrates)
## [1] 0.5181569
#Step 3: Build a linear regression model The formula for our linear regression is:
\[Percent Alcohol(y) = b_0 + b_1 * Carbohydrates(x)\]
Using the dataset beer we build a liner regression
model. Run the below code to create the linear regression model
linearModel
linearModel <- lm(PercentAlcohol~Carbohydrates, data=beer)
linearModel
##
## Call:
## lm(formula = PercentAlcohol ~ Carbohydrates, data = beer)
##
## Coefficients:
## (Intercept) Carbohydrates
## 3.4247 0.1478
summary(linearModel)
##
## Call:
## lm(formula = PercentAlcohol ~ Carbohydrates, data = beer)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9904 -0.7421 -0.3002 0.4338 5.6218
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.42474 0.26095 13.12 < 2e-16 ***
## Carbohydrates 0.14780 0.02019 7.32 1.53e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.211 on 146 degrees of freedom
## Multiple R-squared: 0.2685, Adjusted R-squared: 0.2635
## F-statistic: 53.59 on 1 and 146 DF, p-value: 1.532e-11
plot function we create a scatter plot by
simply plotting two variables x (Carbohydrates) and y
(PercentAlcohol).main is the label for your plot.xlab means label for x-axis.y-lab is the label for y-axis.col is used to specify color.pch=18 creates filled diamond blue pots.cex creates dots that are times bigger or smaller than
the default (where cex = 1).text function adds text to a plot.abline function helps draw regression lines to a
graph.The values of pos meaning position can be 1, 2, 3 and 4,
which indicates positions below, to the left of, above and to the right
of the specified (x,y) coordinates respectively.
plot(beer$Carbohydrates, beer$PercentAlcohol,
main = "Regression Model",
xlab = "Carbohydrates",
ylab = "Alcohol",
pch = 18,
col = "blue"
)
text(beer$Carbohydrates, beer$PercentAlcohol,
cex = 0.6,
pos = 4,
col = "red"
)
abline(lm(PercentAlcohol~ Carbohydrates, data=beer))