#This recitation include constructing boxplot, histogram, two-way table, normal quantile plot, histogram with fited normal curve, linear regression line, and calculateing probability in normal distribution.

##Recitation 2.1 boxplot and histogram #A research surveys the words spoken everyday by 79 people, 37 of whom are male (GenderMale1=1) and 42 are female.

#Step 1: import the data set

rm(list = ls())  #Clean previous data
Daily_Words_Spoken <- read.csv("Ch3_Daily Words Spoken.csv")

#Step 2: understand the data with summary function ##we can use summary function to get the statistics (including five-number summary and mean)of the variable WordsPerDay.

summary(Daily_Words_Spoken)

##        ID        GenderMale1     WordsPerDay   
##  Min.   : 1.0   Min.   :1.000   Min.   :  695  
##  1st Qu.:20.5   1st Qu.:1.000   1st Qu.: 8346  
##  Median :40.0   Median :2.000   Median :12460  
##  Mean   :40.0   Mean   :1.532   Mean   :14186  
##  3rd Qu.:59.5   3rd Qu.:2.000   3rd Qu.:18050  
##  Max.   :79.0   Max.   :2.000   Max.   :36345

#Step 3: calculate the mean, median and standard deviation of WordsPerDay for all observations

mean(Daily_Words_Spoken$WordsPerDay)

## [1] 14186.01

median(Daily_Words_Spoken$WordsPerDay)

## [1] 12460

sd(Daily_Words_Spoken$WordsPerDay,na.rm = TRUE)

## [1] 7729.664

#Step 4: create a boxplot and histogram for WordsPerDay.

boxplot(Daily_Words_Spoken$WordsPerDay)

hist(Daily_Words_Spoken$WordsPerDay)

#Step 5: create a pie chart and bar chart

counts <- table(Daily_Words_Spoken$GenderMale1)
counts

## 
##  1  2 
## 37 42

pie(counts)

barplot(counts,xlab = "Gender", ylab = "Counts", col = "blue", border = "red")

#Step 6: split the data set into two data sets: male and female

male <- subset(Daily_Words_Spoken,GenderMale1 == 1)
male

##    ID GenderMale1 WordsPerDay
## 1   1           1       23871
## 2   2           1        5180
## 3   3           1        9951
## 4   4           1       12460
## 5   5           1       17155
## 6   6           1       10344
## 7   7           1        9811
## 8   8           1       12387
## 9   9           1       29920
## 10 10           1       21791
## 11 11           1        9789
## 12 12           1       31127
## 13 13           1        8572
## 14 14           1        6942
## 15 15           1        2539
## 16 16           1       36345
## 17 17           1        6858
## 18 18           1       24024
## 19 19           1        5488
## 20 20           1        9960
## 21 21           1       11118
## 22 22           1        4970
## 23 23           1       10710
## 24 24           1       15011
## 25 25           1        1569
## 26 26           1       23794
## 27 27           1       23689
## 28 28           1       11769
## 29 29           1       26846
## 30 30           1       17386
## 31 31           1        7987
## 32 32           1       25638
## 33 33           1         695
## 34 34           1        2366
## 35 35           1       16075
## 36 36           1       16789
## 37 37           1        9308

#Step 7: calculate the mean of WordsPerDay for male and female (Hint:use tapply function, or group_by + summarize functions)

#tapply function:
tapply(Daily_Words_Spoken$WordsPerDay,Daily_Words_Spoken$GenderMale1,mean)

##        1        2 
## 14060.38 14296.69

#Recitation 2.2: Two-way table #Case introduction: On April 15, 1912, the largest passenger liner ever made collided with an iceberg during her maiden voyage. When the Titanic sank it killed many passengers and crew. One of the reasons that the shipwreck resulted in such loss of life was that there were not enough lifeboats for the passengers and crew. ##The “Ch3_Titanic.csv” file contains data for 1309 of the real Titanic passengers. Each row represents one person. The columns describe different attributes about the person including their passenger-class, whether they survived, their sex, and their age. ##Note that the first class in a passenger ship is usually on the upper deck and the third class is in the lower deck.

#Step 1: import the data set

rm(list = ls())  #Clean previous data
Titanic <- read.csv("Ch3_Titanic.csv")

#Step 2: preprocess the data set #The variables pclass (passenger class) and survived (1=survived, 0=perished) are categorical variables. However, in the data set just imported, RStudio automatically assumes these two variables are numerical variables. Run the following R code to summarize all the variables:

summary(Titanic)

##      pclass         survived         sex                 age         
##  Min.   :1.000   Min.   :0.000   Length:1309        Min.   : 0.1667  
##  1st Qu.:2.000   1st Qu.:0.000   Class :character   1st Qu.:21.0000  
##  Median :3.000   Median :0.000   Mode  :character   Median :28.0000  
##  Mean   :2.295   Mean   :0.382                      Mean   :29.8811  
##  3rd Qu.:3.000   3rd Qu.:1.000                      3rd Qu.:39.0000  
##  Max.   :3.000   Max.   :1.000                      Max.   :80.0000  
##                                                     NA's   :263

#If a variable is categorical, the output should just display the categories and the corresponding counts. If the output is displayed as five-number summary, the variable is recognized as numeric by RStudio. Therefore, variables “pclass”, “survived” and “age” are assumed numeric be default. #To convert variables “pclass” and “survived” to categorical, run the following codes:

Titanic$pclass <- as.factor(Titanic$pclass)
Titanic$survived <- as.factor(Titanic$survived)
summary(Titanic)

##  pclass  survived     sex                 age         
##  1:323   0:809    Length:1309        Min.   : 0.1667  
##  2:277   1:500    Class :character   1st Qu.:21.0000  
##  3:709            Mode  :character   Median :28.0000  
##                                      Mean   :29.8811  
##                                      3rd Qu.:39.0000  
##                                      Max.   :80.0000  
##                                      NA's   :263

#Step 3: Generate a two-way table for “pclass” and “survived” #Run the following codes to generate a two-way table for “pclass” and “survived”.

#install.packages("gmodels")
library(gmodels)

## Warning: package 'gmodels' was built under R version 4.2.3

CrossTable(Titanic$pclass,Titanic$survived)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1309 
## 
##  
##                | Titanic$survived 
## Titanic$pclass |         0 |         1 | Row Total | 
## ---------------|-----------|-----------|-----------|
##              1 |       123 |       200 |       323 | 
##                |    29.411 |    47.587 |           | 
##                |     0.381 |     0.619 |     0.247 | 
##                |     0.152 |     0.400 |           | 
##                |     0.094 |     0.153 |           | 
## ---------------|-----------|-----------|-----------|
##              2 |       158 |       119 |       277 | 
##                |     1.017 |     1.645 |           | 
##                |     0.570 |     0.430 |     0.212 | 
##                |     0.195 |     0.238 |           | 
##                |     0.121 |     0.091 |           | 
## ---------------|-----------|-----------|-----------|
##              3 |       528 |       181 |       709 | 
##                |    18.411 |    29.788 |           | 
##                |     0.745 |     0.255 |     0.542 | 
##                |     0.653 |     0.362 |           | 
##                |     0.403 |     0.138 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |       809 |       500 |      1309 | 
##                |     0.618 |     0.382 |           | 
## ---------------|-----------|-----------|-----------|
## 
##

In the inner cell of the generated two-way table, there are 5 entries: Count, Chi-square contribution, row %, col% and total %. The learning objective in this recitation is to learn how to use the R to analyze a two-way table and its joint, marginal and conditional probabilities. There are the 3 types of probabilities: 1.A Joint probability (total % in the inner cells) is a statistical measure where the likelihood of two events occurring together and at the same point in time is calculated (cell/total). All the joint distribution should add to 1.
#For example, the probability that a first class passenger “and” he/she survived = 15.28%. 2.A Marginal probability allows us to study one variable at a time. It is calculated by (row or column summation/ total).
#The marginal probabilities for pclass is in the last column labeled Row Total, and the marginal probabilities for survived is in the last row labeled Column Total.
#For example, what percentage of the Titanic passenger survived? Answer: 38.20%. 3.A Conditional probability is the probability of an event (A) given another event (B) has occurred: (cell/ row or column summation). #GIVEN that a passenger survived (or not), the col % gives the probabilities of he/she was in pclass = 1, 2 or 3. For example, among all passengers that survived, 40.00% of them was in the first class. #GIVEN that a passenger was in pclass = 1, 2 or 3, the row % gives the probability of whether or not he/she survived. For example, among all 2nd class passengers, 42.96% survived.

#Recitation 2.3: Calculate Probability in Normal Distribution.

The pnorm(z) returns the area under the standard normal curve from -inf to z.

\[P(Z \leq 2.05)\]

pnorm(2.05)

## [1] 0.9798178

##The pnorm(x, mean = µ, sd= σ) returns the area under the normal curve from -inf to x.

pnorm(6,mean = 3,sd=2)

## [1] 0.9331928

Check the 68-95-99.7 rule:

##approximately 68% of the observations fall within σ of µ; ##approximately 95% of the observations fall within 2σ of µ; ##approximately 99.7% of the observations fall within 3σ of µ.

##For standard normal distribution (µ=0 and σ=1):

pnorm(1)-pnorm(-1)

## [1] 0.6826895

pnorm(2)-pnorm(-2)

## [1] 0.9544997

pnorm(3)-pnorm(-3)

## [1] 0.9973002

##For normal distribution (µ=3 and σ=2):

pnorm(5,mean=3,sd=2)-pnorm(1,mean=3,sd=2)

## [1] 0.6826895

If we want to know the area on the right side of z

pnorm(1,lower.tail = FALSE)

## [1] 0.1586553

#If given the probability, return z score

qnorm(0.5)

## [1] 0

qnorm(0.84,mean = 3,sd=2)

## [1] 4.988916

Generate random numbers that follow normal distribution with different mean and sd

rnorm(50) # by default mean=0, and sd=1 (random numbers that follow standard normal distribution)

##  [1]  0.16133446 -1.68170703  2.05590100 -0.07092908  1.94965713 -2.44762291
##  [7]  0.87375041 -0.58907058  1.48745694  1.46253779  0.39347253 -1.15227629
## [13] -0.63493778 -0.22032546 -1.92770850  0.20250070 -0.05419261 -0.03957231
## [19]  1.52298877  0.28696503  0.43657338  0.20947924  1.01907027  0.19931156
## [25]  1.48741923  0.11318637 -0.91860083 -0.41719646  0.96604666 -1.14170588
## [31]  0.47178793 -0.49772583  1.48108091  0.06317800  0.17950954 -1.35421677
## [37] -0.38019194 -1.70238007  1.62189911 -0.59841234  0.39508433  1.58990696
## [43] -1.23158350 -0.65923622 -1.80246408  1.20146959  0.36019853 -1.40978246
## [49]  0.09649710 -0.36607454

Test = rnorm(100,mean = 3,sd=2) #genertate 100 random numbers that follow normal distribution with mean=3 and sd=2
hist(Test)

#Recitation 2.4: generate normal quantile plot, and histogram with fited normal curve
### Step 1: Import the dataset

rm(list = ls())  #Clean previous data
mydata <- read.csv("Ch3_Data set for normality testing.csv")

Step 2: Install package `car` and generate the normal quantile plot with `qqplot()` function

#install.packages("car")
library(car)

## Warning: package 'car' was built under R version 4.2.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.2.3

####The qq-plot or normal quartile plot for first variable Var.1 is produced by the code below

qqPlot(mydata$Var.1)

## [1] 36 83

library(moments)
skewness(mydata$Var.1)

## [1] -0.01835039

###The rule of thumb seems to be: #If the skewness is between -0.5 and 0.5, the data are fairly symmetrical. #if the skewness is between -1 and -0.5(negatively/left skewed) or between 0.5 and 1(positively/right skewed), the data are moderately skewed. #If the skewness is less than -1(negatively skewed) or greater than 1(positively skewed), the data are highly skewed.

Step 3: Generate histogram with fited normal curve for variable `Var.1`

####We use the following code to get the histogram with a fitted normal curve of numeric variable Var.1

#install.packages("rcompanion")
library(rcompanion)

## Warning: package 'rcompanion' was built under R version 4.2.3

## Registered S3 method overwritten by 'DescTools':
##   method         from 
##   reorder.factor gdata

plotNormalHistogram(mydata$Var.1, col = "red", main = "Histogram with a normal curve",linecol = "blue", lwd = 4)

#Recitation 2.5: Linear Regression Analysis

Step 1:Import the data set “Ch3_Beer.csv”

rm(list = ls())  #Clean previous data
beer <- read.csv("Ch3_Beer.csv")

#Step 2: Compute the correlation coefficient between two variables PercentAlcohol and Carbohydrate.

cor(beer$PercentAlcohol,beer$Carbohydrates)

## [1] 0.5181569

#Step 3: Build a linear regression model The formula for our linear regression is:

\[Percent Alcohol(y) = b_0 + b_1 * Carbohydrates(x)\]

Using the dataset beer we build a liner regression model. Run the below code to create the linear regression model linearModel

linearModel <- lm(PercentAlcohol~Carbohydrates, data=beer)
linearModel

## 
## Call:
## lm(formula = PercentAlcohol ~ Carbohydrates, data = beer)
## 
## Coefficients:
##   (Intercept)  Carbohydrates  
##        3.4247         0.1478

Write down the formula for the regression function Y ̂ = b0 + b1 ∙ X. y=3.42+0.1478*x
What is the coefficient of determination R2 Run the code below to find coefficient Determination R2

summary(linearModel)

## 
## Call:
## lm(formula = PercentAlcohol ~ Carbohydrates, data = beer)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9904 -0.7421 -0.3002  0.4338  5.6218 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.42474    0.26095   13.12  < 2e-16 ***
## Carbohydrates  0.14780    0.02019    7.32 1.53e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.211 on 146 degrees of freedom
## Multiple R-squared:  0.2685, Adjusted R-squared:  0.2635 
## F-statistic: 53.59 on 1 and 146 DF,  p-value: 1.532e-11

Step 4: Create a graphical representation of the linear regression model

Using the plot function we create a scatter plot by simply plotting two variables x (Carbohydrates) and y (PercentAlcohol).
main is the label for your plot.
xlab means label for x-axis.
y-lab is the label for y-axis.
col is used to specify color.
pch=18 creates filled diamond blue pots.
cex creates dots that are times bigger or smaller than the default (where cex = 1).
The text function adds text to a plot.
abline function helps draw regression lines to a graph.

Shapes of the points

pch=0 => square; pch=1 => circle;pch=2 => triangle point up; pch=3 => plus; pch=4 => cross; pch=5 => diamond; pch=6 => triangle point down; pch=7 => square cross;pch=8 => star;pch=9 => diamond plus;pch=10 => circle plus;pch=11 => triangles up and down;pch=12 => square plus;pch=13 => circle cross;pch=14 => square and triangle down;pch=15 => filled square blue;pch=16 => filled circle blue;pch=17 => filled triangle point up blue;pch=18 => filled diamond blue;pch=19 => solid circle blue;pch=20 => bullet (smaller circle);pch=21 => filled circle red;pch=22 => filled square red;pch=23 => filled diamond red;pch=24 => filled triangle point up red;pch=25 => filled triangle point down red

Position of the added texts

The values of pos meaning position can be 1, 2, 3 and 4, which indicates positions below, to the left of, above and to the right of the specified (x,y) coordinates respectively.

plot(beer$Carbohydrates, beer$PercentAlcohol,
    main = "Regression Model",
    xlab = "Carbohydrates",
    ylab = "Alcohol",
    pch = 18,
    col = "blue"
)
text(beer$Carbohydrates, beer$PercentAlcohol,
    cex = 0.6,
    pos = 4,
    col = "red"
)
abline(lm(PercentAlcohol~ Carbohydrates, data=beer))

Recitation 2- Data Visualization and Summary Measures