Previously on Stat412:

  • Multiple imputation

  • Cross validation

  • Polynomial regression

  • Encoding categorical data

Today’s focus is on Dimension Reduction and Categorical Data Analysis

Dimension Reduction

Dimension reduction techniques are methods used to reduce the number of features in high-dimensional datasets. In other words, it involves transforming a dataset with a large number of dimensions into a dataset with fewer dimensions, ideally without losing significant information. Some common dimension reduction methods include Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Linear Discriminant Analysis (LDA).

Principal Component Analysis(PCA)

This is the most commonly used dimension reduction technique. It creates a new set of variables that are linear combinations of the original variables while preserving the variability in the data. The goal is to reduce the dimensionality of the data while retaining most of the variance.

Let’s recall the Boston data set:

library(MASS)
## Warning: package 'MASS' was built under R version 4.3.3
data("Boston")
head(Boston)
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
dim(Boston) #506 obs and 14 var
## [1] 506  14
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

All observations are numeric. Also, you can make a comment on the skewness of some individuals.

Now, we can apply pca to Boston data set with the command prcomp(). Like most of the machine learning algorithms, we must scale the variables with the input “scale=TRUE” so that each of the variables in the dataset are scaled to have a mean of 0 and a standard deviation of 1 before calculating the principal components.

Remember the steps for PCA:

  • Standardize your data set.

  • Calculate the covariance matrix.

  • Find the eigen vector and eigenvalues of the covariance matrix. An eigen vector is a direction and an eigenvalue is a number that indicates how much variance is in the data in that direction.

  • Sort the eigen vectors according to their eigenvalues in decreasing order.

  • Choose first k eigen vectors and that will be the new k dimensions.

  • Transform the original n dimensional data points into k dimensions.

pca = prcomp(Boston, scale = TRUE)
summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.5585 1.2843 1.16142 0.94156 0.92244 0.81241 0.73172
## Proportion of Variance 0.4676 0.1178 0.09635 0.06332 0.06078 0.04714 0.03824
## Cumulative Proportion  0.4676 0.5854 0.68174 0.74507 0.80585 0.85299 0.89123
##                            PC8    PC9    PC10   PC11    PC12    PC13    PC14
## Standard deviation     0.63488 0.5266 0.50225 0.4613 0.42777 0.36607 0.24561
## Proportion of Variance 0.02879 0.0198 0.01802 0.0152 0.01307 0.00957 0.00431
## Cumulative Proportion  0.92003 0.9398 0.95785 0.9730 0.98612 0.99569 1.00000

In the output above, we see standard deviation, proportion of variance and cumulative proportion for each variable.

Let’s see eigen values in descending order:

pca$sdev  #to see eigenvalues
##  [1] 2.5585132 1.2843410 1.1614241 0.9415625 0.9224421 0.8124105 0.7317177
##  [8] 0.6348831 0.5265582 0.5022524 0.4612919 0.4277704 0.3660733 0.2456149
pca$rotation #to see components
##                  PC1          PC2          PC3          PC4          PC5
## crim     0.242284451 -0.065873108  0.395077419  0.100366211 -0.004957659
## zn      -0.245435005 -0.148002653  0.394545713  0.342958421 -0.114495002
## indus    0.331859746  0.127075668 -0.066081913 -0.009626936  0.022583692
## chas    -0.005027133  0.410668763 -0.125305293  0.700406497  0.535197817
## nox      0.325193880  0.254276363 -0.046475549  0.053707583 -0.194605570
## rm      -0.202816554  0.434005810  0.353406095 -0.293357309  0.008320481
## age      0.296976574  0.260303205 -0.200823078 -0.078426326 -0.149750092
## dis     -0.298169809 -0.359149977  0.157068710  0.184747787  0.106219480
## rad      0.303412754  0.031149596  0.418510334 -0.051374381  0.230352185
## tax      0.324033052  0.008851406  0.343232194 -0.026810695  0.163425820
## ptratio  0.207679535 -0.314623061  0.000399092 -0.342036328  0.615707380
## black   -0.196638358  0.026481032 -0.361375914 -0.201741185  0.367460674
## lstat    0.311397955 -0.201245177 -0.161060336  0.242621217 -0.178358870
## medv    -0.266636396  0.444924411  0.163188735 -0.180297553  0.050659893
##                 PC6          PC7         PC8          PC9         PC10
## crim     0.22462703 -0.777083366  0.15740140 -0.254211798 -0.071384615
## zn       0.33574694  0.274178365 -0.38031404 -0.382899480  0.245579673
## indus    0.08082495  0.340273839  0.17174578 -0.627048264 -0.254827026
## chas    -0.16264906 -0.074075775 -0.03292700  0.018642967 -0.041706916
## nox      0.14899191  0.198092965  0.04745838  0.043024391 -0.211620349
## rm      -0.13108056 -0.074084938 -0.43761566  0.003666947 -0.526133916
## age      0.06086960 -0.118580363 -0.58810569  0.043265822  0.245647942
## dis     -0.01162335  0.104397844 -0.12823060  0.175802196 -0.299412026
## rad      0.13493732  0.137080107  0.07464872  0.463439397  0.115793486
## tax      0.18847146  0.313984433  0.07099212  0.179446555 -0.008366413
## ptratio -0.27901731 -0.001485608 -0.28346960 -0.274525949  0.160474164
## black    0.78590728 -0.074842780 -0.04444175  0.060975651 -0.146292237
## lstat    0.09197211 -0.083213083 -0.35748247  0.171810921  0.066647267
## medv     0.05402804  0.009964973  0.15230879 -0.070751083  0.575547284
##                 PC11        PC12         PC13         PC14
## crim    -0.071068781 -0.06327612 -0.097032312  0.059114176
## zn      -0.127709065  0.22112210  0.132375830 -0.096296807
## indus    0.273797614 -0.34840828 -0.083716854 -0.235472877
## chas    -0.009968402  0.01903975  0.049917454  0.023488966
## nox     -0.437475550  0.44909357 -0.524974687  0.087649148
## rm       0.223951923  0.12560554  0.049893596  0.007190515
## age     -0.329630928 -0.48633905  0.051462562 -0.038227027
## dis     -0.114600078 -0.49356822 -0.552292172  0.047124029
## rad      0.042213348 -0.01863641  0.006278474 -0.634975332
## tax      0.042794054 -0.17042179  0.242987756  0.698822190
## ptratio -0.099991841  0.23214842 -0.188347079  0.055738160
## black    0.039194858  0.04152885  0.021078199 -0.016165280
## lstat    0.683032690  0.18189209 -0.249489863  0.083143795
## medv     0.242001064 -0.09828580 -0.469629324  0.134127182

The variable “crim” is mostly explained by component 7. Additionally, the variable “chas” is mostly described by the components 4 and 5.

Visualization of PCA:

biplot(pca, scale = 0,cex=c(0.3,0.7))  #scale=0 the arrows in the plot are scaled to represent the loadings

How to decide the number of components?

  1. Select the number of components having the eigen value greater than 1.

  2. Use scree-plots

library(factoextra)
## Warning: package 'factoextra' was built under R version 4.3.3
## Zorunlu paket yükleniyor: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_eig(pca,ncp = 14)

Let’s select 3 components:

selected_pca=pca$x[,1:3]
head(selected_pca) 
##         PC1        PC2         PC3
## 1 -2.085280  0.4923660 -0.33565863
## 2 -1.372024 -0.1707548 -0.96500929
## 3 -2.374204  0.9131235 -0.08993678
## 4 -2.834974  0.1946773  0.06048549
## 5 -2.770174  0.4328708  0.06397874
## 6 -2.298644 -0.3283548 -0.44993218
dim(selected_pca)  # now we have 506 scores for each obs
## [1] 506   3
cor_check = cor(selected_pca, method="pearson")
cor_check
##               PC1          PC2           PC3
## PC1  1.000000e+00 6.605001e-15 -4.002711e-16
## PC2  6.605001e-15 1.000000e+00  4.736769e-16
## PC3 -4.002711e-16 4.736769e-16  1.000000e+00

Note: Remember that one of the remedies for multicollinearity is dimension reduction technique. We expect no correlation between components.

Principal Component Regression

It uses the results of Principal Component Analysis performed on regressors and use the output as new regressors. By doing so, the independent variables are orthogonal ( statistically independent) and ensure the computations are easier and more stable.

PCA in linear regression has been used to serve two basic goals:

  • reduce the number of dimensions

  • prevent the multicollinearity

Note: No matter how many PCs are actually used, the regression equation will always contain all of the variables in X ( since each PC is a linear combination of the variables in X formed by an eigenvector)

library(pls)
## Warning: package 'pls' was built under R version 4.3.3
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(1)
model = pcr(medv~., data=Boston, scale=TRUE, validation="CV")  #apply pcr

summary(model)  #RMSEP: Root Mean Squared Error of Prediction
## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.206    7.293    6.903    5.617    5.604     5.15    5.126
## adjCV        9.206    7.292    6.891    5.610    5.628     5.14    5.119
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       5.150    5.115    5.159     5.163     5.117     4.937     4.861
## adjCV    5.143    5.108    5.150     5.154     5.106     4.927     4.851
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.13    58.15    67.71    74.31    80.73    85.79    89.91    92.95
## medv    37.42    45.59    63.59    64.78    69.70    70.05    70.05    70.56
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.08     96.78     98.21     99.51    100.00
## medv    70.57     70.89     71.30     73.21     74.06

M=1 only captures 47.13% of all the variance, or information, in the predictors. In contrast, using M=3 increases the value to 67.71%. If we use all, this would increase 100%.

par(mfrow=c(1,3))
validationplot(model)
validationplot(model, val.type="MSEP")  #MSEP: mean squared error of prediction
validationplot(model, val.type="R2")

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Zorunlu paket yükleniyor: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:pls':
## 
##     R2
set.seed(1)
train_ind = Boston$medv %>%
  createDataPartition(p = 0.8, list = FALSE) 
train  = Boston[train_ind,]
test_x = Boston[-train_ind, -14]
test_y = Boston[-train_ind, 14]


#use model to make predictions on a test set
model = pcr(medv~., data=train, scale=TRUE, validation="CV")
pred = predict(model, test_x, ncomp=3)

#calculate RMSE
sqrt(mean((pred - test_y)^2))
## [1] 5.649384

If we use 3 components, rmse will be equal to 5.65.

Comparison PCR with the classical linear regression:

library(dplyr)
library(caret)  
set.seed(1)
train_ind = Boston$medv %>%
  createDataPartition(p = 0.8, list = FALSE) 
train  = Boston[train_ind, ]
test = Boston[-train_ind, ]

model = lm(medv ~., data = train)
summary(model)
## 
## Call:
## lm(formula = medv ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.6891  -2.6859  -0.5393   1.8871  25.5015 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  32.334452   5.582317   5.792 1.42e-08 ***
## crim         -0.130344   0.034279  -3.802 0.000166 ***
## zn            0.052932   0.015536   3.407 0.000724 ***
## indus        -0.023633   0.064464  -0.367 0.714109    
## chas          3.253159   0.957335   3.398 0.000748 ***
## nox         -16.050622   4.059695  -3.954 9.13e-05 ***
## rm            4.095705   0.456716   8.968  < 2e-16 ***
## age          -0.012316   0.014639  -0.841 0.400679    
## dis          -1.636081   0.221466  -7.387 9.05e-13 ***
## rad           0.278750   0.069715   3.998 7.62e-05 ***
## tax          -0.011576   0.003892  -2.975 0.003116 ** 
## ptratio      -0.817708   0.142322  -5.745 1.84e-08 ***
## black         0.010540   0.002888   3.650 0.000298 ***
## lstat        -0.494600   0.055517  -8.909  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.675 on 393 degrees of freedom
## Multiple R-squared:  0.7474, Adjusted R-squared:  0.7391 
## F-statistic: 89.46 on 13 and 393 DF,  p-value: < 2.2e-16
pred = model %>%  predict(test)
pred
##         17         18         21         23         24         30         32 
## 21.1688793 17.1753534 12.6968098 16.1798960 13.9698327 21.1720709 18.0476643 
##         37         41         43         44         49         50         53 
## 22.6121537 35.2432337 25.4565332 24.9381469  8.7293878 16.8453308 27.6538427 
##         58         61         65         69         73         76         82 
## 33.0657243 17.6170132 23.1103489 17.4728905 24.8064986 23.9424957 27.0306435 
##         86         92        104        118        129        132        138 
## 27.9718424 27.6330339 20.7445633 23.6268309 19.2088078 19.4771875 19.6093608 
##        145        154        159        162        167        173        174 
##  8.6089508 16.7847009 27.9820632 36.0994981 36.6126852 22.8296782 29.0444214 
##        188        192        197        199        202        205        213 
## 33.8880618 30.3774500 36.1220488 34.7115703 29.4787421 43.9505929 23.4967080 
##        235        236        237        240        241        245        250 
## 32.2620437 25.2392448 30.4894557 28.3591645 27.2554706 15.6226082 24.3309210 
##        255        270        272        277        278        281        283 
## 23.8356867 26.5211681 27.8396482 36.5863508 35.9047516 38.6861122 40.8183217 
##        286        287        293        308        309        310        311 
## 27.4136825 20.4149840 32.7831636 33.2204608 28.5239075 23.5577286 18.7980456 
##        312        314        315        318        321        323        327 
## 27.1455724 25.4286846 25.3793012 18.3886545 25.1382982 23.0241484 24.0288252 
##        347        348        350        369        380        383        386 
## 14.4948424 25.6349727 22.6311726 22.8927592 16.6366449 13.3040134  7.9591663 
##        387        390        391        393        399        407        413 
##  5.5717784 13.9730318 16.9735732  9.6425054  5.9662802  7.2391219  0.8205434 
##        415        418        420        424        429        434        436 
## -5.5122007  6.1870867 14.6408508 12.7430413 14.1227450 16.7201880 13.1950824 
##        445        460        462        476        485        486        494 
## 11.2771605 18.3721848 20.1494107 15.8629210 19.4486860 22.0785962 21.1953072 
##        500 
## 18.8193384
RMSE = RMSE(pred, test$medv)
RMSE
## [1] 5.120283

With original data, rmse will be 5.12.

Categorical Data Analysis

The data to be used is “Credit” from ISLR package. It contains information regarding the credit card balance of customers. The binary target variable is “Married” (yes or no). The information obtained by categorical data analysis could be utilized by a bank to identify customers that may be suitable for a new couples credit card plan, for example.

library(ISLR)
data("Credit")
Credit=Credit[,-1]  #drop the ID column
summary(Credit)
##      Income           Limit           Rating          Cards      
##  Min.   : 10.35   Min.   :  855   Min.   : 93.0   Min.   :1.000  
##  1st Qu.: 21.01   1st Qu.: 3088   1st Qu.:247.2   1st Qu.:2.000  
##  Median : 33.12   Median : 4622   Median :344.0   Median :3.000  
##  Mean   : 45.22   Mean   : 4736   Mean   :354.9   Mean   :2.958  
##  3rd Qu.: 57.47   3rd Qu.: 5873   3rd Qu.:437.2   3rd Qu.:4.000  
##  Max.   :186.63   Max.   :13913   Max.   :982.0   Max.   :9.000  
##       Age          Education        Gender    Student   Married  
##  Min.   :23.00   Min.   : 5.00    Male :193   No :360   No :155  
##  1st Qu.:41.75   1st Qu.:11.00   Female:207   Yes: 40   Yes:245  
##  Median :56.00   Median :14.00                                   
##  Mean   :55.67   Mean   :13.45                                   
##  3rd Qu.:70.00   3rd Qu.:16.00                                   
##  Max.   :98.00   Max.   :20.00                                   
##             Ethnicity      Balance       
##  African American: 99   Min.   :   0.00  
##  Asian           :102   1st Qu.:  68.75  
##  Caucasian       :199   Median : 459.50  
##                         Mean   : 520.01  
##                         3rd Qu.: 863.00  
##                         Max.   :1999.00
head(Credit)
Income Limit Rating Cards Age Education Gender Student Married Ethnicity Balance
14.891 3606 283 2 34 11 Male No Yes Caucasian 333
106.025 6645 483 3 82 15 Female Yes Yes Asian 903
104.593 7075 514 4 71 11 Male No No Asian 580
148.924 9504 681 3 36 11 Female No No Asian 964
55.882 4897 357 2 68 16 Male No Yes Caucasian 331
80.180 8047 569 4 77 10 Male No No Caucasian 1151

Examining variables

  • One-way examination:
table(Credit$Gender)
## 
##   Male Female 
##    193    207

We have 193 males and 207 females in our data set. Additionally, we can examine the proportion of gender via prop.table().

gen_tab=table(Credit$Gender)
prop.table(gen_tab)
## 
##   Male Female 
## 0.4825 0.5175

About 52% of the data set includes observations belong to females.

  • Two-way examination:
gen_mar=table(Credit$Gender,Credit$Married) #first input is row, second input is column. 
gen_mar
##         
##           No Yes
##    Male   76 117
##   Female  79 128

The table does not show the margin sums. If you are interested in counting margin sums, you can use margin.table() function.

margin.table(gen_mar,1)  #for row sums
## 
##   Male Female 
##    193    207
margin.table(gen_mar,2) #for column sums
## 
##  No Yes 
## 155 245

If you are interested in counting row percentage for 2x2 table, you can use prop.table(data,1) If you are interested in counting column percentage for 2x2 table, you can use prop.table(data,2)

prop.table(gen_mar) # gives cell percentage
##         
##              No    Yes
##    Male  0.1900 0.2925
##   Female 0.1975 0.3200
prop.table(gen_mar,1) # gives row percentage
##         
##                 No       Yes
##    Male  0.3937824 0.6062176
##   Female 0.3816425 0.6183575
prop.table(gen_mar,2) # gives column percentage
##         
##                 No       Yes
##    Male  0.4903226 0.4775510
##   Female 0.5096774 0.5224490
  • Three-way Examination:
gen_mar_eth=table(Credit$Gender,Credit$Married,Credit$Ethnicity)
#first input is the first row category.
#second input is the second row category.
#third input is the column category.
ftable(gen_mar_eth)
##             African American Asian Caucasian
##                                             
##  Male  No                 20    19        37
##        Yes                29    28        60
## Female No                 27    13        39
##        Yes                23    42        63

Visualization of Tables

library(ggplot2)
library(ggmosaic)
## Warning: package 'ggmosaic' was built under R version 4.3.3
ggplot(data=Credit)+geom_mosaic(aes(x = product(Gender, Married)))+labs(x = "Married", y="Gender",title='Mosaic plot for Gender vs Married')
## Warning: `unite_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `unite()` instead.
## ℹ The deprecated feature was likely used in the ggmosaic package.
##   Please report the issue at <https://github.com/haleyjeppson/ggmosaic>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Comment on the plot above.

mosaicplot(table(Credit$Married,Credit$Ethnicity),main="Mosaic plot for Married vs Ethnicity",col=c("darkgreen","royalblue","orange"))

Performance Measures

  • Odds Ratio
odds.ratio = function(x, conf.level=0.95) {
  OR = x[1,1] * x[2,2] / ( x[2,1] * x[1,2] )
  SE = sqrt(sum(1/x))
  CI = exp(log(OR) + c(-1,1) * qnorm(0.5*(1-conf.level), lower.tail=F) * SE )
  list(estimator=OR,
       SE=SE,
       conf.interval=CI,
       conf.level=conf.level)
}
or=table(Credit$Gender,Credit$Married)
or
##         
##           No Yes
##    Male   76 117
##   Female  79 128
odds.ratio(or)
## $estimator
## [1] 1.052472
## 
## $SE
## [1] 0.2053671
## 
## $conf.interval
## [1] 0.7037232 1.5740531
## 
## $conf.level
## [1] 0.95

According to the result (1.05), the odds of being married is 1.05 times higher for females compared to males.

  • Relative Risk:
rr = (or[1, 1] / (or[1, 1] + or[1, 2])) / (or[2, 1] / (or[2, 1] + or[2, 2]))
rr  
## [1] 1.03181

Risk of being married is 3% higher for females than males.

  • Yule’s Q:

Yule’s Q, also known as the coefficient of colligation, is a statistical measure commonly used in a 2×2 contingency table. It allows for analyzing two variables within the matrix, each having two levels or categories.

q = ((or[1, 1] * or[2, 2]) - (or[1, 2] * or[2, 1])) / ((or[1, 1] * or[2, 2]) + (or[1, 2] * or[2, 1]))
q
## [1] 0.02556534

As we can say, there is no strong association between participants gender and married.

Statistical tests for categorical data analysis

  • Chi-square test:

The Chi Square statistic is commonly used for testing relationships between categorical variables. The null hypothesis of the Chi-Square test is that no relationship exists on the categorical variables in the population; they are independent.

gen_mar
##         
##           No Yes
##    Male   76 117
##   Female  79 128

The hypotheses are as follows:

\(\(H_0\)\) = Married and gender are independent.

\(\(H_1\)\) = Married and gender are not independent.

You can conduct the test using chisq.test function. The input of this function is supposed to be table.

chisq.test(gen_mar)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  gen_mar
## X-squared = 0.021415, df = 1, p-value = 0.8837

Since p value is greater than \(\alpha\), we cannot reject \(H_0\). Therefore, we are 95% confident that there is no significant relation between gender and married.

  • Cramer’s V

Cramer’s V is a measure of the strength of association between two nominal variables.

It ranges from 0 to 1 where: 0 indicates no association between the two variables. 1 indicates a perfect association between the two variables.

library(vcd)
## Zorunlu paket yükleniyor: grid
## 
## Attaching package: 'vcd'
## The following objects are masked from 'package:ggmosaic':
## 
##     mosaic, spine
## The following object is masked from 'package:ISLR':
## 
##     Hitters
assocstats(gen_mar)
##                       X^2 df P(> X^2)
## Likelihood Ratio 0.062012  1  0.80334
## Pearson          0.062018  1  0.80333
## 
## Phi-Coefficient   : 0.012 
## Contingency Coeff.: 0.012 
## Cramer's V        : 0.012

It is seen that the coefficient is 0.012, so we can say that there is a weak positive relationship between gender and married.

  • Correlation between a continuous and a categorical variable

If we are interested in the relationship between one continuous and one binary categorical variable, we can use point biserial correlation. To do so, we use biserial.cor function from ltm package.

library(ltm)
## Warning: package 'ltm' was built under R version 4.3.3
## Zorunlu paket yükleniyor: msm
## Warning: package 'msm' was built under R version 4.3.3
## Zorunlu paket yükleniyor: polycor
## Warning: package 'polycor' was built under R version 4.3.3
biserial.cor(Credit$Age,Credit$Married)
## [1] 0.0731355

As you see, the coefficient is 0.07. It indicates that there is a weak relationship between these two concepts.

Good luck in your Stat412 exam! :)