Previously on Stat412:
Multiple imputation
Cross validation
Polynomial regression
Encoding categorical data
Today’s focus is on Dimension Reduction and Categorical Data Analysis
Dimension reduction techniques are methods used to reduce the number of features in high-dimensional datasets. In other words, it involves transforming a dataset with a large number of dimensions into a dataset with fewer dimensions, ideally without losing significant information. Some common dimension reduction methods include Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Linear Discriminant Analysis (LDA).
This is the most commonly used dimension reduction technique. It creates a new set of variables that are linear combinations of the original variables while preserving the variability in the data. The goal is to reduce the dimensionality of the data while retaining most of the variance.
Let’s recall the Boston data set:
## Warning: package 'MASS' was built under R version 4.3.3
| crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.00632 | 18 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296 | 15.3 | 396.90 | 4.98 | 24.0 |
| 0.02731 | 0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 396.90 | 9.14 | 21.6 |
| 0.02729 | 0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 392.83 | 4.03 | 34.7 |
| 0.03237 | 0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 394.63 | 2.94 | 33.4 |
| 0.06905 | 0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222 | 18.7 | 396.90 | 5.33 | 36.2 |
| 0.02985 | 0 | 2.18 | 0 | 0.458 | 6.430 | 58.7 | 6.0622 | 3 | 222 | 18.7 | 394.12 | 5.21 | 28.7 |
## [1] 506 14
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
All observations are numeric. Also, you can make a comment on the skewness of some individuals.
Now, we can apply pca to Boston data set with the command prcomp(). Like most of the machine learning algorithms, we must scale the variables with the input “scale=TRUE” so that each of the variables in the dataset are scaled to have a mean of 0 and a standard deviation of 1 before calculating the principal components.
Remember the steps for PCA:
Standardize your data set.
Calculate the covariance matrix.
Find the eigen vector and eigenvalues of the covariance matrix. An eigen vector is a direction and an eigenvalue is a number that indicates how much variance is in the data in that direction.
Sort the eigen vectors according to their eigenvalues in decreasing order.
Choose first k eigen vectors and that will be the new k dimensions.
Transform the original n dimensional data points into k dimensions.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.5585 1.2843 1.16142 0.94156 0.92244 0.81241 0.73172
## Proportion of Variance 0.4676 0.1178 0.09635 0.06332 0.06078 0.04714 0.03824
## Cumulative Proportion 0.4676 0.5854 0.68174 0.74507 0.80585 0.85299 0.89123
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.63488 0.5266 0.50225 0.4613 0.42777 0.36607 0.24561
## Proportion of Variance 0.02879 0.0198 0.01802 0.0152 0.01307 0.00957 0.00431
## Cumulative Proportion 0.92003 0.9398 0.95785 0.9730 0.98612 0.99569 1.00000
In the output above, we see standard deviation, proportion of variance and cumulative proportion for each variable.
Let’s see eigen values in descending order:
## [1] 2.5585132 1.2843410 1.1614241 0.9415625 0.9224421 0.8124105 0.7317177
## [8] 0.6348831 0.5265582 0.5022524 0.4612919 0.4277704 0.3660733 0.2456149
## PC1 PC2 PC3 PC4 PC5
## crim 0.242284451 -0.065873108 0.395077419 0.100366211 -0.004957659
## zn -0.245435005 -0.148002653 0.394545713 0.342958421 -0.114495002
## indus 0.331859746 0.127075668 -0.066081913 -0.009626936 0.022583692
## chas -0.005027133 0.410668763 -0.125305293 0.700406497 0.535197817
## nox 0.325193880 0.254276363 -0.046475549 0.053707583 -0.194605570
## rm -0.202816554 0.434005810 0.353406095 -0.293357309 0.008320481
## age 0.296976574 0.260303205 -0.200823078 -0.078426326 -0.149750092
## dis -0.298169809 -0.359149977 0.157068710 0.184747787 0.106219480
## rad 0.303412754 0.031149596 0.418510334 -0.051374381 0.230352185
## tax 0.324033052 0.008851406 0.343232194 -0.026810695 0.163425820
## ptratio 0.207679535 -0.314623061 0.000399092 -0.342036328 0.615707380
## black -0.196638358 0.026481032 -0.361375914 -0.201741185 0.367460674
## lstat 0.311397955 -0.201245177 -0.161060336 0.242621217 -0.178358870
## medv -0.266636396 0.444924411 0.163188735 -0.180297553 0.050659893
## PC6 PC7 PC8 PC9 PC10
## crim 0.22462703 -0.777083366 0.15740140 -0.254211798 -0.071384615
## zn 0.33574694 0.274178365 -0.38031404 -0.382899480 0.245579673
## indus 0.08082495 0.340273839 0.17174578 -0.627048264 -0.254827026
## chas -0.16264906 -0.074075775 -0.03292700 0.018642967 -0.041706916
## nox 0.14899191 0.198092965 0.04745838 0.043024391 -0.211620349
## rm -0.13108056 -0.074084938 -0.43761566 0.003666947 -0.526133916
## age 0.06086960 -0.118580363 -0.58810569 0.043265822 0.245647942
## dis -0.01162335 0.104397844 -0.12823060 0.175802196 -0.299412026
## rad 0.13493732 0.137080107 0.07464872 0.463439397 0.115793486
## tax 0.18847146 0.313984433 0.07099212 0.179446555 -0.008366413
## ptratio -0.27901731 -0.001485608 -0.28346960 -0.274525949 0.160474164
## black 0.78590728 -0.074842780 -0.04444175 0.060975651 -0.146292237
## lstat 0.09197211 -0.083213083 -0.35748247 0.171810921 0.066647267
## medv 0.05402804 0.009964973 0.15230879 -0.070751083 0.575547284
## PC11 PC12 PC13 PC14
## crim -0.071068781 -0.06327612 -0.097032312 0.059114176
## zn -0.127709065 0.22112210 0.132375830 -0.096296807
## indus 0.273797614 -0.34840828 -0.083716854 -0.235472877
## chas -0.009968402 0.01903975 0.049917454 0.023488966
## nox -0.437475550 0.44909357 -0.524974687 0.087649148
## rm 0.223951923 0.12560554 0.049893596 0.007190515
## age -0.329630928 -0.48633905 0.051462562 -0.038227027
## dis -0.114600078 -0.49356822 -0.552292172 0.047124029
## rad 0.042213348 -0.01863641 0.006278474 -0.634975332
## tax 0.042794054 -0.17042179 0.242987756 0.698822190
## ptratio -0.099991841 0.23214842 -0.188347079 0.055738160
## black 0.039194858 0.04152885 0.021078199 -0.016165280
## lstat 0.683032690 0.18189209 -0.249489863 0.083143795
## medv 0.242001064 -0.09828580 -0.469629324 0.134127182
The variable “crim” is mostly explained by component 7. Additionally, the variable “chas” is mostly described by the components 4 and 5.
Visualization of PCA:
biplot(pca, scale = 0,cex=c(0.3,0.7)) #scale=0 the arrows in the plot are scaled to represent the loadingsHow to decide the number of components?
Select the number of components having the eigen value greater than 1.
Use scree-plots
## Warning: package 'factoextra' was built under R version 4.3.3
## Zorunlu paket yükleniyor: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Let’s select 3 components:
## PC1 PC2 PC3
## 1 -2.085280 0.4923660 -0.33565863
## 2 -1.372024 -0.1707548 -0.96500929
## 3 -2.374204 0.9131235 -0.08993678
## 4 -2.834974 0.1946773 0.06048549
## 5 -2.770174 0.4328708 0.06397874
## 6 -2.298644 -0.3283548 -0.44993218
## [1] 506 3
## PC1 PC2 PC3
## PC1 1.000000e+00 6.605001e-15 -4.002711e-16
## PC2 6.605001e-15 1.000000e+00 4.736769e-16
## PC3 -4.002711e-16 4.736769e-16 1.000000e+00
Note: Remember that one of the remedies for multicollinearity is dimension reduction technique. We expect no correlation between components.
It uses the results of Principal Component Analysis performed on regressors and use the output as new regressors. By doing so, the independent variables are orthogonal ( statistically independent) and ensure the computations are easier and more stable.
PCA in linear regression has been used to serve two basic goals:
reduce the number of dimensions
prevent the multicollinearity
Note: No matter how many PCs are actually used, the regression equation will always contain all of the variables in X ( since each PC is a linear combination of the variables in X formed by an eigenvector)
## Warning: package 'pls' was built under R version 4.3.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
set.seed(1)
model = pcr(medv~., data=Boston, scale=TRUE, validation="CV") #apply pcr
summary(model) #RMSEP: Root Mean Squared Error of Prediction## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 9.206 7.293 6.903 5.617 5.604 5.15 5.126
## adjCV 9.206 7.292 6.891 5.610 5.628 5.14 5.119
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 5.150 5.115 5.159 5.163 5.117 4.937 4.861
## adjCV 5.143 5.108 5.150 5.154 5.106 4.927 4.851
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.13 58.15 67.71 74.31 80.73 85.79 89.91 92.95
## medv 37.42 45.59 63.59 64.78 69.70 70.05 70.05 70.56
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.08 96.78 98.21 99.51 100.00
## medv 70.57 70.89 71.30 73.21 74.06
M=1 only captures 47.13% of all the variance, or information, in the predictors. In contrast, using M=3 increases the value to 67.71%. If we use all, this would increase 100%.
par(mfrow=c(1,3))
validationplot(model)
validationplot(model, val.type="MSEP") #MSEP: mean squared error of prediction
validationplot(model, val.type="R2")##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'caret' was built under R version 4.3.3
## Zorunlu paket yükleniyor: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:pls':
##
## R2
set.seed(1)
train_ind = Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train = Boston[train_ind,]
test_x = Boston[-train_ind, -14]
test_y = Boston[-train_ind, 14]
#use model to make predictions on a test set
model = pcr(medv~., data=train, scale=TRUE, validation="CV")
pred = predict(model, test_x, ncomp=3)
#calculate RMSE
sqrt(mean((pred - test_y)^2))## [1] 5.649384
If we use 3 components, rmse will be equal to 5.65.
Comparison PCR with the classical linear regression:
library(dplyr)
library(caret)
set.seed(1)
train_ind = Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
train = Boston[train_ind, ]
test = Boston[-train_ind, ]
model = lm(medv ~., data = train)
summary(model)##
## Call:
## lm(formula = medv ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.6891 -2.6859 -0.5393 1.8871 25.5015
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.334452 5.582317 5.792 1.42e-08 ***
## crim -0.130344 0.034279 -3.802 0.000166 ***
## zn 0.052932 0.015536 3.407 0.000724 ***
## indus -0.023633 0.064464 -0.367 0.714109
## chas 3.253159 0.957335 3.398 0.000748 ***
## nox -16.050622 4.059695 -3.954 9.13e-05 ***
## rm 4.095705 0.456716 8.968 < 2e-16 ***
## age -0.012316 0.014639 -0.841 0.400679
## dis -1.636081 0.221466 -7.387 9.05e-13 ***
## rad 0.278750 0.069715 3.998 7.62e-05 ***
## tax -0.011576 0.003892 -2.975 0.003116 **
## ptratio -0.817708 0.142322 -5.745 1.84e-08 ***
## black 0.010540 0.002888 3.650 0.000298 ***
## lstat -0.494600 0.055517 -8.909 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.675 on 393 degrees of freedom
## Multiple R-squared: 0.7474, Adjusted R-squared: 0.7391
## F-statistic: 89.46 on 13 and 393 DF, p-value: < 2.2e-16
## 17 18 21 23 24 30 32
## 21.1688793 17.1753534 12.6968098 16.1798960 13.9698327 21.1720709 18.0476643
## 37 41 43 44 49 50 53
## 22.6121537 35.2432337 25.4565332 24.9381469 8.7293878 16.8453308 27.6538427
## 58 61 65 69 73 76 82
## 33.0657243 17.6170132 23.1103489 17.4728905 24.8064986 23.9424957 27.0306435
## 86 92 104 118 129 132 138
## 27.9718424 27.6330339 20.7445633 23.6268309 19.2088078 19.4771875 19.6093608
## 145 154 159 162 167 173 174
## 8.6089508 16.7847009 27.9820632 36.0994981 36.6126852 22.8296782 29.0444214
## 188 192 197 199 202 205 213
## 33.8880618 30.3774500 36.1220488 34.7115703 29.4787421 43.9505929 23.4967080
## 235 236 237 240 241 245 250
## 32.2620437 25.2392448 30.4894557 28.3591645 27.2554706 15.6226082 24.3309210
## 255 270 272 277 278 281 283
## 23.8356867 26.5211681 27.8396482 36.5863508 35.9047516 38.6861122 40.8183217
## 286 287 293 308 309 310 311
## 27.4136825 20.4149840 32.7831636 33.2204608 28.5239075 23.5577286 18.7980456
## 312 314 315 318 321 323 327
## 27.1455724 25.4286846 25.3793012 18.3886545 25.1382982 23.0241484 24.0288252
## 347 348 350 369 380 383 386
## 14.4948424 25.6349727 22.6311726 22.8927592 16.6366449 13.3040134 7.9591663
## 387 390 391 393 399 407 413
## 5.5717784 13.9730318 16.9735732 9.6425054 5.9662802 7.2391219 0.8205434
## 415 418 420 424 429 434 436
## -5.5122007 6.1870867 14.6408508 12.7430413 14.1227450 16.7201880 13.1950824
## 445 460 462 476 485 486 494
## 11.2771605 18.3721848 20.1494107 15.8629210 19.4486860 22.0785962 21.1953072
## 500
## 18.8193384
## [1] 5.120283
With original data, rmse will be 5.12.
The data to be used is “Credit” from ISLR package. It contains information regarding the credit card balance of customers. The binary target variable is “Married” (yes or no). The information obtained by categorical data analysis could be utilized by a bank to identify customers that may be suitable for a new couples credit card plan, for example.
## Income Limit Rating Cards
## Min. : 10.35 Min. : 855 Min. : 93.0 Min. :1.000
## 1st Qu.: 21.01 1st Qu.: 3088 1st Qu.:247.2 1st Qu.:2.000
## Median : 33.12 Median : 4622 Median :344.0 Median :3.000
## Mean : 45.22 Mean : 4736 Mean :354.9 Mean :2.958
## 3rd Qu.: 57.47 3rd Qu.: 5873 3rd Qu.:437.2 3rd Qu.:4.000
## Max. :186.63 Max. :13913 Max. :982.0 Max. :9.000
## Age Education Gender Student Married
## Min. :23.00 Min. : 5.00 Male :193 No :360 No :155
## 1st Qu.:41.75 1st Qu.:11.00 Female:207 Yes: 40 Yes:245
## Median :56.00 Median :14.00
## Mean :55.67 Mean :13.45
## 3rd Qu.:70.00 3rd Qu.:16.00
## Max. :98.00 Max. :20.00
## Ethnicity Balance
## African American: 99 Min. : 0.00
## Asian :102 1st Qu.: 68.75
## Caucasian :199 Median : 459.50
## Mean : 520.01
## 3rd Qu.: 863.00
## Max. :1999.00
| Income | Limit | Rating | Cards | Age | Education | Gender | Student | Married | Ethnicity | Balance |
|---|---|---|---|---|---|---|---|---|---|---|
| 14.891 | 3606 | 283 | 2 | 34 | 11 | Male | No | Yes | Caucasian | 333 |
| 106.025 | 6645 | 483 | 3 | 82 | 15 | Female | Yes | Yes | Asian | 903 |
| 104.593 | 7075 | 514 | 4 | 71 | 11 | Male | No | No | Asian | 580 |
| 148.924 | 9504 | 681 | 3 | 36 | 11 | Female | No | No | Asian | 964 |
| 55.882 | 4897 | 357 | 2 | 68 | 16 | Male | No | Yes | Caucasian | 331 |
| 80.180 | 8047 | 569 | 4 | 77 | 10 | Male | No | No | Caucasian | 1151 |
##
## Male Female
## 193 207
We have 193 males and 207 females in our data set. Additionally, we can examine the proportion of gender via prop.table().
##
## Male Female
## 0.4825 0.5175
About 52% of the data set includes observations belong to females.
##
## No Yes
## Male 76 117
## Female 79 128
The table does not show the margin sums. If you are interested in counting margin sums, you can use margin.table() function.
##
## Male Female
## 193 207
##
## No Yes
## 155 245
If you are interested in counting row percentage for 2x2 table, you can use prop.table(data,1) If you are interested in counting column percentage for 2x2 table, you can use prop.table(data,2)
##
## No Yes
## Male 0.1900 0.2925
## Female 0.1975 0.3200
##
## No Yes
## Male 0.3937824 0.6062176
## Female 0.3816425 0.6183575
##
## No Yes
## Male 0.4903226 0.4775510
## Female 0.5096774 0.5224490
gen_mar_eth=table(Credit$Gender,Credit$Married,Credit$Ethnicity)
#first input is the first row category.
#second input is the second row category.
#third input is the column category.
ftable(gen_mar_eth)## African American Asian Caucasian
##
## Male No 20 19 37
## Yes 29 28 60
## Female No 27 13 39
## Yes 23 42 63
## Warning: package 'ggmosaic' was built under R version 4.3.3
ggplot(data=Credit)+geom_mosaic(aes(x = product(Gender, Married)))+labs(x = "Married", y="Gender",title='Mosaic plot for Gender vs Married')## Warning: `unite_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `unite()` instead.
## ℹ The deprecated feature was likely used in the ggmosaic package.
## Please report the issue at <https://github.com/haleyjeppson/ggmosaic>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Comment on the plot above.
mosaicplot(table(Credit$Married,Credit$Ethnicity),main="Mosaic plot for Married vs Ethnicity",col=c("darkgreen","royalblue","orange"))odds.ratio = function(x, conf.level=0.95) {
OR = x[1,1] * x[2,2] / ( x[2,1] * x[1,2] )
SE = sqrt(sum(1/x))
CI = exp(log(OR) + c(-1,1) * qnorm(0.5*(1-conf.level), lower.tail=F) * SE )
list(estimator=OR,
SE=SE,
conf.interval=CI,
conf.level=conf.level)
}##
## No Yes
## Male 76 117
## Female 79 128
## $estimator
## [1] 1.052472
##
## $SE
## [1] 0.2053671
##
## $conf.interval
## [1] 0.7037232 1.5740531
##
## $conf.level
## [1] 0.95
According to the result (1.05), the odds of being married is 1.05 times higher for females compared to males.
## [1] 1.03181
Risk of being married is 3% higher for females than males.
Yule’s Q, also known as the coefficient of colligation, is a statistical measure commonly used in a 2×2 contingency table. It allows for analyzing two variables within the matrix, each having two levels or categories.
q = ((or[1, 1] * or[2, 2]) - (or[1, 2] * or[2, 1])) / ((or[1, 1] * or[2, 2]) + (or[1, 2] * or[2, 1]))
q## [1] 0.02556534
As we can say, there is no strong association between participants gender and married.
The Chi Square statistic is commonly used for testing relationships between categorical variables. The null hypothesis of the Chi-Square test is that no relationship exists on the categorical variables in the population; they are independent.
##
## No Yes
## Male 76 117
## Female 79 128
The hypotheses are as follows:
\(\(H_0\)\) = Married and gender are independent.
\(\(H_1\)\) = Married and gender are not independent.
You can conduct the test using chisq.test function. The input of this function is supposed to be table.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: gen_mar
## X-squared = 0.021415, df = 1, p-value = 0.8837
Since p value is greater than \(\alpha\), we cannot reject \(H_0\). Therefore, we are 95% confident that there is no significant relation between gender and married.
Cramer’s V is a measure of the strength of association between two nominal variables.
It ranges from 0 to 1 where: 0 indicates no association between the two variables. 1 indicates a perfect association between the two variables.
## Zorunlu paket yükleniyor: grid
##
## Attaching package: 'vcd'
## The following objects are masked from 'package:ggmosaic':
##
## mosaic, spine
## The following object is masked from 'package:ISLR':
##
## Hitters
## X^2 df P(> X^2)
## Likelihood Ratio 0.062012 1 0.80334
## Pearson 0.062018 1 0.80333
##
## Phi-Coefficient : 0.012
## Contingency Coeff.: 0.012
## Cramer's V : 0.012
It is seen that the coefficient is 0.012, so we can say that there is a weak positive relationship between gender and married.
If we are interested in the relationship between one continuous and one binary categorical variable, we can use point biserial correlation. To do so, we use biserial.cor function from ltm package.
## Warning: package 'ltm' was built under R version 4.3.3
## Zorunlu paket yükleniyor: msm
## Warning: package 'msm' was built under R version 4.3.3
## Zorunlu paket yükleniyor: polycor
## Warning: package 'polycor' was built under R version 4.3.3
## [1] 0.0731355
As you see, the coefficient is 0.07. It indicates that there is a weak relationship between these two concepts.
Good luck in your Stat412 exam! :)