R Markdown

##Problem 2 This problem is of calculating correlations between some input attributes (or predictive attributes) and the output attribute (or predictable attribute) in the a2-p2.csv dataset. Calculate following correlations: correl(A1, A4) correl(A2, A4) correl(A3, A4)

library(readxl)
a <- read_excel("C:/Users/Baha/Downloads/a2-p2.xlsx")
##Correlation between variable A1 and A4
library(stats)
library(corrr)
cor.test(a$A1, a$A4)
## 
##  Pearson's product-moment correlation
## 
## data:  a$A1 and a$A4
## t = 1.5756, df = 98, p-value = 0.1183
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.04048968  0.34300702
## sample estimates:
##       cor 
## 0.1571785

##Conclusion At 5% level of significance, the correlation value is 0.1571785, which indicates a positive correlation between variable A1 and A4. The correlation is not very strong since it is below 0.5.

##Correlation between variables A2 and A4
cor.test(a$A2, a$A4)
## 
##  Pearson's product-moment correlation
## 
## data:  a$A2 and a$A4
## t = -1.2631, df = 98, p-value = 0.2096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.31514917  0.07163332
## sample estimates:
##        cor 
## -0.1265656

##Conclusion Testing at 5% level of significance, the correlation between the variables A2 and A4 is negative as indicated by the value -0.1265656, although they are not strongly negatively correlated. This means that as one variable increases another one decreases.

##Correlation between variable A3 and A4
cor.test(a$A3, a$A4)
## 
##  Pearson's product-moment correlation
## 
## data:  a$A3 and a$A4
## t = 3.8463, df = 98, p-value = 0.0002134
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1784321 0.5214805
## sample estimates:
##       cor 
## 0.3621576

##Conclusion At 5% level of significance, the correlation between variable A3 and A4 is positive, though slightly positive. This is indicated by the value 0.3621576.

##Correlation of the a2_p2 dataset, of all the variables
##Calculate the correlation matrix
correlation_matrix <- cor(a)
##print the correlation matrix
print(correlation_matrix)
##            A1         A2         A3         A4
## A1  1.0000000 -0.1444047  0.2045155  0.1571785
## A2 -0.1444047  1.0000000 -0.2828461 -0.1265656
## A3  0.2045155 -0.2828461  1.0000000  0.3621576
## A4  0.1571785 -0.1265656  0.3621576  1.0000000

##Conclusion. From the results above of the correlation matrix, it is shown that variable A3 had the strongest correlation with A4 with a value of 0.3621576, followed by A1 which had a stronger correlation with A4 with a value of 0.1571785 and finally A2 which had a negative correlation with A4.

##Problem 3. This problem is of determining correlation between two nominal attributes using the chi-square test. Consider the a2-p3.csv dataset. (1) Determine whether there is a correlation between attribute A1 and attribute A4.

##Using the chi squared test, the correlation between attribute A1 and A4 is determined as follows...
library(readxl)
p <- read_excel("C:/Users/Baha/Downloads/a2-p3.xlsx")
## The contingency table is..
contingencytable <-table(p$A1, p$A4)
contingencytable
##         
##           No Yes
##   Middle  80 205
##   Old     36  77
##   Young    5  39
##Calculating the expected frequencies is as follows
Expected <-prop.table(contingencytable) * dim(contingencytable)[1]
Expected
##         
##                  No        Yes
##   Middle 0.54298643 1.39140271
##   Old    0.24434389 0.52262443
##   Young  0.03393665 0.26470588
##Calculating the chi squared test statistic is as follows
test_statistic<- sum((contingencytable - Expected)^2 / Expected)
test_statistic
## [1] 64240.33
# Determining the degrees of freedom
df <- (dim(contingencytable)[1] - 1) * (dim(contingencytable)[2] - 1);df
## [1] 2
# Obtaining the p-value
p_value <- pchisq(test_statistic, df)
p_value
## [1] 1
# Interpreting the results
if (p_value < 0.05) {
  cat("There is a significant correlation between attribute A1 and A4.\n")
} else {
  cat("There is no enough evidence to conclude that there is significant correlation between attribute A1 and A4.\n")
}
## There is no enough evidence to conclude that there is significant correlation between attribute A1 and A4.
  1. Determine whether there is a correlation between attribute A2 and attribute A4.
## The contingency table is..
contingencytable <-table(p$A2, p$A4)
contingencytable
##         
##           No Yes
##   High     8 103
##   Low     46  57
##   Middle  67 161
##Calculating the expected number
Expected <-prop.table(contingencytable) * dim(contingencytable)[2]
Expected
##         
##                 No       Yes
##   High   0.0361991 0.4660633
##   Low    0.2081448 0.2579186
##   Middle 0.3031674 0.7285068
##Calculating the chi squared test statistic,
test_statistic<- sum((contingencytable - Expected)^2 / Expected)
test_statistic
## [1] 96800
# Determining the degrees of freedom
df <- (dim(contingencytable)[1] - 1) * (dim(contingencytable)[2] - 1);df
## [1] 2
# Obtaining the p-value
p_value <- pchisq(test_statistic, df)
p_value
## [1] 1
# Interpreting the results
if (p_value < 0.05) {
  cat("There is a significant correlation between attribute A2 and A4.\n")
} else {
  cat("There is no enough evidence to conclude that there is significant correlation between attribute A2 and A4.\n")
}
## There is no enough evidence to conclude that there is significant correlation between attribute A2 and A4.