1). Explore the overall structure of the dataset using str(). When you read the data,set StringAsFactors=True to let R convert String variables to factors.
# load the vegas dataset into a dataframe
# header set to true to indicate first row is column names
# string as factors = true to convert string variables as factors
df <-read.csv('/Users/subhalaxmirout/CSC 532 - ML/vegashotels-1.csv', header = T, sep = ";")
# explore its structure
str(df)
## 'data.frame': 504 obs. of 20 variables:
## $ User.country : chr "USA" "USA" "USA" "UK" ...
## $ Nr..reviews : int 11 119 36 14 5 31 45 2 24 12 ...
## $ Nr..hotel.reviews: int 4 21 9 7 5 8 12 1 3 7 ...
## $ Helpful.votes : int 13 75 25 14 2 27 46 4 8 11 ...
## $ Score : int 5 3 5 4 4 3 4 4 4 3 ...
## $ Period.of.stay : chr "Dec-Feb" "Dec-Feb" "Mar-May" "Mar-May" ...
## $ Traveler.type : chr "Friends" "Business" "Families" "Friends" ...
## $ Pool : chr "NO" "NO" "NO" "NO" ...
## $ Gym : chr "YES" "YES" "YES" "YES" ...
## $ Tennis.court : chr "NO" "NO" "NO" "NO" ...
## $ Spa : chr "NO" "NO" "NO" "NO" ...
## $ Casino : chr "YES" "YES" "YES" "YES" ...
## $ Free.internet : chr "YES" "YES" "YES" "YES" ...
## $ Hotel.name : chr "Circus Circus Hotel & Casino Las Vegas" "Circus Circus Hotel & Casino Las Vegas" "Circus Circus Hotel & Casino Las Vegas" "Circus Circus Hotel & Casino Las Vegas" ...
## $ Hotel.stars : chr "3" "3" "3" "3" ...
## $ Nr..rooms : int 3773 3773 3773 3773 3773 3773 3773 3773 3773 3773 ...
## $ User.continent : chr "North America" "North America" "North America" "Europe" ...
## $ Member.years : int 9 3 2 6 7 2 4 0 3 5 ...
## $ Review.month : chr "January" "January" "February" "February" ...
## $ Review.weekday : chr "Thursday" "Friday" "Saturday" "Friday" ...
2). Does any of the variables have missing values?
# check missing values
sum(is.na(df))
## [1] 0
3). Draw a histogram of the score variable. Interpret the histogram.
# draw a historgam of score variable
# hist(df$Score)
# add title, xlabel, ylabel
hist(df$Score, main = "Histogram of Score", xlab = "Score", ylab = "Frequency")
4). FInd the mode of “Score”
# calculate mode and assign it
mode <- names(which.max(table(df$Score)))
mode
## [1] "5"
5). Use the “quantile” function to get the quantiles for score. What is the median of score?
# find the quantiles
quantiles <- quantile(df$Score)
quantiles
## 0% 25% 50% 75% 100%
## 1 4 4 5 5
# find median
median <- median(df$Score)
median
## [1] 4
6). Use the ifelse function to create a factor variable “sentiment” which takes the value “positive” if score is 4 or 5 and “negative” otherwise. You can use the method “factor” to create a factor from a character vector. Take a summary of “sentiment” to make sure that the frequencies of “positive” and “negative” categories are consistent with the frequency of the values in Score (e.g., the frequency for the “positive” sentiment should be equal to the combined frequency of 4 and 5 for the Score)
# sentiment
df$sentiment <- ifelse(df$Score >= 4, "positive", "negative")
df$sentiment <- as.factor(df$sentiment)
# check summary of sentiment
summary(df$sentiment)
## negative positive
## 113 391
7). Use appropriate statistical test and plots to determine if sentiment is associated with any of the variables: “Pool”, “Gym”,“Free.Internet”, “Period.of.Stay”, “traverler.type”“, and”hotel.stars" (Assume the significance level alpha=0.01). Interpret the plots and explain which of these variables are associated with sentiment?
# load the necessary library
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# create a contingency table between sentiment and pool
table <- table(df$sentiment, df$Pool)
# perform chi-squared test
chisq.test(table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table
## X-squared = 12.748, df = 1, p-value = 0.0003565
# bar plot
# import ggplot
library(ggplot2)
ggplot(df, aes(x = sentiment, fill = Pool)) +
geom_bar(position = "dodge") +
xlab("Sentiment") +
ylab("Count") +
ggtitle("Bar Plot of Sentiment and Pool") +
scale_fill_discrete(name = "Pool")
# for variables sentiment and gym
# contingency table sentiment and Gym
table <- table(df$sentiment, df$Gym)
# perform chi-squared test
chisq.test(table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table
## X-squared = 0.0035647, df = 1, p-value = 0.9524
Based on the p value, we can say that there is not enough evidence to reject the null hypothesis that the variables “Sentiment” and “Gym” are independent.
The p-value of 0.9524 is greater than significance level of 0.01, which means that the association between the two variables is not statistically significant.
In other words, we can say that the presence or absence of a gym in a hotel does not have a significant impact on the sentiment of the reviews.
# sentiment and gym
ggplot(df, aes(x = sentiment, fill = Gym)) +
geom_bar(position = "dodge") +
xlab("Sentiment") +
ylab("Gym") +
ggtitle("Bar Plot of Sentiment and Gym") +
scale_fill_discrete(name = "Gym")
# sentiment vs Free.Internet
# contigency table
table <- table(df$sentiment, df$Free.internet)
# perform chi-squared test
chisq.test(table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table
## X-squared = 9.4178, df = 1, p-value = 0.002149
As we can see, the p-value of 0.002149 is less than the significance level of 0.01, so we can reject the null hypothesis that the variables Sentiment and Free.Internet are independent.
It means that there is a significant association between these two variables, and it can be concluded that the availability of free internet has an impact on the sentiment of the travelers.
Problem 2: Exploring California Housing Dataset
1-(0.5 pt) Explore the structure of dataset. How many numerical andcategorical attributes are there in the dataset
housing <- read.csv("/Users/subhalaxmirout/CSC 532 - ML/housing-1.csv", sep = "," , header = TRUE)
# Use the str() function to explore the structure
str(housing)
## 'data.frame': 20640 obs. of 10 variables:
## $ longitude : num -122 -122 -122 -122 -122 ...
## $ latitude : num 37.9 37.9 37.9 37.9 37.9 ...
## $ housing_median_age: num 41 21 52 52 52 52 52 52 42 52 ...
## $ total_rooms : num 880 7099 1467 1274 1627 ...
## $ total_bedrooms : num 129 1106 190 235 280 ...
## $ population : num 322 2401 496 558 565 ...
## $ households : num 126 1138 177 219 259 ...
## $ median_income : num 8.33 8.3 7.26 5.64 3.85 ...
## $ median_house_value: num 452600 358500 352100 341300 342200 ...
## $ ocean_proximity : chr "NEAR BAY" "NEAR BAY" "NEAR BAY" "NEAR BAY" ...
The structure of the dataset has 8 numerical attributes and 1 categorical attribute.
The numerical attributes are: “longitude”, “latitude”, “housing_median_age”, “total_rooms”, “total_bedrooms”, “population”, “households”, “median_income”, and “median_house_value”.
The categorical attribute is “ocean_proximity”.
2. Take a summary statistics of the dataset.Does any of the variables have missing values?
summary(housing)
## longitude latitude housing_median_age total_rooms
## Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2
## 1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.: 1448
## Median :-118.5 Median :34.26 Median :29.00 Median : 2127
## Mean :-119.6 Mean :35.63 Mean :28.64 Mean : 2636
## 3rd Qu.:-118.0 3rd Qu.:37.71 3rd Qu.:37.00 3rd Qu.: 3148
## Max. :-114.3 Max. :41.95 Max. :52.00 Max. :39320
##
## total_bedrooms population households median_income
## Min. : 1.0 Min. : 3 Min. : 1.0 Min. : 0.4999
## 1st Qu.: 296.0 1st Qu.: 787 1st Qu.: 280.0 1st Qu.: 2.5634
## Median : 435.0 Median : 1166 Median : 409.0 Median : 3.5348
## Mean : 537.9 Mean : 1425 Mean : 499.5 Mean : 3.8707
## 3rd Qu.: 647.0 3rd Qu.: 1725 3rd Qu.: 605.0 3rd Qu.: 4.7432
## Max. :6445.0 Max. :35682 Max. :6082.0 Max. :15.0001
## NA's :207
## median_house_value ocean_proximity
## Min. : 14999 Length:20640
## 1st Qu.:119600 Class :character
## Median :179700 Mode :character
## Mean :206856
## 3rd Qu.:264725
## Max. :500001
##
3-(0.5 pt) What percentage of the rows have missing values in one or more variables? Youcan use the function “complete.cases” to get a logical vector indicating whether each rowis complete (has no missing value) or incomplete (has one or more missing values). Run?complete.cases to get help on this function in R. Use the logical vector returned bycomplete.cases to count the number of incomplete cases.
#missing values
sapply(housing, function(x) sum(is.na(x)))
## longitude latitude housing_median_age total_rooms
## 0 0 0 0
## total_bedrooms population households median_income
## 207 0 0 0
## median_house_value ocean_proximity
## 0 0
What percentage of the rows have missing values in one or more variables? You can use the function “complete.cases” to get a logical vector indicating whether each row is complete (has no missing value) or incomplete (has one or more missing values).
# Calculate the number of complete cases
complete_cases <- sum(complete.cases(housing))
# Calculate the number of missing cases
missing_cases <- nrow(housing) - complete_cases
# Calculate the percentage of rows with missing values
percent_missing <- (missing_cases / nrow(housing)) * 100
# Print the result
cat("Percentage of rows with missing values:", percent_missing, "%")
## Percentage of rows with missing values: 1.002907 %
4. Remove the rows with missing values. You can use either “complete.cases” function or “na.omit” function for this purpose
# Another way to remove rows with missing values
housing_complete <- na.omit(housing)
#missing values
sapply(housing_complete, function(x) sum(is.na(x)))
## longitude latitude housing_median_age total_rooms
## 0 0 0 0
## total_bedrooms population households median_income
## 0 0 0 0
## median_house_value ocean_proximity
## 0 0
Above result shows no missing values.
5. Draw a histogram and boxplot of the median_house_value.
# Load the ggplot2 library
library(ggplot2)
# Draw a histogram
ggplot(housing, aes(median_house_value)) +
geom_histogram(fill = "blue", alpha = 0.5, bins = 15) +
labs(x = "Median House Value", y = "Frequency") +
ggtitle("Histogram of Median House Value")
# Draw a boxplot
ggplot(housing, aes(x = "", y = median_house_value)) +
geom_boxplot(fill = "blue", alpha = 0.5) +
labs(x = "Median House Value", y = "") +
ggtitle("Boxplot of Median House Value")
6. Use Q-Q plot to examine if the distribution of median_house_value is approximately normal?
# Load the library for qqplot
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
# Create the qqplot
qqplot(housing$median_house_value, rnorm(nrow(housing)), main = "Q-Q Plot of Median House Value",
xlab = "Theoretical Quantiles", ylab = "Sample Quantiles")
7. Use IQR method to find outliers of the median_house_value attribute. How many outliers does this attribute have?
#Calculate the quartiles
Q1 <- quantile(housing$median_house_value, probs = 0.25)
Q2 <- quantile(housing$median_house_value, probs = 0.5)
Q3 <- quantile(housing$median_house_value, probs = 0.75)
#Next, we calculate the interquartile range (IQR) by subtracting Q1 from Q3:
#Calculate the IQR
IQR <- Q3 - Q1
#outliers
outliers <- subset(housing, housing$median_house_value < Q1 - 1.5 * IQR | housing$median_house_value > Q3 + 1.5 * IQR)
#Count the number of outliers
n_outliers <- nrow(outliers)
n_outliers
## [1] 1071
8. Does any of the numerical attributes have a positive/negative linear correlation with median_house_value? Use scatter plots and correlation matrix
library(dplyr)
library(tidyr)
housing_complete %>%
select(median_house_value, longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income) %>%
gather(variable, value, -median_house_value) %>%
ggplot(aes(x = value, y = median_house_value)) +
geom_point(color = 'steelblue', alpha = 0.35) +
facet_wrap(~variable, scales = "free_x") +
labs(title = "median_house_value Relationship with other Variables", x = element_blank())+
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
numeric_values <- housing_complete %>%
dplyr::select_if(is.numeric)
numeric_values <- numeric_values[complete.cases(numeric_values),] %>%
data.frame()
train_cor <- cor(numeric_values)
corrplot::corrplot.mixed(train_cor, tl.col = 'black', tl.pos = 'lt')
The correlation matrix shows the correlation between all the variables in the dataset. The value of 1 indicates a strong positive correlation, while the value of -1 indicates a strong negative correlation.
From the matrix, it can be seen that the median_house_value has a strong positive correlation with median_income (0.69) and a moderate positive correlation with total_rooms (0.13).
On the other hand, median_house_value has a moderate negative correlation with longitude (-0.05) and latitude (-0.14).
9. Is there a statistically significant difference between the mean of meadian_house_valueacross different levels of ocean_proximity ? Use an appropriate plot and statistical test (with significance level alpha=0.01)
# One way to test this is to use an analysis of variance (ANOVA) test, which tests the difference in mean values of a dependent variable (in this case, median_house_value) across multiple levels of an independent variable (in this case, ocean_proximity).
aov_test <- aov(median_house_value ~ ocean_proximity, data = housing)
summary(aov_test)
## Df Sum Sq Mean Sq F value Pr(>F)
## ocean_proximity 4 6.544e+13 1.636e+13 1612 <2e-16 ***
## Residuals 20635 2.094e+14 1.015e+10
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This output shows the results of a one-way ANOVA test to determine if there is a statistically significant difference in the mean of median_house_value across different levels of ocean_proximity.
The F-value of 1612 and the low p-value (<2e-16) indicate that there is a strong association between ocean_proximity and median_house_value, suggesting that the means of median_house_value are significantly different across different levels of ocean_proximity. The "***" significance code means that the difference is significant at a significance level of 0.001.
10. Use ifelse function to create a new factor variable “Is_Inland” which takes the value “Yes” if the ocean_proximity is “Inland” and No” otherwise.
# Create the new factor variable "Is_Inland"
housing$Is_Inland <- ifelse(housing$ocean_proximity == "INLAND", "Yes", "No")
#head(housing)
table(housing$Is_Inland)
##
## No Yes
## 14089 6551
11. Use side by side box plot to visualize the relationship between “Is_Inland” and “median_house_value”. Is there a significant difference between the means of median_house_value for Inland vs not Inland houses?
# Load the ggplot2 library
library(ggplot2)
# Collapse levels into two categories
housing$Is_Inland_new <- ifelse(housing$ocean_proximity %in% c("INLAND"), "Inland", "Not Inland")
# Create the boxplot
#boxplot(median_house_value ~ Is_Inland, data = housing)
GenderPlot1 = ggplot(housing, aes(x = Is_Inland_new, y = median_house_value)) + geom_boxplot()
GenderPlot1
To get to know the significant difference between the means of median_house_value for Inland vs not Inland houses we will perform T-test.
#oneway.test(median_house_value~Is_Inland_new, data=housing)
t.test(median_house_value~Is_Inland_new,alternative="two.sided", data=housing)
##
## Welch Two Sample t-test
##
## data: median_house_value by Is_Inland_new
## t = -93.704, df = 18987, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -122716.0 -117687.3
## sample estimates:
## mean in group Inland mean in group Not Inland
## 124805.4 245007.0
The p-value is less than the significance level (𝛼𝛼i.e 0.) then we can reject the null hypothesis and conclude that there is a statistically significant difference between the group means.
In this problem, you are going to explore the heart disease dataset from UCI. This dataset contains 76 attributes but only 14 of them are relevant and used in publications. These 14 attributes are already processed and extracted from the dataset.
1. Open these files and examine the data in them. Note that the files do not have a header and the missing values are marked by “?” character. Each file contains the 14 attributes described here. Load each file to a dataframe ( remember to set na.string=”?” so that “?” is recognized as missing not a data value)
# processed.cleveland.data
# Load the file into a data frame
df1 <- read.table("/Users/subhalaxmirout/CSC 532 - ML/processed.cleveland.data", header = F, na.strings = "?", sep = ",")
# Give the columns meaningful names
colnames(df1) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")
# Load the file into a data frame
df2 <- read.table("/Users/subhalaxmirout/CSC 532 - ML/processed.hungarian.data", header = F, na.strings = "?", sep = ",")
# Give the columns meaningful names
colnames(df2) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")
# Load the file into a data frame
df3 <- read.table("/Users/subhalaxmirout/CSC 532 - ML/processed.switzerland.data", header = F, na.strings = "?", sep = ",")
# Give the columns meaningful names
colnames(df3) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")
# Load the file into a data frame
df4 <- read.table("/Users/subhalaxmirout/CSC 532 - ML/processed.va.data", header = F, na.strings = "?", sep = ",")
# Give the columns meaningful names
colnames(df4) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")
2. Use rbind function to combine the four dataframes into one dataframe and manually set the column names using colnames function
# Combine the four dataframes into one
df_all <- rbind(df1, df2, df3, df4)
# Set the column names
colnames(df_all) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach",
"exang", "oldpeak", "slope", "ca", "thal", "num")
head(df_all)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal num
## 1 63 1 1 145 233 1 2 150 0 2.3 3 0 6 0
## 2 67 1 4 160 286 0 2 108 1 1.5 2 3 3 2
## 3 67 1 4 120 229 0 2 129 1 2.6 2 2 7 1
## 4 37 1 3 130 250 0 0 187 0 3.5 3 0 3 0
## 5 41 0 2 130 204 0 2 172 0 1.4 1 0 3 0
## 6 56 1 2 120 236 0 0 178 0 0.8 1 0 3 0
3. Explore the overall structure of the dataset. What percentage of rows have missing values in one or more attributes?
missing_values <- colSums(is.na(df_all))
#percentage of missing values
percent_missing <- (missing_values / nrow(df_all)) * 100
#summarize the results
summary(percent_missing)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.05435 5.97826 13.65683 9.02174 66.41304
#Plot the results:
bp <- barplot(percent_missing,
main = "Percentage of Missing Values in Each Column",
col = c("lavender"),
ylab = "Percentage of Missing Values",
las = 2
)
text(bp, 0, round(percent_missing, 1),cex=1,pos=3)
#barplot(percent_missing, main="Percentage of Missing Values in Each Column", xlab="Column", ylab="Percentage of Missing Values")
4. Read the data description carefully. Decide which attributes are categorical and convert them to“factor” using factor function ( set the “labels” option to give meaningful names/labels to each level)
#For the given data, the following attributes are categorical:
#4 (sex)
#9 (cp)
#16 (fbs)
#19 (restecg)
#38 (exang)
#41 (slope)
#44 (ca)
#51 (thal)
#58 (num)
# factor function
df_all$sex <- factor(df_all$sex)
df_all$cp <- factor(df_all$cp)
df_all$fbs <- factor(df_all$fbs)
df_all$restecg <- factor(df_all$restecg)
df_all$exang <- factor(df_all$exang)
df_all$slope <- factor(df_all$slope)
df_all$ca <- factor(df_all$ca)
df_all$thal <- factor(df_all$thal)
df_all$num <- factor(df_all$num)
5. What is the median and mode of the age attribute.
median_age <- median(df_all$age)
#check
median_age
## [1] 54
#mode
age_table <- table(df_all$age)
mode_age <- names(which.max(age_table))
mode_age
## [1] "54"
6. Are female patients under-represented compared to male patients in this dataset?
# Count the number of male and female patients
male_count <- sum(df_all$sex == "1")
female_count <- sum(df_all$sex == "0")
# Calculate the proportion of male and female patients
male_prop <- male_count / nrow(df_all)
female_prop <- female_count / nrow(df_all)
# Print the results
cat("Male proportion:", round(male_prop*100),"%", "\n")
## Male proportion: 79 %
cat("Female proportion:", round(female_prop*100),"%", "\n")
## Female proportion: 21 %
7. Use “ifelse” and “factor” functions to create a new factor variable (call it “diagnosis”) which takes the value “No” if column 14 has the value zero and “Yes” otherwise. Replace column 14 of your dataframe with this new variable.
# Add levels to the num column
df_all$num <- ifelse(df_all$num == 0, "No", "Yes")
# Replace column 14 of your dataframe with this new variable
colnames(df_all)[which(names(df_all) == "num")] <- "diagnosis"
df_all$diagnosis <- factor(df_all$diagnosis)
table(df_all$diagnosis)
##
## No Yes
## 411 509
8. (4 pts) Explore the relationship between “diagnosis” variable and all other 13 attributes in the dataset. Which variables are associated with “diagnosis” use appropriate plots and statistical tests to answer this question. (Note to get full credit for this question, you should use both an appropriate plot and an appropriate statistics test to examine the relationship between each of these variables and sentiment. You should also interpret each plot). Please refer to the lectures to see what type of statistical test and plot you need to use for bivariate analysis based on the type of variables.
# copy the dataframe
df_all_new <- df_all
Above dataframe has 14 variables 920 observations. Fields datatype factors are, sex, cp, fbs, restecg, slope, ca,thal, and diagnosis. Rest others are numeric type. Here we will see the association between diagnosis variable with others. diagnosis has 2 levels i.e Yes or No, so we will draw the mosaic plot to display categorical VS numeric variables.
Association between diagnosis and age
plot(df_all_new$age ~ df_all_new$diagnosis, col="sky blue", xlab="Diagnosis", ylab = "Age")
The median age of the people whose diagnosis for heart disease is 55 and above. We see some outliers for heart disease = Yes. These outliers are the people who are diagnosis for heart disease age is below 35.
mytable_age <- table(df_all_new$age, df_all_new$diagnosis)
mosaicplot(mytable_age, ylab= "diagnosis", xlab="age", main = "Mosaic graph of age vs diagnosis", shade=TRUE)
The mosaic plot is quite clustered, hard to see and tell which are significant or not. Lets categories the age column.
# created age group
df_all_new <- df_all_new %>%
mutate(age_bracket = case_when(age >= 20 & age < 30 ~ "20+",
age >= 30 & age < 40 ~ "30+",
age >= 40 & age < 50 ~ "40+",
age >= 50 & age < 60 ~ "50+",
age >= 60 & age < 70 ~ "60+",
age >= 70 & age < 80 ~ "70+"
))
mytable_age_br <- table(df_all_new$age_bracket, df_all_new$diagnosis)
mosaicplot(mytable_age_br, ylab= "diagnosis", xlab="age", main = "Mosaic graph of age vs diagnosis", shade=TRUE)
Age beacket 60+ is high chance of heart disease.
Lets peform chi-square test to know about the assiciation.
chisq.test(mytable_age)
## Warning in chisq.test(mytable_age): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable_age
## X-squared = 125.91, df = 49, p-value = 1.079e-08
The p-value < 1.079e-08 is much less than the significance level (0.05) rejecting the null hypothesis that age and diagnosis are independent. So we can conclude that age and diagnosis are associated.
Association between sex and diagnosis
mytable_sex <- table(df_all_new$sex, df_all_new$diagnosis)
mosaicplot(mytable_sex, ylab= "diagnosis", xlab="sex", main = "Mosaic graph of sex vs diagnosis", shade=TRUE)
sex: sex (1 = male; 0 = female)
The mosaic plot shows, the sample consist more male data. So, it shows male are diagnosis more in heart disease than female.
chisq.test(mytable_sex)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mytable_sex
## X-squared = 85.361, df = 1, p-value < 2.2e-16
The p-value < 2.2e-16 is much less than the significance level (0.05) rejecting the null hypothesis that sex and diagnosis are independent. So we can conclude that sex and diagnosis are associated.
Association between cp and diagnosis
mytable_cp <- table(df_all_new$cp, df_all_new$diagnosis)
mosaicplot(mytable_cp, ylab= "diagnosis", xlab="cp", main = "Mosaic graph of cp vs diagnosis", shade=TRUE)
cp: chest pain type – Value 1: typical angina – Value 2: atypical angina – Value 3: non-anginal pain – Value 4: asymptomatic
The mosaic plot shows, the sample consist more asymptomatic data. So, it shows asymptomatic diagnosis more in heart disease than other type of chest pain.
chisq.test(mytable_cp)
##
## Pearson's Chi-squared test
##
## data: mytable_cp
## X-squared = 268.35, df = 3, p-value < 2.2e-16
The p-value < 2.2e-16 is much less than the significance level (0.05) rejecting the null hypothesis that cp and diagnosis are independent. So we can conclude that chest pain and diagnosis are associated.
Association trestbps and diagnosis
plot(df_all_new$trestbps ~ df_all_new$diagnosis, col="sky blue", xlab="Diagnosis", ylab = "trestbps")
Resting blood pressure not showing significance difference for heart disease diagnosis vs not diagnosis.
mytable_trestbps <- table(df_all_new$trestbps, df_all_new$diagnosis)
mosaicplot(mytable_trestbps, ylab= "diagnosis", xlab="trestbps", main = "Mosaic graph of Resting blood pressure vs diagnosis", shade=TRUE)
The plot is cluttered to identify the level of Resting blood pressure imapcts on Heart disease Diagnosis.
x <- df_all_new$trestbps
y <- df_all_new$diagnosis
# Plot with main and axis titles
plot(x, y, main = "Resting blood pressure vs diagnosis",
xlab = "Resting blood pressure", ylab = "Diagnosis",
pch = 19, frame = FALSE)
abline(lm(y ~ x, data = df_all_new), col = "blue")
## Warning in model.response(mf, "numeric"): using type = "numeric" with a factor
## response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
Not showing any relationship between variables.
chisq.test(mytable_trestbps)
## Warning in chisq.test(mytable_trestbps): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: mytable_trestbps
## X-squared = 82.286, df = 60, p-value = 0.02969
The p-value = 0.02969, the result shows the Chi-squared approximation may be incorrect.
Association chol and diagnosis
plot(df_all_new$chol ~ df_all_new$diagnosis, col="sky blue", xlab="Diagnosis", ylab = "chol")
Diagnosis for heart disease serum cholestoral has broader IQ range no no outliers.
mytable_chol <- table(df_all_new$chol, df_all_new$diagnosis)
mosaicplot(mytable_trestbps, ylab= "diagnosis", xlab="chol", main = "Mosaic graph of chol vs diagnosis", shade=TRUE)
The plot is cluttered to identify the level of chol imapcts on Heart disease Diagnosis.
Not showing any relationship between variables.
chisq.test(mytable_chol)
## Warning in chisq.test(mytable_chol): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable_chol
## X-squared = 327.64, df = 216, p-value = 1.411e-06
The p-value < 2.2e-16 is much less than the significance level (0.05) rejecting the null hypothesis that chol and diagnosis are independent. So we can conclude that cholestoral and diagnosis are associated.
Association fbs and diagnosis
mytable_fbs <- table(df_all_new$fbs, df_all_new$diagnosis)
mosaicplot(mytable_fbs, ylab= "diagnosis", xlab="fbs", main = "Mosaic graph of fbs vs diagnosis", shade=TRUE)
fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
Fasting blood sugar is more significant in Diagnosis of Heart Disease.
chisq.test(mytable_fbs)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mytable_fbs
## X-squared = 16.112, df = 1, p-value = 5.972e-05
The p-value 5.972e-05 is much less than the significance level (0.05) rejecting the null hypothesis that fbs and diagnosis are independent. So we can conclude that fasting blood sugar and diagnosis are associated.
Association restecg and diagnosis
mytable_restecg <- table(df_all_new$restecg, df_all_new$diagnosis)
mosaicplot(mytable_restecg, ylab= "diagnosis", xlab="restecg", main = "Mosaic graph of restecg vs diagnosis", shade=TRUE)
Resting electrocardiographic Result : Value 0: normal – Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) – Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
Resting electrocardiographic result value 0 has more data than value 1 and 2.ST-T wave abnormality shows more impact in Diagnosis of Heart Disease.
chisq.test(mytable_restecg)
##
## Pearson's Chi-squared test
##
## data: mytable_restecg
## X-squared = 11.712, df = 2, p-value = 0.002863
The p-value = 0.002863 is less than the significance level (0.05) rejecting the null hypothesis that restecg and diagnosis are independent. So we can conclude that Resting electrocardiographic result and diagnosis are associated.
Association thalach and diagnosis
plot(df_all_new$thalach ~ df_all_new$diagnosis, col="sky blue", xlab="Diagnosis", ylab = "thalach")
Diagnosis for heart disease thalach has low median than not diagnosis.
mytable_thalach <- table(df_all_new$thalach, df_all_new$diagnosis)
mosaicplot(mytable_trestbps, ylab= "diagnosis", xlab="thalach", main = "Mosaic graph of thalach vs diagnosis", shade=TRUE)
The plot is cluttered to identify the level of thalach imapcts on Heart disease Diagnosis.
chisq.test(mytable_thalach)
## Warning in chisq.test(mytable_thalach): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: mytable_thalach
## X-squared = 222.53, df = 118, p-value = 2.028e-08
thalach: maximum heart rate achieved
The p-value = 2.028e-08 is much less than the significance level (0.05) rejecting the null hypothesis that thalach and diagnosis are independent.
Association exang and diagnosis
mytable_exang <- table(df_all_new$exang, df_all_new$diagnosis)
mosaicplot(mytable_exang, ylab= "diagnosis", xlab="exang", main = "exang vs diagnosis", shade=TRUE)
exang: exercise induced angina (1 = yes; 0 = no)
Above plot shows exang = 0 has more data than exang = 1. Where exang = 0 has less heart disease diagnosis, however exang = 1 has more heart disase diagnosis.
chisq.test(mytable_exang)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mytable_exang
## X-squared = 184.02, df = 1, p-value < 2.2e-16
The p-value < 2.2e-16 is less than the significance level (0.05) rejecting the null hypothesis that exang and diagnosis are independent. So we can conclude that exang and diagnosis are associated.
Association slope and diagnosis
mytable_slope <- table(df_all_new$slope, df_all_new$diagnosis)
mosaicplot(mytable_slope, ylab= "diagnosis", xlab="slope", main = "slope vs diagnosis", shade=TRUE)
slope: the slope of the peak exercise ST segment – Value 1: upsloping – Value 2: flat – Value 3: downsloping
Above plot shows slope value 2 has more data than slope value 1 and 3. Where slope value 3 has no significant impact. however, slope value 2 has more impact on heart disase diagnosis.
chisq.test(mytable_slope)
##
## Pearson's Chi-squared test
##
## data: mytable_slope
## X-squared = 88.852, df = 2, p-value < 2.2e-16
The p-value < 2.2e-16 is less than the significance level (0.05) rejecting the null hypothesis that slope and diagnosis are independent. So we can conclude that slope and diagnosis are associated.
Association ca and diagnosis
mytable_ca <- table(df_all_new$ca, df_all_new$diagnosis)
mosaicplot(mytable_ca, ylab= "diagnosis", xlab="ca", main = "ca vs diagnosis", shade=TRUE)
ca: number of major vessels (0-3) colored by flourosopy
Above plot shows ca value 0 has more observations than other values. Where ca value is 3 or 2 seems more diagnosis of heart disase diagnosis comparatively ca value 0. Where ca value 0 has no significant impact on heart disease diagnosis.
chisq.test(mytable_ca)
##
## Pearson's Chi-squared test
##
## data: mytable_ca
## X-squared = 73.2, df = 3, p-value = 8.806e-16
The p-value = 8.806e-16 is less than the significance level (0.05) rejecting the null hypothesis that ca and diagnosis are independent. So we can conclude that ca and diagnosis are associated.
Association thal and age
mytable_thal <- table(df_all_new$thal, df_all_new$diagnosis)
mosaicplot(mytable_thal, ylab= "diagnosis", xlab="thal", main = "thal vs diagnosis", shade=TRUE)
thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
Above plot shows slope value 6 has not significant impact on heart disease. Where slope value 3 shows low impact on heart disease. However, slope value 7 has more impact on heart disase diagnosis.
chisq.test(mytable_thal)
##
## Pearson's Chi-squared test
##
## data: mytable_thal
## X-squared = 109.05, df = 2, p-value < 2.2e-16
The p-value < 2.2e-16 is less than the significance level (0.05) rejecting the null hypothesis that thal and diagnosis are independent. So we can conclude that thal and diagnosis are associated.