EDA_ML

Problem 1- Exploring Vegas Hotel Dataset

1). Explore the overall structure of the dataset using str(). When you read the data,set StringAsFactors=True to let R convert String variables to factors.

# load the vegas dataset into a dataframe 
# header set to true to indicate first row is column names 
# string as factors = true to convert string variables as factors 
df <-read.csv('/Users/subhalaxmirout/CSC 532 - ML/vegashotels-1.csv', header = T, sep = ";")

# explore its structure 
str(df)

## 'data.frame':    504 obs. of  20 variables:
##  $ User.country     : chr  "USA" "USA" "USA" "UK" ...
##  $ Nr..reviews      : int  11 119 36 14 5 31 45 2 24 12 ...
##  $ Nr..hotel.reviews: int  4 21 9 7 5 8 12 1 3 7 ...
##  $ Helpful.votes    : int  13 75 25 14 2 27 46 4 8 11 ...
##  $ Score            : int  5 3 5 4 4 3 4 4 4 3 ...
##  $ Period.of.stay   : chr  "Dec-Feb" "Dec-Feb" "Mar-May" "Mar-May" ...
##  $ Traveler.type    : chr  "Friends" "Business" "Families" "Friends" ...
##  $ Pool             : chr  "NO" "NO" "NO" "NO" ...
##  $ Gym              : chr  "YES" "YES" "YES" "YES" ...
##  $ Tennis.court     : chr  "NO" "NO" "NO" "NO" ...
##  $ Spa              : chr  "NO" "NO" "NO" "NO" ...
##  $ Casino           : chr  "YES" "YES" "YES" "YES" ...
##  $ Free.internet    : chr  "YES" "YES" "YES" "YES" ...
##  $ Hotel.name       : chr  "Circus Circus Hotel & Casino Las Vegas" "Circus Circus Hotel & Casino Las Vegas" "Circus Circus Hotel & Casino Las Vegas" "Circus Circus Hotel & Casino Las Vegas" ...
##  $ Hotel.stars      : chr  "3" "3" "3" "3" ...
##  $ Nr..rooms        : int  3773 3773 3773 3773 3773 3773 3773 3773 3773 3773 ...
##  $ User.continent   : chr  "North America" "North America" "North America" "Europe" ...
##  $ Member.years     : int  9 3 2 6 7 2 4 0 3 5 ...
##  $ Review.month     : chr  "January" "January" "February" "February" ...
##  $ Review.weekday   : chr  "Thursday" "Friday" "Saturday" "Friday" ...

2). Does any of the variables have missing values?

# check missing values
sum(is.na(df))

## [1] 0

There are no missing values in the dataset

3). Draw a histogram of the score variable. Interpret the histogram.

# draw a historgam of score variable 
# hist(df$Score)


# add title, xlabel, ylabel
hist(df$Score, main = "Histogram of Score", xlab = "Score", ylab = "Frequency")

Interpretation: There are more than 200 reviews with 5 star score
More than 150 customers rated 4 score
Approximately ~75 customers rated 3 score
Nearly ~25 customers gave 2 score
Overall, five to ten customers gave 1 score
From these statistics, we can say that the hotel significantly good score of 5 meaning high satisfaction among customers

4). FInd the mode of “Score”

# calculate mode and assign it 
mode <- names(which.max(table(df$Score)))
mode

## [1] "5"

“5” is the mode of score. That means majority of customers rated 5 score

5). Use the “quantile” function to get the quantiles for score. What is the median of score?

# find the quantiles 
quantiles <- quantile(df$Score)
quantiles

##   0%  25%  50%  75% 100% 
##    1    4    4    5    5

# find median 
median <- median(df$Score)
median

## [1] 4

The median of the score is 4

6). Use the ifelse function to create a factor variable “sentiment” which takes the value “positive” if score is 4 or 5 and “negative” otherwise. You can use the method “factor” to create a factor from a character vector. Take a summary of “sentiment” to make sure that the frequencies of “positive” and “negative” categories are consistent with the frequency of the values in Score (e.g., the frequency for the “positive” sentiment should be equal to the combined frequency of 4 and 5 for the Score)

# sentiment 
df$sentiment <- ifelse(df$Score >= 4, "positive", "negative")
df$sentiment <- as.factor(df$sentiment)

# check summary of sentiment 
summary(df$sentiment)

## negative positive 
##      113      391

Overall, the positive responses with 4 or 5 score is 391 and negative responses i.e. 1 or 2 or 3 are 113

7). Use appropriate statistical test and plots to determine if sentiment is associated with any of the variables: “Pool”, “Gym”,“Free.Internet”, “Period.of.Stay”, “traverler.type”“, and”hotel.stars" (Assume the significance level alpha=0.01). Interpret the plots and explain which of these variables are associated with sentiment?

# load the necessary library 
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# create a contingency table between sentiment and pool 
table <- table(df$sentiment, df$Pool)

# perform chi-squared test 
chisq.test(table)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table
## X-squared = 12.748, df = 1, p-value = 0.0003565

This test is performed to understand the assocaition between two variables. In this case, the p-value from the test tells us the significance of the association between two variables.
Here, the p value is less than 0.01, it means that we can reject the null hypothesis and can concludde that there is a significant association between two variables.

# bar plot 

# import ggplot 
library(ggplot2)


ggplot(df, aes(x = sentiment, fill = Pool)) + 
  geom_bar(position = "dodge") + 
  xlab("Sentiment") + 
  ylab("Count") + 
  ggtitle("Bar Plot of Sentiment and Pool") + 
  scale_fill_discrete(name = "Pool")

# for variables sentiment and gym 

# contingency table sentiment and Gym 
table <- table(df$sentiment, df$Gym)

# perform chi-squared test 
chisq.test(table)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table
## X-squared = 0.0035647, df = 1, p-value = 0.9524

Based on the p value, we can say that there is not enough evidence to reject the null hypothesis that the variables “Sentiment” and “Gym” are independent.
The p-value of 0.9524 is greater than significance level of 0.01, which means that the association between the two variables is not statistically significant.
In other words, we can say that the presence or absence of a gym in a hotel does not have a significant impact on the sentiment of the reviews.

# sentiment and gym 
ggplot(df, aes(x = sentiment, fill = Gym)) + 
  geom_bar(position = "dodge") + 
  xlab("Sentiment") + 
  ylab("Gym") + 
  ggtitle("Bar Plot of Sentiment and Gym") + 
  scale_fill_discrete(name = "Gym")

# sentiment vs Free.Internet 

# contigency table 
table <- table(df$sentiment, df$Free.internet)

# perform chi-squared test 
chisq.test(table)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table
## X-squared = 9.4178, df = 1, p-value = 0.002149

As we can see, the p-value of 0.002149 is less than the significance level of 0.01, so we can reject the null hypothesis that the variables Sentiment and Free.Internet are independent.
It means that there is a significant association between these two variables, and it can be concluded that the availability of free internet has an impact on the sentiment of the travelers.

Problem 2: Exploring California Housing Dataset

1-(0.5 pt) Explore the structure of dataset. How many numerical andcategorical attributes are there in the dataset

housing <- read.csv("/Users/subhalaxmirout/CSC 532 - ML/housing-1.csv", sep = "," , header = TRUE)


# Use the str() function to explore the structure 
str(housing)

## 'data.frame':    20640 obs. of  10 variables:
##  $ longitude         : num  -122 -122 -122 -122 -122 ...
##  $ latitude          : num  37.9 37.9 37.9 37.9 37.9 ...
##  $ housing_median_age: num  41 21 52 52 52 52 52 52 42 52 ...
##  $ total_rooms       : num  880 7099 1467 1274 1627 ...
##  $ total_bedrooms    : num  129 1106 190 235 280 ...
##  $ population        : num  322 2401 496 558 565 ...
##  $ households        : num  126 1138 177 219 259 ...
##  $ median_income     : num  8.33 8.3 7.26 5.64 3.85 ...
##  $ median_house_value: num  452600 358500 352100 341300 342200 ...
##  $ ocean_proximity   : chr  "NEAR BAY" "NEAR BAY" "NEAR BAY" "NEAR BAY" ...

The structure of the dataset has 8 numerical attributes and 1 categorical attribute.
The numerical attributes are: “longitude”, “latitude”, “housing_median_age”, “total_rooms”, “total_bedrooms”, “population”, “households”, “median_income”, and “median_house_value”.
The categorical attribute is “ocean_proximity”.

2. Take a summary statistics of the dataset.Does any of the variables have missing values?

summary(housing)

##    longitude         latitude     housing_median_age  total_rooms   
##  Min.   :-124.3   Min.   :32.54   Min.   : 1.00      Min.   :    2  
##  1st Qu.:-121.8   1st Qu.:33.93   1st Qu.:18.00      1st Qu.: 1448  
##  Median :-118.5   Median :34.26   Median :29.00      Median : 2127  
##  Mean   :-119.6   Mean   :35.63   Mean   :28.64      Mean   : 2636  
##  3rd Qu.:-118.0   3rd Qu.:37.71   3rd Qu.:37.00      3rd Qu.: 3148  
##  Max.   :-114.3   Max.   :41.95   Max.   :52.00      Max.   :39320  
##                                                                     
##  total_bedrooms     population      households     median_income    
##  Min.   :   1.0   Min.   :    3   Min.   :   1.0   Min.   : 0.4999  
##  1st Qu.: 296.0   1st Qu.:  787   1st Qu.: 280.0   1st Qu.: 2.5634  
##  Median : 435.0   Median : 1166   Median : 409.0   Median : 3.5348  
##  Mean   : 537.9   Mean   : 1425   Mean   : 499.5   Mean   : 3.8707  
##  3rd Qu.: 647.0   3rd Qu.: 1725   3rd Qu.: 605.0   3rd Qu.: 4.7432  
##  Max.   :6445.0   Max.   :35682   Max.   :6082.0   Max.   :15.0001  
##  NA's   :207                                                        
##  median_house_value ocean_proximity   
##  Min.   : 14999     Length:20640      
##  1st Qu.:119600     Class :character  
##  Median :179700     Mode  :character  
##  Mean   :206856                       
##  3rd Qu.:264725                       
##  Max.   :500001                       
##

3-(0.5 pt) What percentage of the rows have missing values in one or more variables? Youcan use the function “complete.cases” to get a logical vector indicating whether each rowis complete (has no missing value) or incomplete (has one or more missing values). Run?complete.cases to get help on this function in R. Use the logical vector returned bycomplete.cases to count the number of incomplete cases.

#missing values 
sapply(housing, function(x) sum(is.na(x)))

##          longitude           latitude housing_median_age        total_rooms 
##                  0                  0                  0                  0 
##     total_bedrooms         population         households      median_income 
##                207                  0                  0                  0 
## median_house_value    ocean_proximity 
##                  0                  0

There are no missing values for the all columns except total_bedrooms. This attributes has 207 NAs in the dataset.

What percentage of the rows have missing values in one or more variables? You can use the function “complete.cases” to get a logical vector indicating whether each row is complete (has no missing value) or incomplete (has one or more missing values).

# Calculate the number of complete cases
complete_cases <- sum(complete.cases(housing))

# Calculate the number of missing cases
missing_cases <- nrow(housing) - complete_cases

# Calculate the percentage of rows with missing values
percent_missing <- (missing_cases / nrow(housing)) * 100

# Print the result
cat("Percentage of rows with missing values:", percent_missing, "%")

## Percentage of rows with missing values: 1.002907 %

4. Remove the rows with missing values. You can use either “complete.cases” function or “na.omit” function for this purpose

# Another way to remove rows with missing values
housing_complete <- na.omit(housing)

#missing values 
sapply(housing_complete, function(x) sum(is.na(x)))

##          longitude           latitude housing_median_age        total_rooms 
##                  0                  0                  0                  0 
##     total_bedrooms         population         households      median_income 
##                  0                  0                  0                  0 
## median_house_value    ocean_proximity 
##                  0                  0

Above result shows no missing values.

5. Draw a histogram and boxplot of the median_house_value.

# Load the ggplot2 library
library(ggplot2)

# Draw a histogram
ggplot(housing, aes(median_house_value)) +
  geom_histogram(fill = "blue", alpha = 0.5, bins = 15) +
  labs(x = "Median House Value", y = "Frequency") +
  ggtitle("Histogram of Median House Value")

# Draw a boxplot
ggplot(housing, aes(x = "", y = median_house_value)) +
  geom_boxplot(fill = "blue", alpha = 0.5) +
  labs(x = "Median House Value", y = "") +
  ggtitle("Boxplot of Median House Value")

6. Use Q-Q plot to examine if the distribution of median_house_value is approximately normal?

# Load the library for qqplot
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

# Create the qqplot
qqplot(housing$median_house_value, rnorm(nrow(housing)), main = "Q-Q Plot of Median House Value", 
       xlab = "Theoretical Quantiles", ylab = "Sample Quantiles")

The distribution of the plot is not approximately normal

7. Use IQR method to find outliers of the median_house_value attribute. How many outliers does this attribute have?

#Calculate the quartiles
Q1 <- quantile(housing$median_house_value, probs = 0.25)
Q2 <- quantile(housing$median_house_value, probs = 0.5)
Q3 <- quantile(housing$median_house_value, probs = 0.75)

#Next, we calculate the interquartile range (IQR) by subtracting Q1 from Q3:

#Calculate the IQR
IQR <- Q3 - Q1

#outliers 
outliers <- subset(housing, housing$median_house_value < Q1 - 1.5 * IQR | housing$median_house_value > Q3 + 1.5 * IQR)

#Count the number of outliers
n_outliers <- nrow(outliers)

n_outliers

## [1] 1071

There are 1071 outliers in the dataset

8. Does any of the numerical attributes have a positive/negative linear correlation with median_house_value? Use scatter plots and correlation matrix

library(dplyr)
library(tidyr)

housing_complete %>%
  select(median_house_value, longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income) %>% 
  gather(variable, value, -median_house_value) %>%
  ggplot(aes(x = value, y = median_house_value)) +
  geom_point(color = 'steelblue', alpha = 0.35) +
  facet_wrap(~variable, scales = "free_x") +
  labs(title = "median_house_value Relationship with other Variables", x = element_blank())+ 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

numeric_values <- housing_complete %>% 
  dplyr::select_if(is.numeric)
numeric_values <- numeric_values[complete.cases(numeric_values),] %>% 
  data.frame()
train_cor <- cor(numeric_values)
corrplot::corrplot.mixed(train_cor, tl.col = 'black', tl.pos = 'lt')

The correlation matrix shows the correlation between all the variables in the dataset. The value of 1 indicates a strong positive correlation, while the value of -1 indicates a strong negative correlation.
From the matrix, it can be seen that the median_house_value has a strong positive correlation with median_income (0.69) and a moderate positive correlation with total_rooms (0.13).
On the other hand, median_house_value has a moderate negative correlation with longitude (-0.05) and latitude (-0.14).

9. Is there a statistically significant difference between the mean of meadian_house_valueacross different levels of ocean_proximity ? Use an appropriate plot and statistical test (with significance level alpha=0.01)

# One way to test this is to use an analysis of variance (ANOVA) test, which tests the difference in mean values of a dependent variable (in this case, median_house_value) across multiple levels of an independent variable (in this case, ocean_proximity). 
aov_test <- aov(median_house_value ~ ocean_proximity, data = housing)
summary(aov_test)

##                    Df    Sum Sq   Mean Sq F value Pr(>F)    
## ocean_proximity     4 6.544e+13 1.636e+13    1612 <2e-16 ***
## Residuals       20635 2.094e+14 1.015e+10                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This output shows the results of a one-way ANOVA test to determine if there is a statistically significant difference in the mean of median_house_value across different levels of ocean_proximity.
The F-value of 1612 and the low p-value (<2e-16) indicate that there is a strong association between ocean_proximity and median_house_value, suggesting that the means of median_house_value are significantly different across different levels of ocean_proximity. The "***" significance code means that the difference is significant at a significance level of 0.001.

10. Use ifelse function to create a new factor variable “Is_Inland” which takes the value “Yes” if the ocean_proximity is “Inland” and No” otherwise.

# Create the new factor variable "Is_Inland"
housing$Is_Inland <- ifelse(housing$ocean_proximity == "INLAND", "Yes", "No")

#head(housing)

table(housing$Is_Inland)

## 
##    No   Yes 
## 14089  6551

11. Use side by side box plot to visualize the relationship between “Is_Inland” and “median_house_value”. Is there a significant difference between the means of median_house_value for Inland vs not Inland houses?

# Load the ggplot2 library
library(ggplot2)

# Collapse levels into two categories
housing$Is_Inland_new <- ifelse(housing$ocean_proximity %in% c("INLAND"), "Inland", "Not Inland")

# Create the boxplot
#boxplot(median_house_value ~ Is_Inland, data = housing)

GenderPlot1 = ggplot(housing, aes(x = Is_Inland_new, y = median_house_value)) + geom_boxplot() 
GenderPlot1

To get to know the significant difference between the means of median_house_value for Inland vs not Inland houses we will perform T-test.

#oneway.test(median_house_value~Is_Inland_new, data=housing)
t.test(median_house_value~Is_Inland_new,alternative="two.sided", data=housing)

## 
##  Welch Two Sample t-test
## 
## data:  median_house_value by Is_Inland_new
## t = -93.704, df = 18987, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -122716.0 -117687.3
## sample estimates:
##     mean in group Inland mean in group Not Inland 
##                 124805.4                 245007.0

The p-value is less than the significance level (𝛼𝛼i.e 0.) then we can reject the null hypothesis and conclude that there is a statistically significant difference between the group means.

Problem 3

In this problem, you are going to explore the heart disease dataset from UCI. This dataset contains 76 attributes but only 14 of them are relevant and used in publications. These 14 attributes are already processed and extracted from the dataset.

1. Open these files and examine the data in them. Note that the files do not have a header and the missing values are marked by “?” character. Each file contains the 14 attributes described here. Load each file to a dataframe ( remember to set na.string=”?” so that “?” is recognized as missing not a data value)

# processed.cleveland.data
# Load the file into a data frame
df1 <- read.table("/Users/subhalaxmirout/CSC 532 - ML/processed.cleveland.data", header = F, na.strings = "?", sep = ",")

# Give the columns meaningful names
colnames(df1) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")

# Load the file into a data frame
df2 <- read.table("/Users/subhalaxmirout/CSC 532 - ML/processed.hungarian.data", header = F, na.strings = "?", sep = ",")

# Give the columns meaningful names
colnames(df2) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")


# Load the file into a data frame
df3 <- read.table("/Users/subhalaxmirout/CSC 532 - ML/processed.switzerland.data", header = F, na.strings = "?", sep = ",")

# Give the columns meaningful names
colnames(df3) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")

# Load the file into a data frame
df4 <- read.table("/Users/subhalaxmirout/CSC 532 - ML/processed.va.data", header = F, na.strings = "?", sep = ",")

# Give the columns meaningful names
colnames(df4) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")

2. Use rbind function to combine the four dataframes into one dataframe and manually set the column names using colnames function

# Combine the four dataframes into one
df_all <- rbind(df1, df2, df3, df4)

# Set the column names
colnames(df_all) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", 
                     "exang", "oldpeak", "slope", "ca", "thal", "num")

head(df_all)

##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal num
## 1  63   1  1      145  233   1       2     150     0     2.3     3  0    6   0
## 2  67   1  4      160  286   0       2     108     1     1.5     2  3    3   2
## 3  67   1  4      120  229   0       2     129     1     2.6     2  2    7   1
## 4  37   1  3      130  250   0       0     187     0     3.5     3  0    3   0
## 5  41   0  2      130  204   0       2     172     0     1.4     1  0    3   0
## 6  56   1  2      120  236   0       0     178     0     0.8     1  0    3   0

3. Explore the overall structure of the dataset. What percentage of rows have missing values in one or more attributes?

missing_values <- colSums(is.na(df_all))

#percentage of missing values 
percent_missing <- (missing_values / nrow(df_all)) * 100


#summarize the results 
summary(percent_missing)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.05435  5.97826 13.65683  9.02174 66.41304

#Plot the results:

bp <- barplot(percent_missing, 
              main = "Percentage of Missing Values in Each Column", 
              col = c("lavender"),
              ylab = "Percentage of Missing Values",
              las = 2
              )
text(bp, 0, round(percent_missing, 1),cex=1,pos=3)

#barplot(percent_missing, main="Percentage of Missing Values in Each Column", xlab="Column", ylab="Percentage of Missing Values")

4. Read the data description carefully. Decide which attributes are categorical and convert them to“factor” using factor function ( set the “labels” option to give meaningful names/labels to each level)

#For the given data, the following attributes are categorical:

#4 (sex)
#9 (cp)
#16 (fbs)
#19 (restecg)
#38 (exang)
#41 (slope)
#44 (ca)
#51 (thal)
#58 (num)

# factor function 
df_all$sex <- factor(df_all$sex)
df_all$cp <- factor(df_all$cp)
df_all$fbs <- factor(df_all$fbs)
df_all$restecg <- factor(df_all$restecg)
df_all$exang <- factor(df_all$exang)
df_all$slope <- factor(df_all$slope)
df_all$ca <- factor(df_all$ca)
df_all$thal <- factor(df_all$thal)
df_all$num <- factor(df_all$num)

5. What is the median and mode of the age attribute.

median_age <- median(df_all$age)
#check 
median_age

## [1] 54

The median is 54

#mode 
age_table <- table(df_all$age)
mode_age <- names(which.max(age_table))
mode_age

## [1] "54"

The mode is 54

6. Are female patients under-represented compared to male patients in this dataset?

# Count the number of male and female patients
male_count <- sum(df_all$sex == "1")
female_count <- sum(df_all$sex == "0")

# Calculate the proportion of male and female patients
male_prop <- male_count / nrow(df_all)
female_prop <- female_count / nrow(df_all)

# Print the results
cat("Male proportion:", round(male_prop*100),"%", "\n")

## Male proportion: 79 %

cat("Female proportion:", round(female_prop*100),"%", "\n")

## Female proportion: 21 %

Based on the proportions, it seems that male patients are over-represented compared to female patients in the dataset.

7. Use “ifelse” and “factor” functions to create a new factor variable (call it “diagnosis”) which takes the value “No” if column 14 has the value zero and “Yes” otherwise. Replace column 14 of your dataframe with this new variable.

# Add levels to the num column
df_all$num <- ifelse(df_all$num == 0, "No", "Yes")
# Replace column 14 of your dataframe with this new variable
colnames(df_all)[which(names(df_all) == "num")] <- "diagnosis"

df_all$diagnosis <- factor(df_all$diagnosis)

table(df_all$diagnosis)

## 
##  No Yes 
## 411 509

8. (4 pts) Explore the relationship between “diagnosis” variable and all other 13 attributes in the dataset. Which variables are associated with “diagnosis” use appropriate plots and statistical tests to answer this question. (Note to get full credit for this question, you should use both an appropriate plot and an appropriate statistics test to examine the relationship between each of these variables and sentiment. You should also interpret each plot). Please refer to the lectures to see what type of statistical test and plot you need to use for bivariate analysis based on the type of variables.

# copy the dataframe 
df_all_new <- df_all

Above dataframe has 14 variables 920 observations. Fields datatype factors are, sex, cp, fbs, restecg, slope, ca,thal, and diagnosis. Rest others are numeric type. Here we will see the association between diagnosis variable with others. diagnosis has 2 levels i.e Yes or No, so we will draw the mosaic plot to display categorical VS numeric variables.

Association between diagnosis and age

plot(df_all_new$age ~ df_all_new$diagnosis, col="sky blue", xlab="Diagnosis", ylab = "Age")

The median age of the people whose diagnosis for heart disease is 55 and above. We see some outliers for heart disease = Yes. These outliers are the people who are diagnosis for heart disease age is below 35.

Mosaic Plot age vs diagnosis

mytable_age <- table(df_all_new$age, df_all_new$diagnosis)
mosaicplot(mytable_age, ylab= "diagnosis", xlab="age", main = "Mosaic graph of age vs diagnosis", shade=TRUE)

The mosaic plot is quite clustered, hard to see and tell which are significant or not. Lets categories the age column.

# created age group
df_all_new <- df_all_new %>%
  mutate(age_bracket = case_when(age >= 20 & age < 30 ~ "20+",
                                 age >= 30 & age < 40 ~ "30+",
                                 age >= 40 & age < 50 ~ "40+",
                                 age >= 50 & age < 60 ~ "50+",
                                 age >= 60 & age < 70 ~ "60+",
                                 age >= 70 & age < 80 ~ "70+"
                                  ))

mytable_age_br <- table(df_all_new$age_bracket, df_all_new$diagnosis)
mosaicplot(mytable_age_br, ylab= "diagnosis", xlab="age", main = "Mosaic graph of age vs diagnosis", shade=TRUE)

Age beacket 60+ is high chance of heart disease.

Lets peform chi-square test to know about the assiciation.

Chi-squared test age vs diagnosis

chisq.test(mytable_age)

## Warning in chisq.test(mytable_age): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  mytable_age
## X-squared = 125.91, df = 49, p-value = 1.079e-08

The p-value < 1.079e-08 is much less than the significance level (0.05) rejecting the null hypothesis that age and diagnosis are independent. So we can conclude that age and diagnosis are associated.

Association between sex and diagnosis

Mosaic Plot sex vs diagnosis

mytable_sex <- table(df_all_new$sex, df_all_new$diagnosis)
mosaicplot(mytable_sex, ylab= "diagnosis", xlab="sex", main = "Mosaic graph of sex vs diagnosis", shade=TRUE)

sex: sex (1 = male; 0 = female)

The mosaic plot shows, the sample consist more male data. So, it shows male are diagnosis more in heart disease than female.

Chi-squared test age vs diagnosis

chisq.test(mytable_sex)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mytable_sex
## X-squared = 85.361, df = 1, p-value < 2.2e-16

The p-value < 2.2e-16 is much less than the significance level (0.05) rejecting the null hypothesis that sex and diagnosis are independent. So we can conclude that sex and diagnosis are associated.

Association between cp and diagnosis

Mosaic Plot cp vs diagnosis

mytable_cp <- table(df_all_new$cp, df_all_new$diagnosis)
mosaicplot(mytable_cp, ylab= "diagnosis", xlab="cp", main = "Mosaic graph of cp vs diagnosis", shade=TRUE)

cp: chest pain type – Value 1: typical angina – Value 2: atypical angina – Value 3: non-anginal pain – Value 4: asymptomatic

The mosaic plot shows, the sample consist more asymptomatic data. So, it shows asymptomatic diagnosis more in heart disease than other type of chest pain.

Chi-squared test age vs diagnosis

chisq.test(mytable_cp)

## 
##  Pearson's Chi-squared test
## 
## data:  mytable_cp
## X-squared = 268.35, df = 3, p-value < 2.2e-16

The p-value < 2.2e-16 is much less than the significance level (0.05) rejecting the null hypothesis that cp and diagnosis are independent. So we can conclude that chest pain and diagnosis are associated.

Association trestbps and diagnosis

plot(df_all_new$trestbps ~ df_all_new$diagnosis, col="sky blue", xlab="Diagnosis", ylab = "trestbps")

Resting blood pressure not showing significance difference for heart disease diagnosis vs not diagnosis.

Mosaic Plot trestbps vs diagnosis

mytable_trestbps <- table(df_all_new$trestbps, df_all_new$diagnosis)
mosaicplot(mytable_trestbps, ylab= "diagnosis", xlab="trestbps", main = "Mosaic graph of Resting blood pressure vs diagnosis", shade=TRUE)

The plot is cluttered to identify the level of Resting blood pressure imapcts on Heart disease Diagnosis.

Scatter plot

x <- df_all_new$trestbps
y <- df_all_new$diagnosis
# Plot with main and axis titles
plot(x, y, main = "Resting blood pressure vs diagnosis",
     xlab = "Resting blood pressure", ylab = "Diagnosis",
     pch = 19, frame = FALSE)
abline(lm(y ~ x, data = df_all_new), col = "blue")

## Warning in model.response(mf, "numeric"): using type = "numeric" with a factor
## response will be ignored

## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors

Not showing any relationship between variables.

Chi-squared test Resting blood pressure vs diagnosis

chisq.test(mytable_trestbps)

## Warning in chisq.test(mytable_trestbps): Chi-squared approximation may be
## incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  mytable_trestbps
## X-squared = 82.286, df = 60, p-value = 0.02969

The p-value = 0.02969, the result shows the Chi-squared approximation may be incorrect.

Association chol and diagnosis

plot(df_all_new$chol ~ df_all_new$diagnosis, col="sky blue", xlab="Diagnosis", ylab = "chol")

Diagnosis for heart disease serum cholestoral has broader IQ range no no outliers.

Mosaic Plot trestbps vs diagnosis

mytable_chol <- table(df_all_new$chol, df_all_new$diagnosis)
mosaicplot(mytable_trestbps, ylab= "diagnosis", xlab="chol", main = "Mosaic graph of chol vs diagnosis", shade=TRUE)

The plot is cluttered to identify the level of chol imapcts on Heart disease Diagnosis.

Not showing any relationship between variables.

Chi-squared test Resting blood pressure vs diagnosis

chisq.test(mytable_chol)

## Warning in chisq.test(mytable_chol): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  mytable_chol
## X-squared = 327.64, df = 216, p-value = 1.411e-06

The p-value < 2.2e-16 is much less than the significance level (0.05) rejecting the null hypothesis that chol and diagnosis are independent. So we can conclude that cholestoral and diagnosis are associated.

Association fbs and diagnosis

Mosaic Plot fbs vs diagnosis

mytable_fbs <- table(df_all_new$fbs, df_all_new$diagnosis)
mosaicplot(mytable_fbs, ylab= "diagnosis", xlab="fbs", main = "Mosaic graph of fbs vs diagnosis", shade=TRUE)

fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

Fasting blood sugar is more significant in Diagnosis of Heart Disease.

Chi-squared test Resting blood pressure vs diagnosis

chisq.test(mytable_fbs)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mytable_fbs
## X-squared = 16.112, df = 1, p-value = 5.972e-05

The p-value 5.972e-05 is much less than the significance level (0.05) rejecting the null hypothesis that fbs and diagnosis are independent. So we can conclude that fasting blood sugar and diagnosis are associated.

Association restecg and diagnosis

Mosaic Plot restecg vs diagnosis

mytable_restecg <- table(df_all_new$restecg, df_all_new$diagnosis)
mosaicplot(mytable_restecg, ylab= "diagnosis", xlab="restecg", main = "Mosaic graph of restecg vs diagnosis", shade=TRUE)

Resting electrocardiographic Result : Value 0: normal – Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) – Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria

Resting electrocardiographic result value 0 has more data than value 1 and 2.ST-T wave abnormality shows more impact in Diagnosis of Heart Disease.

Chi-squared test restecg vs diagnosis

chisq.test(mytable_restecg)

## 
##  Pearson's Chi-squared test
## 
## data:  mytable_restecg
## X-squared = 11.712, df = 2, p-value = 0.002863

The p-value = 0.002863 is less than the significance level (0.05) rejecting the null hypothesis that restecg and diagnosis are independent. So we can conclude that Resting electrocardiographic result and diagnosis are associated.

Association thalach and diagnosis

plot(df_all_new$thalach ~ df_all_new$diagnosis, col="sky blue", xlab="Diagnosis", ylab = "thalach")

Diagnosis for heart disease thalach has low median than not diagnosis.

Mosaic Plot thalach vs diagnosis

mytable_thalach <- table(df_all_new$thalach, df_all_new$diagnosis)
mosaicplot(mytable_trestbps, ylab= "diagnosis", xlab="thalach", main = "Mosaic graph of thalach vs diagnosis", shade=TRUE)

The plot is cluttered to identify the level of thalach imapcts on Heart disease Diagnosis.

Chi-squared test thalach vs diagnosis

chisq.test(mytable_thalach)

## Warning in chisq.test(mytable_thalach): Chi-squared approximation may be
## incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  mytable_thalach
## X-squared = 222.53, df = 118, p-value = 2.028e-08

thalach: maximum heart rate achieved

The p-value = 2.028e-08 is much less than the significance level (0.05) rejecting the null hypothesis that thalach and diagnosis are independent.

Association exang and diagnosis

Mosaic Plot exang vs diagnosis

mytable_exang <- table(df_all_new$exang, df_all_new$diagnosis)
mosaicplot(mytable_exang, ylab= "diagnosis", xlab="exang", main = "exang vs diagnosis", shade=TRUE)

exang: exercise induced angina (1 = yes; 0 = no)

Above plot shows exang = 0 has more data than exang = 1. Where exang = 0 has less heart disease diagnosis, however exang = 1 has more heart disase diagnosis.

Chi-squared test exang vs diagnosis

chisq.test(mytable_exang)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mytable_exang
## X-squared = 184.02, df = 1, p-value < 2.2e-16

The p-value < 2.2e-16 is less than the significance level (0.05) rejecting the null hypothesis that exang and diagnosis are independent. So we can conclude that exang and diagnosis are associated.

Association slope and diagnosis

Mosaic Plot slope vs diagnosis

mytable_slope <- table(df_all_new$slope, df_all_new$diagnosis)
mosaicplot(mytable_slope, ylab= "diagnosis", xlab="slope", main = "slope vs diagnosis", shade=TRUE)

slope: the slope of the peak exercise ST segment – Value 1: upsloping – Value 2: flat – Value 3: downsloping

Above plot shows slope value 2 has more data than slope value 1 and 3. Where slope value 3 has no significant impact. however, slope value 2 has more impact on heart disase diagnosis.

Chi-squared test slope vs diagnosis

chisq.test(mytable_slope)

## 
##  Pearson's Chi-squared test
## 
## data:  mytable_slope
## X-squared = 88.852, df = 2, p-value < 2.2e-16

The p-value < 2.2e-16 is less than the significance level (0.05) rejecting the null hypothesis that slope and diagnosis are independent. So we can conclude that slope and diagnosis are associated.

Association ca and diagnosis

Mosaic Plot ca vs diagnosis

mytable_ca <- table(df_all_new$ca, df_all_new$diagnosis)
mosaicplot(mytable_ca, ylab= "diagnosis", xlab="ca", main = "ca vs diagnosis", shade=TRUE)

ca: number of major vessels (0-3) colored by flourosopy

Above plot shows ca value 0 has more observations than other values. Where ca value is 3 or 2 seems more diagnosis of heart disase diagnosis comparatively ca value 0. Where ca value 0 has no significant impact on heart disease diagnosis.

Chi-squared test ca vs diagnosis

chisq.test(mytable_ca)

## 
##  Pearson's Chi-squared test
## 
## data:  mytable_ca
## X-squared = 73.2, df = 3, p-value = 8.806e-16

The p-value = 8.806e-16 is less than the significance level (0.05) rejecting the null hypothesis that ca and diagnosis are independent. So we can conclude that ca and diagnosis are associated.

Association thal and age

Mosaic Plot thal vs diagnosis

mytable_thal <- table(df_all_new$thal, df_all_new$diagnosis)
mosaicplot(mytable_thal, ylab= "diagnosis", xlab="thal", main = "thal vs diagnosis", shade=TRUE)

thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

Above plot shows slope value 6 has not significant impact on heart disease. Where slope value 3 shows low impact on heart disease. However, slope value 7 has more impact on heart disase diagnosis.

Chi-squared test thal vs diagnosis

chisq.test(mytable_thal)

## 
##  Pearson's Chi-squared test
## 
## data:  mytable_thal
## X-squared = 109.05, df = 2, p-value < 2.2e-16

The p-value < 2.2e-16 is less than the significance level (0.05) rejecting the null hypothesis that thal and diagnosis are independent. So we can conclude that thal and diagnosis are associated.

EDA_ML

Group_Assignment: Venkata Naga Vamsidhar reddy karasani, Anila Cheekati, Venkata sai ram tirunagari , Pradeep kumar Naidu, Simhadri Ramanjaneyulu, Subhalaxmi Rout

February 11, 2023

Problem 1- Exploring Vegas Hotel Dataset

Problem 3

Mosaic Plot age vs diagnosis

Chi-squared test age vs diagnosis

Mosaic Plot sex vs diagnosis

Chi-squared test age vs diagnosis

Mosaic Plot cp vs diagnosis

Chi-squared test age vs diagnosis

Mosaic Plot trestbps vs diagnosis

Scatter plot

Chi-squared test Resting blood pressure vs diagnosis

Mosaic Plot trestbps vs diagnosis

Chi-squared test Resting blood pressure vs diagnosis

Mosaic Plot fbs vs diagnosis

Chi-squared test Resting blood pressure vs diagnosis

Mosaic Plot restecg vs diagnosis

Chi-squared test restecg vs diagnosis

Mosaic Plot thalach vs diagnosis

Chi-squared test thalach vs diagnosis

Mosaic Plot exang vs diagnosis

Chi-squared test exang vs diagnosis

Mosaic Plot slope vs diagnosis

Chi-squared test slope vs diagnosis

Mosaic Plot ca vs diagnosis

Chi-squared test ca vs diagnosis

Mosaic Plot thal vs diagnosis

Chi-squared test thal vs diagnosis