Cirrhosis Data Analysis

##Introduction: This analysis explores the cirrhosis dataset using R. Various data manipulation and statistical techniques are applied to understand the dataset. ## Load Required Libraries and Dataset: library(dplyr) library(readr) #Load dataset cirrhosis <- read_csv(“/Users/tiffanytoussaint/Downloads/cirrhosis.csv”) View(cirrhosis) ## 1. Structure of Dataset: The structure of the dataset shows the type of each variable and the overall format of the dataset. str(cirrhosis) ## 2. Variables in Dataset: This step lists all the column names (variables) present in the dataset. colnames(cirrhosis) ## 3. Top 15 Rows: The first 15 rows of the dataset are displayed to get an overview of the data. head(cirrhosis,15) ## 4. User-Defined Function double_value<-function(x)(return(x*2)) cirrhosis$Double_Albumin<-double_value(cirrhosis$Albumin) ## 5. Filter Rows Based on Logical Criteria: Rows are filtered where Albumin is greater than 3 to analyze patients with higher albumin levels. filtered_cirrhosis<-cirrhosis[cirrhosis\(Albumin>3,] head(filtered_cirrhosis) ## 6. Dependent and Independent Variables: Age is selected as the independent variable and Albumin as the dependent variable to study their relationship. age_albumin_data<-cirrhosis[,c("Age","Albumin")] ## 7. Remove Missing Values: Missing values are removed to ensure clean and accurate analysis. clean_data <- cirrhosis %>% drop_na() ## 8. Remove Duplicated Data: Duplicate rows are removed to avoid redundancy in the dataset. clean_data<-cirrhosis%>%distinct() nrow(clean_data) ## 9. Reorder Rows in Descending Order:The dataset is sorted in descending order based on Age. clean_data <- clean_data %>% arrange(desc(Age)) ## 10. Rename Columns: The column "Age" is renamed to "Patient_Age" for better clarity. clean_data<-clean_data%>%rename(Patient_Age=Age) View(clean_data) ## 11. Add New Variable: A new variable is created by doubling the Patient_Age values. clean_data<-clean_data%>%mutate(Age_Double=Patient_Age*2) ## 12. Create Training and Testing Sets: The dataset is split into training (80%) and testing (20%) sets using random sampling. set.seed(314) train_index<-sample(nrow(clean_data),floor(0.8*nrow(clean_data))) train_data<-clean_data%>%slice(train_index) test_data<-clean_data%>%slice(-train_index) View(test_data) ## 13. Summary Statistics: Summary statistics provide an overview of the dataset including mean, median, mode, and quartiles. summary(clean_data) ## 14. Statistical Measures ### Mean: The mean represents the average Patient_Age. mean(clean_data\)Patient_Age) ### Median: The median represents the middle value of Patient_Age. median(clean_data$Patient_Age) ### Mode: The mode represents the most frequent value in Patient_Age. mode_function <- function(x) { freq <- table(x) modes <- names(freq)[freq == max(freq)] return(modes) } mode_function(clean_data$Patient_Age) ### Range:The range shows the difference between the minimum and maximum Patient_Age. range(clean_data$Patient_Age) ## 15. Scatter Plot: A scatter plot is created to visualize the relationship between Patient_Age and Albumin. plot(clean_data$Patient_Age, clean_data$Albumin, main="Age vs Albumin", xlab = "Age", ylab = "Albumin") ## 16. Bar Plot: A bar plot is created to show the distribution of patients across different stages. clean_data%>% count(Stage)%>% with(barplot(n,names.arg = Stage, main = "Stage Distribution", xlab = "Stage", ylab = "Count")) ## 17. Correlation and Linear Regression: A least squares linear regression model is applied to examine the relationship between Patient_Age and Albumin. ### Correlation cor(clean_data$Patient_Age, clean_data$Albumin) ### Linear Regression Model model<-lm(Albumin~Patient_Age,data = clean_data ) summary(model) ### Regression Plot plot(clean_data$Patient_Age, clean_data$Albumin) abline(model,col=“green”) ## Conclusion This analysis applied data manipulation, visualization, and statistical techniques to understand the cirrhosis dataset. The relationship between Patient_Age and Albumin was explored using correlation and regression analysis.