Introduction

When working with data, you should expect to spend a good amount of time in the clean-up process, but it is not always ‘messy’ or unreadable. For example, data can still be organized in a data-frame in a way that is readable, but at the same time it may not be useful in such presented formats. In these cases, we may have to transpose the data-frame and re-organize to fit our needs.

The following data set is obtained from Kaggle.com and provides the MRI (Magnetic Resonance Imaging) information on individuals from a wide range of ages and backgrounds as well as other tracking data such as identification numbers, etc. Here we will compare the different variables with the corresponding “CDR” or Clinical Dementia Rating for that individual. The CDR ratings have values ranging from 0 to 2, in 0.5 increments. The larger the CDR value, the more severe the case of dementia.

1. Loading Data: Loading the TSV File (Tab Separated Values)

The data is loaded from a tsv file into the assignment ‘oasis’.

rm(list = ls())

oasis_tsv <- read.csv("oasis.tsv")

head(oasis_tsv)

##           ID.M.F.Hand.Age.Educ.SES.MMSE.CDR.eTIV.nWBV.ASF.Delay
## 1 OAS1_0308_MR1\tF\tR\t78\t3\t3\t15\t2\t1401\t0.703\t1.253\tN/A
## 2 OAS1_0351_MR1\tM\tR\t86\t1\t4\t15\t2\t1512\t0.665\t1.161\tN/A
## 3 OAS1_0028_MR1\tF\tR\t86\t2\t4\t27\t1\t1449\t0.738\t1.211\tN/A
## 4 OAS1_0031_MR1\tM\tR\t88\t1\t4\t26\t1\t1419\t0.674\t1.236\tN/A
## 5 OAS1_0035_MR1\tF\tR\t84\t3\t2\t28\t1\t1402\t0.695\t1.252\tN/A
## 6   OAS1_0052_MR1\tF\tR\t78\t1\t5\t23\t1\t1462\t0.697\t1.2\tN/A

2. Loading Data: Cleaning

From above, we can see that all of the data is grouped into one column, including the header row. Below I separated the data as needed, then I reassigned the column names.

oasis_tsv <- oasis_tsv %>%
  separate(
    col = names(oasis_tsv)[1],
    into = c("ID", "Gender", "Hand_Dominance", "Age", "Educ", "Socio_Econ", "MMSE", "CDR", "eTIV", "nWBV", "ASF", "Delay"),
    sep = "\t"
  )

write.csv(oasis_tsv, "oasis.csv")

oasis_csv <- read.csv("oasis.csv")

head(oasis_csv)

##   X            ID Gender Hand_Dominance Age Educ Socio_Econ MMSE CDR eTIV  nWBV
## 1 1 OAS1_0308_MR1      F              R  78    3          3   15   2 1401 0.703
## 2 2 OAS1_0351_MR1      M              R  86    1          4   15   2 1512 0.665
## 3 3 OAS1_0028_MR1      F              R  86    2          4   27   1 1449 0.738
## 4 4 OAS1_0031_MR1      M              R  88    1          4   26   1 1419 0.674
## 5 5 OAS1_0035_MR1      F              R  84    3          2   28   1 1402 0.695
## 6 6 OAS1_0052_MR1      F              R  78    1          5   23   1 1462 0.697
##     ASF Delay
## 1 1.253   N/A
## 2 1.161   N/A
## 3 1.211   N/A
## 4 1.236   N/A
## 5 1.252   N/A
## 6 1.200   N/A

3. Removing Columns

On this step, I removed the columns that included unnecessary data and only kept information on Gender, Age, Education, Socio-Economic Status, and the CDR or Critical Dementia Rating.

dementia <- oasis_csv[,-c(1,4,8,10:13)]

colnames(dementia) <- c("ID", "Gender", "Age", "Educ", "Socio_Econ", "DementiaRating")

head(dementia)

##              ID Gender Age Educ Socio_Econ DementiaRating
## 1 OAS1_0308_MR1      F  78    3          3              2
## 2 OAS1_0351_MR1      M  86    1          4              2
## 3 OAS1_0028_MR1      F  86    2          4              1
## 4 OAS1_0031_MR1      M  88    1          4              1
## 5 OAS1_0035_MR1      F  84    3          2              1
## 6 OAS1_0052_MR1      F  78    1          5              1

4. Data Types

Although the data frame seems usable, the data types are incorrect. I changed the data types for select columns so that I can do some analysis.

dementia_df <- dementia %>%
  mutate(                
    Age = as.numeric(Age),                    
    Gender = as.factor(Gender),             
    Educ = as.factor(Educ),
    Socio_Econ = as.factor(Socio_Econ),                
    DementiaRating = as.numeric(DementiaRating),               
  )

head(dementia_df)

##              ID Gender Age Educ Socio_Econ DementiaRating
## 1 OAS1_0308_MR1      F  78    3          3              2
## 2 OAS1_0351_MR1      M  86    1          4              2
## 3 OAS1_0028_MR1      F  86    2          4              1
## 4 OAS1_0031_MR1      M  88    1          4              1
## 5 OAS1_0035_MR1      F  84    3          2              1
## 6 OAS1_0052_MR1      F  78    1          5              1

5. Comparing Variables & Visualization

(a) Gender vs CDR

Now that I’ve narrowed down my data, I group by one of these column variables and aggregate it to get the mean of that variable column. I then use ggplot to visualize my comparison.

avg_cdr_gender <- dementia_df %>%
  group_by(Gender) %>%
  summarize(avg_dementia_rating = mean(DementiaRating, na.rm = TRUE))

avg_cdr_gender

## # A tibble: 2 × 2
##   Gender avg_dementia_rating
##   <fct>                <dbl>
## 1 F                    0.260
## 2 M                    0.335

ggplot(avg_cdr_gender, aes(x = Gender, y = avg_dementia_rating, fill = Gender)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Dementia Rating by Gender", x = "Gender", y = "Average Dementia Rating") +
  theme_minimal()

(b) Age vs CDR

# Remove rows with NA in specific columns (e.g., CDR, Gender, Age)
dementia_df <- dementia_df %>%
  filter(!is.na(DementiaRating), !is.na(Gender), !is.na(Age))

# Calculate average CDR by Age
avg_cdr_age <- dementia_df %>%
  group_by(Age) %>%
  summarize(avg_dementia_rating = mean(DementiaRating, na.rm = TRUE))

avg_cdr_age

## # A tibble: 51 × 2
##      Age avg_dementia_rating
##    <dbl>               <dbl>
##  1    33                   0
##  2    39                   0
##  3    43                   0
##  4    45                   0
##  5    46                   0
##  6    47                   0
##  7    48                   0
##  8    49                   0
##  9    50                   0
## 10    51                   0
## # ℹ 41 more rows

ggplot(avg_cdr_age, aes(x = Age, y = avg_dementia_rating)) +
  geom_line(color = "blue") +
  geom_point(color = "red") +
  labs(title = "Average Dementia Rating by Age", x = "Age", y = "Average Dementia Rating") +
  theme_minimal()

(c) Socio-Economic Status vs CDR

# Calculate average CDR by Socio-economic Status (SES)
avg_cdr_ses <- dementia_df %>%
  group_by(Socio_Econ) %>%
  summarize(avg_dementia_rating = mean(DementiaRating, na.rm = TRUE))

avg_cdr_ses

## # A tibble: 6 × 2
##   Socio_Econ avg_dementia_rating
##   <fct>                    <dbl>
## 1 1                        0.21 
## 2 2                        0.185
## 3 3                        0.265
## 4 4                        0.398
## 5 5                        0.5  
## 6 <NA>                     0.553

ggplot(avg_cdr_ses, aes(x = factor(Socio_Econ), y = avg_dementia_rating, fill = factor(Socio_Econ))) +
  geom_bar(stat = "identity") +
  labs(title = "Average Dementia Rating by Socio-economic Status", x = "Socio-economic Status", y = "Average Dementia Rating") +
  theme_minimal()

(d) Education vs CDR

# Calculate average CDR by Education
avg_cdr_education <- dementia_df %>%
  group_by(Educ) %>%
  summarize(avg_dementia_rating = mean(DementiaRating, na.rm = TRUE))

avg_cdr_education

## # A tibble: 5 × 2
##   Educ  avg_dementia_rating
##   <fct>               <dbl>
## 1 1                   0.522
## 2 2                   0.359
## 3 3                   0.245
## 4 4                   0.24 
## 5 5                   0.167

ggplot(avg_cdr_education, aes(x = factor(Educ), y = avg_dementia_rating, fill = factor(Educ))) +
  geom_bar(stat = "identity") +
  labs(title = "Average Dementia Rating by Education Level", x = "Education Level", y = "Average Dementia Rating") +
  theme_minimal()

DATA 607 - Project 2: Data Set 3 (Resubmit)

Julian Adames-Ng

2024-10-28