Introduction

When working with data, you should expect to spend a good amount of time in the clean-up process, but it is not always ‘messy’ or unreadable. For example, data can still be organized in a data-frame in a way that is readable, but at the same time it may not be useful in such presented formats. In these cases, we may have to transpose the data-frame and re-organize to fit our needs.

The following data set is obtained from Kaggle.com and provides the MRI (Magnetic Resonance Imaging) information on individuals from a wide range of ages and backgrounds as well as other tracking data such as identification numbers, etc. Here we will compare the different variables with the corresponding “CDR” or Clinical Dementia Rating for that individual. The CDR ratings have values ranging from 0 to 2, in 0.5 increments. The larger the CDR value, the more severe the case of dementia.

1. Loading Data: Loading the TSV File (Tab Separated Values)

The data is loaded from a tsv file into the assignment ‘oasis’.

rm(list = ls())

oasis_tsv <- read.csv("oasis.tsv")

head(oasis_tsv)
##           ID.M.F.Hand.Age.Educ.SES.MMSE.CDR.eTIV.nWBV.ASF.Delay
## 1 OAS1_0308_MR1\tF\tR\t78\t3\t3\t15\t2\t1401\t0.703\t1.253\tN/A
## 2 OAS1_0351_MR1\tM\tR\t86\t1\t4\t15\t2\t1512\t0.665\t1.161\tN/A
## 3 OAS1_0028_MR1\tF\tR\t86\t2\t4\t27\t1\t1449\t0.738\t1.211\tN/A
## 4 OAS1_0031_MR1\tM\tR\t88\t1\t4\t26\t1\t1419\t0.674\t1.236\tN/A
## 5 OAS1_0035_MR1\tF\tR\t84\t3\t2\t28\t1\t1402\t0.695\t1.252\tN/A
## 6   OAS1_0052_MR1\tF\tR\t78\t1\t5\t23\t1\t1462\t0.697\t1.2\tN/A

2. Loading Data: Cleaning

From above, we can see that all of the data is grouped into one column, including the header row. Below I separated the data as needed, then I reassigned the column names.

oasis_tsv <- oasis_tsv %>%
  separate(
    col = names(oasis_tsv)[1],
    into = c("ID", "Gender", "Hand_Dominance", "Age", "Educ", "Socio_Econ", "MMSE", "CDR", "eTIV", "nWBV", "ASF", "Delay"),
    sep = "\t"
  )

write.csv(oasis_tsv, "oasis.csv")

oasis_csv <- read.csv("oasis.csv")

head(oasis_csv)
##   X            ID Gender Hand_Dominance Age Educ Socio_Econ MMSE CDR eTIV  nWBV
## 1 1 OAS1_0308_MR1      F              R  78    3          3   15   2 1401 0.703
## 2 2 OAS1_0351_MR1      M              R  86    1          4   15   2 1512 0.665
## 3 3 OAS1_0028_MR1      F              R  86    2          4   27   1 1449 0.738
## 4 4 OAS1_0031_MR1      M              R  88    1          4   26   1 1419 0.674
## 5 5 OAS1_0035_MR1      F              R  84    3          2   28   1 1402 0.695
## 6 6 OAS1_0052_MR1      F              R  78    1          5   23   1 1462 0.697
##     ASF Delay
## 1 1.253   N/A
## 2 1.161   N/A
## 3 1.211   N/A
## 4 1.236   N/A
## 5 1.252   N/A
## 6 1.200   N/A

3. Removing Columns

On this step, I removed the columns that included unnecessary data and only kept information on Gender, Age, Education, Socio-Economic Status, and the CDR or Critical Dementia Rating.

dementia <- oasis_csv[,-c(1,4,8,10:13)]

colnames(dementia) <- c("ID", "Gender", "Age", "Educ", "Socio_Econ", "DementiaRating")

head(dementia)
##              ID Gender Age Educ Socio_Econ DementiaRating
## 1 OAS1_0308_MR1      F  78    3          3              2
## 2 OAS1_0351_MR1      M  86    1          4              2
## 3 OAS1_0028_MR1      F  86    2          4              1
## 4 OAS1_0031_MR1      M  88    1          4              1
## 5 OAS1_0035_MR1      F  84    3          2              1
## 6 OAS1_0052_MR1      F  78    1          5              1

4. Data Types

Although the data frame seems usable, the data types are incorrect. I changed the data types for select columns so that I can do some analysis.

dementia_df <- dementia %>%
  mutate(                
    Age = as.numeric(Age),                    
    Gender = as.factor(Gender),             
    Educ = as.factor(Educ),
    Socio_Econ = as.factor(Socio_Econ),                
    DementiaRating = as.numeric(DementiaRating),               
  )

head(dementia_df)
##              ID Gender Age Educ Socio_Econ DementiaRating
## 1 OAS1_0308_MR1      F  78    3          3              2
## 2 OAS1_0351_MR1      M  86    1          4              2
## 3 OAS1_0028_MR1      F  86    2          4              1
## 4 OAS1_0031_MR1      M  88    1          4              1
## 5 OAS1_0035_MR1      F  84    3          2              1
## 6 OAS1_0052_MR1      F  78    1          5              1

5. Comparing Variables & Visualization

(a) Gender vs CDR

Now that I’ve narrowed down my data, I group by one of these column variables and aggregate it to get the mean of that variable column. I then use ggplot to visualize my comparison.

avg_cdr_gender <- dementia_df %>%
  group_by(Gender) %>%
  summarize(avg_dementia_rating = mean(DementiaRating, na.rm = TRUE))

avg_cdr_gender
## # A tibble: 2 × 2
##   Gender avg_dementia_rating
##   <fct>                <dbl>
## 1 F                    0.260
## 2 M                    0.335
ggplot(avg_cdr_gender, aes(x = Gender, y = avg_dementia_rating, fill = Gender)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Dementia Rating by Gender", x = "Gender", y = "Average Dementia Rating") +
  theme_minimal()

(b) Age vs CDR

# Remove rows with NA in specific columns (e.g., CDR, Gender, Age)
dementia_df <- dementia_df %>%
  filter(!is.na(DementiaRating), !is.na(Gender), !is.na(Age))

# Calculate average CDR by Age
avg_cdr_age <- dementia_df %>%
  group_by(Age) %>%
  summarize(avg_dementia_rating = mean(DementiaRating, na.rm = TRUE))

avg_cdr_age
## # A tibble: 51 × 2
##      Age avg_dementia_rating
##    <dbl>               <dbl>
##  1    33                   0
##  2    39                   0
##  3    43                   0
##  4    45                   0
##  5    46                   0
##  6    47                   0
##  7    48                   0
##  8    49                   0
##  9    50                   0
## 10    51                   0
## # ℹ 41 more rows
ggplot(avg_cdr_age, aes(x = Age, y = avg_dementia_rating)) +
  geom_line(color = "blue") +
  geom_point(color = "red") +
  labs(title = "Average Dementia Rating by Age", x = "Age", y = "Average Dementia Rating") +
  theme_minimal()

(c) Socio-Economic Status vs CDR

# Calculate average CDR by Socio-economic Status (SES)
avg_cdr_ses <- dementia_df %>%
  group_by(Socio_Econ) %>%
  summarize(avg_dementia_rating = mean(DementiaRating, na.rm = TRUE))

avg_cdr_ses
## # A tibble: 6 × 2
##   Socio_Econ avg_dementia_rating
##   <fct>                    <dbl>
## 1 1                        0.21 
## 2 2                        0.185
## 3 3                        0.265
## 4 4                        0.398
## 5 5                        0.5  
## 6 <NA>                     0.553
ggplot(avg_cdr_ses, aes(x = factor(Socio_Econ), y = avg_dementia_rating, fill = factor(Socio_Econ))) +
  geom_bar(stat = "identity") +
  labs(title = "Average Dementia Rating by Socio-economic Status", x = "Socio-economic Status", y = "Average Dementia Rating") +
  theme_minimal()

(d) Education vs CDR

# Calculate average CDR by Education
avg_cdr_education <- dementia_df %>%
  group_by(Educ) %>%
  summarize(avg_dementia_rating = mean(DementiaRating, na.rm = TRUE))

avg_cdr_education
## # A tibble: 5 × 2
##   Educ  avg_dementia_rating
##   <fct>               <dbl>
## 1 1                   0.522
## 2 2                   0.359
## 3 3                   0.245
## 4 4                   0.24 
## 5 5                   0.167
ggplot(avg_cdr_education, aes(x = factor(Educ), y = avg_dementia_rating, fill = factor(Educ))) +
  geom_bar(stat = "identity") +
  labs(title = "Average Dementia Rating by Education Level", x = "Education Level", y = "Average Dementia Rating") +
  theme_minimal()

Conclusion

My analysis for this data set involved comparing the CDR (Critical Dementia Rating) with Age, Gender, Socio-Economic Status, and Education. My process involved sifting through the data, using the filter() and group_by() functions to narrow down the data set based on the specified criteria, summarizing my data using the mean counts and determining which pair of variables had the most promising results in terms of a functional relationship. Based on this analysis we see that, when grouped by gender, males have a larger average CDR. This could be accounted for in genetic differences among males and females. When comparing CDR and Age, we see significant value increases once we look in the age range of those older than 60 years old, as we would naturally expect. When comparing the CDR to Socio-Economic Status, however, we see that the higher the status, the larger the average CDR rate. One possibility is that the higher the status is associated with jobs that require more responsibility and therefore produce more stress which may have a positive effect on dementia. In contrast, we see an inverse relationship between education level and CDR. This could be attributed to the fact that more educated people tend to have more ‘active’ minds which may keep symptoms of dementia from manifesting sooner.