Load Libraries and Import Data Set

Load libraries:

library(tidyverse) # package for plotting/data management
## Warning: package 'tidyverse' was built under R version 4.1.3
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
## Warning: package 'lubridate' was built under R version 4.1.3
## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.4     v readr     2.1.5
## v forcats   1.0.0     v stringr   1.5.1
## v ggplot2   3.5.1     v tibble    3.2.1
## v lubridate 1.9.2     v tidyr     1.3.0
## v purrr     1.0.1     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Import the data:

# read in data

ra <- read_csv(file = "ArthritisTreatment.csv")
## Rows: 530 Columns: 14
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (14): ID, Age, AgeGp, Sex, Yrs_From_Dx, CDAI, CDAI_YN, DAS_28, DAS28_YN,...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Take a look at the data (just top few rows)
head(ra)

Part 1: Study Design and Data

a) Study design

Do these data come from an experiment or an observational study?

Answer here (1a): observational

b) Sampling

How were the data sampled? That is, who was included in this study?

Answer here (1b): RA patients at a care clinic in Philadelphia (some excluding factors positive TB testing, HIV infection, active viral hepatitis, liver disease with AST/ALT 2x normal limits, history of organ transplantation, and/or active malignancy )

c) Consequences of study design

Considering your answers to parts 1a and 1b, how might the study design impact our ability to generalize the results from this sample to other Rheumatoid Arthritis patients?

Answer here (1c):there is some sampling bias in that there is only one group of patients from a clinic in one city, they are also the patients that came in to the clinic for treatments.

d) Making factor variables

Two variables of interest for this lab are Sex and DAS28_YN (an indicator of if disease activity score was measured). These two variables are coded as numbers (with 1s and 2s or 0s and 1s), but they represent categorical variables. For readability, create two new factor variables, Sex_f, and DAS28_YN_f, with appropriate labels for each of these variables. We will use for the remainder of the assignment.

# create new variables with labels


ra <- ra %>%  #new variable DAS28_YN_f tells Yes or No if DAS28 was measured
  mutate(DAS28_YN_f = factor(DAS28_YN, levels = c(1,2), 
                          labels = c("N","Y")))
ra <- ra %>% 
  mutate(Sex_f = factor(Sex, levels = c(0,1), #new variable Sex_f tells if patient is Female or Male
                          labels = c("F","M")))
ra %>%
  select(DAS28_YN, DAS28_YN_f, Sex, Sex_f) #this displays previous code to check it

Part 2: Univariate Descriptions

Create an appropriate visualization and numerical summary for each of the variables listed below. Using the graphical and numerical summaries you create, describe the variable.

a) Sex

# table of counts
xtabs( ~ Sex_f, data = ra) %>% addmargins()
## Sex_f
##   F   M Sum 
## 428 102 530
# table of proportions
xtabs( ~ Sex_f, data = ra) %>% prop.table()
## Sex_f
##         F         M 
## 0.8075472 0.1924528
# plot (histogram)
ra %>%
  ggplot(aes(x = Sex_f)) + # choose variable to plot
  geom_bar() + 
  theme_bw() + # background color
  labs(x = "Sex")

Description of variable distribution here: many more female, 4 times more which makes sense for RA

b) Age

# summary statistics
ra %>%
  summarise(mean = mean(Age),
            median = median(Age),
            min = min(Age),
            max = max(Age), 
            q25 = quantile(Age, 0.25),
            q75 = quantile(Age,0.75),
            sd = sd(Age)) %>%
mutate(IQR = q75-q25)
# histogram 
ra %>% ggplot(aes(x = Age)) + # plot age
  geom_histogram(binwidth = 1) + # each year gets its own bin, the larger the bin the smoother the histogram
  theme_bw() # background color

# boxplot
ra %>% 
  ggplot(aes(y = Age)) + # change to x = Age for a horizontal boxplot
  geom_boxplot() + 
  theme_bw() # background color

Description of variable distribution here: Data is slightly right skewed though it is very close to bellshaped. oddly there are no values between 71-74. There are some outliers based on the boxplot.75% of the patients seem to be above 54 years old. The median is 59 while the mean is 60 (60.68679). THe youngest is 42 while the oldest is 90.

c) Whether DAS_28 was measured

# table of counts

xtabs( ~ DAS28_YN, data = ra) %>% addmargins()
## DAS28_YN
##   1   2 Sum 
## 464  66 530
# table of proportions
xtabs( ~ DAS28_YN, data = ra) %>% prop.table()
## DAS28_YN
##         1         2 
## 0.8754717 0.1245283
# plot (histogram/bar plot)
ra %>% 
 #drop_na(DAS28_YN) %>%
  ggplot(aes(x = "DAS28_YN")) + 
  geom_bar() + 
  theme_bw() 
## Warning in geom_bar(): All aesthetics have length 1, but the data has 530 rows.
## i Please consider using `annotate()` or provide this layer with data containing
##   a single row.

Description of variable distribution here:464 people did not have DAS28 measured while 66 did and have coordinating values.

d) DAS_28 values

# summary stats
ra%>%
  drop_na(DAS_28) %>% # this is necessary so that you don't get all NA values below
  summarise(mean = mean(DAS_28),
            median = median(DAS_28),
            min = min(DAS_28),
            max = max(DAS_28), 
            q25 = quantile(DAS_28, 0.25),
            q75 = quantile(DAS_28,0.75),
            sd = sd(DAS_28)) %>%
mutate(IQR = q75-q25)
# histogram
ra %>% 
  ggplot(aes( x = DAS_28)) + 
  geom_histogram(binwidth = 2) + 
  theme_bw()
## Warning: Removed 464 rows containing non-finite outside the scale range
## (`stat_bin()`).

# boxplot
ra %>% 
  ggplot(aes( y = DAS_28)) + 
  geom_boxplot() + 
  theme_bw()
## Warning: Removed 464 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Description of variable distribution here:THe majority of patients (those that actually have a recording) are under 5, Mean is 2.922727 while the median is 2.5. Range is 0 to 23. 75% under 3.31.There is one high outlier above 20, the histo is skewed right.

e) Years from diagnosis

# I don't include any code hints in this chunk. Use the chunks you've already completed to finish this section!

# summary stats
ra%>%
  drop_na(Yrs_From_Dx) %>% # this is necessary so that you don't get all NA values below
  summarise(mean = mean(Yrs_From_Dx),
            median = median(Yrs_From_Dx),
            min = min(Yrs_From_Dx),
            max = max(Yrs_From_Dx), 
            q25 = quantile(Yrs_From_Dx, 0.25),
            q75 = quantile(Yrs_From_Dx,0.75),
            sd = sd(Yrs_From_Dx)) %>%
mutate(IQR = q75-q25)
# histogram

ra %>% 
  ggplot(aes( x = Yrs_From_Dx)) + 
  geom_histogram(binwidth = 2) + 
  theme_bw()
## Warning: Removed 15 rows containing non-finite outside the scale range
## (`stat_bin()`).

# boxplot

ra %>% 
  ggplot(aes( y = Yrs_From_Dx)) + 
  geom_boxplot() + 
  theme_bw()
## Warning: Removed 15 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Description of variable distribution here:Nearly all of the points are under 20 years since diagnosis, with the exception of outliers. Very heavily right skewed.

Part 3: Bivariate Descriptions

a) Years from Diagnosis and DAS 28

Create an appropriate visualization and numerical summary to examine the relationship between DAS_28 and years from diagnosis (Yrs_From_Dx). Using these descriptions, comment on the relationship between years since diagnosis and disease severity.

# visualization

ra %>% 
  ggplot(aes(x = DAS_28, y = Yrs_From_Dx)) + 
  geom_point() 
## Warning: Removed 464 rows containing missing values or values outside the scale range
## (`geom_point()`).

# numeric summary
cor(ra$DAS_28, ra$Yrs_From_Dx, use = "pairwise.complete")
## [1] -0.08041014

Response here (relationship between variables):the correlation is -0.0804 which is close to 0 or meaning there’s a weak linear relation

b) Possible problematic points

In the visualization you created above, are there any data points that you are concerned about that you might consider taking a second look at? If yes, what would happen if you excluded that point from the analysis?

# use this chunk to identiy any problematic points and exclude from the analysis you did above, if you found any potentially problematic points

max(ra$DAS_28, na.rm = TRUE)
## [1] 23
index_0 = which(ra$DAS_28==23)
print(index_0)
## [1] 257
cor(ra$DAS_28[-257], ra$Yrs_From_Dx[-257], use = "pairwise.complete")
## [1] 0.1029336

Response here:Yes there are extreme outliers but removing them didn't effect the correlation much


## c) Whether DAS_28 was measured and age

__Create an appropriate visualization and numerical summary to examine the relationship between whether the DAS_28 was measured and age (Age). Compare the age distribution between the two groups (those with and those without DAS_28 measured). Are there any differences?__


``` r
# summary stats
ra%>% 
  drop_na(DAS_28) %>%
  group_by(Age,DAS_28) %>%
   summarise(count = n()) %>% 
  ungroup() %>%
  mutate(prop = count/sum(count))
## `summarise()` has grouped output by 'Age'. You can override using the `.groups`
## argument.
cor(ra$Age, ra$DAS_28, use = "pairwise.complete")
## [1] 0.04215799
xtabs( ~ DAS28_YN + Age,ra) %>% addmargins()
##         Age
## DAS28_YN  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58
##      1     6   1   8   6   6  14  10   7  10  15  15  16  25  21  13  18  20
##      2     0   1   0   2   1   0   0   2   3   3   2   4   4   3   2   2   5
##      Sum   6   2   8   8   7  14  10   9  13  18  17  20  29  24  15  20  25
##         Age
## DAS28_YN  59  60  61  62  63  64  65  66  67  68  69  70  75  76  77  78  79
##      1    19  20  23  16  20  16  14  14   8  16  12   8   3   7  11   3   4
##      2     3   5   5   0   1   2   3   3   3   2   0   1   0   2   0   1   0
##      Sum  22  25  28  16  21  18  17  17  11  18  12   9   3   9  11   4   4
##         Age
## DAS28_YN  80  81  82  83  84  85  86  87  88  90 Sum
##      1     6   2   4   5   4   3   3   3   2   7 464
##      2     0   1   0   0   0   0   0   0   0   0  66
##      Sum   6   3   4   5   4   3   3   3   2   7 530
xtabs( ~ DAS28_YN + Age,ra) %>% prop.table(margin = 1)
##         Age
## DAS28_YN          42          43          44          45          46
##        1 0.012931034 0.002155172 0.017241379 0.012931034 0.012931034
##        2 0.000000000 0.015151515 0.000000000 0.030303030 0.015151515
##         Age
## DAS28_YN          47          48          49          50          51
##        1 0.030172414 0.021551724 0.015086207 0.021551724 0.032327586
##        2 0.000000000 0.000000000 0.030303030 0.045454545 0.045454545
##         Age
## DAS28_YN          52          53          54          55          56
##        1 0.032327586 0.034482759 0.053879310 0.045258621 0.028017241
##        2 0.030303030 0.060606061 0.060606061 0.045454545 0.030303030
##         Age
## DAS28_YN          57          58          59          60          61
##        1 0.038793103 0.043103448 0.040948276 0.043103448 0.049568966
##        2 0.030303030 0.075757576 0.045454545 0.075757576 0.075757576
##         Age
## DAS28_YN          62          63          64          65          66
##        1 0.034482759 0.043103448 0.034482759 0.030172414 0.030172414
##        2 0.000000000 0.015151515 0.030303030 0.045454545 0.045454545
##         Age
## DAS28_YN          67          68          69          70          75
##        1 0.017241379 0.034482759 0.025862069 0.017241379 0.006465517
##        2 0.045454545 0.030303030 0.000000000 0.015151515 0.000000000
##         Age
## DAS28_YN          76          77          78          79          80
##        1 0.015086207 0.023706897 0.006465517 0.008620690 0.012931034
##        2 0.030303030 0.000000000 0.015151515 0.000000000 0.000000000
##         Age
## DAS28_YN          81          82          83          84          85
##        1 0.004310345 0.008620690 0.010775862 0.008620690 0.006465517
##        2 0.015151515 0.000000000 0.000000000 0.000000000 0.000000000
##         Age
## DAS28_YN          86          87          88          90
##        1 0.006465517 0.006465517 0.004310345 0.015086207
##        2 0.000000000 0.000000000 0.000000000 0.000000000
# visualization(s)

ra %>%  
  
  ggplot(aes(x = Age, y = DAS_28)) + 
  geom_point()+
  theme_bw()
## Warning: Removed 464 rows containing missing values or values outside the scale range
## (`geom_point()`).

Description here: There seems to be a weak linear relaztionship between age and DSA_28, this seems weird which would lead me to believe my coding is wrong. I would imagine your DAS_28 would get worse with age, but there is a ton of missing data.

d) Whether DAS_28 was measured and sex

Create an appropriate visualization and numerical summary to examine the relationship between whether the DAS_28 was measured and sex (Sex_f). Calculate the difference in the proportion of subjects with disease activity scores measured between male and female study participants. Using these descriptions, comment on the relationship between age group and having DAS_28 measured.

# numeric summary

xtabs( ~ DAS28_YN + Sex_f,ra) %>% addmargins()
##         Sex_f
## DAS28_YN   F   M Sum
##      1   373  91 464
##      2    55  11  66
##      Sum 428 102 530
xtabs( ~ DAS28_YN + Sex_f,ra) %>% prop.table(margin = 1)
##         Sex_f
## DAS28_YN         F         M
##        1 0.8038793 0.1961207
##        2 0.8333333 0.1666667
# visualization

ra %>%

  drop_na(DAS28_YN, Sex_f) %>%
  ggplot(aes(x = Sex_f, fill = DAS28_YN ))+
  geom_bar(position = "fill")+
  labs(x= "Sex", y = "percent", fill = "DAS measured" )+
  theme_bw()
## Warning: The following aesthetics were dropped during statistical transformation: fill.
## i This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## i Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

Description here:

e) Consequences of 3c-d

Consider what you found in parts 3c and 3d. How could this impact the results in part 3a and the conclusions we can draw from this study?

Response here:Looking at 3c and 3d I would say that age is not a factor is DAS28 score but sex is a factor in wether you get DAS28 measured. The issue is there is so much missing data to know wheter this is real or not, it is surprising and it’s much more likely that given the sex disparity that there is some type of bias. I am also confused about the age because in most immunologic diseases age is correlated with more severe disease, though this may be a result of bias and missing data in the study.

Summary

a) Summarize your findings

Summarize your findings. Describe the study design and the sample collected. What trends did you find? Are there any caveats you would want your audience to consider when interpreting the results you have shown here?

Response here: Based on the analysis (which is flawed in areas by my poor coding skills) there doesn’t seem to be much correlation between any of the variables that we are looking at. There is definitely a difference in the data between male and female participants in that there were drastically more females. There is also very little data on DAS28 which seems like a big problem in this kind of study, given that this is a direct measure of disease severity.

Session Information

Include this so if there are any issues we can check your R/package versions to see if this could be the source of any inconsistencies.

sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] lubridate_1.9.2 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
##  [5] purrr_1.0.1     readr_2.1.5     tidyr_1.3.0     tibble_3.2.1   
##  [9] ggplot2_3.5.1   tidyverse_2.0.0 knitr_1.48     
## 
## loaded via a namespace (and not attached):
##  [1] highr_0.11        bslib_0.8.0       compiler_4.1.2    pillar_1.9.0     
##  [5] jquerylib_0.1.4   tools_4.1.2       bit_4.0.5         digest_0.6.31    
##  [9] timechange_0.2.0  jsonlite_1.8.4    evaluate_0.24.0   lifecycle_1.0.4  
## [13] gtable_0.3.5      pkgconfig_2.0.3   rlang_1.1.0       cli_3.6.1        
## [17] rstudioapi_0.16.0 parallel_4.1.2    yaml_2.3.7        xfun_0.47        
## [21] fastmap_1.1.1     withr_3.0.1       hms_1.1.3         generics_0.1.3   
## [25] sass_0.4.9        vctrs_0.6.5       bit64_4.0.5       rprojroot_2.0.4  
## [29] grid_4.1.2        tidyselect_1.2.1  glue_1.6.2        R6_2.5.1         
## [33] fansi_1.0.4       vroom_1.6.1       rmarkdown_2.28    farver_2.1.1     
## [37] tzdb_0.3.0        magrittr_2.0.3    scales_1.3.0      htmltools_0.5.8.1
## [41] colorspace_2.1-0  labeling_0.4.3    utf8_1.2.3        stringi_1.7.12   
## [45] munsell_0.5.1     cachem_1.0.7      crayon_1.5.3