##Dataset 1

#Introduction

This data set contains the data behind the story “Marriage Isn’t Dead — Yet” from FiveThirtyEight and presents the proportions of people within certain social, economic, educational, racial, geographical, and employment groups that have never been married, over the years 1960 to 2012. I will tidy and clean the data set to bring to light the factors that are most strongly correlated with the decision to not get married.

#Reading in the csv file from github's raw data url

rawdata <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/marriage/both_sexes.csv")

head(rawdata)
##   X year       date  all_2534   HS_2534   SC_2534  BAp_2534  BAo_2534   GD_2534
## 1 1 1960 1960-01-01 0.1233145 0.1095332 0.1522818 0.2389952 0.2389952        NA
## 2 2 1970 1970-01-01 0.1269715 0.1094000 0.1495096 0.2187031 0.2187031        NA
## 3 3 1980 1980-01-01 0.1991767 0.1617313 0.2236916 0.2881646 0.2881646        NA
## 4 4 1990 1990-01-01 0.2968306 0.2777491 0.2780912 0.3612968 0.3656655 0.3474505
## 5 5 2000 2000-01-01 0.3450087 0.3316545 0.3249205 0.3874906 0.3939579 0.3691740
## 6 6 2001 2001-01-01 0.3527767 0.3446069 0.3341101 0.3835686 0.3925148 0.3590304
##   White_2534 Black_2534 Hisp_2534   NE_2534   MA_2534 Midwest_2534 South_2534
## 1  0.1164848  0.1621855 0.1393736 0.1504184 0.1628934    0.1121467  0.1090562
## 2  0.1179043  0.1855163 0.1298769 0.1517231 0.1640680    0.1153741  0.1126220
## 3  0.1824126  0.3137500 0.1885440 0.2414327 0.2505925    0.1828339  0.1688435
## 4  0.2639256  0.4838556 0.2962372 0.3500384 0.3623321    0.2755046  0.2639794
## 5  0.3127149  0.5144994 0.3180681 0.4091852 0.4175565    0.3308022  0.3099712
## 6  0.3183506  0.5437985 0.3321214 0.4200581 0.4294281    0.3344332  0.3182688
##   Mountain_2534 Pacific_2534 poor_2534   mid_2534 rich_2534   all_3544
## 1    0.09152117    0.1198758 0.1371597 0.07514929 0.2066776 0.07058157
## 2    0.10293602    0.1374964 0.1717202 0.08159207 0.1724093 0.06732520
## 3    0.17434230    0.2334279 0.3100591 0.14825303 0.1851082 0.06883378
## 4    0.25264326    0.3319579 0.4199108 0.24320008 0.2783226 0.11191800
## 5    0.30621032    0.3753061 0.5033676 0.30202036 0.2717386 0.15605881
## 6    0.30980779    0.3844799 0.5178771 0.31716118 0.2532041 0.15642529
##      HS_3544    SC_3544  BAp_3544  BAo_3544   GD_3544 White_3544 Black_3544
## 1 0.06860309 0.06663695 0.1326265 0.1326265        NA 0.06825586 0.08836728
## 2 0.06511964 0.06271724 0.1116899 0.1116899        NA 0.06250372 0.10290904
## 3 0.06429102 0.06531333 0.1056102 0.1056102        NA 0.05966739 0.13140081
## 4 0.11210043 0.09699372 0.1285172 0.1258567 0.1328018 0.09611312 0.22010298
## 5 0.16993703 0.13800404 0.1541238 0.1536299 0.1550970 0.13207032 0.30239381
## 6 0.16870156 0.13986044 0.1548151 0.1524923 0.1595169 0.13287455 0.30857796
##    Hisp_3544    NE_3544    MA_3544 Midwest_3544 South_3544 Mountain_3544
## 1 0.07307651 0.09194322 0.09347468   0.06863360 0.06026353    0.04739747
## 2 0.07070500 0.08570110 0.09040725   0.06156272 0.05966057    0.04651163
## 3 0.08110790 0.07997323 0.09744428   0.06070641 0.05914089    0.04880077
## 4 0.12194206 0.12785915 0.14354989   0.10157576 0.09637035    0.09189904
## 5 0.15469520 0.17327422 0.18819256   0.14539201 0.14230600    0.13584194
## 6 0.14953050 0.16653497 0.18315109   0.14794407 0.14312592    0.13943820
##   Pacific_3544 poor_3544   mid_3544  rich_3544   all_4554    HS_4554    SC_4554
## 1   0.05822486 0.1019749 0.04717272 0.08553870 0.07254649 0.06840792 0.07903755
## 2   0.06347796 0.1117548 0.04566838 0.06499159 0.05968794 0.05833439 0.05443478
## 3   0.07552538 0.1291426 0.05050321 0.04445951 0.05250871 0.05036563 0.04816180
## 4   0.13134638 0.2012208 0.09024739 0.06573916 0.05947824 0.05988244 0.04654087
## 5   0.17480047 0.2813137 0.12815751 0.08622046 0.08804394 0.09442809 0.07558786
## 6   0.17694864 0.2919112 0.13267625 0.06803283 0.08823342 0.09189007 0.07795481
##     BAp_4554   BAo_4554    GD_4554 White_4554 Black_4554  Hisp_4554    NE_4554
## 1 0.15360889 0.15360889         NA 0.07246692 0.06913249 0.06636058 0.10236412
## 2 0.10466047 0.10466047         NA 0.05754799 0.07899168 0.05810740 0.08028082
## 3 0.08623774 0.08623774         NA 0.04765354 0.08624602 0.06522951 0.06930253
## 4 0.07301884 0.06416529 0.08394886 0.05092552 0.11617699 0.07613556 0.07047502
## 5 0.09208417 0.09097472 0.09362802 0.07578174 0.17587334 0.09418009 0.10232170
## 6 0.09333365 0.09313480 0.09362876 0.07516912 0.18154531 0.09409896 0.09868408
##      MA_4554 Midwest_4554 South_4554 Mountain_4554 Pacific_4554 poor_4554
## 1 0.09264788   0.07285321 0.05977295    0.04754183   0.05996993 0.1030055
## 2 0.07860635   0.05791163 0.05174462    0.03970134   0.04826312 0.1016489
## 3 0.07508466   0.04807290 0.04485348    0.03374438   0.04958992 0.1003011
## 4 0.08373134   0.05398391 0.05043636    0.04459411   0.06461875 0.1148335
## 5 0.11269659   0.08302437 0.07631858    0.07637774   0.09896832 0.1718976
## 6 0.10953635   0.08207629 0.07886513    0.07405971   0.10119511 0.1759369
##     mid_4554  rich_4554 nokids_all_2534 kids_all_2534 nokids_HS_2534
## 1 0.05364421 0.07908591       0.4640564   0.002820625      0.4430148
## 2 0.04221637 0.05142867       0.4309043   0.009868596      0.4246779
## 3 0.03830266 0.03311296       0.4464304   0.025285667      0.4319342
## 4 0.04562332 0.03136386       0.5425242   0.060277451      0.5464881
## 5 0.07055672 0.03897342       0.5714531   0.099472713      0.5711395
## 6 0.07407508 0.02857320       0.5852213   0.110178467      0.6045475
##   nokids_SC_2534 nokids_BAp_2534 nokids_BAo_2534 nokids_GD_2534 kids_HS_2534
## 1      0.5000402       0.5619099       0.5619099             NA  0.003318886
## 2      0.4333479       0.4554766       0.4554766             NA  0.012465915
## 3      0.4505900       0.4719700       0.4719700             NA  0.031930752
## 4      0.5238446       0.5560765       0.5633301      0.5332628  0.078470444
## 5      0.5700042       0.5729677       0.5862213      0.5367160  0.127193577
## 6      0.5810912       0.5698644       0.5864967      0.5258800  0.141395652
##   kids_SC_2534 kids_BAp_2534 kids_BAo_2534 kids_GD_2534 nokids_poor_2534
## 1  0.001150824  0.0005751073  0.0005751073           NA        0.4933061
## 2  0.003699982  0.0014683425  0.0014683425           NA        0.5097742
## 3  0.018135401  0.0062544364  0.0062544364           NA        0.5740402
## 4  0.052032702  0.0171241042  0.0181766027   0.01374234        0.6546908
## 5  0.097625310  0.0370024452  0.0401009875   0.02761467        0.7055451
## 6  0.110030662  0.0399801447  0.0445838012   0.02645041        0.7147334
##   nokids_mid_2534 nokids_rich_2534 kids_poor_2534 kids_mid_2534 kids_rich_2534
## 1       0.4100080        0.4921184    0.008722711  0.0007532065   0.0008027331
## 2       0.3764538        0.4288948    0.029974945  0.0033771145   0.0030435661
## 3       0.3998250        0.3848089    0.077926214  0.0102368871   0.0068317224
## 4       0.5186604        0.4750156    0.170763774  0.0274655254   0.0182329127
## 5       0.5690228        0.4458023    0.256281918  0.0597845173   0.0295644698
## 6       0.5864741        0.4461111    0.280146488  0.0677954572   0.0336540502
#loading packages
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#Data Cleaning

The data is relatively clean, although there are a few null values and two unnecessary columns. Removing the top four rows will allow me to make year over year comparisons of the proportions of people never married and addresses the null values. The first four rows contain data for years 1960, 1970, 1980, and 1990, then the remaining rows contain individual years of survey data from 2000 to 2012.

#Removing unnecessary columns X and Date (we are given the year in column 2). Removing rows 1 to 4.
rawdata <- rawdata[-c(1:4),-c(1,3)]

#Data Tidying

The first thing that I notice when looking to tidy the dataset here is that the column headers are values not variable names, and that the column headers contain multiple variables (an attribute of the observational unit and an age range).

Assuming that the column headers without the kids or nokids denomination show proportions of marriage taken from a combination of people with kids and no kids, I will create a variable for kids in our tidy data table.

#Analysis planning

For analysis I will make comparisons between the change in never married proportions for the different attributes and discover if any attributes are markedly above the change in proportion for all attributes, then compare if having kids at home makes a significant impact on these proportions.

#Data Tidying

#Seperating the table into attributes and kids tables to standardize the columns then bind back together

attributes <- rawdata %>%
  select(-matches("^kids"), -matches("^nokids"))

attributes <- attributes %>%
  pivot_longer(
    cols = 2:55,
    names_to = "attribute",
    values_to = "p_never.married",
    values_drop_na = TRUE
  )

#Splitting the out variables
attributes <- attributes %>%
  separate(attribute, into = c("attribute", "age.range"),sep = "_") %>%
  mutate(age.range = sub("(\\d{2})(\\d{2})", "\\1-\\2", age.range))

attributes <- attributes %>%
  mutate(kids = "combined") %>%
  relocate(kids, .after = year)
kids <- rawdata %>%
  select(matches("^(year|kids|nokids).*"))

#Pivoting longer the column headers

kids <- kids %>%
  pivot_longer(
    cols = 2:19,
    names_to = "attribute",
    values_to = "p_never.married",
    values_drop_na = TRUE
  )

#Splitting the column kids into kids, attribute, and age.range 
kids <- kids %>%
  separate(attribute, into = c("kids","attribute", "age.range"),sep = "_") %>%
  mutate(age.range = sub("(\\d{2})(\\d{2})", "\\1-\\2", age.range))

#Creating the final tidy table

#In this tidied final table I have the observation of proportion never married per row, and each observation is described by the variables of year survey was taken, status of kids, attributes specified by the American Community Survey, and the age range of the respondent group.

tidy.data <- bind_rows(attributes,kids)

head(tidy.data)
## # A tibble: 6 × 5
##    year kids     attribute age.range p_never.married
##   <int> <chr>    <chr>     <chr>               <dbl>
## 1  2000 combined all       25-34               0.345
## 2  2000 combined HS        25-34               0.332
## 3  2000 combined SC        25-34               0.325
## 4  2000 combined BAp       25-34               0.387
## 5  2000 combined BAo       25-34               0.394
## 6  2000 combined GD        25-34               0.369

#Data Analysis

To find which attributes are most correlated with an increasing never married proportion I will first explore the different attributes and the effect of kids.

library(ggplot2)
  ggplot(tidy.data, aes(x = ifelse(kids == "combined", year, NA), y = p_never.married, color = age.range)) +
  geom_point() +
  facet_wrap(vars(attribute)) +
  theme(axis.text.x = element_text(angle = 45))
## Warning: Removed 234 rows containing missing values or values outside the scale range
## (`geom_point()`).

For attributes that have the greatest never married proportion, Black, HS, MA, NE, Pacific, and poor stand out in the eye test.

kids <- tidy.data %>% filter(kids ==  "kids")
  ggplot(kids, aes(x = year, y = p_never.married, color = age.range)) +
  geom_point() +
  facet_wrap(vars(attribute)) +
  theme(axis.text.x = element_text(angle = 45))

nokids <- tidy.data %>% filter(kids ==  "nokids")
  ggplot(nokids, aes(x = year, y = p_never.married, color = age.range)) +
  geom_point() +
  facet_wrap(vars(attribute)) +
  theme(axis.text.x = element_text(angle = 45))

td.analysis <- tidy.data %>%
  filter(year %in% c(2000, 2012)) %>%
  group_by(attribute, age.range, kids) %>%
  reframe(difference = p_never.married[year == 2012] - p_never.married[year == 2000])
  
td.analysis %>%
  arrange(-difference)
## # A tibble: 72 × 4
##    attribute age.range kids     difference
##    <chr>     <chr>     <chr>         <dbl>
##  1 HS        25-34     combined      0.192
##  2 HS        25-34     nokids        0.174
##  3 rich      25-34     nokids        0.172
##  4 Black     25-34     combined      0.170
##  5 Hisp      25-34     combined      0.169
##  6 poor      25-34     kids          0.159
##  7 MA        25-34     combined      0.158
##  8 NE        25-34     combined      0.155
##  9 SC        25-34     combined      0.155
## 10 South     25-34     combined      0.155
## # ℹ 62 more rows
#Finding base rates for the increase of never married proportions from 2000 to 2012

td.analysisall <- td.analysis %>%
  filter(attribute == "all", kids == "combined") %>%
  group_by(attribute, age.range, kids)
  
td.analysisall %>%
  arrange(-difference)
## # A tibble: 3 × 4
## # Groups:   attribute, age.range, kids [3]
##   attribute age.range kids     difference
##   <chr>     <chr>     <chr>         <dbl>
## 1 all       25-34     combined     0.149 
## 2 all       35-44     combined     0.0600
## 3 all       45-54     combined     0.0547

My final tidied data table is tidy.data prior to analysis

#Conclusion

Tidying this data table was challenging, I explored the idea of breaking out attribute = all, and column headers starting with kids or no kids, into their own tables, as they are calculated from the same respondent groups as the proportions for column headers with education, race, location or income values, and breaking out each attribute (education, race, location, income) into their own tables as well. My original thoughts on this was that if I were to maintain these data tables, normalizing by breaking out the attributes into tables would be a better organization for additions, edits deletes, etc., then calculating out the kids/no kids and all attributes proportions from the normalized tables.

I ended up leaving leaving the final tidied table as one because I was able to format the table to have one observation per row, and the columns be an attribute of the observation (with the creation of the kids column).

As for the data analysis never married proportions are increasing fastest first for the age range of 25-35, the 22 groups in this dataset with the highest never married proportion increases are from the 25-35 age range. We can also see that people with highschool or below education levels, but also rich with no kids are the groups showing the greatest increases.

##Dataset 2

#Introduction

My second data set comes from the results of the International Mathematics Olympiad. It provides each country’s teams results in a wide and untidy format. I will attempt to analyze if there is any correlation between the number of females on a countries team and the team’s points scored.

library(stringr)
library(dplyr)
library(tidyr)
#reading in the csv
imoresults <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/refs/heads/master/data/2024/2024-09-24/country_results_df.csv")

head(imoresults)
##   year                    country team_size_all team_size_male team_size_female
## 1 2024   United States of America             6              5                1
## 2 2024 People's Republic of China             6              6                0
## 3 2024          Republic of Korea             6              6                0
## 4 2024                      India             6              6                0
## 5 2024                    Belarus             6              6                0
## 6 2024                  Singapore             6              6                0
##   p1 p2 p3 p4 p5 p6 p7 awards_gold awards_silver awards_bronze
## 1 42 41 19 40 35 15 NA           5             1             0
## 2 42 42 31 40 22 13 NA           5             1             0
## 3 42 37 18 42  7 22 NA           2             4             0
## 4 42 34 11 42 28 10 NA           4             1             0
## 5 42 30 10 42 36  5 NA           4             0             2
## 6 42 37  7 42 29  5 NA           1             5             0
##   awards_honorable_mentions                   leader    deputy_leader
## 1                         0              John Berman Carl Schildkraut
## 2                         0               Liang Xiao        Yijun Yao
## 3                         0             Suyoung Choi      Hwajong Yoo
## 4                         1 Krishnan Sivasubramanian      Rijul Saini
## 5                         0           David Zmiaikou  Dzmitry Bazyleu
## 6                         0           Yong Sheng Soh    Teck Kian Teo

#Data cleaning

Removing unnecessary columns for analysis: column p7 has no values, and leader and deputy_leader are not of interest.

#removing columns
imoresults <- imoresults[,-c(12:18)]

#removing rows with null points

imoresults <- imoresults %>% 
  filter(!is.na(p1))

#removing rows with null male and female counts
imoresults <- imoresults %>%
  filter(!(is.na(team_size_male) & is.na(team_size_female)))

#mutating rows to show male and female counts
imoresults <- imoresults %>%
  mutate(
    team_size_male = ifelse(is.na(team_size_male), team_size_all - team_size_female, team_size_male),
    team_size_female = ifelse(is.na(team_size_female), team_size_all - team_size_male, team_size_female)
  )
  
head(imoresults)
##   year                    country team_size_all team_size_male team_size_female
## 1 2024   United States of America             6              5                1
## 2 2024 People's Republic of China             6              6                0
## 3 2024          Republic of Korea             6              6                0
## 4 2024                      India             6              6                0
## 5 2024                    Belarus             6              6                0
## 6 2024                  Singapore             6              6                0
##   p1 p2 p3 p4 p5 p6
## 1 42 41 19 40 35 15
## 2 42 42 31 40 22 13
## 3 42 37 18 42  7 22
## 4 42 34 11 42 28 10
## 5 42 30 10 42 36  5
## 6 42 37  7 42 29  5

#Data Tidying

There are two observational units in our remaining table. I will split this table into imoteams and imopoints in tidy format before beginning analysis.

#Creating and tidying the teams table

tidy.imoteams <- imoresults[,c(1,2,4,5)]

tidy.imoteams <- tidy.imoteams %>%
  pivot_longer(
    cols = 3:4,
    names_to = "column",
    values_to = "count"
  )

tidy.imoteams <- tidy.imoteams %>%
  mutate(sex = str_extract(column, "male$|female$"))

tidy.imoteams <- tidy.imoteams[,-3] %>%
  relocate(sex, .after = country)

head(tidy.imoteams)
## # A tibble: 6 × 4
##    year country                    sex    count
##   <int> <chr>                      <chr>  <int>
## 1  2024 United States of America   male       5
## 2  2024 United States of America   female     1
## 3  2024 People's Republic of China male       6
## 4  2024 People's Republic of China female     0
## 5  2024 Republic of Korea          male       6
## 6  2024 Republic of Korea          female     0

The tidy version of the teams table contains the observation of count with the variables that describe the count. FOr example the first row is 2024 USA male count and the second row is 2024 USA female count.

#Creating and tidying the points table

tidy.imopoints <- imoresults[,c(1,2,6:11)] %>%
  pivot_longer(
    cols = 3:8,
    names_to = "round",
    values_to = "points")

#Removing the "p" from the values in the round column  

tidy.imopoints <- tidy.imopoints %>%
  mutate(round = gsub("[^0-9]", "", round))

head(tidy.imopoints)
## # A tibble: 6 × 4
##    year country                  round points
##   <int> <chr>                    <chr>  <int>
## 1  2024 United States of America 1         42
## 2  2024 United States of America 2         41
## 3  2024 United States of America 3         19
## 4  2024 United States of America 4         40
## 5  2024 United States of America 5         35
## 6  2024 United States of America 6         15

The tidy version of the points table contains the observation of points scored with the variables that describe the points. For example the first row is 2024 USA round 1 points scored.

#Data analysis

For my analysis I want to show the average number of points for each country when there are no females on the team versus when there are females on the team.

To do this I will first sum the points across all rounds in a particular year for each country, then find the years that each country has females on the team and show the average scores of years without females and years with females.

#Adding the points from each round to get the points scored in total from that year and country's team

points_year <- tidy.imopoints %>%
  group_by(year,country)%>%
  summarise(points = sum(points))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
#Filtering to be able to group for analysis later
female.imoteams <- tidy.imoteams %>%
  filter(sex == "female")
#Joining the points_year table and female.imoteams table

analysis <- inner_join(points_year,female.imoteams, by = c("year","country"))
#Creating tables for 
m.analysis <- analysis %>%
  filter(count == 0) %>%
  group_by(country) %>%
  summarize(avgpoints_maleonly = mean(points))
  
f.analysis <- analysis %>%
  filter(count != 0) %>%
  group_by(country) %>%
  summarize(avgpoints_female = mean(points))

avganalysis <- inner_join(m.analysis,f.analysis,by ="country")
avganalysis %>%
  summarise(count = sum(avgpoints_female > avgpoints_maleonly))
## # A tibble: 1 × 1
##   count
##   <int>
## 1    46

My final tidied data tables for data set 2 prior to analysis is tidy.imopoints and tidy.imoteams

Conclusion

The original data set had many null values and encompassed two observational units which needed to be broken out into two separate tables. Once I addressed the null values (ensure that male and female counts added to the total reported, and removed rows where male and female were both null or had null point values), I separated the table into two by indexing the original table.

I then pivoted longer the column headers where they contained values and not variable names. This put the two data tables in tidy form. Using this tidy form I was able to do an average point analysis for years of countries with and without females. For other analysis, for example: how well has the US done over the years, or which round is typically scored the lowest across all countries, the tidy format enables these to be done easily. From the original csv format the same analysis could be done, but in a much more complex way - any ad hoc analysis would be more difficult than it could be with tidy data.

For the analysis, out of 117 countries, 46 of them had better average scores from the total year’s performance with females on the team. There is not enough data to say if females on the team had a significant impact on the team’s average scores.

##Dataset 3

#Introduction

My third data set was synthetically created by a classmate. It contains 4 respondents answers to two different screening tools the GAD7 and PHQ9 pre and post intervention. For my analysis I want to compare the pre and post intervention scores to try to determine if the intervention was successful.

#Importing data
#loading libraries

library(tidyr)
library(dplyr)
fakedata <- read.csv("https://raw.githubusercontent.com/Chung-Brandon/607/refs/heads/main/Fake_untidy_data.csv")

head(fakedata)
##    id                      X                 X.1
## 1  NA                                           
## 2 202 consented on 5/14/2023 before intervention
## 3 202                                           
## 4 202                                           
## 5 202                                           
## 6 202                                           
##                                                             GAD.7.Questions
## 1                                                                          
## 2 how often have they been bothered by the following over the past 2 weeks?
## 3                                      Feeling nervous, anxious, or on edge
## 4                                Not being able to stop or control worrying
## 5                                  Worrying too much about different things
## 6                                                          Trouble relaxing
##   GAD.7 gad.response
## 1                   
## 2                   
## 3  gad1            2
## 4  gad2            0
## 5  gad3            2
## 6  gad4            3
##                                                                             PHQ.9.Questions
## 1                                                                                          
## 2 Over the last 2 weeks, how often have you been bothered by any of the following problems?
## 3                                               Little interest or pleasure in doing things
## 4                                                      Feeling down, depressed, or hopeless
## 5                                   Trouble falling or staying asleep, or sleeping too much
## 6                                                     Feeling tired or having little energy
##   PHQ.9 phq.response
## 1                   
## 2                  2
## 3  phq1            0
## 4  phq2            0
## 5  phq3            0
## 6  phq4            2

#Data tidying and cleaning

For this data set I will separate it into three different tables for each of the observational units. One for the patient id and consent date, one for the patient answers, and one for the questions of the two different screens.

library(dplyr)
#removing blank rows
fakedata <- fakedata[-c(1,22,23,44,45,66,67),]

#substituting point scale for string answers to screen questions

fakedata <- fakedata %>%
  mutate(phq.response = replace(phq.response, phq.response == "Not at all", 0),
         gad.response = replace(gad.response, gad.response == "Not at all", 0))

fakedata <- fakedata %>%
  mutate(phq.response = replace(phq.response, phq.response == "Several days", 1),
         gad.response = replace(gad.response, gad.response == "Several days", 1))


#creating table to hold patient consent dates

consent <- fakedata[,1:2]

consent <- consent %>%
  rename(consent_date = X)

consent <- consent %>%
  filter(consent_date != "")

consent <- consent %>%
  mutate(consent_date = str_extract(consent_date, "\\d{1,2}/\\d{1,2}/\\d{4}"))

#Creating tables to hold the questions of both screens

#Creating the gad question bank table
library(dplyr)
gadquestions <- fakedata[,c(4,5)]
gadquestions <- gadquestions[2:8,]

gadquestions <- gadquestions %>%
  rename(
       question = "GAD.7.Questions",
       question.number = "GAD.7")

gadquestions <- gadquestions %>%
  mutate(question.number = str_extract(question.number, "\\d+"))%>%
  mutate(question.number = as.integer(question.number))

gadquestions <- gadquestions %>%
  mutate(screen = "gad")%>%
  relocate(screen, .before = 1)
#Creating the PHQ 9 question bank table

phqquestions <- fakedata[,c(7,8)]
phqquestions <- phqquestions[2:10,]

phqquestions <- phqquestions %>%
  rename(
       question = "PHQ.9.Questions",
       question.number = "PHQ.9")

phqquestions <- phqquestions %>%
  mutate(question.number = str_extract(question.number, "\\d+"))%>%
  mutate(question.number = as.integer(question.number))

phqquestions <- phqquestions %>%
  mutate(screen = "phq")%>%
  relocate(screen, .before = 1)

#Binding screen questions to create table of screen questions

screenquestions <- bind_rows(phqquestions,gadquestions)

screenquestions <- screenquestions %>%
  relocate(question.number, .before = question)

#Creating patient response table

#Creating gad response table then phq responses to combine the tables into a screenresponse 

gadresponse <- fakedata[,c(1,3:6)]

gadresponse <- gadresponse %>%
  rename(
    intervention = "X.1",
    screen = "GAD.7.Questions",
    question.number = "GAD.7")

#removing unneccessary rows

gadresponse <- gadresponse %>%
  filter(gad.response != "")

#filling in blank intervention values  

gadresponse <- gadresponse %>%
  mutate(intervention = ifelse((row_number() - 1) %% 14 < 7, "before", intervention))

gadresponse <- gadresponse %>%
  mutate(intervention = ifelse((row_number() - 1) %% 14 >= 7 & row_number() >= 8, "after", intervention))

gadresponse <- gadresponse %>%
  mutate(question.number = str_extract(question.number, "\\d+"))

# filling in null question numbers and replacing screen column values with gad

gadresponse <- gadresponse %>%
  mutate(question.number = as.numeric(question.number)) %>%
  mutate(question.number = replace_na(question.number, 2)) %>%
  mutate(screen = "gad")
phqresponse <- fakedata[,c(1,3,7,8,9)]

phqresponse <- phqresponse %>%
  rename(
    intervention = "X.1",
    screen = "PHQ.9.Questions",
    question.number = "PHQ.9")

#removing unnecessary rows

phqresponse <- phqresponse %>%
  filter(!grepl("^Over the last 2 weeks", screen))

phqresponse <- phqresponse %>%
  filter(phq.response != "")


#filling in blank intervention values  

phqresponse <- phqresponse %>%
  mutate(intervention = ifelse((row_number() - 1) %% 18 < 9, "before", intervention))

phqresponse <- phqresponse %>%
  mutate(intervention = ifelse((row_number() - 10) %% 18 < 9 & row_number() >= 10, "after", intervention))

# extracting question numbers

phqresponse <- phqresponse %>%
  mutate(question.number = str_extract(question.number, "\\d+"))

# filling in null question numbers and replacing screen column values with gad

phqresponse <- phqresponse %>%
  mutate(question.number = as.numeric(question.number)) %>%
  mutate(question.number = replace_na(question.number, 2)) %>%
  mutate(screen = "phq")
#Combining the respose tables together to make a tidy response table
phqresponse <- phqresponse %>%
  rename(response = "phq.response")

gadresponse <- gadresponse %>%
  rename(response = "gad.response")

screenresponse <- bind_rows(gadresponse, phqresponse)

head(screenresponse)
##    id intervention screen question.number response
## 1 202       before    gad               1        2
## 2 202       before    gad               2        0
## 3 202       before    gad               3        2
## 4 202       before    gad               4        3
## 5 202       before    gad               5        2
## 6 202       before    gad               6        1
#Tidy Tables are the following
tidy.screenquestions <- screenquestions
tidy.consent <- consent
tidy.screenquestion <- screenquestions

In my final tidy format tables, I created a consent table which holds the patient id and consent date information, a screenquestions table to hold the questions corresponding to both the GAD and PHQ screen’s questions and question numbers, and a screenresponse table. The screenresponse table shows observations of patient response based on the patient id, before or after intervention, and which screen, and which question from the screen is being asked.

#Data Analysis

In the data analysis I will sum each patient response scores before and after intervention to see if the intervention was subjectively successful to the patient. This will also tell us which of the GAD or PHQ screens is shows a greater change.

screenresponse <- screenresponse %>%
  mutate(response = as.integer(response))


screenresponse %>%
  group_by(id,intervention)%>%
  summarise(response = sum(response))
## `summarise()` has grouped output by 'id'. You can override using the `.groups`
## argument.
## # A tibble: 8 × 3
## # Groups:   id [4]
##      id intervention response
##   <int> <chr>           <int>
## 1   202 after              16
## 2   202 before             20
## 3   207 after              10
## 4   207 before             26
## 5   211 after              15
## 6   211 before             32
## 7   242 after              18
## 8   242 before             27

Conclusion

The cleaning of this data table was a challenge. There were many missing or values of “” in the data table, but also for normalization and tidying the creation of three separate tables was necessary.

After creating the tidy form data table, specifically screenresponse, I can see how the tidyform would make analysis easier via different groupings of variables. In an untidy table where the same response variable is in multiple columns, I’m not sure how analysis would be done, let alone adding, or editing the tables if I was in a database administrative role.

As for the data analysis we can see in the patient’s response that the intervention works. Across both the screens for all four patient the GAD and PHQ scores were reduced. With a sample size of four not much can be said in confidence, however patient 211 and 207 had a decrease of response scores of 17 and 16. These score reductions are larger than the other two patients and more research can should be done into these patients.