##Dataset 1
#Introduction
This data set contains the data behind the story “Marriage Isn’t Dead — Yet” from FiveThirtyEight and presents the proportions of people within certain social, economic, educational, racial, geographical, and employment groups that have never been married, over the years 1960 to 2012. I will tidy and clean the data set to bring to light the factors that are most strongly correlated with the decision to not get married.
#Reading in the csv file from github's raw data url
rawdata <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/marriage/both_sexes.csv")
head(rawdata)
## X year date all_2534 HS_2534 SC_2534 BAp_2534 BAo_2534 GD_2534
## 1 1 1960 1960-01-01 0.1233145 0.1095332 0.1522818 0.2389952 0.2389952 NA
## 2 2 1970 1970-01-01 0.1269715 0.1094000 0.1495096 0.2187031 0.2187031 NA
## 3 3 1980 1980-01-01 0.1991767 0.1617313 0.2236916 0.2881646 0.2881646 NA
## 4 4 1990 1990-01-01 0.2968306 0.2777491 0.2780912 0.3612968 0.3656655 0.3474505
## 5 5 2000 2000-01-01 0.3450087 0.3316545 0.3249205 0.3874906 0.3939579 0.3691740
## 6 6 2001 2001-01-01 0.3527767 0.3446069 0.3341101 0.3835686 0.3925148 0.3590304
## White_2534 Black_2534 Hisp_2534 NE_2534 MA_2534 Midwest_2534 South_2534
## 1 0.1164848 0.1621855 0.1393736 0.1504184 0.1628934 0.1121467 0.1090562
## 2 0.1179043 0.1855163 0.1298769 0.1517231 0.1640680 0.1153741 0.1126220
## 3 0.1824126 0.3137500 0.1885440 0.2414327 0.2505925 0.1828339 0.1688435
## 4 0.2639256 0.4838556 0.2962372 0.3500384 0.3623321 0.2755046 0.2639794
## 5 0.3127149 0.5144994 0.3180681 0.4091852 0.4175565 0.3308022 0.3099712
## 6 0.3183506 0.5437985 0.3321214 0.4200581 0.4294281 0.3344332 0.3182688
## Mountain_2534 Pacific_2534 poor_2534 mid_2534 rich_2534 all_3544
## 1 0.09152117 0.1198758 0.1371597 0.07514929 0.2066776 0.07058157
## 2 0.10293602 0.1374964 0.1717202 0.08159207 0.1724093 0.06732520
## 3 0.17434230 0.2334279 0.3100591 0.14825303 0.1851082 0.06883378
## 4 0.25264326 0.3319579 0.4199108 0.24320008 0.2783226 0.11191800
## 5 0.30621032 0.3753061 0.5033676 0.30202036 0.2717386 0.15605881
## 6 0.30980779 0.3844799 0.5178771 0.31716118 0.2532041 0.15642529
## HS_3544 SC_3544 BAp_3544 BAo_3544 GD_3544 White_3544 Black_3544
## 1 0.06860309 0.06663695 0.1326265 0.1326265 NA 0.06825586 0.08836728
## 2 0.06511964 0.06271724 0.1116899 0.1116899 NA 0.06250372 0.10290904
## 3 0.06429102 0.06531333 0.1056102 0.1056102 NA 0.05966739 0.13140081
## 4 0.11210043 0.09699372 0.1285172 0.1258567 0.1328018 0.09611312 0.22010298
## 5 0.16993703 0.13800404 0.1541238 0.1536299 0.1550970 0.13207032 0.30239381
## 6 0.16870156 0.13986044 0.1548151 0.1524923 0.1595169 0.13287455 0.30857796
## Hisp_3544 NE_3544 MA_3544 Midwest_3544 South_3544 Mountain_3544
## 1 0.07307651 0.09194322 0.09347468 0.06863360 0.06026353 0.04739747
## 2 0.07070500 0.08570110 0.09040725 0.06156272 0.05966057 0.04651163
## 3 0.08110790 0.07997323 0.09744428 0.06070641 0.05914089 0.04880077
## 4 0.12194206 0.12785915 0.14354989 0.10157576 0.09637035 0.09189904
## 5 0.15469520 0.17327422 0.18819256 0.14539201 0.14230600 0.13584194
## 6 0.14953050 0.16653497 0.18315109 0.14794407 0.14312592 0.13943820
## Pacific_3544 poor_3544 mid_3544 rich_3544 all_4554 HS_4554 SC_4554
## 1 0.05822486 0.1019749 0.04717272 0.08553870 0.07254649 0.06840792 0.07903755
## 2 0.06347796 0.1117548 0.04566838 0.06499159 0.05968794 0.05833439 0.05443478
## 3 0.07552538 0.1291426 0.05050321 0.04445951 0.05250871 0.05036563 0.04816180
## 4 0.13134638 0.2012208 0.09024739 0.06573916 0.05947824 0.05988244 0.04654087
## 5 0.17480047 0.2813137 0.12815751 0.08622046 0.08804394 0.09442809 0.07558786
## 6 0.17694864 0.2919112 0.13267625 0.06803283 0.08823342 0.09189007 0.07795481
## BAp_4554 BAo_4554 GD_4554 White_4554 Black_4554 Hisp_4554 NE_4554
## 1 0.15360889 0.15360889 NA 0.07246692 0.06913249 0.06636058 0.10236412
## 2 0.10466047 0.10466047 NA 0.05754799 0.07899168 0.05810740 0.08028082
## 3 0.08623774 0.08623774 NA 0.04765354 0.08624602 0.06522951 0.06930253
## 4 0.07301884 0.06416529 0.08394886 0.05092552 0.11617699 0.07613556 0.07047502
## 5 0.09208417 0.09097472 0.09362802 0.07578174 0.17587334 0.09418009 0.10232170
## 6 0.09333365 0.09313480 0.09362876 0.07516912 0.18154531 0.09409896 0.09868408
## MA_4554 Midwest_4554 South_4554 Mountain_4554 Pacific_4554 poor_4554
## 1 0.09264788 0.07285321 0.05977295 0.04754183 0.05996993 0.1030055
## 2 0.07860635 0.05791163 0.05174462 0.03970134 0.04826312 0.1016489
## 3 0.07508466 0.04807290 0.04485348 0.03374438 0.04958992 0.1003011
## 4 0.08373134 0.05398391 0.05043636 0.04459411 0.06461875 0.1148335
## 5 0.11269659 0.08302437 0.07631858 0.07637774 0.09896832 0.1718976
## 6 0.10953635 0.08207629 0.07886513 0.07405971 0.10119511 0.1759369
## mid_4554 rich_4554 nokids_all_2534 kids_all_2534 nokids_HS_2534
## 1 0.05364421 0.07908591 0.4640564 0.002820625 0.4430148
## 2 0.04221637 0.05142867 0.4309043 0.009868596 0.4246779
## 3 0.03830266 0.03311296 0.4464304 0.025285667 0.4319342
## 4 0.04562332 0.03136386 0.5425242 0.060277451 0.5464881
## 5 0.07055672 0.03897342 0.5714531 0.099472713 0.5711395
## 6 0.07407508 0.02857320 0.5852213 0.110178467 0.6045475
## nokids_SC_2534 nokids_BAp_2534 nokids_BAo_2534 nokids_GD_2534 kids_HS_2534
## 1 0.5000402 0.5619099 0.5619099 NA 0.003318886
## 2 0.4333479 0.4554766 0.4554766 NA 0.012465915
## 3 0.4505900 0.4719700 0.4719700 NA 0.031930752
## 4 0.5238446 0.5560765 0.5633301 0.5332628 0.078470444
## 5 0.5700042 0.5729677 0.5862213 0.5367160 0.127193577
## 6 0.5810912 0.5698644 0.5864967 0.5258800 0.141395652
## kids_SC_2534 kids_BAp_2534 kids_BAo_2534 kids_GD_2534 nokids_poor_2534
## 1 0.001150824 0.0005751073 0.0005751073 NA 0.4933061
## 2 0.003699982 0.0014683425 0.0014683425 NA 0.5097742
## 3 0.018135401 0.0062544364 0.0062544364 NA 0.5740402
## 4 0.052032702 0.0171241042 0.0181766027 0.01374234 0.6546908
## 5 0.097625310 0.0370024452 0.0401009875 0.02761467 0.7055451
## 6 0.110030662 0.0399801447 0.0445838012 0.02645041 0.7147334
## nokids_mid_2534 nokids_rich_2534 kids_poor_2534 kids_mid_2534 kids_rich_2534
## 1 0.4100080 0.4921184 0.008722711 0.0007532065 0.0008027331
## 2 0.3764538 0.4288948 0.029974945 0.0033771145 0.0030435661
## 3 0.3998250 0.3848089 0.077926214 0.0102368871 0.0068317224
## 4 0.5186604 0.4750156 0.170763774 0.0274655254 0.0182329127
## 5 0.5690228 0.4458023 0.256281918 0.0597845173 0.0295644698
## 6 0.5864741 0.4461111 0.280146488 0.0677954572 0.0336540502
#loading packages
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Data Cleaning
The data is relatively clean, although there are a few null values and two unnecessary columns. Removing the top four rows will allow me to make year over year comparisons of the proportions of people never married and addresses the null values. The first four rows contain data for years 1960, 1970, 1980, and 1990, then the remaining rows contain individual years of survey data from 2000 to 2012.
#Removing unnecessary columns X and Date (we are given the year in column 2). Removing rows 1 to 4.
rawdata <- rawdata[-c(1:4),-c(1,3)]
#Data Tidying
The first thing that I notice when looking to tidy the dataset here is that the column headers are values not variable names, and that the column headers contain multiple variables (an attribute of the observational unit and an age range).
Assuming that the column headers without the kids or nokids denomination show proportions of marriage taken from a combination of people with kids and no kids, I will create a variable for kids in our tidy data table.
#Analysis planning
For analysis I will make comparisons between the change in never married proportions for the different attributes and discover if any attributes are markedly above the change in proportion for all attributes, then compare if having kids at home makes a significant impact on these proportions.
#Data Tidying
#Seperating the table into attributes and kids tables to standardize the columns then bind back together
attributes <- rawdata %>%
select(-matches("^kids"), -matches("^nokids"))
attributes <- attributes %>%
pivot_longer(
cols = 2:55,
names_to = "attribute",
values_to = "p_never.married",
values_drop_na = TRUE
)
#Splitting the out variables
attributes <- attributes %>%
separate(attribute, into = c("attribute", "age.range"),sep = "_") %>%
mutate(age.range = sub("(\\d{2})(\\d{2})", "\\1-\\2", age.range))
attributes <- attributes %>%
mutate(kids = "combined") %>%
relocate(kids, .after = year)
kids <- rawdata %>%
select(matches("^(year|kids|nokids).*"))
#Pivoting longer the column headers
kids <- kids %>%
pivot_longer(
cols = 2:19,
names_to = "attribute",
values_to = "p_never.married",
values_drop_na = TRUE
)
#Splitting the column kids into kids, attribute, and age.range
kids <- kids %>%
separate(attribute, into = c("kids","attribute", "age.range"),sep = "_") %>%
mutate(age.range = sub("(\\d{2})(\\d{2})", "\\1-\\2", age.range))
#Creating the final tidy table
#In this tidied final table I have the observation of proportion never married per row, and each observation is described by the variables of year survey was taken, status of kids, attributes specified by the American Community Survey, and the age range of the respondent group.
tidy.data <- bind_rows(attributes,kids)
head(tidy.data)
## # A tibble: 6 × 5
## year kids attribute age.range p_never.married
## <int> <chr> <chr> <chr> <dbl>
## 1 2000 combined all 25-34 0.345
## 2 2000 combined HS 25-34 0.332
## 3 2000 combined SC 25-34 0.325
## 4 2000 combined BAp 25-34 0.387
## 5 2000 combined BAo 25-34 0.394
## 6 2000 combined GD 25-34 0.369
#Data Analysis
To find which attributes are most correlated with an increasing never married proportion I will first explore the different attributes and the effect of kids.
library(ggplot2)
ggplot(tidy.data, aes(x = ifelse(kids == "combined", year, NA), y = p_never.married, color = age.range)) +
geom_point() +
facet_wrap(vars(attribute)) +
theme(axis.text.x = element_text(angle = 45))
## Warning: Removed 234 rows containing missing values or values outside the scale range
## (`geom_point()`).
For attributes that have the greatest never married proportion, Black, HS, MA, NE, Pacific, and poor stand out in the eye test.
kids <- tidy.data %>% filter(kids == "kids")
ggplot(kids, aes(x = year, y = p_never.married, color = age.range)) +
geom_point() +
facet_wrap(vars(attribute)) +
theme(axis.text.x = element_text(angle = 45))
nokids <- tidy.data %>% filter(kids == "nokids")
ggplot(nokids, aes(x = year, y = p_never.married, color = age.range)) +
geom_point() +
facet_wrap(vars(attribute)) +
theme(axis.text.x = element_text(angle = 45))
td.analysis <- tidy.data %>%
filter(year %in% c(2000, 2012)) %>%
group_by(attribute, age.range, kids) %>%
reframe(difference = p_never.married[year == 2012] - p_never.married[year == 2000])
td.analysis %>%
arrange(-difference)
## # A tibble: 72 × 4
## attribute age.range kids difference
## <chr> <chr> <chr> <dbl>
## 1 HS 25-34 combined 0.192
## 2 HS 25-34 nokids 0.174
## 3 rich 25-34 nokids 0.172
## 4 Black 25-34 combined 0.170
## 5 Hisp 25-34 combined 0.169
## 6 poor 25-34 kids 0.159
## 7 MA 25-34 combined 0.158
## 8 NE 25-34 combined 0.155
## 9 SC 25-34 combined 0.155
## 10 South 25-34 combined 0.155
## # ℹ 62 more rows
#Finding base rates for the increase of never married proportions from 2000 to 2012
td.analysisall <- td.analysis %>%
filter(attribute == "all", kids == "combined") %>%
group_by(attribute, age.range, kids)
td.analysisall %>%
arrange(-difference)
## # A tibble: 3 × 4
## # Groups: attribute, age.range, kids [3]
## attribute age.range kids difference
## <chr> <chr> <chr> <dbl>
## 1 all 25-34 combined 0.149
## 2 all 35-44 combined 0.0600
## 3 all 45-54 combined 0.0547
My final tidied data table is tidy.data prior to analysis
#Conclusion
Tidying this data table was challenging, I explored the idea of breaking out attribute = all, and column headers starting with kids or no kids, into their own tables, as they are calculated from the same respondent groups as the proportions for column headers with education, race, location or income values, and breaking out each attribute (education, race, location, income) into their own tables as well. My original thoughts on this was that if I were to maintain these data tables, normalizing by breaking out the attributes into tables would be a better organization for additions, edits deletes, etc., then calculating out the kids/no kids and all attributes proportions from the normalized tables.
I ended up leaving leaving the final tidied table as one because I was able to format the table to have one observation per row, and the columns be an attribute of the observation (with the creation of the kids column).
As for the data analysis never married proportions are increasing fastest first for the age range of 25-35, the 22 groups in this dataset with the highest never married proportion increases are from the 25-35 age range. We can also see that people with highschool or below education levels, but also rich with no kids are the groups showing the greatest increases.
##Dataset 2
#Introduction
My second data set comes from the results of the International Mathematics Olympiad. It provides each country’s teams results in a wide and untidy format. I will attempt to analyze if there is any correlation between the number of females on a countries team and the team’s points scored.
library(stringr)
library(dplyr)
library(tidyr)
#reading in the csv
imoresults <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/refs/heads/master/data/2024/2024-09-24/country_results_df.csv")
head(imoresults)
## year country team_size_all team_size_male team_size_female
## 1 2024 United States of America 6 5 1
## 2 2024 People's Republic of China 6 6 0
## 3 2024 Republic of Korea 6 6 0
## 4 2024 India 6 6 0
## 5 2024 Belarus 6 6 0
## 6 2024 Singapore 6 6 0
## p1 p2 p3 p4 p5 p6 p7 awards_gold awards_silver awards_bronze
## 1 42 41 19 40 35 15 NA 5 1 0
## 2 42 42 31 40 22 13 NA 5 1 0
## 3 42 37 18 42 7 22 NA 2 4 0
## 4 42 34 11 42 28 10 NA 4 1 0
## 5 42 30 10 42 36 5 NA 4 0 2
## 6 42 37 7 42 29 5 NA 1 5 0
## awards_honorable_mentions leader deputy_leader
## 1 0 John Berman Carl Schildkraut
## 2 0 Liang Xiao Yijun Yao
## 3 0 Suyoung Choi Hwajong Yoo
## 4 1 Krishnan Sivasubramanian Rijul Saini
## 5 0 David Zmiaikou Dzmitry Bazyleu
## 6 0 Yong Sheng Soh Teck Kian Teo
#Data cleaning
Removing unnecessary columns for analysis: column p7 has no values, and leader and deputy_leader are not of interest.
#removing columns
imoresults <- imoresults[,-c(12:18)]
#removing rows with null points
imoresults <- imoresults %>%
filter(!is.na(p1))
#removing rows with null male and female counts
imoresults <- imoresults %>%
filter(!(is.na(team_size_male) & is.na(team_size_female)))
#mutating rows to show male and female counts
imoresults <- imoresults %>%
mutate(
team_size_male = ifelse(is.na(team_size_male), team_size_all - team_size_female, team_size_male),
team_size_female = ifelse(is.na(team_size_female), team_size_all - team_size_male, team_size_female)
)
head(imoresults)
## year country team_size_all team_size_male team_size_female
## 1 2024 United States of America 6 5 1
## 2 2024 People's Republic of China 6 6 0
## 3 2024 Republic of Korea 6 6 0
## 4 2024 India 6 6 0
## 5 2024 Belarus 6 6 0
## 6 2024 Singapore 6 6 0
## p1 p2 p3 p4 p5 p6
## 1 42 41 19 40 35 15
## 2 42 42 31 40 22 13
## 3 42 37 18 42 7 22
## 4 42 34 11 42 28 10
## 5 42 30 10 42 36 5
## 6 42 37 7 42 29 5
#Data Tidying
There are two observational units in our remaining table. I will split this table into imoteams and imopoints in tidy format before beginning analysis.
#Creating and tidying the teams table
tidy.imoteams <- imoresults[,c(1,2,4,5)]
tidy.imoteams <- tidy.imoteams %>%
pivot_longer(
cols = 3:4,
names_to = "column",
values_to = "count"
)
tidy.imoteams <- tidy.imoteams %>%
mutate(sex = str_extract(column, "male$|female$"))
tidy.imoteams <- tidy.imoteams[,-3] %>%
relocate(sex, .after = country)
head(tidy.imoteams)
## # A tibble: 6 × 4
## year country sex count
## <int> <chr> <chr> <int>
## 1 2024 United States of America male 5
## 2 2024 United States of America female 1
## 3 2024 People's Republic of China male 6
## 4 2024 People's Republic of China female 0
## 5 2024 Republic of Korea male 6
## 6 2024 Republic of Korea female 0
The tidy version of the teams table contains the observation of count with the variables that describe the count. FOr example the first row is 2024 USA male count and the second row is 2024 USA female count.
#Creating and tidying the points table
tidy.imopoints <- imoresults[,c(1,2,6:11)] %>%
pivot_longer(
cols = 3:8,
names_to = "round",
values_to = "points")
#Removing the "p" from the values in the round column
tidy.imopoints <- tidy.imopoints %>%
mutate(round = gsub("[^0-9]", "", round))
head(tidy.imopoints)
## # A tibble: 6 × 4
## year country round points
## <int> <chr> <chr> <int>
## 1 2024 United States of America 1 42
## 2 2024 United States of America 2 41
## 3 2024 United States of America 3 19
## 4 2024 United States of America 4 40
## 5 2024 United States of America 5 35
## 6 2024 United States of America 6 15
The tidy version of the points table contains the observation of points scored with the variables that describe the points. For example the first row is 2024 USA round 1 points scored.
#Data analysis
For my analysis I want to show the average number of points for each country when there are no females on the team versus when there are females on the team.
To do this I will first sum the points across all rounds in a particular year for each country, then find the years that each country has females on the team and show the average scores of years without females and years with females.
#Adding the points from each round to get the points scored in total from that year and country's team
points_year <- tidy.imopoints %>%
group_by(year,country)%>%
summarise(points = sum(points))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
#Filtering to be able to group for analysis later
female.imoteams <- tidy.imoteams %>%
filter(sex == "female")
#Joining the points_year table and female.imoteams table
analysis <- inner_join(points_year,female.imoteams, by = c("year","country"))
#Creating tables for
m.analysis <- analysis %>%
filter(count == 0) %>%
group_by(country) %>%
summarize(avgpoints_maleonly = mean(points))
f.analysis <- analysis %>%
filter(count != 0) %>%
group_by(country) %>%
summarize(avgpoints_female = mean(points))
avganalysis <- inner_join(m.analysis,f.analysis,by ="country")
avganalysis %>%
summarise(count = sum(avgpoints_female > avgpoints_maleonly))
## # A tibble: 1 × 1
## count
## <int>
## 1 46
My final tidied data tables for data set 2 prior to analysis is tidy.imopoints and tidy.imoteams
The original data set had many null values and encompassed two observational units which needed to be broken out into two separate tables. Once I addressed the null values (ensure that male and female counts added to the total reported, and removed rows where male and female were both null or had null point values), I separated the table into two by indexing the original table.
I then pivoted longer the column headers where they contained values and not variable names. This put the two data tables in tidy form. Using this tidy form I was able to do an average point analysis for years of countries with and without females. For other analysis, for example: how well has the US done over the years, or which round is typically scored the lowest across all countries, the tidy format enables these to be done easily. From the original csv format the same analysis could be done, but in a much more complex way - any ad hoc analysis would be more difficult than it could be with tidy data.
For the analysis, out of 117 countries, 46 of them had better average scores from the total year’s performance with females on the team. There is not enough data to say if females on the team had a significant impact on the team’s average scores.
##Dataset 3
#Introduction
My third data set was synthetically created by a classmate. It contains 4 respondents answers to two different screening tools the GAD7 and PHQ9 pre and post intervention. For my analysis I want to compare the pre and post intervention scores to try to determine if the intervention was successful.
#Importing data
#loading libraries
library(tidyr)
library(dplyr)
fakedata <- read.csv("https://raw.githubusercontent.com/Chung-Brandon/607/refs/heads/main/Fake_untidy_data.csv")
head(fakedata)
## id X X.1
## 1 NA
## 2 202 consented on 5/14/2023 before intervention
## 3 202
## 4 202
## 5 202
## 6 202
## GAD.7.Questions
## 1
## 2 how often have they been bothered by the following over the past 2 weeks?
## 3 Feeling nervous, anxious, or on edge
## 4 Not being able to stop or control worrying
## 5 Worrying too much about different things
## 6 Trouble relaxing
## GAD.7 gad.response
## 1
## 2
## 3 gad1 2
## 4 gad2 0
## 5 gad3 2
## 6 gad4 3
## PHQ.9.Questions
## 1
## 2 Over the last 2 weeks, how often have you been bothered by any of the following problems?
## 3 Little interest or pleasure in doing things
## 4 Feeling down, depressed, or hopeless
## 5 Trouble falling or staying asleep, or sleeping too much
## 6 Feeling tired or having little energy
## PHQ.9 phq.response
## 1
## 2 2
## 3 phq1 0
## 4 phq2 0
## 5 phq3 0
## 6 phq4 2
#Data tidying and cleaning
For this data set I will separate it into three different tables for each of the observational units. One for the patient id and consent date, one for the patient answers, and one for the questions of the two different screens.
library(dplyr)
#removing blank rows
fakedata <- fakedata[-c(1,22,23,44,45,66,67),]
#substituting point scale for string answers to screen questions
fakedata <- fakedata %>%
mutate(phq.response = replace(phq.response, phq.response == "Not at all", 0),
gad.response = replace(gad.response, gad.response == "Not at all", 0))
fakedata <- fakedata %>%
mutate(phq.response = replace(phq.response, phq.response == "Several days", 1),
gad.response = replace(gad.response, gad.response == "Several days", 1))
#creating table to hold patient consent dates
consent <- fakedata[,1:2]
consent <- consent %>%
rename(consent_date = X)
consent <- consent %>%
filter(consent_date != "")
consent <- consent %>%
mutate(consent_date = str_extract(consent_date, "\\d{1,2}/\\d{1,2}/\\d{4}"))
#Creating tables to hold the questions of both screens
#Creating the gad question bank table
library(dplyr)
gadquestions <- fakedata[,c(4,5)]
gadquestions <- gadquestions[2:8,]
gadquestions <- gadquestions %>%
rename(
question = "GAD.7.Questions",
question.number = "GAD.7")
gadquestions <- gadquestions %>%
mutate(question.number = str_extract(question.number, "\\d+"))%>%
mutate(question.number = as.integer(question.number))
gadquestions <- gadquestions %>%
mutate(screen = "gad")%>%
relocate(screen, .before = 1)
#Creating the PHQ 9 question bank table
phqquestions <- fakedata[,c(7,8)]
phqquestions <- phqquestions[2:10,]
phqquestions <- phqquestions %>%
rename(
question = "PHQ.9.Questions",
question.number = "PHQ.9")
phqquestions <- phqquestions %>%
mutate(question.number = str_extract(question.number, "\\d+"))%>%
mutate(question.number = as.integer(question.number))
phqquestions <- phqquestions %>%
mutate(screen = "phq")%>%
relocate(screen, .before = 1)
#Binding screen questions to create table of screen questions
screenquestions <- bind_rows(phqquestions,gadquestions)
screenquestions <- screenquestions %>%
relocate(question.number, .before = question)
#Creating patient response table
#Creating gad response table then phq responses to combine the tables into a screenresponse
gadresponse <- fakedata[,c(1,3:6)]
gadresponse <- gadresponse %>%
rename(
intervention = "X.1",
screen = "GAD.7.Questions",
question.number = "GAD.7")
#removing unneccessary rows
gadresponse <- gadresponse %>%
filter(gad.response != "")
#filling in blank intervention values
gadresponse <- gadresponse %>%
mutate(intervention = ifelse((row_number() - 1) %% 14 < 7, "before", intervention))
gadresponse <- gadresponse %>%
mutate(intervention = ifelse((row_number() - 1) %% 14 >= 7 & row_number() >= 8, "after", intervention))
gadresponse <- gadresponse %>%
mutate(question.number = str_extract(question.number, "\\d+"))
# filling in null question numbers and replacing screen column values with gad
gadresponse <- gadresponse %>%
mutate(question.number = as.numeric(question.number)) %>%
mutate(question.number = replace_na(question.number, 2)) %>%
mutate(screen = "gad")
phqresponse <- fakedata[,c(1,3,7,8,9)]
phqresponse <- phqresponse %>%
rename(
intervention = "X.1",
screen = "PHQ.9.Questions",
question.number = "PHQ.9")
#removing unnecessary rows
phqresponse <- phqresponse %>%
filter(!grepl("^Over the last 2 weeks", screen))
phqresponse <- phqresponse %>%
filter(phq.response != "")
#filling in blank intervention values
phqresponse <- phqresponse %>%
mutate(intervention = ifelse((row_number() - 1) %% 18 < 9, "before", intervention))
phqresponse <- phqresponse %>%
mutate(intervention = ifelse((row_number() - 10) %% 18 < 9 & row_number() >= 10, "after", intervention))
# extracting question numbers
phqresponse <- phqresponse %>%
mutate(question.number = str_extract(question.number, "\\d+"))
# filling in null question numbers and replacing screen column values with gad
phqresponse <- phqresponse %>%
mutate(question.number = as.numeric(question.number)) %>%
mutate(question.number = replace_na(question.number, 2)) %>%
mutate(screen = "phq")
#Combining the respose tables together to make a tidy response table
phqresponse <- phqresponse %>%
rename(response = "phq.response")
gadresponse <- gadresponse %>%
rename(response = "gad.response")
screenresponse <- bind_rows(gadresponse, phqresponse)
head(screenresponse)
## id intervention screen question.number response
## 1 202 before gad 1 2
## 2 202 before gad 2 0
## 3 202 before gad 3 2
## 4 202 before gad 4 3
## 5 202 before gad 5 2
## 6 202 before gad 6 1
#Tidy Tables are the following
tidy.screenquestions <- screenquestions
tidy.consent <- consent
tidy.screenquestion <- screenquestions
In my final tidy format tables, I created a consent table which holds the patient id and consent date information, a screenquestions table to hold the questions corresponding to both the GAD and PHQ screen’s questions and question numbers, and a screenresponse table. The screenresponse table shows observations of patient response based on the patient id, before or after intervention, and which screen, and which question from the screen is being asked.
#Data Analysis
In the data analysis I will sum each patient response scores before and after intervention to see if the intervention was subjectively successful to the patient. This will also tell us which of the GAD or PHQ screens is shows a greater change.
screenresponse <- screenresponse %>%
mutate(response = as.integer(response))
screenresponse %>%
group_by(id,intervention)%>%
summarise(response = sum(response))
## `summarise()` has grouped output by 'id'. You can override using the `.groups`
## argument.
## # A tibble: 8 × 3
## # Groups: id [4]
## id intervention response
## <int> <chr> <int>
## 1 202 after 16
## 2 202 before 20
## 3 207 after 10
## 4 207 before 26
## 5 211 after 15
## 6 211 before 32
## 7 242 after 18
## 8 242 before 27
The cleaning of this data table was a challenge. There were many missing or values of “” in the data table, but also for normalization and tidying the creation of three separate tables was necessary.
After creating the tidy form data table, specifically screenresponse, I can see how the tidyform would make analysis easier via different groupings of variables. In an untidy table where the same response variable is in multiple columns, I’m not sure how analysis would be done, let alone adding, or editing the tables if I was in a database administrative role.
As for the data analysis we can see in the patient’s response that the intervention works. Across both the screens for all four patient the GAD and PHQ scores were reduced. With a sample size of four not much can be said in confidence, however patient 211 and 207 had a decrease of response scores of 17 and 16. These score reductions are larger than the other two patients and more research can should be done into these patients.