Data Prep

Load Libraries

# if you haven't used a given package before, you'll need to download it first
# delete the "#" before the install function and run it to download
# then run the library function calling that package

# install.packages("naniar")

library(naniar) # for the gg_miss-upset() command

Import Data

Import the full project data into a dataframe, call it “df”. Replace ‘DOWLOADED FILE NAME’ with the actual file name of your dataset (either for the ARC or EAMMi2).

Note: If you named your folder something else, you will also need to replace ‘Data’ with whatever the name of your folder is where you saved the dataset in.

# for the lab, you'll import your chosen project's full dataset CSV file you downloaded

# note: If you named your folder something else, you will also need to replace 'Data' with whatever the name of your folder is where you saved the project dataset in.

df <- read.csv(file="Data/arc_data_final_fall24.csv", header=T)

Viewing Data

These are commands useful for viewing a data frame.

# you can also click the object (the little table picture) in the environment tab to view it in a new window

names(df)  # all the variable name in the data frame

##  [1] "X"                    "gender"               "trans"               
##  [4] "sexual_orientation"   "ethnicity"            "relationship_status" 
##  [7] "age"                  "urban_rural"          "income"              
## [10] "education"            "employment"           "treatment"           
## [13] "health"               "mhealth"              "sleep_hours"         
## [16] "exercise"             "pet"                  "covid_pos"           
## [19] "covid_neg"            "big5_open"            "big5_con"            
## [22] "big5_agr"             "big5_neu"             "big5_ext"            
## [25] "pswq"                 "iou"                  "mfq_26"              
## [28] "mfq_state"            "rse"                  "school_covid_support"
## [31] "school_att"           "pas_covid"            "pss"                 
## [34] "phq"                  "gad"                  "edeq12"              
## [37] "brs"                  "swemws"               "isolation_a"         
## [40] "isolation_c"          "support"

head(df)   # first 6 lines of data in the data frame

##    X gender trans    sexual_orientation                     ethnicity
## 1  1 female    no Heterosexual/Straight White - British, Irish, other
## 2 20   male    no Heterosexual/Straight White - British, Irish, other
## 3 30 female    no Heterosexual/Straight White - British, Irish, other
## 4 31 female    no Heterosexual/Straight White - British, Irish, other
## 5 32   <NA>  <NA>                  <NA>                          <NA>
## 6 33 female    no Heterosexual/Straight White - British, Irish, other
##                        relationship_status                 age urban_rural
## 1 In a relationship/married and cohabiting                <NA>        city
## 2                        Prefer not to say          1 under 18        city
## 3                        Prefer not to say          1 under 18        city
## 4 In a relationship/married and cohabiting 4 between 36 and 45        town
## 5                                     <NA>                <NA>        <NA>
## 6 In a relationship/married and cohabiting 4 between 36 and 45        city
##     income                              education               employment
## 1   3 high            6 graduate degree or higher               3 employed
## 2     <NA>                      prefer not to say 1 high school equivalent
## 3     <NA> 2 equivalent to high school completion 1 high school equivalent
## 4 2 middle                 5 undergraduate degree               3 employed
## 5     <NA>                                   <NA>                     <NA>
## 6 2 middle            6 graduate degree or higher               3 employed
##                    treatment                           health          mhealth
## 1 no psychological disorders something else or not applicable       none or NA
## 2               in treatment something else or not applicable anxiety disorder
## 3           not in treatment something else or not applicable       none or NA
## 4 no psychological disorders                   two conditions       none or NA
## 5                       <NA>                             <NA>       none or NA
## 6           not in treatment something else or not applicable       none or NA
##   sleep_hours exercise                   pet covid_pos covid_neg big5_open
## 1 3 7-8 hours      0.0                   cat         0         0  5.333333
## 2 2 5-6 hours      2.0                   cat         0         0  5.333333
## 3 3 7-8 hours      3.0                   dog         0         0  5.000000
## 4 2 5-6 hours      1.5               no pets         0         0  6.000000
## 5        <NA>       NA                  <NA>         0         0        NA
## 6 3 7-8 hours      1.0 multiple types of pet         0         0  5.000000
##   big5_con big5_agr big5_neu big5_ext     pswq      iou mfq_26 mfq_state rse
## 1 6.000000 4.333333 6.000000 2.000000 4.937500 3.185185   4.20     3.625 2.3
## 2 3.333333 4.333333 6.666667 1.666667 3.357143 4.000000   3.35     3.000 1.6
## 3 5.333333 6.666667 4.000000 6.000000 1.857143 1.592593   4.65     5.875 3.9
## 4 5.666667 4.666667 4.000000 5.000000 3.937500 3.370370   4.65     4.000 1.7
## 5       NA       NA       NA       NA       NA       NA     NA        NA  NA
## 6 6.000000 6.333333 2.666667       NA 2.625000 1.703704   4.50     4.625 3.9
##   school_covid_support school_att pas_covid  pss      phq      gad   edeq12 brs
## 1                   NA         NA  3.222222 3.25 1.333333 1.857143 1.583333  NA
## 2                   NA         NA  4.555556 3.75 3.333333 3.857143 1.833333  NA
## 3                   NA         NA  3.333333 1.00 1.000000 1.142857 1.000000  NA
## 4                   NA         NA  4.222222 3.25 2.333333 2.000000 1.666667  NA
## 5                   NA         NA        NA   NA       NA       NA       NA  NA
## 6                   NA         NA  3.222222 2.00 1.111111 1.428571 1.416667  NA
##     swemws isolation_a isolation_c  support
## 1 2.857143        2.25          NA 2.500000
## 2 2.285714          NA         3.5 2.166667
## 3 4.285714          NA         1.0 5.000000
## 4 3.285714        2.50          NA 2.500000
## 5       NA          NA          NA       NA
## 6 4.000000        1.75          NA 3.666667

str(df)    # shows all the variables in the data frame and their classification type (e.g., numeric, string, character,etc.)

## 'data.frame':    2073 obs. of  41 variables:
##  $ X                   : int  1 20 30 31 32 33 48 49 57 58 ...
##  $ gender              : chr  "female" "male" "female" "female" ...
##  $ trans               : chr  "no" "no" "no" "no" ...
##  $ sexual_orientation  : chr  "Heterosexual/Straight" "Heterosexual/Straight" "Heterosexual/Straight" "Heterosexual/Straight" ...
##  $ ethnicity           : chr  "White - British, Irish, other" "White - British, Irish, other" "White - British, Irish, other" "White - British, Irish, other" ...
##  $ relationship_status : chr  "In a relationship/married and cohabiting" "Prefer not to say" "Prefer not to say" "In a relationship/married and cohabiting" ...
##  $ age                 : chr  NA "1 under 18" "1 under 18" "4 between 36 and 45" ...
##  $ urban_rural         : chr  "city" "city" "city" "town" ...
##  $ income              : chr  "3 high" NA NA "2 middle" ...
##  $ education           : chr  "6 graduate degree or higher" "prefer not to say" "2 equivalent to high school completion" "5 undergraduate degree" ...
##  $ employment          : chr  "3 employed" "1 high school equivalent" "1 high school equivalent" "3 employed" ...
##  $ treatment           : chr  "no psychological disorders" "in treatment" "not in treatment" "no psychological disorders" ...
##  $ health              : chr  "something else or not applicable" "something else or not applicable" "something else or not applicable" "two conditions" ...
##  $ mhealth             : chr  "none or NA" "anxiety disorder" "none or NA" "none or NA" ...
##  $ sleep_hours         : chr  "3 7-8 hours" "2 5-6 hours" "3 7-8 hours" "2 5-6 hours" ...
##  $ exercise            : num  0 2 3 1.5 NA 1 NA 2 2 1.7 ...
##  $ pet                 : chr  "cat" "cat" "dog" "no pets" ...
##  $ covid_pos           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ covid_neg           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ big5_open           : num  5.33 5.33 5 6 NA ...
##  $ big5_con            : num  6 3.33 5.33 5.67 NA ...
##  $ big5_agr            : num  4.33 4.33 6.67 4.67 NA ...
##  $ big5_neu            : num  6 6.67 4 4 NA ...
##  $ big5_ext            : num  2 1.67 6 5 NA ...
##  $ pswq                : num  4.94 3.36 1.86 3.94 NA ...
##  $ iou                 : num  3.19 4 1.59 3.37 NA ...
##  $ mfq_26              : num  4.2 3.35 4.65 4.65 NA 4.5 NA 4.3 5.25 4.45 ...
##  $ mfq_state           : num  3.62 3 5.88 4 NA ...
##  $ rse                 : num  2.3 1.6 3.9 1.7 NA 3.9 NA 2.4 1.8 NA ...
##  $ school_covid_support: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ school_att          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ pas_covid           : num  3.22 4.56 3.33 4.22 NA ...
##  $ pss                 : num  3.25 3.75 1 3.25 NA 2 NA 2 4 1.25 ...
##  $ phq                 : num  1.33 3.33 1 2.33 NA ...
##  $ gad                 : num  1.86 3.86 1.14 2 NA ...
##  $ edeq12              : num  1.58 1.83 1 1.67 NA ...
##  $ brs                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ swemws              : num  2.86 2.29 4.29 3.29 NA ...
##  $ isolation_a         : num  2.25 NA NA 2.5 NA 1.75 NA 2 1.25 NA ...
##  $ isolation_c         : num  NA 3.5 1 NA NA NA NA NA NA 1 ...
##  $ support             : num  2.5 2.17 5 2.5 NA ...

Subsetting Data

Open your mini codebook and get the names of your variables (first column). Then enter this list of names within the “select=c()” argument to subset those columns from the dataframe “df” into a new one “d”.

Replace “variable1, variable2,…” with your variables names.

# Make sure to keep the "ResponseId" variable first in the "select" argument

d <- subset(df, select=c(X, mhealth, age, big5_ext, pas_covid, mfq_state, brs))

#Your new data frame should contain 7 variables (ResponseId, + your 2 categorical, + your 4 continuous)

Missing Data

# use the gg_miss_upset() command for a visualization of your missing data

gg_miss_upset(d[-1], nsets = 6)

# the [-1] tells the function to ignore the first column i.e., variable -- we are doing this because here it is just the ID variable, we don't need to check it for missingness because everyone was assigned a random ID

# use the na.omit() command to create a new dataframe in which any participants with missing data are dropped from the dataframe
d2 <- na.omit(d)

# calc the total number of participants dropped, then convert to % and insert both the number and % in the text below.
# insert the total number of participants in your d2 in the text where it says N = #.

We looked at the missing data in our dataset, and found that 1757, or about 84.76%, of the participants in our sample skipped at least one item. We dropped these participants from our analysis, which is not advisable and runs the risk of dropping vulnerable groups or skewing results. However, we will proceed for the sake of this class using the reduced dataset, N = 316.

2073-316 1757/2073 ## Exporting Cleaned Data

Our last step is to export the data frame after we’ve dropped NAs so that it can be used in future HWs.

# use the "write.cvs" function to export the cleaned data
# please keep the file name as 'projectdata'
# note: you only need to change 'Data' before the slash if you named your folder something else

write.csv(d2, file="Data/projectdata.csv", row.names = F)

P421 Data Prep HW

Matthew Howell

2024-09-17