Homework: Get means and counts for variables in a different data set. What questions do you all have?

Generate data quickly

ordVar = c(1,2,3,4,5)
binVar = c(0,1)
set.seed(123)
datWeekTwo = data.frame(outcome1 = rnorm(100), outcome2 = rnorm(100), satisfaction = sample(ordVar, 100, replace = TRUE), gender = sample(binVar, 100, replace = TRUE))
datWeekTwo[1:10,1] = NA
datWeekTwo[11:15,2] = -99
datWeekThree = datWeekTwo

Quick review.

Load data into R by first setting the working directory, then using the read.csv function. Make sure header = TRUE to make the first row of data the variable names and if using na.strings make sure to include all NA indicators including NA in the list. Quick new feature, we can get rid of missing data by using the na.omit function. This function deletes any row with at least one missing value. So if you are doing analyses with a large data set, you may want to subset only the data that you need for particular analyses and then use na.omit on that data set.

If we want to get means and sds, then we can library the descr package and use describe on the data set. If we want to get counts and percentages for categorical variables then we can library pretty R and use the describe.factor function.

setwd("~/Desktop")
write.csv(datWeekThree, "datWeekThree.csv", row.names = FALSE)
datWeekThree = read.csv("datWeekThree.csv", header = TRUE, na.strings = c(NA, -99))
datWeekThree = na.omit(datWeekThree)
head(datWeekThree)
##      outcome1   outcome2 satisfaction gender
## 16  1.7869131  0.3011534            1      0
## 17  0.4978505  0.1056762            5      0
## 18 -1.9666172 -0.6407060            2      1
## 19  0.7013559 -0.8497043            3      0
## 20 -0.4727914 -1.0241288            1      1
## 21 -1.0678237  0.1176466            3      0
library(descr)
library(prettyR)
## 
## Attaching package: 'prettyR'
## The following object is masked from 'package:descr':
## 
##     freq
describe(datWeekThree)
## Description of datWeekThree
## 
##  Numeric 
##               mean median  var   sd valid.n
## outcome1      0.08   0.01 0.86 0.93      85
## outcome2     -0.07  -0.20 1.00 1.00      85
## satisfaction  2.93   3.00 2.09 1.45      85
## gender        0.59   1.00 0.25 0.50      85
describe.factor(datWeekThree$satisfaction)
##                          
## datWeekThree$satisfaction        1        2        4        5        3
##                   Count   19.00000 18.00000 18.00000 16.00000 14.00000
##                   Percent 22.35294 21.17647 21.17647 18.82353 16.47059
describe.factor(datWeekThree$gender)
##                    
## datWeekThree$gender        1        0
##             Count   50.00000 35.00000
##             Percent 58.82353 41.17647

Sometimes you want to get cross tabs of different variables. We can look at the means for males and females using the compmeans function and then round the results.

Also, the round function follows some different rules: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Round.html

round(compmeans(datWeekTwo$outcome1, datWeekTwo$gender),2)
## Warning in compmeans(datWeekTwo$outcome1, datWeekTwo$gender): Warning:
## "datWeekTwo$gender" was converted into factor!
## Warning in compmeans(datWeekTwo$outcome1, datWeekTwo$gender): 10 rows with
## missing values dropped

## Mean value of "datWeekTwo$outcome1" according to "datWeekTwo$gender"
##       Mean  N Std. Dev.
## 0     0.21 38      0.84
## 1     0.00 52      0.96
## Total 0.09 90      0.91

Sometimes we want to subset the data. For example, in the satisfaction variable, we can imagine it is on the following scale: 5 = strongly agree, 4 = agree, 3 = neutral, 2 = disagree, 1 = strongly disagree. We may not be sure what to do with the neutral category so we may want to exclude those people. We can use the subset function in R. To subset the data where we exclude neutrals (i.e. 3’s), we need two arguments, first is the dataset that we want to subset and second is the condition. In this example, we want to exclude 3’s so we say satisfaction!=3 to exclude the 3’s. In the other example below I show how to subset where you only include 5’s using the == operator.

datWeekTwo = subset(datWeekTwo, satisfaction != 3)
datWeekTwo$satisfaction
##  [1] 5 1 5 2 4 1 2 4 2 5 2 2 1 5 2 1 4 2 5 4 2 2 1 5 5 2 5 2 4 1 1 1 2 4 1
## [36] 5 4 4 1 2 4 2 2 1 2 5 5 4 1 1 5 1 2 2 4 2 5 1 5 4 4 4 1 2 4 2 5 2 4 5
## [71] 1 1 4 5 5 1 4 1 5 2 1 4 4
datWeekTwoExample = subset(datWeekTwo, satisfaction == 5)
datWeekTwoExample$satisfaction
##  [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

Sometimes we want to subset by certain dates. Let us first create a date variable and then combine that with our current data set. The format R likes is year, month, day. I will show you later how to change month, day, year into year, month, day format.

set.seed(123)
dateWeekThree = sample(seq(as.Date('2015-05-01'), as.Date('2018-05-01'), by="day"), 85)
head(dateWeekThree)
## [1] "2016-03-11" "2017-09-10" "2016-07-21" "2017-12-22" "2018-02-21"
## [6] "2015-06-19"
dateWeekThree = as.Date(dateWeekThree, format = "%Y/%m/%d")

Now we are going to review how to add a variable into a currently existing dataframe. There are several ways to do this, but I will just show you one. We can use the data.frame function to combine the original datWeekThree dataset with the new variable dateWeekThree.

datWeekThree = data.frame(datWeekThree, dateWeekThree)
head(datWeekThree)
##      outcome1   outcome2 satisfaction gender dateWeekThree
## 16  1.7869131  0.3011534            1      0    2016-03-11
## 17  0.4978505  0.1056762            5      0    2017-09-10
## 18 -1.9666172 -0.6407060            2      1    2016-07-21
## 19  0.7013559 -0.8497043            3      0    2017-12-22
## 20 -0.4727914 -1.0241288            1      1    2018-02-21
## 21 -1.0678237  0.1176466            3      0    2015-06-19

Now we can subset the data. Let us say we only need data between 2017-4-1 and 2017-6-30.

datWeekThree = subset(datWeekThree, dateWeekThree >= "2017-4-1" & dateWeekThree <= "2017-6-30")
head(datWeekThree)
##      outcome1    outcome2 satisfaction gender dateWeekThree
## 28  0.1533731  0.07796085            2      0    2017-05-05
## 37  0.5539177 -1.46064007            1      0    2017-05-15
## 40 -0.3804710 -1.44389316            1      0    2017-04-03
## 41 -0.6947070  0.70178434            2      1    2017-05-29
## 73  1.0057385 -0.03406725            3      1    2017-06-22
## 86  0.3317820 -0.19717589            1      0    2017-06-13

Our dates are usually month day year, so if we want to change them then we can use the format function. For the format, we want the format that the current date is in then R will change it for us. For some reason you need to capitalize the Y in year not sure why.

testDate = c("2/4/2018", "3/4/2018")
testDate = as.Date(testDate, format = "%m/%d/%Y")
testDate = subset(testDate, testDate > "2018-02-04")
testDate
## [1] "2018-03-04"

Just like in excel sometimes we want to use an if else statement. If else statements allow us to change data based on some rules. For example, in our data set we may want to create a binary variable from the satisfaction variable where we have all agree (strongly agree and agree) as 1 and all disagrees (strongly disagree and disagree) as zero. We can use an ifelse statement to change the satisfaction variable.

datWeekTwo$satisfaction = ifelse(datWeekTwo$satisfaction >=4, 1, 0)
datWeekTwo$satisfaction
##  [1] 1 0 1 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 0 0 0 0 1 0
## [36] 1 1 1 0 0 1 0 0 0 0 1 1 1 0 0 1 0 0 0 1 0 1 0 1 1 1 1 0 0 1 0 1 0 1 1
## [71] 0 0 1 1 1 0 1 0 1 0 0 1 1

Now we are moving to a slighly more advanced function called apply. Apply has other versions lapply, mapply, but we will focus on apply. I think the best way to understand apply is through an example. Let us say that we have a PHQ-9 with nine columns of data and we want to create a total score. Let’s run the data code below to create the fake data set.

Also, if you want to see the first six rows, you can use head(data set name)

ordvar = c(1,2,3,4,5)
set.seed(124)
PHQ9 = data.frame(item1 = sample(ordvar, 100, replace = TRUE), item2 = sample(ordvar, 100, replace = TRUE), item3 = sample(ordvar, 100, replace = TRUE), item4 = sample(ordvar, 100, replace = TRUE), item5 = sample(ordvar, 100, replace = TRUE), item6 = sample(ordvar, 100, replace = TRUE), item7 = sample(ordvar, 100, replace = TRUE), item8 = sample(ordvar, 100, replace = TRUE), item9 = sample(ordvar, 100, replace = TRUE))
head(PHQ9)
##   item1 item2 item3 item4 item5 item6 item7 item8 item9
## 1     1     2     2     2     5     4     3     2     3
## 2     3     3     5     4     2     1     2     2     5
## 3     3     3     1     1     2     2     5     1     3
## 4     2     1     5     4     5     4     4     4     3
## 5     2     1     5     2     3     4     5     4     4
## 6     2     5     2     2     1     5     1     3     4

Now we can use the apply function to sum across the nine rows. First tell R which data set we want it to use, then we say 1, because we want it to sum across the rows (not columns), then we tell it what function we want it to use, which is the sum function. We are creating a new variable PHQ9Total, which we then combine with the original PHQ9 data set giving us a PHQ9Total variable.

PHQ9Total = apply(PHQ9, 1, sum)
head(PHQ9Total)
## [1] 24 27 21 32 30 25
PHQ9 = data.frame(PHQ9, PHQ9Total)
head(PHQ9)
##   item1 item2 item3 item4 item5 item6 item7 item8 item9 PHQ9Total
## 1     1     2     2     2     5     4     3     2     3        24
## 2     3     3     5     4     2     1     2     2     5        27
## 3     3     3     1     1     2     2     5     1     3        21
## 4     2     1     5     4     5     4     4     4     3        32
## 5     2     1     5     2     3     4     5     4     4        30
## 6     2     5     2     2     1     5     1     3     4        25

Homework, try making a rule to subset your data and dicotimizing a variable with ifelse.