Homework: Loading a data set into R. What questions do you all have?
Brief review
Creating data frames and extracting data from the data frames. Data frames put the variables into a form that we are used to where we have the variable names on the top row and “people” or data points for each variable below each variable. Remember that when we are creating variables and we want to include multiple data points, we need to use c() with commas to seperate them. We need to use the data.frame to turn a data set into a data frame.
When we have a data frame, we can access the variables inside the data frame using the $ or [].
datReview = data.frame(a = c(1,2,3), b = c(4,5,6))
datReview[,1]
## [1] 1 2 3
datReview$a
## [1] 1 2 3
Just run this code. What it is doing is creating a data set with three variables. Two continous variables and one ordinal variable and creating some NA values, so I can show you how to deal with NAs.
We can pretend that we have two outcomes variables, a 5-point Likert satisfaction score, and a gender variable with two options male and female.
ordVar = c(1,2,3,4,5)
binVar = c(0,1)
set.seed(123)
datWeekTwo = data.frame(outcome1 = rnorm(100), outcome2 = rnorm(100), satisfaction = sample(ordVar, 100, replace = TRUE), gender = sample(binVar, 100, replace = TRUE))
datWeekTwo[1:10,1] = NA
datWeekTwo[11:15,2] = -99
datWeekTwo
## outcome1 outcome2 satisfaction gender
## 1 NA -0.71040656 5 0
## 2 NA 0.25688371 1 0
## 3 NA -0.24669188 5 0
## 4 NA -0.34754260 3 0
## 5 NA -0.95161857 2 0
## 6 NA -0.04502772 3 0
## 7 NA -0.78490447 4 1
## 8 NA -1.66794194 1 1
## 9 NA -0.38022652 2 1
## 10 NA 0.91899661 4 0
## 11 1.224081797 -99.00000000 2 0
## 12 0.359813827 -99.00000000 5 0
## 13 0.400771451 -99.00000000 2 1
## 14 0.110682716 -99.00000000 3 0
## 15 -0.555841135 -99.00000000 2 1
## 16 1.786913137 0.30115336 1 0
## 17 0.497850478 0.10567619 5 0
## 18 -1.966617157 -0.64070601 2 1
## 19 0.701355902 -0.84970435 3 0
## 20 -0.472791408 -1.02412879 1 1
## 21 -1.067823706 0.11764660 3 0
## 22 -0.217974915 -0.94747461 4 0
## 23 -1.026004448 -0.49055744 3 0
## 24 -0.728891229 -0.25609219 2 0
## 25 -0.625039268 1.84386201 5 1
## 26 -1.686693311 -0.65194990 4 1
## 27 0.837787044 0.23538657 2 1
## 28 0.153373118 0.07796085 2 0
## 29 -1.138136937 -0.96185663 1 1
## 30 1.253814921 -0.07130809 5 0
## 31 0.426464221 1.44455086 5 1
## 32 -0.295071483 0.45150405 3 1
## 33 0.895125661 0.04123292 2 0
## 34 0.878133488 -0.42249683 5 1
## 35 0.821581082 -2.05324722 2 0
## 36 0.688640254 1.13133721 4 1
## 37 0.553917654 -1.46064007 1 0
## 38 -0.061911711 0.73994751 3 1
## 39 -0.305962664 1.90910357 1 1
## 40 -0.380471001 -1.44389316 1 0
## 41 -0.694706979 0.70178434 2 1
## 42 -0.207917278 -0.26219749 3 0
## 43 -1.265396352 -1.57214416 4 1
## 44 2.168955965 -1.51466765 1 0
## 45 1.207961998 -1.60153617 5 1
## 46 -1.123108583 -0.53090652 4 1
## 47 -0.402884835 -1.46175558 4 0
## 48 -0.466655354 0.68791677 1 1
## 49 0.779965118 2.10010894 2 1
## 50 -0.083369066 -1.28703048 4 0
## 51 0.253318514 0.78773885 3 0
## 52 -0.028546755 0.76904224 2 0
## 53 -0.042870457 0.33220258 2 1
## 54 1.368602284 -1.00837661 1 1
## 55 -0.225770986 -0.11945261 2 1
## 56 1.516470604 -0.28039534 5 1
## 57 -1.548752804 0.56298953 5 1
## 58 0.584613750 -0.37243876 4 1
## 59 0.123854244 0.97697339 1 1
## 60 0.215941569 -0.37458086 1 0
## 61 0.379639483 1.05271147 5 0
## 62 -0.502323453 -1.04917701 3 1
## 63 -0.333207384 -1.26015524 3 1
## 64 -1.018575383 3.24103993 1 0
## 65 -1.071791226 -0.41685759 3 1
## 66 0.303528641 0.29822759 2 1
## 67 0.448209779 0.63656967 2 0
## 68 0.053004227 -0.48378063 4 1
## 69 0.922267468 0.51686204 2 1
## 70 2.050084686 0.36896453 5 1
## 71 -0.491031166 -0.21538051 1 1
## 72 -2.309168876 0.06529303 5 1
## 73 1.005738524 -0.03406725 3 1
## 74 -0.709200763 2.12845190 4 0
## 75 -0.688008616 -0.74133610 4 1
## 76 1.025571370 -1.09599627 3 1
## 77 -0.284773007 0.03778840 4 0
## 78 -1.220717712 0.31048075 1 0
## 79 0.181303480 0.43652348 2 1
## 80 -0.138891362 -0.45836533 4 0
## 81 0.005764186 -1.06332613 2 1
## 82 0.385280401 1.26318518 5 1
## 83 -0.370660032 -0.34965039 2 1
## 84 0.644376549 -0.86551286 4 0
## 85 -0.220486562 -0.23627957 5 1
## 86 0.331781964 -0.19717589 1 0
## 87 1.096839013 1.10992029 1 0
## 88 0.435181491 0.08473729 4 1
## 89 -0.325931586 0.75405379 3 1
## 90 1.148807618 -0.49929202 5 1
## 91 0.993503856 0.21444531 5 1
## 92 0.548396960 -0.32468591 1 0
## 93 0.238731735 0.09458353 3 1
## 94 -0.627906076 -0.89536336 4 0
## 95 1.360652449 -1.31080153 1 0
## 96 -0.600259587 1.99721338 5 1
## 97 2.187332993 0.60070882 2 1
## 98 1.532610626 -1.25127136 1 0
## 99 -0.235700359 -0.61116592 4 1
## 100 -1.026420900 -1.18548008 4 0
The first thing you want to do is set the working directory. This tells R where you want to read in and store data sets. Go to the session, set working directory, then choose the working directory. Then you can copy that path into the code so you don’t have to do that every time.
***** I am working on a mac so make sure you don’t copy and paste the setwd directly from this page and you actually find the specific file path for your computer if you have a PC.
Let’s first export the data set that we have to a csv file because that is the easiest file to work with. We can use the write.csv function to do that. Row names are likely to be false.
Then you can read the csv file using the read.csv function. Most of the time the first row in the dataset will be the variable names, so you will need to set the header to be true.
Sometimes you have multiple values that signal a value is NA (e.g. -99, -77). To load all the possible NA values into R and turn them all into NA’s use the na.strings function and concatinate the list of values using c() and put each character (i.e. non-integer) in quotation marks and each value should be separated by a comma.
For data that is left blank you can use the " " with one space. Matt provide example during workshop.
If you want to get rid of missing values you can use the na.omit function. This function deletes any row that has at least one missing value for at least one variable.
setwd("~/Desktop")
#setwd("C:/Users/Matthew.Hanauer/Desktop")
write.csv(datWeekTwo, "datWeekTwo.csv", row.names = FALSE)
datWeekTwo = read.csv("datWeekTwo.csv", header = TRUE, na.strings = c("NA",-99, " "))
datWeekTwo
## outcome1 outcome2 satisfaction gender
## 1 NA -0.71040656 5 0
## 2 NA 0.25688371 1 0
## 3 NA -0.24669188 5 0
## 4 NA -0.34754260 3 0
## 5 NA -0.95161857 2 0
## 6 NA -0.04502772 3 0
## 7 NA -0.78490447 4 1
## 8 NA -1.66794194 1 1
## 9 NA -0.38022652 2 1
## 10 NA 0.91899661 4 0
## 11 1.224081797 NA 2 0
## 12 0.359813827 NA 5 0
## 13 0.400771451 NA 2 1
## 14 0.110682716 NA 3 0
## 15 -0.555841135 NA 2 1
## 16 1.786913137 0.30115336 1 0
## 17 0.497850478 0.10567619 5 0
## 18 -1.966617157 -0.64070601 2 1
## 19 0.701355902 -0.84970435 3 0
## 20 -0.472791408 -1.02412879 1 1
## 21 -1.067823706 0.11764660 3 0
## 22 -0.217974915 -0.94747461 4 0
## 23 -1.026004448 -0.49055744 3 0
## 24 -0.728891229 -0.25609219 2 0
## 25 -0.625039268 1.84386201 5 1
## 26 -1.686693311 -0.65194990 4 1
## 27 0.837787044 0.23538657 2 1
## 28 0.153373118 0.07796085 2 0
## 29 -1.138136937 -0.96185663 1 1
## 30 1.253814921 -0.07130809 5 0
## 31 0.426464221 1.44455086 5 1
## 32 -0.295071483 0.45150405 3 1
## 33 0.895125661 0.04123292 2 0
## 34 0.878133488 -0.42249683 5 1
## 35 0.821581082 -2.05324722 2 0
## 36 0.688640254 1.13133721 4 1
## 37 0.553917654 -1.46064007 1 0
## 38 -0.061911711 0.73994751 3 1
## 39 -0.305962664 1.90910357 1 1
## 40 -0.380471001 -1.44389316 1 0
## 41 -0.694706979 0.70178434 2 1
## 42 -0.207917278 -0.26219749 3 0
## 43 -1.265396352 -1.57214416 4 1
## 44 2.168955965 -1.51466765 1 0
## 45 1.207961998 -1.60153617 5 1
## 46 -1.123108583 -0.53090652 4 1
## 47 -0.402884835 -1.46175558 4 0
## 48 -0.466655354 0.68791677 1 1
## 49 0.779965118 2.10010894 2 1
## 50 -0.083369066 -1.28703048 4 0
## 51 0.253318514 0.78773885 3 0
## 52 -0.028546755 0.76904224 2 0
## 53 -0.042870457 0.33220258 2 1
## 54 1.368602284 -1.00837661 1 1
## 55 -0.225770986 -0.11945261 2 1
## 56 1.516470604 -0.28039534 5 1
## 57 -1.548752804 0.56298953 5 1
## 58 0.584613750 -0.37243876 4 1
## 59 0.123854244 0.97697339 1 1
## 60 0.215941569 -0.37458086 1 0
## 61 0.379639483 1.05271147 5 0
## 62 -0.502323453 -1.04917701 3 1
## 63 -0.333207384 -1.26015524 3 1
## 64 -1.018575383 3.24103993 1 0
## 65 -1.071791226 -0.41685759 3 1
## 66 0.303528641 0.29822759 2 1
## 67 0.448209779 0.63656967 2 0
## 68 0.053004227 -0.48378063 4 1
## 69 0.922267468 0.51686204 2 1
## 70 2.050084686 0.36896453 5 1
## 71 -0.491031166 -0.21538051 1 1
## 72 -2.309168876 0.06529303 5 1
## 73 1.005738524 -0.03406725 3 1
## 74 -0.709200763 2.12845190 4 0
## 75 -0.688008616 -0.74133610 4 1
## 76 1.025571370 -1.09599627 3 1
## 77 -0.284773007 0.03778840 4 0
## 78 -1.220717712 0.31048075 1 0
## 79 0.181303480 0.43652348 2 1
## 80 -0.138891362 -0.45836533 4 0
## 81 0.005764186 -1.06332613 2 1
## 82 0.385280401 1.26318518 5 1
## 83 -0.370660032 -0.34965039 2 1
## 84 0.644376549 -0.86551286 4 0
## 85 -0.220486562 -0.23627957 5 1
## 86 0.331781964 -0.19717589 1 0
## 87 1.096839013 1.10992029 1 0
## 88 0.435181491 0.08473729 4 1
## 89 -0.325931586 0.75405379 3 1
## 90 1.148807618 -0.49929202 5 1
## 91 0.993503856 0.21444531 5 1
## 92 0.548396960 -0.32468591 1 0
## 93 0.238731735 0.09458353 3 1
## 94 -0.627906076 -0.89536336 4 0
## 95 1.360652449 -1.31080153 1 0
## 96 -0.600259587 1.99721338 5 1
## 97 2.187332993 0.60070882 2 1
## 98 1.532610626 -1.25127136 1 0
## 99 -0.235700359 -0.61116592 4 1
## 100 -1.026420900 -1.18548008 4 0
datWeekTwo = na.omit(datWeekTwo)
datWeekTwo
## outcome1 outcome2 satisfaction gender
## 16 1.786913137 0.30115336 1 0
## 17 0.497850478 0.10567619 5 0
## 18 -1.966617157 -0.64070601 2 1
## 19 0.701355902 -0.84970435 3 0
## 20 -0.472791408 -1.02412879 1 1
## 21 -1.067823706 0.11764660 3 0
## 22 -0.217974915 -0.94747461 4 0
## 23 -1.026004448 -0.49055744 3 0
## 24 -0.728891229 -0.25609219 2 0
## 25 -0.625039268 1.84386201 5 1
## 26 -1.686693311 -0.65194990 4 1
## 27 0.837787044 0.23538657 2 1
## 28 0.153373118 0.07796085 2 0
## 29 -1.138136937 -0.96185663 1 1
## 30 1.253814921 -0.07130809 5 0
## 31 0.426464221 1.44455086 5 1
## 32 -0.295071483 0.45150405 3 1
## 33 0.895125661 0.04123292 2 0
## 34 0.878133488 -0.42249683 5 1
## 35 0.821581082 -2.05324722 2 0
## 36 0.688640254 1.13133721 4 1
## 37 0.553917654 -1.46064007 1 0
## 38 -0.061911711 0.73994751 3 1
## 39 -0.305962664 1.90910357 1 1
## 40 -0.380471001 -1.44389316 1 0
## 41 -0.694706979 0.70178434 2 1
## 42 -0.207917278 -0.26219749 3 0
## 43 -1.265396352 -1.57214416 4 1
## 44 2.168955965 -1.51466765 1 0
## 45 1.207961998 -1.60153617 5 1
## 46 -1.123108583 -0.53090652 4 1
## 47 -0.402884835 -1.46175558 4 0
## 48 -0.466655354 0.68791677 1 1
## 49 0.779965118 2.10010894 2 1
## 50 -0.083369066 -1.28703048 4 0
## 51 0.253318514 0.78773885 3 0
## 52 -0.028546755 0.76904224 2 0
## 53 -0.042870457 0.33220258 2 1
## 54 1.368602284 -1.00837661 1 1
## 55 -0.225770986 -0.11945261 2 1
## 56 1.516470604 -0.28039534 5 1
## 57 -1.548752804 0.56298953 5 1
## 58 0.584613750 -0.37243876 4 1
## 59 0.123854244 0.97697339 1 1
## 60 0.215941569 -0.37458086 1 0
## 61 0.379639483 1.05271147 5 0
## 62 -0.502323453 -1.04917701 3 1
## 63 -0.333207384 -1.26015524 3 1
## 64 -1.018575383 3.24103993 1 0
## 65 -1.071791226 -0.41685759 3 1
## 66 0.303528641 0.29822759 2 1
## 67 0.448209779 0.63656967 2 0
## 68 0.053004227 -0.48378063 4 1
## 69 0.922267468 0.51686204 2 1
## 70 2.050084686 0.36896453 5 1
## 71 -0.491031166 -0.21538051 1 1
## 72 -2.309168876 0.06529303 5 1
## 73 1.005738524 -0.03406725 3 1
## 74 -0.709200763 2.12845190 4 0
## 75 -0.688008616 -0.74133610 4 1
## 76 1.025571370 -1.09599627 3 1
## 77 -0.284773007 0.03778840 4 0
## 78 -1.220717712 0.31048075 1 0
## 79 0.181303480 0.43652348 2 1
## 80 -0.138891362 -0.45836533 4 0
## 81 0.005764186 -1.06332613 2 1
## 82 0.385280401 1.26318518 5 1
## 83 -0.370660032 -0.34965039 2 1
## 84 0.644376549 -0.86551286 4 0
## 85 -0.220486562 -0.23627957 5 1
## 86 0.331781964 -0.19717589 1 0
## 87 1.096839013 1.10992029 1 0
## 88 0.435181491 0.08473729 4 1
## 89 -0.325931586 0.75405379 3 1
## 90 1.148807618 -0.49929202 5 1
## 91 0.993503856 0.21444531 5 1
## 92 0.548396960 -0.32468591 1 0
## 93 0.238731735 0.09458353 3 1
## 94 -0.627906076 -0.89536336 4 0
## 95 1.360652449 -1.31080153 1 0
## 96 -0.600259587 1.99721338 5 1
## 97 2.187332993 0.60070882 2 1
## 98 1.532610626 -1.25127136 1 0
## 99 -0.235700359 -0.61116592 4 1
## 100 -1.026420900 -1.18548008 4 0
To get some summary statistics we will need some different statistical packages. This means we need to use the install.packages function to install the psych and prettyR packages and then library them. You only need to install packages once (unless you get an updated version of R), but you need to library the package every time you restart R.
You can also get summary statistics fairly quickly using summary and or describe for continuous variables and describe.factor for ordinal, categorical, and binary types.
describe.factor only works with a single variable; however, we will learn how to use it to provide counts and percentages for several variables at a time.
#install.packages("psych")
#install.packages("prettyR")
#install.packages("descr")
library(descr)
library(psych)
library(prettyR)
##
## Attaching package: 'prettyR'
## The following objects are masked from 'package:psych':
##
## describe, skew
## The following object is masked from 'package:descr':
##
## freq
summary(datWeekTwo)
## outcome1 outcome2 satisfaction gender
## Min. :-2.309169 Min. :-2.05325 Min. :1.000 Min. :0.0000
## 1st Qu.:-0.491031 1st Qu.:-0.84970 1st Qu.:2.000 1st Qu.:0.0000
## Median : 0.005764 Median :-0.19718 Median :3.000 Median :1.0000
## Mean : 0.079468 Mean :-0.06676 Mean :2.929 Mean :0.5882
## 3rd Qu.: 0.701356 3rd Qu.: 0.51686 3rd Qu.:4.000 3rd Qu.:1.0000
## Max. : 2.187333 Max. : 3.24104 Max. :5.000 Max. :1.0000
describe(datWeekTwo)
## Description of datWeekTwo
##
## Numeric
## mean median var sd valid.n
## outcome1 0.08 0.01 0.86 0.93 85
## outcome2 -0.07 -0.20 1.00 1.00 85
## satisfaction 2.93 3.00 2.09 1.45 85
## gender 0.59 1.00 0.25 0.50 85
describe.factor(datWeekTwo$ordVar)
##
## datWeekTwo$ordVar
## Count
## Percent
Also, to better understand the packages you can use the help function
help("summary")
Sometimes you want to get cross tabs of different variables. We can look at the means for males and females using the compmeans function and then round the results.
Also, the round function follows some different rules: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Round.html
round(compmeans(datWeekTwo$outcome1, datWeekTwo$gender),2)
## Warning in compmeans(datWeekTwo$outcome1, datWeekTwo$gender): Warning:
## "datWeekTwo$gender" was converted into factor!
## Mean value of "datWeekTwo$outcome1" according to "datWeekTwo$gender"
## Mean N Std. Dev.
## 0 0.18 35 0.86
## 1 0.01 50 0.97
## Total 0.08 85 0.93
Sometimes we want to subset the data. For example, in the satisfaction variable, we can imagine it is on the following scale: 5 = strongly agree, 4 = agree, 3 = neutral, 2 = disagree, 1 = strongly disagree. We may not be sure what to do with the neutral category so we may want to exclude those people. We can use the subset function in R. To subset the data where we exclude neutrals (i.e. 3’s), we need two arguments, first is the dataset that we want to subset and second is the condition. In this example, we want to exclude 3’s so we say satisfaction!=3 to exclude the 3’s. In the other example below I show how to subset where you only include 5’s using the == operator.
datWeekTwo = subset(datWeekTwo, satisfaction != 3)
datWeekTwo$satisfaction
## [1] 1 5 2 1 4 2 5 4 2 2 1 5 5 2 5 2 4 1 1 1 2 4 1 5 4 4 1 2 4 2 2 1 2 5 5
## [36] 4 1 1 5 1 2 2 4 2 5 1 5 4 4 4 1 2 4 2 5 2 4 5 1 1 4 5 5 1 4 1 5 2 1 4
## [71] 4
datWeekTwoExample = subset(datWeekTwo, satisfaction == 5)
datWeekTwoExample$satisfaction
## [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
Just like in excel sometimes we want to use an if else statement. If else statements allow us to change data based on some rules. For example, in our data set we may want to create a binary variable from the satisfaction variable where we have all agree (strongly agree and agree) as 1 and all disagrees (strongly disagree and disagree) as zero. We can use an ifelse statement to change the satisfaction variable.
datWeekTwo$satisfaction = ifelse(datWeekTwo$satisfaction >=4, 1, 0)
datWeekTwo$satisfaction
## [1] 0 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 0 1 1
## [36] 1 0 0 1 0 0 0 1 0 1 0 1 1 1 1 0 0 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 0 0 1
## [71] 1
Week two homework. Get means for continuous variables and counts and percentages for at least one binary or ordinal variable from the data you loaded into R.