Today we learn how to use rstudio to get a sense of our dataset. This means learning about the dataset size, the number of variables, the number of observations, the type of variables, the number of missing values and so forth. This is an initial step in any data analysis project. We need to know what is we have before we can decide on what to do.
Datasets come in many formats. A very common one is the csv format. CSV stands for comma sperated values, which surprisingly(!) means values are seperated by commas. But what values? A dataset consists of variables (e.g., Age, Sex) and specific values (e.g., 12, Male). A variable is the name we give to the data on a certain characteristic. So if I have 30 people in a room and collect dta on their age call the variable age (12,23,12,…), sex (Male, Female, Other), location (Manhattan, Brooklyn, Queens,…) with a made up variable called subject (1,2,3,…) and store them in csv format, it will look like this:
subject, age, sex, location
1,12, Male, Brooklyn
2,23, Male, Queens
3,12, Female, Manhattan
4,14, Male, Booklyn
…
The commas here act as sperators between values. Usually, we open csv files using MS Excel or other spreadsheet programs where we find the data in table — the commas are replaced with lines. This looks much better to the human eye, doesn’t it?
| Subject | age | sex | location |
|---|---|---|---|
| 1 | 12 | Male | Brooklyn |
| 2 | 23 | Male | Queens |
| 3 | 12 | Female | Manhattan |
| 4 | 14 | Male | Brooklyn |
Ok. Let’s learn how read a csv file into R. The command is as intuitive as anything else in R: read.csv(). Remember, in R we have two types of things: objects (e.g., numbers, letters, tables) and functions (e.g., read(), sum()). Functions take objects as inputs and produce objects as outputs. Because the output of a function is an object, it can be used as input to naother function. So for example, sum(1,3) gives the sum of 1 and 3. Running this command produced the output 4. This output can be used as an input to another function. For example, the output of sum(sum(1,3), 7) is the sum of the number 7 and the sum of the numbers 1 and 3. Which is… 11.
What happens if we runs these lines of code:
sum(1)
[1] 1
sum(1,2,3,4)
[1] 10
sum(1:5)
[1] 15
a<-3
b<-7
sum(a,b)
[1] 10
The command below reads a csv file, and gives it a name, lab2.sample:
lab2.sample <- read.csv("./sample.dataset.csv")
Let’s use common sense to decipher what this command is doing. If I asked you to open a file, what would be the first thing you’d ask me? Exactly: What file and from where? This is precisely what we are telling R in this command. We are giving it the address and the name of the file. Becasue we are using the commnad read.csv, R already knows to treat the file as a csv file (i.e., commas seperate values). Once you run the comman, rstudio creates a table like this:
Note that there is a line underneath the variables’ names row which either says <int> or <fctr>. This line is not in the dataset but is added to show us what the data types in each column are. <int> means the column has integers (e.g., 0,1,2,3…) as values, and <fctr> means the column is a factor (i.e., content is a list of words — Brooklyn, Manhattan, Queens, Bronx, Staten Island).
You can also use the command below to view your datasets:
View(lab2.sample)
Inputs to a function are called arguments. To learn about a function’s arguments, you can type ? followed by the name of the function. This shows you the help page for that function. The help page opens in the bottem right box. Help articles in R have a specific sructure which you wil get used to over time. But for now, type in ?sum and see wht the help articles tells you. It starts by telling you which package the function is a part of, and a short description of what it does. So in the case of sum, you see the following:
In the Usage section, you can see the arguments that a function takes and below that, there is a brief definition for each arguement. For the function sum, we have the arguement ... defined as numeric, complex, or logical vectors. This is a fancy way of saying a list of things you want to add up. Then there is na.rm which can be set as TRUE or FALSE. This field or arguement tells the function what to do when there are missing values. We call these missing values NA (not available). Let’s try the two lines of code below ans see what happens:
sum(NA, 3, 4)
[1] NA
sum(NA, 3, 4, na.rm = T)
[1] 7
When you set the na.rm=T, you are telling R to remove NAs and give you the sum of whatever is left. By default, the na.rm is e to be FALSE, which is why the first line of code above returns NA.
Tips:
You can see the example in the console using the command below:
example(sum)
sum> ## Pass a vector to sum, and it will add the elements together.
sum> sum(1:5)
[1] 15
sum> ## Pass several numbers to sum, and it also adds the elements.
sum> sum(1, 2, 3, 4, 5)
[1] 15
sum> ## In fact, you can pass vectors into several arguments, and everything gets added.
sum> sum(1:2, 3:5)
[1] 15
sum> ## If there are missing values, the sum is unknown, i.e., also missing, ....
sum> sum(1:5, NA)
[1] NA
sum> ## ... unless we exclude missing values explicitly:
sum> sum(1:5, NA, na.rm = TRUE)
[1] 15
Not that we have the data file loaded in, we want to know things about the dataset. Here are some possible functions to get us started:
dim() is a function that tells us how many rows and columns (i.e., how many varaibles and values) our dataset has. It shows us the dimensions of the dataset.dim(lab2.sample)
[1] 30 4
So our dataset has 30 rows, and 4 column. That seems right.
head() shows you the first few lines. you can set what few meanshead(lab2.sample)
head(lab2.sample, 10)
summary() - give you info about each varaible (e.g., type, calues, fequencies, NAs, etc)summary(lab2.sample)
Subject age sex location
Min. : 1.00 Min. :15.00 Female:17 Bronx : 2
1st Qu.: 8.25 1st Qu.:16.25 Male : 7 Brooklyn : 6
Median :15.50 Median :19.00 Other : 6 Manhattan :10
Mean :15.50 Mean :20.03 Queens : 6
3rd Qu.:22.75 3rd Qu.:23.75 Staten Island: 6
Max. :30.00 Max. :28.00
str() - similar to the two above. Also shows you he type of variable and the few first linesstr(lab2.sample)
'data.frame': 30 obs. of 4 variables:
$ Subject : int 1 2 3 4 5 6 7 8 9 10 ...
$ age : int 17 20 20 18 16 15 26 26 16 24 ...
$ sex : Factor w/ 3 levels "Female","Male",..: 2 1 1 1 1 1 1 1 1 1 ...
$ location: Factor w/ 5 levels "Bronx","Brooklyn",..: 4 4 2 4 3 2 3 3 5 5 ...
So far, we have run functions on an object that happened to be a dataset. What if we want to access a specific variable in the dataset? For example, what if we are interested only in the age variable? To access specificcolumns, we can use either of the the following: dataset.name$variable.name or dataset.name[["variable.name"]]. Let’s run the two lines of code below:
lab2.sample$age
[1] 17 20 20 18 16 15 26 26 16 24 18 26 20 16 28 18 19 18 26 23 20 16 27 15 20 26 15 15 19 18
lab2.sample[["age"]]
[1] 17 20 20 18 16 15 26 26 16 24 18 26 20 16 28 18 19 18 26 23 20 16 27 15 20 26 15 15 19 18
Same goes for other variables (i.e., columns):
lab2.sample$sex
[1] Male Female Female Female Female Female Female Female Female Female Other Female Female Female Female Female Other Male Male Other
[21] Male Male Female Male Other Female Female Other Other Male
Levels: Female Male Other
lab2.sample$location
[1] Queens Queens Brooklyn Queens Manhattan Brooklyn Manhattan Manhattan Staten Island Staten Island
[11] Brooklyn Queens Staten Island Manhattan Manhattan Queens Manhattan Manhattan Bronx Staten Island
[21] Staten Island Brooklyn Staten Island Queens Brooklyn Bronx Manhattan Manhattan Manhattan Brooklyn
Levels: Bronx Brooklyn Manhattan Queens Staten Island
Notice that for Location and Sex, there is a line at the end that starts with Levels. Remember when we talked about integers and factors? These two variables are factors, so their value is an item from the list. What list? The list of values a factor can take is called levels.
Now tha twe know how to access specific variable sin a dataset, we want to find ways to describe them. Let’s focus on age. This variable is an integer. You can check by running the command, is(lab2.sample$age). Soe let’s see what the *mean** age is:
mean(lab2.sample$age)
[1] 20.03333
what about *median**?
median(lab2.sample$age)
[1] 19
Is this the median? Hmmm, let’s check. The *median** is the number that is exactly in the middle of a list of values, or if we have an even number of values, the the median is the average of the two middle numbers. So how can find the middle number? Rememebr, in R everything is intuitive. One thing we’d have to do first it to sort the numbers. But how? Is there a function for sorting values? Let’s try the function below.
apropos("sort")
[1] ".doSortWrap" ".rs.sortCompletions" "is.unsorted" "sort" "sort.default" "sort.int"
[7] "sort.list" "sort.POSIXlt" "sortedXyData"
The apropos command tells R to gives anything that refers to or is related to the word sort. That’s why we put it in quotation marks. The resutls show us a bunch of functions but one is called sort. Let’s see what it does:
?sort()
Oh look, the help page say, ‘Sort (or order) a vector or factor (partially) into ascending or descending order’. That seems like what we are looking for.
So let’s ryi it on the variable age:
sort(lab2.sample$age)
[1] 15 15 15 15 16 16 16 16 17 18 18 18 18 18 19 19 20 20 20 20 20 23 24 26 26 26 26 26 27 28
This seem right. So how do we find the middle numbers? We knwo that we have 30 rows so the median is the average of the two middle values, 15 and 16. The values in these two spots are 19 and 19. The average of them is 19. So our median is correct.
But what if we had a much larger dataset? How would we find a specific value? Let’s try this:
sort(lab2.sample$age)[15]
sort(lab2.sample$age)[16]
sort(lab2.sample$age)[30]
Thisis amazing!! We can use the [] to access the values within the columns. So froexample, let’s look at the age data fro the first person in our dataset:
lab2.sample$age[1]
How do we read this line of code? We are telling are to take the first row of the column named age of the dataset named lab2.sample. Rememebr, the dataset in R islooks like a table, so it has rows and columns. When you type dataset.name[1,4], you are referring to row 1 and column 4. If you refer to the column using the $ sign, then all you need to do is specify the row number. Let’s try a few things.
lab2.sample$age
lab2.sample$age[1]
lab2.sample$age[5]
lab2.sample$age[2:5]
So now we know how to find individual datapoints, specific variables, and observations.
Le’ts try the commands below to get some decripive statitics split by the sex variable:
by(lab2.sample$age, lab2.sample$sex, mean)
by(lab2.sample$age, lab2.sample$sex, sd)
by(lab2.sample$age, lab2.sample$sex, IQR)
Another way to summarize all this is to use the command below:
summary(lab2.sample$age)
hist(lab2.sample$age)
boxplot(age ~ sex, data = lab2.sample)