Data Matrices, Observations, and Variables

Key vocabulary

A data matrix is a matrix which stores data. Typically the observations which are also known as cases are stored as rows within the data matrix, while the variables which are also known as characteristics are stored as columns.
An observation is a case of the data being collected. For example, if we were collecting data on students in the class, the observations would be each individual student in the class.
A variable is an attribute or characteristc of the observation we record in our data. Again, if we are collecting data on the students in our class, and each student is an observation, then a variable might be ‘eye color’ or ‘height’
A continuous variable is a numerical variable that takes on real number values. An example of this would be the temperature of the room at a given point in time.
A discrete variable is a numeric variable that only takes on integer values {1,2,3. . .}. An example of this might be number of days since the school year started.
A nominal variable is a categorical variable that has no natural ordering. An example of this might be zip code, or the team that a professional soccer player plays on.
A ordinal variable is a categorical variable that that has an ordering. For example, a student’s grade level is categorical since it places students in a grade category, but it is also ordinal because there is a natural ‘order’ to grades {1,2,3. . .}

Case studies

Case study 1: The cars dataset

The cars data set was constructed by randomly selecting 54 cars and then collecting data on various attributes of these cars. The first ten observations of the data can be seen in the data matrix below:

##      type price mpgCity driveTrain passengers weight
## 1   small  15.9      25      front          5   2705
## 2 midsize  33.9      18      front          5   3560
## 3 midsize  37.7      19      front          6   3405
## 4 midsize  30.0      22       rear          4   3640
## 5 midsize  15.7      22      front          6   2880
## 6   large  20.8      19      front          6   3470

What are the observations in this data matrix?
What are the variables in this data matrix?

Case study 2: The ipod dataset

The ipod dataset was constructed by looking at 3000 songs on Bradshaw’s ipod and recording the lengths of each of these songs. The first ten obsertvations of the data can be seen in the data matrix below:

##   songLength
## 1  0.9047763
## 2  0.2007149
## 3  0.8378343
## 4  0.6075439
## 5  0.3441459
## 6  0.9662882

What are the observations in the data matrix?
What is the variable in the data matrix?

Case study 3: Flipping a coin

Suppose you flip a coin five times and record the outcome (heads or tails) each time.

What would be the observations?
What is the variable? (note there is only one)
Make up fake coin flip data and record it in a data matrix.

Case study 4: Deck of cards

Suppose you draw five cards out of a standard 4-suit, 52 card deck.

What would be the observations?
What are the variables (note there are two variables)
Make up fake card data and record it in a data matix.

Case study 5: Opposite gender friends and academic achievement

In a recent interesting study, researcher wondered if the percentage of a student’s friends who were of the opposite sex negatively effected how well they did in school due to distrction. The researchers collected data on 20,769 students. They recorded which school the student attended, the percent of the student’s friends who were of the opposite sex, whether the student hung out with friends outside of school, and the student’s cumulative GPA.

What is the research question being investigated in this study?
What are the observations in this study?
Identify each of the variables in this study, as well as their types and subtypes.

Case study 6: Email and spam

Often times, google engineers collect data from a large number of emails in order to build spam filters. Below are some of the variables that are collected from emails in order to detect spam. For each of these variables, determine the type of variable as well as its subtype.

Spam indicates whether or not the email was spam
Multiple indicates whether the email was sent to more than one person
Image records the number of images attached to the email
Time records the time the email was sent
Viagra records whether or not the word ‘viagra’ appeared in the email
Dollar records the number of times a dollar sign or the word dollar appears in the email

Case study 7: Major League Baseball

Sports statistics data is probably the most widely used data among the general public. Below is a data matrix displaying several MLB players and their characteristics.

##            player                 team       position salary
## 16   Blaine Boyer Arizona Diamondbacks        Pitcher  725.0
## 81  Dirk Hayhurst    Toronto Blue Jays        Pitcher  405.0
## 29     Tim Hudson       Atlanta Braves        Pitcher 9000.0
## 84     Mike McCoy    Toronto Blue Jays Second Baseman  400.7
## 34  Takashi Saito       Atlanta Braves        Pitcher 3200.0
## 77 Brandon Morrow    Toronto Blue Jays        Pitcher  409.8

What are the variables in the data matrix?
Identify the type and subtype of each of the variables in the data matrix.