Setup

# This is a chunk where you can load the necessary packages required to reproduce the report 

library(readr) # Useful for importing data
library(knitr) # Useful for creating nice tables
#library(foreign) # Useful for importing SPSS, SAS, STATA etc. data files
#library(rvest) # Useful for scraping HTML data
#library(dplyr)

Data Description

This particular set of data has been obtained from the Kaggle web site as stated below,

Data Obtained from :

https://www.kaggle.com/annavictoria/speed-dating-experiment/version/1#Speed%20Dating%20Data.csv

This file was downloaded from the site in the form of a .csv file.

And in order to meet the instructions as stated below : ##———————————————————– As a minimum, your data set should include:

———————————————————–

I will need to shrink my data set considerably because the original data set is comprised of 195 variables. I will reduce this variable set to 7 variables which will contain numeric variables and categorical variables, it will be done after the Import / Read section.

Read/Import Data

# This is an R chunk for importing the data. Provide your R codes here:

library(readr)
Speed_Dating_Data <- read.csv("C:/Users/dan/Desktop/a Pre Processing/Speed_Dating_Data.csv")
#View(Speed_Dating_Data)

The data set contains 195 variables, I’m going to use the coding below to shrink these variables to 7 ## Shrinking the data set

##head(Speed_Dating_Data)

speedDating_1  <- cbind.data.frame(Speed_Dating_Data$age_o, Speed_Dating_Data$gender, Speed_Dating_Data$samerace, Speed_Dating_Data$from, Speed_Dating_Data$zipcode, Speed_Dating_Data$income, Speed_Dating_Data$career)

The first thing to do is glance to your Data Set enviroment and see if the data frame has been created, and the observations count is the same the original data set, if this is all good then do a head and structure check

head(speedDating_1)
##   Speed_Dating_Data$age_o Speed_Dating_Data$gender
## 1                      27                        0
## 2                      22                        0
## 3                      22                        0
## 4                      23                        0
## 5                      24                        0
## 6                      25                        0
##   Speed_Dating_Data$samerace Speed_Dating_Data$from
## 1                          0                Chicago
## 2                          0                Chicago
## 3                          1                Chicago
## 4                          0                Chicago
## 5                          0                Chicago
## 6                          0                Chicago
##   Speed_Dating_Data$zipcode Speed_Dating_Data$income
## 1                    60,521                69,487.00
## 2                    60,521                69,487.00
## 3                    60,521                69,487.00
## 4                    60,521                69,487.00
## 5                    60,521                69,487.00
## 6                    60,521                69,487.00
##   Speed_Dating_Data$career
## 1                   lawyer
## 2                   lawyer
## 3                   lawyer
## 4                   lawyer
## 5                   lawyer
## 6                   lawyer
str(speedDating_1)
## 'data.frame':    8378 obs. of  7 variables:
##  $ Speed_Dating_Data$age_o   : int  27 22 22 23 24 25 30 27 28 24 ...
##  $ Speed_Dating_Data$gender  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Speed_Dating_Data$samerace: int  0 0 1 0 0 0 0 0 0 0 ...
##  $ Speed_Dating_Data$from    : Factor w/ 270 levels "","94115","alabama",..: 56 56 56 56 56 56 56 56 56 56 ...
##  $ Speed_Dating_Data$zipcode : Factor w/ 410 levels "","0","1,040",..: 262 262 262 262 262 262 262 262 262 262 ...
##  $ Speed_Dating_Data$income  : Factor w/ 262 levels "","106,663.00",..: 239 239 239 239 239 239 239 239 239 239 ...
##  $ Speed_Dating_Data$career  : Factor w/ 368 levels "","?","??","a research position",..: 185 185 185 185 185 185 185 185 185 185 ...

I’ve noticed the column names are huge, so this section deals with renaming them to something smaller

# Rename a column in R
names(speedDating_1)[1]<-"Age"
names(speedDating_1)[2]<-"SameRace"
names(speedDating_1)[3]<-"Gender"
names(speedDating_1)[4]<-"From"
names(speedDating_1)[5]<-"zipcode"
names(speedDating_1)[6]<-"Income"
names(speedDating_1)[7]<-"Career"

Recheck all of the above for the new data set

head(speedDating_1)
##   Age SameRace Gender    From zipcode    Income Career
## 1  27        0      0 Chicago  60,521 69,487.00 lawyer
## 2  22        0      0 Chicago  60,521 69,487.00 lawyer
## 3  22        0      1 Chicago  60,521 69,487.00 lawyer
## 4  23        0      0 Chicago  60,521 69,487.00 lawyer
## 5  24        0      0 Chicago  60,521 69,487.00 lawyer
## 6  25        0      0 Chicago  60,521 69,487.00 lawyer
str(speedDating_1)
## 'data.frame':    8378 obs. of  7 variables:
##  $ Age     : int  27 22 22 23 24 25 30 27 28 24 ...
##  $ SameRace: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Gender  : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ From    : Factor w/ 270 levels "","94115","alabama",..: 56 56 56 56 56 56 56 56 56 56 ...
##  $ zipcode : Factor w/ 410 levels "","0","1,040",..: 262 262 262 262 262 262 262 262 262 262 ...
##  $ Income  : Factor w/ 262 levels "","106,663.00",..: 239 239 239 239 239 239 239 239 239 239 ...
##  $ Career  : Factor w/ 368 levels "","?","??","a research position",..: 185 185 185 185 185 185 185 185 185 185 ...

My new data set looks much cleaner and much easier to work with now, however I need to define in the Gender column what “0” means, as the data set came with no guidelines I’m going to state that :

“0” = Female ## This is actually confirmed in row 81 of the original data set

“1” = Male

I could of course write code to correct this, but for now I’ll leave it as this

View(speedDating_1)  ##  Final view as a data frame before moving on

I have essentually partially cleaned my data set ready for pre processing at this point

————————————————————————

Objectives :

Report sections:

  1. Report title and student details [YAML input].

  2. Data Description [Plain text].

  3. Read/Import Data [Plain text & R code & Output].

  4. Inspect and Understand [Plain text & R code & Output].

  5. Subsetting I [Plain text & R code & Output].

  6. Subsetting II [Plain text & R code & Output].

  7. Create a New Data frame [Plain text & R code & Output].

A more thorough Objectives explaination is at the End of this Assignment

—————————————————————————

Objectives 1, 2, 3, and 4 are completed above to this point

Objectives 5, 6, and 7 are completed below

—————————————————————————

Subsetting I

Subset the data frame using first 10 observations (include all variables). Then convert it to a matrix. Check the structure of that matrix (i.e. check whether the matrix is character, numeric, integer, factor, or logical) and explain in a few words why you ended up with that structure.

# This is a chunk to subset your data and convert it to a matrix 

subsetting_I <- speedDating_1[1:10, ]  ##  Subsetting the data set, this reduces data set to 10 observations
#head(subsetting_I)
#View(subsetting_I)                     ##  <<===    used as a check that 10 observations show only
## Convert subset_1 to a matrix

subset_1m <- as.matrix.data.frame(subsetting_I)
subset_1M <- as.matrix(subsetting_I)  ##  Not being sure which matrix to use I have used both
                                      ##  and appear to have created 2 identical data sets 
str(subset_1m)
##  chr [1:10, 1:7] "27" "22" "22" "23" "24" "25" "30" "27" "28" "24" "0" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:10] "1" "2" "3" "4" ...
##   ..$ : chr [1:7] "Age" "SameRace" "Gender" "From" ...
str(subset_1M)
##  chr [1:10, 1:7] "27" "22" "22" "23" "24" "25" "30" "27" "28" "24" "0" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:10] "1" "2" "3" "4" ...
##   ..$ : chr [1:7] "Age" "SameRace" "Gender" "From" ...

I created two to compare, both appear to have been changed to character, not sure why the numeric one changed though

Check Matrix Structure

str(subset_1m)  ##  I'm going with this one , both structures are the same
##  chr [1:10, 1:7] "27" "22" "22" "23" "24" "25" "30" "27" "28" "24" "0" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:10] "1" "2" "3" "4" ...
##   ..$ : chr [1:7] "Age" "SameRace" "Gender" "From" ...

Answer : everything appears to have been changed to characters

—————————————————————————

Subsetting II

Subset the data frame including only first and the last variable in the data set, save it as an R object file (.RData). Provide the R codes with outputs and explain everything that you do in this step.

subsetting_II  <- cbind.data.frame(subsetting_I$Age, subsetting_I$Career) 
## bind first and last and check it 

names(subsetting_II)[1]<-"Age_2"       ## Names are huge so I'm creating new names here
names(subsetting_II)[2]<-"SameRace_2"

head(subsetting_II)
##   Age_2 SameRace_2
## 1    27     lawyer
## 2    22     lawyer
## 3    22     lawyer
## 4    23     lawyer
## 5    24     lawyer
## 6    25     lawyer

Create a new Data Frame

Create a data frame with 2 variables and 4 observations. Your data frame has to contain one numeric variable (integer) and one nominal (categorical) variable.

In this part I’m creating 2 columns, and coercing one column to appear as a factor

Numeric <-  as.numeric( c('100','200','300','400'))   ## coerce to show as a factor
Nominal <- c('yes', 'no', 'maybe', 'all')  ## Nominal means categorical
                                           ## Nominal will show in structure as a character

In this part I’m binding the 2 columns in the order shown, then I’ll check with the head() call

and then check the structure or everything

speedDating_2 <- cbind(Numeric, Nominal)

head(speedDating_2)
##      Numeric Nominal
## [1,] "100"   "yes"  
## [2,] "200"   "no"   
## [3,] "300"   "maybe"
## [4,] "400"   "all"
str(speedDating_2)
##  chr [1:4, 1:2] "100" "200" "300" "400" "yes" "no" "maybe" "all"
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:2] "Numeric" "Nominal"
str(Numeric)
##  num [1:4] 100 200 300 400
str(Nominal)
##  chr [1:4] "yes" "no" "maybe" "all"
##  for saving if I wanted to , to a particular file / location

## write.csv(speedDating_1, file = "C:/Users/dan/Desktop/a Visualization/speedDating_1", row.names = FALSE)
## write.csv(speedDating_2, file = "C:/Users/dan/Desktop/a Visualization/speedDating_2", row.names = FALSE)  

In this part I’m checking each the number of rows and columns in the new data set

nr <- nrow(speedDating_2)  ## Shows number of rows
nc <- ncol(speedDating_2)  ## Shows number of columns
paste("Number of Rows =    " ,nr) 
## [1] "Number of Rows =     4"
paste("Number of Columns = " ,nc) 
## [1] "Number of Columns =  2"

Of note: * Create a numeric vector and use cbind() to add it to your data frame.

Can be inferred as meaning, add a 6X7 matrix to a 4x2 matrix as in my case, which is undo-able. However, adding a 4x2 matrix to a 4x1 matrix is do-able, and as above is what I’ve done.

—————————————————–

References

As noted at the beginning of this Assignment

This particular set of data has been obtained from the Kaggle web site as stated below,

Data Obtained from :

https://www.kaggle.com/annavictoria/speed-dating-experiment/version/1#Speed%20Dating%20Data.csv

This file was downloaded from the site in the form of a .csv file.

—————————————————–

Objectives : (Explained)

Report sections:

  1. Report title and student details [YAML input]: You can add the title of your report (i.e. Assignment 1) and student number by updating the “title” and “author” entries in the YAML header (located at the top of the R Markdown Template).

  2. Data Description [Plain text]: A clear description of data and its source (i.e. URL of the web site) should be provided.

  3. Read/Import Data [Plain text & R code & Output]: Read/Import the data into R, then save it as a data frame. You can use Base R functions or readr, xlsx, readxl, foreign, rvest packages for this purpose. In this section, you must provide the R codes with outputs (i.e. head of data set) and explain everything that you do in order to import/read/scrape the data set.

  4. Inspect and Understand [Plain text & R code & Output]: Summarize the types of variables and data structures, check the attributes in the data. Provide the R codes with outputs and explain everything that you do in this step.

  5. Subsetting I [Plain text & R code & Output]: Subset the data frame using first 10 observations (include all variables). Then convert it to a matrix. Check the structure of that matrix (i.e. check whether the matrix is character, numeric, integer, factor, or logical) and explain in a few words why you ended up with that structure. Provide the R codes with outputs and explain everything that you do in this step.

  6. Subsetting II [Plain text & R code & Output]: Subset the data frame including only first and the last variable in the data set, save it as an R object file (.RData). Provide the R codes with outputs and explain everything that you do in this step.

  7. Create a New Data frame [Plain text & R code & Output]: Create a data frame with 2 variables and 4 observations. Your data frame has to contain one numeric variable (integer) and one nominal (categorical) variable. Nominal variable has to be a factor and ordered. Make sure you name your variables. Show the structure of your variables and the levels of the nominal variable. Create a numeric vector and use cbind() to add it to your data frame. Check the attributes and the dimension of your new data frame. Provide the R codes with outputs and explain everything that you do in this step.

—————————————————-

IMPORTANT NOTE:

This report must be uploaded to Turnitin as a PDF with your code chunks showing. The easiest way to achieve this is to Preview your notebook in HTML (by clicking Preview) → Open in Browser (Chrome) → Right click on the report in Chrome → Click Print and Select the Destination Option to Save as PDF.