# This is a chunk where you can load the necessary packages required to reproduce the report
library(readr) # Useful for importing data
library(knitr) # Useful for creating nice tables
#library(foreign) # Useful for importing SPSS, SAS, STATA etc. data files
#library(rvest) # Useful for scraping HTML data
#library(dplyr)
This particular set of data has been obtained from the Kaggle web site as stated below,
I will need to shrink my data set considerably because the original data set is comprised of 195 variables. I will reduce this variable set to 7 variables which will contain numeric variables and categorical variables, it will be done after the Import / Read section.
# This is an R chunk for importing the data. Provide your R codes here:
#Speed_Dating_Data <- read.csv("C:/Users/dan/Desktop/Pre Processing/Speed Dating Data.csv/Speed Dating Data.csv")
##View(Speed_Dating_Data)
#library(readr)
#Speed_Dating_Data <- read.csv("C:/Users/dan/Desktop/a Pre Processing/Speed Dating Data.csv/Speed Dating Data.csv")
#View(Speed_Dating_Data)
library(readr)
Speed_Dating_Data <- read.csv("C:/Users/dan/Desktop/a Pre Processing/Speed_Dating_Data.csv")
View(Speed_Dating_Data)
The data set contains 195 variables, I’m going to use the coding below to shrink these variables to 7 ## Shrinking the data set
##head(Speed_Dating_Data)
speedDating_1 <- cbind.data.frame(Speed_Dating_Data$age_o, Speed_Dating_Data$gender, Speed_Dating_Data$samerace, Speed_Dating_Data$from, Speed_Dating_Data$zipcode, Speed_Dating_Data$income, Speed_Dating_Data$career)
head(speedDating_1)
## Speed_Dating_Data$age_o Speed_Dating_Data$gender
## 1 27 0
## 2 22 0
## 3 22 0
## 4 23 0
## 5 24 0
## 6 25 0
## Speed_Dating_Data$samerace Speed_Dating_Data$from
## 1 0 Chicago
## 2 0 Chicago
## 3 1 Chicago
## 4 0 Chicago
## 5 0 Chicago
## 6 0 Chicago
## Speed_Dating_Data$zipcode Speed_Dating_Data$income
## 1 60,521 69,487.00
## 2 60,521 69,487.00
## 3 60,521 69,487.00
## 4 60,521 69,487.00
## 5 60,521 69,487.00
## 6 60,521 69,487.00
## Speed_Dating_Data$career
## 1 lawyer
## 2 lawyer
## 3 lawyer
## 4 lawyer
## 5 lawyer
## 6 lawyer
str(speedDating_1)
## 'data.frame': 8378 obs. of 7 variables:
## $ Speed_Dating_Data$age_o : int 27 22 22 23 24 25 30 27 28 24 ...
## $ Speed_Dating_Data$gender : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Speed_Dating_Data$samerace: int 0 0 1 0 0 0 0 0 0 0 ...
## $ Speed_Dating_Data$from : Factor w/ 270 levels "","94115","alabama",..: 56 56 56 56 56 56 56 56 56 56 ...
## $ Speed_Dating_Data$zipcode : Factor w/ 410 levels "","0","1,040",..: 262 262 262 262 262 262 262 262 262 262 ...
## $ Speed_Dating_Data$income : Factor w/ 262 levels "","106,663.00",..: 239 239 239 239 239 239 239 239 239 239 ...
## $ Speed_Dating_Data$career : Factor w/ 368 levels "","?","??","a research position",..: 185 185 185 185 185 185 185 185 185 185 ...
# Rename a column in R
names(speedDating_1)[1]<-"Age"
names(speedDating_1)[2]<-"SameRace"
names(speedDating_1)[3]<-"Gender"
names(speedDating_1)[4]<-"From"
names(speedDating_1)[5]<-"zipcode"
names(speedDating_1)[6]<-"Income"
names(speedDating_1)[7]<-"Career"
head(speedDating_1)
## Age SameRace Gender From zipcode Income Career
## 1 27 0 0 Chicago 60,521 69,487.00 lawyer
## 2 22 0 0 Chicago 60,521 69,487.00 lawyer
## 3 22 0 1 Chicago 60,521 69,487.00 lawyer
## 4 23 0 0 Chicago 60,521 69,487.00 lawyer
## 5 24 0 0 Chicago 60,521 69,487.00 lawyer
## 6 25 0 0 Chicago 60,521 69,487.00 lawyer
str(speedDating_1)
## 'data.frame': 8378 obs. of 7 variables:
## $ Age : int 27 22 22 23 24 25 30 27 28 24 ...
## $ SameRace: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Gender : int 0 0 1 0 0 0 0 0 0 0 ...
## $ From : Factor w/ 270 levels "","94115","alabama",..: 56 56 56 56 56 56 56 56 56 56 ...
## $ zipcode : Factor w/ 410 levels "","0","1,040",..: 262 262 262 262 262 262 262 262 262 262 ...
## $ Income : Factor w/ 262 levels "","106,663.00",..: 239 239 239 239 239 239 239 239 239 239 ...
## $ Career : Factor w/ 368 levels "","?","??","a research position",..: 185 185 185 185 185 185 185 185 185 185 ...
View(speedDating_1) ## Final view as a data frame before moving on
Report sections:
Report title and student details [YAML input].
Data Description [Plain text].
Read/Import Data [Plain text & R code & Output].
Inspect and Understand [Plain text & R code & Output].
Subsetting I [Plain text & R code & Output].
Subsetting II [Plain text & R code & Output].
Create a New Data frame [Plain text & R code & Output].
Subset the data frame using first 10 observations (include all variables). Then convert it to a matrix. Check the structure of that matrix (i.e. check whether the matrix is character, numeric, integer, factor, or logical) and explain in a few words why you ended up with that structure.
# This is a chunk to subset your data and convert it to a matrix
subsetting_I <- speedDating_1[1:10, ] ## Subsetting the data set, this reduces data set to 10 observations
#head(subsetting_I)
#View(subsetting_I) ## <<=== used as a check that 10 observations show only
## Convert subset_1 to a matrix
subset_1m <- as.matrix.data.frame(subsetting_I)
subset_1M <- as.matrix(subsetting_I) ## Not being sure which matrix to use I have used both
## and appear to have created 2 identical data sets
str(subset_1m) ## I'm going with this one , both structures are the same
## chr [1:10, 1:7] "27" "22" "22" "23" "24" "25" "30" "27" "28" "24" "0" ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:10] "1" "2" "3" "4" ...
## ..$ : chr [1:7] "Age" "SameRace" "Gender" "From" ...
Subset the data frame including only first and the last variable in the data set, save it as an R object file (.RData). Provide the R codes with outputs and explain everything that you do in this step.
subsetting_II <- cbind.data.frame(subsetting_I$Age, subsetting_I$Career) ## bind first and last
head(subsetting_II) ## check it
## subsetting_I$Age subsetting_I$Career
## 1 27 lawyer
## 2 22 lawyer
## 3 22 lawyer
## 4 23 lawyer
## 5 24 lawyer
## 6 25 lawyer
Create a data frame with 2 variables and 4 observations. Your data frame has to contain one numeric variable (integer) and one nominal (categorical) variable.
Nominal variable has to be a factor and ordered. Make sure you name your variables.
Show the structure of your variables and the levels of the nominal variable.
Create a numeric vector and use cbind() to add it to your data frame.
Check the attributes and the dimension of your new data frame.
Provide the R codes with outputs and explain everything that you do in this step.
Numeric <- as.factor( c(1,2,3,4)) ## coerce to show as a factor
Nominal <- c('yes', 'no', 'maybe', 'all') ## Nominal means categorical
## Nominal will show in structure as a character
speedDating_2 <- cbind(Numeric, Nominal)
# View(speedDating_2) # I'll use this to verify the data frame
head(speedDating_2)
## Numeric Nominal
## [1,] "1" "yes"
## [2,] "2" "no"
## [3,] "3" "maybe"
## [4,] "4" "all"
# str(speedDating_2)
## (speedDating_1) ## for saving if I wanted to , to a particular file / location
## (speedDating_2)
## (speedDating_3)
## write.csv(speedDating_1, file = "C:/Users/dan/Desktop/a Visualization/speedDating_1", row.names = FALSE)
## write.csv(speedDating_2, file = "C:/Users/dan/Desktop/a Visualization/speedDating_2", row.names = FALSE)
## write.csv(speedDating_3, file = "C:/Users/dan/Desktop/a Visualization/speedDating_3", row.names = FALSE)
Numeric_2 <- c(100,200,300,400)
Numeric_2 <- as.numeric (Numeric_2) ## coerce to show as numeric
#speedDating_3 <- cbind(Numeric, Numeric_2, Nominal)
speedDating_3 <- cbind(Numeric_2, Nominal)
View(speedDating_3)
head(speedDating_3)
## Numeric_2 Nominal
## [1,] "100" "yes"
## [2,] "200" "no"
## [3,] "300" "maybe"
## [4,] "400" "all"
str(Numeric)
## Factor w/ 4 levels "1","2","3","4": 1 2 3 4
str(Numeric_2)
## num [1:4] 100 200 300 400
str(Nominal)
## chr [1:4] "yes" "no" "maybe" "all"
nrow(speedDating_3) ## Shows number of rows
## [1] 4
ncol(speedDating_3) ## Shows number of columns
## [1] 2
Can be inferred as meaning, add a 6X7 matrix to a 4x2 matrix as in my case, which is undo-able. However, adding a 4x2 matrix to a 4x1 matrix is do-able, and as above is what I’ve done.
As noted at the beginning of this Assignment
This particular set of data has been obtained from the Kaggle web site as stated below,
Report sections:
Report title and student details [YAML input]: You can add the title of your report (i.e. Assignment 1) and student number by updating the “title” and “author” entries in the YAML header (located at the top of the R Markdown Template).
Data Description [Plain text]: A clear description of data and its source (i.e. URL of the web site) should be provided.
Read/Import Data [Plain text & R code & Output]: Read/Import the data into R, then save it as a data frame. You can use Base R functions or readr, xlsx, readxl, foreign, rvest packages for this purpose. In this section, you must provide the R codes with outputs (i.e. head of data set) and explain everything that you do in order to import/read/scrape the data set.
Inspect and Understand [Plain text & R code & Output]: Summarize the types of variables and data structures, check the attributes in the data. Provide the R codes with outputs and explain everything that you do in this step.
Subsetting I [Plain text & R code & Output]: Subset the data frame using first 10 observations (include all variables). Then convert it to a matrix. Check the structure of that matrix (i.e. check whether the matrix is character, numeric, integer, factor, or logical) and explain in a few words why you ended up with that structure. Provide the R codes with outputs and explain everything that you do in this step.
Subsetting II [Plain text & R code & Output]: Subset the data frame including only first and the last variable in the data set, save it as an R object file (.RData). Provide the R codes with outputs and explain everything that you do in this step.
Create a New Data frame [Plain text & R code & Output]: Create a data frame with 2 variables and 4 observations. Your data frame has to contain one numeric variable (integer) and one nominal (categorical) variable. Nominal variable has to be a factor and ordered. Make sure you name your variables. Show the structure of your variables and the levels of the nominal variable. Create a numeric vector and use cbind() to add it to your data frame. Check the attributes and the dimension of your new data frame. Provide the R codes with outputs and explain everything that you do in this step.
This report must be uploaded to Turnitin as a PDF with your code chunks showing. The easiest way to achieve this is to Preview your notebook in HTML (by clicking Preview) â Open in Browser (Chrome) â Right click on the report in Chrome â Click Print and Select the Destination Option to Save as PDF.