SETUP
Begin the setup of environment by setting a directory to work in.
- setwd(“C:/Users/Wajahath/Desktop/Data pre-processing/Assignment 1”) - This sets working directory
Install/Load the required packages.
- library(readr) - This is used for importing data
- library(foreign) - This is used for importing SPSS, SAS, STATA etc data files
- library(gdata) - This is used for manipulating data
READ/IMPORT DATA
Step 1: WR <- read.csv(“worldRecords.csv”) - The file worldRecords which is in csv format is imported into R.
Step 2: head(WR) - The function head() describes the header of the file. On execution, header of WR can be viewed.
WR <- read.csv("worldRecords.csv")
head(WR)
NA
Step 3: WR.df <- data.frame(WR) - The imported file is converted and saved as a data frame.
WR.df <- data.frame(WR)
INSPECT and UNDERSTAND
This step is about analysing and manipulating the data frame with respect to its dimension, data types and structure.
So, Dimensions of the data frame could be obtained by dim(“WR.df”) which gives number of rows and columns as its output data in form of dimension.
dim(WR.df)
[1] 40 6
Data type is the type of data, the variable holds. It could be either of numeric, character, integer, factor, and logical. The following function i,e typef() help us in getting data types of the variable set.
typeof(WR.df)
[1] "list"
attributes(WR.df)
$names
[1] "X" "Distance" "roadORtrack" "Place" "Time" "Date"
$class
[1] "data.frame"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
typeof("X")
[1] "character"
typeof("Distance")
[1] "character"
typeof("roadORtrack")
[1] "character"
typeof("Place")
[1] "character"
typeof("Time")
[1] "character"
typeof("Date")
[1] "character"
The categorical variables or factor variables has different labels to categorise the things. These labels follow the levels or ordering which could be renamed or rearanged.
rORt <- factor(WR$roadORtrack, labels = c("road", "track"), levels = c("road", "track"))
levels(rORt)
[1] "road" "track"
Column names of a data frame could be obtained by function colnames().
colnames(WR.df)
[1] "X" "Distance" "roadORtrack" "Place" "Time" "Date"
The column name of column number 1 has been assigned as “X” as it was null value. It could be renamed as “Sl.No” by following the syntax below.
colnames(WR.df)[1] <- c("Sl.No")
colnames(WR.df)
[1] "Sl.No" "Distance" "roadORtrack" "Place" "Time" "Date"
SUBSET 1
Subsetting a data frame inclusive of all variables.
WR.sub.df <- WR.df[1:10, ]
WR.sub.df
NA
NA
Conversion of data frame to matrix.
WR.mat <- matrix(WR.sub.df)
WR.mat
[,1]
[1,] Integer,10
[2,] Numeric,10
[3,] factor,10
[4,] factor,10
[5,] Numeric,10
[6,] factor,10
Structure of matrix,
str(WR.mat)
List of 6
$ : int [1:10] 1 2 3 4 5 6 7 8 9 10
$ : num [1:10] 0.1 0.15 0.2 0.3 0.4 0.5 0.6 0.8 1 1.5
$ : Factor w/ 2 levels "road","track": 2 2 2 2 2 2 2 2 2 2
$ : Factor w/ 33 levels "Alphen aan den Rijn",..: 2 9 3 27 31 7 30 13 28 29
$ : num [1:10] 0.163 0.247 0.322 0.514 0.72 ...
$ : Factor w/ 37 levels "1978-10-28","1980-06-07",..: 32 5 14 26 23 7 9 18 24 21
- attr(*, "dim")= int [1:2] 6 1
- A subset of 10 observations inclusive of all variables is created. Further, the subset created is converted into a matrix in second step. Then, the structure of matrix is obtained which yields interesting output. The data type of matrix obtained is list. It is so because, data frames has properties of both list and a matrix. Matrix on other hand should have same class of variables.
SUBSET 2
Subsetting a data frame with only first and last variable.
WR.sub1.df <- WR.df[, c(1,6)]
WR.sub1.df
NA
Saving as R object file.
- save(WR.sub1.df, file = “WR.sub1.df.rdata”)
save(WR.sub1.df, file = "WR.sub1.df.rdata")
- A new subset of data frame is made here. This subset is generated by considering all the observations over 2 variables, precisely over first and last variable(i,e SL.no and Date). This could be attained from WR.sub1.df <- WR.df[, c(1:6)]. Later, this is saved as R object file.
CREATING A NEW DATA FRAME
A new data frame with 2 variables and 4 observations is created here. The variables being Building and Level.
newdf <- data.frame(Building = 80:83, Level = c("A", "B", "C", "D"))
newdf
NA
Structure and levels of ordinal variable could be obtained from doing the following,
str("Building")
chr "Building"
str("Level")
chr "Level"
levels("Level")
NULL
Creating a numeric vector and adding it to data frame using cbind().
Num <- c(1, 2, 3, 4)
newdf1 <- cbind(newdf, Num)
newdf1
NA
Attributes and dimension of the new data frame,
attributes(newdf1)
$names
[1] "Building" "Level" "Num"
$class
[1] "data.frame"
$row.names
[1] 1 2 3 4
dim(newdf1)
[1] 4 3
