SETUP

Begin the setup of environment by setting a directory to work in.

Install/Load the required packages.

DATA DESCRIPTION

As per the given task, following is the open source link for the data I have considered.

https://vincentarelbundock.github.io/Rdatasets/datasets.html?fbclid=IwAR1zE57wqKeMzPiTb8uFFK0nuIRkmMSXfcDFuifyP4BhCwTy_jEdncyVZIg

File downloand link - https://vincentarelbundock.github.io/Rdatasets/csv/DAAG/worldRecords.csv

The filename is “worldRecords” which is a csv file. The data in the file is about records created by fellow racers in car racing over the years at different locations.

Some of the characteristics of selected dataset are,

READ/IMPORT DATA

Step 1: WR <- read.csv(“worldRecords.csv”) - The file worldRecords which is in csv format is imported into R.

Step 2: head(WR) - The function head() describes the header of the file. On execution, header of WR can be viewed.


WR <- read.csv("worldRecords.csv")
head(WR)
NA

Step 3: WR.df <- data.frame(WR) - The imported file is converted and saved as a data frame.


WR.df <- data.frame(WR)

INSPECT and UNDERSTAND

This step is about analysing and manipulating the data frame with respect to its dimension, data types and structure.

So, Dimensions of the data frame could be obtained by dim(“WR.df”) which gives number of rows and columns as its output data in form of dimension.


dim(WR.df)
[1] 40  6

Data type is the type of data, the variable holds. It could be either of numeric, character, integer, factor, and logical. The following function i,e typef() help us in getting data types of the variable set.


typeof(WR.df)
[1] "list"

attributes(WR.df)
$names
[1] "X"           "Distance"    "roadORtrack" "Place"       "Time"        "Date"       

$class
[1] "data.frame"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

typeof("X")
[1] "character"

typeof("Distance")
[1] "character"

typeof("roadORtrack")
[1] "character"

typeof("Place")
[1] "character"

typeof("Time")
[1] "character"

typeof("Date")
[1] "character"

The categorical variables or factor variables has different labels to categorise the things. These labels follow the levels or ordering which could be renamed or rearanged.


rORt <- factor(WR$roadORtrack, labels = c("road", "track"), levels = c("road", "track"))
levels(rORt)    
[1] "road"  "track"

Column names of a data frame could be obtained by function colnames().


colnames(WR.df)
[1] "X"           "Distance"    "roadORtrack" "Place"       "Time"        "Date"       

The column name of column number 1 has been assigned as “X” as it was null value. It could be renamed as “Sl.No” by following the syntax below.


colnames(WR.df)[1] <- c("Sl.No")   
colnames(WR.df)    
[1] "Sl.No"       "Distance"    "roadORtrack" "Place"       "Time"        "Date"       

SUBSET 1

Subsetting a data frame inclusive of all variables.


WR.sub.df <- WR.df[1:10, ]
WR.sub.df 
NA
NA

Conversion of data frame to matrix.


WR.mat <- matrix(WR.sub.df)
WR.mat    
     [,1]      
[1,] Integer,10
[2,] Numeric,10
[3,] factor,10 
[4,] factor,10 
[5,] Numeric,10
[6,] factor,10 

Structure of matrix,

str(WR.mat)
List of 6
 $ : int [1:10] 1 2 3 4 5 6 7 8 9 10
 $ : num [1:10] 0.1 0.15 0.2 0.3 0.4 0.5 0.6 0.8 1 1.5
 $ : Factor w/ 2 levels "road","track": 2 2 2 2 2 2 2 2 2 2
 $ : Factor w/ 33 levels "Alphen aan den Rijn",..: 2 9 3 27 31 7 30 13 28 29
 $ : num [1:10] 0.163 0.247 0.322 0.514 0.72 ...
 $ : Factor w/ 37 levels "1978-10-28","1980-06-07",..: 32 5 14 26 23 7 9 18 24 21
 - attr(*, "dim")= int [1:2] 6 1

SUBSET 2

Subsetting a data frame with only first and last variable.


WR.sub1.df <- WR.df[, c(1,6)]
WR.sub1.df
NA

Saving as R object file.


save(WR.sub1.df, file = "WR.sub1.df.rdata")

CREATING A NEW DATA FRAME

A new data frame with 2 variables and 4 observations is created here. The variables being Building and Level.


newdf <- data.frame(Building = 80:83, Level = c("A", "B", "C", "D"))
newdf
NA

Structure and levels of ordinal variable could be obtained from doing the following,


str("Building")
 chr "Building"

str("Level")
 chr "Level"

levels("Level")
NULL

Creating a numeric vector and adding it to data frame using cbind().


Num <- c(1, 2, 3, 4)
newdf1 <- cbind(newdf, Num)
newdf1
NA

Attributes and dimension of the new data frame,


attributes(newdf1)
$names
[1] "Building" "Level"    "Num"     

$class
[1] "data.frame"

$row.names
[1] 1 2 3 4

dim(newdf1)
[1] 4 3




---
title: "MATH2349 Semester 1, 2019"
author: "Syed Wajahath - s3750039"
subtitle: Assignment 1
output: 
  html_notebook: default
---

## SETUP

Begin the setup of environment by setting a directory to work in.

* setwd("C:/Users/Wajahath/Desktop/Data pre-processing/Assignment 1") - This sets working directory

Install/Load the required packages.

* library(readr) - This is used for importing data
* library(foreign) - This is used for importing SPSS, SAS, STATA etc data files
* library(gdata) - This is used for manipulating data

```{r include=FALSE, warnings=}

setwd("C:/Users/Wajahath/Desktop/Analytics/Sem 1/Data pre-processing/Assignment 1") #Sets the working directory

library(readr)   # Used for importing data
library(foreign) # Used for importing SPSS, SAS, STATA etc. data files
library(gdata)   # Used for manipulating data
```


## DATA DESCRIPTION

As per the given task, following is the open source link for the data I have considered. 

https://vincentarelbundock.github.io/Rdatasets/datasets.html?fbclid=IwAR1zE57wqKeMzPiTb8uFFK0nuIRkmMSXfcDFuifyP4BhCwTy_jEdncyVZIg

File downloand link - https://vincentarelbundock.github.io/Rdatasets/csv/DAAG/worldRecords.csv

The filename is "worldRecords" which is a csv file. The data in the file is about records created by fellow racers in car racing over the years at different locations.

Some of the characteristics of selected dataset are,

* The file has 6 different attributes considered over 40 observations.

* The attribute **roadORtrack** is a categorical variable.

* Data has few other variables with names - Distance, Place, Time and Date.


## READ/IMPORT DATA

Step 1: **WR <- read.csv("worldRecords.csv")** - The file worldRecords which is in csv format is imported into R.

Step 2: **head(WR)** - The function head() describes the header of the file. On execution, header of WR can be viewed.

```{r}

WR <- read.csv("worldRecords.csv")
head(WR)

```

Step 3: **WR.df <- data.frame(WR)** - The imported file is converted and saved as a data frame.   

```{r}

WR.df <- data.frame(WR)

```


## INSPECT and UNDERSTAND

This step is about analysing and manipulating the data frame with respect to its dimension, data types and structure.

So, Dimensions of the data frame could be obtained by **dim("WR.df")** which gives number of rows and columns as its output data in form of dimension.

```{r}

dim(WR.df)

```

Data type is the type of data, the variable holds. It could be either of numeric, character, integer, factor, and logical. The following function i,e **typef()** help us in getting data types of the variable set.

* typeof(WR.df)
```{r}

typeof(WR.df)

```
* attributes(WR.df)
```{r}

attributes(WR.df)

```
* typeof("X")
```{r}

typeof("X")

```
* typeof("Distance")
```{r}

typeof("Distance")

```
* typeof("roadOrtrack")
```{r}

typeof("roadORtrack")

```
* typeof("Place"")
```{r}

typeof("Place")

```
* typeof("Time")
```{r}

typeof("Time")

```
* typeof("Date")
```{r}

typeof("Date")

```


The categorical variables or factor variables has different labels to categorise the things. These labels follow the levels or ordering which could be renamed or rearanged.

```{r}

rORt <- factor(WR$roadORtrack, labels = c("road", "track"), levels = c("road", "track"))
levels(rORt)    

```
  
Column names of a data frame could be obtained by function **colnames()**.

* colnames(WR.df)
```{r}

colnames(WR.df)

```

The column name of column number 1 has been assigned as **"X"** as it was null value. It could be renamed as "Sl.No" by following the syntax below.

    
```{r}

colnames(WR.df)[1] <- c("Sl.No")   
colnames(WR.df)    

```


## SUBSET 1

Subsetting a data frame inclusive of all variables.

* WR.sub.df <- WR.df[1:10, ]

    WR.sub.df
    
```{r}

WR.sub.df <- WR.df[1:10, ]
WR.sub.df 


```

Conversion of data frame to matrix.

* WR.mat <- matrix(WR.sub.df)

    WR.mat
    
```{r}

WR.mat <- matrix(WR.sub.df)
WR.mat    

```

Structure of matrix,
```{r}
str(WR.mat)

```

* A subset of 10 observations inclusive of all variables is created. Further, the subset created is converted into a matrix in second step. Then, the structure of matrix is obtained which yields interesting output. The data type of matrix obtained is **list**. It is so because, data frames has properties of both list and a matrix. Matrix on other hand should have same class of variables.


## SUBSET 2

Subsetting a data frame with only first and last variable.

* WR.sub1.df <- WR.df[, c(1,6)]

    WR.sub1.df
```{r}

WR.sub1.df <- WR.df[, c(1,6)]
WR.sub1.df

```

Saving as R object file.

* save(WR.sub1.df, file = "WR.sub1.df.rdata")
```{r}

save(WR.sub1.df, file = "WR.sub1.df.rdata")

```


* A new subset of data frame is made here. This subset is generated by considering all the observations over 2 variables, precisely over first and last variable(i,e SL.no and Date). This could be attained from **WR.sub1.df <- WR.df[, c(1:6)]**. Later, this is saved as R object file.





## CREATING A NEW DATA FRAME

A new data frame with 2 variables and 4 observations is created here. The variables being **Building and Level**. 

```{r}

newdf <- data.frame(Building = 80:83, Level = c("A", "B", "C", "D"))
newdf

```
    
Structure and levels of ordinal variable could be obtained from doing the following,

```{r}

str("Building")

```

```{r}

str("Level")

```

```{r}

levels("Level")

```

Creating a numeric vector and adding it to data frame using cbind().

* Num <- c(1, 2, 3, 4)

    newdf1 <- cbind(newdf, Num)
    
    newdf1
```{r}

Num <- c(1, 2, 3, 4)
newdf1 <- cbind(newdf, Num)
newdf1

```

Attributes and dimension of the new data frame,


```{r}

attributes(newdf1)

```


```{r}

dim(newdf1)

```




<br>
<br>
<br>
