Introduction

Assignment 1 - Locate open data from the web, import it into R, inspect the data using Rfunctions, apply tidy data principles on the data and manipulate the data appropriately.

Setup

Install and load the necessary packages to reproduce the report here:

# Load necessary packages required to reproduce the report 

library(readr) # Useful for importing data
library(foreign) # Useful for importing SPSS, SAS, STATA etc. data files
library(rvest) # Useful for scraping HTML data
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
## 
##     guess_encoding
library(knitr) # Useful for creating nice tables
library(tidyverse) # Used for data manipulation
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v dplyr   1.0.3
## v tibble  3.0.4     v stringr 1.4.0
## v tidyr   1.1.2     v forcats 0.5.0
## v purrr   0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter()         masks stats::filter()
## x rvest::guess_encoding() masks readr::guess_encoding()
## x dplyr::lag()            masks stats::lag()
## x purrr::pluck()          masks rvest::pluck()
library(dplyr) # Useful for data manipulation
library(ggplot2) # Useful for data plotting
library(openxlsx) # Useful for importing
library(tidyr) # Useful for data manipulation
library(magrittr) # Useful for data manipulation and structure
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract

Step 1 - Locate Data

Data Set Downloaded from the Kaggle in CSV format

Site Called - Emissions by Cars- Carbon Dioxide Emission by Cars - Data Renamed to “CarsCO2.csv” when saved to local location from original file name “Sample.csv”

This data set shows the differnt fuel consumption and CO2 emission levels for a variety of cars.

Step 2 - Read/Import Data

Import the Data - Saved the file into the project directory as a CSV to enable direct read.table function to be utilised

Cars <- read.table("CarsCO2.csv", header = TRUE, stringsAsFactor = FALSE, sep = ",", fill = TRUE)

Actions undertaken in Step 2:

  • Renamed the “Sample.csv” downloaded from Kaggle to “CarsCO2.csv” and saved it in the project environment
  • Wrote the name of a new dataframe “Cars”
  • Used the read.table function to import the data from the “CarsCO2.csv” into the “Cars” data frame
  • Used header = TRUE to assign top row of data as the header
  • stringsAsFactor = FALSE splits the data by columns and separate by any commas using sep = “,”
  • Fill = TRUE means that any blanks headers will be filled

Step 3 - Data description

Provide a clear description of the data and its source (i.e. URL of the web site). Provide variable descriptions.

Data sourced from - https://www.kaggle.com/prathamtripathi/co2-emissions-by-cars-in-canada

This data set contains various information about a set of cars that were manufactured with set of factory parameters like cylinder size, number of cylinders, fuel consumption, Carbon dioxide emissions

This data set allows for brand comparison for both fuel consumption and CO2 emissions.

Description of the variables

  • MODEL - Year of car manufacture
  • MAKE - Brand of car (manufacturer)
  • MODEL.1 - Manufacturer sub-line of car
  • VEHICLE.CLASS - Provides detail on the variety of car and size
  • ENGINE_SIZE - The engine capacity of the car in litres
  • CYLINDERS - THe number of cylinders the car engine has
  • TRANSMISSION - The type of transmission the car has (automatic/ manual and the number of gears)
  • FUEL - TYpe of fuel required to run the car
  • FUEL_CONSUMPTION - Level of fuel consumption in litres per 100kms of driving
  • X - Unknown data point to be removed from data table
  • X.1 - Unknown data point to be removed from data table
  • X.2 - Unknown data point to be removed from data table
  • CO2_EMISSIONS - The amount of CO2 in grams released per km of driving

Step 4 - Inspect dataset and variables

### Step 4.a - Check the dimensions of the dataframe

nrow(Cars)
## [1] 679
ncol(Cars)
## [1] 13
dim(Cars)
## [1] 679  13

Step 4.a - Actions for dimension check

  • Checked the number of rows using the nrow function
  • Checked the number of columns the ncol function
  • Checked the overall dimension of the dataframe using the dim function (shows both rows and columns)
### Step 4.b - Summarise the types of variables and review for appropriateness 

str(Cars)
## 'data.frame':    679 obs. of  13 variables:
##  $ MODEL            : int  2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
##  $ MAKE             : chr  "ACURA" "ACURA" "ACURA" "ACURA" ...
##  $ MODEL.1          : chr  "1.7EL" "1.7EL" "3.2CL" "3.2TL" ...
##  $ VEHICLE.CLASS    : chr  "COMPACT" "COMPACT" "COMPACT" "MID-SIZE" ...
##  $ ENGINE_SIZE      : num  1.7 1.7 3.2 3.2 3.5 1.8 1.8 1.8 3.5 3 ...
##  $ CYLINDERS        : int  4 4 6 6 6 4 4 4 6 6 ...
##  $ TRANSMISSION     : chr  "A4" "M5" "AS5" "AS5" ...
##  $ FUEL             : chr  "X" "X" "Z" "Z" ...
##  $ FUEL_CONSUMPTION.: num  9.3 8.9 13.7 13.8 15 11.4 10.6 10.8 15.5 15.3 ...
##  $ X                : num  7.2 7.4 8.8 8.8 10.9 8.5 8.3 8.5 11.1 10.9 ...
##  $ X.1              : num  8.3 8.3 11.5 11.6 13.1 10.1 9.6 9.7 13.5 13.3 ...
##  $ X.2              : int  34 34 25 24 22 28 29 29 21 21 ...
##  $ CO2_EMISSIONS    : int  191 191 265 267 301 232 221 223 311 306 ...

Step 4.b - Actions for variable summary

  • Use the str function to show the data frame structure table to allow for review of the variable types.
## Step 4.c - Changing variable type

### Checking this data frame for type information they all variables have been assigned the correct type. 

### To illustrate how to change type I will use one of the variables that is set to be deleted (X.2)

### The change of type test conducted to to be 
#### 1. Convert from Integer to Character 
#### 2. Convert from Character to Number
#### 3. Convert from Number to Integer (original type)


Cars$X.2 <-as.character(Cars$X.2)
#### Check change 1
str(Cars) 
## 'data.frame':    679 obs. of  13 variables:
##  $ MODEL            : int  2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
##  $ MAKE             : chr  "ACURA" "ACURA" "ACURA" "ACURA" ...
##  $ MODEL.1          : chr  "1.7EL" "1.7EL" "3.2CL" "3.2TL" ...
##  $ VEHICLE.CLASS    : chr  "COMPACT" "COMPACT" "COMPACT" "MID-SIZE" ...
##  $ ENGINE_SIZE      : num  1.7 1.7 3.2 3.2 3.5 1.8 1.8 1.8 3.5 3 ...
##  $ CYLINDERS        : int  4 4 6 6 6 4 4 4 6 6 ...
##  $ TRANSMISSION     : chr  "A4" "M5" "AS5" "AS5" ...
##  $ FUEL             : chr  "X" "X" "Z" "Z" ...
##  $ FUEL_CONSUMPTION.: num  9.3 8.9 13.7 13.8 15 11.4 10.6 10.8 15.5 15.3 ...
##  $ X                : num  7.2 7.4 8.8 8.8 10.9 8.5 8.3 8.5 11.1 10.9 ...
##  $ X.1              : num  8.3 8.3 11.5 11.6 13.1 10.1 9.6 9.7 13.5 13.3 ...
##  $ X.2              : chr  "34" "34" "25" "24" ...
##  $ CO2_EMISSIONS    : int  191 191 265 267 301 232 221 223 311 306 ...
Cars$X.2 <-as.numeric(Cars$X.2)

Cars$X.2 <-as.integer(Cars$X.2)

Step 4.c - Actions for dimension check

  • Change variable type by referencing the variable from the dataframe and transforming it by stating the variable type needed for the change and stating the variable which is to be changed.
  • I performed this a further two times on the Cars$X.2 variable to test the function and revert it back to the original type.
### Step 4.d - Change variable types to factors

### When reviewing the data frame for factors I have identified 3 variables which could be converted to factors.
#### 1. VEHICLE.CLASS
#### 2. CYLINDERS
#### 3. FUEL

Cars$VEHICLE.CLASS <- factor(Cars$VEHICLE.CLASS)
Cars$CYLINDERS <- factor(Cars$CYLINDERS)
Cars$FUEL <- factor(Cars$FUEL)

#### Check variable type and the levels
str(Cars) 
## 'data.frame':    679 obs. of  13 variables:
##  $ MODEL            : int  2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
##  $ MAKE             : chr  "ACURA" "ACURA" "ACURA" "ACURA" ...
##  $ MODEL.1          : chr  "1.7EL" "1.7EL" "3.2CL" "3.2TL" ...
##  $ VEHICLE.CLASS    : Factor w/ 15 levels "","COMPACT","FULL-SIZE",..: 2 2 2 4 4 11 11 11 12 13 ...
##  $ ENGINE_SIZE      : num  1.7 1.7 3.2 3.2 3.5 1.8 1.8 1.8 3.5 3 ...
##  $ CYLINDERS        : Factor w/ 7 levels "3","4","5","6",..: 2 2 4 4 4 2 2 2 4 4 ...
##  $ TRANSMISSION     : chr  "A4" "M5" "AS5" "AS5" ...
##  $ FUEL             : Factor w/ 6 levels "","D","E","N",..: 5 5 6 6 6 5 5 6 6 6 ...
##  $ FUEL_CONSUMPTION.: num  9.3 8.9 13.7 13.8 15 11.4 10.6 10.8 15.5 15.3 ...
##  $ X                : num  7.2 7.4 8.8 8.8 10.9 8.5 8.3 8.5 11.1 10.9 ...
##  $ X.1              : num  8.3 8.3 11.5 11.6 13.1 10.1 9.6 9.7 13.5 13.3 ...
##  $ X.2              : int  34 34 25 24 22 28 29 29 21 21 ...
##  $ CO2_EMISSIONS    : int  191 191 265 267 301 232 221 223 311 306 ...
levels(Cars$VEHICLE.CLASS)
##  [1] ""                         "COMPACT"                 
##  [3] "FULL-SIZE"                "MID-SIZE"                
##  [5] "MINICOMPACT"              "MINIVAN"                 
##  [7] "PICKUP TRUCK - SMALL"     "PICKUP TRUCK - STANDARD" 
##  [9] "STATION WAGON - MID-SIZE" "STATION WAGON - SMALL"   
## [11] "SUBCOMPACT"               "SUV"                     
## [13] "TWO-SEATER"               "VAN - CARGO"             
## [15] "VAN - PASSENGER"
levels(Cars$CYLINDERS)
## [1] "3"  "4"  "5"  "6"  "8"  "10" "12"
levels(Cars$FUEL)
## [1] ""  "D" "E" "N" "X" "Z"

Step 4.d - Actions for chnaging variables within a dataframe to factors

  • Applied the same change type process as was undertaken in 4.c, however stated the type as factor to convert the variables to factors.
  • Following the conversion to factors it was possible to check the associated levels of each of the factors using the levels function.
### Step 4.e - Rename the levels for the Fuel variable

levels(Cars$FUEL) <- c("NA","Diesel", "E10", "N91", "N95", "N98")

#### Check

levels(Cars$FUEL)
## [1] "NA"     "Diesel" "E10"    "N91"    "N95"    "N98"

Step 4.e - Actions for renaming levels

  • Identify the levels which need to be changed and then write the transform code and state the revised names
  • Check that the names changed as desired
### Step 4.f - Check the variable names

colnames(Cars)
##  [1] "MODEL"             "MAKE"              "MODEL.1"          
##  [4] "VEHICLE.CLASS"     "ENGINE_SIZE"       "CYLINDERS"        
##  [7] "TRANSMISSION"      "FUEL"              "FUEL_CONSUMPTION."
## [10] "X"                 "X.1"               "X.2"              
## [13] "CO2_EMISSIONS"
### Desired to change the name of the following variables:
#### 1. MODEL to YEAR
#### 2. MODEL.1 to MODEL

Cars <- Cars %>% 
  rename(
    YEAR = MODEL,
    MODEL = MODEL.1)

#### Check

colnames(Cars) 
##  [1] "YEAR"              "MAKE"              "MODEL"            
##  [4] "VEHICLE.CLASS"     "ENGINE_SIZE"       "CYLINDERS"        
##  [7] "TRANSMISSION"      "FUEL"              "FUEL_CONSUMPTION."
## [10] "X"                 "X.1"               "X.2"              
## [13] "CO2_EMISSIONS"

Step 4.f - Actions to check variable names

  • Use the rename transform function an state the variable which need to be renamed in the code
  • Check the columns to ensure the name changes align.

Step 5 - Tidy data

Check if the data conforms the tidy data principles. If your data is untidy, reshape your data into a tidy format. If the data is in a tidy format, you will be expected to explain why the data is originally ‘tidy’.

Clean to the data so that the data conforms with tidy data principles

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. value must have its own cell.
Cars ## Review

Based upon the review of the Cars2 Data Frame it can be concluded that the data conforms with all 3 data tidy principles.

  1. Each variable is correctly assigned to a column with data separated correctly - The columns have been distinguished correctly by variable with headers assigned - This would be as a result of the CSV format used for upload - If a HTML scrape of text file was used there may need to be more separating of data strings o clean they data

  2. There is an individual row for each observation - Each row can be distinctly identified with no data concatenating

  3. Each value is correctly assigned to a cell - there is no data which has been uploaded as a string and assigned to the incorrect cells

Other considerations - Based upon review of the data the following areas for data frame alteration have been identified;

  1. Remove unknown and unnecessary data columns (X, X.1, X.2)
  2. Remove observation rows which have incomplete/ blank data cells.
#### 1. Create a  new dataframe with the unnecessary columns removed

Cars2 <- Cars[, c(1:9, 13)]

View(Cars2) ## Check

#### 2. Omit blank data to enable clean grouping and analysis 

Cars3 <- na.omit(Cars2)
dim(Cars3) ## Dataframe has now reduced in size 
## [1] 642  10
Cars3 ## Check

Step 5 - Actions to tidy the data set

  • Create a new dataframe with columns removed by selecting the columns that are set to be retained form the original Cars data set
  • Omit any blank data cells to not cause for data errors using the na.omit function and creating a new dataframe called Cars3 to store this
  • Check the revised dataframe dimensions.

Step 6 - Summary statistics

Provide summary statistics (mean, median, minimum, maximum, standard deviation) of numeric variables grouped by one of the qualitative (categorical) variable. For example, if your categorical variable is age groups and quantitative variable is income, provide summary statistics of income grouped by the age groups.

### 1. MAKE with CO2_EMISSIONS Summary - Assess what the CO2 emission summary is by MAKE of car.

MAKE_CO2 <- Cars3 %>%
  group_by(MAKE) %>%
  summarise(Mean = mean(CO2_EMISSIONS),
            Median = median(CO2_EMISSIONS),
            Min = min(CO2_EMISSIONS),
            Max = max(CO2_EMISSIONS),
            SD = sd(CO2_EMISSIONS))

### 2. MAKE with FUEL_CONSUMPTION Summary - Assess what the Fuel Consumption summary is by MAKE of car.

MAKE_FUEL <- Cars3 %>%
  group_by(MAKE) %>%
  summarise(Mean = mean(FUEL_CONSUMPTION.),
            Median = median(FUEL_CONSUMPTION.),
            Min = min(FUEL_CONSUMPTION.),
            Max = max(FUEL_CONSUMPTION.),
            SD = sd(FUEL_CONSUMPTION.))

### 3. VEHICLE.CLASS with CO2_EMISSIONS Summary - Assess what the CO2 emission summary is by CLASS of car.

CLASS_CO2 <- Cars3 %>%
  group_by(VEHICLE.CLASS) %>%
  summarise(Mean = mean(CO2_EMISSIONS),
            Median = median(CO2_EMISSIONS),
            Min = min(CO2_EMISSIONS),
            Max = max(CO2_EMISSIONS),
            SD = sd(CO2_EMISSIONS))

### 4. VEHICLE.CLASS with FUEL_CONSUMPTION Summary - Assess what the Fuel Consumption summary is by CLASS of car.

CLASS_FUEL <- Cars3 %>%
  group_by(VEHICLE.CLASS) %>%
  summarise(Mean = mean(FUEL_CONSUMPTION.),
            Median = median(FUEL_CONSUMPTION.),
            Min = min(FUEL_CONSUMPTION.),
            Max = max(FUEL_CONSUMPTION.),
            SD = sd(FUEL_CONSUMPTION.))

Step 6 - Actions to create summary statistics

  • Create new data tables for the relevant summaries using the Cars3 Dataframe as the base, then the group_by function used to summarize the relevant data by mean, median, min, max and SD. MAKE and CLASS were used as the category groups, while FUEL consumption and CO2 emissions were used as the variable to create the summary statistics to review.

View all the newly created summary statistic tables

MAKE_CO2
MAKE_FUEL
CLASS_CO2
CLASS_FUEL

Step 7 - Create a list

Create a list that contains a numeric value for each response to the categorical variable. Typically, they are numbered from 1-n.

### Assign the Numeric Value to MAKE categories

### Generate a list of the MAKE categories from one of the previous compressed lists (MAKE_CO2)

ListMAKE <- list(MAKE_CO2$MAKE)

### Generate a standard numeric list for assignment to the categories
ListNum <- list(1:34)

### Convert the lists to a set of values

MAKE <- unlist(ListMAKE)
NUM <- unlist(ListNum)

### Use the values to create a new dataframe

MAKEDF <- data.frame(MAKE, NUM)

View(MAKEDF) ### Check

Step 7 - Actions to create a list

  • Assign the Numeric Value to MAKE categories
  • Generate a list of the MAKE categories from one of the previous compressed lists (MAKE_CO2)
  • Convert the lists to a set of values
  • Use the values to create a new dataframe

Step 8 - Join the list

Join this list on using a join of your choice. Remember that this has to keep the numeric variable, as well as matching to your categorical variable.

### Merge the two data frames (Cars3 and MAKEDF) to create a new Dataframe (Cars4) which now contains a number assigned to each MAKE (the categorical variable)

Cars4 <- merge(Cars3, MAKEDF, 'by' = 'MAKE')

View(Cars4) ## Check

Step 8 - Actions to join the list

  • Assign the Numeric Value to MAKE categories to allow for merge commonality
  • Use the merge function using MAKE as the common variable

Step 9 - Subsetting I

Subset the data frame using first 10 observations (include all variables). Then convert it to a matrix. Check the structure of that matrix (i.e. check whether the matrix is character, numeric, integer, factor, or logical) and explain in a few words why you ended up with that structure.

### Use the filter function to extract the top 10 rows of the Cars4 Dataframe.

CarsSub <- Cars4 %>% filter(row(Cars4) > 0 & row(Cars4) < 11)

### Convert the data frame to a matrix using the data.matrix function 

CarsMat <- data.matrix(CarsSub, rownames.force = NA)

View(CarsMat) ### Check
str(CarsMat) ### Check
##  num [1:10, 1:11] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:11] "MAKE" "YEAR" "MODEL" "VEHICLE.CLASS" ...
### This function has converted and assigned all non integer/ numeric values a numeric value. The variable headers have been retained as character.

Step 9 - Actions to subset (part 1)

  • Create a new dataframe with the subset content using the filter function and identifying the rows for retention in the subset dataframe.
  • Use the data.matrix function to convert the dataframe to a matrix - By converting to a matrix the previous dataframe has been changed so that all non integer/ numeric values now become numeric value. The variable headers have been retained as character.

Step 10 - Subsetting II

Subset the data frame including only first and the last variable in the data set and save it as an R object file (.RData).

### Use the Cars3 dataframe to create the subset - Use the header extract function are name the columns which are to be subset and put in the number of rows desired for the dataframe. 

ColSub1 <- head(Cars3[,c("MAKE","CO2_EMISSIONS")],642)

### Create a Rdata file of the dataframe - Use the save function and identify the dataframe and assign a file name. 

save(ColSub1, file = "ColSub1.Rdata")

Step 10 - Actions to subset (part 2)

  • Create a new dataframe with the subset of columns by identifying the columns for retention and the number of rows which are required within the subset (in this instance all rows were retained, while only 2 columns were subset).
  • The file was saved using the save function by stating the dataframe and allocating it a name.

Step 11 - Create a new Data Frame

Create a data frame with 2 variables. Your data frame has to contain one integer variable and one ordinal variable.

## Create the variables
### Ordinal variable age groups
AgeGroup <- c('0-20', '21-30', '31-40', '41-50', '51-60', '61-70', '70+')
### Integer variable population
ID <- c(1:7)

NewDF <- data.frame(AgeGroup, ID)

View(NewDF) ## Check
str(NewDF) ## Check
## 'data.frame':    7 obs. of  2 variables:
##  $ AgeGroup: chr  "0-20" "21-30" "31-40" "41-50" ...
##  $ ID      : int  1 2 3 4 5 6 7
## Convert Age to Factor
NewDF$AgeGroup <- factor(NewDF$AgeGroup)

levels(NewDF$AgeGroup) ## Check levels
## [1] "0-20"  "21-30" "31-40" "41-50" "51-60" "61-70" "70+"
str(NewDF) ## Check
## 'data.frame':    7 obs. of  2 variables:
##  $ AgeGroup: Factor w/ 7 levels "0-20","21-30",..: 1 2 3 4 5 6 7
##  $ ID      : int  1 2 3 4 5 6 7
### Create a new numeric variable
Pop <- c(20000, 75000, 100000, 110000, 130000, 80000, 50000)

### Combine the NewDF and Pop vector - use the Cbid function and write in the NewDF name and Pop vector name
Combined <- cbind(NewDF, Pop)

View(Combined) ##Check

Step 11 - Actions to create a new dataframe

  • Create two new values in the form of “AgeGroup” and “ID”
  • Create a dataframe with these two value sets combining them so that the “AgeGroup” gets assigned an “ID”
  • Convert “AgeGroup” to a factor and check the levels
  • Create a new value set for the numeric values called “PoP”
  • Combine the “NewDF” previously created (AgeGroup and ID) and the “PoP” value set. This is done using the cbind function and stating the dataframe and value set.
  • Check that the new combined dataframe has been created as desired.

Step 12 - Create another Data Frame

Create another data frame with a common variable to the dataset created in step 11.

## Create 2 new vectors (Have ID2 as the same as ID from previous dataframe to allow for merge)
ID2 <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,5,5,6,6,6,6,6,6,7,7,7,7)
Age <- c(12,13,15,12,19,21,24,25,26,25,32,39,33,35,44,41,47,49,55,51,66,65,67,69,67,63,74,77,71,72)

## Create new dataframe from new variable vectors
NewDF2 <- data.frame(Age, ID2)

## Rename the ID2 variable to ID to allow for merge based upon naming convention.
NewDF2 <- NewDF2 %>% 
  rename(
    ID = ID2)

## Merge the two dataframe using ID as the reference. 
NewDF3 <- merge(NewDF, NewDF2, 'by' = 'ID')

View(NewDF3) ##Check

Step 12 - Actions to create a new dataframe and merge with a common variable

  • Create two new values in the form of “ID2” and “Age”
  • Create a dataframe with these two value sets combining them so that the “Age” gets assigned to “ID2” - Note that the “ID2” values are consistent with those used from ID in the dataframe previously created.
  • Rename the “ID2” variable to “ID” in the “NewDF2” dataframe so it is consistent with the naming convention of “ID” per the “ID” variable in the “DFNew” dataframe, to allow for the merge to occur.
  • Merge the two dataframes (DFNew and DFNew2) to create a new dataframe with all variable contained (DFNew3) - this merge is done by common variable “ID”.
  • Check the merge has occurred as desired.

All Steps Complete

Thank you for revewing my project - By Jake Harry - s3875020