Assignment 1 - Locate open data from the web, import it into R, inspect the data using Rfunctions, apply tidy data principles on the data and manipulate the data appropriately.
Install and load the necessary packages to reproduce the report here:
# Load necessary packages required to reproduce the report
library(readr) # Useful for importing data
library(foreign) # Useful for importing SPSS, SAS, STATA etc. data files
library(rvest) # Useful for scraping HTML data
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
## guess_encoding
library(knitr) # Useful for creating nice tables
library(tidyverse) # Used for data manipulation
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v dplyr 1.0.3
## v tibble 3.0.4 v stringr 1.4.0
## v tidyr 1.1.2 v forcats 0.5.0
## v purrr 0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x rvest::guess_encoding() masks readr::guess_encoding()
## x dplyr::lag() masks stats::lag()
## x purrr::pluck() masks rvest::pluck()
library(dplyr) # Useful for data manipulation
library(ggplot2) # Useful for data plotting
library(openxlsx) # Useful for importing
library(tidyr) # Useful for data manipulation
library(magrittr) # Useful for data manipulation and structure
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
Data Set Downloaded from the Kaggle in CSV format
Site Called - Emissions by Cars- Carbon Dioxide Emission by Cars - Data Renamed to “CarsCO2.csv” when saved to local location from original file name “Sample.csv”
This data set shows the differnt fuel consumption and CO2 emission levels for a variety of cars.
Import the Data - Saved the file into the project directory as a CSV to enable direct read.table function to be utilised
Cars <- read.table("CarsCO2.csv", header = TRUE, stringsAsFactor = FALSE, sep = ",", fill = TRUE)
Provide a clear description of the data and its source (i.e. URL of the web site). Provide variable descriptions.
Data sourced from - https://www.kaggle.com/prathamtripathi/co2-emissions-by-cars-in-canada
This data set contains various information about a set of cars that were manufactured with set of factory parameters like cylinder size, number of cylinders, fuel consumption, Carbon dioxide emissions
This data set allows for brand comparison for both fuel consumption and CO2 emissions.
### Step 4.a - Check the dimensions of the dataframe
nrow(Cars)
## [1] 679
ncol(Cars)
## [1] 13
dim(Cars)
## [1] 679 13
### Step 4.b - Summarise the types of variables and review for appropriateness
str(Cars)
## 'data.frame': 679 obs. of 13 variables:
## $ MODEL : int 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
## $ MAKE : chr "ACURA" "ACURA" "ACURA" "ACURA" ...
## $ MODEL.1 : chr "1.7EL" "1.7EL" "3.2CL" "3.2TL" ...
## $ VEHICLE.CLASS : chr "COMPACT" "COMPACT" "COMPACT" "MID-SIZE" ...
## $ ENGINE_SIZE : num 1.7 1.7 3.2 3.2 3.5 1.8 1.8 1.8 3.5 3 ...
## $ CYLINDERS : int 4 4 6 6 6 4 4 4 6 6 ...
## $ TRANSMISSION : chr "A4" "M5" "AS5" "AS5" ...
## $ FUEL : chr "X" "X" "Z" "Z" ...
## $ FUEL_CONSUMPTION.: num 9.3 8.9 13.7 13.8 15 11.4 10.6 10.8 15.5 15.3 ...
## $ X : num 7.2 7.4 8.8 8.8 10.9 8.5 8.3 8.5 11.1 10.9 ...
## $ X.1 : num 8.3 8.3 11.5 11.6 13.1 10.1 9.6 9.7 13.5 13.3 ...
## $ X.2 : int 34 34 25 24 22 28 29 29 21 21 ...
## $ CO2_EMISSIONS : int 191 191 265 267 301 232 221 223 311 306 ...
## Step 4.c - Changing variable type
### Checking this data frame for type information they all variables have been assigned the correct type.
### To illustrate how to change type I will use one of the variables that is set to be deleted (X.2)
### The change of type test conducted to to be
#### 1. Convert from Integer to Character
#### 2. Convert from Character to Number
#### 3. Convert from Number to Integer (original type)
Cars$X.2 <-as.character(Cars$X.2)
#### Check change 1
str(Cars)
## 'data.frame': 679 obs. of 13 variables:
## $ MODEL : int 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
## $ MAKE : chr "ACURA" "ACURA" "ACURA" "ACURA" ...
## $ MODEL.1 : chr "1.7EL" "1.7EL" "3.2CL" "3.2TL" ...
## $ VEHICLE.CLASS : chr "COMPACT" "COMPACT" "COMPACT" "MID-SIZE" ...
## $ ENGINE_SIZE : num 1.7 1.7 3.2 3.2 3.5 1.8 1.8 1.8 3.5 3 ...
## $ CYLINDERS : int 4 4 6 6 6 4 4 4 6 6 ...
## $ TRANSMISSION : chr "A4" "M5" "AS5" "AS5" ...
## $ FUEL : chr "X" "X" "Z" "Z" ...
## $ FUEL_CONSUMPTION.: num 9.3 8.9 13.7 13.8 15 11.4 10.6 10.8 15.5 15.3 ...
## $ X : num 7.2 7.4 8.8 8.8 10.9 8.5 8.3 8.5 11.1 10.9 ...
## $ X.1 : num 8.3 8.3 11.5 11.6 13.1 10.1 9.6 9.7 13.5 13.3 ...
## $ X.2 : chr "34" "34" "25" "24" ...
## $ CO2_EMISSIONS : int 191 191 265 267 301 232 221 223 311 306 ...
Cars$X.2 <-as.numeric(Cars$X.2)
Cars$X.2 <-as.integer(Cars$X.2)
### Step 4.d - Change variable types to factors
### When reviewing the data frame for factors I have identified 3 variables which could be converted to factors.
#### 1. VEHICLE.CLASS
#### 2. CYLINDERS
#### 3. FUEL
Cars$VEHICLE.CLASS <- factor(Cars$VEHICLE.CLASS)
Cars$CYLINDERS <- factor(Cars$CYLINDERS)
Cars$FUEL <- factor(Cars$FUEL)
#### Check variable type and the levels
str(Cars)
## 'data.frame': 679 obs. of 13 variables:
## $ MODEL : int 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
## $ MAKE : chr "ACURA" "ACURA" "ACURA" "ACURA" ...
## $ MODEL.1 : chr "1.7EL" "1.7EL" "3.2CL" "3.2TL" ...
## $ VEHICLE.CLASS : Factor w/ 15 levels "","COMPACT","FULL-SIZE",..: 2 2 2 4 4 11 11 11 12 13 ...
## $ ENGINE_SIZE : num 1.7 1.7 3.2 3.2 3.5 1.8 1.8 1.8 3.5 3 ...
## $ CYLINDERS : Factor w/ 7 levels "3","4","5","6",..: 2 2 4 4 4 2 2 2 4 4 ...
## $ TRANSMISSION : chr "A4" "M5" "AS5" "AS5" ...
## $ FUEL : Factor w/ 6 levels "","D","E","N",..: 5 5 6 6 6 5 5 6 6 6 ...
## $ FUEL_CONSUMPTION.: num 9.3 8.9 13.7 13.8 15 11.4 10.6 10.8 15.5 15.3 ...
## $ X : num 7.2 7.4 8.8 8.8 10.9 8.5 8.3 8.5 11.1 10.9 ...
## $ X.1 : num 8.3 8.3 11.5 11.6 13.1 10.1 9.6 9.7 13.5 13.3 ...
## $ X.2 : int 34 34 25 24 22 28 29 29 21 21 ...
## $ CO2_EMISSIONS : int 191 191 265 267 301 232 221 223 311 306 ...
levels(Cars$VEHICLE.CLASS)
## [1] "" "COMPACT"
## [3] "FULL-SIZE" "MID-SIZE"
## [5] "MINICOMPACT" "MINIVAN"
## [7] "PICKUP TRUCK - SMALL" "PICKUP TRUCK - STANDARD"
## [9] "STATION WAGON - MID-SIZE" "STATION WAGON - SMALL"
## [11] "SUBCOMPACT" "SUV"
## [13] "TWO-SEATER" "VAN - CARGO"
## [15] "VAN - PASSENGER"
levels(Cars$CYLINDERS)
## [1] "3" "4" "5" "6" "8" "10" "12"
levels(Cars$FUEL)
## [1] "" "D" "E" "N" "X" "Z"
### Step 4.e - Rename the levels for the Fuel variable
levels(Cars$FUEL) <- c("NA","Diesel", "E10", "N91", "N95", "N98")
#### Check
levels(Cars$FUEL)
## [1] "NA" "Diesel" "E10" "N91" "N95" "N98"
### Step 4.f - Check the variable names
colnames(Cars)
## [1] "MODEL" "MAKE" "MODEL.1"
## [4] "VEHICLE.CLASS" "ENGINE_SIZE" "CYLINDERS"
## [7] "TRANSMISSION" "FUEL" "FUEL_CONSUMPTION."
## [10] "X" "X.1" "X.2"
## [13] "CO2_EMISSIONS"
### Desired to change the name of the following variables:
#### 1. MODEL to YEAR
#### 2. MODEL.1 to MODEL
Cars <- Cars %>%
rename(
YEAR = MODEL,
MODEL = MODEL.1)
#### Check
colnames(Cars)
## [1] "YEAR" "MAKE" "MODEL"
## [4] "VEHICLE.CLASS" "ENGINE_SIZE" "CYLINDERS"
## [7] "TRANSMISSION" "FUEL" "FUEL_CONSUMPTION."
## [10] "X" "X.1" "X.2"
## [13] "CO2_EMISSIONS"
Check if the data conforms the tidy data principles. If your data is untidy, reshape your data into a tidy format. If the data is in a tidy format, you will be expected to explain why the data is originally ‘tidy’.
Cars ## Review
Based upon the review of the Cars2 Data Frame it can be concluded that the data conforms with all 3 data tidy principles.
Each variable is correctly assigned to a column with data separated correctly - The columns have been distinguished correctly by variable with headers assigned - This would be as a result of the CSV format used for upload - If a HTML scrape of text file was used there may need to be more separating of data strings o clean they data
There is an individual row for each observation - Each row can be distinctly identified with no data concatenating
Each value is correctly assigned to a cell - there is no data which has been uploaded as a string and assigned to the incorrect cells
Other considerations - Based upon review of the data the following areas for data frame alteration have been identified;
#### 1. Create a new dataframe with the unnecessary columns removed
Cars2 <- Cars[, c(1:9, 13)]
View(Cars2) ## Check
#### 2. Omit blank data to enable clean grouping and analysis
Cars3 <- na.omit(Cars2)
dim(Cars3) ## Dataframe has now reduced in size
## [1] 642 10
Cars3 ## Check
Provide summary statistics (mean, median, minimum, maximum, standard deviation) of numeric variables grouped by one of the qualitative (categorical) variable. For example, if your categorical variable is age groups and quantitative variable is income, provide summary statistics of income grouped by the age groups.
### 1. MAKE with CO2_EMISSIONS Summary - Assess what the CO2 emission summary is by MAKE of car.
MAKE_CO2 <- Cars3 %>%
group_by(MAKE) %>%
summarise(Mean = mean(CO2_EMISSIONS),
Median = median(CO2_EMISSIONS),
Min = min(CO2_EMISSIONS),
Max = max(CO2_EMISSIONS),
SD = sd(CO2_EMISSIONS))
### 2. MAKE with FUEL_CONSUMPTION Summary - Assess what the Fuel Consumption summary is by MAKE of car.
MAKE_FUEL <- Cars3 %>%
group_by(MAKE) %>%
summarise(Mean = mean(FUEL_CONSUMPTION.),
Median = median(FUEL_CONSUMPTION.),
Min = min(FUEL_CONSUMPTION.),
Max = max(FUEL_CONSUMPTION.),
SD = sd(FUEL_CONSUMPTION.))
### 3. VEHICLE.CLASS with CO2_EMISSIONS Summary - Assess what the CO2 emission summary is by CLASS of car.
CLASS_CO2 <- Cars3 %>%
group_by(VEHICLE.CLASS) %>%
summarise(Mean = mean(CO2_EMISSIONS),
Median = median(CO2_EMISSIONS),
Min = min(CO2_EMISSIONS),
Max = max(CO2_EMISSIONS),
SD = sd(CO2_EMISSIONS))
### 4. VEHICLE.CLASS with FUEL_CONSUMPTION Summary - Assess what the Fuel Consumption summary is by CLASS of car.
CLASS_FUEL <- Cars3 %>%
group_by(VEHICLE.CLASS) %>%
summarise(Mean = mean(FUEL_CONSUMPTION.),
Median = median(FUEL_CONSUMPTION.),
Min = min(FUEL_CONSUMPTION.),
Max = max(FUEL_CONSUMPTION.),
SD = sd(FUEL_CONSUMPTION.))
MAKE_CO2
MAKE_FUEL
CLASS_CO2
CLASS_FUEL
Create a list that contains a numeric value for each response to the categorical variable. Typically, they are numbered from 1-n.
### Assign the Numeric Value to MAKE categories
### Generate a list of the MAKE categories from one of the previous compressed lists (MAKE_CO2)
ListMAKE <- list(MAKE_CO2$MAKE)
### Generate a standard numeric list for assignment to the categories
ListNum <- list(1:34)
### Convert the lists to a set of values
MAKE <- unlist(ListMAKE)
NUM <- unlist(ListNum)
### Use the values to create a new dataframe
MAKEDF <- data.frame(MAKE, NUM)
View(MAKEDF) ### Check
Join this list on using a join of your choice. Remember that this has to keep the numeric variable, as well as matching to your categorical variable.
### Merge the two data frames (Cars3 and MAKEDF) to create a new Dataframe (Cars4) which now contains a number assigned to each MAKE (the categorical variable)
Cars4 <- merge(Cars3, MAKEDF, 'by' = 'MAKE')
View(Cars4) ## Check
Subset the data frame using first 10 observations (include all variables). Then convert it to a matrix. Check the structure of that matrix (i.e. check whether the matrix is character, numeric, integer, factor, or logical) and explain in a few words why you ended up with that structure.
### Use the filter function to extract the top 10 rows of the Cars4 Dataframe.
CarsSub <- Cars4 %>% filter(row(Cars4) > 0 & row(Cars4) < 11)
### Convert the data frame to a matrix using the data.matrix function
CarsMat <- data.matrix(CarsSub, rownames.force = NA)
View(CarsMat) ### Check
str(CarsMat) ### Check
## num [1:10, 1:11] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:11] "MAKE" "YEAR" "MODEL" "VEHICLE.CLASS" ...
### This function has converted and assigned all non integer/ numeric values a numeric value. The variable headers have been retained as character.
Subset the data frame including only first and the last variable in the data set and save it as an R object file (.RData).
### Use the Cars3 dataframe to create the subset - Use the header extract function are name the columns which are to be subset and put in the number of rows desired for the dataframe.
ColSub1 <- head(Cars3[,c("MAKE","CO2_EMISSIONS")],642)
### Create a Rdata file of the dataframe - Use the save function and identify the dataframe and assign a file name.
save(ColSub1, file = "ColSub1.Rdata")
Create a data frame with 2 variables. Your data frame has to contain one integer variable and one ordinal variable.
## Create the variables
### Ordinal variable age groups
AgeGroup <- c('0-20', '21-30', '31-40', '41-50', '51-60', '61-70', '70+')
### Integer variable population
ID <- c(1:7)
NewDF <- data.frame(AgeGroup, ID)
View(NewDF) ## Check
str(NewDF) ## Check
## 'data.frame': 7 obs. of 2 variables:
## $ AgeGroup: chr "0-20" "21-30" "31-40" "41-50" ...
## $ ID : int 1 2 3 4 5 6 7
## Convert Age to Factor
NewDF$AgeGroup <- factor(NewDF$AgeGroup)
levels(NewDF$AgeGroup) ## Check levels
## [1] "0-20" "21-30" "31-40" "41-50" "51-60" "61-70" "70+"
str(NewDF) ## Check
## 'data.frame': 7 obs. of 2 variables:
## $ AgeGroup: Factor w/ 7 levels "0-20","21-30",..: 1 2 3 4 5 6 7
## $ ID : int 1 2 3 4 5 6 7
### Create a new numeric variable
Pop <- c(20000, 75000, 100000, 110000, 130000, 80000, 50000)
### Combine the NewDF and Pop vector - use the Cbid function and write in the NewDF name and Pop vector name
Combined <- cbind(NewDF, Pop)
View(Combined) ##Check
Create another data frame with a common variable to the dataset created in step 11.
## Create 2 new vectors (Have ID2 as the same as ID from previous dataframe to allow for merge)
ID2 <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,5,5,6,6,6,6,6,6,7,7,7,7)
Age <- c(12,13,15,12,19,21,24,25,26,25,32,39,33,35,44,41,47,49,55,51,66,65,67,69,67,63,74,77,71,72)
## Create new dataframe from new variable vectors
NewDF2 <- data.frame(Age, ID2)
## Rename the ID2 variable to ID to allow for merge based upon naming convention.
NewDF2 <- NewDF2 %>%
rename(
ID = ID2)
## Merge the two dataframe using ID as the reference.
NewDF3 <- merge(NewDF, NewDF2, 'by' = 'ID')
View(NewDF3) ##Check