Introduction

Project 2 requires to create 3 tidy datasets by either using the untidy datasets from week5 discussion or choose any of our own dataset. It requires the data set to be wide and untidy so that we read the data from a CSV and transform and tidy the datasets. we have used 3 of the datasets from the discussion and tried to transform and tidy the data. We have analysed the data over plots using the ggplot library.

Dataset2 (NYC per-capita fuel consumption and CO2 emissions) by Peter Fernandes

Analysis :

         Calculate the average of energy consumption and CO2 emissions.

Loading of the required packages

knitr::opts_chunk$set(eval = TRUE, results = FALSE, 
                      fig.show = "hide", message = FALSE)
if (!require("tidyr")) install.packages('tidyr')

## Loading required package: tidyr

if (!require("dplyr")) install.packages('dplyr')

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

if (!require("stringr")) install.packages('stringr')

## Loading required package: stringr

if (!require("DT")) install.packages('DT')

## Loading required package: DT

if (!require("ggplot2")) install.packages('ggplot2')

## Loading required package: ggplot2

Reading untidy dataset

csv <- read.csv("https://raw.githubusercontent.com/petferns/607-Project2/main/energy.csv", na.strings = c("", "NA"))
head(csv)

Renaming of the columns

names(csv)[1] <- "Category"
names(csv)[2] <- "Building and industry consumption"
names(csv)[3] <- "Building and industry Emission"
names(csv)[4] <- "Transportation consumption"
names(csv)[5] <- "Transportation Emission"
head(csv)

Exclude non required rows

csv <- csv[-c(1),]
csv

Add Counties column to make it wide

csv$Counties <-  gsub(".*\\(+(.*)+\\)","\\1",csv$Category)
csv

Convert into large from wide dataset based on column 2 to 5

csv1<- csv %>% gather("Type", "Count", 2:5)
csv1

Create a extra column to identify if Consumption or Emission

csv1$Utils <- sub("^.+ ", "", csv1$Type)

csv1$Division <- str_extract(csv1$Type, "[^ ]+")
csv1$Division

Analyze and create plot

We see the consumption and emission average across different segments using the below plot

csv1$Count <- as.numeric(csv1$Count)
overall_dt <- csv1 %>% group_by(Category, Type) %>% mutate(overall_avg = mean(`Count`))
overall_dt
ggplot(overall_dt ,aes(x=  Type, y=overall_avg, fill= Type)) +
    geom_bar(stat="identity", position=position_dodge()) + theme(axis.text.x = element_text(angle = 90))

Analyse and plot the required result - Overal Consumption and Emission

We see from the analysis and plotting that the overall consumption is much higher than the Emissions across NY.

overall_dt <- csv1 %>% group_by(Utils) %>% mutate(overall_avg = mean(`Count`))
overall_dt
ggplot(overall_dt ,aes(x=  Utils, y=overall_avg, fill= Utils)) +
    geom_bar(stat="identity", position=position_dodge()) + theme(axis.text.x = element_text(angle = 90))

Porject 2 - Data Transformation- Dataset2

10/2/2020