Lessons learned in Task 1 and 2

Packages needed

RMySQL dplyr ggplot2 lubridate *Plotly

Task 1

SQL in R

#Obtaining data using SQL
library(RMySQL)
library(dplyr)

#Listing the tables in our dataset
dbListTables(con)
## [1] "iris"    "yr_2006" "yr_2007" "yr_2008" "yr_2009" "yr_2010"
#Listing the attributes within a table
dbListFields(con,'iris')
## [1] "id"            "SepalLengthCm" "SepalWidthCm"  "PetalLengthCm"
## [5] "PetalWidthCm"  "Species"
#Use asterisk to specify all attributes for download or select
irisALL <- dbGetQuery(con, "SELECT * FROM iris")

# Use attribute names to specify specific attributes for download
irisSELECT <- dbGetQuery(con, "SELECT SepalLengthCm, SepalWidthCm FROM iris")

I learned how to be able to select certain features from a datatset with SQL as well as joining tables together by a certain attribute.

Working with dates

#Creating a new column with Dates up to second and including the time zone

#Combining data and time
yr_all <- cbind(yr_all,paste(yr_all$Date,yr_all$Time),
                stringsAsFactors=FALSE)

#renaming the column
colnames(yr_all)[6] <-"DateTime"

#moving the attribute
yr_all <- yr_all[,c(ncol(yr_all), 1:(ncol(yr_all)-1))]

#DateTime from POSIXlt to POSIXct 
yr_all$DateTime <- as.POSIXct(yr_all$DateTime,
                              "%d/%m/%Y %H:%M:%S",
                              tz = "UTC")

# Pericles suggestion

#Combining data and time
#yr_all$DateTime <- paste(yr_all$Date,yr_all$Time)

#DateTime from POSIXlt to POSIXct 
#yr_all$DateTime <- as.POSIXct(yr_all$DateTime,
                              "%d/%m/%Y %H:%M:%S"
## [1] "%d/%m/%Y %H:%M:%S"
                              #tz = "UTC")





#Adding the time zone
attr(yr_all$DateTime, "tzone") <- "Europe/Paris"

This is how I learned to combine Date into just one column, being able to move a row within the dataset, and applying a timezone to the Date.

Lubridate

#Learning how to use the package lubridate
library(lubridate)
#Extracting attributes from the Date time and making them new variables
yr_all$year <- year(yr_all$DateTime)
yr_all$quarter_db <- quarter(yr_all$DateTime)
yr_all$month_db <- month(yr_all$DateTime)
yr_all$week_db <- week(yr_all$DateTime)
yr_all$day_db <- day(yr_all$DateTime)
yr_all$weekday_db <- weekdays(yr_all$DateTime)
yr_all$hour_db <- hour(yr_all$DateTime)
yr_all$minute_db <- minute(yr_all$DateTime)

I learned how to use the lubridate package to be able to create a new column with up to minute, taking it from the Date created earlier.

Dplyr

# Using dplyr to filter, group and summarise data by mean, sum, median, min and max

#For this example we will use sum, but it can be changed to be median or others
yr_all %>% 
  filter(Sub_metering_2 > 1)%>% 
  group_by(year) %>% 
  summarise(sum_sub2 = sum(Sub_metering_2))
## # A tibble: 5 x 2
##    year sum_sub2
##   <dbl>    <dbl>
## 1  2006    45413
## 2  2007   763576
## 3  2008   577865
## 4  2009   498820
## 5  2010   397133

Dplyr is a very useful package where I have learned in this task are how to join tables, filtering and grouping data, as well as summarising the sum, mean, median, min and max. Other features I learnt were Select specific features of interest from a data set. Filter out irrelevant data and keep only observations of interest. Mutate a data set by adding more features.

Task 2

#Code for the graphs
full_year <- yr_all %>% 
  filter(between(year, 2007,2009))

houseDay <- filter(full_year, year == 2008 & month_db == 1 & day_db == 9)

###Visualitazion

#Subsetting dates for plotting
houseWeek <- filter(full_year, year == 2008 & week_db == 2)
plot(houseWeek$Sub_metering_1)

I learned how to filter the data by date to be able to plot just specific dates

Plotly

## Plotly {.tabset .tabset-fade .tabset-pills}

#Creating good looking graphs with Plotly package
library(plotly)
### First graph
plot_ly(houseDay, x = ~houseDay$DateTime, y = ~houseDay$Sub_metering_1, name = 'Kitchen', type = 'scatter', mode = 'lines') %>%
 add_trace(y = ~houseDay$Sub_metering_2, name = 'Laundry Room', mode = 'lines') %>%
 add_trace(y = ~houseDay$Sub_metering_3, name = 'Water Heater & AC', mode = 'lines') %>%
 layout(title = "Power Consumption January 9th, 2008",
 xaxis = list(title = "Time"),
 yaxis = list (title = "Power (watt-hours)"))
#Reducing granularity to create a better graph, in this case from one observation per minute to one observation every 10 minutes
houseDay10 <- filter(full_year, year == 2008 & month_db == 1 & day_db == 9 & (minute_db == 0 | minute_db == 10 | minute_db == 20 | minute_db == 30 | minute_db == 40 | minute_db == 50))
### Improved graph.
plot_ly(houseDay10, x = ~houseDay10$DateTime, y = ~houseDay10$Sub_metering_1, name = 'Kitchen', type = 'scatter', mode = 'lines') %>%
 add_trace(y = ~houseDay10$Sub_metering_2, name = 'Laundry Room', mode = 'lines') %>%
 add_trace(y = ~houseDay10$Sub_metering_3, name = 'Water Heater & AC', mode = 'lines') %>%
 layout(title = "Power Consumption January 9th, 2008",
 xaxis = list(title = "Time"),
 yaxis = list (title = "Power (watt-hours)"))

Plotly is a very useful package I learnt, where you can create very nice looking graphs.

Time series

library(ggplot2)
library(ggfortify)


#Filtering the data
house070809weekly <- filter(full_year, week_db == 2 & hour_db == 20 & minute_db == 1)

tsSM3_070809weekly <- ts(house070809weekly$Sub_metering_3, frequency=7, start=c(2007,1))

#Plot of the Time series
autoplot(tsSM3_070809weekly)

I leant how to create a time series, to be able to use in the future for Forecast or visualitazion.

Forecast

#Forecasting a time series
library(forecast)

## Forecast {.tabset}

#Producing the forecast
fitSM3 <- tslm(tsSM3_070809weekly ~ trend + season) 
#Summary of the forecast
summary(fitSM3)
## 
## Call:
## tslm(formula = tsSM3_070809weekly ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.667  -5.667   1.429   7.095  10.667 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   9.7483     6.1471   1.586    0.137
## trend        -0.5102     0.3593  -1.420    0.179
## season2       6.5102     7.6926   0.846    0.413
## season3       6.3537     7.7178   0.823    0.425
## season4       7.5306     7.7595   0.971    0.350
## season5       8.0408     7.8175   1.029    0.322
## season6       2.2177     7.8915   0.281    0.783
## season7       2.7279     7.9810   0.342    0.738
## 
## Residual standard error: 9.411 on 13 degrees of freedom
## Multiple R-squared:  0.2424, Adjusted R-squared:  -0.1656 
## F-statistic: 0.5942 on 7 and 13 DF,  p-value: 0.7503
# Create the forecast for sub-meter 3. Forecast ahead 20 time periods 
forecastfitSM3 <- forecast(fitSM3, h=20)

### Plotting the forecast to be able to see the prediction and how accurate it is by the blue color
plot(forecastfitSM3)

# Create a forecast with confidence levels 80 and 90
forecastfitSM3c <- forecast(fitSM3, h=20, level=c(80,90))

### Plotting a forecast with limit y and add labels
plot(forecastfitSM3c, ylim = c(0, 20), ylab= "Watt-Hours", xlab="Time")

Forecasting is a very useful skill i learnt, where I can predict results of predictions depending by date.

Decomposing

#Decomposing the time series

# Decompose time series into trend, seasonal and remainder
components070809SM3weekly <- decompose(tsSM3_070809weekly)
# Plot decomposed sub-meter 3 
plot(components070809SM3weekly)

# Check summary statistics for decomposed sub-meter 3 
summary(components070809SM3weekly)
##          Length Class  Mode     
## x        21     ts     numeric  
## seasonal 21     ts     numeric  
## trend    21     ts     numeric  
## random   21     ts     numeric  
## figure    7     -none- numeric  
## type      1     -none- character

Decomposing is useful to know the Seasonality, Random, and trend of our Data, to be able to know the ups and downs.

Removing decompositions

#Removing components from decomposition

#Removing random


#name for decompose <- decompose(timeseries_object)

#name for decompose without the component <- timeseries_object - name of the decompose$component

Removing decompositions is a useful process, as it can imporve the accuracy of our forecast, or we can remove the random composition as it can be caused by outliers.

Holt-winters forecasting

#Holt-winters forecasting

# Seasonal adjusting sub-meter 3 by subtracting the seasonal component & plot
tsSM3_070809Adjusted <- tsSM3_070809weekly - components070809SM3weekly$seasonal
#Plot
autoplot(tsSM3_070809Adjusted)

#plot with the decomposed again
plot(decompose(tsSM3_070809Adjusted))

# Holt Winters Exponential Smoothing & Plot
tsSM3_HW070809 <- HoltWinters(tsSM3_070809Adjusted, beta=FALSE, gamma=FALSE)
plot(tsSM3_HW070809, ylim = c(0, 25))

# HoltWinters forecast & plot
tsSM3_HW070809for <- forecast(tsSM3_HW070809, h=25)
plot(tsSM3_HW070809for, ylim = c(0, 20), ylab= "Watt-Hours", xlab="Time - Sub-meter 3")

Knowing Holt-winters forecasting is important because it is a way to model and predict the behavior of a sequence of values over time—a time series. Holt-Winters is one of the most popular forecasting techniques for time series.

Extra

Random forest algorithm

Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction. The random forest is a classification algorithm consisting of many decisions trees. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree.

K-nn algorithm

K-nn is an approach to data classification that estimates how likely a data point is to be a member of one group or the other depending on what group the data points nearest to it are in. A k-nearest-neighbor is a data classification algorithm that attempts to determine what group a data point is in by looking at the data points around it.

Gradient boosting tree algorithm

The gradient boosting algorithm (gbm) can be most easily explained by first introducing the AdaBoost Algorithm.The AdaBoost Algorithm begins by training a decision tree in which each observation is assigned an equal weight. After evaluating the first tree, we increase the weights of those observations that are difficult to classify and lower the weights for those that are easy to classify. The second tree is therefore grown on this weighted data. Here, the idea is to improve upon the predictions of the first tree. Our new model is therefore Tree 1 + Tree 2. We then compute the classification error from this new 2-tree ensemble model and grow a third tree to predict the revised residuals. We repeat this process for a specified number of iterations. Subsequent trees help us to classify observations that are not well classified by the previous trees. Predictions of the final ensemble model is therefore the weighted sum of the predictions made by the previous tree models.

Support Vector Machine algorithm

The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.