Title: Prediction of Solar Power System Based On Regression and Classification Problem.

Introduction

Solar power systems, also known as photovoltaic (PV) systems, are a popular choice for generating clean, renewable energy as they use solar panels to convert sunlight into electricity. These systems can be used for a variety of applications, making them an efficient and cost-effective source of electricity. In recent years, the development of solar power technology has been an active area of research and innovation, with efforts focused on improving the efficiency of solar power systems.

There are several factors that impact the efficiency of solar power systems, including the size and type of the system, the location and weather conditions, and the efficiency of the solar panels. The type and size of the solar power system are important considerations, as different systems are suited for different applications. For example, large systems may be more suitable for commercial or industrial applications, while small systems may be more suitable for residential use. The location and weather conditions also play a role in the efficiency of solar power systems, as the amount of sunlight received can vary significantly depending on the location and time of year. Finally, the efficiency of the solar panels themselves is an important factor, as more efficient panels can produce more electricity from the same amount of sunlight.

Objective

The objectives of the project are as follows:

Identify the correlations between the available features and solar power generation: By analyzing the data and determining which factors have the most significant impact on solar power generation, we can better understand how to optimize the performance of a solar power system.
Predict the output of a solar power system based on past performance: By using machine learning algorithms and historical data, we can develop a model that accurately predicts the output of a solar power system based on various input parameters. This can help optimize the performance and efficiency of the system and make informed decisions about its operation and maintenance.

Load Library

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(graphics)
library(stats) # install reshape2 library
library(reshape2) 
library(ggplot2) # for visualization
library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:reshape2':
## 
##     smiths

Data Import & Exploration

We begin by importing the dataset and inspecting its contents.

    solar <- read.csv("solar.csv") # import the dataset
    View(solar) # inspect the dataframe table
    str(solar) # check the data types of the columns

## 'data.frame':    2970 obs. of  16 variables:
##  $ Day.of.Year                         : int  245 245 245 245 245 245 245 245 246 246 ...
##  $ Year                                : int  2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
##  $ Month                               : int  9 9 9 9 9 9 9 9 9 9 ...
##  $ Day                                 : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ First.Hour.of.Period                : int  1 4 7 10 13 16 19 22 1 4 ...
##  $ Is.Daylight                         : logi  FALSE FALSE TRUE TRUE TRUE TRUE ...
##  $ Distance.to.Solar.Noon              : num  0.8599 0.6285 0.3972 0.1658 0.0656 ...
##  $ Average.Temperature..Day.           : int  69 69 69 69 69 69 69 69 72 72 ...
##  $ Average.Wind.Direction..Day.        : int  28 28 28 28 28 28 28 28 29 29 ...
##  $ Average.Wind.Speed..Day.            : num  7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 6.8 6.8 ...
##  $ Sky.Cover                           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Visibility                          : num  10 10 10 10 10 10 10 10 10 10 ...
##  $ Relative.Humidity                   : int  75 77 70 33 21 20 36 49 67 49 ...
##  $ Average.Wind.Speed..Period.         : int  8 5 0 0 3 23 15 6 6 0 ...
##  $ Average.Barometric.Pressure..Period.: num  29.8 29.9 29.9 29.9 29.9 ...
##  $ Power.Generated                     : int  0 0 5418 25477 30069 16280 515 0 0 0 ...

    names(solar) # check the names of the columns

##  [1] "Day.of.Year"                         
##  [2] "Year"                                
##  [3] "Month"                               
##  [4] "Day"                                 
##  [5] "First.Hour.of.Period"                
##  [6] "Is.Daylight"                         
##  [7] "Distance.to.Solar.Noon"              
##  [8] "Average.Temperature..Day."           
##  [9] "Average.Wind.Direction..Day."        
## [10] "Average.Wind.Speed..Day."            
## [11] "Sky.Cover"                           
## [12] "Visibility"                          
## [13] "Relative.Humidity"                   
## [14] "Average.Wind.Speed..Period."         
## [15] "Average.Barometric.Pressure..Period."
## [16] "Power.Generated"

    summary(solar) # check the statistical summary for each column

##   Day.of.Year         Year          Month             Day       
##  Min.   :  1.0   Min.   :2008   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 92.0   1st Qu.:2008   1st Qu.: 4.000   1st Qu.: 8.00  
##  Median :183.5   Median :2009   Median : 7.000   Median :16.00  
##  Mean   :183.6   Mean   :2009   Mean   : 6.531   Mean   :15.73  
##  3rd Qu.:275.0   3rd Qu.:2009   3rd Qu.:10.000   3rd Qu.:23.00  
##  Max.   :366.0   Max.   :2009   Max.   :12.000   Max.   :31.00  
##  NA's   :6       NA's   :1      NA's   :1                       
##  First.Hour.of.Period Is.Daylight     Distance.to.Solar.Noon
##  Min.   : 1.0         Mode :logical   Min.   :0.0504        
##  1st Qu.: 4.0         FALSE:1133      1st Qu.:0.2560        
##  Median :11.5         TRUE :1831      Median :0.4814        
##  Mean   :11.5         NA's :6         Mean   :0.5037        
##  3rd Qu.:19.0                         3rd Qu.:0.7387        
##  Max.   :22.0                         Max.   :1.1414        
##  NA's   :4                                                  
##  Average.Temperature..Day. Average.Wind.Direction..Day.
##  Min.   :42.0              Min.   : 1.00               
##  1st Qu.:53.0              1st Qu.:25.00               
##  Median :59.0              Median :27.00               
##  Mean   :58.5              Mean   :24.96               
##  3rd Qu.:63.0              3rd Qu.:29.00               
##  Max.   :78.0              Max.   :36.00               
##  NA's   :2                 NA's   :4                   
##  Average.Wind.Speed..Day.   Sky.Cover       Visibility     Relative.Humidity
##  Min.   : 1.1             Min.   :0.000   Min.   : 0.000   Min.   : 14.00   
##  1st Qu.: 6.6             1st Qu.:1.000   1st Qu.:10.000   1st Qu.: 65.00   
##  Median :10.0             Median :2.000   Median :10.000   Median : 77.00   
##  Mean   :10.1             Mean   :1.989   Mean   : 9.556   Mean   : 73.56   
##  3rd Qu.:13.1             3rd Qu.:3.000   3rd Qu.:10.000   3rd Qu.: 84.00   
##  Max.   :26.6             Max.   :4.000   Max.   :10.000   Max.   :100.00   
##  NA's   :3                NA's   :8       NA's   :3        NA's   :1        
##  Average.Wind.Speed..Period. Average.Barometric.Pressure..Period.
##  Min.   : 0.00               Min.   :29.48                       
##  1st Qu.: 5.00               1st Qu.:29.92                       
##  Median : 9.00               Median :30.00                       
##  Mean   :10.13               Mean   :30.02                       
##  3rd Qu.:15.00               3rd Qu.:30.11                       
##  Max.   :40.00               Max.   :30.53                       
##  NA's   :5                   NA's   :4                           
##  Power.Generated
##  Min.   :    0  
##  1st Qu.:    0  
##  Median :  385  
##  Mean   : 6959  
##  3rd Qu.:12668  
##  Max.   :36580  
##  NA's   :3

Next, we perform some data cleaning to ensure that the data is usable for analysis.

print(sum(duplicated(solar))) # check for duplicated rows

## [1] 0

# check for null values
solar %>% summarise_all(funs(sum(is.na(.))))

## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

##   Day.of.Year Year Month Day First.Hour.of.Period Is.Daylight
## 1           6    1     1   0                    4           6
##   Distance.to.Solar.Noon Average.Temperature..Day. Average.Wind.Direction..Day.
## 1                      0                         2                            4
##   Average.Wind.Speed..Day. Sky.Cover Visibility Relative.Humidity
## 1                        3         8          3                 1
##   Average.Wind.Speed..Period. Average.Barometric.Pressure..Period.
## 1                           5                                    4
##   Power.Generated
## 1               3

# remove rows with null values
solar <- solar[complete.cases(solar),]

We can also visualize the data using boxplots to check for outliers.

#create a new var for boxplot plotting, since some modification unique to this plot is needed
solar_bp<-solar

#pivot the df so that an a col with the feature names exist
solar_bp<-solar_bp %>% select(Average.Temperature..Day., Average.Wind.Direction..Day., Average.Wind.Speed..Day., Average.Wind.Speed..Period.,Power.Generated) %>% pivot_longer(., cols = c(Average.Temperature..Day., Average.Wind.Direction..Day., Average.Wind.Speed..Day., Average.Wind.Speed..Period.,Power.Generated), names_to = "Var", values_to = "Val")

#plot a facet box plot with free y axis scale. Removed the x axis labels to avoid clashing texts
ggplot(solar_bp,aes(x=Var,y=Val))+geom_boxplot()+facet_wrap(~solar_bp$Var,scales="free_y")+theme(axis.text.x=element_blank())

Data Visualization

To understand the factors that influence solar energy generation, we can compute the correlations between the different features and the output (Power.Generated).

con_data <-c('Day.of.Year','Year','Month','Day','Distance.to.Solar.Noon','Average.Temperature..Day.','Average.Wind.Direction..Day.','Average.Wind.Speed..Day.','Visibility','Relative.Humidity','Average.Wind.Speed..Period.','Average.Barometric.Pressure..Period.','Power.Generated')
cat_data <-c('First.Hour.of.Period','Is.Daylight','Sky.Cover')
solar_condata <- select(solar, con_data)

## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(con_data)
## 
##   # Now:
##   data %>% select(all_of(con_data))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.

solar_catdata <- select(solar, cat_data)

## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(cat_data)
## 
##   # Now:
##   data %>% select(all_of(cat_data))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.

For eda visualization purposes

We can plot the continuous data using histograms and density plots.

hist(solar_condata$Power.Generated, main = "Power.Generated", xlab = "Power.Generated")

density(solar_condata$Power.Generated, main = "Power.Generated", xlab = "Power.Generated")

## Warning: In density.default(solar_condata$Power.Generated, main = "Power.Generated", 
##     xlab = "Power.Generated") :
##  extra arguments 'main', 'xlab' will be disregarded

## 
## Call:
##  density.default(x = solar_condata$Power.Generated, main = "Power.Generated",     xlab = "Power.Generated")
## 
## Data: solar_condata$Power.Generated (2919 obs.); Bandwidth 'bw' = 1733
## 
##        x               y            
##  Min.   :-5198   Min.   :3.820e-09  
##  1st Qu.: 6546   1st Qu.:8.340e-06  
##  Median :18290   Median :1.090e-05  
##  Mean   :18290   Mean   :2.125e-05  
##  3rd Qu.:30034   3rd Qu.:1.478e-05  
##  Max.   :41778   Max.   :1.358e-04

d <- density(solar_condata$Power.Generated)
plot(d, main = "Power.Generated")
polygon(d, col = "red", border = "blue")

solar_condata_value <- solar_condata
for (con in 1:length(solar_condata)) {
    nd_condata <- ggplot(data = solar, aes(x = solar_condata_value[,con])) +
        geom_bar(fill = "purple") +
        ggtitle("The Normal Distribution of Features") +
        theme(plot.title = element_text(hjust = 0.5)) +
        labs(x = con_data[con])
    print(nd_condata)
}

boxplot(solar_condata$Power.Generated, main = "Boxplot of Power Generated")

Data Preparation

Label encoding & Normalization

# Convert the logical vector to numeric (label encoding purpose)
solar$Is.Daylight <- ifelse(solar$Is.Daylight == TRUE, 1, 0)

# normalization process
normalize<- function(x){return((x-min(x))/ (max(x) - min(x)))}
normalize(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))

##  [1] 0.00000000 0.06666667 0.13333333 0.20000000 0.26666667 0.33333333
##  [7] 0.40000000 0.46666667 0.53333333 0.60000000 0.66666667 0.73333333
## [13] 0.80000000 0.86666667 0.93333333 1.00000000

solar_n<-as.data.frame(lapply(solar[,c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)], normalize))

Regression Problem

1. Features Selection for regression problem (Predict Power.Generated)

# Correlation
regression_solar <- cor(solar_n, solar_n$Power.Generated)
ggplot(melt(regression_solar), aes(x=Var1, y='Power.Generated', fill=value)) +
  geom_tile(color="white") +
  geom_text(aes(label=round(value, 2)), color="black", size=3) +
  scale_fill_gradient(low="white", high="steelblue", name="Correlation",
                      limits=c(-1, 1), breaks=seq(-1, 1, by=0.2),
                      labels=seq(-1, 1, by=0.2), guide=guide_colorbar(barheight=15, barwidth=1,
                                                                      title.position = "top", title.hjust=0.5)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Since most the correlation of these features are quite low when compared to the target features(Power.Generated) due to the scale of every features values, therefore we are going to select all of the features to make a prediction, but we will drop some of the features that are not really important to the target output.

# Features selection
regression_features <-c('Year','First.Hour.of.Period','Is.Daylight','Distance.to.Solar.Noon','Average.Temperature..Day.','Average.Wind.Direction..Day.','Average.Wind.Speed..Day.','Sky.Cover','Relative.Humidity','Average.Wind.Speed..Period.')
solar_regression <-solar_n[,regression_features]

2. Train test split (Predict Power.Generated)

# train_test_split solar dataframe
library(dplyr)
X_r <- solar_regression #features
y_r <- solar_n %>% select(Power.Generated) #target
set.seed(42)
# Load the caret package
library(caret)

## Loading required package: lattice

indices_r <- createDataPartition(solar_n$Power.Generated, p = 0.7, list = FALSE) # 70% training 30 % testing

# Create the x training set
x_train_r <- X_r[indices_r, ]

# Create the x test set
x_test_r <- X_r[-indices_r, ]

# Create the y training set
y_train_r <- y_r[indices_r, ]

# Create the y test set
y_test_r <- y_r[-indices_r, ]

3. Machine learning development

The machine learning that we plan to use in our project is Random Forest.

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

modelRF_r <- randomForest(x_train_r, y_train_r, ntree = 100)
y_pred_rf_r <- predict(modelRF_r, x_test_r)
# Check the model's performance on the test data
rf_model_r <- cor(y_pred_rf_r, y_test_r)^2
print(rf_model_r)

## [1] 0.9343499

plot(y_test_r,y_pred_rf_r)

4. Machine learning evaluation

The machine learning that we use will be evaluated by using MAE, RMSLE, and RMSE.

library(Metrics)

## 
## Attaching package: 'Metrics'

## The following objects are masked from 'package:caret':
## 
##     precision, recall

# Evaluation metrics
rf_mae_valid_r <- mae(y_test_r, y_pred_rf_r)
rf_rmsle_valid_r <- rmsle(y_test_r, y_pred_rf_r)
rf_rmse_valid_r <- rmse(y_test_r, y_pred_rf_r)

# Print the evaluation metrics
cat("RF - MAE      Valid:", rf_mae_valid_r, "\n")

## RF - MAE      Valid: 0.03476183

cat("RF - RMSE      Valid:", rf_rmse_valid_r, "\n")

## RF - RMSE      Valid: 0.07281934

cat("RF - RMSLE    Valid:", rf_rmsle_valid_r, "\n")

## RF - RMSLE    Valid: 0.05293914

Classification Problem

1. Features Selection for classification problem (Predict Is.Daylight)

# Correlation
class_solar <- cor(solar_n, solar_n$Is.Daylight)
ggplot(melt(class_solar), aes(x=Var1, y='Is.Daylight', fill=value)) +
  geom_tile(color="white") +
  geom_text(aes(label=round(value, 2)), color="black", size=3) +
  scale_fill_gradient(low="white", high="steelblue", name="Correlation",
                      limits=c(-1, 1), breaks=seq(-1, 1, by=0.2),
                      labels=seq(-1, 1, by=0.2), guide=guide_colorbar(barheight=15, barwidth=1,
                                                                      title.position = "top", title.hjust=0.5)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Since most the correlation of these features are quite low when compared to the target features(Is.Daylight) due to the scale of every features values, therefore we are going to select all of the features to make a prediction first, but drop some of the features that are not really related to the target output.

# Features selection
class_features <-c('Year','First.Hour.of.Period','Distance.to.Solar.Noon','Average.Temperature..Day.','Average.Wind.Direction..Day.','Average.Wind.Speed..Day.','Sky.Cover','Relative.Humidity','Average.Wind.Speed..Period.','Power.Generated')
solar_class <-solar_n[,class_features]

2. Train test split (Predict Is.Daylight)

# train_test_split solar dataframe
library(dplyr)
X_c <- solar_class #features
y_c <- solar_n %>% select(Is.Daylight) #target
set.seed(221)
# Load the caret package
library(caret)
indices_c <- createDataPartition(solar_n$Is.Daylight, p = 0.7, list = FALSE) # 70% training 30 % testing

# Create the x training set
x_train_c <- X_c[indices_c, ]

# Create the x test set
x_test_c <- X_c[-indices_c, ]

# Create the y training set
y_train_c <- y_c[indices_c, ]

# Create the y test set
y_test_c <- y_c[-indices_c, ]

#convert y to factors
y_test_c=factor(y_test_c,levels=c(0,1))
y_train_c=factor(y_train_c,levels=c(0,1))

3. Machine learning development

The machine learning that we plan to use in our project is Random Forest.

library(randomForest)
modelRF_c <- randomForest(x_train_c, y_train_c, ntree = 100,proximity=F)
y_pred_rf_c <- predict(modelRF_c, x_test_c)
# Check the model's performance on the test data
y_pred_rf_c=as.numeric(as.character(y_pred_rf_c))
y_test_c=as.numeric(as.character(y_test_c))
rf_model_c <- cor(y_pred_rf_c, y_test_c)^2
print(rf_model_c)

## [1] 0.9903508

4. Machine learning evaluation

The machine learning that we use will be evaluated by using f1, precision, recall and AUC.

library(Metrics)
library(cvms)
library(tibble)

# Evaluation metrics
rf_f1_valid <- f1(y_test_c, y_pred_rf_c)
rf_precise_valid <- precision(y_test_c, y_pred_rf_c)
rf_auc_valid <- auc(y_test_c, y_pred_rf_c)
rf_recall_valid <- recall(y_test_c, y_pred_rf_c)

# Print the evaluation metrics
cat("RF - F1      Valid:", rf_f1_valid, "\n")

## RF - F1      Valid: 1

cat("RF - PRECISE      Valid:", rf_precise_valid, "\n")

## RF - PRECISE      Valid: 1

cat("RF - AUC      Valid:", rf_auc_valid, "\n")

## RF - AUC      Valid: 0.9981584

cat("RF - RECALL    Valid:", rf_recall_valid, "\n")

## RF - RECALL    Valid: 0.9963168

#Confusion Matrix
d_binomial <- tibble("target" = y_test_c,
                     "prediction" = y_pred_rf_c)
basic_table <- table(d_binomial)
cfm <- as_tibble(basic_table)

plot_confusion_matrix(cfm, 
                      target_col = "target", 
                      prediction_col = "prediction",
                      counts_col = "n")

## Warning in plot_confusion_matrix(cfm, target_col = "target", prediction_col =
## "prediction", : 'ggimage' is missing. Will not plot arrows and zero-shading.

## Warning in plot_confusion_matrix(cfm, target_col = "target", prediction_col =
## "prediction", : 'rsvg' is missing. Will not plot arrows and zero-shading.

Conclusion

In conclusion, we have successfully carried out a number of data exploration and visualization techniques in order to better understand and prepare for the data analysis. These techniques are crucial for gaining insights into the characteristics and patterns of the data, and can inform the next steps in the data analysis process. Specifically, we have imported and cleaned the data, and have used correlation analysis, normal distribution plots, and boxplots to explore the data. These techniques have allowed us to identify potential trends and relationships within the data, and identify key features that may be relevant to predict the solar power generation.

Based on the outcome of machine learning development and evaluation via Random Forest, the model is able to accurately predict the Power.Generated and Is.Daylight with accuracy score of 93% and 99% respectively for both regression and classification problem. In reference to the model evaluation, the model indicates very low MAE, RMSLE and RMSE (< 0.1) for regression problem and >0.99 for F1, AUC, Precision and Recall for classification problem.

Hence, it can be summarised that adoption of Random Forest is successful in predicting the output of a solar power system and can be used to improve the performance and efficiency of solar power systems in the future.