Solar power systems, also known as photovoltaic (PV) systems, are a popular choice for generating clean, renewable energy as they use solar panels to convert sunlight into electricity. These systems can be used for a variety of applications, making them an efficient and cost-effective source of electricity. In recent years, the development of solar power technology has been an active area of research and innovation, with efforts focused on improving the efficiency of solar power systems.
There are several factors that impact the efficiency of solar power systems, including the size and type of the system, the location and weather conditions, and the efficiency of the solar panels. The type and size of the solar power system are important considerations, as different systems are suited for different applications. For example, large systems may be more suitable for commercial or industrial applications, while small systems may be more suitable for residential use. The location and weather conditions also play a role in the efficiency of solar power systems, as the amount of sunlight received can vary significantly depending on the location and time of year. Finally, the efficiency of the solar panels themselves is an important factor, as more efficient panels can produce more electricity from the same amount of sunlight.
The objectives of the project are as follows:
Identify the correlations between the available features and solar power generation: By analyzing the data and determining which factors have the most significant impact on solar power generation, we can better understand how to optimize the performance of a solar power system.
Predict the output of a solar power system based on past performance: By using machine learning algorithms and historical data, we can develop a model that accurately predicts the output of a solar power system based on various input parameters. This can help optimize the performance and efficiency of the system and make informed decisions about its operation and maintenance.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(graphics)
library(stats) # install reshape2 library
library(reshape2)
library(ggplot2) # for visualization
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
##
## smiths
We begin by importing the dataset and inspecting its contents.
solar <- read.csv("solar.csv") # import the dataset
View(solar) # inspect the dataframe table
str(solar) # check the data types of the columns
## 'data.frame': 2970 obs. of 16 variables:
## $ Day.of.Year : int 245 245 245 245 245 245 245 245 246 246 ...
## $ Year : int 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
## $ Month : int 9 9 9 9 9 9 9 9 9 9 ...
## $ Day : int 1 1 1 1 1 1 1 1 2 2 ...
## $ First.Hour.of.Period : int 1 4 7 10 13 16 19 22 1 4 ...
## $ Is.Daylight : logi FALSE FALSE TRUE TRUE TRUE TRUE ...
## $ Distance.to.Solar.Noon : num 0.8599 0.6285 0.3972 0.1658 0.0656 ...
## $ Average.Temperature..Day. : int 69 69 69 69 69 69 69 69 72 72 ...
## $ Average.Wind.Direction..Day. : int 28 28 28 28 28 28 28 28 29 29 ...
## $ Average.Wind.Speed..Day. : num 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 6.8 6.8 ...
## $ Sky.Cover : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Visibility : num 10 10 10 10 10 10 10 10 10 10 ...
## $ Relative.Humidity : int 75 77 70 33 21 20 36 49 67 49 ...
## $ Average.Wind.Speed..Period. : int 8 5 0 0 3 23 15 6 6 0 ...
## $ Average.Barometric.Pressure..Period.: num 29.8 29.9 29.9 29.9 29.9 ...
## $ Power.Generated : int 0 0 5418 25477 30069 16280 515 0 0 0 ...
names(solar) # check the names of the columns
## [1] "Day.of.Year"
## [2] "Year"
## [3] "Month"
## [4] "Day"
## [5] "First.Hour.of.Period"
## [6] "Is.Daylight"
## [7] "Distance.to.Solar.Noon"
## [8] "Average.Temperature..Day."
## [9] "Average.Wind.Direction..Day."
## [10] "Average.Wind.Speed..Day."
## [11] "Sky.Cover"
## [12] "Visibility"
## [13] "Relative.Humidity"
## [14] "Average.Wind.Speed..Period."
## [15] "Average.Barometric.Pressure..Period."
## [16] "Power.Generated"
summary(solar) # check the statistical summary for each column
## Day.of.Year Year Month Day
## Min. : 1.0 Min. :2008 Min. : 1.000 Min. : 1.00
## 1st Qu.: 92.0 1st Qu.:2008 1st Qu.: 4.000 1st Qu.: 8.00
## Median :183.5 Median :2009 Median : 7.000 Median :16.00
## Mean :183.6 Mean :2009 Mean : 6.531 Mean :15.73
## 3rd Qu.:275.0 3rd Qu.:2009 3rd Qu.:10.000 3rd Qu.:23.00
## Max. :366.0 Max. :2009 Max. :12.000 Max. :31.00
## NA's :6 NA's :1 NA's :1
## First.Hour.of.Period Is.Daylight Distance.to.Solar.Noon
## Min. : 1.0 Mode :logical Min. :0.0504
## 1st Qu.: 4.0 FALSE:1133 1st Qu.:0.2560
## Median :11.5 TRUE :1831 Median :0.4814
## Mean :11.5 NA's :6 Mean :0.5037
## 3rd Qu.:19.0 3rd Qu.:0.7387
## Max. :22.0 Max. :1.1414
## NA's :4
## Average.Temperature..Day. Average.Wind.Direction..Day.
## Min. :42.0 Min. : 1.00
## 1st Qu.:53.0 1st Qu.:25.00
## Median :59.0 Median :27.00
## Mean :58.5 Mean :24.96
## 3rd Qu.:63.0 3rd Qu.:29.00
## Max. :78.0 Max. :36.00
## NA's :2 NA's :4
## Average.Wind.Speed..Day. Sky.Cover Visibility Relative.Humidity
## Min. : 1.1 Min. :0.000 Min. : 0.000 Min. : 14.00
## 1st Qu.: 6.6 1st Qu.:1.000 1st Qu.:10.000 1st Qu.: 65.00
## Median :10.0 Median :2.000 Median :10.000 Median : 77.00
## Mean :10.1 Mean :1.989 Mean : 9.556 Mean : 73.56
## 3rd Qu.:13.1 3rd Qu.:3.000 3rd Qu.:10.000 3rd Qu.: 84.00
## Max. :26.6 Max. :4.000 Max. :10.000 Max. :100.00
## NA's :3 NA's :8 NA's :3 NA's :1
## Average.Wind.Speed..Period. Average.Barometric.Pressure..Period.
## Min. : 0.00 Min. :29.48
## 1st Qu.: 5.00 1st Qu.:29.92
## Median : 9.00 Median :30.00
## Mean :10.13 Mean :30.02
## 3rd Qu.:15.00 3rd Qu.:30.11
## Max. :40.00 Max. :30.53
## NA's :5 NA's :4
## Power.Generated
## Min. : 0
## 1st Qu.: 0
## Median : 385
## Mean : 6959
## 3rd Qu.:12668
## Max. :36580
## NA's :3
Next, we perform some data cleaning to ensure that the data is usable for analysis.
print(sum(duplicated(solar))) # check for duplicated rows
## [1] 0
# check for null values
solar %>% summarise_all(funs(sum(is.na(.))))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Day.of.Year Year Month Day First.Hour.of.Period Is.Daylight
## 1 6 1 1 0 4 6
## Distance.to.Solar.Noon Average.Temperature..Day. Average.Wind.Direction..Day.
## 1 0 2 4
## Average.Wind.Speed..Day. Sky.Cover Visibility Relative.Humidity
## 1 3 8 3 1
## Average.Wind.Speed..Period. Average.Barometric.Pressure..Period.
## 1 5 4
## Power.Generated
## 1 3
# remove rows with null values
solar <- solar[complete.cases(solar),]
We can also visualize the data using boxplots to check for outliers.
#create a new var for boxplot plotting, since some modification unique to this plot is needed
solar_bp<-solar
#pivot the df so that an a col with the feature names exist
solar_bp<-solar_bp %>% select(Average.Temperature..Day., Average.Wind.Direction..Day., Average.Wind.Speed..Day., Average.Wind.Speed..Period.,Power.Generated) %>% pivot_longer(., cols = c(Average.Temperature..Day., Average.Wind.Direction..Day., Average.Wind.Speed..Day., Average.Wind.Speed..Period.,Power.Generated), names_to = "Var", values_to = "Val")
#plot a facet box plot with free y axis scale. Removed the x axis labels to avoid clashing texts
ggplot(solar_bp,aes(x=Var,y=Val))+geom_boxplot()+facet_wrap(~solar_bp$Var,scales="free_y")+theme(axis.text.x=element_blank())
To understand the factors that influence solar energy generation, we can compute the correlations between the different features and the output (Power.Generated).
con_data <-c('Day.of.Year','Year','Month','Day','Distance.to.Solar.Noon','Average.Temperature..Day.','Average.Wind.Direction..Day.','Average.Wind.Speed..Day.','Visibility','Relative.Humidity','Average.Wind.Speed..Period.','Average.Barometric.Pressure..Period.','Power.Generated')
cat_data <-c('First.Hour.of.Period','Is.Daylight','Sky.Cover')
solar_condata <- select(solar, con_data)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(con_data)
##
## # Now:
## data %>% select(all_of(con_data))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
solar_catdata <- select(solar, cat_data)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(cat_data)
##
## # Now:
## data %>% select(all_of(cat_data))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
We can plot the continuous data using histograms and density plots.
hist(solar_condata$Power.Generated, main = "Power.Generated", xlab = "Power.Generated")
density(solar_condata$Power.Generated, main = "Power.Generated", xlab = "Power.Generated")
## Warning: In density.default(solar_condata$Power.Generated, main = "Power.Generated",
## xlab = "Power.Generated") :
## extra arguments 'main', 'xlab' will be disregarded
##
## Call:
## density.default(x = solar_condata$Power.Generated, main = "Power.Generated", xlab = "Power.Generated")
##
## Data: solar_condata$Power.Generated (2919 obs.); Bandwidth 'bw' = 1733
##
## x y
## Min. :-5198 Min. :3.820e-09
## 1st Qu.: 6546 1st Qu.:8.340e-06
## Median :18290 Median :1.090e-05
## Mean :18290 Mean :2.125e-05
## 3rd Qu.:30034 3rd Qu.:1.478e-05
## Max. :41778 Max. :1.358e-04
d <- density(solar_condata$Power.Generated)
plot(d, main = "Power.Generated")
polygon(d, col = "red", border = "blue")
solar_condata_value <- solar_condata
for (con in 1:length(solar_condata)) {
nd_condata <- ggplot(data = solar, aes(x = solar_condata_value[,con])) +
geom_bar(fill = "purple") +
ggtitle("The Normal Distribution of Features") +
theme(plot.title = element_text(hjust = 0.5)) +
labs(x = con_data[con])
print(nd_condata)
}
boxplot(solar_condata$Power.Generated, main = "Boxplot of Power Generated")
# Convert the logical vector to numeric (label encoding purpose)
solar$Is.Daylight <- ifelse(solar$Is.Daylight == TRUE, 1, 0)
# normalization process
normalize<- function(x){return((x-min(x))/ (max(x) - min(x)))}
normalize(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))
## [1] 0.00000000 0.06666667 0.13333333 0.20000000 0.26666667 0.33333333
## [7] 0.40000000 0.46666667 0.53333333 0.60000000 0.66666667 0.73333333
## [13] 0.80000000 0.86666667 0.93333333 1.00000000
solar_n<-as.data.frame(lapply(solar[,c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)], normalize))
# Correlation
regression_solar <- cor(solar_n, solar_n$Power.Generated)
ggplot(melt(regression_solar), aes(x=Var1, y='Power.Generated', fill=value)) +
geom_tile(color="white") +
geom_text(aes(label=round(value, 2)), color="black", size=3) +
scale_fill_gradient(low="white", high="steelblue", name="Correlation",
limits=c(-1, 1), breaks=seq(-1, 1, by=0.2),
labels=seq(-1, 1, by=0.2), guide=guide_colorbar(barheight=15, barwidth=1,
title.position = "top", title.hjust=0.5)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Since most the correlation of these features are quite low when compared to the target features(Power.Generated) due to the scale of every features values, therefore we are going to select all of the features to make a prediction, but we will drop some of the features that are not really important to the target output.
# Features selection
regression_features <-c('Year','First.Hour.of.Period','Is.Daylight','Distance.to.Solar.Noon','Average.Temperature..Day.','Average.Wind.Direction..Day.','Average.Wind.Speed..Day.','Sky.Cover','Relative.Humidity','Average.Wind.Speed..Period.')
solar_regression <-solar_n[,regression_features]
# train_test_split solar dataframe
library(dplyr)
X_r <- solar_regression #features
y_r <- solar_n %>% select(Power.Generated) #target
set.seed(42)
# Load the caret package
library(caret)
## Loading required package: lattice
indices_r <- createDataPartition(solar_n$Power.Generated, p = 0.7, list = FALSE) # 70% training 30 % testing
# Create the x training set
x_train_r <- X_r[indices_r, ]
# Create the x test set
x_test_r <- X_r[-indices_r, ]
# Create the y training set
y_train_r <- y_r[indices_r, ]
# Create the y test set
y_test_r <- y_r[-indices_r, ]
The machine learning that we plan to use in our project is Random Forest.
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
modelRF_r <- randomForest(x_train_r, y_train_r, ntree = 100)
y_pred_rf_r <- predict(modelRF_r, x_test_r)
# Check the model's performance on the test data
rf_model_r <- cor(y_pred_rf_r, y_test_r)^2
print(rf_model_r)
## [1] 0.9343499
plot(y_test_r,y_pred_rf_r)
The machine learning that we use will be evaluated by using MAE, RMSLE, and RMSE.
library(Metrics)
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
# Evaluation metrics
rf_mae_valid_r <- mae(y_test_r, y_pred_rf_r)
rf_rmsle_valid_r <- rmsle(y_test_r, y_pred_rf_r)
rf_rmse_valid_r <- rmse(y_test_r, y_pred_rf_r)
# Print the evaluation metrics
cat("RF - MAE Valid:", rf_mae_valid_r, "\n")
## RF - MAE Valid: 0.03476183
cat("RF - RMSE Valid:", rf_rmse_valid_r, "\n")
## RF - RMSE Valid: 0.07281934
cat("RF - RMSLE Valid:", rf_rmsle_valid_r, "\n")
## RF - RMSLE Valid: 0.05293914
# Correlation
class_solar <- cor(solar_n, solar_n$Is.Daylight)
ggplot(melt(class_solar), aes(x=Var1, y='Is.Daylight', fill=value)) +
geom_tile(color="white") +
geom_text(aes(label=round(value, 2)), color="black", size=3) +
scale_fill_gradient(low="white", high="steelblue", name="Correlation",
limits=c(-1, 1), breaks=seq(-1, 1, by=0.2),
labels=seq(-1, 1, by=0.2), guide=guide_colorbar(barheight=15, barwidth=1,
title.position = "top", title.hjust=0.5)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Since most the correlation of these features are quite low when compared to the target features(Is.Daylight) due to the scale of every features values, therefore we are going to select all of the features to make a prediction first, but drop some of the features that are not really related to the target output.
# Features selection
class_features <-c('Year','First.Hour.of.Period','Distance.to.Solar.Noon','Average.Temperature..Day.','Average.Wind.Direction..Day.','Average.Wind.Speed..Day.','Sky.Cover','Relative.Humidity','Average.Wind.Speed..Period.','Power.Generated')
solar_class <-solar_n[,class_features]
# train_test_split solar dataframe
library(dplyr)
X_c <- solar_class #features
y_c <- solar_n %>% select(Is.Daylight) #target
set.seed(221)
# Load the caret package
library(caret)
indices_c <- createDataPartition(solar_n$Is.Daylight, p = 0.7, list = FALSE) # 70% training 30 % testing
# Create the x training set
x_train_c <- X_c[indices_c, ]
# Create the x test set
x_test_c <- X_c[-indices_c, ]
# Create the y training set
y_train_c <- y_c[indices_c, ]
# Create the y test set
y_test_c <- y_c[-indices_c, ]
#convert y to factors
y_test_c=factor(y_test_c,levels=c(0,1))
y_train_c=factor(y_train_c,levels=c(0,1))
The machine learning that we plan to use in our project is Random Forest.
library(randomForest)
modelRF_c <- randomForest(x_train_c, y_train_c, ntree = 100,proximity=F)
y_pred_rf_c <- predict(modelRF_c, x_test_c)
# Check the model's performance on the test data
y_pred_rf_c=as.numeric(as.character(y_pred_rf_c))
y_test_c=as.numeric(as.character(y_test_c))
rf_model_c <- cor(y_pred_rf_c, y_test_c)^2
print(rf_model_c)
## [1] 0.9903508
The machine learning that we use will be evaluated by using f1, precision, recall and AUC.
library(Metrics)
library(cvms)
library(tibble)
# Evaluation metrics
rf_f1_valid <- f1(y_test_c, y_pred_rf_c)
rf_precise_valid <- precision(y_test_c, y_pred_rf_c)
rf_auc_valid <- auc(y_test_c, y_pred_rf_c)
rf_recall_valid <- recall(y_test_c, y_pred_rf_c)
# Print the evaluation metrics
cat("RF - F1 Valid:", rf_f1_valid, "\n")
## RF - F1 Valid: 1
cat("RF - PRECISE Valid:", rf_precise_valid, "\n")
## RF - PRECISE Valid: 1
cat("RF - AUC Valid:", rf_auc_valid, "\n")
## RF - AUC Valid: 0.9981584
cat("RF - RECALL Valid:", rf_recall_valid, "\n")
## RF - RECALL Valid: 0.9963168
#Confusion Matrix
d_binomial <- tibble("target" = y_test_c,
"prediction" = y_pred_rf_c)
basic_table <- table(d_binomial)
cfm <- as_tibble(basic_table)
plot_confusion_matrix(cfm,
target_col = "target",
prediction_col = "prediction",
counts_col = "n")
## Warning in plot_confusion_matrix(cfm, target_col = "target", prediction_col =
## "prediction", : 'ggimage' is missing. Will not plot arrows and zero-shading.
## Warning in plot_confusion_matrix(cfm, target_col = "target", prediction_col =
## "prediction", : 'rsvg' is missing. Will not plot arrows and zero-shading.
In conclusion, we have successfully carried out a number of data exploration and visualization techniques in order to better understand and prepare for the data analysis. These techniques are crucial for gaining insights into the characteristics and patterns of the data, and can inform the next steps in the data analysis process. Specifically, we have imported and cleaned the data, and have used correlation analysis, normal distribution plots, and boxplots to explore the data. These techniques have allowed us to identify potential trends and relationships within the data, and identify key features that may be relevant to predict the solar power generation.
Based on the outcome of machine learning development and evaluation via Random Forest, the model is able to accurately predict the Power.Generated and Is.Daylight with accuracy score of 93% and 99% respectively for both regression and classification problem. In reference to the model evaluation, the model indicates very low MAE, RMSLE and RMSE (< 0.1) for regression problem and >0.99 for F1, AUC, Precision and Recall for classification problem.
Hence, it can be summarised that adoption of Random Forest is successful in predicting the output of a solar power system and can be used to improve the performance and efficiency of solar power systems in the future.