In this practice, we will explore the turbine data in Canada, explore factor affecting turbine’s capacity, and apply decision tree to predict tubines’s capacity based on thier characteristic.
Detail description of the data frame: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-10-27/readme.md
Let’s load necessary packages and data
# check if missing data in the data frame
is.na(turbines) %>% colSums()
## objectid province_territory
## 0 0
## project_name total_project_capacity_mw
## 1 0
## turbine_identifier turbine_number_in_project
## 0 0
## turbine_rated_capacity_k_w rotor_diameter_m
## 220 0
## hub_height_m manufacturer
## 0 0
## model commissioning_date
## 0 0
## latitude longitude
## 0 0
## notes
## 6064
Overall, the datafram is adequate enough, there are 220 rows having missing value for turbine_rated_capacity_k_w. We must do some cleaning and transforming data before training and fitting our model
Transform data from turbines data downloaded fom the link aboved.
First, we rename some variables for more convenience such as turbine_capacity, hub_height,
Second, model and province_teritory variable contain too much level, so, we need to lump them into 10 most frequent levels :))
turbines_df <- turbines %>%
transmute(
turbine_capacity = turbine_rated_capacity_k_w,
hub_height = hub_height_m,
rotor_diameter = rotor_diameter_m,
commissioning_date = parse_number(commissioning_date),
model = fct_lump_n(model, n=10),
province_territory = fct_lump_n(province_territory, n=10)
) %>%
filter(!is.na(turbine_capacity)) %>%
mutate_if(is.character, factor)
turbines_df
## # A tibble: 6,478 x 6
## turbine_capacity hub_height rotor_diameter commissioning_d~ model
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 150 30 23 1993 Other
## 2 600 40 44 1997 Other
## 3 600 50 44 1998 Other
## 4 600 50 44 1998 Other
## 5 600 50 44 1998 Other
## 6 660 50 47 2000 V47/~
## 7 1300 46 60 2001 Other
## 8 1300 46 60 2001 Other
## 9 1300 46 60 2001 Other
## 10 1300 46 60 2001 Other
## # ... with 6,468 more rows, and 1 more variable: province_territory <fct>
Following are steps to build, fit and tune the model
First, we split data into training and testing data
Second, we create resampling data on training data for further evaluation purpose
we use decision_tree() to specify parameters which are needed to be tuned such as: cost_complexity, tree_depth, min_n.
Then we specify which engines to be used and mode.
model_engine <- decision_tree(
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
set_engine("rpart") %>%
set_mode("regression")
#set tree grid for tuning activity
tree_grid <- grid_regular(cost_complexity(), tree_depth(), min_n(), levels = 2)
this step will tune the model using tune_grid () command
tuned_model %>% collect_metrics() #showing metrics based on different tree's specification
## # A tibble: 32 x 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 1 2 mae standard 388. 25 1.93
## 2 0.0000000001 1 2 mape standard 28.1 25 0.490
## 3 0.0000000001 1 2 rmse standard 507. 25 1.57
## 4 0.0000000001 1 2 rsq standard 0.310 25 0.00286
## 5 0.1 1 2 mae standard 388. 25 1.93
## 6 0.1 1 2 mape standard 28.1 25 0.490
## 7 0.1 1 2 rmse standard 507. 25 1.57
## 8 0.1 1 2 rsq standard 0.310 25 0.00286
## 9 0.0000000001 15 2 mae standard 12.9 25 0.712
## 10 0.0000000001 15 2 mape standard 0.634 25 0.0333
## # ... with 22 more rows, and 1 more variable: .config <chr>
tuned_model %>% show_best() #showing the best metric that will be used in the final model
## Warning: No value of `metric` was given; metric 'rmse' will be used.
## # A tibble: 5 x 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 15 2 rmse standard 61.1 25 2.10
## 2 0.0000000001 15 40 rmse standard 79.8 25 1.81
## 3 0.1 15 2 rmse standard 351. 25 3.18
## 4 0.1 15 40 rmse standard 351. 25 3.18
## 5 0.0000000001 1 2 rmse standard 507. 25 1.57
## # ... with 1 more variable: .config <chr>
tuned_model %>% collect_predictions() #collect the prediction
## # A tibble: 358,672 x 8
## id .pred .row cost_complexity tree_depth min_n turbine_capacity .config
## <chr> <dbl> <int> <dbl> <int> <int> <dbl> <chr>
## 1 Boots~ 1363. 3 0.0000000001 1 2 600 Preproc~
## 2 Boots~ 1363. 5 0.0000000001 1 2 600 Preproc~
## 3 Boots~ 1363. 7 0.0000000001 1 2 1300 Preproc~
## 4 Boots~ 1363. 15 0.0000000001 1 2 1300 Preproc~
## 5 Boots~ 1363. 21 0.0000000001 1 2 660 Preproc~
## 6 Boots~ 2168. 22 0.0000000001 1 2 1800 Preproc~
## 7 Boots~ 1363. 23 0.0000000001 1 2 660 Preproc~
## 8 Boots~ 1363. 26 0.0000000001 1 2 660 Preproc~
## 9 Boots~ 1363. 29 0.0000000001 1 2 660 Preproc~
## 10 Boots~ 1363. 31 0.0000000001 1 2 660 Preproc~
## # ... with 358,662 more rows
# finalize the model engine using best metric
final_model <- finalize_model(model_engine, select_best(tuned_model, "rmse"))
#fit data using the final model
final_fit <- fit(final_model, turbine_capacity~., turbines_training)
# testing/training split using last fit for further evaluation
final_rs <- last_fit(final_model,turbine_capacity~., turbines_split)
#see how will does model work
final_rs%>% collect_metrics()
## # A tibble: 2 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 73.6 Preprocessor1_Model1
## 2 rsq standard 0.985 Preprocessor1_Model1
First, we need to predict our output using final_fit and predict on our testing data
Second, merge predicted data on real data in testing dataframe.
predict(final_fit,turbines_test )
## # A tibble: 1,618 x 1
## .pred
## <dbl>
## 1 1300
## 2 1300
## 3 1300
## 4 1300
## 5 1300
## 6 1300
## 7 1300
## 8 1300
## 9 660
## 10 660
## # ... with 1,608 more rows
check <- data.frame(predict(final_fit,turbines_test),
turbines_test$turbine_capacity)
check <- check %>%
transmute(
truth = turbines_test.turbine_capacity,
estimated = .pred)
check %>% ggplot(aes(truth, estimated)) +
geom_point(alpha = 0.5, color = "midnightblue") +
geom_abline(lty = 2, color = "gray50") +
labs (
title = "Assessing the model performance ",
subtitle = "Predicted and true turbine capacity in the testing dataset",
x = "True capacity",
y = "Predicted capacity",
caption = ":)"
) +
theme_ipsum_rc ()
Reference: https://juliasilge.com/blog/wind-turbine/