Explore turbines data

In this practice, we will explore the turbine data in Canada, explore factor affecting turbine’s capacity, and apply decision tree to predict tubines’s capacity based on thier characteristic.

Detail description of the data frame: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-10-27/readme.md

Let’s load necessary packages and data

# check if missing data in the data frame
is.na(turbines) %>% colSums()
##                   objectid         province_territory 
##                          0                          0 
##               project_name  total_project_capacity_mw 
##                          1                          0 
##         turbine_identifier  turbine_number_in_project 
##                          0                          0 
## turbine_rated_capacity_k_w           rotor_diameter_m 
##                        220                          0 
##               hub_height_m               manufacturer 
##                          0                          0 
##                      model         commissioning_date 
##                          0                          0 
##                   latitude                  longitude 
##                          0                          0 
##                      notes 
##                       6064

Overall, the datafram is adequate enough, there are 220 rows having missing value for turbine_rated_capacity_k_w. We must do some cleaning and transforming data before training and fitting our model

Let’s do some cleaning acitivites

Transform data from turbines data downloaded fom the link aboved.

First, we rename some variables for more convenience such as turbine_capacity, hub_height,

Second, model and province_teritory variable contain too much level, so, we need to lump them into 10 most frequent levels :))

turbines_df <- turbines %>%
  transmute(
    turbine_capacity = turbine_rated_capacity_k_w, 
    hub_height = hub_height_m, 
    rotor_diameter = rotor_diameter_m,
    commissioning_date = parse_number(commissioning_date),
    model = fct_lump_n(model, n=10),
    province_territory = fct_lump_n(province_territory, n=10)
    ) %>%
  filter(!is.na(turbine_capacity)) %>%
  mutate_if(is.character, factor)

turbines_df
## # A tibble: 6,478 x 6
##    turbine_capacity hub_height rotor_diameter commissioning_d~ model
##               <dbl>      <dbl>          <dbl>            <dbl> <fct>
##  1              150         30             23             1993 Other
##  2              600         40             44             1997 Other
##  3              600         50             44             1998 Other
##  4              600         50             44             1998 Other
##  5              600         50             44             1998 Other
##  6              660         50             47             2000 V47/~
##  7             1300         46             60             2001 Other
##  8             1300         46             60             2001 Other
##  9             1300         46             60             2001 Other
## 10             1300         46             60             2001 Other
## # ... with 6,468 more rows, and 1 more variable: province_territory <fct>

Build a model

Following are steps to build, fit and tune the model

data preparation

First, we split data into training and testing data

Second, we create resampling data on training data for further evaluation purpose

engine preparation

we use decision_tree() to specify parameters which are needed to be tuned such as: cost_complexity, tree_depth, min_n.

Then we specify which engines to be used and mode.

model_engine <- decision_tree(
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
) %>%
  set_engine("rpart") %>%
  set_mode("regression")

#set tree grid for tuning activity
tree_grid <- grid_regular(cost_complexity(), tree_depth(), min_n(), levels = 2)

tune the model

this step will tune the model using tune_grid () command

evaluate the model

tuned_model %>% collect_metrics() #showing metrics based on different  tree's specification
## # A tibble: 32 x 9
##    cost_complexity tree_depth min_n .metric .estimator    mean     n std_err
##              <dbl>      <int> <int> <chr>   <chr>        <dbl> <int>   <dbl>
##  1    0.0000000001          1     2 mae     standard   388.       25 1.93   
##  2    0.0000000001          1     2 mape    standard    28.1      25 0.490  
##  3    0.0000000001          1     2 rmse    standard   507.       25 1.57   
##  4    0.0000000001          1     2 rsq     standard     0.310    25 0.00286
##  5    0.1                   1     2 mae     standard   388.       25 1.93   
##  6    0.1                   1     2 mape    standard    28.1      25 0.490  
##  7    0.1                   1     2 rmse    standard   507.       25 1.57   
##  8    0.1                   1     2 rsq     standard     0.310    25 0.00286
##  9    0.0000000001         15     2 mae     standard    12.9      25 0.712  
## 10    0.0000000001         15     2 mape    standard     0.634    25 0.0333 
## # ... with 22 more rows, and 1 more variable: .config <chr>
tuned_model %>% show_best() #showing the best metric that will be used in the final model
## Warning: No value of `metric` was given; metric 'rmse' will be used.
## # A tibble: 5 x 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001         15     2 rmse    standard    61.1    25    2.10
## 2    0.0000000001         15    40 rmse    standard    79.8    25    1.81
## 3    0.1                  15     2 rmse    standard   351.     25    3.18
## 4    0.1                  15    40 rmse    standard   351.     25    3.18
## 5    0.0000000001          1     2 rmse    standard   507.     25    1.57
## # ... with 1 more variable: .config <chr>
tuned_model %>% collect_predictions() #collect the prediction
## # A tibble: 358,672 x 8
##    id     .pred  .row cost_complexity tree_depth min_n turbine_capacity .config 
##    <chr>  <dbl> <int>           <dbl>      <int> <int>            <dbl> <chr>   
##  1 Boots~ 1363.     3    0.0000000001          1     2              600 Preproc~
##  2 Boots~ 1363.     5    0.0000000001          1     2              600 Preproc~
##  3 Boots~ 1363.     7    0.0000000001          1     2             1300 Preproc~
##  4 Boots~ 1363.    15    0.0000000001          1     2             1300 Preproc~
##  5 Boots~ 1363.    21    0.0000000001          1     2              660 Preproc~
##  6 Boots~ 2168.    22    0.0000000001          1     2             1800 Preproc~
##  7 Boots~ 1363.    23    0.0000000001          1     2              660 Preproc~
##  8 Boots~ 1363.    26    0.0000000001          1     2              660 Preproc~
##  9 Boots~ 1363.    29    0.0000000001          1     2              660 Preproc~
## 10 Boots~ 1363.    31    0.0000000001          1     2              660 Preproc~
## # ... with 358,662 more rows

final fit the model

# finalize the model engine using best metric 
final_model <- finalize_model(model_engine, select_best(tuned_model, "rmse"))

#fit data using the final model
final_fit <- fit(final_model, turbine_capacity~., turbines_training)

# testing/training split using last fit for further evaluation
final_rs <- last_fit(final_model,turbine_capacity~., turbines_split)

#see how will does model work
final_rs%>% collect_metrics()
## # A tibble: 2 x 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard      73.6   Preprocessor1_Model1
## 2 rsq     standard       0.985 Preprocessor1_Model1

let’s see how well does our model work on testing data

First, we need to predict our output using final_fit and predict on our testing data

Second, merge predicted data on real data in testing dataframe.

predict(final_fit,turbines_test )
## # A tibble: 1,618 x 1
##    .pred
##    <dbl>
##  1  1300
##  2  1300
##  3  1300
##  4  1300
##  5  1300
##  6  1300
##  7  1300
##  8  1300
##  9   660
## 10   660
## # ... with 1,608 more rows
check <- data.frame(predict(final_fit,turbines_test),
                    turbines_test$turbine_capacity)
check <- check %>% 
  transmute(
    truth = turbines_test.turbine_capacity,
    estimated = .pred)

check %>% ggplot(aes(truth, estimated)) +
  geom_point(alpha = 0.5, color = "midnightblue") +
  geom_abline(lty = 2, color = "gray50") +
  labs (
    title = "Assessing the model performance ",
    subtitle = "Predicted and true turbine capacity in the testing dataset",
    x = "True capacity",
    y = "Predicted capacity",
    caption = ":)"
  ) +
theme_ipsum_rc ()

Reference: https://juliasilge.com/blog/wind-turbine/