Neste artigo, exploramos um conjunto de dados de compartilhamento de bicicletas usando a linguagem de programação R para realizar uma análise exploratória, preparar os dados, e construir um modelo de regressão linear. Utilizamos diversas bibliotecas do ecossistema R, como tidyverse, lubridate, summarytools, patchwork, paletteer, e tidymodels, para facilitar o processo de análise de dados e modelagem.
Inicialmente, configuramos o ambiente R para suprimir avisos e mensagens e garantir uma saída limpa. Utilizamos o pacote pacman para carregar dinamicamente outros pacotes necessários para nossa análise.
suppressWarnings(if(!require("pacman")) install.packages("pacman"))
## Carregando pacotes exigidos: pacman
pacman::p_load('tidyverse', 'tidymodels', 'lubridate', 'summarytools', 'patchwork', 'paletteer', 'lubridate')
Importamos o conjunto de dados usando read_csv do tidyverse e exploramos suas primeiras linhas e estrutura geral usando slice_head e glimpse, respectivamente.
bike_data <- read_csv("https://raw.githubusercontent.com/MicrosoftDocs/ml-basics/master/data/daily-bike-share.csv", show_col_types = FALSE)
bike_data %>% slice_head(n = 7)
## # A tibble: 7 × 14
## instant dteday season yr mnth holiday weekday workingday weathersit temp
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1/1/20… 1 0 1 0 6 0 2 0.344
## 2 2 1/2/20… 1 0 1 0 0 0 2 0.363
## 3 3 1/3/20… 1 0 1 0 1 1 1 0.196
## 4 4 1/4/20… 1 0 1 0 2 1 1 0.2
## 5 5 1/5/20… 1 0 1 0 3 1 1 0.227
## 6 6 1/6/20… 1 0 1 0 4 1 1 0.204
## 7 7 1/7/20… 1 0 1 0 5 1 2 0.197
## # ℹ 4 more variables: atemp <dbl>, hum <dbl>, windspeed <dbl>, rentals <dbl>
glimpse(bike_data)
## Rows: 731
## Columns: 14
## $ instant <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ dteday <chr> "1/1/2011", "1/2/2011", "1/3/2011", "1/4/2011", "1/5/2011",…
## $ season <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ yr <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mnth <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ holiday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ weekday <dbl> 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,…
## $ workingday <dbl> 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,…
## $ weathersit <dbl> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2,…
## $ temp <dbl> 0.3441670, 0.3634780, 0.1963640, 0.2000000, 0.2269570, 0.20…
## $ atemp <dbl> 0.3636250, 0.3537390, 0.1894050, 0.2121220, 0.2292700, 0.23…
## $ hum <dbl> 0.805833, 0.696087, 0.437273, 0.590435, 0.436957, 0.518261,…
## $ windspeed <dbl> 0.1604460, 0.2485390, 0.2483090, 0.1602960, 0.1869000, 0.08…
## $ rentals <dbl> 331, 131, 120, 108, 82, 88, 148, 68, 54, 41, 43, 25, 38, 54…
Com a biblioteca lubridate, analisamos as datas e extraímos os dias para facilitar análises temporais. Selecionamos colunas específicas e as transformamos conforme necessário para melhor representar os dados.
bike_data <- bike_data %>%
mutate(dteday = mdy(dteday)) %>%
mutate(day = day(dteday))
bike_data %>%
slice_head(n = 10)
## # A tibble: 10 × 15
## instant dteday season yr mnth holiday weekday workingday weathersit
## <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2011-01-01 1 0 1 0 6 0 2
## 2 2 2011-01-02 1 0 1 0 0 0 2
## 3 3 2011-01-03 1 0 1 0 1 1 1
## 4 4 2011-01-04 1 0 1 0 2 1 1
## 5 5 2011-01-05 1 0 1 0 3 1 1
## 6 6 2011-01-06 1 0 1 0 4 1 1
## 7 7 2011-01-07 1 0 1 0 5 1 2
## 8 8 2011-01-08 1 0 1 0 6 0 2
## 9 9 2011-01-09 1 0 1 0 0 0 1
## 10 10 2011-01-10 1 0 1 0 1 1 1
## # ℹ 6 more variables: temp <dbl>, atemp <dbl>, hum <dbl>, windspeed <dbl>,
## # rentals <dbl>, day <int>
Utilizamos summarytools para obter estatísticas resumidas das variáveis numéricas e ggplot2 (parte do tidyverse) para visualizar a distribuição dos aluguéis com histogramas e box plots. Com o patchwork, combinamos gráficos para uma comparação direta.
bike_data %>%
# Seleciona recursos e rótulo
select(c(temp, atemp, hum, windspeed, rentals)) %>%
# Estatísticas resumidas
descr(order = "preserve",
stats = c('mean', 'sd', 'min', 'q1', 'med', 'q3', 'max'),
round.digits = 6)
## Descriptive Statistics
## bike_data
## N: 731
##
## temp atemp hum windspeed rentals
## ------------- ---------- ---------- ---------- ----------- -------------
## Mean 0.495385 0.474354 0.627894 0.190486 848.176471
## Std.Dev 0.183051 0.162961 0.142429 0.077498 686.622488
## Min 0.059130 0.079070 0.000000 0.022392 2.000000
## Q1 0.336667 0.337746 0.520000 0.134950 315.000000
## Median 0.498333 0.486733 0.626667 0.180975 713.000000
## Q3 0.655833 0.609229 0.730417 0.233221 1097.000000
## Max 0.861667 0.840896 0.972500 0.507463 3410.000000
library(patchwork)
library(paletteer)
theme_set(theme_light())
hist_plt <- bike_data %>%
ggplot(mapping = aes(x = rentals)) +
geom_histogram(bins = 100, fill = "midnightblue", alpha = 0.7) +
geom_vline(aes(xintercept = mean(rentals), color = 'Média'), linetype = "dashed", linewidth = 1.3) +
geom_vline(aes(xintercept = median(rentals), color = 'Mediana'), linetype = "dashed", linewidth = 1.3 ) +
xlab("") +
ylab("Frequência") +
scale_color_manual(name = "", values = c(Média = "red", Mediana = "yellow")) +
theme(legend.position = c(0.9, 0.9), legend.background = element_blank())
hist_plt
box_plt <- bike_data %>%
ggplot(aes(x = rentals, y = 1)) +
geom_boxplot(fill = "#E69F00", color = "gray23", alpha = 0.7) +
# Adiciona títulos e rótulos
xlab("Aluguéis")+
ylab("")
box_plt
(hist_plt / box_plt) +
plot_annotation(title = 'Distribuição de Aluguéis',
theme = theme(plot.title = element_text(hjust = 0.5)))
Transformamos os dados para um formato longo, facilitando comparações e visualizações de várias variáveis numéricas e categóricas. Utilizamos pivot_longer, facet_wrap, e paletas de cores do paletteer para visualizações distintas.
numeric_features <- bike_data %>%
select(c(temp, atemp, hum, windspeed, rentals))
numeric_features
## # A tibble: 731 × 5
## temp atemp hum windspeed rentals
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.344 0.364 0.806 0.160 331
## 2 0.363 0.354 0.696 0.249 131
## 3 0.196 0.189 0.437 0.248 120
## 4 0.2 0.212 0.590 0.160 108
## 5 0.227 0.229 0.437 0.187 82
## 6 0.204 0.233 0.518 0.0896 88
## 7 0.197 0.209 0.499 0.169 148
## 8 0.165 0.162 0.536 0.267 68
## 9 0.138 0.116 0.434 0.362 54
## 10 0.151 0.151 0.483 0.223 41
## # ℹ 721 more rows
numeric_features <- numeric_features %>%
pivot_longer(!rentals, names_to = "recursos", values_to = "valores") %>%
group_by(recursos) %>%
mutate(Média = mean(valores),
Mediana = median(valores))
numeric_features
## # A tibble: 2,924 × 5
## # Groups: recursos [4]
## rentals recursos valores Média Mediana
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 331 temp 0.344 0.495 0.498
## 2 331 atemp 0.364 0.474 0.487
## 3 331 hum 0.806 0.628 0.627
## 4 331 windspeed 0.160 0.190 0.181
## 5 131 temp 0.363 0.495 0.498
## 6 131 atemp 0.354 0.474 0.487
## 7 131 hum 0.696 0.628 0.627
## 8 131 windspeed 0.249 0.190 0.181
## 9 120 temp 0.196 0.495 0.498
## 10 120 atemp 0.189 0.474 0.487
## # ℹ 2,914 more rows
numeric_features %>%
ggplot() +
geom_histogram(aes(x = valores, fill = recursos), bins = 100, alpha = 0.7, show.legend = F) +
facet_wrap(~ recursos, scales = 'free')+
paletteer::scale_fill_paletteer_d("ggthemes::excel_Parallax") +
geom_vline(aes(xintercept = Média, color = "Média"), linetype = "dashed", linewidth = 1.3 ) +
geom_vline(aes(xintercept = Mediana, color = "Mediana"), linetype = "dashed", linewidth = 1.3 ) +
scale_color_manual(name = "", values = c(Média = "red", Mediana = "yellow"))
categorical_features <- bike_data %>%
select(c(season, mnth, holiday, weekday, workingday, weathersit, day, rentals))
categorical_features
## # A tibble: 731 × 8
## season mnth holiday weekday workingday weathersit day rentals
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
## 1 1 1 0 6 0 2 1 331
## 2 1 1 0 0 0 2 2 131
## 3 1 1 0 1 1 1 3 120
## 4 1 1 0 2 1 1 4 108
## 5 1 1 0 3 1 1 5 82
## 6 1 1 0 4 1 1 6 88
## 7 1 1 0 5 1 2 7 148
## 8 1 1 0 6 0 2 8 68
## 9 1 1 0 0 0 1 9 54
## 10 1 1 0 1 1 1 10 41
## # ℹ 721 more rows
categorical_features <- categorical_features %>%
pivot_longer(!rentals, names_to = "recursos", values_to = "valores") %>%
group_by(recursos) %>%
mutate(valores = factor(valores))
categorical_features
## # A tibble: 5,117 × 3
## # Groups: recursos [7]
## rentals recursos valores
## <dbl> <chr> <fct>
## 1 331 season 1
## 2 331 mnth 1
## 3 331 holiday 0
## 4 331 weekday 6
## 5 331 workingday 0
## 6 331 weathersit 2
## 7 331 day 1
## 8 131 season 1
## 9 131 mnth 1
## 10 131 holiday 0
## # ℹ 5,107 more rows
categorical_features %>%
ggplot() +
geom_bar(aes(x = valores, fill = recursos), alpha = 0.7, show.legend = F) +
facet_wrap(~ recursos, scales = 'free') +
paletteer::scale_fill_paletteer_d("ggthemr::solarized") +
theme(
panel.grid = element_blank(),
axis.text.x = element_text(angle = 90))
numeric_features %>%
mutate(coef_corr = cor(valores, rentals)) %>%
mutate(recursos = paste(recursos, ' vs aluguéis, r = ', coef_corr, sep = '')) %>%
ggplot(aes(x = valores, y = rentals, color = recursos)) +
geom_point(alpha = 0.7, show.legend = F) +
facet_wrap(~ recursos, scales = 'free')+
paletteer::scale_color_paletteer_d("ggthemes::excel_Parallax")
numeric_features %>%
summarise(coef_corr = cor(valores, rentals))
## # A tibble: 4 × 2
## recursos coef_corr
## <chr> <dbl>
## 1 atemp 0.544
## 2 hum -0.0770
## 3 temp 0.543
## 4 windspeed -0.168
categorical_features %>%
ggplot() +
geom_boxplot(aes(x = valores, y = rentals, fill = recursos), alpha = 0.9, show.legend = F) +
facet_wrap(~ recursos, scales = 'free') +
paletteer::scale_fill_paletteer_d("tvthemes::simpsons")+
theme(
panel.grid = element_blank(),
axis.text.x = element_text(angle = 90))
bike_select <- bike_data %>%
select(c(season, mnth, holiday, weekday, workingday, weathersit,
temp, atemp, hum, windspeed, rentals)) %>%
mutate(across(1:6, factor))
glimpse(bike_select)
## Rows: 731
## Columns: 11
## $ season <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ mnth <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ holiday <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ weekday <fct> 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,…
## $ workingday <fct> 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,…
## $ weathersit <fct> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2,…
## $ temp <dbl> 0.3441670, 0.3634780, 0.1963640, 0.2000000, 0.2269570, 0.20…
## $ atemp <dbl> 0.3636250, 0.3537390, 0.1894050, 0.2121220, 0.2292700, 0.23…
## $ hum <dbl> 0.805833, 0.696087, 0.437273, 0.590435, 0.436957, 0.518261,…
## $ windspeed <dbl> 0.1604460, 0.2485390, 0.2483090, 0.1602960, 0.1869000, 0.08…
## $ rentals <dbl> 331, 131, 120, 108, 82, 88, 148, 68, 54, 41, 43, 25, 38, 54…
Realizamos previsões no conjunto de teste e avaliamos a precisão do modelo usando métricas como RMSE e R². Visualizamos a relação entre valores previstos e reais para entender a performance do modelo.
set.seed(2056)
bike_split <- bike_select %>%
initial_split(prop = 0.7,
strata = holiday)
bike_train <- training(bike_split)
bike_test <- testing(bike_split)
lm_spec <-
linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
lm_mod <- lm_spec %>%
fit(rentals ~ ., data = bike_train)
lm_mod
## parsnip model object
##
##
## Call:
## stats::lm(formula = rentals ~ ., data = data)
##
## Coefficients:
## (Intercept) season2 season3 season4 mnth2 mnth3
## 890.95 302.62 204.77 74.71 -81.60 137.13
## mnth4 mnth5 mnth6 mnth7 mnth8 mnth9
## 46.33 -44.03 -258.90 -213.47 -200.88 39.64
## mnth10 mnth11 mnth12 holiday1 weekday1 weekday2
## 202.04 116.12 -53.85 523.43 -700.04 -769.93
## weekday3 weekday4 weekday5 weekday6 workingday1 weathersit2
## -778.10 -765.64 -600.51 223.64 NA -23.31
## weathersit3 temp atemp hum windspeed
## -447.03 2516.10 -462.18 -737.63 -1265.53
pred <- lm_mod %>%
predict(new_data = bike_test)
## Warning in predict.lm(object = object$fit, newdata = new_data, type =
## "response", : prediction from rank-deficient fit; consider predict(.,
## rankdeficient="NA")
pred %>%
slice_head(n = 5)
## # A tibble: 5 × 1
## .pred
## <dbl>
## 1 992.
## 2 791.
## 3 83.7
## 4 -138.
## 5 770.
results <- bike_test %>%
bind_cols(lm_mod %>%
predict(new_data = bike_test) %>%
rename(previsões = .pred))
## Warning in predict.lm(object = object$fit, newdata = new_data, type =
## "response", : prediction from rank-deficient fit; consider predict(.,
## rankdeficient="NA")
results
## # A tibble: 220 × 12
## season mnth holiday weekday workingday weathersit temp atemp hum
## <fct> <fct> <fct> <fct> <fct> <fct> <dbl> <dbl> <dbl>
## 1 1 1 0 6 0 2 0.344 0.364 0.806
## 2 1 1 0 0 0 2 0.363 0.354 0.696
## 3 1 1 0 5 1 2 0.197 0.209 0.499
## 4 1 1 0 1 1 1 0.151 0.151 0.483
## 5 1 1 0 0 0 1 0.232 0.234 0.484
## 6 1 1 0 5 1 1 0.178 0.158 0.457
## 7 1 1 0 0 0 1 0.0965 0.0988 0.437
## 8 1 1 0 3 1 3 0.218 0.204 0.862
## 9 1 1 0 6 0 1 0.197 0.212 0.652
## 10 1 2 0 2 1 1 0.221 0.198 0.538
## # ℹ 210 more rows
## # ℹ 3 more variables: windspeed <dbl>, rentals <dbl>, previsões <dbl>
results %>%
select(c(rentals, previsões)) %>%
slice_head(n = 10)
## # A tibble: 10 × 2
## rentals previsões
## <dbl> <dbl>
## 1 331 992.
## 2 131 791.
## 3 148 83.7
## 4 41 -138.
## 5 251 770.
## 6 75 -120.
## 7 150 454.
## 8 34 -889.
## 9 123 846.
## 10 64 -351.
eval_metrics <- metric_set(rmse, mae, rsq)
eval_metrics(data = results,
truth = rentals,
estimate = previsões)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 395.
## 2 mae standard 296.
## 3 rsq standard 0.680
results %>%
ggplot(mapping = aes(x = rentals, y = previsões)) +
geom_point(size = 1.6, color = "steelblue") +
geom_smooth(method = "lm", se = F, color = 'magenta') +
ggtitle("Previsões do Aluguel Diário de Bicicletas") +
xlab("Rótulos Reais") +
ylab("Rótulos Previstos") +
theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula = 'y ~ x'
A análise de dados de compartilhamento de bicicletas demonstrou o poder das ferramentas de R para manipulação de dados, análise exploratória, e modelagem preditiva. Através de visualizações detalhadas, conseguimos insights significativos sobre os padrões de aluguel de bicicletas, e o modelo de regressão linear nos permitiu prever aluguéis com base em características do dia. Este processo ilustra como a ciência de dados pode ser aplicada para informar decisões em sistemas de compartilhamento de bicicletas e outras aplicações urbanas.
Este artigo foi uma introdução à análise e modelagem de dados com R, demonstrando a facilidade e eficácia de usar R para ciência de dados.