#Load all the needed Libraries for your analysis
## 'data.frame': 10886 obs. of 12 variables:
## $ datetime : chr "2011-01-01 00:00:00" "2011-01-01 01:00:00" "2011-01-01 02:00:00" "2011-01-01 03:00:00" ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weather : int 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 9.84 9.02 9.02 9.84 9.84 ...
## $ atemp : num 14.4 13.6 13.6 14.4 14.4 ...
## $ humidity : int 81 80 80 75 75 75 80 86 75 76 ...
## $ windspeed : num 0 0 0 0 0 ...
## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...
## $ count : int 16 40 32 13 1 1 2 3 8 14 ...
## # A tibble: 110 × 3
## var1 var2 coef_corr
## <fct> <fct> <dbl>
## 1 holiday season 0.0294
## 2 workingday season -0.00813
## 3 weather season 0.00888
## 4 temp season 0.259
## 5 atemp season 0.265
## 6 humidity season 0.191
## 7 windspeed season -0.147
## 8 casual season 0.0968
## 9 registered season 0.164
## 10 count season 0.163
## # … with 100 more rows
#Diagnoses the outliers of the numeric (continuous and discrete)
## # A tibble: 55 × 3
## var1 var2 coef_corr
## <fct> <fct> <dbl>
## 1 holiday season 0.0294
## 2 workingday season -0.00813
## 3 weather season 0.00888
## 4 temp season 0.259
## 5 atemp season 0.265
## 6 humidity season 0.191
## 7 windspeed season -0.147
## 8 casual season 0.0968
## 9 registered season 0.164
## 10 count season 0.163
## # … with 45 more rows
#Univariate #The following is a list of the EDA functions included in the dlookr package.
#Provides descriptive statistics for numerical data.
## # A tibble: 11 × 26
## described_…¹ n na mean sd se_mean IQR skewness kurto…² p00
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 season 10886 0 2.51e+0 1.12 0.0107 2 -0.00708 -1.36 1
## 2 holiday 10886 0 2.86e-2 0.167 0.00160 0 5.66 30.0 0
## 3 workingday 10886 0 6.81e-1 0.466 0.00447 1 -0.776 -1.40 0
## 4 weather 10886 0 1.42e+0 0.634 0.00607 1 1.24 0.396 1
## 5 temp 10886 0 2.02e+1 7.79 0.0747 12.3 0.00369 -0.915 0.82
## 6 atemp 10886 0 2.37e+1 8.47 0.0812 14.4 -0.103 -0.850 0.76
## 7 humidity 10886 0 6.19e+1 19.2 0.184 30 -0.0863 -0.760 0
## 8 windspeed 10886 0 1.28e+1 8.16 0.0783 10.0 0.589 0.630 0
## 9 casual 10886 0 3.60e+1 50.0 0.479 45 2.50 7.55 0
## 10 registered 10886 0 1.56e+2 151. 1.45 186 1.52 2.63 0
## 11 count 10886 0 1.92e+2 181. 1.74 242 1.24 1.30 1
## # … with 16 more variables: p01 <dbl>, p05 <dbl>, p10 <dbl>, p20 <dbl>,
## # p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>, p60 <dbl>, p70 <dbl>,
## # p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>, p99 <dbl>, p100 <dbl>, and
## # abbreviated variable names ¹​described_variables, ²​kurtosis
#Perform normalization and visualization of numerical data.
## # A tibble: 11 × 4
## vars statistic p_value sample
## <chr> <dbl> <dbl> <dbl>
## 1 season 0.857 2.32e-55 5000
## 2 holiday 0.160 8.97e-92 5000
## 3 workingday 0.590 6.16e-76 5000
## 4 weather 0.659 4.15e-72 5000
## 5 temp 0.982 5.74e-25 5000
## 6 atemp 0.982 6.88e-25 5000
## 7 humidity 0.982 1.89e-24 5000
## 8 windspeed 0.956 2.34e-36 5000
## 9 casual 0.702 2.43e-69 5000
## 10 registered 0.855 1.05e-55 5000
## 11 count 0.878 1.10e-52 5000
#bivariate #Select the variable to compute
#Diagnoses the outliers of the numeric (continuous and discrete)
## variables outliers_cnt outliers_ratio outliers_mean with_mean
## 1 season 0 0.000000000 NaN 2.5066140
## 2 holiday 311 2.856880397 1.00000 0.0285688
## 3 workingday 0 0.000000000 NaN 0.6808745
## 4 weather 1 0.009186111 4.00000 1.4184273
## 5 temp 0 0.000000000 NaN 20.2308598
## 6 atemp 0 0.000000000 NaN 23.6550841
## 7 humidity 22 0.202094433 0.00000 61.8864597
## 8 windspeed 227 2.085247106 36.58932 12.7993954
## 9 casual 749 6.880396840 181.91856 36.0219548
## 10 registered 423 3.885724784 631.32624 155.5521771
## 11 count 300 2.755833180 751.11667 191.5741319
## without_mean
## 1 2.5066140
## 2 0.0000000
## 3 0.6808745
## 4 1.4181902
## 5 20.2308598
## 6 23.6550841
## 7 62.0117820
## 8 12.2927519
## 9 25.2419848
## 10 136.3174998
## 11 175.7170792
#Calculate the correlation coefficient between two numerical data and provide visualization.
## # A tibble: 110 × 3
## var1 var2 coef_corr
## <fct> <fct> <dbl>
## 1 holiday season 0.0294
## 2 workingday season -0.00813
## 3 weather season 0.00888
## 4 temp season 0.259
## 5 atemp season 0.265
## 6 humidity season 0.191
## 7 windspeed season -0.147
## 8 casual season 0.0968
## 9 registered season 0.164
## 10 count season 0.163
## # … with 100 more rows
#Defines the target variable
#Describes the relationship with the variables of interest corresponding to the target variable.
##
## Call:
## lm(formula = formula_str, data = data)
##
## Coefficients:
## (Intercept) windspeed
## 2.76405 -0.02011
#Visualizes the relationship to the variable of interest
corresponding to the destination variable.
#dlookr provides two automated EDA reports:
Web page-based dynamic reports can perform in-depth analysis through visualization and statistical tables.eda_web_report()
Static reports generated as pdf files or html files can be archived as output of data analysis.eda_paged_repo()