Here we try automating exploratory data analysis with DataExplorer
.
We start by loading hospital
dataset from Kaggle into Rstudio.
library(readr)
healthcare<-read_csv("C:/Users/email/Downloads/Healthcare_Investments_and_Hospital_Stay (1).csv")
Rows: 518 Columns: 6── Column specification ─────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Location
dbl (5): Time, Hospital_Stay, MRI_Units, CT_Scanners, Hospital_Beds
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(healthcare)
spc_tbl_ [518 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Location : chr [1:518] "AUS" "AUS" "AUS" "AUS" ...
$ Time : num [1:518] 1992 1994 1995 1996 1997 ...
$ Hospital_Stay: num [1:518] 6.6 6.4 6.5 6.4 6.2 6.1 6.2 6.1 6.2 6.2 ...
$ MRI_Units : num [1:518] 1.43 2.36 2.89 2.96 3.53 4.51 6.01 3.52 3.79 3.74 ...
$ CT_Scanners : num [1:518] 16.7 18.5 20.6 21.9 23.3 ...
$ Hospital_Beds: num [1:518] 1.43 2.36 2.89 2.96 3.53 4.51 6.01 3.52 3.79 3.74 ...
- attr(*, "spec")=
.. cols(
.. Location = col_character(),
.. Time = col_double(),
.. Hospital_Stay = col_double(),
.. MRI_Units = col_double(),
.. CT_Scanners = col_double(),
.. Hospital_Beds = col_double()
.. )
- attr(*, "problems")=<externalptr>
Now we load DataExplorer
which will create some reports
some high-level summaries and graphs:
library(DataExplorer)
create_report(healthcare)
processing file: report.rmd
|
| | 0%
|
|. | 2%
|
|.. | 5% [global_options]
|
|... | 7%
|
|.... | 10% [introduce]
|
|..... | 12%
|
|...... | 14% [plot_intro]
|
|....... | 17%
|
|........ | 19% [data_structure]
|
|......... | 21%
|
|.......... | 24% [missing_profile]
|
|.......... | 26%
|
|........... | 29% [univariate_distribution_header]
|
|............ | 31%
|
|............. | 33% [plot_histogram]
|
|.............. | 36%
|
|............... | 38% [plot_density]
|
|................ | 40%
|
|................. | 43% [plot_frequency_bar]
|
|.................. | 45%
|
|................... | 48% [plot_response_bar]
|
|.................... | 50%
|
|..................... | 52% [plot_with_bar]
|
|...................... | 55%
|
|....................... | 57% [plot_normal_qq]
|
|........................ | 60%
|
|......................... | 62% [plot_response_qq]
|
|.......................... | 64%
|
|........................... | 67% [plot_by_qq]
|
|............................ | 69%
|
|............................. | 71% [correlation_analysis]
|
|.............................. | 74%
|
|.............................. | 76% [principal_component_analysis]
|
|............................... | 79%
|
|................................ | 81% [bivariate_distribution_header]
|
|................................. | 83%
|
|.................................. | 86% [plot_response_boxplot]
|
|................................... | 88%
|
|.................................... | 90% [plot_by_boxplot]
|
|..................................... | 93%
|
|...................................... | 95% [plot_response_scatterplot]
|
|....................................... | 98%
|
|........................................| 100% [plot_by_scatterplot]
"C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/pandoc" +RTS -K512m -RTS "C:\Users\email\OneDrive\Documents\report.knit.md" --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output pandocd601c233cc0.html --lua-filter "C:\Users\email\AppData\Local\R\win-library\4.2\rmarkdown\rmarkdown\lua\pagebreak.lua" --lua-filter "C:\Users\email\AppData\Local\R\win-library\4.2\rmarkdown\rmarkdown\lua\latex-div.lua" --embed-resources --standalone --variable bs3=TRUE --section-divs --table-of-contents --toc-depth 6 --template "C:\Users\email\AppData\Local\R\win-library\4.2\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable theme=yeti --mathjax --variable "mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" --include-in-header "C:\Users\email\AppData\Local\Temp\RtmpAPyRrA\rmarkdown-strd6013262276.html"
This one function creates Basic Statistics, Raw Counts, Percentages, Data Structure, Missing Data Profile, Univariate Distribution, Histogram, Bar Chart (with frequency), QQ Plots, Correlation Analysis, and a Principal Component Analysis.
The missing data profile is nice. In less than 2 minutes, we get a high-level view of any missing values in the dataset.
DataExplorer
has other helpful functions. For example,
to get introduced to your newly created dataset:
introduce(healthcare)
To visualize the table above (with some light analysis):
plot_intro(healthcare)
To visualize frequency distributions for all discrete features:
plot_bar(healthcare)
To visualize distributions for all continuous features:
plot_histogram(healthcare)
To visualize correlation heatmap for all non-missing features:
plot_correlation(na.omit(healthcare), maxcat = 5L)
Warning: Ignored all discrete features since `maxcat` set to 5 categories!
sadasdas