I have tested and reviewed a few Python packages for data processing and/or exploratory data analysis (EDA). Most of these packages attempt to automate parts of the data processing and/or EDA process, or provide a suite of functions to manipulate and visualize data.
The main objective here is to review and explore Python packages that will shorten the time needed for data processing and/or exploratory data analysis.
In summary, sweetviz may be the best option under a business setting (with focus on business understanding) while autoviz/pandas-profiling are solid choices under an R&D setting (with focus on deep dive analysis).
Comparison
Package | sweetviz | pandas-profiling | autoviz | lux-api | dtale | dataprep |
---|---|---|---|---|---|---|
Version | 2.1.3 | 3.0.0 | 0.0.83 | 0.3.2 | 1.56.0 | 0.3.0 |
Recommended for exploration | Yes | No | Yes | No | No | Yes |
Recommended for production | No | No | No | No | No | No |
Ease of use | Yes | Yes | Yes | Yes | Yes | Yes |
Computation speed | Fast | Medium | Fast | Fast | Fast | Fast |
Installation complexity | Low | Low | Low | Low | Medium | Low |
Target variable-centric | Yes | No | Yes | Yes | No | Yes |
Missing data check | No | Yes | No | No | Yes | Yes |
Per variable summary statistics | Yes | Yes | No | No | Yes | Yes |
AutoEDA focus | No | Yes | Yes | No | No | Yes |
Score | 4 | 3 | 2 | 2 | 1 | 1 |
Unique Points of Each Library
sweetviz
- Beautiful and simple visualisation that is good for business explanation.
- Limited features beyond simple visualisation.
pandas-profiling
- Good for generic data quality check.
- No option to specify target variable for tailored EDA.
autoviz
- Holistic visualisations that are good for deep dive analysis.
- Claims to do smart selection of plots/analyses but have yet to see the impact.
lux-api
- Generate very simple single or pair variables distribution plots.
- Plot of interest selection is manual.
- Charts are not integrated into notebook (as widgets only).
dtale
- Very good as a Python-based data manipulation tool with GUI.
- Contains a full suite of tools for data manipulation/analysis/visualisation, almost matching similar commercial data analytics tools.
- Too much manual effort required to setup all analyses/plots required.
- No dashboard deployment support.
- No option to specify target variable for tailored EDA.
- Complex dependecies on many libraries.
dataprep
- EDA module is like an extension to pandas-profiling. Very comprehensive.
- Contains 3 modules to : collect data, explore data and clean data.
- Potentially faster on large data due to use of Dask.
- Possible to deep dive investigate selected columns/variables.
Hi Yee Lim,
Thanks for your writeup on the various EDA libraries that you assessed. Do you plan to include the dataprep EDA library in the future?
https://dataprep.ai/