Comparing Exploratory Data Analysis Libraries

I have tested and reviewed a few Python packages for data processing and/or exploratory data analysis (EDA). Most of these packages attempt to automate parts of the data processing and/or EDA process, or provide a suite of functions to manipulate and visualize data.

The main objective here is to review and explore Python packages that will shorten the time needed for data processing and/or exploratory data analysis.

In summary, sweetviz may be the best option under a business setting (with focus on business understanding) while autoviz/pandas-profiling are solid choices under an R&D setting (with focus on deep dive analysis).

Comparison

Packagesweetvizpandas-profilingautovizlux-apidtaledataprep
Version2.1.33.0.00.0.830.3.21.56.00.3.0
Recommended for explorationYesNoYesNoNoYes
Recommended for productionNoNoNoNoNoNo
Ease of useYesYesYesYesYesYes
Computation speedFastMediumFastFastFastFast
Installation complexityLowLowLowLowMediumLow
Target variable-centricYesNoYesYesNoYes
Missing data checkNoYesNoNoYesYes
Per variable summary statisticsYesYesNoNoYesYes
AutoEDA focusNoYesYesNoNoYes
Score432211
Table comparing autoEDA libraries for exploratory data analysis.

Unique Points of Each Library

sweetviz

  • Beautiful and simple visualisation that is good for business explanation. 
  • Limited features beyond simple visualisation.

pandas-profiling

  • Good for generic data quality check.
  • No option to specify target variable for tailored EDA.

autoviz

  • Holistic visualisations that are good for deep dive analysis. 
  • Claims to do smart selection of plots/analyses but have yet to see the impact.

lux-api

  • Generate very simple single or pair variables distribution plots.
  • Plot of interest selection is manual.
  • Charts are not integrated into notebook (as widgets only).

dtale

  • Very good as a Python-based data manipulation tool with GUI.
  • Contains a full suite of tools for data manipulation/analysis/visualisation, almost matching similar commercial data analytics tools.
  • Too much manual effort required to setup all analyses/plots required.
  • No dashboard deployment support.
  • No option to specify target variable for tailored EDA.
  • Complex dependecies on many libraries.

dataprep

  • EDA module is like an extension to pandas-profiling. Very comprehensive.
  • Contains 3 modules to : collect data, explore data and clean data.
  • Potentially faster on large data due to use of Dask.
  • Possible to deep dive investigate selected columns/variables.

1 comment

Leave a comment

Your email address will not be published. Required fields are marked *