Have you considered putting weights on the two numbers used in the mutual reachability distance calculation, or just the last one?

Changing the weights would fundamentally alter the definition; since the metric is designed to ensure points are both dense and close according to DBScan's density definition, altering the weights would mean you are no longer talking about density.

Could you define the triangle inequality, which you mentioned several times?

The triangle inequality holds if the direct distance between two points is faster (shorter) than taking any indirect path between them; if this doesn't hold, it breaks the intuition required for spatial tree algorithms.

If my data is larger than memory, what is the right algorithmic way to build a Minimum Spanning Tree?

The short answer is to buy a bigger computer. The longer answer is to work with a disk-based developer to build the MST computation on top of systems like Gask.

Where does spectral clustering fit into the taxonomy presented?

Traditional spectral clustering is flat and parametric; it typically involves inducing a nearest neighbor graph, calculating the Laplacian, embedding into a low-dimensional space via eigenvalue decomposition, and then running K-means on that result.

What are your thoughts on using t-SNE (t-Distributed Stochastic Neighbor Embedding) as a method to visualize data, either before or after using HDBScan?

I would suggest using UMAP instead. A major issue with t-SNE embeddings is that they are often unstable, leading to different clusterings upon repeated runs, which reduces user trust in the results.

Is it easy to explain the main difference between UMAP and t-SNE in this short time frame?

No, it is not easy in the short time available, but a pointer to the relevant slides explaining the difference will be provided.

HDBSCAN, Fast Density Based Clustering, the How and the Why - John Healy

Clustering Algorithm Taxonomy
📌 Clustering algorithms are broadly categorized as flat vs. hierarchical based on the complexity of resolution, or by method: centroid-based/parametric or density-based.
🤖 Centroid-based methods (like K-means) make assumptions about cluster shape (e.g., Gaussian spheres) and work well on small datasets or low dimensions.
🗺️ Density-based methods (like DBSCAN) allow data to define cluster shapes but typically require more data to be effective.
⚄ A complete taxonomy includes four quadrants: flat/centroid-based, flat/density-based, hierarchical/centroid-based, and hierarchical/density-based.

HDBScan: Hierarchical Density-Based Clustering
🎯 HDBScan (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is presented as a solution combining the advantages of density-based clustering (robustness to noise, variable shapes) and hierarchical structure (addressing the resolution parameter problem).
🌳 The core concept involves estimating an underlying probability density function and defining clusters as connected components of level sets of this function.
🧪 The algorithm efficiently approximates the density function locally using concepts similar to DBSCAN's $\epsilon$ and `min_points` parameters, then builds a hierarchical tree (dendrogram) by observing how connected components merge as density thresholds drop.
📏 The merging process uses mutual reachability distance, which incorporates both Euclidean distance and density considerations, allowing the resulting tree structure to support robust cuts for final cluster selection.

Computational Efficiency and Comparison
⚙️ The computational bottleneck in building the hierarchical tree relates to the minimal spanning tree calculation, traditionally $O(n^2)$ for a complete graph.
🚀 The presentation introduces an accelerated approach using the dual-tree Boruvka algorithm combined with spatial indexing trees (like KD-trees), achieving a theoretical complexity closer to $O(n \log n)$ instead of $O(n^2)$ .
📊 Experimental comparisons showed this accelerated HDBScan implementation performed similarly in timing to a single, optimally parameterized run of the traditional DBSCAN algorithm, suggesting one run of HDBScan provides meaningful results comparable to a well-tuned DBSCAN.
🔄 Due to its inherent ability to generate a hierarchy, HDBScan is recommended even if a flat cut (like that provided by DBSCAN) is desired, as it eliminates the need to search for the optimal $\epsilon$ parameter beforehand.

Key Points & Insights
➡️ HDBScan is the recommended cutting-edge algorithm as it addresses limitations of traditional methods like K-means (shape assumptions) and DBSCAN (single resolution parameter setting).
➡️ Researchers should consider UMAP over t-SNE for data visualization prior to clustering, as t-SNE embeddings can suffer from instability across runs, affecting user trust in subsequent clustering results.
➡️ The HDBScan implementation is integrated into scikit-learn, allowing immediate substitution for existing clustering algorithms like K-means or hierarchical clustering in current workflows.
➡️ The concept of a cluster in this context is defined as a connected component of a level set of the underlying, unknown probability density function governing the data.

📸 Video summarized with SummaryTube.com on Jan 06, 2026, 08:02 UTC

Related Products

Find relevant products on Amazon related to this video

Hdbscan

Shop on Amazon

Experiment

Shop on Amazon

Set

Shop on Amazon

Neuroscience Book

Shop on Amazon

As an Amazon Associate, we earn from qualifying purchases

HDBSCAN, Fast Density Based Clustering, the How and the Why - John Healy

Related Products

📜Transcript

📄Video Description

Recently Summarized Videos

Related Products

Loading Similar Videos...

Recently Summarized Videos

Get the Chrome Extension