8 min read

DBSCAN Algorithm in Python | Density-Based Clustering Algorithm D

By admin / October 7, 2024

Table of Contents [show]

DBSCAN is a simple way to group data points that are close together. In addition, it looks for dense areas and separates them from areas with fewer points. Unlike other methods that need you to tell them how many groups to make, DBSCAN figures this out independently. The upcoming article will describe the DBSCAN Algorithm in depth. Moreover, the guide will help you solve your queries until the end.

DBSCAN Explained

Clustering is a way to group similar data points. It looks for patterns and differences in the data and puts identical things into the same group. Unlike other methods, clustering doesn't need labels to tell it what groups to make. It tries to find the natural structure of the data on its own. In addition, there are many different ways to do clustering, each with its strengths and weaknesses.

There are many different ways to group data points, but they all work the same way. First, they find how similar the points are to each other. Then, they use this information to put the points into groups. One popular method is called the DBSCAN algorithm. It looks for groups of points that are close together and separates them from groups that are far apart.

Why Density-Based Clustering Algorithm?

DBSCAN algorithm is a method that groups data points based on how close they are to each other. In addition, it looks for dense areas, where there are many points close together, and separates them from areas where there are fewer points. To be in a group, a point needs to have at least a certain number of other points nearby.

Some methods, like K-means and hierarchical clustering, are good at finding groups that are round or have a specific shape. However, they don't work well when the groups are not separated or some points don't fit into any group. These methods can also easily be affected by noise and outliers.

Real data is messy and doesn't always follow the rules. Sometimes, groups of data points can be shaped in strange ways, like a squiggle or a blob. There might also be some extra points that don't fit into any group, like noise.

Parameters For DBSCAN Algorithm

DBSCAN needs two things to work:

Eps: This is the size of the neighborhood around each point. If two points are closer than eps, they are considered neighbors. However, if eps is too small, many points will be labeled as outliers. If eps is too big, all points will be in the same group.
MinPts: This is the minimum number of neighbors a point needs to be considered part of a group. The bigger the dataset, the bigger MinPts should be. A good rule of thumb is to set MinPts to be at least D+1, where D is the number of dimensions in the data.

There are three types of points in DBSCAN in python:

Core point: A point with at least MinPts neighbors within eps.
Border point: A point with fewer than MinPts neighbors within eps, but it's close to a core point.
Noise or outlier: A point that is not a core point or a border point.

How Does the DBSCAN Clustering Work?

DBSCAN algorithm works in these steps:

Find core points: First, look for points that have at least MinPts neighbors within eps.
Create clusters: After that, for each core point that isn't already in a group, start a new group.
Find connected points: Then, find all the points that are connected to the core point by a chain of points that are also close together. Subsequently, put these points in the same group.
Label noise: In the end, any points that aren't in a group are considered noise.

DBSCAN Implementation in Python

DBSCAN is a method for grouping data points that was invented in 1996. In 2014, it was recognized as a very important and influential method in data mining.

Step 1: Import Appropriate Libraries

To start, we'll need to use a sklearn. It helps us do things with data, like grouping it.

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.cluster import DBSCAN

Step 2: Import and Picture Our Dataset

We'll use a famous dataset about iris flowers. It has information about the size and shape of their petals and sepals and also tells us what kind of iris they are. Some of its dimensions are sepal_length, sepal_width, petal_length, petal_width, and species.

df = pd.read_csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")

print(df.head())

To see how our data is spread out, we can use a "describe". As a result, this will give us a summary of the numbers in our data. We can learn a lot from the summary of our data. In addition, we can see how many rows there are, as well as the average value, the range, and how spread out the numbers are for each type of measurement. We could also compare these things for each type of iris flower. It's also helpful to look at how pairs of measurements are related to each other, to find any interesting patterns that might be useful in our analysis.

Step 3: Executing a DBSCAN Model

To make it easier to see what DBSCAN algorithm is doing, we'll only use two measurements: sepal length and sepal width. We can draw a picture called a scatterplot to show these two measurements together. Here's what the scatterplot looks like for sepal length and sepal width.

dbscan=DBSCAN()

dbscan.fit(df[['sepal_length', 'sepal_width']])

DBSCAN(algorithm=’auto’, eps=0.5, leaf_size=30, metric=’euclidean’,

metric_params=None, min_samples=5, n_jobs=None, p=None)

The different colors show the different groups that DBSCAN found. When we run DBSCAN with the usual settings, it finds two groups: one big group that most of the points are in, and one small group that might be made up of points that don't fit in very well. DBSCAN algorithm doesn't need to be told how many groups to make; it uses two numbers called minPts and epsilon to decide how to group the points.

Next, we'll try to find the best values for minPts and epsilon to help DBSCAN find the points that don't fit in well. To find the best value for epsilon, we'll use a K-distance graph. To make this graph, we need to find the distance between each point and its closest neighbor. We can use a tool called NearestNeighbors from sklearn to do this.

DBSCAN Applications

DBSCAN algorithm is the group's data points based on how close they are to each other.

Spatial data analysis: Finds clusters in geographic data (e.g., hotspots in urban planning).
Anomaly detection: Identifies unusual patterns in data (e.g., fraudulent activities).
Customer segmentation: Groups customers with similar buying behaviors.
Traffic analysis: Identifies traffic congestion hotspots and cluster routes.
Machine learning and data mining: Uncovers patterns in data.
Pattern recognition: Recognizes patterns in data.
Image processing: Processes images.

Partitions data into dense regions (clusters) separated by less dense areas.

DBSCAN Clustering Python Example

Here, we will discuss various DBSCAN clustering algorithm examples.

Homogeneity: 0.953

Completeness: 0.883

V-measure: 0.917

Adjusted Rand Index: 0.952

Adjusted Mutual Information: 0.916

Silhouette Coefficient: 0.626

From sklearn.datasets import make_blobs

from sklearn.preprocessing import StandardScaler

centers = [[1, 1], [-1, -1], [1, -1]]

X, labels_true = make_blobs(

n_samples=750, centers=centers, cluster_std=0.4, random_state=0

)

X = StandardScaler().fit_transform(X)

model = DBSCAN(eps = 0.4, min_samples = 10).fit(data)

colors = model.labels_

outliers = data[model.labels_ == -1]

print(outliers)

Our Learner Also Reads: Python for Beginners: How to Learn Python from Scratch

Concluding Thoughts

DBSCAN algorithm used for customer segmentation. In addition, it offers other options including K-means and Hierarchical clustering. Clustering algorithms are used in recommendation engines, market segmentation, social network analysis, and document analysis. This blog teaches the basics of DBSCAN and how to use it with scikit-learn for customer segmentation. You can improve DBSCAN by finding optimal eps and min_samples using silhouette score and heatmap.

Frequently Asked Questions

Q1. What is DBSCAN used for?

Ans. DBSCAN is a method for grouping data points based on how close they are to each other. In addition, it's good at finding and removing points that don't fit in well, which makes it useful for cleaning up data and finding unusual things.

Q2. Is DBSCAN supervised or unsupervised?

Ans. Clustering algorithms are a type of machine learning that doesn't need to be told what the right answers are. Additionally, they look for patterns in the data and group similar things together.