Cluster analysis or simply clustering is the process of partitioning a set of abstract data objects (or observations) into subsets. Each subset is a cluster, such that objects in a cluster are similar to one another, yet dissimilar to objects in other clusters.

The set of clusters resulting from a cluster analysis can be referred to as a clustering. In this context, different clustering methods may generate different clustering on the same data set by the clustering algorithm. Therefore, clustering is useful in that it can lead to the discovery of previously unknown groups within the data Cluster analysis used in many applications such as business intelligence, image pattern recognition, Web search, biology, and security.

In business intelligence, clustering can be used to organize a large number of customers into groups, where customers within a group share strong similar characteristics. This facilitates the development of business strategies for enhanced customer relationship management. Clustering is also called data segmentation in some applications because clustering partitions large data sets into groups according to their similarity.

Data clustering research include data mining, statistics, machine learning, spatial database technology, information retrieval, Web search, biology, and marketing. Cluster analysis tools based on k-means, k-medoids, and several other methods also have been built into many statistical analysis software packages or systems, such as S-Plus SPSS, and SAS.

Here classification is known as supervised learning in machine learning because the class label information is given, that is, the learning algorithm is supervised in that it is told the class membership of each training tuple. Clustering is known as unsupervised learning because the class label information is not present, For this reason, clustering is a form of learning by observation, rather than learning by examples.

A cluster of data objects can be treated as one group. While doing cluster analysis, we first partition the set of data into groups based on data similarity and then assign the labels to the groups. Advantage of clustering above classification is that, it is adjustable to changes and helps single out useful features that distinguish different groups.

Overview Of Basic Clustering Methods:

Partitioning methods:

In Partitioning methods given a set of n objects, a partitioning method constructs partitions of the data, where each partition represents a cluster. That is, it divides the data into k groups such that each group must contain at least one object. In other words, partitioning methods conduct one-level partitioning on data sets. The basic partitioning methods typically adopt exclusive cluster separation. That is, each object must belong to exactly one group.

Most partitioning methods are distance-based. Given k, the number of partitions to construct, a partitioning method creates an initial partitioning. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from One round to another. The general criterion of a good partitioning is that objects in the same cluster are close” or related to each other, whereas objects in different clusters are “far apart” or very different.

There are various kinds of other criteria for the quality of partitions. Traditional partitioning methods can be extended for clustering, rather than searching the full data space. This is useful when the attributes and the data are sparse. Most partitioning applications adopt popular heuristic methods, such as greedy approaches like the k-means and the k-medoids algorithms, which methods work well for finding shaped clusters in small- to medium-size databases.

Clustering
  • Save

Hierarchical methods:

A hierarchical method creates a hierarchical decomposition of the given set of data objects. A hierarchical method can be classified as being either agglomerative or divisive, based on how the hierarchical decomposition is formed.

The agglomerative approach, also called the bottom-up approach, starts with each object forming a separate group. It successively merges the objects or groups close to one another, until all the groups are merged into one (the topmost level of the hierarchy), or a termination condition holds. The divisive approach, also called the top-down approach, starts with all the objects in the same cluster.

In each successive iteration, a cluster is split into smaller clusters, until eventually each object is in one cluster, or a termination condition holds. Hierarchical clustering methods can be distance-based or density- and continuity based. Various extensions of hierarchical methods consider clustering in subspaces as well.

 Agglomerative versus Divisive Hierarchical Clustering:

A hierarchical clustering method can be either agglomerative or divisive, depending on whether the hierarchical decomposition is formed in a bottom-up (merging) or top-down (splitting) fashion. Let ‘s have a closer look at these strategies.

An agglomerative hierarchical clustering method uses a bottom-up strategy. It typically starts by letting each object form its own cluster and iteratively merges clusters into larger and larger clusters, until all the objects are in a single cluster or certain termination conditions are satisfied. The single cluster becomes the hierarchy’s root.

For the merging step, it finds the two clusters that are closest to each other (according to some similarity measure), and combines the two to form one cluster. Because two clusters are merged per iteration, where each cluster contains at least one object, an agglomerative method requires at most n iterations.

A divisive hierarchical clustering method employs a top-down strategy. It starts by placing all objects in one cluster, which is the hierarchy’s root. It then divides the root cluster into several smaller sub clusters, and recursively partitions those clusters into smaller ones.

The partitioning process continues until each cluster at the lowest level is coherent enough either containing only one object, or the objects within a cluster are sufficiently similar to each other. In either agglomerative or divisive hierarchical clustering, a user can specify the desired number of clusters as a termination condition.

Density-based methods:

Lost partitioning methods cluster objects based on the distance between objects. Such methods can find only spherical-shaped clusters and encounter difficulty in discovering of arbitrary shapes. Other clustering methods have been developed based on the notion of density. Idea is to continue growing a given cluster as long as the density (number of objects or data points) in the “neighbourhood” exceeds some threshold.

Clustering
  • Save

For example, or each data point within a given cluster, the neighbourhood of a given radius has to contain at least a minimum number of points. Such a method can be used to filter out noise or outliers and discover clusters of arbitrary shape.

Density-based methods can divide a set of objects into multiple exclusive clusters, or a hierarchy of clusters. Typically, density-based methods consider exclusive clusters and do not consider fuzzy clusters. Moreover, density-based methods can be extended from full space to subspace clustering.

DBSCAN: (Density-based spatial clustering of applications with noise) Density- Based Clustering Based on Connected Regions with High Density.

“How can we find dense regions in density-based clustering?” The density of an object can be measured by the number of objects close to o. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core objects, that is, objects that have dense neighbourhoods. It connects core objects and their neighbourhoods to form dense regions as clusters.

“How does DBSCAN quantify the neighbourhood of an object?” A user-specified parameter is used to specify the radius of a neighbourhood we consider for every object. The E-neighbourhood of an object o is the space within a radius centered at .

Due to the fixed neighbourhood size parameterized by €, the density of a neighbourhood can be measured simply by the number of objects in the neighbourhood. To determine whether a neighbourhood is dense or not, DBSCAN uses another user-specified parameter, Mints, which specifies the density threshold of dense regions. An object is a core object if of the object contains at least Mints objects. Core objects are the pillars of dense regions.

DBSCAN algorithm requires two parameters:

It defines the neighbourhood around a data point for example, if the distance between two points is lower or equal to ‘eps’ then they are considered as neighbours. If the value is chosen too small then large part of the data will be considered as others.

Mints: Minimum number of neighbours (data points) within eps radius. Larger the dataset, the larger value of Mints must be chosen., As a general rule, the minimum Mints can be derived from the number of dimensions D in the dataset as, Mints.

Follow on Twitter

Our New Post

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *