Grid-based methods quantize the object space into a finite number of cells that form a grid structure. All the K-Means clustering operations are performed on the grid structure (i.e., on the quantized space). New Trick Data Mining In Grid-Based and K-Means Methods
Table of Contents
The main advantage of this approach is its fast processing time, which is typically independent of the number of data objects and dependent only on the number of cells in each dimension in the quantized space. Using grids is often an efficient approach to many spatial data mining problems, including clustering. Therefore, grid-based methods can be integrated with other clustering methods such as density-based methods and hierarchical methods.
K-Means: A Centroid based technique:
An objective function is used to assess the partitioning quality so that objects within a cluster are similar to one another but dissimilar to obiects in other clusters. A centroid-based partitioning technique uses the centroid of a cluster, Ci, to represent that cluster. Conceptually, the centroid of a cluster is its center point.
The centroid can be defined in various ways such as by the mean or medoid of the objects (or points) assigned to the cluster. The difference between an object the representative of the cluster, is measured by dist (p,ci), where dist (x, v) is the Euclidean distance between two points x and y.
The quality of cluster Ci can be measured by the within cluster variation, which is the sum of squared error between all objects in Ci and the centroid ci, defined as Where E is the sum of the squared error for all objects in the data set; p is the point in space representing a given object; and ci is the centroid of cluster Ci (both p and ci are multidimensional). In other words, for each object in each cluster, the distance from the object to its cluster center is squared, and the distances are summed.
This objective function tries to make the resulting k clusters as compact and as separate as possible. “How does the k-means algorithm work?” The k-means algorithm defines the a cluster as the mean value of the points within the cluster.
It proceeds as follows, First. I random selects k of the objects in D, each of which initially represents a cluster meat. For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the Euclidean distance between the object and the cluster mean The k-means algorithm then iteratively improves the within-cluster variation. For each cluster, it computes the new mean using the objects assigned to the cluster in the previous iteration.
All the objects are then reassigned using the updated means as the new cluster. The iterations continue until the assignment is stable. Clustering by k-means partitioning. Consider a set of objects located in 2-D space. a depicted in above figure (a). Let k = 3, that is, the user would like the objects to partitioned into three clusters.
According to the algorithm in above figure 9, we arbitrarily choose three objects as the three initial cluster centers, where cluster centers are marked by a +. Each object is assigned to a cluster based on the cluster center to which it is the nearest. Such a distribution forms silhouettes encircled by dotted curves, as shown in above figure (a).
Next, the cluster centers are updated. That is, the mean value of each cluster is recalculated based on the current objects in the cluster. Using the new cluster centers, the objects are redistributed to the clusters based on which cluster center is the nearest. Such a redistribution forms new silhouettes encircled by dashed curves, as shown in above figure (b).
This process iterates, leading to above figure (c). The process of iteratively reassigning objects to clusters to improve the partitioning is referred to as iterative relocation. Eventually, no reassignment of the objects in any cluster occurs and so the process to minutes The resulting clusters are returned by the clustering process. The k-means method can be applied only when the mean of a set of objects is defined.
This may not be the case in some applications such as when data with nominal attributes are involved. The k-modes method is a variant of k-means, which extends the k-means paradigm to cluster nominal data by replacing the means of clusters with modes.
It uses new dissimilarity measures to deal with nominal objects and a frequency-based method to update modes of clusters. The k-means and the k-modes methods can be integrated to cluster data with mixed numeric and nominal values.
One approach to making the k-means method more efficient on large data sets is to use a good-sized set of samples in clustering. Another is to employ a filtering approach that uses a spatial hierarchical data index to save costs when computing means. A third approach explores the micro clustering idea, which first groups nearby objects into “micro clusters” and then performs k-means clustering on the micro clusters
Data Mining Applications:
Data Mining for Financial Data Analysis –
Most banks and financial institutions offer a wide variety of banking, investment, and credit services, some also offer insurance and stock investment service. Financial data collected in the banking and financial industry are often relatively complete, reliable, and of high quality, which facilitates systematic data analysis and data mining. Here we present a few typical cases.
Design and construction of data warehouses for multidimensional data analysis and Data mining:
Multidimensional data analysis methods should be used to analyze the real properties of such data. For example, a company’s financial officer may want to the debt and revenue changes by month, region, and sector, and other factors, along with maximum, minimum, total, average, trend, deviation, and other statistical information. Data warehouses, data cubes, characterization and class comparisons, clustering, and outlier analysis will all play important roles in financial data analysis and mining.
Classification and clustering of customers for targeted marketing:
Classification and clustering methods can be used for customer group identification and targeted marketing. Customers with similar behaviours regarding loan payments may be identified by multidimensional clustering techniques. These can help identify customer groups, associate a new customer with an appropriate customer group.
Detection of money laundering and other financial crimes:
To detect money laundering and other financial crimes, it is important to integrate information from multiple, heterogeneous databases e.g., bank transaction databases and federal or state crime history databases, as long as they are potentially related to the study. Multiple data analysis tools can then be used to detect unusual patterns, such as large amounts of cash flow at certain periods, by certain groups of customers.
Useful tools include data visualization, linkage and information network analysis tools to identify links among different customers and activities, classification tools (to filter unrelated attributes and rank the highly related ones), clustering tools (to group different cases), outlier analysis tools (to detect unusual amounts of fund transfers or other activities), and sequential pattern analysis tools (to characterize unusual access sequences). These tools may identify important relationships and patterns of activities and help investigators focus on suspicious cases for further detailed examination.
Data Mining for Retail and Telecommunication Industries:
Today, most major chain that customers can make purchases online. Some businesses, Such as Amazon.com exist solely online, without any physical store locations. Retail data provide a rich source for data mining. Retail data mining can help identify customer buying behaviours, discover customer shopping patterns and trends, improve the quality or customer service, achieve better customer retention and satisfaction, enhance goods consumption ratios, design more effective goods transportation and distribution policies, and reduce the cost of business.
Design and construction of data warehouses:
Because retail data cover a wide range including sales, customers, employees, goods transportation, consumption, and services, there can be many ways to design a data warehouse for this industry. The levels of details to include can vary considerably, The outcome of preliminary data mining exercises be used to help guide the design and development of data warehouse structures.
Multidimensional analysis of sales, customers, products, time, and region:
The retail Industry requires timely information regarding customer needs, product sales, trends, and fashions, as well as the quality, cost, profit, and service of commodities. It is therefore Important to provide powerful multidimensional analysis and visualization tools including the construction of sophisticated data cubes according to the needs of data analysis.
Customer retention-analysis of customer loyalty:
Here, Customer loyalty and purchase trends can be analyzed systematically. Goods purchased at different periods by the same customers can be grouped into sequences. Sequential pattern mining can then be used to investigate changes in customer consumption or loyalty and suggest adjustments on the pricing and variety of goods to help retain customers and attract new ones.
Fraudulent analysis and the identification of unusual patterns:
It is important to Identify potentially fraudulent users and their atypical usage patterns; second, detect attempts to gain fraudulent entry or unauthorized access to individual and organizational accounts; and third discover unusual patterns that may need special attention. Many of these patterns can be discovered by multidimensional analysis, cluster analysis, and outlier analysis.
Data Mining in Science and Engineering:
Data warehouses and data preprocessing-
Data preprocessing and data warehouses are serious for information exchange and data mining. Creating a warehouse often requires finding means for resolving unpredictable or incompatible data collected in multiple environments and at different time periods. Methods are needed for integrating data from heterogeneous sources and for identifying events.
Mining in complex data types:
Scientific data sets are heterogeneous in nature, such as multimedia data, as well as data with sophisticated, deeply hidden semantics e.g. genomic and proteomic data. Robust and dedicated analysis methods are needed for handling spatiotemporal data, biological data, related concept hierarchies, and complex semantic relationships. For example, in bioinformatics, a research problem is to identify regulatory influences on genes.
Gene regulation refers to how genes in a cell are switched on or off to determine the cell’s functions. Thus, to understand a biological process we need to identify the participating genes and their regulators. This requires the development of sophisticated data mining methods to analyze large biological data sets for clues about regulatory influences on specific genes, by finding DNA Segments (“regulatory sequences”) mediating such influence.
Follow on Twitter
Our New Post