New Journey of Data Mining and Data Warehouse

Data mining and Data Warehouse algorithms embody techniques that have sometimes existed for many years, but have only lately been applied as reliable and scalable tools that time and again outperform older classical statistical methods. While data mining is still in its infancy, it is becoming a trend and ubiquitous. Before data mining develops into a conventional, mature and trusted discipline, many still pending issues have to be addressed. Some of these issues are addressed below. Note that these issues are not exclusive and are not ordered in any way.

There are five types issues in Data Mining…

Security and social issues –

Security is an important issues with any data collection that is shared and/or is intended to be used for strategic decision-making. In addition, when data is collected for customer profiling, user behaviour understanding, correlating personal data with other information. etc., large amounts of sensitive and private information about individuals or companies is gathered and stored.

This becomes controversial given the confidential nature of some of this data and the potential illegal access to the information. Moreover, data mining could disclose new implicit knowledge about individuals or groups that could be against privacy policies, especially if there is potential dissemination of discovered information. Another issue that arises from this concern is the appropriate use of data mining.

Due to the value of data, databases of all sorts of content are regularly sold, and because of the competitive advantage that can be attained from implicit knowledge discovered, some important information could be withheld, while other information could be widely distributed and used without control.

Data Mining and Data Warehouse

User interface issues –

The knowledge discovered by data mining tools is useful as long as it is interesting, and above all understandable by the user. Good data visualization eases the interpretation of data mining results, as well as helps users better understand their needs. Many data exploratory analysis tasks are significantly facilitated by the ability to set data in an appropriate visual presentation. There are many visualization ideas and proposals for effective data graphical presentation.

However, there is still much research to accomplish in order to obtain good visualization tools for large datasets that could be used to display and manipulate mined knowledge? The major issues related to user interfaces and visualization are “screen real-estate”, information rendering, and interaction. Interactivity with the data and data mining results is crucial since it provides means for the user to focus and refine the mining tasks, as well as to picture the discovered knowledge from different angles and at different conceptual levels.

Data Mining and Data Warehouse

Mining methodology issues –

These issues pertain to the data mining approaches applied and their limitations. Topics such as versatility of the mining approaches, the diversity of data available, the dimensionality of the domain.,

the broad analysis needs (when known), the assessment of the knowledge discovered, the exploitation of background knowledge and metadata, the control and handling of noise in data, etc. are all examples that can dictate mining methodology choices.

For instance, it is often desirable to have different data mining data at hand. Moreover, different approaches may suit and solve user’s needs s differently. methods available since different approaches may perform differently depending upon the Most algorithms assume the data to be noise-free. This is of course a strong assumption.

Currently, there is a focus on relational databases and data warehouses but other approaches need to be pioneered for other specific complex data types. A versatile data mining tool, for all sorts of data, may not be realistic.

Moreover, the proliferation of heterogeneous data sources, at structural and semantic levels, poses important challenges not only to the database community but also to the data mining community.st datamining contain exceptions, invalid or incomplete complicate, if not obscure, the analysis process and in many information.

which cases compromise the accuracy of the results. As a consequence, data preprocessing (data cleaning and transformation) becomes vital. It is often seen as lost time, but data cleaning, as time- consuming and frustrating as it may be, is one of the most important phases in the knowledge discovery process. Data mining techniques should be able to handle noise in data or incomplete information.

More than the size of data, the size of the search space is even more decisive for datamining techniques. The size of the search space is often depending upon the number of dimensions in the domain space. The search space usually grows exponentially when the number of dimensions increases. This is known as the curse of dimensionality. This “‘curse” affects so badly the performance of some data mining approaches that it is becoming one of the most urgent issues to solve.

Performance issues –

Many artificial intelligence and statistical methods exist for data analysis and interpretation. However, these methods were often not designed for the very large data sets data mining is dealing with today. Terabyte sizes are common. This raises the issues of scalability and efficiency of the data mining methods when processing considerably large data.

Algorithms with exponential and even medium-order polynomial complexity cannot be of practical use for data mining. Linear algorithms are usually the norm. In same theme, sampling can be used for mining instead of the whole dataset. However, concerns such as completeness and choice of samples may arise

Other topics in the issue of performance are incremental updating, and parallel programming? There is no doubt that parallelism can help solve the size problem if the dataset can be subdivided and the results can be merged later. Incremental updating is important for merging results from parallel mining, or updating data mining results when new data becomes available without having to re-analyze the complete dataset.

Data source issues –

There are many issues related to the data sources, some are practical such as the diversity of data types, while others are philosophical like the data glut problem.

We certainly have an excess of data since we already have more data than we can handle and we are still collecting data at an even higher rate.

If the spread of database management systems has helped increase the gathering of information, the advent of data mining is certainly encouraging more data harvesting.

The current practice is to collect as much data as possible now and process it, or try to process it, later.

The concern is whether we are collecting the right data at the appropriate amount, whether we know what we want to do with it, and whether we distinguish between what data is important and what data is insignificant. Regarding the practical issues related to data sources, there is the subject of heterogeneous databases and the focus on diverse complex data types.

We are storing different types of data in a variety of repositories. It is difficult to expect a data mining System to effectively and efficiently achieve good mining results on all kinds of data and sources. Different kinds of data and sources may require distinct algorithms and methodologies.

Currently, there is a focus on relational databases and data warehouses, but other approaches need to be pioneered for other specific complex data types.

A versatile data mining tool, for all sorts of data, may not be realistic. Moreover, the proliferation of heterogeneous data sources, at structural and semantic levels, poses important challenges not only to the database community but also to the data mining community.

YOU CAN EASY TO UNDERSTAND TAP HERE

Data Warehouse: Basic Concepts

Data Mining and Data Warehouse
DATA WAREHOUSE

If we talk about, a Data Base Management System then it stores data in the form of tables, uses ER model and its goal is to fulfill the ACID properties. For example, a database system of an institute has tables for courses, students. faculty members etc.

  • While, a Data Warehouse is separate from DBMS. It stores huge amount of data, which is typically collected from multiple heterogeneous sources like files, DBMS etc. The goal is to produce statistical results that may help in decision making.
  • For example, an institute might want to see overall results, course wise placement records that have been improved over last few years etc. An ordinary Database can store MBs to GBs of data for a specific purpose. For storing data of TB size, the storage shifted to Data Warehouse.
  • A transactional database doesn’t offer itself to analyze. To effectively perform analysis, an organization keeps a central Data Warehouse to closely study its business by organizing, understanding and using its historic data for taking strategic decisions and scrutinizing trends.
  • The data warehouse enables the organization to make use of an enterprise-wide data store to link information from diverse sources and make the information accessible to the users for strategic analysis that includes trend analysis, forecasting, competitive analysis, and targeted market research etc.
  • The basic concept of a Data Warehouse is to facilitate a single version of truth for a company for decision making and forecasting. A Data warehouse is an information system that contains historical and commutative data from single or multiple sources.
  • Data Warehouse concept, simplifies reporting and analysis process of the organization. Data warehouse allows business users to quickly access critical data from some sources all in one place.
  • It provides consistent information on various cross-functional activities. It is also supporting ad-hoc reporting and query. It helps to integrate many sources of data to reduce stress on the production system. It helps to reduce total turnaround time for analysis and reporting.
  • It allows users to access critical data from the number of sources in a single place. Therefore, it saves user’s time of retrieving data from multiple sources. It stores a large amount of historical data. This helps users to analyze different time periods and trends to make future predictions.

YOU CAN ALSO READ NEW TOPIC

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *