Data cleaning is one of the important parts of machine learning. It plays a significant part in building a model. Data Cleaning is one of those things that everyone does but no one really talks about. It surely isn’t the fanciest part of machine learning and at the same time, there aren’t any hidden tricks or secrets to uncover. However, proper data cleaning can make or break your project. Professional data scientists usually spend a very large portion of their time on this step.
we have a well-cleaned dataset we can get desired results even with a very simple algorithm, which can prove very beneficial at times. Obviously, different types of data will require different types of cleaning, However, this systematic approach can always serve as a good starting point.
How do you clean data?
While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to map out a framework for your
organization.
Step 1: Remove duplicate or irrelevant observations:
Remove unwanted observations from your dataset, including duplicate observations or irrelevant observations. Duplicate observations will happen most often during data collection. When you combine data sets from multiple places, scrape data, or receive data from clients or multiple departments, there are opportunities to create duplicate data.
De-duplication is one of the largest areas to be considered in this process. Irrelevant observations are when you notice observations that do not fit into the specific problem you are trying to analyze.
For example, if you want to analyze data regarding millennial customers, but your dataset includes older generations, you might remove those irrelevant observations, This can make analysis more efficient and minimize distraction from your primary target as well as creating a more manageable and more perform ant dataset.
Step 2: Fix structural errors:
The errors that arise during measurement transfer of data or other similar situations are called structural errors. Structural errors include typos in the name of features, same attribute with different name, mislabelled classes, i.e. Separate classes that should really though they represent the same value or red, yellow and red-yellow as different.
For example, the model will treat America and America as different classes or values. classes or attributes, though one class can be included in other two classes. So, are some structural errors that make our model inefficient and gives Poor results.
Step 3: Filter unwanted outliers:
Often, there will be one-off observations where, at a glance. They do not appear to it within the data you are analysing. If you have a legitimate reason to remove an like improper data-entry, doing so will help the performance of the data you are working with. However, sometimes it is the appearance of an outlier that will prove a theory you are working on.
Remember: just because an outlier exists, doesn’t mean it is incorrect. This step is needed to determine the validity of that number. If an outlier proves to be irrelevant for analysis or is a mistake, consider removing
Step 4: Handle missing data:
You can’t ignore missing data because many algorithms will not accept missing values. There are a couple of ways to deal with missing data. Neither is optimal, but both can be considered.
1. As a first option, you can drop observations that have missing values, but doing this will drop or lose information, so be mindful of this before you remove it.
2. As a second option, you can input missing values based on other observations; again, there is an opportunity to lose integrity of the data because you may be operating from assumptions and not actual observations.
3. As a third option, you might alter the way the data is used to effectively navigate null values.
Step 5: Validate and QA:
At the end of the data cleaning process, you should be able to answer these questions as & part of basic validation:
- Does the data follow the appropriate rules for its field?
- Does it prove or disprove your working theory, or bring any insight to light?
- Can you find trends in the data to help you form your next theory?
- If not. I that because of a data quality issues?
Components of quality data:
Determining the quality of data required an examination of its characteristic then
weighing those characteristic according to what is most important to your send the application(S) for which they will be used characteristics of quality data:
1. Validity. -The degree to which hour data conforms o defined business ules or constraints.
2. Accuracy. -Ensure your data is close o the true values
3. Completeness- The degree to which all required data is known.
4. Consistency.-Ensure your data is consistent within the same dataset and or across multiple data sets.
5. Uniformity. -The degree to which the data is using the same unit of measure.
Benefits of data cleaning:
Having clean data will ultimately increase overall pro activity and allow for the highest quality information in your decision-making. Benefits include . Removal of errors when multiple sources of data are at play. Fewer errors make for happier clients and less-frustrated employees.
Ability to map the different functions and what your data is intended to do it easier to fix incorrect or data for future applications. Using tools for data cleaning will make for more efficient business practices and quicker decision-making.
Overview of Data Reduction Strategies
Data reduction strategies in data mining: Data reduction strategies applied on huge data set. Complex data and mining on huge amounts of data can take a long time, making such analysis impractical or infeasible. Data reduction techniques can be applied to obtain a reduces data should be more Strategies for data reduction include the following-
1. Data Cube Aggregation: This technique is used to aggregate data in a simpler form to 2014, that data includes the revenue of your company every three months.
For example, imagine that information you gathered for your analysis for the years 2012 the data in such a way that the resulting data summarizes the total sales per year instead.
2. Dimension reduction: Whenever we come across any data which is weakly important then we use the attribute required for our analysis. It reduces data size as it eliminates out dated or redundant features.
Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we decide best of the original attributes on the set based on their relevance to other attributes. We know it as a p-value in statistics.
Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at each point, it eliminates the worst remaining attribute in the set. Combination of forwarding and Backward Selection. It allows us to remove the worst and select best attributes, saving time and making the process faster.
3. Data Compression: The data compression technique reduces the size of the files using different encoding mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two types based on their compression techniques.
• Lossless Compression
Encoding techniques (Run Length Encoding) allows a simple and minimal data size reduction. Lossless data compression uses algorithms to restore the precise original data from the compressed data. Lossy Compression Methods such as Discrete Wavelet transform technique, PCA (principal component analysis) are examples of this compression.
For e.g.. JPEG image format is a loss compression, but we can find the meaning equivalent to the original the image.
lossy-data compression, the decompressed data may differ to the original data but are useful enough to retrieve information from them,
4, Numerosity Reduction: In this reduction technique the actual data is replaced with mathematical models or smaller representation of the data instead of actual data, it is important to only store the model parameter. Or non-parametric method such as clustering, histogram, sampling. For More Information on Numerosity Reduction Visit the link below:
5. Diseretization & Concept Hierarchy Operation: Techniques of data discretization are used to divide the attributes of the continuous nature into data with intervals. We replace many constant values of the attributes by labels of small intervals. This means that mining results are shown in a concise and easily understandable way.
If you first consider one or a couple of points (so-called breakpoints or split points) to divide the whole set of attributes and repeat of this method up to the end, then the process is known as top-down discretization also known as splitting.
Bottom-up discretization – If you first consider the entire constant values as split-points, some are discarded through a combination of the neighbourhood values in the interval, that process is called bottom-up discretization.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts to high-level concepts (categorical variables such as middle age or Senior).
For numeric data following techniques can be followed:
Binning – Binning is the process of changing numerical variables into categorical counterparts. The number of categorical counterparts depends on the number of bins specified by the user.
Histogram analysis – Like the process of binning, the histogram is used to partition the value for the attribute X, into disjoint ranges called brackets.
There are several partitioning rules:
1. Equal Frequency partitioning: Partitioning the values based on their number of occurrences in the data set.
2. Equal Width Partining: Partining the values in a fixed gap based on the number of bins i.e. a set of values ranging from 0-20.
3. Clustering: Grouping the similar data together.
New Post
Follow On Twitter