10 Reasons why data cleaning is so much important in data science

Whenever we hear the word Data Science, we think about large data and machine learning algorithms that helps data scientists to predict the values or to classify the outcome into two or more classes.

But, if you ask any Data scientist, or if you might have read some textbooks or maybe you have been into many projects , you might have realised that 70–80 percent of the time of data people is spent on Data cleaning.

What is Data cleaning?

Its the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

Why to bother about data cleaning?

The data you might have downloaded from any server or internet have been compiled for any specific tasks or it maybe user collected data! The data values in these datasets might not be in your own standards or there might be many missing data points in your data.

Let me give you one example how there can be presence of missing data. For example there is a company that is collecting a survey from its customers about financial status. When the customers recieve the survey in their mail, they start filling the survey and there is high chance of possibility that many customers would not answer all the questions present in the survey. They would fill out the information that they feel like giving for the research purpose. And guys… thats just one instance of acquiring the missing data in our dataset.

But this is one case? right? there not all datasets are surveys! what about those datasets?

Indeed my friend! There are other several cases are possible for eg:

(i) the value is missing because it was forgotten or lost.

(ii) the value is missing because it was not applicable to the instance.

(iii) the value is missing because it is of no interest to the instance.

If we were to put this in a medical context: [1]

(i) the variable is measured but for some unidentifiable reason the values are not electronically recorded, e.g. disconnection of sensors, errors in communicating with the database server, accidental human omission, electricity failures, and others.

(ii) the variable is not measured during a certain period of time due to an identifiable reason, for instance the patient is disconnected from the ventilator because of a medical decision.

(iii) the variable is not measured because it is unrelated with the patient condition and provides no clinical useful information to the physician.

Now lets get back to the question… OKAY.. so we have dirty data .. why to clean? why not just run Machine learning algorithms over the messed up data? We ultimately have to predict right?

WELL… No..

Having clean data will ultimately increase overall productivity and allow for the highest quality information in your decision-making.

  • Removal of errors when multiple sources of data are at play.
  • Fewer errors make for happier clients and less-frustrated employees.
  • Ability to map the different functions and what your data is intended to do.
  • Monitoring errors and better reporting to see where errors are coming from, making it easier to fix incorrect or corrupt data for future applications.
  • Using tools for data cleaning will make for more efficient business practices and quicker decision-making.Having clean data will ultimately increase overall productivity and allow for the highest quality information in your decision-making. Benefits include:
  • Removal of errors when multiple sources of data are at play.
  • Fewer errors make for happier clients and less-frustrated employees.
  • Ability to map the different functions and what your data is intended to do.
  • Monitoring errors and better reporting to see where errors are coming from, making it easier to fix incorrect or corrupt data for future applications.
  • Using tools for data cleaning will make for more efficient business practices and quicker decision-making.

So yes I would defintely support the argument that Data Cleaning is much more important step than running machine learning algorithms on the acquired data!

I hope you like this article! If you did! Motivate me for writing such more articles by following me and liking this article! ❤

Written by

Data Enthusiast, Geek, part — time blogger.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store