Member-only story

10 Reasons why data cleaning is so much important in data science

Dhaval Thakur
3 min readOct 24, 2020

--

Whenever we hear the word Data Science, we think about large data and machine learning algorithms that helps data scientists to predict the values or to classify the outcome into two or more classes.

But, if you ask any Data scientist, or if you might have read some textbooks or maybe you have been into many projects , you might have realised that 70–80 percent of the time of data people is spent on Data cleaning.

What is Data cleaning?

Its the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

A Clean data has a better idea

Why to bother about data cleaning?

The data you might have downloaded from any server or internet have been compiled for any specific tasks or it maybe user collected data! The data values in these datasets might not be in your own standards or there might be many missing data points in your data.

Let me give you one example how there can be presence of missing data. For example there is a company that is collecting a survey from its customers about financial status…

--

--

Dhaval Thakur
Dhaval Thakur

Written by Dhaval Thakur

Data Enthusiast, Geek, part — time blogger. Every week 1 new Data Science/ Product Management story 🖥 I also write on Python, scripting & blockchain

No responses yet