Member-only story
10 Reasons why data cleaning is so much important in data science
Whenever we hear the word Data Science, we think about large data and machine learning algorithms that helps data scientists to predict the values or to classify the outcome into two or more classes.
But, if you ask any Data scientist, or if you might have read some textbooks or maybe you have been into many projects , you might have realised that 70–80 percent of the time of data people is spent on Data cleaning.
What is Data cleaning?
Its the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.
Why to bother about data cleaning?
The data you might have downloaded from any server or internet have been compiled for any specific tasks or it maybe user collected data! The data values in these datasets might not be in your own standards or there might be many missing data points in your data.
Let me give you one example how there can be presence of missing data. For example there is a company that is collecting a survey from its customers about financial status…