Splitting Datasets: From 0 to Hero

Learn about data splitting

Whenever you are dealing with large amount of data and or creating any type of model, a predictive model or classification model for instance, or any model you are creating, for better model evaluation metrics, you would require to know about these terms.

BONUS: — I have talked about a concept called Data wrangling here in this post! and you might like it !

First lets see why we need to do the splitting. The thing is that we want to evaluate our model’s performance.

Why Splitting the dataset at first place?

The main and the most important purpose of splitting data into three different categories is to avoid overfitting which is to pay attention to minor details/noise which are not necessary and only optimizes the training dataset accuracy. We need a model that performs well on dataset that it has never seen (unknown data points), which is called generalization.

But, what is overfitting?

Okay, so I won’t be going into mathematics functions and all and thus I would be explaining in very simple layman language.

Overfitting happens when your developed model learns all(most) of the details and noise in the training data to the extent that it negatively impacts the performance of your model on new data(test data set). This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model. The problem is that these concepts do not apply to new data and might negatively impact the models ability to generalize.

Solution? Splitting your big dataset!

Splitting your dataset into small chunks

So.. okay, I told you the answer, that you would be splitting the datasets, now how to go about that? Here is the nutshell answer: we normally split the data into three chunks: train dataset, test data set and validation set.

What are these sets any way Dhaval? let’s dive into these concepts one by one.

Training set: These are the set of examples used for learning: to fit the parameters of the classifier In the Multilayer Perceptron (MLP) case, we would use the training set to find the “optimal” weights with the back-prop rule. In layman words you would use this dataset chunk to train your model.And it most cases it would be the large chunk of the data (maybe around 70–80% of your whole data set)

Validation set: A set of examples used to tune the parameters of a classifier In the MLP case, we would use the validation set to find the “optimal” number of hidden units or determine a stopping point for the back-propagation algorithm

Test set: A set of examples used only to assess the performance of a fully-trained classifier In the MLP case, we would use the test to estimate the error rate after we have chosen the final model (MLP size and actual weights) After assessing the final model on the test set, YOU MUST NOT tune the model any further!

Now, in order to make you understand in a fun way, i found this story very good for explanation on Quora… so here it goes —

Train your Dog example!

Let’s say you want to teach your dog a few tricks — sit, stay, roll over, etc. You can achieve this by giving the command and showing your dog what he needs to do when you say this command (training data). If you provide your dog with enough clear instructions on what he is suppose to learn, your dog might reach a point where he obeys your command almost every time (high training accuracy).

You can brag to your friends that your dog can perform a lot of tricks. However, will your dog do the correct thing if your friend gives the command (testing data)? If your dog rolls over when your friend tells him to sit, it might mean that your dog is only good at performing a trick when you (training data) give the command (low testing accuracy). This is an example of overfitting.

The reasons for why your dog only responds in the correct manner when you give the command can vary, but it comes down to your training data. Your friend might not showing the same body language as when you give the command. If your friend isn’t giving the exact same input data (tone of voice, pronunciation of the command, hand gesture, body language, etc.) as what he is trained on for a specific command, you might get a different outcome as what you would expect.

So.. i hope you liked this story and got the basic uniderstanding of the concept of splitting data.. BUT there..lets see some of the important things:

Important things to keep in mind:

Never EVER train on test data. If you are seeing surprisingly good results on your evaluation metrics, it might be a sign that you are accidentally training on the test set. For example, high accuracy might indicate that test data has leaked into the training set.

For example, consider a model that predicts whether an email is spam, using the subject line, email body, and sender’s email address as features. We apportion the data into training and test sets, with an 80–20 split. After training, the model achieves 99% precision on both the training set and the test set. We’d expect a lower precision on the test set, so we take another look at the data and discover that many of the examples in the test set are duplicates of examples in the training set (we neglected to scrub duplicate entries for the same spam email from our input database before splitting the data).

Large datasets vs Small data sets? which is good?

Large data tends to be preferable to small data, since the larger samples you have, the more precise your estimates will be. There are a few benefits of small data. For instance, visualization, inspection and understanding what’s going on in the data is much easier with small data than with large data. If you have 20 000 observations and 50 variables, it’s not easy to look at the data manually, so to speak, whereas 10 observations on 2 variables is much easier.

On the flip-side, small datasets will lead to lower precision in your estimates, lower power and have a much larger risk that comparison groups by chance differ on some important background characteristics, that make comparisons between the groups unfair, even if the data is from a randomized trial. To me, these drawbacks outweigh the benefits of having a small dataset.

Furthermore, if you have a large dataset, evaluation of your models is easier, as you can split your data into training and evaluation sets. This means that you can test your model on data that was not used to estimate its parameters. If your dataset is small, this might not be possible since every observation is important then for estimation of the parameters. Leave-one-out cross validation is an option, but there will be high dependency between the tests.

I hope that you liked this article, in the next article I would be posting the coding part on how you can easily split your datasets into these in Python.

Data Enthusiast, Geek, part — time blogger.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store