How to Handle Missing Values in Machine Learning?- Simplest Explanation

Missing values in the data are the problem that affects our model accuracy. So if you want to know how to handle missing values in machine learning? then read this full article. In this article, I have explained the three methods to handle missing values in machine learning.

So without any further ado, let’s get started-

How to Handle Missing Values in Machine Learning?

When we have real-world data, then it’s common to have some missing values. And if we continue with these missing values, then our model accuracy is not accurate. So we need to handle these missing values. There are various methods to handle missing values. But in this article, I’m gonna discuss the main 3 techniques to handle missing values.

Suppose this is our dataset (just for your reference)-

A	B	C	D	E
1	10	2345	30	80
2	20	4567	31	81
3	30	2358	34	82
–	–	–	54	–
–	40	6782	63	85
6	50	9846	45	87
7	60	2095	61	88
8	80	–	56	89

In this dataset, the “-“ represents the missing or null values. A, B, C, D, and E are the 5 features. So how to handle these missing values?

For this, there are 3 techniques-

Remove the record.
Create Sub-Model
Statistical methods( mean, median, and mode)

So let’s understand these techniques in detail-

1. Remove the Record

In this technique, we remove the particular missing record from our dataset. For example in this dataset, as you can see in the image, this record has many missing values, so we just remove this record from our dataset.

How to Handle Missing Values in Machine Learning?

But this technique is good only if you have a huge dataset. Huge dataset means millions of record, then we can remove the records which have missing values. If you have a small dataset, then don’t apply this technique. Because in a small dataset, each record contains some information, which helps the model to learn. So if you remove the records from a small dataset, then your model will be not accurate.

Now you understood when to apply this technique, now let’s move to the next technique, which is-

2. Create Sub-Model

So what is a sub-model? That means before creating your main model, you need to create a sub-model to predict the missing values in the dataset. But this technique takes more time and more computational effort.

Now let’s see how to create a sub-model. Suppose we want to know what should be the missing value of this particular feature-

So for this, we need to create a sub-model. As we know, for building a machine learning model, we need training data and test data.

Right?

So in this case, B, C, D, and E are our non-target attributes(Input) and A is out target attributes(Output).

Why?

Because we need to find out the missing value of feature A. So we will pass these target attributes and non-target attributes to the training phase. And our model will be trained with the training data.

After the training phase, we will give the values of B, C, D, and E (40, 6782, 63, 85) of that particular record to the model as input. Now the model predicts the value of A.

You can understand the whole procedure with the help of this image-

As you can see in this image, we have given the values of B, C, D, and E of that particular record as input whose A’s value we have to predict. And the model has predicted the value of A as 5. Now you can fill the value of A as 5 in your dataset.

So this is how you can construct a model for handling missing values in machine learning. But this is computationally very complex. If you have a small dataset, then you can use this technique.

Now let’s move to the next technique which is-

3. Statistical Methods( Mean, Median, and Mode)

In this technique, we use mean, median, and mode to handle missing values. For example, we have to find the missing values of feature B by using Mean. So what we have to do is add all values of feature B and divide by the total number of values.

Let’s understand how to use mean to find out the missing values of feature B-

Mean

B-> 10, 20, 30, -, 40, 50, 60, 80

Mean= 10+20+30+40+50+60+80/7

= 41.4

So we will replace the missing value with 41.4

So this is how you can find missing values using mean. Now let’s see how to use median to find out missing values-

Median-

Suppose we have to find the missing value of feature B using the median, so first, we need to assign all the values of feature B in increasing order-

B-> 10, 20, 30, 40, 50, 60, 80

In median we find the middle value, and in this case the middle value is 40. So we will fill 40 in place of missing value.

Now let’s see how to use mode for finding missing value-

Mode-

In mode, we check the frequency of any value. That means if any value is repeated more, then this value will be the mode value. For example-

B-> 10, 20, 20, 40, 50, 60, 80

As you can see here, 20 is repeated 2 times, so we will replace the missing value as 20.

So that’s all. I hope you understood all the 3 techniques for handling missing values in machine learning. Now it’s time to wrap up.

Conclusion

In this article, I have discussed the 3 techniques to handle missing values in machine learning. If you have any questions, feel free to ask me in the comment section.

If you found this article helpful, kindly share it with others.

All the Best!

Happy Learning!

Also, Read

Best Math Courses for Machine Learning- Find the Best One!
9 Best Tensorflow Courses & Certifications Online- Discover the Best One!
Machine Learning Engineer Career Path: Step by Step Complete Guide
Best Online Courses On Machine Learning You Must Know in 2024
Best Machine Learning Courses for Finance You Must Know
Best Resources to Learn Machine Learning Online in 2024

Thank YOU!

Though of the Day…

‘ Anyone who stops learning is old, whether at twenty or eighty. Anyone who keeps learning stays young.

– Henry Ford

Written By Aqsa Zafar

Founder of MLTUT, Machine Learning Ph.D. scholar at Dayananda Sagar University. Research on social media depression detection. Create tutorials on ML and data science for diverse applications. Passionate about sharing knowledge through website and social media.