Handling Missing Values
In real world data, there are some instances where a particular element is absent because of various reasons, such as, corrupt data, failure to load the information, or incomplete extraction. Handling the missing values is one of the greatest challenges faced by analysts, because making the right decision on how to handle it generates robust data models.
Missing At Random:
The probability of missing values, at random, in a variable depends only on the available information in other predictors.
For example, when men and women respond to the question “have you ever taken parental leave?”, men would tend to ignore the question at a different rate compared to women.

Missing Not At Random:
The probability of missing values, not at random, depends on information that has not been recorded, and this information also predicts the missing values.
For example, in a survey, cheaters are less likely to respond when asked if they have ever cheated.
For example, when data are missing on IQ and only the people with low IQ values have missing observations for this variable. A problem with the MNAR mechanism is that it is impossible to verify that scores are MNAR without knowing the missing values.

Missing Completely At Random:
MCAR occurs when the probability of missing values in a variable is the same for all samples.
For example, when a survey is conducted, and values were just randomly missed when being entered in the computer or a respondent chose not to respond to a question.
1. Deleting Rows/Columns
Pros:
Cons:
2. Filling Null Values:

Pros:
Cons:
3:Mean/Median/Mode Imputation:

Here ,standard deviation of Age column is 14.526497332334042
And std of Age_median column is 13.019696550973201

Pros:
Cons:
4:Random Sample Imputation:
It assumes that the data are missing completely at random(MCAR)

Comparision

Pros:
Cons:
5:Capturing nan Values With New Features
It works well if the data are not missing completely at random.
Create a new feature and capture the importance of
missing values by place 1 for missing values
and 0 for other values .
Here ,row no 5 is missing impute with median =28
Pros:
Cons:





Comments
Post a Comment