Missing data can provide insights into the data collection process. It’s important to determine whether the missing data is randomly distributed or specific to certain features. Filling in data is a type of Data Transformation.
In DE_Toolssee:
Resources:
Identifying Missing Data
How do you find which features have the most missing data?
To find where missing values (NA) are located in your dataset, use the following commands:
df.isnull().sum()
df.isna().sum()
df[df.columns[df.isnull().sum() > 0].tolist()].info()Treating Missing Values (Imputation Techniques)
There are two main strategies for handling missing values: removing them or replacing them.
Remove Missing Values:
dropna: Drops rows with missing values.df.dropna(inplace=True): Drops rows with NA values and updates the DataFrame in place.df.reset_index(inplace=True, drop=True): Resets the index after dropping rows.
Replace Missing Values:
fillna: Fills missing values with specified values.- Example:
df['var1'] = df['var1'].fillna(df['var1'].mean())fills missing values invar1with the column’s average.
- Example:
isnull: Checks for missing values.df.reindex: Reindexes the DataFrame.- Imputation methods for filling in missing data so that it has a higher likelyhood of being true.
Related: