Missing data can provide insights into the data collection process. It’s important to determine whether the missing data is randomly distributed or specific to certain features. Filling in data is a type of Data Transformation.
In DE_Toolssee:
Resources:
Identifying Missing Data
How do you find which features have the most missing data?
To find where missing values (NA) are located in your dataset, use the following commands:
df.isnull().sum()
df.isna().sum()
df[df.columns[df.isnull().sum() > 0].tolist()].info()
Treating Missing Values (Imputation Techniques)
There are two main strategies for handling missing values: removing them or replacing them.
Remove Missing Values:
dropna
: Drops rows with missing values.df.dropna(inplace=True)
: Drops rows with NA values and updates the DataFrame in place.df.reset_index(inplace=True, drop=True)
: Resets the index after dropping rows.
Replace Missing Values:
fillna
: Fills missing values with specified values.- Example:
df['var1'] = df['var1'].fillna(df['var1'].mean())
fills missing values invar1
with the column’s average.
- Example:
isnull
: Checks for missing values.df.reindex
: Reindexes the DataFrame.- Imputation methods for filling in missing data so that it has a higher likelyhood of being true.