Missing Value Imputation, Explained: A Visual Guide with Code Examples for Beginners

DATA PREPROCESSINGOne (tiny) dataset, six imputation methods?Let’s talk about something that every data scientist, analyst, or curious number-cruncher has to deal with sooner or later: missing values. Now, I know what you’re thinking — “Oh great, another missing value guide.” But hear me out. I’m going to show you how to tackle this problem using not one, not two, but six different imputation methods, all on a single dataset (with helpful visuals as well!). By the end of this, you’ll see why domain knowledge is worth its weight in gold (something even our AI friends might struggle to replicate).All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.What Are Missing Values and Why Do They Occur?Before we get into our dataset and imputation methods, let’s take a moment to understand what missing values are and why they’re such a common headache in data science.What Are Missing Values?Missing values, often represented as NaN (Not a Number) in pandas or NULL in databases, are essentially holes in your dataset. They’re the empty cells in your spreadsheet, the blanks in your survey responses, the data points that got away. In the world of data, not all absences are created equal, and understanding the nature of your missing values is crucial for deciding how to handle them.Image by author.Why Do Missing Values Occur?Missing values can sneak into your data for a variety of reasons. Here are some common reasons:Data Entry Errors: Sometimes, it’s just human error. Someone might forget to input a value or accidentally delete one.Sensor Malfunctions: In IoT or scientific experiments, a faulty sensor might fail to record data at certain times.Survey Non-Response: In surveys, respondents might skip questions they’re uncomfortable answering or don’t understand.Merged Datasets: When combining data from multiple sources, some entries might not have corresponding values in all datasets.Data Corruption: During data transfer or storage, some values might get corrupted and become unreadable.Intentional Omissions: Some data might be intentionally left out due to privacy concerns or irrelevance.Sampling Issues: The data collection method might systematically miss certain types of data.Time-Sensitive Data: In time series data, values might be missing for periods when data wasn’t collected (e.g., weekends, holidays).Types of Missing DataUnderstanding the type of missing data you’re dealing with can help you choose the most appropriate imputation method. Statisticians generally categorize missing data into three types:Missing Completely at Random (MCAR): The missingness is totally random and doesn’t depend on any other variable. For example, if a lab sample was accidentally dropped.Missing at Random (MAR): The probability of missing data depends on other observed variables but not on the missing data itself. For example, men might be less likely to answer questions about emotions in a survey.Missing Not at Random (MNAR): The missingness depends on the value of the missing data itself. For example, people with high incomes might be less likely to report their income in a survey.Why Care About Missing Values?Missing values can significantly impact your analysis:They can introduce bias if not handled properly.Many machine learning algorithms can’t handle missing values out of the box.They can lead to loss of important information if instances with missing values are simply discarded.Improperly handled missing values can lead to incorrect conclusions or predictions.That’s why it’s crucial to have a solid strategy for dealing with missing values. And that’s exactly what we’re going to explore in this article!The DatasetFirst things first, let’s introduce our dataset. We’ll be working with a golf course dataset that tracks various factors affecting the crowdedness of the course. This dataset has a bit of everything — numerical data, categorical data, and yes, plenty of missing values.This dataset is artificially made by the author (inspired by [1]) to promote learning.import pandas as pdimport numpy as np# Create the dataset as a dictionarydata = { 'Date': ['08-01', '08-02', '08-03', '08-04', '08-05', '08-06', '08-07', '08-08', '08-09', '08-10', '08-11', '08-12', '08-13', '08-14', '08-15', '08-16', '08-17', '08-18', '08-19', '08-20'], 'Weekday': [0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5], 'Holiday': [0.0, 0.0, 0.0, 0.0, np.nan, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, np.nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 'Temp': [25.1, 26.4, np.nan, 24.1, 24.7, 26.5, 27.6, 28.2, 27.1, 26.7, np.nan, 24.3, 23.1, 22.4, np.nan, 26.5, 28.6, np.nan, 27.0, 26.9], 'Humidity': [99.0, np.nan, 96.0, 68.0, 98.0, 98.0, 78.0, np.nan, 70.0, 75.0, np.nan, 77.0, 77.0, 89.0, 80.0, 88.0, 76.0, np.nan, 73.0, 73.0], 'Wind': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, np.nan, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, np.nan, 1.0, 0.0], 'Outlook': ['rainy', 'sunny', 'rainy', 'overcast', 'rainy', np.nan, 'rainy', 'rainy', 'overcast', 'sunny', np.nan, 'overcast', 'sunny', 'rainy', 'sunny', 'rainy', np.nan, 'rainy', 'overcast', 'sunny'], 'Crowdedness': [0.14, np.nan, 0.21, 0.68, 0.20, 0.32, 0.72, 0.61, np.nan, 0.54, np.nan, 0.67, 0.66, 0.38, 0.46, np.nan, 0.52, np.nan, 0.62, 0.81]}# Create a DataFrame from the dictionarydf = pd.DataFrame(data)# Display basic information about the datasetprint(df.info())# Display the first few rows of the datasetprint(df.head())# Display the count of missing values in each columnprint(df.isnull().sum())Output:<class 'pandas.core.frame.DataFrame'>RangeIndex: 20 entries, 0 to 19Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 20 non-null object 1 Weekday 20 non-null int64 2 Holiday 19 non-null float64 3 Temp 16 non-null float64 4 Humidity 17 non-null float64 5 Wind 19 non-null float64 6 Outlook 17 non-null object 7 Crowdedness 15 non-null float64dtypes: float64(5), int64(1), object(2)memory usage: 1.3+ KB Date Weekday Holiday Temp Humidity Wind Outlook Crowdedness0 08-01 0 0.0 25.1 99.0 0.0 rainy 0.141 08-02 1 0.0 26.4 NaN 0.0 sunny NaN2 08-03 2 0.0 NaN 96.0 0.0 rainy 0.213 08-04 3 0.0 24.1 68.0 0.0 overcast 0.684 08-05 4 NaN 24.7 98.0 0.0 rainy 0.20Date 0Weekday 0Holiday 1Temp 4Humidity 3Wind 1Outlook 3Crowdedness 5dtype: int64As we can see, our dataset contains 20 rows and 8 columns:Date: The date of the observationWeekday: Day of the week (0–6, where 0 is Monday)Holiday: Boolean indicating if it’s a holiday (0 or 1)Temp: Temperature in CelsiusHumidity: Humidity percentageWind: Wind condition (0 or 1, possibly indicating calm or windy)Outlook: Weather outlook (sunny, overcast, or rainy)Crowdedness: Percentage of course occupancyAnd look at that! We’ve got missing values in every column except Date and Weekday. Perfect for our imputation party.Now that we have our dataset loaded, let’s tackle these missing values with six different imputation methods. We’ll use a different strategy for each type of data.Method 1: Listwise DeletionListwise deletion, also known as complete case analysis, involves removing entire rows that contain any missing values. This method is simple and preserves the distribution of the data, but it can lead to a significant loss of information if many rows contain missing values.👍 Common Use: Listwise deletion is often used when the number of missing values is small and the data is missing completely at random (MCAR). It’s also useful when you need a complete dataset for certain analyses that can’t handle missing values.In Our Case: We’re using listwise deletion for rows that have at least 4 missing values. These rows might not provide enough reliable information, and removing them can help us focus on the more complete data points. However, we’re being cautious and only removing rows with significant missing data to preserve as much information as possible.# Count missing values in each rowmissing_count = df.isnull().sum(axis=1)# Keep only rows with less than 4 missing valuesdf_clean = df[missing_count < 4].copy()We’ve removed 2 rows that had too many missing values. Now let’s move on to imputing the remaining missing data.Method 2: Simple Imputation — Mean and ModeSimple imputation involves replacing missing values with a summary statistic of the observed values. Common approaches include using the mean, median, or mode of the non-missing values in a column.👍 Common Use: Mean imputation is often used for continuous variables when the data is missing at random and the distribution is roughly symmetric. Mode imputation is typically used for categorical variables.In Our Case: We’re using mean imputation for Humidity and mode imputation for Holiday. For Humidity, assuming the missing values are random, the mean provides a reasonable estimate of the typical humidity. For Holiday, since it’s a binary variable (holiday or not), the mode gives us the most common state, which is a sensible guess for missing values.# Mean imputation for Humiditydf_clean['Humidity'] = df_clean['Humidity'].fillna(df_clean['Humidity'].mean())# Mode imputation for Holidaydf_clean['Holiday'] = df_clean['Holiday'].fillna(df_clean['Holiday'].mode()[0])Method 3: Linear InterpolationLinear interpolation estimates missing values by assuming a linear relationship between known data points. It’s particularly useful for time series data or data with a natural ordering.👍 Common Use: Linear interpolation is often used for time series data, where missing values can be estimated based on the values before and after them. It’s also useful for any data where there’s expected to be a roughly linear relationship between adjacent points.In Our Case: We’re using linear interpolation for Temperature. Since temperature tends to change gradually over time and our data is ordered by date, linear interpolation can provide reasonable estimates for the missing temperature values based on the temperatures recorded on nearby days.df_clean['Temp'] = df_clean['Temp'].interpolate(method='linear')Method 4: Forward/Backward FillForward fill (or “last observation carried forward”) propagates the last known value forward to fill gaps, while backward fill does the opposite. This method assumes that the missing value is likely to be similar to the nearest known value.👍 Common Use: Forward/backward fill is often used for time series data, especially when the value is likely to remain constant until changed (like in financial data) or when the most recent known value is the best guess for the current state.In Our Case: We’re using a combination of forward and backward fill for Outlook. Weather conditions often persist for several days, so it’s reasonable to assume that a missing Outlook value might be similar to the Outlook of the previous or following day.df_clean['Outlook'] = df_clean['Outlook'].fillna(method='ffill').fillna(method='bfill')Method 5: Constant Value ImputationThis method involves replacing all missing values in a variable with a specific constant value. This constant could be chosen based on domain knowledge or a safe default value.👍 Common Use: Constant value imputation is often used when there’s a logical default value for missing data, or when you want to explicitly flag that a value was missing (by using a value outside the normal range of the data).In Our Case: We’re using constant value imputation for the Wind column, replacing missing values with -1. This approach explicitly flags imputed values (since -1 is outside the normal 0–1 range for Wind) and it preserves the information that these values were originally missing.df_clean['Wind'] = df_clean['Wind'].fillna(-1)Method 6: KNN ImputationK-Nearest Neighbors (KNN) imputation estimates missing values by finding the K most similar samples in the dataset (just like KNN as Classification Algorithm) and using their values to impute the missing data. This method can capture complex relationships between variables.👍 Common Use: KNN imputation is versatile and can be used for both continuous and categorical variables. It’s particularly useful when there are expected to be complex relationships between variables that simpler methods might miss.In Our Case: We’re using KNN imputation for Crowdedness. Crowdedness likely depends on a combination of factors (like temperature, holiday status, etc.), and KNN can capture these complex relationships to provide more accurate estimates of missing crowdedness values.from sklearn.impute import KNNImputer# One-hot encode the 'Outlook' columnoutlook_encoded = pd.get_dummies(df_clean['Outlook'], prefix='Outlook')# Prepare features for KNN imputationfeatures_for_knn = ['Weekday', 'Holiday', 'Temp', 'Humidity', 'Wind']knn_features = pd.concat([df_clean[features_for_knn], outlook_encoded], axis=1)# Apply KNN imputationknn_imputer = KNNImputer(n_neighbors=3)df_imputed = pd.DataFrame(knn_imputer.fit_transform(pd.concat([knn_features, df_clean[['Crowdedness']]], axis=1)), columns=list(knn_features.columns) + ['Crowdedness'])# Update the original dataframe with the imputed Crowdedness valuesdf_clean['Crowdedness'] = df_imputed['Crowdedness']Conclusion: The Power of Choice (and Knowledge)So, there you have it! Six different ways to handle missing values, all applied to our golf course dataset.Now, all missing values are filled!Let’s recap how each method tackled our data:Listwise Deletion: Helped us focus on more complete data points by removing rows with extensive missing values.Simple Imputation: Filled in Humidity with average values and Holiday with the most common occurrence.Linear Interpolation: Estimated missing Temperature values based on the trend of surrounding days.Forward/Backward Fill: Guessed missing Outlook values from adjacent days, reflecting the persistence of weather patterns.Constant Value Imputation: Flagged missing Wind data with -1, preserving the fact that these values were originally unknown.KNN Imputation: Estimated Crowdedness based on similar days, capturing complex relationships between variables.Each method tells a different story about our missing data, and the “right” choice depends on what we know about our golf course operations and what questions we’re trying to answer.The key takeaway? Don’t just blindly apply imputation methods. Understand your data, consider the context, and choose the method that makes the most sense for your specific situation.⚠️ Warning: The Purpose and Limitations of Missing Value ImputationWhile we’ve explored various imputation techniques, we need to understand their purpose and limitations:Not a Magic Solution: Imputation is not a cure-all for missing data. It’s a tool to make your data usable, not to create perfect data.Potential for Bias: Imputed values are educated guesses. They can introduce bias if not done carefully, especially if the data is Not Missing At Random (NMAR).Loss of Uncertainty: Most simple imputation methods don’t account for the uncertainty in the missing values, which can lead to overconfident models.Data Distortion: Aggressive imputation can distort relationships in your data. Always check if imputation has significantly altered your data’s distribution or correlations.Document Your Process: Always clearly document your imputation methods. This transparency is crucial for reproducibility and for others to understand potential biases in your results.Again, the goal of imputation is to make your data usable while minimizing bias and information loss. It’s not about creating perfect data, but about making the best use of the information you have. Always approach imputation with caution and critical thinking.🌟 Missing Value Imputation Summarizedimport pandas as pdimport numpy as npfrom sklearn.impute import KNNImputer# Create the dataset as a dictionarydata = { 'Date': ['08-01', '08-02', '08-03', '08-04', '08-05', '08-06', '08-07', '08-08', '08-09', '08-10', '08-11', '08-12', '08-13', '08-14', '08-15', '08-16', '08-17', '08-18', '08-19', '08-20'], 'Weekday': [0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5], 'Holiday': [0.0, 0.0, 0.0, 0.0, np.nan, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, np.nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 'Temp': [25.1, 26.4, np.nan, 24.1, 24.7, 26.5, 27.6, 28.2, 27.1, 26.7, np.nan, 24.3, 23.1, 22.4, np.nan, 26.5, 28.6, np.nan, 27.0, 26.9], 'Humidity': [99.0, np.nan, 96.0, 68.0, 98.0, 98.0, 78.0, np.nan, 70.0, 75.0, np.nan, 77.0, 77.0, 89.0, 80.0, 88.0, 76.0, np.nan, 73.0, 73.0], 'Wind': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, np.nan, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, np.nan, 1.0, 0.0], 'Outlook': ['rainy', 'sunny', 'rainy', 'overcast', 'rainy', np.nan, 'rainy', 'rainy', 'overcast', 'sunny', np.nan, 'overcast', 'sunny', 'rainy', 'sunny', 'rainy', np.nan, 'rainy', 'overcast', 'sunny'], 'Crowdedness': [0.14, np.nan, 0.21, 0.68, 0.20, 0.32, 0.72, 0.61, np.nan, 0.54, np.nan, 0.67, 0.66, 0.38, 0.46, np.nan, 0.52, np.nan, 0.62, 0.81]}# Create a DataFrame from the dictionarydf = pd.DataFrame(data)# 1. Listwise Deletiondf_clean = df[df.isnull().sum(axis=1) < 4].reset_index(drop=True).copy()# 2. Simple Imputationdf_clean['Humidity'] = df_clean['Humidity'].fillna(df_clean['Humidity'].mean())df_clean['Holiday'] = df_clean['Holiday'].fillna(df_clean['Holiday'].mode()[0])# 3. Linear Interpolationdf_clean['Temp'] = df_clean['Temp'].interpolate(method='linear')# 4. Forward/Backward Filldf_clean['Outlook'] = df_clean['Outlook'].ffill()# 5. Constant Value Imputationdf_clean['Wind'] = df_clean['Wind'].fillna(-1)# 6. KNN Imputation# One-hot encode the 'Outlook' columnoutlook_encoded = pd.get_dummies(df_clean['Outlook'], prefix='Outlook')# Prepare features for KNN imputationfeatures_for_knn = ['Weekday', 'Holiday', 'Temp', 'Humidity', 'Wind']knn_features = pd.concat([df_clean[features_for_knn], outlook_encoded], axis=1)# Apply KNN imputationknn_imputer = KNNImputer(n_neighbors=3)df_imputed = pd.DataFrame(knn_imputer.fit_transform(pd.concat([knn_features, df_clean[['Crowdedness']]], axis=1)), columns=list(knn_features.columns) + ['Crowdedness'])# Update the original dataframe with the imputed Crowdedness valuesdf_clean['Crowdedness'] = df_imputed['Crowdedness'].round(2)print("Before:")print(df)print("\n\nAfter:")print(df_clean)Further ReadingFor a detailed explanation of the KNNImputer and its implementation in scikit-learn, readers can refer to the official documentation [2], which provides comprehensive information on its usage and parameters.Technical EnvironmentThis article uses Python 3.7 and scikit-learn 1.5. While the concepts discussed are generally applicable, specific code implementations may vary slightly with different versions.About the IllustrationsUnless otherwise noted, all images are created by the author, incorporating licensed design elements from Canva Pro.For a concise visual summary of Missing Values Imputation, check out the companion Instagram post.Reference[1] T. M. Mitchell, Machine Learning (1997), McGraw-Hill Science/Engineering/Math, pp. 59[2] F. Pedregosa et al., Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [Online]. Available: https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.htmlMissing Value Imputation, Explained: A Visual Guide with Code Examples for Beginners was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Welcome to Billionaire Club Co LLC, your gateway to a brand-new social media experience! Sign up today and dive into over 10,000 fresh daily articles and videos curated just for your enjoyment. Enjoy the ad free experience, unlimited content interactions, and get that coveted blue check verification—all for just $1 a month!
Account Frozen
Your account is frozen. You can still view content but cannot interact with it.
Please go to your settings to update your account status.
Open Profile Settings