Clean Data with Arctic Fox
In this post, we'll walk through cleaning data using Arctic Fox. Specifically, we're going to fill in missing values using interpolation, a simple and powerful method to estimate missing data based on nearby values.
This is just one of many techniques for cleaning data - but it's quick, intuitive, and perfect for numeric columns with trends or sequences. Let's get started!
Load and View Data
First, let's load in our dataset, which contains some missing values we want to fix.
We'll use the Data kit to load the MessyIMDBDataset.csv file into a pandas dataframe.

Next, let's view the data to understand what we're working with. Add the Visualize kit, which prints out the dataframe so we can inspect it directly.

Run the script. You'll see your data printed out in the file as comments. Pay special attention to the Duration column - you'll notice it has some missing values that show up as NaN.

Let's fix that!
Sort the Data Before Interpolation
We're going to interpolate the Duration column—but before we do, it's a good idea to sort the data by year. This makes the interpolation more meaningful because movie durations tend to follow trends over time. Sorting helps us capture that relationship.
Add the RowSort kit and set it to sort the Year column in ascending order:

Run the script again. You should now see the dataframe with rows ordered from the earliest to the latest year.

Perfect - now we're ready to clean!
Interpolate the Duration Column
Let's fill in those missing durations!
To do that, we'll use the CleanData kit with the --interpolate option. We'll also pass in the column we want to clean - in this case, Duration.

Run the script once more. The NaN values in Duration will be interpolated between the nearest non-missing values. That means each missing value will become the average of the rows above and below it (based on our sorted order).

Boom—those gaps are filled, and your data is ready for analysis!
Recap: What We Did
In this guide, we:
- Loaded a dataset using the Data kit
- Viewed the dataset using the Visualize kit
- Spotted missing values in the Duration column
- Sorted the dataset by year using the RowSort kit
- Cleaned the missing values via interpolation with the CleanData kit
Thanks for following along!