KitDocumentation

RowDuplicates

Counts or removes the duplicate rows within a dataframe. This kit can be applied to all columns of the dataframe, such that a row is only counted or removed if every column is the same for two records. Or, this can can be applied to only specified column of the dataframe, such that a row is only counted or removed if every column is the same for two records.

Options

columns: Specifies columns to check for duplicates
count: Specifies whether to count duplicates
remove: Specifies whether to remove duplicates

Examples

Example 1 - Count the Number of Duplicate Rows

It’s useful to identify how many rows in a dataset are exact duplicates. This example counts the number of duplicated rows so you can assess the extent of redundancy in the dataframe.
#> RowDuplicates --count
AFLEFT 
duplicateRowsCount = df[df.duplicated()].shape[0] AFRIGHT

Example 2 - Extract Duplicate Rows from a Dataframe

Alernaively, you may want to examine duplicate rows directly. This example extracts all duplicated rows from the dataframe so they can be inspected or processed separately.
#> RowDuplicates
AFLEFT 
duplicateRows = df[df.duplicated()] AFRIGHT

Example 3 - Remove Duplicate Rows

Instead, you may want to remove duplicate rows from a dataframe to make the dataframe unique. This operation removes all repeated rows while keeping only the first occurrence of each duplicated value set.
#> RowDuplicates --remove
AFLEFT 
duplicateRows = duplicateRows.drop_duplicates(keep='first') AFRIGHT