RowDuplicates
Counts or removes the duplicate rows within a dataframe. This kit can be applied to all columns of the dataframe, such that a row is only counted or removed if every column is the same for two records. Or, this can can be applied to only specified column of the dataframe, such that a row is only counted or removed if every column is the same for two records.
Options
columns: Specifies columns to check for duplicates
count: Specifies whether to count duplicates
remove: Specifies whether to remove duplicates
Examples
Example 1 - Count the Number of Duplicate Rows
It’s useful to identify how many rows in a dataset are exact duplicates. This example counts the number of duplicated rows so you can assess the extent of redundancy in the dataframe.
#> RowDuplicates --count
AFLEFT
duplicateRowsCount = df[df.duplicated()].shape[0] AFRIGHT
Example 2 - Extract Duplicate Rows from a Dataframe
Alernaively, you may want to examine duplicate rows directly. This example extracts all duplicated rows from the dataframe so they can be inspected or processed separately.
#> RowDuplicates
AFLEFT
duplicateRows = df[df.duplicated()] AFRIGHT
Example 3 - Remove Duplicate Rows
Instead, you may want to remove duplicate rows from a dataframe to make the dataframe unique. This operation removes all repeated rows while keeping only the first occurrence of each duplicated value set.
#> RowDuplicates --remove
AFLEFT
duplicateRows = duplicateRows.drop_duplicates(keep='first') AFRIGHT