CleanData
Removes rows from the dataframe according to one of the following:
- Remove rows at specified indexes
- Remove rows that start from an index and go to a second index
- Remove rows that meet a specified criteria
- Removes rows that have missing data
- Removes rows that do not have missing data
- Remove rows at specified indexes
- Remove rows that start from an index and go to a second index
- Remove rows that meet a specified criteria
- Removes rows that have missing data
- Removes rows that do not have missing data
Options
columns: Specifies columns to remove rows, use with either --missing or --notMissing
where: Specifies a condition for removing rows
index: Specifies the index for removing rows
indexStart: Specifies the start index for removing rows
indexStop: Specifies the stop index for removing rows
missing: Specifies whether to remove rows where columns are missing
notMissing: Specifies whether to remove rows where columns are not missing
Examples
Example 1
One of the easiest ways to clean a dataframe with missing cells is to remove the rows with missing cells entirely. Using CleanData and passing in the option
#> CleanData --removeMissing Remove Rows with Missing Cells
AFLEFT
pizzeriasDf = pizzeriasDf.dropna().reset_index(drop=True) AFRIGHT
Example 2 - Fill Missing Cells with Mean
Rather than removing missing cells from a dataframe, it may be beneficial to keep those rows and fill the missing cells with a statistical value. In this case, we fill the missing cells with the mean of the colum.
#> CleanData --fill --columns Rating --mean
AFLEFT
pizzeriasDf[ ['Rating'] ] = pizzeriasDf[ ['Rating'] ].apply(lambda col: col.fillna(col.mean())) AFRIGHT
Example 3 - Interpolate Missing Cells
Another approach to cleaning a data set with missing cells is to fill the cells with a midpoint value based on the surrounding cells. This process is known as interpoloation. Rather than being replace with a statistical value, each cell will receive an interpolated value based on the cells arround it.
#> CleanData --interpolate --columns Rating
AFLEFT
pizzeriasDf[['Rating']] = pizzeriasDf[['Rating']].interpolate() AFRIGHT