CleanData

Removes rows from the dataframe according to one of the following:
- Remove rows at specified indexes
- Remove rows that start from an index and go to a second index
- Remove rows that meet a specified criteria
- Removes rows that have missing data
- Removes rows that do not have missing data

Options

columns: Specifies columns to remove rows, use with either --missing or --notMissing

where: Specifies a condition for removing rows

index: Specifies the index for removing rows

indexStart: Specifies the start index for removing rows

indexStop: Specifies the stop index for removing rows

missing: Specifies whether to remove rows where columns are missing

notMissing: Specifies whether to remove rows where columns are not missing

Examples

Example 1

One of the easiest ways to clean a dataframe with missing cells is to remove the rows with missing cells entirely. Using CleanData and passing in the option

#> CleanData --removeMissing  Remove Rows with Missing Cells
AFLEFT 
pizzeriasDf = pizzeriasDf.dropna().reset_index(drop=True) AFRIGHT

Example 2 - Fill Missing Cells with Mean

Rather than removing missing cells from a dataframe, it may be beneficial to keep those rows and fill the missing cells with a statistical value. In this case, we fill the missing cells with the mean of the colum.

#> CleanData --fill --columns Rating --mean
AFLEFT 
pizzeriasDf[ ['Rating'] ] = pizzeriasDf[ ['Rating'] ].apply(lambda col: col.fillna(col.mean())) AFRIGHT

Example 3 - Interpolate Missing Cells

Another approach to cleaning a data set with missing cells is to fill the cells with a midpoint value based on the surrounding cells. This process is known as interpoloation. Rather than being replace with a statistical value, each cell will receive an interpolated value based on the cells arround it.

#> CleanData --interpolate --columns Rating
AFLEFT 
pizzeriasDf[['Rating']] = pizzeriasDf[['Rating']].interpolate() AFRIGHT