KitDocumentation

RemoveWeirdCharacters

Not all datasets are clean from the start. One of the ways data is dirty is the column names. We created the RemoveWeirdColumn kit to remove non standard character from column namers. The kit can remove weird characters from either specified column names or the entire dataframe.

Options

columns: Specifies the columns to remove weird characters from, if not used, weird characters will be removed from the entire dataframe

Examples

Example 1 - Remove Non-ASCII Characters from Entire DataFrame

Cleans the full dataset by stripping out any characters outside the ASCII range. This is especially helpful when working with data where encoding issues may introduce stray symbols or formatting artifacts across all columns.
#> RemoveWeirdCharacters
AFLEFT 
for column in texasCensusDf.columns:
    texasCensusDf[column] = texasCensusDf[column].str.replace(r'[^\x00-\x7F]+', '', regex=True) AFRIGHT

Example 2 - Remove Non-ASCII Characters from Specific Columns

Targets and removes strange or corrupted characters only from selected columns that are known to contain bad encodings. This provides a more precise fix when only a subset of the dataset is affected.
#> RemoveWeirdCharacters Label (Grouping) Texas Total Margin of Error
AFLEFT 
texasCensusDf['Label (Grouping)'] = texasCensusDf['Label (Grouping)'].str.replace(r'[^\x00-\x7F]+', '', regex=True)
texasCensusDf['Texas Total Margin of Error'] = texasCensusDf['Texas Total Margin of Error'].str.replace(r'[^\x00-\x7F]+', '', regex=True) AFRIGHT