RemoveWeirdCharacters
Not all datasets are clean from the start. One of the ways data is dirty is the column names. We created the RemoveWeirdColumn kit to remove non standard character from column namers. The kit can remove weird characters from either specified column names or the entire dataframe.
Options
columns: Specifies the columns to remove weird characters from, if not used, weird characters will be removed from the entire dataframe
Examples
Example 1 - Remove Non-ASCII Characters from Entire DataFrame
Cleans the full dataset by stripping out any characters outside the ASCII range. This is especially helpful when working with data where encoding issues may introduce stray symbols or formatting artifacts across all columns.
#> RemoveWeirdCharacters
AFLEFT
for column in texasCensusDf.columns:
texasCensusDf[column] = texasCensusDf[column].str.replace(r'[^\x00-\x7F]+', '', regex=True) AFRIGHT
Example 2 - Remove Non-ASCII Characters from Specific Columns
Targets and removes strange or corrupted characters only from selected columns that are known to contain bad encodings. This provides a more precise fix when only a subset of the dataset is affected.
#> RemoveWeirdCharacters Label (Grouping) Texas Total Margin of Error
AFLEFT
texasCensusDf['Label (Grouping)'] = texasCensusDf['Label (Grouping)'].str.replace(r'[^\x00-\x7F]+', '', regex=True)
texasCensusDf['Texas Total Margin of Error'] = texasCensusDf['Texas Total Margin of Error'].str.replace(r'[^\x00-\x7F]+', '', regex=True) AFRIGHT