KitDocumentation

ColumnCorrelation

Calculates the correlation of numerical data between either all numerical columns in a dataframe or specified columns. Correlation shows how connected or related that data is between columns. A high correlation shows that modifying data in one column will produce a predictable modification in the second column. A low correlation means changes to the first column will not reflect or predict changes to data in the second column.

Additionally, a where clause can be used to only calculate the correlation between columns for rows that meet a desired criteria.

Options

columns: Specifies columns to calculate the correlation for, if not specified, the correlation will be computed for the entire dataframe
where: Specifies a condition for removing rows

Examples

Example 1 - Correlation of All Numeric Columns

Dataframes often contain many numeric columns and we want to see the correlation between all of the numerical columns. Using ColumnCorrelation without any options will get the correlation between all numeric columns in a dataframe.
#> ColumnCorrelation
AFLEFT 
appleStockDfNumericOnly = appleStockDf.select_dtypes(include=['number'])
appleStockDfCorrelation = appleStockDfNumericOnly.corr() AFRIGHT

Example 2 - Correlation Between Specific Columns

Rather than getting the correlation between all columns in the dataframe, sometimes we only want to see how correlated specific columns are. This can be done be passing in the names of each column that you want to see the correlation for.
#> ColumnCorrelation --columns High Low Volume
AFLEFT 
appleStockDfCorrelation = appleStockDf[ ['High', 'Low', 'Volume'] ].corr() AFRIGHT

Example 3 - Correlation Only Between Desired Rows

Data between columns is not always correlated throughout the entire column, but sometimes only specific sections of the data contains a correlation. Specifying a where clause will limit the data to rows that meet the condition to see the correlation between those rows.
#> ColumnCorrelation --columns High Low Volume --where High > 1.10 * Open
AFLEFT 
appleStockDfCorrelation = appleStockDf[ ['High', 'Low', 'Volume'] ][appleStockDf['High'] > 1.10 * appleStockDf['Open']].corr() AFRIGHT