Usually, when we try to select the best features of the dataset, we choose different spells(algorithms) to see a difference, and select the best one, so why don’t we try to mix them all up and combine the power them all in one spell. This is where the MixerSelector comes for.
Using Kydavra MixerSelector.
To install Kydavra we just have to type the following in the command line.
Next, we need to import the model, create the selector, and apply it to our data:
from kydavra import ANOVASelector
from kydavra import MixerSelector
from kydavra import MUSESelector
mixer = MixerSelector([PointBiserialCorrSelector(),ANOVASelector(),
cols = mixer.select(df,'diagnosis')
The MixerSelector() accepts 2 parameters:
- selectors: list, list of initialized selectors.
- strategy: str, If set to ‘union’ the selector will return the union of selected columns returned by selectors. If set to ‘intersection’ the selector will return the intersection of columns returned by selectors.
And the select() function takes as parameters the panda’s data frame and the name of the target column.
Let’s see an example:
I chose the Breast cancer dataset where we have to predict the ‘diagnosis’ column. The dataset has the following features.
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se','texture_se','perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se','symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst',
The simple LogisticRegression gives a cross_val_score mean of 0.952569476.
When trying to mix 2 different spells with the ‘intersection’ strategy, we get:
'concave points_mean', 'area_worst', 'perimeter_mean','radius_mean']
and the accuracy means accuracy became 0.9455364073901567, but it reduced the number of features from 30 to 8, amazing.
The next strategy is the ‘union’, with the same data and spells.
'radius_mean','radius_worst','concave points_mean', 'compactness_se','radius_se','concave points_se','concavity_mean', 'perimeter_se','texture_worst','compactness_worst',
'concavity_worst','texture_mean','concave points_worst', 'fractal_dimension_worst','smoothness_worst','area_se','area_mean',
So now we got 25 features out of 30, with an accuracy of 0.952569476, which is the same accuracy we got when using LogisticRegression on all 30 features, which is a great selection.
Generally, it all depends on the purpose of your model, if you need better accuracy, then it's better to set the strategy parameter as ‘union’, otherwise if you need a fast algorithm, then ‘intersection’ is what you need.
Thanks for reading!
Follow Sigmoid on Facebook, Instagram, and LinkedIn: