# Using Crucio SMOTENC for unbalanced datasets

## Unbalanced data

In simple terms, an unbalanced dataset is one in which the target variable has more observations in one specific class than the others. An example can be a dataset for transactions: usually, 99% of this set consists of simple, not fraudulent transactions, and only 1% of them are fraud.

## How does SMOTENC work?

SMOTENC (Synthetic Minority Over-sampling Technique for Nominal and Continuous) is very similar to classic SMOTE but, unlike SMOTE, SMOTE-NC can work with both numerical and categorical features.

SMOTE-NC slightly changes the way a new sample is generated by performing something specifically for the categorical features. The categories of a newly generated sample are decided by picking the most frequent category of the nearest neighbors present during the generation, and while calculating Euclidian distance when searching for k-nearest-neighbors, the algorithm just adds the square of the median of standard deviations of every column that represent numerical values.

## Using Crucio SMOTENC

In case you didn’t install ** Crucio** yet, then use in the terminal the following:

Now we can import the algorithm and create the ** SMOTENC** balancer.

smotenc = SMOTENC()

balanced_df = smotenc.balance(df, 'target')

The ** SMOTENC**() initialization constructor can contain the following arguments:

(int > 0, default = 5) : The number of nearest neighbors from which SMOTE will sample data points.**k**(int, default = 45): The number used to initialize the random number generator.**seed**(list, default = None): The list of binary columns from the data set, so sampled data be approximated to the nearest binary value.**binary_columns**

The ** balance()** method takes as parameters the panda’s data frame and the name of the target column.

# Example:

So I chose a data set where we have to predict the type of a ** Pokemon** (

**or not), the Legendary class constitutes**

**Legendary****out of all dataset, so it is an imbalanced dataset.**

**8%**We recommend you, before applying any module from ** Crucio**, to first split your data into train and test data sets and balance only the train set. In such a way you will test the performance of the model only on natural data.

The basic ** Random Forest** algorithm gives an accuracy of approximately

**by training on imbalanced data, so now it’s time to test out SMOTE algorithm.**

**97%**new_df = smotenc.balance(df,'Legendary')

After balancing data, we increased accuracy from ** 97% **to

**100%.**# Conclusion:

SMOTENC is a very good technique to balance datasets when you have categorical columns and you don’t wanna lose their predicting power. I encourage you to test it and many other balancing methods from Crucio such as SMOTETOMEK, SCUT, SLS, ADASYN, and ICOTE.

**Thank you for reading!**

Follow Sigmoid on Facebook, Instagram, and LinkedIn:

https://www.facebook.com/sigmoidAI

https://www.instagram.com/sigmo.ai/

https://www.linkedin.com/company/sigmoid/