In machine Xi, an unbalanced dataset is a dataset with an unequal number of samples in each class. For example, if a dataset has two categories, one of which has 95% of the sample and the other has only 5%, the dataset is unbalanced.
A balanced dataset, on the other hand, is a dataset with approximately equal numbers of samples in each class. Balanced datasets are desirable because they prevent the machine Xi model from biasing toward the majority class.
Unbalanced datasets can be addressed using a variety of techniques, such as resampling, modifying cost functions, and using different algorithms. Resampling is the process of changing the number of samples in a dataset by oversampling a minority class or undersampling a majority class.
Oversampling is the process of increasing the number of minority samples, while undersampling is the process of decreasing the number of majority samples.
There are a variety of oversampling and undersampling techniques, including:
Random oversampling: This technique involves randomly copying a small number of samples until the desired equilibrium level is reached.
Random undersampling: This technique involves randomly removing samples from the majority of classes until the desired level of equilibrium is reached.
SMOTE (Synthetic Minority Oversampling Technique): SMOTE creates a synthetic sample by selecting similar samples from a small number of categories and creating a new sample that serves as a linear combination of those similar samples.
Tomek Links: Tomek links are pairs of samples from different classes that are very close to each other. Undersampling with Tomek Links involves removing the majority of class samples from such pairs.
Adasyn (Adaptive Synthetic Sampling): Adasyn generates synthetic samples for minority classes by using density distributions.
In classification, oversampling and undersampling are used to balance the dataset before training a model. The goal is to ensure that the model is not biased towards the majority category, which can lead to poor performance of the minority category.