Upsampling And Downsampling: Correcting The Imbalances In Data

0
174

By mastering the art of upsampling and downsampling, we not only correct skewed
datasets but also pave the way for fairer, more accurate machine learning models.
This second article in the two-part series focuses on the uses of downsampling.

In the first part of this article series, which was published in the March 2024 edition of OSFY, we learnt how to manage imbalanced datasets by using upsampling techniques. Now, let us see how downsampling works.

Downsampling

Downsampling is used to decrease the size of a dataset by removing some of the instances. When the data is imbalanced, the data points of the majority class are removed. This is often done to reduce computational complexity and training time, and to eliminate redundancy or irrelevant instances from the dataset. However, when doing this, we need to ensure important information is not lost and the sample represents the entire dataset.

Methods used in downsampling

Upsampling is used when the dataset is large. Downsampling is a good option to reduce the computational complexity and training time of the model. If the goal is to improve model efficiency or reduce the risk of overfitting, downsampling is a better option. A few methods used for downsampling are explained here, though these are not exhaustive.

Resample

  • This method which is used for upsampling can be used for downsampling as well. The simplest way of downsampling majority classes is by randomly removing records from that category.
  • Logic used is the same as upsampling — instead of increasing random records, remove them.
  • Here is sample code:
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.utils import resample
from matplotlib import pyplot
from numpy import where
import pandas as pd
import numpy as np
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_features=4, n_clusters_per_class=1, n_samples=1000, random_state=10)
# summarize class distribution
print(‘Original dataset shape %s’ % Counter(y))
Original dataset shape Counter({1: 894, 0: 106})
# scatter plot of examples by class label
for label, _ in Counter(y).items():
row_ix = where(y == label)[0]
pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
pyplot.legend()
pyplot.show()
# perform downsampling
df = pd.DataFrame(X, columns=[‘Column_A’, ‘Column_B’, ‘Column_C’, ‘Column_D’])
df[‘Y’] = y
over_samples = df[df[“Y”] == 1]
under_samples = df[df[“Y”] == 0]
print(“Original Over Sample Shape”, under_samples.shape)
df_res = resample(over_samples,
replace=True,
n_samples=len(under_samples),
random_state=42)
print(“Resampled Over Sample Shape”, df_res.shape)
X_Res = np.append(df_res[[‘Column_A’, ‘Column_B’, ‘Column_C’, ‘Column_D’]].to_numpy(), under_samples[[‘Column_A’, ‘Column_B’, ‘Column_C’, ‘Column_D’]].to_numpy(), axis=0)
y_res = np.append(df_res[‘Y’].to_numpy(), under_samples[‘Y’].to_numpy())
print(‘Resampled dataset shape %s’ % Counter(y_res))
Original Over Sample Shape (106, 5)
Resampled Over Sample Shape (894, 5)
Resampled dataset shape Counter({0: 894, 1: 894})
# scatter plot of examples by class label
for label, _ in Counter(y_res).items():
row_ix = where(y_res == label)[0]
pyplot.scatter(X_Res[row_ix, 0], X_Res[row_ix, 1], label=str(label))
pyplot.legend()
pyplot.show()
Scatter plot of examples by class label – pre-downsampling (used the same data for all the downsampling methods, and observe how this data changes after using each technique)
Figure 1: Scatter plot of examples by class label – pre-downsampling (used the same data for all the downsampling methods, and observe how this data changes after using each technique)
Figure 2: Scatter plot of examples by class label downsampled post using resample

RandomUnderSampler

  • In this method, the rows from the majority class are selected randomly and replaced.
  • Logic used is the same as that used for upsampling, except the strategies in this case impact the majority class instead of the minority class.
  • The source code is:
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import RandomUnderSampler
from matplotlib import pyplot
from numpy import where
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_features=4, n_clusters_per_class=1, n_samples=1000, random_state=10)
# summarize class distribution
print(‘Original dataset shape %s’ % Counter(y))
Original dataset shape Counter({1: 894, 0: 106})

Now we will carry out downsampling:

rus = RandomUnderSampler(random_state=42)
X_Res, y_res = rus.fit_resample(X, y)
print(‘Resampled dataset shape %s’ % Counter(y_res))
Resampled dataset shape Counter({0: 106, 1: 106})
# scatter plot of examples by class label
for label, _ in Counter(y_res).items():
row_ix = where(y_res == label)[0]
pyplot.scatter(X_Res[row_ix, 0], X_Res[row_ix, 1], label=str(label))
pyplot.legend()
pyplot.show()
Figure 3: Scatter plot of examples by class label downsampled post using RandomUnderSampler

Cluster centroid

  • The centroid algorithm uses the K-Means method to replace the samples under the majority class.
  • This method handles the imbalanced datasets and involves undersampling the majority class by replacing a group of majority samples with the cluster centroid of a K-Means algorithm. This newly generated set is created using the centroids of the K-Means method instead of the original samples. By doing this, the majority class(es) are transformed while the minority class remains unchanged.
  • The logic is:
    • Choose random data from the majority class.
    • Calculate the Euclidean distance between the random data and its k nearest neighbours.
    • Calculate centroids for clusters and use them to represent the majority class, reducing samples.
    • Repeat the procedure until the desired proportion of minority classes is met.
  • The sample code is:
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.utils import resample
from imblearn.under_sampling import ClusterCentroids
from sklearn.cluster import MiniBatchKMeans
from matplotlib import pyplot
from numpy import where
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_features=4, n_clusters_per_class=1, n_samples=1000, random_state=10)
# summarize class distribution
print(‘Original dataset shape %s’ % Counter(y))
Original dataset shape Counter({1: 894, 0: 106})

Now we will carry out downsampling:

cc = ClusterCentroids(
estimator=MiniBatchKMeans(n_init=1, random_state=0), random_state=42
)
X_Res, y_res = cc.fit_resample(X, y)
print(‘Resampled dataset shape %s’ % Counter(y_res))
Resampled dataset shape Counter({0: 106, 1: 106})
# scatter plot of examples by class label
for label, _ in Counter(y_res).items():
row_ix = where(y_res == label)[0]
pyplot.scatter(X_Res[row_ix, 0], X_Res[row_ix, 1], label=str(label))
pyplot.legend()
pyplot.show()
Scatter plot of examples by class label downsampled post using cluster centroids
Figure 4: Scatter plot of examples by class label downsampled post using cluster centroids

NearMiss

  • In this method, the majority class instances that are near the minority class instances are removed.
  • There are three versions of this algorithm, and each version has a different criterion for selecting which majority class instances to keep and which ones to discard.
    • NearMiss-1: The data is balanced by selecting the samples from the majority class for which the average distance of the k nearest samples of the minority class is the smallest.
    • NearMiss-2: The data is balanced by selecting the samples from the majority class for which the average distance to the farthest samples of the minority class is the smallest.
    • NearMiss-3: This is a 2-step algorithm: first, the m nearest-neighbours are kept for each minority sample; then, the majority samples for which the average distance to the k nearest neighbours is the largest are selected.
  • The logic is:
    • This algorithm has two important parameters — version and k neighbours.
    • It calculates the distance between all the points in the larger class with the points in the smaller class.
    • Elimination is done based on the passed parameter.
      • NearMiss-1: This version keeps instances from the majority class that are closest to the instances in the minority class.
      • NearMiss-2: This version removes instances from the majority class that are farthest from the minority class.
      • NearMiss-3: This version is somewhat opposite to NearMiss-1; it keeps instances from the majority class that are farthest from the instances in the minority class.
  • The sample code is:
from collections import Counter

from sklearn.datasets import make_classification

from sklearn.utils import resample

from imblearn.under_sampling import NearMiss

from matplotlib import pyplot

from numpy import where

X, y = make_classification(n_classes=2, class_sep=2,

weights=[0.1, 0.9], n_features=4, n_clusters_per_class=1, n_samples=1000, random_state=10)

# summarise class distribution

print(‘Original dataset shape %s’ % Counter(y))

Original dataset shape Counter({1: 894, 0: 106})

We will now perform downsampling:

cc = NearMiss()

X_Res, y_res = cc.fit_resample(X, y)

print(‘Resampled dataset shape %s’ % Counter(y_res))

Resampled dataset shape Counter({0: 106, 1: 106})

# scatter plot of examples by class label

for label, _ in Counter(y_res).items():

row_ix = where(y_res == label)[0]

pyplot.scatter(X_Res[row_ix, 0], X_Res[row_ix, 1], label=str(label))

pyplot.legend()

pyplot.show()
Figure 5: Scatter plot of examples by class label downsampled post using NearMiss

Tomek links (T-links)

  • Here a pair of data points from different classes (nearest-neighbours) are dropped — the objective is to drop the sample that corresponds to the majority and thereby minimise the count of the dominating label.
  • This also increases the border space between the two labels and thus enhances the accuracy of the performance.
  • Here’s the logic.
    • Identifying Tomek links:
      • A Tomek link is a pair of instances (x, y) where x is from the majority class and y is from the minority class, leaving no other smaller instance between.
    • Removing instances in Tomek links:
      • Once Tomek links are identified, the instances from the majority class in these links are removed.
      • This helps in creating a cleaner and more separable boundary between the classes.
    • Resulting effect:
      • The removal of instances involved in Tomek links can enhance the performance of a machine learning model by reducing noise in the dataset and improving the separation between classes.
  • The source code is:
from collections import Counter

from sklearn.datasets import make_classification

from sklearn.utils import resample

from imblearn.under_sampling import TomekLinks

from matplotlib import pyplot

from numpy import where

X, y = make_classification(n_classes=2, class_sep=2,

weights=[0.1, 0.9], n_features=4, n_clusters_per_class=1, n_samples=1000, random_state=10)

# summarize class distribution

print(‘Original dataset shape %s’ % Counter(y))

Original dataset shape Counter({1: 894, 0: 106})

Next, we will carry out downsampling:

cc = TomekLinks()

X_Res, y_res = cc.fit_resample(X, y)

print(‘Resampled dataset shape %s’ % Counter(y_res))

Resampled dataset shape Counter({1: 890, 0: 106})

# scatter plot of examples by class label

for label, _ in Counter(y_res).items():

row_ix = where(y_res == label)[0]

pyplot.scatter(X_Res[row_ix, 0], X_Res[row_ix, 1], label=str(label))

pyplot.legend()

pyplot.show()
Figure 6: Tomek representation (Image credits: https://www.analyticsvidhya.com/)

AIIKNN (All K-Nearest-Neighbours)

  • This method works by removing some of the instances from the majority class that are considered to be ‘misclassified’ or near the decision boundary.
  • The goal is to create a more balanced dataset by eliminating instances from the majority class that may be causing confusion for the classifier.
  • The logic is:
    • AllKNN is a modification of RepeatedEditedNearestNeighbours (RENN) that expands the size of the neighbourhood being considered each time it is run.
    • The RENN algorithm runs the EditedNearestNeighbour (ENN) algorithm a specified number of times, removing more outlier points with each rerun.
    • ENN removes misclassified instances near the decision boundary from the majority class.
  • The sample code is:
from collections import Counter

from sklearn.datasets import make_classification

from sklearn.utils import resample

from imblearn.under_sampling import AllKNN

from matplotlib import pyplot

from numpy import where

X, y = make_classification(n_classes=2, class_sep=2,

weights=[0.1, 0.9], n_features=4, n_clusters_per_class=1, n_samples=1000, random_state=10)

# summarise class distribution

print(‘Original dataset shape %s’ % Counter(y))

Original dataset shape Counter({1: 894, 0: 106})

# perform downsampling

allknn = AllKNN()

X_Res, y_res = allknn.fit_resample(X, y)

print(‘Resampled dataset shape %s’ % Counter(y_res))

Resampled dataset shape Counter({1: 872, 0: 106})

# scatter plot of examples by class label

for label, _ in Counter(y_res).items():

row_ix = where(y_res == label)[0]

pyplot.scatter(X_Res[row_ix, 0], X_Res[row_ix, 1], label=str(label))

pyplot.legend()

pyplot.show()
 Scatter plot of examples by class label downsampled post using Tomek links
Figure 7: Scatter plot of examples by class label downsampled post using Tomek links

Pros and cons of downsampling

Pros

  • Helps in reducing the bias where the class is over-populated.
  • Reduces noise and improves accuracy.
  • Saves training time and reduces the computational power needed.

Cons

  • Can result in information loss.
  • Post downsampling, the remaining dataset can lead to overfitting or biased results.
Scatter plot of examples by class label downsampled post using AIIKNN
Figure 8: Scatter plot of examples by class label downsampled post using AIIKNN

To sum up, upsampling involves an increase in the minority class while downsampling is vice-versa – it involves a decrease in the majority class. These equalisation processes prevent the machine learning model from inclining towards the majority class. Remember that upsampling and downsampling need to preserve the distribution of the data, and the boundary line between the target classes must remain the same. Also, upsampling and downsampling are applied only to the training datasets and no changes are made to validation and testing data.

LEAVE A REPLY

Please enter your comment!
Please enter your name here