Machine Learning - Label Encoding

Table of Contents

This article explains Label Encoding on Machine Learning.

Label Encoding #

Label Encoding converts categorical features into numerical values. For example, product categories like ‘TV’, ‘fridge’, ‘microwave’, ‘computer’, ‘fan’, ‘mixer’ are converted into numerical codes such as TV: 1, fridge: 2, and so on. It’s crucial to note that even numerical codes like ‘01’, ‘02’ should be converted to numeric values without leading zeros.

Scikit-learn Implementation #

Scikit-learn implements Label Encoding through the LabelEncoder class. You create a LabelEncoder object and perform label encoding by calling fit() and transform().

from sklearn.preprocessing import LabelEncoder
items = ['TV', 'fridge', 'microwave', 'computer',
         'fan', 'fan', 'mixer', 'mixer']
# Create objects with LabelEncoder &
# Conduct label encoding with fit( ) and transform( ) 
encoder = LabelEncoder()
encoder.fit(items)
labels = encoder.transform(items)
print('Encoded label:', labels)

Encoded label: [0 3 4 1 2 2 5 5]

This results in Encoded label: [0 3 4 1 2 2 5 5], where ‘TV’ is 0, ‘fridge’ is 3, ‘microwave’ is 4, ‘computer’ is 1, ‘fan’ is 2, and ‘mixer’ is 5. If it’s unclear which string values correspond to which numerical codes, you can check the classes_ attribute of the LabelEncoder object.

print('Encoding Class:', encoder.classes_)

Encoding Class: ['TV' 'computer' 'fan' 'fridge' 'microwave' 'mixer']

The classes_ attribute holds the original values corresponding to the encoding numbers starting from 0 in order. Therefore, it can be determined that ‘TV’ is encoded as 0, ‘computer’ as 1, ‘fan’ as 2, ‘fridge’ as 3, ‘microwave’ as 4, and ‘mixer’ as 5. For decoding, inverse_transform() can be used to revert the encoded values back to the original strings.

print('Decoding original:', encoder.inverse_transform([4, 2, 5, 0, 3, 3, 2, 2]))

Decoding original: ['microwave' 'fan' 'mixer' 'TV' 'fridge' 'fridge' 'fan' 'fan']

When product data consists of two attributes, product category and price, applying label encoding to the product category can transform it as follows.

Original Data
Product Category	Price
TV	1,000,000
fridge	1,500,000
microwave	200,000
computer	800,000
fan	100,000
fan	100,000
mixer	50,000
mixer	50,000

Data with Encoded Product Category
Product Category	Price
0	1,000,000
3	1,500,000
4	200,000
1	800,000
2	100,000
2	100,000
5	50,000
5	50,000

Label encoding converts string values to numeric category values, simplifying categorical data handling. However, this approach may lead to performance issues in some machine learning algorithms due to the numeric values’ inherent order or magnitude, potentially influencing algorithm predictions inaccurately.

For instance, numerical values might imply an unintended order or importance among categories (e.g., ‘computer’ encoded as 1 might be deemed less significant than ‘fan’ encoded as 2). Therefore, label encoding is not recommended for models like linear regression that interpret these magnitudes.

Instead, it’s suitable for tree-based algorithms that don’t consider the numeric value’s order. One-Hot Encoding is proposed to address these issues, offering an alternative that avoids ordinal implications.

We will explore One-Hot Encoding in the next section.