Machine Learning - Label Encoding
Table of Contents
This article explains Label Encoding on Machine Learning.
Label Encoding #
Label Encoding converts categorical features into numerical values. For example, product categories like ‘TV’, ‘fridge’, ‘microwave’, ‘computer’, ‘fan’, ‘mixer’ are converted into numerical codes such as TV: 1, fridge: 2, and so on. It’s crucial to note that even numerical codes like ‘01’, ‘02’ should be converted to numeric values without leading zeros.
Scikit-learn Implementation #
Scikit-learn implements Label Encoding through the LabelEncoder class. You create a LabelEncoder object and perform label encoding by calling fit()
and transform()
.
from sklearn.preprocessing import LabelEncoder
items = ['TV', 'fridge', 'microwave', 'computer',
'fan', 'fan', 'mixer', 'mixer']
# Create objects with LabelEncoder &
# Conduct label encoding with fit( ) and transform( )
encoder = LabelEncoder()
encoder.fit(items)
labels = encoder.transform(items)
print('Encoded label:', labels)
Encoded label: [0 3 4 1 2 2 5 5]
This results in Encoded label: [0 3 4 1 2 2 5 5]
, where ‘TV’ is 0, ‘fridge’ is 3, ‘microwave’ is 4, ‘computer’ is 1, ‘fan’ is 2, and ‘mixer’ is 5. If it’s unclear which string values correspond to which numerical codes, you can check the classes_
attribute of the LabelEncoder object.
print('Encoding Class:', encoder.classes_)
Encoding Class: ['TV' 'computer' 'fan' 'fridge' 'microwave' 'mixer']
The classes_
attribute holds the original values corresponding to the encoding numbers starting from 0 in order. Therefore, it can be determined that ‘TV’ is encoded as 0, ‘computer’ as 1, ‘fan’ as 2, ‘fridge’ as 3, ‘microwave’ as 4, and ‘mixer’ as 5. For decoding, inverse_transform()
can be used to revert the encoded values back to the original strings.
print('Decoding original:', encoder.inverse_transform([4, 2, 5, 0, 3, 3, 2, 2]))
Decoding original: ['microwave' 'fan' 'mixer' 'TV' 'fridge' 'fridge' 'fan' 'fan']
When product data consists of two attributes, product category and price, applying label encoding to the product category can transform it as follows.
Original Data | |
---|---|
Product Category | Price |
TV | 1,000,000 |
fridge | 1,500,000 |
microwave | 200,000 |
computer | 800,000 |
fan | 100,000 |
fan | 100,000 |
mixer | 50,000 |
mixer | 50,000 |
Data with Encoded Product Category | |
---|---|
Product Category | Price |
0 | 1,000,000 |
3 | 1,500,000 |
4 | 200,000 |
1 | 800,000 |
2 | 100,000 |
2 | 100,000 |
5 | 50,000 |
5 | 50,000 |
Label encoding converts string values to numeric category values, simplifying categorical data handling. However, this approach may lead to performance issues in some machine learning algorithms due to the numeric values’ inherent order or magnitude, potentially influencing algorithm predictions inaccurately.
For instance, numerical values might imply an unintended order or importance among categories (e.g., ‘computer’ encoded as 1 might be deemed less significant than ‘fan’ encoded as 2). Therefore, label encoding is not recommended for models like linear regression that interpret these magnitudes.
Instead, it’s suitable for tree-based algorithms that don’t consider the numeric value’s order. One-Hot Encoding is proposed to address these issues, offering an alternative that avoids ordinal implications.
We will explore One-Hot Encoding in the next section.