Machine Learning - One-Hot Encoding
Table of Contents
This article explains what One-Hot Encoding is on Machine Learning.
One-Hot Encoding #
One-Hot Encoding adds new features based on the types of feature values, marking 1 in the column corresponding to the unique value and 0 in all others. This changes the dimensionality from a single feature with various unique values to multiple binary features, transforming row-based unique feature values into columnar form. The process involves converting the original data into a format where each unique value of the feature gets its column, with 1 indicating the presence of the feature value and 0 indicating its absence.
Original Data |
---|
Product Category |
TV |
fridge |
microwave |
computer |
fan |
fan |
mixer |
mixer |
One-Hot Encoding | |||||
---|---|---|---|---|---|
Product Category_TV | Product Category_computer | Product Category_fan | Product Category_fridge | Product Category_microwave | Product Category_mixer |
1 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 1 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 0 |
0 | 1 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 0 | 0 | 0 |
0 | 0 | 1 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 0 | 0 | 1 |
The original data contains 8 records with 6 unique values: [‘TV’, ‘computer’, ‘fan’, ‘fridge’, ‘microwave’, ‘mixer’]. From the label encoding example, we know ‘TV’ is encoded as 0, ‘computer’ as 1, and so on. For one-hot encoding, each product category is converted into 6 unique features. If a record’s category is ‘TV’, then ‘Product Category_TV’ is marked 1, and all others 0. Similarly, if a record’s category is ‘fridge’, then ‘Product Category_fridge’ is 1, and others 0. This method, where only one attribute is marked as 1, is named one-hot encoding.
Scikit-learn Implementation #
One-Hot Encoding in Scikit-learn can be performed using the OneHotEncoder class. Unlike LabelEncoder, it requires the input data to be in a 2D format. Additionally, the output from OneHotEncoder is a sparse matrix, which should be converted to a dense matrix using the toarray()
method for ease of use. This process enables the transformation of categorical data into a format suitable for machine learning algorithms that require numerical input.
from sklearn.preprocessing import OneHotEncoder
import numpy as np
items=['TV', 'fridge', 'microwave', 'computer', 'fan', 'fan', 'mixer', 'mixer']
# Converting to 2 dimension ndarray
items =np.array(items).reshape(-1 , 1)
# Applying One-Hot Encoding
oh_encoder = OneHotEncoder()
oh_encoder.fit(items)
oh_labels = oh_encoder.transform(items)
# The result of the conversion using OneHotEncoder is a sparse matrix, so we use toarray() to convert it into a dense matrix.
print('One-Hot Encoded data:')
print(oh_labels.toarray())
print('Dimensions of One-Hot Encoded data:')
print(oh_labels.shape)
One-Hot Encoded data:
[[1. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0.]
[0. 0. 0. 0. 1. 0.]
[0. 1. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0.]
[0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 1.]]
Dimensions of One-Hot Encoded data:
(8, 6)
The original data, consisting of 8 records and 1 column, transforms into a dataset with 8 records and 6 columns through one-hot encoding. This encoding assigns ‘TV’ as 0, ‘computer’ as 1, ‘fan’ as 2, ‘fridge’ as 3, ‘microwave’ as 4, and ‘mixer’ as 5, with each number corresponding to a specific column. Hence, if the original data’s first record is ‘TV’, in the transformed data, the first column in the first record is 1, and all other columns are 0. This process effectively expands the dataset’s dimensionality to more accurately represent categorical data for machine learning models.
Original Data | |
---|---|
Product Category | Price |
TV | 1,000,000 |
fridge | 1,500,000 |
microwave | 200,000 |
computer | 800,000 |
fan | 100,000 |
fan | 100,000 |
mixer | 50,000 |
mixer | 50,000 |
↓
Original Data | |
---|---|
Product Category | Price |
0 | 1,000,000 |
3 | 1,500,000 |
4 | 200,000 |
1 | 800,000 |
2 | 100,000 |
2 | 100,000 |
5 | 50,000 |
5 | 50,000 |
↓
One-Hot Encoding | ||||||
---|---|---|---|---|---|---|
Product Category_TV | Product Category_computer | Product Category_fan | Product Category_fridge | Product Category_microwave | Product Category_mixer | Price |
1 | 0 | 0 | 0 | 0 | 0 | 1,000,000 |
0 | 0 | 0 | 1 | 0 | 0 | 1,500,000 |
0 | 0 | 0 | 0 | 1 | 0 | 200,000 |
0 | 1 | 0 | 0 | 0 | 0 | 800,000 |
0 | 0 | 1 | 0 | 0 | 0 | 100,000 |
0 | 0 | 1 | 0 | 0 | 0 | 100,000 |
0 | 0 | 0 | 0 | 0 | 1 | 50,000 |
0 | 0 | 0 | 0 | 0 | 1 | 50,000 |
Pandas offers an easier API for one-hot encoding through the get_dummies()
function. Unlike Scikit-learn’s OneHotEncoder, it allows direct conversion of string category values to numeric form without needing to transform them into numbers first, simplifying the encoding process.
import pandas as pd
df = pd.DataFrame({'item':['TV', 'fridge', 'microwave',
'computer', 'fan', 'fan',
'mixer', 'mixer']})
pd.get_dummies(df)
item_TV | item_computer | item_fan | item_fridge | item_microwave | item_mixer | |
---|---|---|---|---|---|---|
0 | True | False | False | False | False | False |
1 | False | False | False | True | False | False |
2 | False | False | False | False | True | False |
3 | False | True | False | False | False | False |
4 | False | False | True | False | False | False |
5 | False | False | True | False | False | False |
6 | False | False | False | False | False | True |
7 | False | False | False | False | False | True |
Using get_dummies()
allows for direct conversion without needing to first transform string categories into numeric values. This feature simplifies the process of applying one-hot encoding to categorical data in pandas.