Machine Learning - One-Hot Encoding

Table of Contents

This article explains what One-Hot Encoding is on Machine Learning.

One-Hot Encoding #

One-Hot Encoding adds new features based on the types of feature values, marking 1 in the column corresponding to the unique value and 0 in all others. This changes the dimensionality from a single feature with various unique values to multiple binary features, transforming row-based unique feature values into columnar form. The process involves converting the original data into a format where each unique value of the feature gets its column, with 1 indicating the presence of the feature value and 0 indicating its absence.

Original Data
Product Category
TV
fridge
microwave
computer
fan
fan
mixer
mixer

One-Hot Encoding
Product Category_TV	Product Category_computer	Product Category_fan	Product Category_fridge	Product Category_microwave	Product Category_mixer
1	0	0	0	0	0
0	0	0	1	0	0
0	0	0	0	1	0
0	1	0	0	0	0
0	0	1	0	0	0
0	0	1	0	0	0
0	0	0	0	0	1
0	0	0	0	0	1

The original data contains 8 records with 6 unique values: [‘TV’, ‘computer’, ‘fan’, ‘fridge’, ‘microwave’, ‘mixer’]. From the label encoding example, we know ‘TV’ is encoded as 0, ‘computer’ as 1, and so on. For one-hot encoding, each product category is converted into 6 unique features. If a record’s category is ‘TV’, then ‘Product Category_TV’ is marked 1, and all others 0. Similarly, if a record’s category is ‘fridge’, then ‘Product Category_fridge’ is 1, and others 0. This method, where only one attribute is marked as 1, is named one-hot encoding.

Scikit-learn Implementation #

One-Hot Encoding in Scikit-learn can be performed using the OneHotEncoder class. Unlike LabelEncoder, it requires the input data to be in a 2D format. Additionally, the output from OneHotEncoder is a sparse matrix, which should be converted to a dense matrix using the toarray() method for ease of use. This process enables the transformation of categorical data into a format suitable for machine learning algorithms that require numerical input.


from sklearn.preprocessing import OneHotEncoder
import numpy as np
items=['TV', 'fridge', 'microwave', 'computer', 'fan', 'fan', 'mixer', 'mixer']
# Converting to 2 dimension ndarray
items =np.array(items).reshape(-1 , 1)
# Applying One-Hot Encoding
oh_encoder = OneHotEncoder() 
oh_encoder.fit(items)
oh_labels = oh_encoder.transform(items)
# The result of the conversion using OneHotEncoder is a sparse matrix, so we use toarray() to convert it into a dense matrix.  
print('One-Hot Encoded data:')
print(oh_labels.toarray())
print('Dimensions of One-Hot Encoded data:')
print(oh_labels.shape)

 One-Hot Encoded data:
[[1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 1.]]
Dimensions of One-Hot Encoded data:
(8, 6)

The original data, consisting of 8 records and 1 column, transforms into a dataset with 8 records and 6 columns through one-hot encoding. This encoding assigns ‘TV’ as 0, ‘computer’ as 1, ‘fan’ as 2, ‘fridge’ as 3, ‘microwave’ as 4, and ‘mixer’ as 5, with each number corresponding to a specific column. Hence, if the original data’s first record is ‘TV’, in the transformed data, the first column in the first record is 1, and all other columns are 0. This process effectively expands the dataset’s dimensionality to more accurately represent categorical data for machine learning models.

Original Data
Product Category	Price
TV	1,000,000
fridge	1,500,000
microwave	200,000
computer	800,000
fan	100,000
fan	100,000
mixer	50,000
mixer	50,000

↓

Original Data
Product Category	Price
0	1,000,000
3	1,500,000
4	200,000
1	800,000
2	100,000
2	100,000
5	50,000
5	50,000

↓

One-Hot Encoding
Product Category_TV	Product Category_computer	Product Category_fan	Product Category_fridge	Product Category_microwave	Product Category_mixer	Price
1	0	0	0	0	0	1,000,000
0	0	0	1	0	0	1,500,000
0	0	0	0	1	0	200,000
0	1	0	0	0	0	800,000
0	0	1	0	0	0	100,000
0	0	1	0	0	0	100,000
0	0	0	0	0	1	50,000
0	0	0	0	0	1	50,000

Pandas offers an easier API for one-hot encoding through the get_dummies() function. Unlike Scikit-learn’s OneHotEncoder, it allows direct conversion of string category values to numeric form without needing to transform them into numbers first, simplifying the encoding process.

import pandas as pd
df = pd.DataFrame({'item':['TV', 'fridge', 'microwave', 
                           'computer', 'fan', 'fan',
                           'mixer', 'mixer']})
pd.get_dummies(df)

	item_TV	item_computer	item_fan	item_fridge	item_microwave	item_mixer
0	True	False	False	False	False	False
1	False	False	False	True	False	False
2	False	False	False	False	True	False
3	False	True	False	False	False	False
4	False	False	True	False	False	False
5	False	False	True	False	False	False
6	False	False	False	False	False	True
7	False	False	False	False	False	True

Using get_dummies() allows for direct conversion without needing to first transform string categories into numeric values. This feature simplifies the process of applying one-hot encoding to categorical data in pandas.