CatBoost's Categorical Encoding: One-Hot vs. Target Encoding

Last Updated : 04 Jun, 2024

Improve

CatBoostis a powerfulgradient boostingalgorithm thatexcels in handlingcategorical data. It incorporatesunique methodsfor encodingcategorical features, including one-hot encodingand target encoding. Understandingthese encodingtechniques iscrucial for effectivelyutilizing CatBoost in machinelearning tasks.

In real-world datasets, we quite often deal with categorical data. The cardinality of a categorical feature, i.e. the number of different values that the feature can take varies drastically among features and datasets from just a few to thousands and millions of distinct values. The values of a categorical feature can be distributed almost uniformly and there might be values with a frequency different by the orders of magnitude. CatBoost supports some traditional methods of categorical data preprocessing, such as One-hot Encoding and Frequency Encoding. However one of the signatures of this package is its original solution for categorical features encoding.

Table of Content

One-Hot Encodingin CatBoost
Target Encoding in CatBoost
Implementing One-hot encoding and Target encoding in CatBoost

1. Implementing One-Hot Encoding in CatBoost
2. Demonstrating Target Encoding in CatBoost

Advantages and Disadvantages of One-Hot Encoding and Target Encoding

One-Hot Encodingin CatBoost

One-hot encodingis a common techniqueused to convertcategorical variablesinto a formatthat can be providedto machine learningalgorithms. Inone-hot encoding, each categoryis representedas a binary vector, where only oneelement is “1” (indicating thepresence of thecategory) andall other elementsare “0”.

One-hot encoding converts categorical variables into a binary matrix, where each category is represented by a separate binary feature, where the corresponding category is marked as ‘True’ or ‘1’, and all other categories are marked with ‘False’ or ‘0’. This is particularly useful for categorical features with a small number of unique values.

Target Encoding in CatBoost

Target encoding, sometimes referred to as mean encoding, substitutes the target variable’s mean for each category’s categorical values. A more advanced variation known as ordered target encoding is used by CatBoost.

Each category is replaced by the mean target value for that category.

Example: For binary target values, a feature with categories “A”, “B”, and “C”:

Category A: mean(target|A)
Category B: mean(target|B)
Category C: mean(target|C)

CatBoost uses a variant of target encoding called “ordered encoding” to avoid target leakage. Ordered encoding calculates the target statistics for a categorical feature based on the observed history, i.e., only from the rows (observations) before the current one. This approach mimics time series data validation and helps prevent overfitting.

Steps in Ordered Encoding

TargetCount: Sum of the target values for the categorical feature up to the current observation.
Prior: A constant value determined by the sum of target values in the entire dataset divided by the total number of observations.
FeatureCount: Total number of observations with the same categorical feature value up to the current observation.

The encoded value for a category is calculated using the formula:

EncodedValue=TargetCount+PriorFeatureCount+1

To reduce the variance in the first few observations, CatBoost uses multiple random permutations of the data and averages the target statistics across these permutations.

Implementing One-hot encoding and Target encoding in CatBoost

Install CatBoost: If not already installed, use the command pip install catboost.
Prepare Data: Create a pandas DataFrame with your dataset.
Specify Categorical Features: Use the cat_features parameter to indicate which features are categorical.
Train the Model: Initialize the CatBoost model with the necessary parameters and train it using the fit method.
Evaluate the Model: Use the predict method to evaluate the model on the validation set and print the predictions.

1. Implementing One-Hot Encoding in CatBoost

One-Hot Encoding Example: The feature ‘feature1’ with categories [‘Red’, ‘Green’, ‘Blue’] will be one-hot encoded since it has fewer than 3 unique values (threshold set by one_hot_max_size=3). The predictions are based on the transformed binary vectors for the categorical feature.

Python

from catboost import CatBoostClassifier, Poolfrom sklearn.model_selection import train_test_split# Load datasetdata = { 'feature1': ['Red', 'Green', 'Blue', 'Green', 'Red', 'Blue'], 'feature2': [1, 2, 3, 4, 5, 6], 'target': [0, 1, 0, 1, 0, 1]}# Prepare datadf = pd.DataFrame(data)X = df[['feature1', 'feature2']]y = df['target']# Split the dataX_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)# Specify categorical featurescat_features = ['feature1']# Initialize and train CatBoost modelmodel = CatBoostClassifier(iterations=100, depth=3, learning_rate=0.1, cat_features=cat_features, one_hot_max_size=3)model.fit(X_train, y_train, eval_set=(X_val, y_val), verbose=False)# Model evaluationpredictions = model.predict(X_val)print(predictions)

Output:

[0 1]

Here, the model is predicting the classes for the two samples in the validation set.

2. Demonstrating Target Encoding in CatBoost

Target Encoding Example: The feature ‘feature1’ with categories [‘A’, ‘B’, ‘C’] will use ordered target encoding. The encoding will replace each category with the mean target value for that category, computed using only the preceding data points to avoid data leakage.

Python

from catboost import CatBoostClassifier, Poolfrom sklearn.model_selection import train_test_split# Load datasetdata = { 'feature1': ['A', 'B', 'C', 'A', 'B', 'C'], 'feature2': [10, 20, 30, 40, 50, 60], 'target': [1, 0, 1, 0, 1, 0]}# Prepare datadf = pd.DataFrame(data)X = df[['feature1', 'feature2']]y = df['target']# Split the dataX_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)# Specify categorical featurescat_features = ['feature1']# Initialize and train CatBoost modelmodel = CatBoostClassifier(iterations=100, depth=3, learning_rate=0.1, cat_features=cat_features)model.fit(X_train, y_train, eval_set=(X_val, y_val), verbose=False)# Model evaluationpredictions = model.predict(X_val)print(predictions)
See Also
均匀分布的随机数 - MATLAB rand
- MathWorks 中国

Output:

[1 1]

In this case, the model is predicting the classes for the two samples in the validation set.

Advantages and Disadvantages of One-Hot Encoding and Target Encoding

One-Hot Encoding:
- Advantage: Simple and effective for categorical features with a small number of unique values.
- Disadvantage: Can lead to high-dimensional data and is not suitable for features with many unique values.
Target Encoding:
- Advantage: Captures the relationship between categorical features and the target variable, handles high-cardinality features effectively.
- Disadvantage: Prone to overfitting if not implemented correctly, requires careful handling to avoid target leakage.

Conclusion

CatBoost’s ability to handle categorical data directly through one-hot encoding and target encoding makes it a versatile tool for machine learning tasks. One-hot encoding is suitable for features with a small number of unique values, while target encoding is effective for high-cardinality features. By leveraging these encoding techniques, CatBoost enhances model performance and generalization, making it a valuable asset in data preprocessing and machine learning.

Anonymous

Improve

How to Detect Character Encoding using mb_detect_encoding() function in PHP ?

CatBoost Ranking Metrics: A Comprehensive Guide

CatBoost's Categorical Encoding: One-Hot vs. Target Encoding - GeeksforGeeks (2024)