CatBoost's Categorical Encoding: One-Hot vs. Target Encoding - GeeksforGeeks (2024)

Last Updated : 04 Jun, 2024

Improve

CatBoostis a powerfulgradient boostingalgorithm thatexcels in handlingcategorical data. It incorporatesunique methodsfor encodingcategorical features, including one-hot encodingand target encoding. Understandingthese encodingtechniques iscrucial for effectivelyutilizing CatBoost in machinelearning tasks.

In real-world datasets, we quite often deal with categorical data. The cardinality of a categorical feature, i.e. the number of different values that the feature can take varies drastically among features and datasets from just a few to thousands and millions of distinct values. The values of a categorical feature can be distributed almost uniformly and there might be values with a frequency different by the orders of magnitude. CatBoost supports some traditional methods of categorical data preprocessing, such as One-hot Encoding and Frequency Encoding. However one of the signatures of this package is its original solution for categorical features encoding.

Table of Content

  • One-Hot Encodingin CatBoost
  • Target Encoding in CatBoost
  • Implementing One-hot encoding and Target encoding in CatBoost
    • 1. Implementing One-Hot Encoding in CatBoost
    • 2. Demonstrating Target Encoding in CatBoost
  • Advantages and Disadvantages of One-Hot Encoding and Target Encoding

One-Hot Encodingin CatBoost

One-hot encodingis a common techniqueused to convertcategorical variablesinto a formatthat can be providedto machine learningalgorithms. Inone-hot encoding, each categoryis representedas a binary vector, where only oneelement is “1” (indicating thepresence of thecategory) andall other elementsare “0”.

One-hot encoding converts categorical variables into a binary matrix, where each category is represented by a separate binary feature, where the corresponding category is marked as ‘True’ or ‘1’, and all other categories are marked with ‘False’ or ‘0’. This is particularly useful for categorical features with a small number of unique values.

Each category is represented as a binary vector.

  • Example: For a feature with categories “Red”, “Green”, and “Blue”:
  1. Red: [1, 0, 0]
  2. Green: [0, 1, 0]
  3. Blue: [0, 0, 1]

CatBoost uses one-hot encoding for categorical features with a small number of unique values. The default threshold for applying one-hot encoding depends on various conditions, such as the training mode and the availability of target data. For instance:

  • Default Thresholds:
    • GPU Training: 255 unique values if the selected Ctr (Categorical Target Statistics) types require target data that is not available during training.
    • Ranking Mode: 10 unique values.
    • Other Conditions: 2 unique values if none of the above conditions are met.

Target Encoding in CatBoost

Target encoding, sometimes referred to as mean encoding, substitutes the target variable’s mean for each category’s categorical values. A more advanced variation known as ordered target encoding is used by CatBoost.

Each category is replaced by the mean target value for that category.

  • Example: For binary target values, a feature with categories “A”, “B”, and “C”:
  1. Category A: mean(target|A)
  2. Category B: mean(target|B)
  3. Category C: mean(target|C)

CatBoost uses a variant of target encoding called “ordered encoding” to avoid target leakage. Ordered encoding calculates the target statistics for a categorical feature based on the observed history, i.e., only from the rows (observations) before the current one. This approach mimics time series data validation and helps prevent overfitting.

Steps in Ordered Encoding

  1. TargetCount: Sum of the target values for the categorical feature up to the current observation.
  2. Prior: A constant value determined by the sum of target values in the entire dataset divided by the total number of observations.
  3. FeatureCount: Total number of observations with the same categorical feature value up to the current observation.

The encoded value for a category is calculated using the formula:

EncodedValue=TargetCount+PriorFeatureCount+1​

To reduce the variance in the first few observations, CatBoost uses multiple random permutations of the data and averages the target statistics across these permutations.

Implementing One-hot encoding and Target encoding in CatBoost

  1. Install CatBoost: If not already installed, use the command pip install catboost.
  2. Prepare Data: Create a pandas DataFrame with your dataset.
  3. Specify Categorical Features: Use the cat_features parameter to indicate which features are categorical.
  4. Train the Model: Initialize the CatBoost model with the necessary parameters and train it using the fit method.
  5. Evaluate the Model: Use the predict method to evaluate the model on the validation set and print the predictions.

1. Implementing One-Hot Encoding in CatBoost

One-Hot Encoding Example: The feature ‘feature1’ with categories [‘Red’, ‘Green’, ‘Blue’] will be one-hot encoded since it has fewer than 3 unique values (threshold set by one_hot_max_size=3). The predictions are based on the transformed binary vectors for the categorical feature.

Python
from catboost import CatBoostClassifier, Poolfrom sklearn.model_selection import train_test_split# Load datasetdata = { 'feature1': ['Red', 'Green', 'Blue', 'Green', 'Red', 'Blue'], 'feature2': [1, 2, 3, 4, 5, 6], 'target': [0, 1, 0, 1, 0, 1]}# Prepare datadf = pd.DataFrame(data)X = df[['feature1', 'feature2']]y = df['target']# Split the dataX_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)# Specify categorical featurescat_features = ['feature1']# Initialize and train CatBoost modelmodel = CatBoostClassifier(iterations=100, depth=3, learning_rate=0.1, cat_features=cat_features, one_hot_max_size=3)model.fit(X_train, y_train, eval_set=(X_val, y_val), verbose=False)# Model evaluationpredictions = model.predict(X_val)print(predictions)

Output:

[0 1]

Here, the model is predicting the classes for the two samples in the validation set.

2. Demonstrating Target Encoding in CatBoost

Target Encoding Example: The feature ‘feature1’ with categories [‘A’, ‘B’, ‘C’] will use ordered target encoding. The encoding will replace each category with the mean target value for that category, computed using only the preceding data points to avoid data leakage.

Python
from catboost import CatBoostClassifier, Poolfrom sklearn.model_selection import train_test_split# Load datasetdata = { 'feature1': ['A', 'B', 'C', 'A', 'B', 'C'], 'feature2': [10, 20, 30, 40, 50, 60], 'target': [1, 0, 1, 0, 1, 0]}# Prepare datadf = pd.DataFrame(data)X = df[['feature1', 'feature2']]y = df['target']# Split the dataX_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)# Specify categorical featurescat_features = ['feature1']# Initialize and train CatBoost modelmodel = CatBoostClassifier(iterations=100, depth=3, learning_rate=0.1, cat_features=cat_features)model.fit(X_train, y_train, eval_set=(X_val, y_val), verbose=False)# Model evaluationpredictions = model.predict(X_val)print(predictions)

Output:

[1 1]

In this case, the model is predicting the classes for the two samples in the validation set.

Advantages and Disadvantages of One-Hot Encoding and Target Encoding

  • One-Hot Encoding:
    • Advantage: Simple and effective for categorical features with a small number of unique values.
    • Disadvantage: Can lead to high-dimensional data and is not suitable for features with many unique values.
  • Target Encoding:
    • Advantage: Captures the relationship between categorical features and the target variable, handles high-cardinality features effectively.
    • Disadvantage: Prone to overfitting if not implemented correctly, requires careful handling to avoid target leakage.

Conclusion

CatBoost’s ability to handle categorical data directly through one-hot encoding and target encoding makes it a versatile tool for machine learning tasks. One-hot encoding is suitable for features with a small number of unique values, while target encoding is effective for high-cardinality features. By leveraging these encoding techniques, CatBoost enhances model performance and generalization, making it a valuable asset in data preprocessing and machine learning.



CatBoost's Categorical Encoding: One-Hot vs. Target Encoding - GeeksforGeeks (1)

Anonymous

Improve

Previous Article

How to Detect Character Encoding using mb_detect_encoding() function in PHP ?

Next Article

CatBoost Ranking Metrics: A Comprehensive Guide

Please Login to comment...

CatBoost's Categorical Encoding: One-Hot vs. Target Encoding - GeeksforGeeks (2024)

References

Top Articles
Latest Posts
Article information

Author: Cheryll Lueilwitz

Last Updated:

Views: 5923

Rating: 4.3 / 5 (74 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Cheryll Lueilwitz

Birthday: 1997-12-23

Address: 4653 O'Kon Hill, Lake Juanstad, AR 65469

Phone: +494124489301

Job: Marketing Representative

Hobby: Reading, Ice skating, Foraging, BASE jumping, Hiking, Skateboarding, Kayaking

Introduction: My name is Cheryll Lueilwitz, I am a sparkling, clean, super, lucky, joyous, outstanding, lucky person who loves writing and wants to share my knowledge and understanding with you.