Multi-Label Classification Example with MultiOutputClassifier and XGBoost in Python
Scikit-learn API provides a MulitOutputClassifier class that helps to classify multi-output data. In this tutorial, we’ll learn how to classify multi-output (multi-label) data with this method in Python. Multi-output data contains more than one y label data for a given X input data. The tutorial covers:
- Preparing the data
- Defining the model
- Predicting and accuracy check
- Source code listing
We’ll start by loading the required libraries for this tutorial.
Preparing the data
We can generate a multi-output data with a make_multilabel_classification function. The target dataset contains 20 features (x), 5 classes (y), and 10000 samples.
We’ll define them in the parameters of the function.
x, y = make_multilabel_classification(n_samples=10000, n_features=20, n_classes=5, random_state=88)
The generated data looks as below. There are 20 features and 5 labels in this dataset.
Next, we’ll split the data into the train and test parts.
xtrain, xtest, ytrain, ytest=train_test_split(x, y, train_size=0.8, random_state=88)
Defining the model
We’ll define the model with the MultiOutputClassifier class of sklearn. As an estimator, we’ll use XGBClassifier, and then we’ll include the estimator into the MultiOutputClassifier class.
We can check the parameters of the model by the print command.
We’ll fit the model with training data and check the training accuracy.
We’ll check the numbers of accuracy metrics for this prediction. Remember, we have five output labels in the ytest and the yhat data, thus we need to use them accordingly.
First, we’ll check the area under the ROC with the roc_auc_score function.
The second method is to check the confusion matrics.
Finally, we’ll check the classification report with the classification_report function.
In this tutorial, we’ve briefly learned how to classify multi-label data with MultiOutputClassifier and XGBoost in Python.