Boost Up With XGboost

One of the trendy libraries

There are lots of articles out there talking about XGBoost and using it for models.
And why shouldn’t there be? It is a really powerful tool that has been proven to obtain great results in a wide variety of environments, favoring heterogeneous data. It has implementations in several languages, but in this article we are going to follow the trend of the previous ones and see the Python 3 implementation.

What is eXtreme Gradient Boosting?

The fancy name of the library comes from the algorithm used in it to train the model, but how does it work? Let’s go backwards seeing what each word means.

Boosting is a very well known ensemble strategy to get very complex models.
What it does is generating several weak models and training them, picking the best features among them to “boost” the next batch of models to train. Once the models generated are too similar between each other, the training comes to an end.

Gradient is other of the hot buzzwords that is very used in machine learning.
Essentially, the Gradient Descent technique is an optimization algorithm that is highly used in neural networks. Here, it gives the algorithm the means or the “objective” to improve or to boost each iteration of the generated trees. We won’t enter in the mathematical depths of Gradient Boosting in this article, but for the most curious, here is one of the original articles on how it works.

But what makes XGBoost so extreme?
Well, it has several elements that differ from other Gradient Boosting algorithms. Most notably, the way it boosts the trees, its regularization and its loss function. First of all, instead of using a regular gradient descent boosting, it uses the Newton Rhapson Method. Using the second order derivative, it can have a better knowledge on how to optimize the tree, despite it being a bit more computationally costly. That ties in with its loss function, that is modified to work with that Newton boosting. Finally, it uses regularization in leaf scores. Usually, regression algorithm do this in features. As it stands, it helps creating shallower trees, which helps with overfitting.

This is all much more complex than I just explained it, but with this we can have an overview on what we are doing, rather than just pasting Python commands to a terminal and hoping they work. If you want to know more about this method in a deeper way, a very good article can be found here

Loading Data and the DMatrix Structure

Not everything is in the theory of the matter, the actual implementation is just as important, so let’s get to it. For this article, we are going to use this mushroom dataset. We will try to predict if a certain mushroom is poisonous or not. It will also be a good time to see how to preprocess text data.

The first thing we will do is to download the files. We can do it by hand, but it is a good place to revise our web skills in Python. We will use urlretrieve() from urllib:


from urllib.request import urlretrieve

url_data = "https://raw.githubusercontent.com/jboscomendoza/xgboost_en_python/master/agaricus-lepiota.data"
url_names = "https://raw.githubusercontent.com/jboscomendoza/xgboost_en_python/master/agaricus-lepiota.names"urlretrieve(url_data, "agaricus-lepiota.data")
urlretrieve(url_names, "agaricus-lepiota.names")

However, before putting the data in a DataFrame, we should have a look at the files. The first is the data as we expect it, a letter per column. However, the second file is all the information of all the dataset, not just the names of the colums. Reading that information, we can get the name of the colums, knowing that the target is the first one; and add it to an array to use them:


names = [
    "target", "cap_shape", "cap_surface", "cap_color", "bruises", "odor", 
    "gill_attachment", "gill_spacing", "gill_size", "gill_color", "stalk_shape",
    "stalk_root", "stalk_surface_above_ring", "stalk_surface_below_ring", 
    "stalk_color_above_ring", "stalk_color_below_ring", "veil_type", 
    "veil_color", "ring_number", "ring_type", "spore_print_color", "population",
    "habitat"
  ]

Now, we are ready to create a DataFrame:


import pandas as pd
df = pd.read_csv('agaricus-lepiota.data', names=names)
df.head(5)


  target cap_shape cap_surface cap_color  ... 
ring_type spore_print_color population habitat
0      p         x           s         n  ... 
        p                 k          s       u
1      e         x           s         y  ... 
        p                 n          n       g
2      e         b           s         w  ... 
        p                 n          n       m
3      p         x           y         w  ... 
        p                 k          s       u
4      e         x           s         g  ... 
        e                 n          a       g

[5 rows x 23 columns]

Now we need to encode those data to integers. The easiest way is to get the number of unique values and get a dictionary, where 0 is the first value, 1 the second and so on. For that, we are going to use a LabelEncoder from Scikit Learn and a defaultdict from the Python Standard library:


from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
labels = defaultdict(LabelEncoder)
encoded_df = df.apply(lambda x: labels[x.name].fit_transform(x))

What is going on here and why so much hassle? We are first using a default dictionary that will create every key as a LabelEncoder, which basically does as explained before. But this element just takes a single numpy array. Therefore, we need to apply it to each one of the columns of the dataframe (in this case). The fit_transform function is the one that “trains” the encoder and then does the transformation. Finally, we are storing each encoder in a dictionary so we can reverse the encoding, if we wanted to understand the model later. Label Encoding is a convenient way, but it is neither the only one nor the most powerful to encode a categorical value. Have a look at a OneHotEncoder, if you are curious about more!

Finally, we are going to do a train test validation, for convenience:


from sklearn.model_selection import train_test_split

X_cols = names[1:]
y_col = 'target'

X = encoded_df[X_cols]
y = encoded_df[y_col]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

After this, we will have to transform the data to the special format XGBoost uses: DMatrix. Luckily, both numpy arrays and pandas DataFrames are very easy to transform. For that, we will use an auxiliary function provided by XGBoost:


import xgboost as xgb
train_mat = xgb.DMatrix(X_train, label=y_train)
test_mat = xgb.DMatrix(X_test, label=y_test)

Training the model

After all of this it is time to train the model. Let’s see a bit the most important hyperparameters:

booster: Can be dart, gbtree or gblinear. The first two use trees, while the third uses a linear function. We will use gbtree for this problem.
objective: The objective function that will be optimized. For a binary classification problem, we will use binary:logistic
max_depth: When trees are used for the booster, it sets the maximum depth they can have.
eta:_Akin to the learning rate in neural networks, if it is too big it might never converge, but the smaller it is the slower the model will train.
nthread: Parallelization parameter, really important with the appearance of high threaded CPU and GPUs.
nround: Number of iterations that will be performed. As usual with these algorithms, more rounds will help with training but there is always the risk of overfitting.

Let’s do a dictionary with all the parameters we will use, as well as the number of rounds:


parameters = {
    "booster":"gbtree", 
    "max_depth": 2, 
    "eta": 0.3, 
    "objective": "binary:logistic", 
    "nthread":4}
rounds = 10

Also, we need to define a variable with the tuples of the sets that will be used to validate the model in each round:


eval_tuples = [(test_mat, "eval"), (train_mat, "train")]

Finally, we train our model:


model = xgb.train(parameters, train_mat, rounds, eval_tuples)


[0]     eval-error:0.09661      train-error:0.08543
[1]     eval-error:0.09064      train-error:0.07955
[2]     eval-error:0.05147      train-error:0.04593
[3]     eval-error:0.03021      train-error:0.02774
[4]     eval-error:0.03021      train-error:0.02701
[5]     eval-error:0.02536      train-error:0.02499
[6]     eval-error:0.02536      train-error:0.02499
[7]     eval-error:0.01306      train-error:0.01121
[8]     eval-error:0.01679      train-error:0.01598
[9]     eval-error:0.01306      train-error:0.01121

It will give us the training and test (or eval) error here for each iteration. Thanks to defining the tuples, it shows the progress of the training, in order to verify or debug our model.

Evaluating the model

It is very easy to get the accuracy of the model with XGBoost:


 model.eval(test_mat)


'[0]    eval-error:0.013055'

We can also use Scikit Learn to get additional insight in the model. For that, first we will get the predictions of the test set:


pred = model.predict(test_mat)


array([0.04565517, 0.9624026 , 0.963244  , ..., 0.04554567, 0.17563151,
       0.04554567], dtype=float32)

Our output should be binary, right? XGBoost returns the probability for the elements to be true, so let’s encode very quickly this array to a binary one:


threshold = 0.5
pred = [1 if i > threshold else 0 for i in pred]

We can tweak that threshold, but for now 0.5 is enough. Then, we will use classification_report from Scikit to get some more data:


from sklearn.metrics import classification_report
classification_report(y_test, pred)


                precision   recall  f1-score    support
              
0               0.98        1.00    0.99        1378
1               1.00        0.97    0.99        1303
                                    
accuracy                            0.99        2681
macro avg       0.99        0.99    0.99        2681
weighted avg    0.99        0.99    0.99        2681

We can see with this the precision per class, and other scores such as the recall or the f1-score. It seems we got a really great model here. One final thing to do, then.

Saving and Loading models for Production

We can import and export easily these models. Given that inference is usually much easier to perform than training, it is common to train the models in a supercomputer but being able to deploy the trained models in simple servers or even mobile applications. To save an XGBoost model, we just do the following:


model.save_model('mushroom.model')

Which saves it in a binary format. To retrieve it later, we will create an empty model and import it there:


imported_model = xgb.Booster()
imported_model.load_model('mushroom.model')
imported_model.eval(test_mat)


[0]    eval-error:0.013055

Bonus: Scikit-Learn API

In this article we mentioned that one of the best things about Scikit Learn was its versatility to use it with other elements. And of course this very popular library could not be less. It has a wrapper that lets it work just as any other Scikit algorithm. To do this, just use these classes:


from xgboost import XGBClassifier, XGBRegressor

They will work with the same syntax as any other Scikit model, working with all the elements mentioned in the Scikit article. They can be integrated in any other pipeline seamlessly, and they will call underneath the corresponding methods - even transforming the data to a DMatrix on their own.

Conclusion

There are countless machine learning algorithms out there, but not all of them work the same, neither they work well for all problems. In this article we have seen how XGBoost can work for classification problems, giving you one more tool to approach new problems.