Skip to main content

How to achieve maximum parallel processing capabilities with XGBoost-1.0.0?


Another post starts with you beautiful people!
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. Recently XGBoost is released with it's newer version 1.0.0 which has improvements like performance scaling for multi core CPUs, improved installation experience on Mac OSX, availability of distributed XGBoost on Kubernates etc. In this post we are going to explore it's multi processing capabilities on a real world ml problem Otto Group Product Classification Challenge. In the end of the post I will share my kaggle kernel link also so that you can explore my complete code.

Once you go to the challenge link in Kaggle and start your kernel, first you need to enable the Internet option in the notebook since current version of XGBoost installed in kernel notebook is 0.90. So for upgrading it run the following command in your anaconda prompt-
pip install --upgrade xgboost or !pip install --upgrade xgboost in your kaggle kernel. Screenshot of Anaconda prompt is as below-
And screen shot of my kernel is as below-
After running above command you will see following screen showing successful installation of the library-

Let's import the dataset-
This dataset describes the 95 details of 61,878 products grouped into 10 product categories (e.g. fashion, electronics, etc.). Input attributes are counts of different events of some kind. The goal is to make predictions for new products as an array of probabilities for each of the 10 categories and models are evaluated using multi-class logarithmic loss (also called cross entropy).

Please note that we get multi-threading support by default with XGBoost. But depending on our Python environment (e.g. Python 3) we may need to explicitly enable multi-threading support for XGBoost. We can confirm that XGBoost multi-threading support is working fine by building a
number of different XGBoost models, specifying the number of threads and timing how long it takes to build each model. The trend will both show us that multi-threading support is enabled and give us an indication of the effect it has when building models. Below is the code snippet showing how can you check this-

You can update the number of threads list based on your system configuration. Recommended way to set number of threads (nthread) is it should be equal to the number of physical CPU cores in your machine. After running the above code cell I am getting following result-

Also we can plot the above trend in following way-
From above plot we can see a nice trend in the decrease in execution time as the number of threads is increased. You can run the same code on a machine with a lot more cores and decrease the model training time. Now you know how to configure the number of threads with XGBoost in your machine. But there is one more important thing we can do as tuning. We always do cross validation to avoid overfiting in our model and this steps also time consuming. So is there any way to tune this process. Answer is absolutely yes! We can enable the multi-threading in both XGBoost as well as in cross validation.

The k-fold cross-validation also supports multi-threading. For example the n_jobs argument on the cross val score() function allows us to specify the number of parallel jobs to run. By default, this is set to 1, but can be set to -1 to use all of the CPU cores on our system. In following code snippet we will check three configurations with cross validation and XGBoost to achieve multi-threading and then compare the output of all three configurations-

In above code snippet you can see only configurable parameter required for multi-threading is nthread in XGBoost and is n_jobs in cross validation. For example only I am using 1 thread but you can change it to your no. of cores. After running above cell the best result is achieved by enabling multi-threading within XGBoost and not in cross-validation as you can see in below screen shot also-

So if you are going to to do cross validation with any library other than XGBoost 1.0.0, don't forget to enable the multi-treading feature in cross validation and if you are going to use XGBoost which you should be, don't forget to check no. of threads and enabling multi-threading within XGBoost. For the complete solution, you can find my kernel in following link: my kernel
Fork my kernel and start experimenting with it and if you like to learn more about XGBoost, follow the tutorials of this amazing guy: Jason Brownlee PhD.Till then Go chase your dreams, have an awesome day, make every second count and see you later in my next post.


Comments

  1. Okay then...

    What I'm going to tell you might sound pretty weird, and maybe even kind of "strange"

    HOW would you like it if you could just press "PLAY" to LISTEN to a short, "miracle tone"...

    And miraculously attract MORE MONEY into your LIFE??

    And I'm really talking about hundreds... even thousands of dollars!!

    Sound too EASY?? Think something like this is not for real?!?

    Well then, Let me tell you the news..

    Usually the largest blessings in life are also the EASIEST!!

    Honestly, I will PROVE it to you by letting you listen to a REAL "magical money tone" I've produced...

    You just click "PLAY" and watch money coming right into your life... starting pretty much right away...

    GO here now to experience the marvelous "Miracle Money Tone" - as my gift to you!!

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. I have been researching on this concept now i find useful information. thanks for giving clear info. looking forward for more posts from this author
    data science course chennai

    ReplyDelete
  4. This video helps me to understand Matplotlib whats your opinion guys.

    ReplyDelete

Post a Comment

Popular posts from this blog

How can I make a simple ChatBot?

Another post starts with you beautiful people! It has been a long time of posting a new post. But my friends in this period I was not sitting  where I got a chance to work with chatbot and classification related machine learning problem. So in this post I am going to share all about chatbot- from where I have learned? What I have learned? And how can you build your first bot? Quite interesting right! Chatbot is a program that can conduct an intelligent conversation based on user's input. Since chatbot is a new thing to me also, I first searched- is there any Python library available to start with this? And like always Python has helped me this time also. There is a Python library available with name as  ChatterBot   which is nothing but a machine learning conversational dialog engine. And yes that is all I want to start my learning because I always prefer inbuilt Python library to start my learning journey and once I learn this then only I move ahead for another...

Detecting Credit Card Fraud As a Data Scientist

Another post starts with you beautiful people! Hope you have learnt something from my previous post about  machine learning classification real world problem Today we will continue our machine learning hands on journey and we will work on an interesting Credit Card Fraud Detection problem. The goal of this exercise is to anonymize credit card transactions labeled as fraudulent or genuine. For your own practice you can download the dataset from here-  Download the dataset! About the dataset:  The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. Let's start our analysis with loading the dataset first:- As per the  official documentation -  features V1, V2, ... V28 are the principal compo...

LightGBM and Kaggle's Mercari Price Suggestion Challenge

Another post starts with you beautiful people! I hope you have enjoyed and must learnt something from previous two posts about real world machine learning problems in Kaggle. As I said earlier Kaggle is a great platform to apply your machine learning skills and enhance your knowledge; today I will share again my learning from there with all of you! In this post we will work upon an online machine learning competition where we need to predict the the price of products for Japan’s biggest community-powered shopping app. The main attraction of this challenge is that this is a Kernels-only competition; it means the datasets are given for downloading only in stage 1.In next final stage it will be available only in Kernels. What kind of problem is this? Since our goal is to predict the price (which is a number), it will be a regression problem. Data: You can see the datasets  here Exploring the datasets: The datasets provided are in the zip format of 'tsv'. So how can ...