Skip to main content

How to achieve maximum parallel processing capabilities with XGBoost-1.0.0?


Another post starts with you beautiful people!
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. Recently XGBoost is released with it's newer version 1.0.0 which has improvements like performance scaling for multi core CPUs, improved installation experience on Mac OSX, availability of distributed XGBoost on Kubernates etc. In this post we are going to explore it's multi processing capabilities on a real world ml problem Otto Group Product Classification Challenge. In the end of the post I will share my kaggle kernel link also so that you can explore my complete code.

Once you go to the challenge link in Kaggle and start your kernel, first you need to enable the Internet option in the notebook since current version of XGBoost installed in kernel notebook is 0.90. So for upgrading it run the following command in your anaconda prompt-
pip install --upgrade xgboost or !pip install --upgrade xgboost in your kaggle kernel. Screenshot of Anaconda prompt is as below-
And screen shot of my kernel is as below-
After running above command you will see following screen showing successful installation of the library-

Let's import the dataset-
This dataset describes the 95 details of 61,878 products grouped into 10 product categories (e.g. fashion, electronics, etc.). Input attributes are counts of different events of some kind. The goal is to make predictions for new products as an array of probabilities for each of the 10 categories and models are evaluated using multi-class logarithmic loss (also called cross entropy).

Please note that we get multi-threading support by default with XGBoost. But depending on our Python environment (e.g. Python 3) we may need to explicitly enable multi-threading support for XGBoost. We can confirm that XGBoost multi-threading support is working fine by building a
number of different XGBoost models, specifying the number of threads and timing how long it takes to build each model. The trend will both show us that multi-threading support is enabled and give us an indication of the effect it has when building models. Below is the code snippet showing how can you check this-

You can update the number of threads list based on your system configuration. Recommended way to set number of threads (nthread) is it should be equal to the number of physical CPU cores in your machine. After running the above code cell I am getting following result-

Also we can plot the above trend in following way-
From above plot we can see a nice trend in the decrease in execution time as the number of threads is increased. You can run the same code on a machine with a lot more cores and decrease the model training time. Now you know how to configure the number of threads with XGBoost in your machine. But there is one more important thing we can do as tuning. We always do cross validation to avoid overfiting in our model and this steps also time consuming. So is there any way to tune this process. Answer is absolutely yes! We can enable the multi-threading in both XGBoost as well as in cross validation.

The k-fold cross-validation also supports multi-threading. For example the n_jobs argument on the cross val score() function allows us to specify the number of parallel jobs to run. By default, this is set to 1, but can be set to -1 to use all of the CPU cores on our system. In following code snippet we will check three configurations with cross validation and XGBoost to achieve multi-threading and then compare the output of all three configurations-

In above code snippet you can see only configurable parameter required for multi-threading is nthread in XGBoost and is n_jobs in cross validation. For example only I am using 1 thread but you can change it to your no. of cores. After running above cell the best result is achieved by enabling multi-threading within XGBoost and not in cross-validation as you can see in below screen shot also-

So if you are going to to do cross validation with any library other than XGBoost 1.0.0, don't forget to enable the multi-treading feature in cross validation and if you are going to use XGBoost which you should be, don't forget to check no. of threads and enabling multi-threading within XGBoost. For the complete solution, you can find my kernel in following link: my kernel
Fork my kernel and start experimenting with it and if you like to learn more about XGBoost, follow the tutorials of this amazing guy: Jason Brownlee PhD.Till then Go chase your dreams, have an awesome day, make every second count and see you later in my next post.


Comments

  1. Okay then...

    What I'm going to tell you might sound pretty weird, and maybe even kind of "strange"

    HOW would you like it if you could just press "PLAY" to LISTEN to a short, "miracle tone"...

    And miraculously attract MORE MONEY into your LIFE??

    And I'm really talking about hundreds... even thousands of dollars!!

    Sound too EASY?? Think something like this is not for real?!?

    Well then, Let me tell you the news..

    Usually the largest blessings in life are also the EASIEST!!

    Honestly, I will PROVE it to you by letting you listen to a REAL "magical money tone" I've produced...

    You just click "PLAY" and watch money coming right into your life... starting pretty much right away...

    GO here now to experience the marvelous "Miracle Money Tone" - as my gift to you!!

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. I have been researching on this concept now i find useful information. thanks for giving clear info. looking forward for more posts from this author
    data science course chennai

    ReplyDelete
  4. This video helps me to understand Matplotlib whats your opinion guys.

    ReplyDelete

Post a Comment

Popular posts from this blog

Machine Learning-Cross Validation & ROC curve

Another post starts with you beautiful people! Hope you enjoyed my previous post about improving your model performance by  confusion metrix . Today we will continue our performance improvement journey and will learn about Cross Validation (k-fold cross validation) & ROC in Machine Learning. A common practice in data science competitions is to iterate over various models to find a better performing model. However, it becomes difficult to distinguish whether this improvement in score is coming because we are capturing the relationship better or we are just over-fitting the data. To find the right answer of this question, we use cross validation technique. This method helps us to achieve more generalized relationships. What is Cross Validation? Cross Validation is a technique which involves reserving a particular sample of a data set on which we do not train the model. Later, we test the model on this sample before finalizing the model. Here are the steps involved in...

Machine Learning-Decision Trees and Random Forests

Another post starts with you beautiful people! I hope after reading my previous post about  Linear and Logistic Regression   your confidence level is up and you are now ready to move one step ahead in Machine Learning arena. In this post we will be going over Decision Trees and Random Forests . In order for you to understand this exercise completely there is some required reading. I suggests you to please read following blog post before going further- A Must Read! After reading the blog post you should have a basic layman's (or laywoman!) understanding of how decision trees and random forests work.  A quick intro is as below- Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features . For instance, in the example below, decision trees learn from data to approximate a sin...

Exploring The File Import

Another post starts with you beautiful people! Today we will explore various file import options in Python which I learned from a great learning site- DataCamp . In order to import data into Python, we should first have an idea of what files are in our working directory. We will learn step by step examples as given below- Importing entire text files- In this exercise, we'll be working with the file mobydick.txt [ download here ] It is a text file that contains the opening sentences of Moby Dick, one of the great American novels! Here you'll get experience opening a text file, printing its contents to the shell and, finally, closing it- # Open a file: file file = open('mobydick.txt', mode='r') # Print it print(file.read()) # Check whether file is closed print(file.closed) # Close file file.close() # Check whether file is closed print(file.closed) Importing text files line by line- For large files, we may not want to print all of th...