Skip to main content

LightGBM and Kaggle's Mercari Price Suggestion Challenge


Another post starts with you beautiful people!
I hope you have enjoyed and must learnt something from previous two posts about real world machine learning problems in Kaggle.
As I said earlier Kaggle is a great platform to apply your machine learning skills and enhance your knowledge; today I will share again my learning from there with all of you!

In this post we will work upon an online machine learning competition where we need to predict the the price of products for Japan’s biggest community-powered shopping app. The main attraction of this challenge is that this is a Kernels-only competition; it means the datasets are given for downloading only in stage 1.In next final stage it will be available only in Kernels.

What kind of problem is this? Since our goal is to predict the price (which is a number), it will be a regression problem.

Data: You can see the datasets here

Exploring the datasets: The datasets provided are in the zip format of 'tsv'. So how can we read such data? Pandas has the answer of this!

#loading the dataset
Here I have used 'c' as engine parameter value because the 'c' engine is faster than the python engine and in competition speed really matters.

#peek of the training dataset

The test dataset has all the columns mentioned in training dataset except the target variable-'price'.

#Checking for missing data



From above it is quite clear that there are lots of missing data in two columns of the datasets which cannot be ignored and should be handled with care. Our next step will be to fill the missing values.

Approach to handle category name:First we will see the category_name. If you remember any ecommerce app, you will notice that the category name is almost same in all of the them. It is in the format of- Root Category/Category/Subcategory. In the given dataset also it is following the same trend so we need to split the category and save each of them in a separate column. For this splitting 'lambda function' is quite useful and I used the same to apply my logic.Here is a snippet-

#splitting of category_name

Here I have not given any column name as 'category' because pandas has this identifier and it will create issue if we use the same name.

#filling missing values in categories

Since most of the machine learning model do not accept categorical variable, we need to convert categorical to numeric ones or pandas category.
#converting categorical variables into pandas category data type
In this problem our target variable is 'price' and when I analyzed this column , I found that there are some products which have zero price but they are not in a great number. So I decided to remove zero priced products from the training dataset.
#remove zero priced products

#combine the datasets and separate the target variables

Next, one of the most important process is to handle the texts in name, category, brand name and description of the products.To deal with this we have a powerful package- sklearn. Using this package first we work on name and category columns and will convert them to a matrix of token counts which will give us a sparse representation of the counts-

For handling the description column we will convert it to a matrix of TF-IDF features which is equivalent to CountVectorizer followed by TfidfVectorizer-

To deal with brand name we will convert multi-class labels to binary labels in a one-vs-all fashion-

Next, we will convert categorical variable-item_condition_id and shipping into dummy/indicator variables and then merge them .For the efficient merging we will use Compressed Sparse Row matrix [CSR]-

Finally we have cleaned variables, next we will convert the matrix to compressed Sparse Row format stack arrays in sequence horizontally (column wise)-

Now we have cleaned data and we are ready for modeling.
For fitting our model I have used sklearn.linear_model.Ridge-

Next, we will split the training dataset so that we don't overfit our model-

The most important part of the modeling is the training and for this I have chosen a fast, distributed, high performance gradient boosting (GBDT, GBRT, GBM or MART) framework-LightGBM

For more details of this framework please read official LightGBM

With above approach I submitted my result in kaggle and find myself under top 16%-

So what I have learnt from various competitions is that obtaining a very good score and ranking depend on two things- first is the EDA of the data and second is the machine learning model with fine parameter tuning.

For parameter tuning I found a very good article here- lightgbm parameter tuning

If you are interested the whole code you can find it here submission-to-mercari-price-suggestion-challenge.
I suggest you to please download the code, analyze the data more, do some parameter tuning and improve the score more.

Meanwhile Friends! Go chase your dreams, have an awesome day, make every second count and see you later in my next post.

Comments

Post a Comment

Popular posts from this blog

How to deploy your ML model as Fast API?

Another post starts with you beautiful people! Thank you all for showing so much interests in my last posts about object detection and recognition using YOLOv4. I was very happy to see many aspiring data scientists have learnt from my past three posts about using YOLOv4. Today I am going to share you all a new skill to learn. Most of you have seen my post about  deploying and consuming ML models as Flask API   where we have learnt to deploy and consume a keras model with Flask API  . In this post you are going to learn a new framework-  FastAPI to deploy your model as Rest API. After completing this post you will have a new industry standard skill. What is FastAPI? FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints. It is easy to learn, fast to code and ready for production . Yes, you heard it right! Flask is not meant to be used in production but with FastAPI you can use you...

Can you build a model to predict toxic comments?

Another post starts with you beautiful people! Hope you have learnt something new and very powerful machine learning model from my previous post-  How to use LightGBM? Till now you must have an idea that there is no any area left that a machine learning model cannot be applied; yes it's everywhere! Continuing our journey today we will learn how to deal a problem which consists texts/sentences as feature. Examples of such kind of problems you see in internet sites, emails, posts , social media etc. Data Scientists sitting in industry giants like Quora, Twitter, Facebook, Google are working very smartly to build machine learning models to classify texts/sentences/words. Today we are going to do the same and believe me friends once you do some hand on, you will be also in the same hat. Challenge Link :  jigsaw-toxic-comment-classification-challenge Problem : We’re challenged to build a multi-headed model that’s capable of detecting different types of toxicity like thre...

How to install and compile YOLO v4 with GPU enable settings in Windows 10?

Another post starts with you beautiful people! Last year I had shared a post about  installing and compiling Darknet YOLOv3   in your Windows machine and also how to detect an object using  YOLOv3 with Keras . This year on April' 2020 the fourth generation of YOLO has arrived and since then I was curious to use this as soon as possible. Due to my project (built on YOLOv3 :)) work I could not find a chance to check this latest release. Today I got some relief and successfully able to install and compile YOLOv4 in my machine. In this post I am going to share a single shot way to do the same in your Windows 10 machine. If your machine does not have GPU then you can follow my  previous post  by just replacing YOLOv3 related files with YOLOv4 files. For GPU having Windows machine, follow my steps to avoid any issue while building the Darknet repository. My machine has following configurations: Windows 10 64 bit Intel Core i7 16 GB RAM NVIDIA GeForce G...