Skip to main content

Predicting user clicks using XGBoost!

Another post starts with you beautiful people!
I hope you are enjoying our machine learning journey and now after familiar with many real world problems as we have seen earlier you come to know that with this skill you can make the world a better place to live!
To continue our journey today we are going to analyze China's largest Big Data service platform problem and this platform is known as TalkingData
About The Problem-
TalkingData covers over 70% of active mobile devices nationwide. They handle 3 billion clicks per day, of which 90% are potentially fraudulent. Yes, your read it right! 90% of clicks are fraud and it causes them unnecessary server load.
Our Challenge-
As a data scientist our task is to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad.
Data-
To support our modeling, TalkingData has provided a generous dataset covering approximately 200 million clicks over 4 days which you can download/see from here- Dataset!

From the challenge it is little bit confusing that how can a click prediction tell us if a user is fraud or not! But if we can predict the click pattern TalkingData can fit our model to decide if that click is coming from a valid ip or not. So let's focus on to predict the click.
Since this is a Big Data problem be ready to deal with big size data . How can we find the size of the data? Not a big deal, we can achieve this as below:-

Here '../input' is the directory where all the data resides in Kaggle; if you have downloaded the data in your local machine, just change the path and run the code in your notebook. In my notebook it is showing following result-
Except the train_sample.csv all the data is very big! the smaller dataset is provided incase you don't want to download the full dataset.
See, the data is very huge and if you try to load such data in your local machine, it will cause the memory problem. So is there any library which can help us to see a glimpse of the statistics? Yes, Python provides a module named as subprocess and we can use this as below:
The above line of code returns that there are 185 million rows in the training dataset and 19 million rows in the test dataset and it's not easy to work on all rows; so we will work on the first 1 million rows of the dataset.
Let's load the datasets:-
Let's see the content of the datasets:-


As per the TalkingData the above columns are described as below:-
  1. ip: ip address of click
  2. app: app id for marketing
  3. device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
  4. os: os version id of user mobile phone
  5. channel: channel id of mobile ad publisher
  6. click_time: timestamp of click (UTC)
  7. attributed_time: if user download the app for after clicking an ad, this is the time of the app download
  8. is_attributed: the target that is to be predicted, indicating the app was downloaded
If you ponder the above columns, you will find that every column is in a digit and it looks like they are storing the value as Id or in encoded form because of the security concern. 
Let's explore each column and their unique values by plotting:-


Since all attributes are in digit form it is very tough to do feature engineering but if you can then it will surely improve your model. Let's start the real work-
Loading the dataset into pandas dataframes:-
Since 'click_time' is in date stamp, let's write a function to convert this to pandas series:-
Let's apply this function to our train and test data:-
Next separate the target attribute and drop it from the train dataset with any other unwanted:-
Next separate the click_id from the test dataset since this attribute will be needed in the submission format:-
That's it our data is now ready for applying any machine learning model. I am going to apply XGBOOST
Why xgboost? Because it is efficient, accurate, feasible and proven winner in past competitions.
Always remember xgboost is powerful only when you apply the right parameters with it and the thumb rule for getting the right parameters is to try again and again with different-different values!

Let's apply our xgboost model in the splitted data:-

And finally predict the model and submit it as per competition rule:-

With this approach I am able to get a score of 0.9487 and the top score is 0.9709 which indicates that there are lot of improvement and research needed for you Guyz to apply! So don't wait, visit my compete notebook HERE and do practice it in your kernel.

With that said my friends- Go chase your dreams, have an awesome day, make every second count and see you later in my next post.







Comments

  1. very informative blog and useful article thank you for sharing with us , keep posting Data Science online Training

    ReplyDelete
  2. This concept is a good way to enhance the knowledge.thanks for sharing. please keep it up machine learning online course

    ReplyDelete
  3. I found your analysis of XGBoost fascinating! In our work as a Digital Marketing Agency in Coimbatore, leveraging data to predict user actions is key to optimizing our clients' campaigns. Your techniques are definitely worth exploring further. Great job!

    ReplyDelete

Post a Comment

Popular posts from this blog

How to install and compile YOLO v4 with GPU enable settings in Windows 10?

Another post starts with you beautiful people! Last year I had shared a post about  installing and compiling Darknet YOLOv3   in your Windows machine and also how to detect an object using  YOLOv3 with Keras . This year on April' 2020 the fourth generation of YOLO has arrived and since then I was curious to use this as soon as possible. Due to my project (built on YOLOv3 :)) work I could not find a chance to check this latest release. Today I got some relief and successfully able to install and compile YOLOv4 in my machine. In this post I am going to share a single shot way to do the same in your Windows 10 machine. If your machine does not have GPU then you can follow my  previous post  by just replacing YOLOv3 related files with YOLOv4 files. For GPU having Windows machine, follow my steps to avoid any issue while building the Darknet repository. My machine has following configurations: Windows 10 64 bit Intel Core i7 16 GB RAM NVIDIA GeForce G...

Can you build a model to predict toxic comments?

Another post starts with you beautiful people! Hope you have learnt something new and very powerful machine learning model from my previous post-  How to use LightGBM? Till now you must have an idea that there is no any area left that a machine learning model cannot be applied; yes it's everywhere! Continuing our journey today we will learn how to deal a problem which consists texts/sentences as feature. Examples of such kind of problems you see in internet sites, emails, posts , social media etc. Data Scientists sitting in industry giants like Quora, Twitter, Facebook, Google are working very smartly to build machine learning models to classify texts/sentences/words. Today we are going to do the same and believe me friends once you do some hand on, you will be also in the same hat. Challenge Link :  jigsaw-toxic-comment-classification-challenge Problem : We’re challenged to build a multi-headed model that’s capable of detecting different types of toxicity like thre...

How to deploy your ML model as Fast API?

Another post starts with you beautiful people! Thank you all for showing so much interests in my last posts about object detection and recognition using YOLOv4. I was very happy to see many aspiring data scientists have learnt from my past three posts about using YOLOv4. Today I am going to share you all a new skill to learn. Most of you have seen my post about  deploying and consuming ML models as Flask API   where we have learnt to deploy and consume a keras model with Flask API  . In this post you are going to learn a new framework-  FastAPI to deploy your model as Rest API. After completing this post you will have a new industry standard skill. What is FastAPI? FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints. It is easy to learn, fast to code and ready for production . Yes, you heard it right! Flask is not meant to be used in production but with FastAPI you can use you...