Skip to main content

Predicting user clicks using XGBoost!

Another post starts with you beautiful people!
I hope you are enjoying our machine learning journey and now after familiar with many real world problems as we have seen earlier you come to know that with this skill you can make the world a better place to live!
To continue our journey today we are going to analyze China's largest Big Data service platform problem and this platform is known as TalkingData
About The Problem-
TalkingData covers over 70% of active mobile devices nationwide. They handle 3 billion clicks per day, of which 90% are potentially fraudulent. Yes, your read it right! 90% of clicks are fraud and it causes them unnecessary server load.
Our Challenge-
As a data scientist our task is to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad.
Data-
To support our modeling, TalkingData has provided a generous dataset covering approximately 200 million clicks over 4 days which you can download/see from here- Dataset!

From the challenge it is little bit confusing that how can a click prediction tell us if a user is fraud or not! But if we can predict the click pattern TalkingData can fit our model to decide if that click is coming from a valid ip or not. So let's focus on to predict the click.
Since this is a Big Data problem be ready to deal with big size data . How can we find the size of the data? Not a big deal, we can achieve this as below:-

Here '../input' is the directory where all the data resides in Kaggle; if you have downloaded the data in your local machine, just change the path and run the code in your notebook. In my notebook it is showing following result-
Except the train_sample.csv all the data is very big! the smaller dataset is provided incase you don't want to download the full dataset.
See, the data is very huge and if you try to load such data in your local machine, it will cause the memory problem. So is there any library which can help us to see a glimpse of the statistics? Yes, Python provides a module named as subprocess and we can use this as below:
The above line of code returns that there are 185 million rows in the training dataset and 19 million rows in the test dataset and it's not easy to work on all rows; so we will work on the first 1 million rows of the dataset.
Let's load the datasets:-
Let's see the content of the datasets:-


As per the TalkingData the above columns are described as below:-
  1. ip: ip address of click
  2. app: app id for marketing
  3. device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
  4. os: os version id of user mobile phone
  5. channel: channel id of mobile ad publisher
  6. click_time: timestamp of click (UTC)
  7. attributed_time: if user download the app for after clicking an ad, this is the time of the app download
  8. is_attributed: the target that is to be predicted, indicating the app was downloaded
If you ponder the above columns, you will find that every column is in a digit and it looks like they are storing the value as Id or in encoded form because of the security concern. 
Let's explore each column and their unique values by plotting:-


Since all attributes are in digit form it is very tough to do feature engineering but if you can then it will surely improve your model. Let's start the real work-
Loading the dataset into pandas dataframes:-
Since 'click_time' is in date stamp, let's write a function to convert this to pandas series:-
Let's apply this function to our train and test data:-
Next separate the target attribute and drop it from the train dataset with any other unwanted:-
Next separate the click_id from the test dataset since this attribute will be needed in the submission format:-
That's it our data is now ready for applying any machine learning model. I am going to apply XGBOOST
Why xgboost? Because it is efficient, accurate, feasible and proven winner in past competitions.
Always remember xgboost is powerful only when you apply the right parameters with it and the thumb rule for getting the right parameters is to try again and again with different-different values!

Let's apply our xgboost model in the splitted data:-

And finally predict the model and submit it as per competition rule:-

With this approach I am able to get a score of 0.9487 and the top score is 0.9709 which indicates that there are lot of improvement and research needed for you Guyz to apply! So don't wait, visit my compete notebook HERE and do practice it in your kernel.

With that said my friends- Go chase your dreams, have an awesome day, make every second count and see you later in my next post.







Comments

  1. very informative blog and useful article thank you for sharing with us , keep posting Data Science online Training

    ReplyDelete
  2. This concept is a good way to enhance the knowledge.thanks for sharing. please keep it up machine learning online course

    ReplyDelete
  3. I found your analysis of XGBoost fascinating! In our work as a Digital Marketing Agency in Coimbatore, leveraging data to predict user actions is key to optimizing our clients' campaigns. Your techniques are definitely worth exploring further. Great job!

    ReplyDelete

Post a Comment

Popular posts from this blog

How to use opencv-python with Darknet's YOLOv4?

Another post starts with you beautiful people 😊 Thank you all for messaging me your doubts about Darknet's YOLOv4. I am very happy to see in a very short amount of time my lovely aspiring data scientists have learned a state of the art object detection and recognition technique. If you are new to my blog and to computer vision then please check my following blog posts one by one- Setup Darknet's YOLOv4 Train custom dataset with YOLOv4 Create production-ready API of YOLOv4 model Create a web app for your YOLOv4 model Since now we have learned to use YOLOv4 built on Darknet's framework. In this post, I am going to share with you how can you use your trained YOLOv4 model with another awesome computer vision and machine learning software library-  OpenCV  and of course with Python 🐍. Yes, the Python wrapper of OpenCV library has just released it's latest version with support of YOLOv4 which you can install in your system using below command- pip install opencv-pyt...

How to convert your YOLOv4 weights to TensorFlow 2.2.0?

Another post starts with you beautiful people! Thank you all for your overwhelming response in my last two posts about the YOLOv4. It is quite clear that my beloved aspiring data scientists are very much curious to learn state of the art computer vision technique but they were not able to achieve that due to the lack of proper guidance. Now they have learnt exact steps to use a state of the art object detection and recognition technique from my last two posts. If you are new to my blog and want to use YOLOv4 in your project then please follow below two links- How to install and compile Darknet code with GPU? How to train your custom data with YOLOv4? In my  last post we have trained our custom dataset to identify eight types of Indian classical dance forms. After the model training we have got the YOLOv4 specific weights file as 'yolo-obj_final.weights'. This YOLOv4 specific weight file cannot be used directly to either with OpenCV or with TensorFlow currently becau...

Detecting Credit Card Fraud As a Data Scientist

Another post starts with you beautiful people! Hope you have learnt something from my previous post about  machine learning classification real world problem Today we will continue our machine learning hands on journey and we will work on an interesting Credit Card Fraud Detection problem. The goal of this exercise is to anonymize credit card transactions labeled as fraudulent or genuine. For your own practice you can download the dataset from here-  Download the dataset! About the dataset:  The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. Let's start our analysis with loading the dataset first:- As per the  official documentation -  features V1, V2, ... V28 are the principal compo...