Skip to main content

Predicting user clicks using XGBoost!

Another post starts with you beautiful people!
I hope you are enjoying our machine learning journey and now after familiar with many real world problems as we have seen earlier you come to know that with this skill you can make the world a better place to live!
To continue our journey today we are going to analyze China's largest Big Data service platform problem and this platform is known as TalkingData
About The Problem-
TalkingData covers over 70% of active mobile devices nationwide. They handle 3 billion clicks per day, of which 90% are potentially fraudulent. Yes, your read it right! 90% of clicks are fraud and it causes them unnecessary server load.
Our Challenge-
As a data scientist our task is to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad.
Data-
To support our modeling, TalkingData has provided a generous dataset covering approximately 200 million clicks over 4 days which you can download/see from here- Dataset!

From the challenge it is little bit confusing that how can a click prediction tell us if a user is fraud or not! But if we can predict the click pattern TalkingData can fit our model to decide if that click is coming from a valid ip or not. So let's focus on to predict the click.
Since this is a Big Data problem be ready to deal with big size data . How can we find the size of the data? Not a big deal, we can achieve this as below:-

Here '../input' is the directory where all the data resides in Kaggle; if you have downloaded the data in your local machine, just change the path and run the code in your notebook. In my notebook it is showing following result-
Except the train_sample.csv all the data is very big! the smaller dataset is provided incase you don't want to download the full dataset.
See, the data is very huge and if you try to load such data in your local machine, it will cause the memory problem. So is there any library which can help us to see a glimpse of the statistics? Yes, Python provides a module named as subprocess and we can use this as below:
The above line of code returns that there are 185 million rows in the training dataset and 19 million rows in the test dataset and it's not easy to work on all rows; so we will work on the first 1 million rows of the dataset.
Let's load the datasets:-
Let's see the content of the datasets:-


As per the TalkingData the above columns are described as below:-
  1. ip: ip address of click
  2. app: app id for marketing
  3. device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
  4. os: os version id of user mobile phone
  5. channel: channel id of mobile ad publisher
  6. click_time: timestamp of click (UTC)
  7. attributed_time: if user download the app for after clicking an ad, this is the time of the app download
  8. is_attributed: the target that is to be predicted, indicating the app was downloaded
If you ponder the above columns, you will find that every column is in a digit and it looks like they are storing the value as Id or in encoded form because of the security concern. 
Let's explore each column and their unique values by plotting:-


Since all attributes are in digit form it is very tough to do feature engineering but if you can then it will surely improve your model. Let's start the real work-
Loading the dataset into pandas dataframes:-
Since 'click_time' is in date stamp, let's write a function to convert this to pandas series:-
Let's apply this function to our train and test data:-
Next separate the target attribute and drop it from the train dataset with any other unwanted:-
Next separate the click_id from the test dataset since this attribute will be needed in the submission format:-
That's it our data is now ready for applying any machine learning model. I am going to apply XGBOOST
Why xgboost? Because it is efficient, accurate, feasible and proven winner in past competitions.
Always remember xgboost is powerful only when you apply the right parameters with it and the thumb rule for getting the right parameters is to try again and again with different-different values!

Let's apply our xgboost model in the splitted data:-

And finally predict the model and submit it as per competition rule:-

With this approach I am able to get a score of 0.9487 and the top score is 0.9709 which indicates that there are lot of improvement and research needed for you Guyz to apply! So don't wait, visit my compete notebook HERE and do practice it in your kernel.

With that said my friends- Go chase your dreams, have an awesome day, make every second count and see you later in my next post.







Comments

  1. very informative blog and useful article thank you for sharing with us , keep posting Data Science online Training

    ReplyDelete
  2. This concept is a good way to enhance the knowledge.thanks for sharing. please keep it up machine learning online course

    ReplyDelete

Post a Comment

Popular posts from this blog

How to convert your YOLOv4 weights to TensorFlow 2.2.0?

Another post starts with you beautiful people! Thank you all for your overwhelming response in my last two posts about the YOLOv4. It is quite clear that my beloved aspiring data scientists are very much curious to learn state of the art computer vision technique but they were not able to achieve that due to the lack of proper guidance. Now they have learnt exact steps to use a state of the art object detection and recognition technique from my last two posts. If you are new to my blog and want to use YOLOv4 in your project then please follow below two links- How to install and compile Darknet code with GPU? How to train your custom data with YOLOv4? In my  last post we have trained our custom dataset to identify eight types of Indian classical dance forms. After the model training we have got the YOLOv4 specific weights file as 'yolo-obj_final.weights'. This YOLOv4 specific weight file cannot be used directly to either with OpenCV or with TensorFlow currently becau

How to install and compile YOLO v4 with GPU enable settings in Windows 10?

Another post starts with you beautiful people! Last year I had shared a post about  installing and compiling Darknet YOLOv3   in your Windows machine and also how to detect an object using  YOLOv3 with Keras . This year on April' 2020 the fourth generation of YOLO has arrived and since then I was curious to use this as soon as possible. Due to my project (built on YOLOv3 :)) work I could not find a chance to check this latest release. Today I got some relief and successfully able to install and compile YOLOv4 in my machine. In this post I am going to share a single shot way to do the same in your Windows 10 machine. If your machine does not have GPU then you can follow my  previous post  by just replacing YOLOv3 related files with YOLOv4 files. For GPU having Windows machine, follow my steps to avoid any issue while building the Darknet repository. My machine has following configurations: Windows 10 64 bit Intel Core i7 16 GB RAM NVIDIA GeForce GTX 1660 Ti Version 445.87

How to use opencv-python with Darknet's YOLOv4?

Another post starts with you beautiful people 😊 Thank you all for messaging me your doubts about Darknet's YOLOv4. I am very happy to see in a very short amount of time my lovely aspiring data scientists have learned a state of the art object detection and recognition technique. If you are new to my blog and to computer vision then please check my following blog posts one by one- Setup Darknet's YOLOv4 Train custom dataset with YOLOv4 Create production-ready API of YOLOv4 model Create a web app for your YOLOv4 model Since now we have learned to use YOLOv4 built on Darknet's framework. In this post, I am going to share with you how can you use your trained YOLOv4 model with another awesome computer vision and machine learning software library-  OpenCV  and of course with Python 🐍. Yes, the Python wrapper of OpenCV library has just released it's latest version with support of YOLOv4 which you can install in your system using below command- pip install opencv-python --up