Skip to main content

Predicting user clicks using XGBoost!

Another post starts with you beautiful people!
I hope you are enjoying our machine learning journey and now after familiar with many real world problems as we have seen earlier you come to know that with this skill you can make the world a better place to live!
To continue our journey today we are going to analyze China's largest Big Data service platform problem and this platform is known as TalkingData
About The Problem-
TalkingData covers over 70% of active mobile devices nationwide. They handle 3 billion clicks per day, of which 90% are potentially fraudulent. Yes, your read it right! 90% of clicks are fraud and it causes them unnecessary server load.
Our Challenge-
As a data scientist our task is to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad.
Data-
To support our modeling, TalkingData has provided a generous dataset covering approximately 200 million clicks over 4 days which you can download/see from here- Dataset!

From the challenge it is little bit confusing that how can a click prediction tell us if a user is fraud or not! But if we can predict the click pattern TalkingData can fit our model to decide if that click is coming from a valid ip or not. So let's focus on to predict the click.
Since this is a Big Data problem be ready to deal with big size data . How can we find the size of the data? Not a big deal, we can achieve this as below:-

Here '../input' is the directory where all the data resides in Kaggle; if you have downloaded the data in your local machine, just change the path and run the code in your notebook. In my notebook it is showing following result-
Except the train_sample.csv all the data is very big! the smaller dataset is provided incase you don't want to download the full dataset.
See, the data is very huge and if you try to load such data in your local machine, it will cause the memory problem. So is there any library which can help us to see a glimpse of the statistics? Yes, Python provides a module named as subprocess and we can use this as below:
The above line of code returns that there are 185 million rows in the training dataset and 19 million rows in the test dataset and it's not easy to work on all rows; so we will work on the first 1 million rows of the dataset.
Let's load the datasets:-
Let's see the content of the datasets:-


As per the TalkingData the above columns are described as below:-
  1. ip: ip address of click
  2. app: app id for marketing
  3. device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
  4. os: os version id of user mobile phone
  5. channel: channel id of mobile ad publisher
  6. click_time: timestamp of click (UTC)
  7. attributed_time: if user download the app for after clicking an ad, this is the time of the app download
  8. is_attributed: the target that is to be predicted, indicating the app was downloaded
If you ponder the above columns, you will find that every column is in a digit and it looks like they are storing the value as Id or in encoded form because of the security concern. 
Let's explore each column and their unique values by plotting:-


Since all attributes are in digit form it is very tough to do feature engineering but if you can then it will surely improve your model. Let's start the real work-
Loading the dataset into pandas dataframes:-
Since 'click_time' is in date stamp, let's write a function to convert this to pandas series:-
Let's apply this function to our train and test data:-
Next separate the target attribute and drop it from the train dataset with any other unwanted:-
Next separate the click_id from the test dataset since this attribute will be needed in the submission format:-
That's it our data is now ready for applying any machine learning model. I am going to apply XGBOOST
Why xgboost? Because it is efficient, accurate, feasible and proven winner in past competitions.
Always remember xgboost is powerful only when you apply the right parameters with it and the thumb rule for getting the right parameters is to try again and again with different-different values!

Let's apply our xgboost model in the splitted data:-

And finally predict the model and submit it as per competition rule:-

With this approach I am able to get a score of 0.9487 and the top score is 0.9709 which indicates that there are lot of improvement and research needed for you Guyz to apply! So don't wait, visit my compete notebook HERE and do practice it in your kernel.

With that said my friends- Go chase your dreams, have an awesome day, make every second count and see you later in my next post.







Comments

  1. very informative blog and useful article thank you for sharing with us , keep posting Data Science online Training

    ReplyDelete
  2. This concept is a good way to enhance the knowledge.thanks for sharing. please keep it up machine learning online course

    ReplyDelete
  3. I found your analysis of XGBoost fascinating! In our work as a Digital Marketing Agency in Coimbatore, leveraging data to predict user actions is key to optimizing our clients' campaigns. Your techniques are definitely worth exploring further. Great job!

    ReplyDelete

Post a Comment

Popular posts from this blog

YOLObile- a new state of the art Real-Time Object Detection model for Mobile Devices

  Another post starts with you beautiful people! Thanks for giving so many views on my previous post 👍. I am glad to see my previous posts are helping people to use state of the art object detection and recognition deep learning model in their projects. If you are new to my blog, I recommend seeing once my previous posts, and you will not be disappointed if your goal is to learn applied computer vision free of cost. Continuing my journey of sharing knowledge in this post I am going to share with you a new state of the art framework for object detection on mobile devices-  YOLObile  📱 There has been a trade-off between speed and the accuracy of object detections. For example, the state of the art,  YOLOv4 model gives us a very accurate detection but its speed is slow if we want to use it on a mobile device. On the other hand, its lighter version YOLOv4-tiny works very fast on a mobile device but its accuracy reduces. For a detailed comparison of FPS vs mAP you can ...

Learn the fastest way to build data apps

Another post starts with you beautiful people! I hope you have enjoyed and learned something new from my previous three posts about machine learning model deployment. In one post we have learned  How to deploy a model as FastAPI?  I n the second post, we have learned  How to deploy a deep learning model as RestAPI ? and in the third post, we have also learned  How to scale your deep learning model API?   If you are following my blog posts, you have seen how easily you have transit yourselves from aspiring to a mature data scientist. In this new post, I am going to share a new framework-  Streamlit which will help you to easily create a beautiful app with Python only. I will show here how had I used the Streamlit framework to create an app for my YOLOv3 custom model. What is Streamlit? Streamlit’s open-source app framework is the easiest way for data scientists and machine learning engineers to create beautiful, performant apps in only a few hours!...

Can you build a model to predict toxic comments?

Another post starts with you beautiful people! Hope you have learnt something new and very powerful machine learning model from my previous post-  How to use LightGBM? Till now you must have an idea that there is no any area left that a machine learning model cannot be applied; yes it's everywhere! Continuing our journey today we will learn how to deal a problem which consists texts/sentences as feature. Examples of such kind of problems you see in internet sites, emails, posts , social media etc. Data Scientists sitting in industry giants like Quora, Twitter, Facebook, Google are working very smartly to build machine learning models to classify texts/sentences/words. Today we are going to do the same and believe me friends once you do some hand on, you will be also in the same hat. Challenge Link :  jigsaw-toxic-comment-classification-challenge Problem : We’re challenged to build a multi-headed model that’s capable of detecting different types of toxicity like thre...