Another post starts with you beautiful people!
I hope you are enjoying our machine learning journey and now after familiar with many real world problems as we have seen earlier you come to know that with this skill you can make the world a better place to live!
To continue our journey today we are going to analyze China's largest Big Data service platform problem and this platform is known as TalkingData
About The Problem-
TalkingData covers over 70% of active mobile devices nationwide. They handle 3 billion clicks per day, of which 90% are potentially fraudulent. Yes, your read it right! 90% of clicks are fraud and it causes them unnecessary server load.
Our Challenge-
As a data scientist our task is to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad.
Data-
To support our modeling, TalkingData has provided a generous dataset covering approximately 200 million clicks over 4 days which you can download/see from here- Dataset!
From the challenge it is little bit confusing that how can a click prediction tell us if a user is fraud or not! But if we can predict the click pattern TalkingData can fit our model to decide if that click is coming from a valid ip or not. So let's focus on to predict the click.
Since this is a Big Data problem be ready to deal with big size data . How can we find the size of the data? Not a big deal, we can achieve this as below:-
Here '../input' is the directory where all the data resides in Kaggle; if you have downloaded the data in your local machine, just change the path and run the code in your notebook. In my notebook it is showing following result-
Except the train_sample.csv all the data is very big! the smaller dataset is provided incase you don't want to download the full dataset.
See, the data is very huge and if you try to load such data in your local machine, it will cause the memory problem. So is there any library which can help us to see a glimpse of the statistics? Yes, Python provides a module named as subprocess and we can use this as below:
The above line of code returns that there are 185 million rows in the training dataset and 19 million rows in the test dataset and it's not easy to work on all rows; so we will work on the first 1 million rows of the dataset.
Let's load the datasets:-
Let's see the content of the datasets:-
As per the TalkingData the above columns are described as below:-
Since all attributes are in digit form it is very tough to do feature engineering but if you can then it will surely improve your model. Let's start the real work-
Loading the dataset into pandas dataframes:-
Since 'click_time' is in date stamp, let's write a function to convert this to pandas series:-
Let's apply this function to our train and test data:-
Next separate the target attribute and drop it from the train dataset with any other unwanted:-
Next separate the click_id from the test dataset since this attribute will be needed in the submission format:-
That's it our data is now ready for applying any machine learning model. I am going to apply XGBOOST
Why xgboost? Because it is efficient, accurate, feasible and proven winner in past competitions.
Always remember xgboost is powerful only when you apply the right parameters with it and the thumb rule for getting the right parameters is to try again and again with different-different values!
Let's apply our xgboost model in the splitted data:-
And finally predict the model and submit it as per competition rule:-
With this approach I am able to get a score of 0.9487 and the top score is 0.9709 which indicates that there are lot of improvement and research needed for you Guyz to apply! So don't wait, visit my compete notebook HERE and do practice it in your kernel.
With that said my friends- Go chase your dreams, have an awesome day, make every second count and see you later in my next post.
I hope you are enjoying our machine learning journey and now after familiar with many real world problems as we have seen earlier you come to know that with this skill you can make the world a better place to live!
To continue our journey today we are going to analyze China's largest Big Data service platform problem and this platform is known as TalkingData
About The Problem-
TalkingData covers over 70% of active mobile devices nationwide. They handle 3 billion clicks per day, of which 90% are potentially fraudulent. Yes, your read it right! 90% of clicks are fraud and it causes them unnecessary server load.
Our Challenge-
As a data scientist our task is to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad.
Data-
To support our modeling, TalkingData has provided a generous dataset covering approximately 200 million clicks over 4 days which you can download/see from here- Dataset!
From the challenge it is little bit confusing that how can a click prediction tell us if a user is fraud or not! But if we can predict the click pattern TalkingData can fit our model to decide if that click is coming from a valid ip or not. So let's focus on to predict the click.
Since this is a Big Data problem be ready to deal with big size data . How can we find the size of the data? Not a big deal, we can achieve this as below:-
Here '../input' is the directory where all the data resides in Kaggle; if you have downloaded the data in your local machine, just change the path and run the code in your notebook. In my notebook it is showing following result-
Except the train_sample.csv all the data is very big! the smaller dataset is provided incase you don't want to download the full dataset.
See, the data is very huge and if you try to load such data in your local machine, it will cause the memory problem. So is there any library which can help us to see a glimpse of the statistics? Yes, Python provides a module named as subprocess and we can use this as below:
The above line of code returns that there are 185 million rows in the training dataset and 19 million rows in the test dataset and it's not easy to work on all rows; so we will work on the first 1 million rows of the dataset.
Let's load the datasets:-
Let's see the content of the datasets:-
As per the TalkingData the above columns are described as below:-
- ip: ip address of click
- app: app id for marketing
- device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
- os: os version id of user mobile phone
- channel: channel id of mobile ad publisher
- click_time: timestamp of click (UTC)
- attributed_time: if user download the app for after clicking an ad, this is the time of the app download
- is_attributed: the target that is to be predicted, indicating the app was downloaded
If you ponder the above columns, you will find that every column is in a digit and it looks like they are storing the value as Id or in encoded form because of the security concern.
Let's explore each column and their unique values by plotting:-
Since all attributes are in digit form it is very tough to do feature engineering but if you can then it will surely improve your model. Let's start the real work-
Loading the dataset into pandas dataframes:-
Since 'click_time' is in date stamp, let's write a function to convert this to pandas series:-
Let's apply this function to our train and test data:-
Next separate the target attribute and drop it from the train dataset with any other unwanted:-
Next separate the click_id from the test dataset since this attribute will be needed in the submission format:-
That's it our data is now ready for applying any machine learning model. I am going to apply XGBOOST
Why xgboost? Because it is efficient, accurate, feasible and proven winner in past competitions.
Always remember xgboost is powerful only when you apply the right parameters with it and the thumb rule for getting the right parameters is to try again and again with different-different values!
Let's apply our xgboost model in the splitted data:-
And finally predict the model and submit it as per competition rule:-
With this approach I am able to get a score of 0.9487 and the top score is 0.9709 which indicates that there are lot of improvement and research needed for you Guyz to apply! So don't wait, visit my compete notebook HERE and do practice it in your kernel.
With that said my friends- Go chase your dreams, have an awesome day, make every second count and see you later in my next post.
very informative blog and useful article thank you for sharing with us , keep posting Data Science online Training
ReplyDeleteThis concept is a good way to enhance the knowledge.thanks for sharing. please keep it up machine learning online course
ReplyDeleteI found your analysis of XGBoost fascinating! In our work as a Digital Marketing Agency in Coimbatore, leveraging data to predict user actions is key to optimizing our clients' campaigns. Your techniques are definitely worth exploring further. Great job!
ReplyDelete