Skip to main content

Machine Learning-Decision Trees and Random Forests


Another post starts with you beautiful people!
I hope after reading my previous post about Linear and Logistic Regression your confidence level is up and you are now ready to move one step ahead in Machine Learning arena.
In this post we will be going over Decision Trees and Random Forests.
In order for you to understand this exercise completely there is some required reading.
I suggests you to please read following blog post before going further-A Must Read!
After reading the blog post you should have a basic layman's (or laywoman!) understanding of how decision trees and random forests work. A quick intro is as below-

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
For instance, in the example below, decision trees learn from data to approximate a sine curve with a set of if-then-else decision rules. The deeper the tree, the more complex the decision rules and the fitter the model.
A Random Forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

Lets see how we can implement them in Python!
Python provides a powerful library-sklearn for decision trees and random forests, we will use the same in our exercise. You can find more details about this here- Decision Tree with Python and Random Forest with Python


Creating Decision Trees:-
We know from the A Must Read! blog posts that Decision Trees utilize binary splitting to make decisions based on features (the questions we asked).
So lets go ahead and create some data using some built-in functions in SciKit-Learn:

Please note here-
make_blobs is used to Generate isotropic Gaussian blobs for clustering.
n_samples : int, optional (default=100),The total number of points equally divided among clusters.
centers : int or array of shape [n_centers, n_features], optional (default=3) The number of centers to generate, or the fixed center locations.
cluster_std : float or sequence of floats, optional (default=1.0) The standard deviation of the clusters.
random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used by np.random.

Output:-

Visualization Function:-
Before we begin implementing the Decision Tree, lets create a nice function to plot out the decision boundaries using mesh grid (a technique common to the Sci-Kit Learn documentation)-

  • Return coordinate matrices from coordinate vectors.
  • Make N-D coordinate arrays for vectorized evaluations of N-D scalar/vector fields over N-D grids, given one-dimensional coordinate arrays x1, x2,..., xn


If you need above function code, let me know in comment section.
I will share the same!

Let's plot out a Decision Tree boundary with a max depth of two branches:-
Output-
How about 4 levels deep?

Notice how changing the depth of the decision causes the boundaries to change substantially! 
If we pay close attention to the second model we can begin to see evidence of over-fitting. 
This basically means that if we were to try to predict a new point the result would be influenced more by the noise than the signal.

So how do we address this issue? 
The answer is by creating an ensemble of decision trees-Random Forests.

Random Forests:-
Ensemble Methods essentially average the results of many individual estimators which over-fit the data. The resulting estimates are much more robust and accurate than the individual estimates which make them up!
One of the most common ensemble methods is the Random Forest, in which the ensemble is made up of many decision trees which are in some way perturbed.
Lets see how we can use Sci-Kit Learn to create a random forest (its actually very simple!)


Note that n_estimators stands for the numerb of trees to use. We would intuitively know that using more decision trees would be better, but after a certain amount of trees (somewhere between 100-400 depending on our data) the benefits in accuracy of adding more estimators significantly decreases and just becomes a load on your CPU.


We can see that the random forest has been able to pick up features that the Decision Tree was not able to (although we must be careful of over-fitting with Random Forests too!)
While a visual is nice, a better way to evaluate our model would be with train test split if we had real data!

Random Forest Regression:-
We can also use Random Forests for Regression! Let's see a quick example!

Let's imagine we have some sort of weather data that's sinusoidal in nature with some noise. It has a slow oscillation component, a fast oscillation component, and then a random noise component.


Now lets use a Random Forest Regressor to create a fitted regression, obviously a standard linear regression approach wouldn't work here. And if we didn't know anything about the true nature of the model, polynomial or sinusoidal regression would be tedious.

Output-

As you can see, the non-parametric random forest model is flexible enough to fit the multi-period data, without us even specifying a multi-period model!
This is a tradeoff between simplicity and thinking about what your data actually is.

Here are some more resources for Random Forests:
A whole webpage form the inventors themselves-Leo Breiman and Adele Cutler Random Forests
Its strange to think Random Forests is actually trademarked!

That's it for today! Please try above function with some modification in your notebook and explore more.

Comments

  1. It is nice blog Thank you porovide important information and i am searching for same information to save my time Data Science online Course Hyderabad

    ReplyDelete
  2. Great thanks for sharing about Machine learning decision tree. This post will be helpful for the readers who are searching for this type of information. Keep it up
    machine learning with python course in chennai | best training institute for machine learning

    ReplyDelete
  3. Decision trees are a great flow chart tree structuecire.Yet decision trees are less appropriate for estimation tasks where the goal is to predict the value of a continuous attribute.
    To understand further more lets look at some Decision Tree Examples in the Creately diagram community.

    ReplyDelete
  4. Good information you shared. keep posting.
    <a href="https://360digitmg.com/india/artificial-intelligence-ai-and-deep-learning-in-noida>artificial intelligence course in noida</a>

    ReplyDelete
  5. Thanks for sharing the best information and suggestions, I love your content, and they are very nice and very useful to us. If you are looking for the best Machine Learning Model Implementation USA, then visit Symentix Technologies Private Limited. I appreciate the work you have put into this.

    ReplyDelete
  6. At APTRON Solutions, we believe in learning by doing. Our Machine Learning Training in Noida emphasizes hands-on experience through practical exercises, real-world projects, and case studies. This approach ensures you gain practical skills directly applicable to the industry.

    ReplyDelete

Post a Comment

Popular posts from this blog

How to install and compile YOLO v4 with GPU enable settings in Windows 10?

Another post starts with you beautiful people! Last year I had shared a post about  installing and compiling Darknet YOLOv3   in your Windows machine and also how to detect an object using  YOLOv3 with Keras . This year on April' 2020 the fourth generation of YOLO has arrived and since then I was curious to use this as soon as possible. Due to my project (built on YOLOv3 :)) work I could not find a chance to check this latest release. Today I got some relief and successfully able to install and compile YOLOv4 in my machine. In this post I am going to share a single shot way to do the same in your Windows 10 machine. If your machine does not have GPU then you can follow my  previous post  by just replacing YOLOv3 related files with YOLOv4 files. For GPU having Windows machine, follow my steps to avoid any issue while building the Darknet repository. My machine has following configurations: Windows 10 64 bit Intel Core i7 16 GB RAM NVIDIA GeForce G...

How can I make a simple ChatBot?

Another post starts with you beautiful people! It has been a long time of posting a new post. But my friends in this period I was not sitting  where I got a chance to work with chatbot and classification related machine learning problem. So in this post I am going to share all about chatbot- from where I have learned? What I have learned? And how can you build your first bot? Quite interesting right! Chatbot is a program that can conduct an intelligent conversation based on user's input. Since chatbot is a new thing to me also, I first searched- is there any Python library available to start with this? And like always Python has helped me this time also. There is a Python library available with name as  ChatterBot   which is nothing but a machine learning conversational dialog engine. And yes that is all I want to start my learning because I always prefer inbuilt Python library to start my learning journey and once I learn this then only I move ahead for another...

Detecting Credit Card Fraud As a Data Scientist

Another post starts with you beautiful people! Hope you have learnt something from my previous post about  machine learning classification real world problem Today we will continue our machine learning hands on journey and we will work on an interesting Credit Card Fraud Detection problem. The goal of this exercise is to anonymize credit card transactions labeled as fraudulent or genuine. For your own practice you can download the dataset from here-  Download the dataset! About the dataset:  The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. Let's start our analysis with loading the dataset first:- As per the  official documentation -  features V1, V2, ... V28 are the principal compo...