Machine Learning-Decision Trees and Random Forests

Another post starts with you beautiful people!
I hope after reading my previous post about Linear and Logistic Regression your confidence level is up and you are now ready to move one step ahead in Machine Learning arena.
In this post we will be going over Decision Trees and Random Forests.
In order for you to understand this exercise completely there is some required reading.
I suggests you to please read following blog post before going further-A Must Read!
After reading the blog post you should have a basic layman's (or laywoman!) understanding of how decision trees and random forests work. A quick intro is as below-

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
For instance, in the example below, decision trees learn from data to approximate a sine curve with a set of if-then-else decision rules. The deeper the tree, the more complex the decision rules and the fitter the model.

A Random Forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

Lets see how we can implement them in Python!
Python provides a powerful library-sklearn for decision trees and random forests, we will use the same in our exercise. You can find more details about this here- Decision Tree with Python and Random Forest with Python

Creating Decision Trees:-
We know from the A Must Read! blog posts that Decision Trees utilize binary splitting to make decisions based on features (the questions we asked).
So lets go ahead and create some data using some built-in functions in SciKit-Learn:

Please note here-
make_blobs is used to Generate isotropic Gaussian blobs for clustering.
n_samples : int, optional (default=100),The total number of points equally divided among clusters.
centers : int or array of shape [n_centers, n_features], optional (default=3) The number of centers to generate, or the fixed center locations.
cluster_std : float or sequence of floats, optional (default=1.0) The standard deviation of the clusters.
random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used by np.random.

Output:-

Visualization Function:-
Before we begin implementing the Decision Tree, lets create a nice function to plot out the decision boundaries using mesh grid (a technique common to the Sci-Kit Learn documentation)-

Return coordinate matrices from coordinate vectors.
Make N-D coordinate arrays for vectorized evaluations of N-D scalar/vector fields over N-D grids, given one-dimensional coordinate arrays x1, x2,..., xn

If you need above function code, let me know in comment section.
I will share the same!

Let's plot out a Decision Tree boundary with a max depth of two branches:-

Output-

How about 4 levels deep?

Notice how changing the depth of the decision causes the boundaries to change substantially!
If we pay close attention to the second model we can begin to see evidence of over-fitting.
This basically means that if we were to try to predict a new point the result would be influenced more by the noise than the signal.

So how do we address this issue?
The answer is by creating an ensemble of decision trees-Random Forests.

Random Forests:-
Ensemble Methods essentially average the results of many individual estimators which over-fit the data. The resulting estimates are much more robust and accurate than the individual estimates which make them up!
One of the most common ensemble methods is the Random Forest, in which the ensemble is made up of many decision trees which are in some way perturbed.
Lets see how we can use Sci-Kit Learn to create a random forest (its actually very simple!)

Note that n_estimators stands for the numerb of trees to use. We would intuitively know that using more decision trees would be better, but after a certain amount of trees (somewhere between 100-400 depending on our data) the benefits in accuracy of adding more estimators significantly decreases and just becomes a load on your CPU.

We can see that the random forest has been able to pick up features that the Decision Tree was not able to (although we must be careful of over-fitting with Random Forests too!)
While a visual is nice, a better way to evaluate our model would be with train test split if we had real data!

Random Forest Regression:-
We can also use Random Forests for Regression! Let's see a quick example!

Let's imagine we have some sort of weather data that's sinusoidal in nature with some noise. It has a slow oscillation component, a fast oscillation component, and then a random noise component.

Now lets use a Random Forest Regressor to create a fitted regression, obviously a standard linear regression approach wouldn't work here. And if we didn't know anything about the true nature of the model, polynomial or sinusoidal regression would be tedious.

Output-

As you can see, the non-parametric random forest model is flexible enough to fit the multi-period data, without us even specifying a multi-period model!
This is a tradeoff between simplicity and thinking about what your data actually is.

Here are some more resources for Random Forests:
A whole webpage form the inventors themselves-Leo Breiman and Adele Cutler Random Forests
Its strange to think Random Forests is actually trademarked!

That's it for today! Please try above function with some modification in your notebook and explore more.

Machine Learning-Cross Validation & ROC curve

Another post starts with you beautiful people! Hope you enjoyed my previous post about improving your model performance by confusion metrix . Today we will continue our performance improvement journey and will learn about Cross Validation (k-fold cross validation) & ROC in Machine Learning. A common practice in data science competitions is to iterate over various models to find a better performing model. However, it becomes difficult to distinguish whether this improvement in score is coming because we are capturing the relationship better or we are just over-fitting the data. To find the right answer of this question, we use cross validation technique. This method helps us to achieve more generalized relationships. What is Cross Validation? Cross Validation is a technique which involves reserving a particular sample of a data set on which we do not train the model. Later, we test the model on this sample before finalizing the model. Here are the steps involved in...

TejutejuAugust 22, 2018 at 4:58 PM
It is nice blog Thank you porovide important information and i am searching for same information to save my time Data Science online Course Hyderabad
UnknownSeptember 11, 2018 at 6:08 PM
Great thanks for sharing about Machine learning decision tree. This post will be helpful for the readers who are searching for this type of information. Keep it up
machine learning with python course in chennai | best training institute for machine learning
KayalMarch 2, 2019 at 12:23 PM
Great blog!!! This is a very different and unique content. I am waiting for your another post...

Social Media Marketing Courses in Chennai
Social Media Marketing Training in Chennai
Embedded System Course Chennai
Linux Training in Chennai
Tableau Training in Chennai
Spark Training in Chennai
Oracle Training in Chennai
Oracle DBA Training in Chennai
Social Media Marketing Courses in Chennai
Social Media Marketing Training in Chennai
Evan RaymondsJune 2, 2019 at 3:35 PM
Decision trees are a great flow chart tree structuecire.Yet decision trees are less appropriate for estimation tasks where the goal is to predict the value of a continuous attribute.
To understand further more lets look at some Decision Tree Examples in the Creately diagram community.
360DigiTMGNoidaMarch 25, 2021 at 5:21 PM
Good information you shared. keep posting.
<a href="https://360digitmg.com/india/artificial-intelligence-ai-and-deep-learning-in-noida>artificial intelligence course in noida</a>
360digitmgdelhiMarch 30, 2021 at 1:57 PM
"Very Nice Blog!!!

Please have a look about "
data science courses in delhi
JackLeachMay 23, 2023 at 5:54 PM
Thanks for sharing the best information and suggestions, I love your content, and they are very nice and very useful to us. If you are looking for the best Machine Learning Model Implementation USA, then visit Symentix Technologies Private Limited. I appreciate the work you have put into this.
AmritaJuly 4, 2024 at 6:38 PM
At APTRON Solutions, we believe in learning by doing. Our Machine Learning Training in Noida emphasizes hands-on experience through practical exercises, real-world projects, and case studies. This approach ensures you gain practical skills directly applicable to the industry.

Learn Data Science using Python

Search This Blog

Machine Learning-Decision Trees and Random Forests

Labels

Comments

Post a Comment

Popular posts from this blog

Machine Learning-Cross Validation & ROC curve

Exploring The File Import