Learn Data Science using Python

Posts

Showing posts from September, 2017

Machine Learning-Cross Validation & ROC curve

Another post starts with you beautiful people! Hope you enjoyed my previous post about improving your model performance by confusion metrix . Today we will continue our performance improvement journey and will learn about Cross Validation (k-fold cross validation) & ROC in Machine Learning. A common practice in data science competitions is to iterate over various models to find a better performing model. However, it becomes difficult to distinguish whether this improvement in score is coming because we are capturing the relationship better or we are just over-fitting the data. To find the right answer of this question, we use cross validation technique. This method helps us to achieve more generalized relationships. What is Cross Validation? Cross Validation is a technique which involves reserving a particular sample of a data set on which we do not train the model. Later, we test the model on this sample before finalizing the model. Here are the steps involved in cro

Machine Learning::Confusion Matrix

Another post starts with you beautiful people! Thanks for your overwhelming response on my previous post about decision trees and random forests . Today in this post we will continue our Machine Learning journey and we will discover the confusion matrix interpretation for use in machine learning. After reading this post we will know: What the confusion matrix is and why we need to use it? How to calculate a confusion matrix? How to create a confusion matrix? A confusion matrix is a technique for summarizing the performance of a classification algorithm . Classification accuracy ( Classification accuracy is the ratio of correct predictions to total predictions made ) alone can be misleading if we have an unequal number of observations in each class or if we have more than two classes in our dataset. For a quick revision remember the following formula - error rate = (1 - (correct predictions / total predictions)) * 100 The main problem with classification accurac

Case Study::Decision Trees & Random Forests::Machine Learning::Kaggle

Another post starts with you beautiful people! I am very happy that you have enjoyed my previous post about Decision Tree and Random Forest and a lot of aspiring data scientists like You have asked me questions like tell a case study and how to apply our knowledge to a competition like in Kaggle? So here I am! In this exercise we will work on a great dataset- Titanic disaster which we have studied in one of my earlier post. I suggest you to do a quick revision of that post here- Let me revise Titanic disaster and submit our prediction to Kaggle . When the Titanic sank, 1502 of the 2224 passengers and crew were killed. One of the main reasons for this high level of casualties was the lack of lifeboats on this self-proclaimed "unsinkable" ship. In this course, we will learn how to apply machine learning techniques to predict a passenger's chance of surviving. Let's start with loading in the training and testing set into our Python environment. We will