Skip to main content

Machine Learning-Logistic Regression


Another post starts with you beautiful people!
I appreciate that you have shown your interest in Machine Learning track and enjoyed my previous post about Linear Regression where we learned the concept with the case study of bike sharing system.
Today we will continue our Data Science journey and learn about Logistic Regression.
Like all regression analyses, the logistic regression is a predictive analysis.
The fact is that linear regression works on a continuum of numeric estimates. In order to classify correctly, we need a more suitable measure, such as the probability of class ownership.
Thanks to the following formula, we can transform a linear regression numeric estimate into a probability that is more apt to describe how a class fits an observation:

probability of a class = exp(r) / (1+exp(r))

  • r is the regression result (the sum of the variables weighted by the coefficients) 
  • exp is the exponential function. 
  • exp(r) corresponds to Euler’s number e elevated to the power of r. 
  • A linear regression using such a formula (also called a link function) for transforming its results into probabilities is a logistic regression.


Logistic regression is similar to linear regression, with the only difference being the y data, which should contain integer values indicating the class relative to the observation.
Agenda of this exercise-

For this exercise we will work on a very interesting dataset- Affair Dataset
In this dataset: Extramarital affair data is used to explain the allocation of an individual’s time among work, time spent with a spouse, and time spent with a paramour. The data is used as an example of regression with censored data.
It was derived from a survey of women in 1974 by Redbook magazine, in which married women were asked about their participation in extramarital affairs.


It was derived from a survey of women in 1974 by Redbook magazine, in which married women were asked about their participation in extramarital affairsDescription of Variables
The dataset contains 6366 observations of 9 variables:

Dataset description:-
  • rate_marriage: woman's rating of her marriage (1 = very poor, 5 = very good)
  • age: woman's age
  • yrs_married: number of years married
  • children: number of children
  • religious: woman's rating of how religious she is (1 = not religious, 4 = strongly religious)
  • educ: level of education (9 = grade school, 12 = high school, 14 = some college, 16 = college graduate, 17 = some graduate school, 20 = advanced degree)
  • occupation: woman's occupation (1 = student, 2 = farming/semi-skilled/unskilled, 3 = "white collar", 4 = teacher/nurse/writer/technician/skilled, 5 = managerial/business, 6 = professional with advanced degree)
  • occupation_husb: husband's occupation (same coding as above)
  • affairs: time spent in extra-marital affairs
Problem Statement:-
We will treat this as a classification problem by creating a new binary variable affair (did the woman have at least one affair?) and trying to predict the classification for each woman.

Importing the required modules-

Data Pre-Processing-
First, let's load the dataset and add a binary 'affair' column.


Data Exploration-

We can see that on average, women who have affairs rate their marriages lower, which is to be expected. 

Let's take another look at the rate_marriage variable.

It seems an increase in age, yrs_married, and children appears to correlate with a declining marriage rating.

Data Visualization-

Result-


Result-

Let's take a look at the distribution of marriage ratings for those having affairs versus those not having affairs.

Let's use a stacked barplot to look at the percentage of women having affairs by number of years of marriage.

Prepare Data for Logistic Regression-

To prepare the data, we will add an intercept column as well as dummy variables for occupation and occupation_husb, since we are treating them as categorial variables. 
The dmatrices function from the patsy module can do that using formula language.

The column names for the dummy variables are ugly, so let's rename those-

We also need to flatten y into a 1-D array, so that scikit-learn will properly understand it as the response variable.

Logistic Regression-
Let's go ahead and run logistic regression on the entire data set, and see how accurate it is!

73% accuracy seems good, but what's the null error rate?


Only 32% of the women had affairs, which means that we could obtain 68% accuracy by always predicting "no". 
So we're doing better than the null error rate, but not by much.
Let's examine the coefficients to see what we learn-


From the above output we can say-Increases in marriage rating and religiousness correspond to a decrease in the likelihood of having an affair

For both the wife's occupation and the husband's occupation, the lowest likelihood of having an affair corresponds to the baseline occupation (student), since all of the dummy coefficients are positive.

Model Evaluation Using a Validation Set-
So far, we have trained and tested on the same set. Let's instead split the data into a training set and a testing set.

We now need to predict class labels for the test set. We will also generate the class probabilities, just to take a look.
As you can see, the classifier is predicting a 1 (having an affair) any time the probability in the second column is greater than 0.5.
Now let's generate some evaluation metrics-
The accuracy is 73%, which is the same as we experienced when training and predicting on the same data.
We can also see the confusion matrix and a classification report with other metrics-

Model Evaluation Using Cross-Validation-
Now let's try 10-fold cross-validation, to see if the accuracy holds up more rigorously.

Looks good. It's still performing at 73% accuracy.So our model is ready for prediction!

Can we predict the probability of an affair using our model?
Let's predict the probability of an affair for a random woman not present in the dataset. 
Assume she's a 25-year-old house wife who graduated college, has been married for 3 years, has 1 child, rates herself as strongly religious, rates her marriage as fair, and her husband is a farmer.
From our model we can predict that probability of an affair is 23%.

Looks cool right! we can make many improvement like below to improve our model-
  • including interaction terms
  • removing features
  • regularization techniques
  • using a non-linear model
It's time to try yourself and improve our model.

In my next post I will share you When and Where to use Linear or Logistic regression?

Comments

  1. i learnt new information about data science using python which really helpful.This concept explanation are very clear so easy to understand..

    Also Check out the : https://www.credosystemz.com/training-in-chennai/best-data-science-training-in-chennai/

    ReplyDelete
    Replies
    1. Really Good blog post.provided a helpful information.I hope that you will post more updates like this Machine Learning Projects for Final Year

      Artificial Intelligence Projects For Final Year

      I think things like this are really interesting. I absolutely love to find unique places like this. It really looks super creepy though!!

      Delete
  2. Really Good blog post.provided a helpful information.I hope that you will post more updates like this Data Science online Training Hyderabad

    ReplyDelete
  3. I think things like this are really interesting. I absolutely love to find unique places like this. It really looks super creepy though!!
    Best Machine Learning Training in Chennai | best machine learning institute in chennai | Machine Learning course in chennai

    ReplyDelete
  4. Excellent article. Very interesting to read. I really love to read such a nice article. Thanks! keep rocking. Data Science online Course India

    ReplyDelete
  5. It is nice blog Thank you porovide importent information and i am searching for same information
    Tableau Online Training

    ReplyDelete
  6. This video helps me to understand Matplotlib whats your opinion guys.

    ReplyDelete
  7. Thanks for the post. It was very interesting and meaningful. I really appreciate it! Keep updating stuff like this.
    We are giving all Programming Courses such as

    Register for a free Demo Sessions

    RPA Ui Path Online Training
    Best Python Online Training
    Online AWS Training
    Online Data Science Training

    ReplyDelete
  8. I really liked your Information. Keep up the good work. Chat with Amateur Models

    ReplyDelete
  9. Extremely decent review. I totally appreciate this site. Much obliged! online news

    ReplyDelete

  10. There's definately a ton to think about this issue. I truly like all the focuses you made.
    best interiors

    ReplyDelete

Post a Comment

Popular posts from this blog

How to use opencv-python with Darknet's YOLOv4?

Another post starts with you beautiful people 😊 Thank you all for messaging me your doubts about Darknet's YOLOv4. I am very happy to see in a very short amount of time my lovely aspiring data scientists have learned a state of the art object detection and recognition technique. If you are new to my blog and to computer vision then please check my following blog posts one by one- Setup Darknet's YOLOv4 Train custom dataset with YOLOv4 Create production-ready API of YOLOv4 model Create a web app for your YOLOv4 model Since now we have learned to use YOLOv4 built on Darknet's framework. In this post, I am going to share with you how can you use your trained YOLOv4 model with another awesome computer vision and machine learning software library-  OpenCV  and of course with Python 🐍. Yes, the Python wrapper of OpenCV library has just released it's latest version with support of YOLOv4 which you can install in your system using below command- pip install opencv-pyt...

How to convert your YOLOv4 weights to TensorFlow 2.2.0?

Another post starts with you beautiful people! Thank you all for your overwhelming response in my last two posts about the YOLOv4. It is quite clear that my beloved aspiring data scientists are very much curious to learn state of the art computer vision technique but they were not able to achieve that due to the lack of proper guidance. Now they have learnt exact steps to use a state of the art object detection and recognition technique from my last two posts. If you are new to my blog and want to use YOLOv4 in your project then please follow below two links- How to install and compile Darknet code with GPU? How to train your custom data with YOLOv4? In my  last post we have trained our custom dataset to identify eight types of Indian classical dance forms. After the model training we have got the YOLOv4 specific weights file as 'yolo-obj_final.weights'. This YOLOv4 specific weight file cannot be used directly to either with OpenCV or with TensorFlow currently becau...

Relational Database in Python-SQLite

Another post starts with you beautiful people! Hope you have enjoyed and learnt my previous post about  file importing Today we will learn how as a Data Scientist we connect to the database step by step- Before starting the code first you must read about the very powerful library - SQLAlchemy which we will use in our exercises and you can find the details here-  SQLAlchemy documentation How To Create a Database Engine- Here, we're going to fire up our very first SQL engine. We'll create an engine to connect to the SQLite database 'Chinook.db'. You can find the download and installation steps of this sample database from here-  download sqlite sample database An engine is just a common interface to a database , and the information it requires to connect to one is contained in a connection string , such as sqlite:///C:\sqlite\db\chinook.db .Here, sqlite is the database driver , while chinook.db is a SQLite db file contained in the local directory. A little...