Skip to main content

Machine Learning-Linear Regression

Another post starts with you beautiful people!
In my previous posts we have learnt the Python basics and advanced, statistics techniques for the Data Science track. I suggest you to please read previous post just for 10-15 min. before sleeping daily and then there is no any obstacle to stop you to become a great Data Scientist.
In this post we will start our Machine Learning track with the Linear Regression topic. I Have highlighted the both so please click on the link to know the formal definition of those.

Machine learning- More specifically the field of predictive modeling is primarily concerned with minimizing the error of a model or making the most accurate predictions possible, at the expense of explainability.
In applied machine learning we will borrow, reuse and steal algorithms from many different fields, including statistics and use them towards these ends.

Linear Regression was developed in the field of statistics and is studied as a model for understanding the relationship between input and output numerical variables, but has been borrowed by machine learning. It is both a statistical algorithm and a machine learning algorithm.For more details please see here-tell me more!

Form of linear regression-

  • y=β0+β1x1+β2x2+...+βnxny=β0+β1x1+β2x2+...+βnxn 
  • yy  is the response
  • β0β0  is the intercept
  • β1β1  is the coefficient for  x1x1  (the first feature)
  • βnβn  is the coefficient for  xnxn  (the nth feature)

The  ββ  values are called the model coefficients:




  • These values are estimated (or "learned") during the model fitting process using the least squares criterion.
  • Specifically, we find the line (mathematically) which minimizes the sum of squared residuals (or "sum of squared errors").
  • And once we've learned these coefficients, we can use the model to predict the response.

Agenda of this post-
Introducing the bikeshare dataset

  • Reading in the data
  • Visualizing the data

Linear regression basics

  • Form of linear regression
  • Building a linear regression model
  • Using the model for prediction
  • Does the scale of the features matter?

Working with multiple features

  • Visualizing the data (part 2)
  • Adding more features to the model

Choosing between models
Feature selection

  • Evaluation metrics for regression problems
  • Comparing models with train/test split and RMSE
  • Comparing testing RMSE with null RMSE
  • Creating features

Handling categorical features

  • Feature engineering
  • Advantages/Disadvantages


In this exercise we'll be working with a dataset from Capital Bikeshare that was used in a Kaggle competition.
This dataset contains the hourly and daily count of rental bikes between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.
Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic.
Through these systems, user is able to easily rent a bike from a particular position and return back at another position.

Importing Data set and Basic Data Exploration-


The structure of our dataset will look like below-

We will deal with following questions in our exercise-

  • What does each observation represent?
  • What is the response variable (as defined by Kaggle)?
  • How many features are there?
If you notice the data structure of our dataset, there is a column named as 'count'; which is also a method name. So it's good to rename it first-


Let's visualize the data-

Result-

Now to know how a response variable y changes as an explanatory variable x changes, we will draw the regression line. So you can say in simple word that a regression line is a connection between data points in a set.

Result-

Next, we are going to Build a linear regression model-


Points to think from above code-

Using the above model for prediction-

Does the scale of the features matter?

Let's say that temperature was measured in Fahrenheit, rather than Celsius. How would that affect the model?

Result-

Let's plot again with this Fahrenheit change to see effect-




Conclusion: The scale of the features is irrelevant for linear regression models. When changing the scale, we simply change our interpretation of the coefficients.

Let's explore other features-

We can also use pandas library to built multiple scatter plots-

Are you seeing anything that you did not expect? Yes, the regression line is missing.

Now we will analyze our categorical data (categorical data is data or variables that are separated into different categories that are mutually exclusive from one another) with the help of Cross tabulation tool. tell me more about crosstab!
Let's plot a box plot of rentals wrt to season-


From the plot we notice that :

  • A line can't capture a non-linear relationship.
  • There are more rentals in winter than in spring.
Let's draw a line plot to see the relation between rentals and weather-



Cool! There are more rentals in the winter than the spring, but only because the system is experiencing overall growth and the winter months happen to come after the spring months.

Let's find out the correlation and draw a heat map-



See above plot and think about What relationships do you notice?

Adding more features to the model-


Interpreting the coefficients:
  • Holding all other features fixed, a 1 unit increase in temperature is associated with a rental increase of 7.86 bikes.
  • Holding all other features fixed, a 1 unit increase in season is associated with a rental increase of 22.5 bikes.
  • Holding all other features fixed, a 1 unit increase in weather is associated with a rental increase of 6.67 bikes.
  • Holding all other features fixed, a 1 unit increase in humidity is associated with a rental decrease of 3.12 bikes.
Does anything look incorrect?

Feature selection-
How do we choose which features to include in the model? For this we're going to use train/test split (and eventually cross-validation).

Why not use of p-values or R-squared for feature selection?
  • Linear models rely upon a lot of assumptions (such as the features being independent), and if those assumptions are violated, p-values and R-squared are less reliable. Train/test split relies on fewer assumptions.
  • Features that are unrelated to the response can still have significant p-values.
  • Adding features to your model that are unrelated to the response will always increase the R-squared value, and adjusted R-squared does not sufficiently account for this.
  • p-values and R-squared are proxies for our goal of generalization, whereas train/test split and cross-validation attempt to directly estimate how well the model will generalize to out-of-sample data.
Evaluation metrics for regression problems-
Evaluation metrics for classification problems, such as accuracy, are not useful for regression problems. We need evaluation metrics designed for comparing continuous values.
Here are three common evaluation metrics for regression problems:-

Don't afraid to above formulas, Python's sklearn library make it simple to use these formulas as below-
Comparing these metrics:
  • MAE is the easiest to understand, because it's the average error.
  • MSE is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world.
  • RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.
All of these are loss functions, because we want to minimize them.

Comparing models with train/test split and RMSE-
Comparing testing RMSE with null RMSE-
Null RMSE is the RMSE that could be achieved by always predicting the mean response value. It is a benchmark against which you may want to measure your regression model.

Handling categorical features-
Since scikit-learn expects all features to be numeric. So how do we include a categorical feature in our model?
  • Ordered categories: transform them to sensible numeric values (example: small=1, medium=2, large=3)
  • Unordered categories: use dummy encoding (0/1)
What are the categorical features in our dataset?
  • Ordered categories: weather (already encoded with sensible numeric values)
  • Unordered categories: season (needs dummy encoding), holiday (already dummy encoded), workingday (already dummy encoded)
For season, we can't simply leave the encoding as 1 = spring, 2 = summer, 3 = fall, and 4 = winter, because that would imply an ordered relationship
Instead, we create multiple dummy variables:

However, we actually only need three dummy variables (not four), and thus we'll drop the first dummy variable.
Why? Because three dummies captures all of the "information" about the season feature, and implicitly defines spring (season 1) as the baseline level:

In general, if you have a categorical feature with k possible values, you create k-1 dummy variables.
If that's confusing, think about why we only need one dummy variable for holiday, not two dummy variables (holiday_yes and holiday_no).



How do we interpret the season coefficients? 
They are measured against the baseline (spring):
  • Holding all other features fixed, summer is associated with a rental decrease of 3.39 bikes compared to the spring.
  • Holding all other features fixed, fall is associated with a rental decrease of 41.7 bikes compared to the spring.
  • Holding all other features fixed, winter is associated with a rental increase of 64.4 bikes compared to the spring.
Would it matter if we changed which season was defined as the baseline?
No, it would simply change our interpretation of the coefficients.

Important: Dummy encoding is relevant for all machine learning models, not just linear regression models.

Compare original season variable with dummy variables-

That's the end of our model and we come to know following-
Advantages of linear regression:
  • Simple to explain
  • Highly interpretable
  • Model training and prediction are fast
  • No tuning is required (excluding regularization)
  • Features don't need scaling
  • Can perform well with a small number of observations
  • Well-understood
Disadvantages of linear regression:
  • Presumes a linear relationship between the features and the response
  • Performance is (generally) not competitive with the best supervised learning methods due to high bias
  • Can't automatically learn feature interactions
Hope you enjoyed and must learn something today!
Please try above code in your notebook and explore more.











Comments

  1. It is nice blog Thank you porovide importent information and i am searching for same information
    Tableau Online Training

    ReplyDelete
  2. AI Patasala Machine Learning Training in the Hyderabad Program is the best platform for Machine Learning professionals to transform into professionals in their field.
    AI Patasala Machine Learning Course

    ReplyDelete
  3. There are also varieties of options available to choose a career in data science, like you can become a data scientist, developer, data engineer, analyst, database administrator, and many more.
    data science training in patna

    ReplyDelete

Post a Comment

Popular posts from this blog

How to use opencv-python with Darknet's YOLOv4?

Another post starts with you beautiful people 😊 Thank you all for messaging me your doubts about Darknet's YOLOv4. I am very happy to see in a very short amount of time my lovely aspiring data scientists have learned a state of the art object detection and recognition technique. If you are new to my blog and to computer vision then please check my following blog posts one by one- Setup Darknet's YOLOv4 Train custom dataset with YOLOv4 Create production-ready API of YOLOv4 model Create a web app for your YOLOv4 model Since now we have learned to use YOLOv4 built on Darknet's framework. In this post, I am going to share with you how can you use your trained YOLOv4 model with another awesome computer vision and machine learning software library-  OpenCV  and of course with Python 🐍. Yes, the Python wrapper of OpenCV library has just released it's latest version with support of YOLOv4 which you can install in your system using below command- pip install opencv-pyt...

How can I make a simple ChatBot?

Another post starts with you beautiful people! It has been a long time of posting a new post. But my friends in this period I was not sitting  where I got a chance to work with chatbot and classification related machine learning problem. So in this post I am going to share all about chatbot- from where I have learned? What I have learned? And how can you build your first bot? Quite interesting right! Chatbot is a program that can conduct an intelligent conversation based on user's input. Since chatbot is a new thing to me also, I first searched- is there any Python library available to start with this? And like always Python has helped me this time also. There is a Python library available with name as  ChatterBot   which is nothing but a machine learning conversational dialog engine. And yes that is all I want to start my learning because I always prefer inbuilt Python library to start my learning journey and once I learn this then only I move ahead for another...

How to convert your YOLOv4 weights to TensorFlow 2.2.0?

Another post starts with you beautiful people! Thank you all for your overwhelming response in my last two posts about the YOLOv4. It is quite clear that my beloved aspiring data scientists are very much curious to learn state of the art computer vision technique but they were not able to achieve that due to the lack of proper guidance. Now they have learnt exact steps to use a state of the art object detection and recognition technique from my last two posts. If you are new to my blog and want to use YOLOv4 in your project then please follow below two links- How to install and compile Darknet code with GPU? How to train your custom data with YOLOv4? In my  last post we have trained our custom dataset to identify eight types of Indian classical dance forms. After the model training we have got the YOLOv4 specific weights file as 'yolo-obj_final.weights'. This YOLOv4 specific weight file cannot be used directly to either with OpenCV or with TensorFlow currently becau...