Skip to main content

Statistically Thinking as a Data Scientist-2

Another post starts with you beautiful people!
I hope you have learnt something from my previous post about computing the ECDF, plotting the ECDF, comparison of ECDFs, mean, percentile and covariance. If you haven't seen the post then please visit here statistically thinking as a data scientist-1

In this post we will learn following topics-

  • Computing the Pearson correlation coefficient
  • Bernoulli trials
  • Binomial distribution
  • Poisson distributions
  • Normal PDF/CDF



Computing the Pearson correlation coefficient-
The Pearson correlation coefficient, also called the Pearson r, is often easier to interpret than the covariance. I must suggest you to read more about this here-  A Must Read about Pearson Correlation Coefficient
It is computed using the np.corrcoef() function. Like np.cov(), it takes two arrays as arguments and returns a 2D array.
Entries [0,0] and [1,1] are necessarily equal to 1 (can you think about why?), and the value we are after is entry [0,1].
In this exercise, we will write a function, pearson_r(x, y) that takes in two arrays and returns the Pearson correlation coefficient. 
WE will then use this function to compute it for the petal lengths and widths of I. versicolor.
Again, we include the scatter plot we generated in my previous post to remind you how the petal width and length are related.

Try above code in your notebook and find the value of Pearson Correlation Coefficient!

Generating random numbers using the np.random module-
We will be hammering the np.random module for the rest of this course and its sequel.
Actually, we will probably call functions from this module more than any other while wearing our hacker statistician hat.
Let's start by taking its simplest function, np.random.random() for a test spin. The function returns a random number between zero and one.
Call np.random.random() a few times in the IPython shell. You should see numbers jumping around between zero and one.
In this exercise, we'll generate lots of random numbers between zero and one, and then plot a histogram of the results. 
If the numbers are truly random, all bars in the histogram should be of (close to) equal height.

Result-

Good work! The histogram is almost exactly flat across the top, 
indicating that there is equal chance that a randomly-generated number is in any of the bins of the histogram.

The np.random module and Bernoulli trials-
We can think of a Bernoulli trial as a flip of a possibly biased coin. You can find more details about this here- Tell me more about bernoulli trials
Specifically, each coin flip has a probability pp of landing heads (success) and probability 1−p1−p of landing tails (failure).
In this exercise, we will write a function to perform n Bernoulli trials, perform_bernoulli_trials(n, p), 
which returns the number of successes out of n Bernoulli trials, each of which has probability p of success.
To perform each Bernoulli trial, use the np.random.random() function, which returns a random number between zero and one.

How many defaults might we expect?-
Let's say a bank made 100 mortgage loans. It is possible that anywhere between 0 and 100 of the loans will be defaulted upon.
We would like to know the probability of getting a given number of defaults, given that the probability of a default is p = 0.05.
To investigate this, we will do a simulation. We will perform 100 Bernoulli trials using the perform_bernoulli_trials() function , we wrote in the previous exercise and record how many defaults we get. Here, a success is a default. 
(Remember that the word "success" just means that the Bernoulli trial evaluates to True, i.e., did the loan recipient default?)
We will do this for another 100 Bernoulli trials. And again and again until we have tried it 1000 times.
Then, we will plot a histogram describing the probability of the number of defaults.

Result-

Nice work! This is actually not an optimal way to plot a histogram when the results are known to be integers. We will revisit this in forthcoming exercises.

Will the bank fail?
If interest rates are such that the bank will lose money if 10 or more of its loans are defaulted upon,

what is the probability that the bank will lose money?

Result-
As we might expect, we most likely get 5/100 defaults. But we still have about a 2% chance of getting 10 or more defaults out of 100 loans.

Sampling out of the Binomial distribution-
Compute the probability mass function for the number of defaults we would expect for 100 loans as in the last section, but instead of simulating all of the Bernoulli trials, perform the sampling using np.random.binomial()
This is identical to the calculation we did in the last set of exercises using our custom-written perform_bernoulli_trials() function,  but far more computationally efficient. Given this extra efficiency, we will take 10,000 samples instead of 1000
After taking the samples, plot the CDF as last time. This CDF that we are plotting is that of the Binomial distribution.

Note: For this exercise and all going forward, the random number generator is pre-seeded for you (with np.random.seed(42)) to save you typing that each time.

Result-

Great work! If you know the story, using built-in algorithms to directly sample out of the distribution is much faster.

Plotting the Binomial PMF-
Plotting a nice looking PMF requires a bit of matplotlib trickery that we will not go into here. 
Instead, we will plot the PMF of the Binomial distribution as a histogram with skills you have already learned. 
The trick is setting up the edges of the bins to pass to plt.hist() via the bins keyword argument. 
We want the bins centered on the integers. So, the edges of the bins should be -0.5, 0.5, 1.5, 2.5, ... up to max(n_defaults) + 1.5
We can generate an array like this using np.arange() and then subtracting 0.5 from the array.
We have already sampled out of the Binomial distribution during our exercises on loan defaults, and the resulting samples are in the NumPy array n_defaults.

Result-

Relationship between Binomial and Poisson distributions-
We just heard that the Poisson distribution is a limit of the Binomial distribution for rare events
This makes sense if you think about the stories. Say we do a Bernoulli trial every minute for an hour, each with a success probability of 0.1. 
We would do 60 trials, and the number of successes is Binomially distributed, and we would expect to get about 6 successes
So, the Poisson distribution with arrival rate equal to npnp approximates a Binomial distribution for nn Bernoulli trials with probability pp of success (with nn large and pp small)
Importantly, the Poisson distribution is often simpler to work with because it has only one parameter instead of two for the Binomial distribution.
Let's explore these two distributions computationally. We will compute the mean and standard deviation of samples from a Poisson distribution with an arrival rate of 10. Then, we will compute the mean and standard deviation of samples from a Binomial distribution with parameters nn and pp such that np=10np=10.
The means are all about the same, which can be shown to be true by doing some pen-and-paper work. 
The standard deviation of the Binomial distribution gets closer and closer to that of the Poisson distribution as the probability p gets lower and lower.

The Normal PDF-
In this exercise, we will explore the Normal PDF and also learn a way to plot a PDF of a known distribution using hacker statistics. 

Specifically, we will plot a Normal PDF for various values of the variance.

Result-

Great work! You can see how the different standard deviations result in PDFS of different widths. The peaks are all centered at the mean of 20.

The Normal CDF-
Now that we have a feel for how the Normal PDF looks, let's consider its CDF.
Using the samples we generated in the last exercise (in our namespace as samples_std1, samples_std3, and samples_std10), generate and plot the CDFs.

Result-

Great work! The CDFs all pass through the mean at the 50th percentile; the mean and median of a Normal distribution are equal. 

The width of the CDF varies with the standard deviation.

Key Note-
Remember that the mean of a Normal, or Gaussian, distribution is the location of the maximum along the x-axis, and not the maximal value, while the standard deviation is about half the width of the PDF about 2/3 the way up to the peak. 
Note the three lines dividing the distribution: the central line denotes where the mean is, while the lines on either side of it represent one standard deviation from the mean.

Try above code in your notebook and explore other datasets also.




Comments

  1. This concept is a good way to enhance the knowledge.thanks for sharing. please keep it up machine learning online course

    ReplyDelete

Post a Comment

Popular posts from this blog

How to use opencv-python with Darknet's YOLOv4?

Another post starts with you beautiful people 😊 Thank you all for messaging me your doubts about Darknet's YOLOv4. I am very happy to see in a very short amount of time my lovely aspiring data scientists have learned a state of the art object detection and recognition technique. If you are new to my blog and to computer vision then please check my following blog posts one by one- Setup Darknet's YOLOv4 Train custom dataset with YOLOv4 Create production-ready API of YOLOv4 model Create a web app for your YOLOv4 model Since now we have learned to use YOLOv4 built on Darknet's framework. In this post, I am going to share with you how can you use your trained YOLOv4 model with another awesome computer vision and machine learning software library-  OpenCV  and of course with Python 🐍. Yes, the Python wrapper of OpenCV library has just released it's latest version with support of YOLOv4 which you can install in your system using below command- pip install opencv-pyt...

How can I make a simple ChatBot?

Another post starts with you beautiful people! It has been a long time of posting a new post. But my friends in this period I was not sitting  where I got a chance to work with chatbot and classification related machine learning problem. So in this post I am going to share all about chatbot- from where I have learned? What I have learned? And how can you build your first bot? Quite interesting right! Chatbot is a program that can conduct an intelligent conversation based on user's input. Since chatbot is a new thing to me also, I first searched- is there any Python library available to start with this? And like always Python has helped me this time also. There is a Python library available with name as  ChatterBot   which is nothing but a machine learning conversational dialog engine. And yes that is all I want to start my learning because I always prefer inbuilt Python library to start my learning journey and once I learn this then only I move ahead for another...

How to convert your YOLOv4 weights to TensorFlow 2.2.0?

Another post starts with you beautiful people! Thank you all for your overwhelming response in my last two posts about the YOLOv4. It is quite clear that my beloved aspiring data scientists are very much curious to learn state of the art computer vision technique but they were not able to achieve that due to the lack of proper guidance. Now they have learnt exact steps to use a state of the art object detection and recognition technique from my last two posts. If you are new to my blog and want to use YOLOv4 in your project then please follow below two links- How to install and compile Darknet code with GPU? How to train your custom data with YOLOv4? In my  last post we have trained our custom dataset to identify eight types of Indian classical dance forms. After the model training we have got the YOLOv4 specific weights file as 'yolo-obj_final.weights'. This YOLOv4 specific weight file cannot be used directly to either with OpenCV or with TensorFlow currently becau...