Statistically Thinking as a Data Scientist-2

Another post starts with you beautiful people!
I hope you have learnt something from my previous post about computing the ECDF, plotting the ECDF, comparison of ECDFs, mean, percentile and covariance. If you haven't seen the post then please visit here statistically thinking as a data scientist-1

In this post we will learn following topics-

Computing the Pearson correlation coefficient
Bernoulli trials
Binomial distribution
Poisson distributions
Normal PDF/CDF

Computing the Pearson correlation coefficient-
The Pearson correlation coefficient, also called the Pearson r, is often easier to interpret than the covariance. I must suggest you to read more about this here- A Must Read about Pearson Correlation Coefficient
It is computed using the np.corrcoef() function. Like np.cov(), it takes two arrays as arguments and returns a 2D array.
Entries [0,0] and [1,1] are necessarily equal to 1 (can you think about why?), and the value we are after is entry [0,1].
In this exercise, we will write a function, pearson_r(x, y) that takes in two arrays and returns the Pearson correlation coefficient.
WE will then use this function to compute it for the petal lengths and widths of I. versicolor.
Again, we include the scatter plot we generated in my previous post to remind you how the petal width and length are related.

Try above code in your notebook and find the value of Pearson Correlation Coefficient!

Generating random numbers using the np.random module-
We will be hammering the np.random module for the rest of this course and its sequel.
Actually, we will probably call functions from this module more than any other while wearing our hacker statistician hat.
Let's start by taking its simplest function, np.random.random() for a test spin. The function returns a random number between zero and one.
Call np.random.random() a few times in the IPython shell. You should see numbers jumping around between zero and one.
In this exercise, we'll generate lots of random numbers between zero and one, and then plot a histogram of the results.
If the numbers are truly random, all bars in the histogram should be of (close to) equal height.

Result-

Good work! The histogram is almost exactly flat across the top,
indicating that there is equal chance that a randomly-generated number is in any of the bins of the histogram.

The np.random module and Bernoulli trials-
We can think of a Bernoulli trial as a flip of a possibly biased coin. You can find more details about this here- Tell me more about bernoulli trials
Specifically, each coin flip has a probability pp of landing heads (success) and probability 1−p1−p of landing tails (failure).
In this exercise, we will write a function to perform n Bernoulli trials, perform_bernoulli_trials(n, p),
which returns the number of successes out of n Bernoulli trials, each of which has probability p of success.
To perform each Bernoulli trial, use the np.random.random() function, which returns a random number between zero and one.

How many defaults might we expect?-
Let's say a bank made 100 mortgage loans. It is possible that anywhere between 0 and 100 of the loans will be defaulted upon.
We would like to know the probability of getting a given number of defaults, given that the probability of a default is p = 0.05.
To investigate this, we will do a simulation. We will perform 100 Bernoulli trials using the perform_bernoulli_trials() function , we wrote in the previous exercise and record how many defaults we get. Here, a success is a default.
(Remember that the word "success" just means that the Bernoulli trial evaluates to True, i.e., did the loan recipient default?)
We will do this for another 100 Bernoulli trials. And again and again until we have tried it 1000 times.
Then, we will plot a histogram describing the probability of the number of defaults.

Result-

Nice work! This is actually not an optimal way to plot a histogram when the results are known to be integers. We will revisit this in forthcoming exercises.

Will the bank fail?
If interest rates are such that the bank will lose money if 10 or more of its loans are defaulted upon,

what is the probability that the bank will lose money?

Result-

As we might expect, we most likely get 5/100 defaults. But we still have about a 2% chance of getting 10 or more defaults out of 100 loans.

Sampling out of the Binomial distribution-
Compute the probability mass function for the number of defaults we would expect for 100 loans as in the last section, but instead of simulating all of the Bernoulli trials, perform the sampling using np.random.binomial().
This is identical to the calculation we did in the last set of exercises using our custom-written perform_bernoulli_trials() function, but far more computationally efficient. Given this extra efficiency, we will take 10,000 samples instead of 1000.
After taking the samples, plot the CDF as last time. This CDF that we are plotting is that of the Binomial distribution.

Note: For this exercise and all going forward, the random number generator is pre-seeded for you (with np.random.seed(42)) to save you typing that each time.

Result-

Great work! If you know the story, using built-in algorithms to directly sample out of the distribution is much faster.

Plotting the Binomial PMF-
Plotting a nice looking PMF requires a bit of matplotlib trickery that we will not go into here.
Instead, we will plot the PMF of the Binomial distribution as a histogram with skills you have already learned.
The trick is setting up the edges of the bins to pass to plt.hist() via the bins keyword argument.
We want the bins centered on the integers. So, the edges of the bins should be -0.5, 0.5, 1.5, 2.5, ... up to max(n_defaults) + 1.5.
We can generate an array like this using np.arange() and then subtracting 0.5 from the array.
We have already sampled out of the Binomial distribution during our exercises on loan defaults, and the resulting samples are in the NumPy array n_defaults.

Result-

Relationship between Binomial and Poisson distributions-
We just heard that the Poisson distribution is a limit of the Binomial distribution for rare events.
This makes sense if you think about the stories. Say we do a Bernoulli trial every minute for an hour, each with a success probability of 0.1.
We would do 60 trials, and the number of successes is Binomially distributed, and we would expect to get about 6 successes.
So, the Poisson distribution with arrival rate equal to npnp approximates a Binomial distribution for nn Bernoulli trials with probability pp of success (with nn large and pp small).
Importantly, the Poisson distribution is often simpler to work with because it has only one parameter instead of two for the Binomial distribution.
Let's explore these two distributions computationally. We will compute the mean and standard deviation of samples from a Poisson distribution with an arrival rate of 10. Then, we will compute the mean and standard deviation of samples from a Binomial distribution with parameters nn and pp such that np=10np=10.

The means are all about the same, which can be shown to be true by doing some pen-and-paper work.
The standard deviation of the Binomial distribution gets closer and closer to that of the Poisson distribution as the probability p gets lower and lower.

The Normal PDF-
In this exercise, we will explore the Normal PDF and also learn a way to plot a PDF of a known distribution using hacker statistics.

Specifically, we will plot a Normal PDF for various values of the variance.

Result-

Great work! You can see how the different standard deviations result in PDFS of different widths. The peaks are all centered at the mean of 20.

The Normal CDF-
Now that we have a feel for how the Normal PDF looks, let's consider its CDF.
Using the samples we generated in the last exercise (in our namespace as samples_std1, samples_std3, and samples_std10), generate and plot the CDFs.

Result-

Great work! The CDFs all pass through the mean at the 50th percentile; the mean and median of a Normal distribution are equal.

The width of the CDF varies with the standard deviation.

Key Note-
Remember that the mean of a Normal, or Gaussian, distribution is the location of the maximum along the x-axis, and not the maximal value, while the standard deviation is about half the width of the PDF about 2/3 the way up to the peak.
Note the three lines dividing the distribution: the central line denotes where the mean is, while the lines on either side of it represent one standard deviation from the mean.

Try above code in your notebook and explore other datasets also.

Can you build a model to predict toxic comments?

Another post starts with you beautiful people! Hope you have learnt something new and very powerful machine learning model from my previous post- How to use LightGBM? Till now you must have an idea that there is no any area left that a machine learning model cannot be applied; yes it's everywhere! Continuing our journey today we will learn how to deal a problem which consists texts/sentences as feature. Examples of such kind of problems you see in internet sites, emails, posts , social media etc. Data Scientists sitting in industry giants like Quora, Twitter, Facebook, Google are working very smartly to build machine learning models to classify texts/sentences/words. Today we are going to do the same and believe me friends once you do some hand on, you will be also in the same hat. Challenge Link : jigsaw-toxic-comment-classification-challenge Problem : We’re challenged to build a multi-headed model that’s capable of detecting different types of toxicity like thre...

Learn Data Science using Python

Search This Blog

Statistically Thinking as a Data Scientist-2

Labels

Comments

Post a Comment

Popular posts from this blog

How to deploy your ML model as Fast API?

Can you build a model to predict toxic comments?

How to install and compile YOLO v4 with GPU enable settings in Windows 10?