Skip to main content

Central Limit Theorem and Hypothesis Testing

Another post starts with you beautiful people!
Today we will learn about an important topic related to statistics.

Statistical inference is the process of deducing properties of an underlying distribution by analysis of data. Inferential statistical analysis infers properties about a population: this includes testing hypotheses and deriving estimates.
Statistics are helpful in analyzing most collections of data. Hypothesis testing can justify conclusions even when no scientific theory exists.

You can find more about this here- tell me more about Statistical_hypothesis_testing

Here our case study will be Average Experience of Data Science Specialization(DSS) batch taught in a leading University with Statistical Inference.
We will aim to study how accurately can we characterize the actual average participant experience (population mean) from the samples of data (sample mean). We can quantify the certainty of outcome through the confidence intervals.


Let's plot the distribution of experience-

Result:-

Estimating DSS Experience from samples:-

We are now drawing samples for 1000 times (NUM_TRIALS) and compute the mean each time. The distribution is plotted to identify range of values it can take. The original data has experience raging between 0 years and 20 years and spread across it.
Result:-


The above plot is histogram of mean of samples for any given n. 
As you vary n from 1 to 128 you will see the following summary for the Sampling Distribution of Mean chart-
n = 1, Mean = 10.301, Std Dev = 5.689, 5% Pct = 1.000, 95% Pct = 19.000
n = 2, Mean = 10.431, Std Dev = 3.984, 5% Pct = 4.000, 95% Pct = 17.000
n = 4, Mean = 10.388, Std Dev = 2.878, 5% Pct = 5.500, 95% Pct = 15.000
n = 8, Mean = 10.407, Std Dev = 2.031, 5% Pct = 7.000, 95% Pct = 13.875
n = 10, Mean = 10.455, Std Dev = 1.794, 5% Pct = 7.600, 95% Pct = 13.405
n = 16, Mean = 10.467, Std Dev = 1.372, 5% Pct = 8.188, 95% Pct = 12.688
n = 32, Mean = 10.470, Std Dev = 0.984, 5% Pct = 8.844, 95% Pct = 12.125
n = 64, Mean = 10.446, Std Dev = 0.693, 5% Pct = 9.297, 95% Pct = 11.609

n = 128, Mean = 10.451, Std Dev = 0.502, 5% Pct = 9.625, 95% Pct = 11.258

One important point is to note the trend in values of Std Dev in the above table as n increases.

As n increases the Std Dev of Sampling distribution of Mean reduces from 5.689 (n = 1) to 0.502 (n = 128). 
As n increases the Sampling distribution of Mean looks more normal. 
It is observed that when n >= 30 irrespective of the original distribution the Sampling distribution of Mean becomes normal distribution.

Definition of Central Limit Theorem:-
In probability theory, the central limit theorem (CLT) establishes that, for the most commonly studied scenarios, when independent random variables are added, their sum tends toward a normal distribution (commonly known as a bell curve) even if the original variables themselves are not normally distributed [Ref: Wikipedia]


The key point to note is that irrespective of the original distribution then sum (and hence average as well) of large samples tend towards normal distribution. This gives power to estimate the confidence intervals.

Normal distribution has following characteristics:_


Function to check if the true mean lies within 90% Confidence Interval:-

Getting the average experience estimate from sample and Confidence Intervals:-
Now given the sample size n we can estimate the sample mean and confidence interval. The confidence interval is estimated assuming normal distribution which really holds good when n >= 30.

When n is increased the confidence interval becomes smaller which implies that results are obtained with higher certainty.

Execute the code below multiple times and check how often the population mean of 10.435 will lie within 90% confidence interval. It should be on average 9 out of 10 times i.e 90%-

Hypothesis Testing(HT):-
Let us define the Hypotheses as follows:

  • H0 : Average Experience of Current Batch & Previous batch are same
  • H1 : Average Experience of Current Batch & Previous batch are different
Repeat the same steps for previous batch-




Result:-


Below code describes to do hypothesis testing based on the intuition from Sampling distribution of mean. There will be differences with exact math based on various conditions and assumptions made.

Show the frequency distribution of experience:-

Code for dumping the experiment results to a csv file:-

Demonstrate how the standard deviation of sample mean reduce and the distribution goes closer to normal distribution:-


Stats functions for Hypothesis Testing taken from here- hypothesis-testing-comparing-two-groups


Comments


  1. After reading this blog i very strong in this topics and this blog really helpful to all.Data Science online Course Bangalore



    ReplyDelete

Post a Comment

Popular posts from this blog

How to use opencv-python with Darknet's YOLOv4?

Another post starts with you beautiful people 😊 Thank you all for messaging me your doubts about Darknet's YOLOv4. I am very happy to see in a very short amount of time my lovely aspiring data scientists have learned a state of the art object detection and recognition technique. If you are new to my blog and to computer vision then please check my following blog posts one by one- Setup Darknet's YOLOv4 Train custom dataset with YOLOv4 Create production-ready API of YOLOv4 model Create a web app for your YOLOv4 model Since now we have learned to use YOLOv4 built on Darknet's framework. In this post, I am going to share with you how can you use your trained YOLOv4 model with another awesome computer vision and machine learning software library-  OpenCV  and of course with Python 🐍. Yes, the Python wrapper of OpenCV library has just released it's latest version with support of YOLOv4 which you can install in your system using below command- pip install opencv-pyt...

How to convert your YOLOv4 weights to TensorFlow 2.2.0?

Another post starts with you beautiful people! Thank you all for your overwhelming response in my last two posts about the YOLOv4. It is quite clear that my beloved aspiring data scientists are very much curious to learn state of the art computer vision technique but they were not able to achieve that due to the lack of proper guidance. Now they have learnt exact steps to use a state of the art object detection and recognition technique from my last two posts. If you are new to my blog and want to use YOLOv4 in your project then please follow below two links- How to install and compile Darknet code with GPU? How to train your custom data with YOLOv4? In my  last post we have trained our custom dataset to identify eight types of Indian classical dance forms. After the model training we have got the YOLOv4 specific weights file as 'yolo-obj_final.weights'. This YOLOv4 specific weight file cannot be used directly to either with OpenCV or with TensorFlow currently becau...

Detecting Credit Card Fraud As a Data Scientist

Another post starts with you beautiful people! Hope you have learnt something from my previous post about  machine learning classification real world problem Today we will continue our machine learning hands on journey and we will work on an interesting Credit Card Fraud Detection problem. The goal of this exercise is to anonymize credit card transactions labeled as fraudulent or genuine. For your own practice you can download the dataset from here-  Download the dataset! About the dataset:  The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. Let's start our analysis with loading the dataset first:- As per the  official documentation -  features V1, V2, ... V28 are the principal compo...