Another post starts with you beautiful people!
I hope you have enjoyed my previous post about CRUD operations in relational database.
As we have seen in one of my previous post about EDA that; first we explore data by plotting them and then we compute simple summary statistics; is a crucial first step in statistical analysis of data.
For the exercises in this section, we will use a classic data set-iris; collected by botanist Edward Anderson and made famous by Ronald Fisher, one of the most prolific statisticians in history- what is IRIS dataset?
Anderson carefully measured the anatomical properties of samples of three different species of iris, Iris setosa, Iris versicolor, and Iris virginica.
We will learn following topics with our dataset-
The subset of the data set containing the Iris versicolor petal lengths in units of centimeters (cm) is stored in the NumPy array versicolor_petal_length.
First we load our dataset in a variable and the store the required attributes of this dataset in appropriate variables for furthur use in this exercise-
Our dataframe will look like below-
Let's check the species column-
It's always good to convert the dataframe to a numpy array before plotting-
Computing the ECDF-
In this exercise, we will write a function that takes as input a 1D array of data and then returns the x and y values of the ECDF and more details about this is here Tell me more about ECDF
We will use this function over and over again throughout this course and its sequel. ECDFs are among the most important plots in statistical analysis.
Plotting the ECDF-
We will now use our ecdf() function to compute the ECDF for the petal lengths of Anderson's Iris versicolor flowers. We will then plot the ECDF. Recall that our ecdf() function returns two arrays so we will need to unpack them.
Result-
Comparison of ECDFs-
ECDFs also allow us to compare two or more distributions (though plots get cluttered if we have too many). Here, we will plot ECDFs for the petal lengths of all three iris species. We already wrote a function to generate ECDFs so we can put it to good use!
Result-
The ECDFs expose clear differences among the species.
Setosa is much shorter, also with less absolute variability in petal length than versicolor and virginica.
Computing means-
The mean of all measurements gives an indication of the typical magnitude of a measurement. It is computed using np.mean().
Computing percentiles-
In this exercise, we will compute the percentiles of petal length of Iris versicolor.
Comparing percentiles to ECDF-
To see how the percentiles relate to the ECDF, we will plot the percentiles of Iris versicolor petal lengths we calculated in the last exercise on the ECDF plot.
Note that to ensure the Y-axis of the ECDF plot remains between 0 and 1, you will need to rescale the percentiles array accordingly - in this case, dividing it by 100.
Result-
Box-and-whisker plot-
Making a box plot for the petal lengths is unnecessary because the iris data set is not too large and the bee swarm plot works fine.
However, it is always good to get some practice. Make a box plot of the iris petal lengths.
Result-
Please take a reference from below screen shot of 2008 US election result from which we can easily understand the box plot key properties-
Computing the variance-
It is important to have some understanding of what commonly-used functions are doing under the hood.
Though we may already know how to compute variances, in this exercise, we will explicitly compute the variance of the petal length of Iris veriscolor using the equations.
We will then use np.var() to compute it.
The standard deviation and the variance-
The standard deviation is the square root of the variance.
We will see this for ourself by computing the standard deviation using np.std() and comparing it to what we get by computing the variance with np.var() and then computing the square root.
Scatter plots-
When we made bee swarm plots, box plots, and ECDF plots in previous exercises, we compared the petal lengths of different species of iris.
But what if we want to compare two properties of a single species? This is exactly what we will do in this exercise.
We will make a scatter plot of the petal length and width measurements of Anderson's Iris versicolor flowers.
If the flower scales (that is, it preserves its proportion as it grows), we would expect the length and width to be correlated.
Result-
Here Indeed, we see some correlation. Longer petals also tend to be wider.
Key Note-
If you have multiple plots then look at the spread in the x-direction in the plots: The plot with the largest spread is the one that has the highest variance.
High covariance means that when x is high, y is also high, and when x is low, y is also low.
Negative covariance means that when x is high, y is low, and when x is low, y is high.
Computing the covariance-
The covariance may be computed using the Numpy function np.cov().
For example, we have two sets of data x and y, np.cov(x, y) returns a 2D array where entries [0,1] and [1,0] are the covariances.
Entry [0,0] is the variance of the data in x, and entry [1,1] is the variance of the data in y.
This 2D output array is called the Covariance Matrix, since it organizes the self- and covariance.
Run the above code in your notebook and see the covariance value of length and width.
In next chapter Statistically Thinking as a Data Scientist-2 we will learn about computing the Pearson correlation coefficient, Bernoulli trials, Binomial distribution, Poisson distributions and Normal PDF/CDF.
I hope you have enjoyed my previous post about CRUD operations in relational database.
As we have seen in one of my previous post about EDA that; first we explore data by plotting them and then we compute simple summary statistics; is a crucial first step in statistical analysis of data.
For the exercises in this section, we will use a classic data set-iris; collected by botanist Edward Anderson and made famous by Ronald Fisher, one of the most prolific statisticians in history- what is IRIS dataset?
Anderson carefully measured the anatomical properties of samples of three different species of iris, Iris setosa, Iris versicolor, and Iris virginica.
We will learn following topics with our dataset-
- Computing the ECDF
- Plotting the ECDF
- Comparison of ECDFs
- Computing means
- Computing percentiles
- Comparing percentile to ECDF
- Computing the variance
- Standard deviation
- Computing the covariance
The subset of the data set containing the Iris versicolor petal lengths in units of centimeters (cm) is stored in the NumPy array versicolor_petal_length.
First we load our dataset in a variable and the store the required attributes of this dataset in appropriate variables for furthur use in this exercise-
Our dataframe will look like below-
Let's check the species column-
It's always good to convert the dataframe to a numpy array before plotting-
Computing the ECDF-
In this exercise, we will write a function that takes as input a 1D array of data and then returns the x and y values of the ECDF and more details about this is here Tell me more about ECDF
We will use this function over and over again throughout this course and its sequel. ECDFs are among the most important plots in statistical analysis.
Plotting the ECDF-
We will now use our ecdf() function to compute the ECDF for the petal lengths of Anderson's Iris versicolor flowers. We will then plot the ECDF. Recall that our ecdf() function returns two arrays so we will need to unpack them.
Result-
Comparison of ECDFs-
ECDFs also allow us to compare two or more distributions (though plots get cluttered if we have too many). Here, we will plot ECDFs for the petal lengths of all three iris species. We already wrote a function to generate ECDFs so we can put it to good use!
Result-
The ECDFs expose clear differences among the species.
Setosa is much shorter, also with less absolute variability in petal length than versicolor and virginica.
Computing means-
The mean of all measurements gives an indication of the typical magnitude of a measurement. It is computed using np.mean().
Computing percentiles-
In this exercise, we will compute the percentiles of petal length of Iris versicolor.
Comparing percentiles to ECDF-
To see how the percentiles relate to the ECDF, we will plot the percentiles of Iris versicolor petal lengths we calculated in the last exercise on the ECDF plot.
Note that to ensure the Y-axis of the ECDF plot remains between 0 and 1, you will need to rescale the percentiles array accordingly - in this case, dividing it by 100.
Result-
Box-and-whisker plot-
Making a box plot for the petal lengths is unnecessary because the iris data set is not too large and the bee swarm plot works fine.
However, it is always good to get some practice. Make a box plot of the iris petal lengths.
Result-
Please take a reference from below screen shot of 2008 US election result from which we can easily understand the box plot key properties-
Computing the variance-
It is important to have some understanding of what commonly-used functions are doing under the hood.
Though we may already know how to compute variances, in this exercise, we will explicitly compute the variance of the petal length of Iris veriscolor using the equations.
We will then use np.var() to compute it.
The standard deviation and the variance-
The standard deviation is the square root of the variance.
We will see this for ourself by computing the standard deviation using np.std() and comparing it to what we get by computing the variance with np.var() and then computing the square root.
Scatter plots-
When we made bee swarm plots, box plots, and ECDF plots in previous exercises, we compared the petal lengths of different species of iris.
But what if we want to compare two properties of a single species? This is exactly what we will do in this exercise.
We will make a scatter plot of the petal length and width measurements of Anderson's Iris versicolor flowers.
If the flower scales (that is, it preserves its proportion as it grows), we would expect the length and width to be correlated.
Result-
Here Indeed, we see some correlation. Longer petals also tend to be wider.
Key Note-
If you have multiple plots then look at the spread in the x-direction in the plots: The plot with the largest spread is the one that has the highest variance.
High covariance means that when x is high, y is also high, and when x is low, y is also low.
Negative covariance means that when x is high, y is low, and when x is low, y is high.
Computing the covariance-
The covariance may be computed using the Numpy function np.cov().
For example, we have two sets of data x and y, np.cov(x, y) returns a 2D array where entries [0,1] and [1,0] are the covariances.
Entry [0,0] is the variance of the data in x, and entry [1,1] is the variance of the data in y.
This 2D output array is called the Covariance Matrix, since it organizes the self- and covariance.
Run the above code in your notebook and see the covariance value of length and width.
In next chapter Statistically Thinking as a Data Scientist-2 we will learn about computing the Pearson correlation coefficient, Bernoulli trials, Binomial distribution, Poisson distributions and Normal PDF/CDF.
Thank you for providing such an awesome article and it is a very useful blog for others to read.
ReplyDeleteOracle ICS Online Training
ReplyDeleteThanks for providing a useful article containing valuable information. start learning the best online software courses.
Workday Online Training