Cara menggunakan pooled covariance matrix python

When dealing with problems on statistics and machine learning, one of the most frequently encountered thing is the covariance. While most of us know that variance represents the variation of values in a single variable, we may not be sure what covariance stands for. Besides, knowing covariance can provide way more information on solving multivariate problems. Most of the methods for preprocessing or predictive analysis depend on the covariance. Multivariate outlier detection, dimensionality reduction, and regression can be given as examples.

In this article, I am going to explain five things that you should know about covariance. Instead of explaining it from the definition in Wikipedia, we will try to understand it from its formula. After reading this article, you will be able to answer the following questions.

  • How is covariance calculated?
  • What does covariance tell us?
  • What is a strong covariance?
  • What does the covariance matrix tell you?
  • What do the eigenvectors and eigenvalues of the covariance matrix give us?

1 — The Formula of Variance and Covariance

It would be better to go over the variance to understand the covariance. The variance explains how the values vary in a variable. It depends on how the values far from each other. Take a look at Formula 1 to understand how variance gets calculated.

Formula 1 — Variance formulas according to the known and unknown population mean

In the formula, each value in the variable subtracts from the mean of that variable. After the differences are squared, it gets divided by the number of values (N) in that variable. Okay, what happens when the variance is low or high. You can see Figure 1 to understand what happens if the variance value is low or high.

Figure 1 — Difference between high and low variance (Image by author)

Now, it is time to have a look at the covariance formula. It is as simple as the variance formula. Unlike the variance, covariance is calculated between two different variables. Its purpose is to find the value that indicates how these two variables vary together. In the covariance formula, the values of both variables are multiplied by taking the difference from the mean. You can see Formula 2 to understand it clearly.

Formula 2 — Covariance formulas according to the known and unknown population mean

The only difference between variance and covariance is using the values and means of two variables instead of one. Now, let’s take a look at the second thing that you should know.

Note: As you can see from Formula 1 and Formula 2, there are two different formulas as population known and unknown. When we work on sample data, we don’t know the population mean, we know only the sample mean. That’s why we should use the formula with N-1. When we have the all population of the subject, we can you the with N.

2— The Covariance Matrix

The second thing that you should know is the covariance matrix. Because covariance can only be calculated between two variables, covariance matrices stand for representing covariance values of each pair of variables in multivariate data. Also, the covariance between the same variables equals variance, so, the diagonal shows the variance of each variable. Suppose there are two variables as x and y in our data set. The covariance matrix should look like Formula 3.

Formula 3 – 2 and 3-dimensional covariance matrices

It is a symmetric matrix that shows covariances of each pair of variables. These values in the covariance matrix show the distribution magnitude and direction of multivariate data in multidimensional space. By controlling these values we can have information about how data spread among two dimensions.

3 — Positive, Negative, and Zero States of The Covariance.

The third thing that you should know about covariance is their positive, negative, and zero states. We can go over the formula to understand it. When Xi-Xmean and Yi-Ymean are both negative or positive at the same time, multiplication returns a positive value. If the sum of these values is positive, covariance gets found as positive. It means variable X and variable Y variate in the same direction. In other words, if a value in variable X is higher, it is expected to be high in the corresponding value in variable Y too. In short, there is a positive relationship between them. If there is a negative covariance, this is interpreted right as the opposite. That is, there is a negative relationship between the two variables.

The covariance can only be zero when the sum of products of Xi-Xmean and Yi-Ymeanis is zero. However, the products of Xi-Xmean and Yi-Ymean can be near-zero when one or both are zero. In such a scenario, there aren’t any relations between variables. To understand it clearly, you can see the flowing Figure 2.

Figure 2 — Positive, negative, and near-zero covariance (Image by author)

As another possible scenario, we can have a distribution something like in Figure 3. It happens while the covariance is near zero and the variance of variables are different.

Figure 3 — Different variances and near-zero covariance (Image by author)

4 — The size of covariance value

Unlike correlation, covariance values do not have a limit between -1 and 1. Therefore, it may be wrong to conclude that there might be a high relationship between variables when the covariance is high. The size of covariance values depends on the difference between values in variables. For instance, if the values are between 1000 and 2000 in the variable, it possible to have high covariance. However, if the values are between 1 and 2 in both variables, it is possible to have a low covariance. Therefore, we can’t say the relationship in the first example is stronger than the second. The covariance stands for only the variation and relation direction between two variables. You can understand it from Figure 4.

Figure 4 — High covariance values versus low covariance values (Image by author)

Although the covariance in the first figure is very large, the relationship can be higher or the same in the second figure. (The values in Figure 4 are given as examples, they aren’t from any data set and aren’t true values)

5 — Eigenvalues and Eigenvectors of Covariance Matrix

What do eigenvalues and eigenvectors tell us? These are the essential part of the covariance matrix. The methods that require a covariance matrix to find the magnitude and direction of the data points use eigenvalues and eigenvectors. For example, the eigenvalues represent the magnitude of the spread in the direction of the principal components in PCA. In Figure 5, the first and second plots show the distribution of points when the covariance is near zero. When the covariance is zero, eigenvalues will be directly equal to the variance values. The third and fourth plots represent the distribution of points when the covariance is different from zero. Unlike the first two, eigenvalues and eigenvectors should be calculated for these two.

Figure 5 — Eigenvalues and Eigenvectors of covariance and their effects on direction and magnitude (Image by author)

As can be seen from Figure 5, the eigenvalues represent the magnitude of the spread for both variables x and y. The eigenvectors show the direction. It is possible to find the angle of propagation from the arccosine of the value v[0,0] when the covariance is positive. If the covariance is negative, the cosine of the valuev[0,0]gives the spread direction.

How do you find eigenvalues and eigenvectors from the covariance matrix? You can find both eigenvectors and eigenvalues using NumPY in Python. First thing you should do is to find covariance matrix using method numpy.cov(). After you found the covariance matrix you can use the method numpy.linalg.eig(M) to find eigenvectors and eigenvalues.

You can read my other article to find out how eigenvalues are used in principal component analysis.

PCA: Where to Use and How to Use

Understanding how PCA works, in a visual way

sergencansiz.medium.com

Conclusions

Covariance is one of the most used measurements in data science. Knowing covariance with its details provides many opportunities to understand multivariate data. Therefore, I wanted to share with you the five things that you should know about covariance. Please feel free to leave a comment if you have any questions or recommendations.

Postingan terbaru

LIHAT SEMUA