The correlation coefficient is the degree of connection between two variables. His calculation gives an idea of whether there is a dependency between two data arrays. Unlike regression, correlation does not allow predicting the values of quantities. However, the calculation of the coefficient is an important step in a preliminary statistical analysis. For example, we found that the correlation coefficient between the level of foreign direct investment and the growth rate of GDP is high. This gives us an idea that in order to ensure well-being, it is necessary to create a favorable climate specifically for foreign entrepreneurs. Not such an obvious conclusion at first glance!
Correlation and causality
Perhaps, there is not a single area of statistics that would so firmly enter our life. The correlation coefficient is used in all areas of public knowledge. Its main danger lies in the fact that its high values are often speculated in order to convince people and make them believe in some conclusions. However, in fact, a strong correlation does not at all indicate a causal relationship between the quantities.
Correlation coefficient: Pearson and Spearman formula
There are several key indicators that characterize the relationship between two variables. Historically, the first is the Pearson linear correlation coefficient. He is still in school. It was developed by C. Pearson and J. Yul on the basis of the works of Fr. Galton. This coefficient allows you to see the relationship between rational numbers that change rationally. It is always greater than -1 and less than 1. A negative number indicates an inversely proportional dependence. If the coefficient is zero, then there is no connection between the variables. Equal to a positive number - there is a directly proportional relationship between the quantities under study. The Spearman's rank correlation coefficient allows you to simplify calculations by building a hierarchy of variable values.
Relationship between variables
Correlation helps to find the answer to two questions. First, is the relationship between variables positive or negative. Secondly, how strong the addiction is. Correlation analysis is a powerful tool with which you can get this important information. It is easy to see that family incomes and expenses fall and grow proportionally. This relationship is considered positive. On the contrary, with the growth of prices for goods, the demand for it falls. Such a connection is called negative. The values of the correlation coefficient are in the range between -1 and 1. Zero means that there is no dependence between the values under study. The closer the indicator to the extreme values, the stronger the connection (negative or positive). The absence of dependence is indicated by a coefficient from -0.1 to 0.1. It is necessary to understand that such a value indicates only the absence of a linear connection.
The use of both indicators is subject to certain assumptions. First, the presence of a strong connection does not determine the fact that one quantity determines another. There may well be a third quantity that defines each of them. Secondly, the high Pearson correlation coefficient does not indicate a causal relationship between the variables under study. Thirdly, it shows only linear dependence. Correlation can be used to estimate meaningful quantitative data (for example, atmospheric pressure, air temperature), and not categories such as gender or favorite color.
Multiple correlation coefficient
Pearson and Spearman investigated the relationship between the two variables. But how to act in the event that there are three or even more. This is where the multiple correlation coefficient comes to the rescue. For example, the gross national product is influenced not only by foreign direct investment, but also by the government’s monetary and fiscal policies, as well as the level of exports. The growth rate and GDP volume is the result of the interaction of a number of factors. However, it should be understood that the multiple correlation model is based on a number of simplifications and assumptions. First, multicollinearity between quantities is excluded. Secondly, the relationship between the dependent and the variables affecting it is assumed to be linear.
Areas of use of correlation and regression analysis
This method of finding the relationship between quantities is widely used in statistics. It is most often resorted to in three main cases:
- To test the causal relationship between the values of two variables. As a result, the researcher hopes to find a linear relationship and derive a formula that describes these relationships between quantities. The units of their measurements may be different.
- To check the connection between the quantities. In this case, no one determines which variable is dependent. It may turn out that the value of both quantities causes some other factor.
- To derive the equation. In this case, you can simply substitute the numbers in it and find out the values of the unknown variable.
Man searching for causation
Consciousness is arranged in such a way that we definitely need to explain the events that are happening around. A person is always looking for a connection between the picture of the world in which he lives and the information he receives. Often the brain creates order from chaos. He can easily see a causal relationship where there is none. Scientists have to specifically learn to overcome this trend. The ability to assess the relationship between data is objectively necessary in an academic career.
Consider how the presence of a correlation link can be misinterpreted. A group of British students with bad behavior were asked if their parents smoked. Then the test was published in the newspaper. The result showed a strong correlation between parents smoking and the offenses of their children. The professor who conducted this study even suggested putting a warning on the packs of cigarettes. However, there are a number of problems with this conclusion. First, the correlation does not show which of the quantities is independent. Therefore, it can be assumed that the addiction of parents is caused by the disobedience of children. Secondly, it is impossible to say with certainty that both problems did not appear due to some third factor. For example, low income families. It should be noted the emotional aspect of the initial findings of the professor who conducted the study. He was an ardent opponent of smoking. Therefore, it is not surprising that he interpreted the results of his research in this way.
The misinterpretation of correlation as a causal relationship between two variables can cause shameful errors in research. The problem is that it lies at the very core of human consciousness. Many marketing tricks are built on this particular feature. Understanding the difference between causation and correlation makes it possible to rationally analyze information both in everyday life and in a professional career.