Introduction
Partial and Semi-partial Correlation are two techniques are closely related to both simple correlation but also to regression and multiple regression.
Correlation versus Regression
Let’s begin by reviewing the difference between correlation and regression. Although the two are related techniques i.e. the mathematics is very similar behind the scenes, and you can get one from the other, nonetheless the goals of each are different from one another.
Correlation:
- tests for the association between variables, while making no assumption on the causality i.e. whether one variable is dependent on the other.
- the variables under consideration are often interdependent, perhaps controlled by some other third process that we cannot measure.
Regression:
- is used when you have good mathematical, business, or physical intuition to believe that one variable causes the change in another … for example you want to predict one variable from measurements of the other.
- attempts to describe the dependence of variable on an explanatory variable, implicitly assumes a one-way causal effect.
- independent variable often manipulated experimentally … and we want to predict the dependent variable from independent.
In reality you often see people use these methods interchangeably - and sure enough partial correlation does indeed blur the lines between strict association (correlation) and strict causality (regression) because of the confounding variables.
Partial Correlation
Partial correlation is often quite useful because natural systems are often complex and have a lot of interacting interdependent processes.
For example - the abundance of clouds over the ocean is often correlated with the amount of aerosol particles in the atmosphere. Thus when there is a lot of aerosol there are a lot of clouds (and vice-versa). However, both aerosols and cloud cover can be correlated with wind speed - and thus perhaps wind speed could be what we would call a mediating or confounding variable.
It could instead be that some (or maybe all) of the correlation between aerosols and cloud cover occurs because both of those two things are correlated with this third factor of wind speed. Thus, what we really want to know is: what is the strength of correlation between aerosol and cloud cover after accounting for the fact that wind speed might affect both of them.
Goals of Partial Correlation
Type of data: To perform partial correlation you need two continuous main variables (the ones you are really interested in) … and … at least one potentially mediating / confounding variable (usually continuous but can also be categorical).
Purpose: To test for an association between the two main variables after accounting for / controlling for the effect(s) of one or more potentially mediating / confounding variable(s).
Mathematics
The partial correlation coefficient between our two variables (
It is the simple correlation coefficient (that is
The equation here has a single confounding variable - for hypothesis testing, the coefficient that you get can be converted into something that follows a t-distribution, where the degrees of freedom are
Interpreting the Partial Correlation Coefficient
Another way to think about partial correlation is with a Venn diagram example. Suppose the blue circle shows the range of values of variable
So then in partial correlation, we add a third confounding variable (the orange circle) which overlaps the original two variables. This reduces the central overlap area … and which is now referred to as the variance in variable
Alternative interpretation of the Partial Correlation Coefficient
Another way to think about partial correlation is to think about it in the context of regression … and actually the partial correlation coefficient is really just the same as the simple correlation between the residuals of our variables. Thus suppose we take the residuals from a regression of aerosol amount versus wind speed … where our residuals are the difference between the best fit line and the observed values (e.g. the vertical red line shown). The residuals tell us what is left over in the aerosol value after taking account of the effects of wind speed. What aerosols are versus what they should be, given the wind speed.
We can do the same thing for cloud abundance … we can get the residuals and then plot (see the larger graph to the right hand side of the figure) the residuals of cloud abundance after accounting for wind speed … and the residuals of aerosols after accounting for wind speed and then finally calculate the correlation between those two. So as partial correlation is a relationship between the two variables after accounting for the effect of the confounding variables that is actually the same thing as the simple correlation between the residuals.
If you do this, you find that the p-value for a correlation in the residuals is not the same as the p-value for partial correlation … and this is because in this case we are not dealing with degrees of freedom correctly … we would have to subtract an extra degree of freedom that doesn’t get accounted for in the diagram. But conceptually speaking, the correlation coefficient that you get is the same if you run the partial correlation formula or if you calculate the simple correlation between the residuals.
Semipartial Correlation
The partial correlation is the more commonly used of the two methods - versus semipartial correlation.
Semipartial correlation differs from partial correlation in that the confounding variable only influences one of the two variables that are being tested. The two are quite similar - the only difference is that with semipartial correlation the confounding variable is only thought to influence one of the two main variables and not both of them (as in partial correlation).
So for example we might have wind speed only effecting aerosol amount but not cloud abundance. That is wind speed is a confounding variable for aerosol amount but not cloud abundance. In this case we run a semipartial correlation to asses the correlation between aerosols and cloud cover while accounting for the confounding effect of wind speed on aerosol abundance only. In other words it is basically testing how much aerosols add to cloud abundance, how much they add to our knowledge of cloud abundance, above and beyond what we would expect from the effect of wind speeds on aerosols. That is “do aerosols add anything useful to our knowledge”.
Partial Correlation Assumptions
The assumptions for partial correlation and semipartial correlation are the same as those for simple correlation.
For parametric (Pearson’s) partial correlation coefficient:
- All pairs of variables are assumed to have a linear relationship (not curved and no outliers).
- Points are assumed to be independent of each other (no time series, or spatial correlation).
- Pairs of variables are assumed to be bivariate normal (but typically assess whether each variable is approximately normal). So rather than the bell curve you have effectively a bell shaped mountain … but essentially you just assess whether each main variable is approximately normal on its own.
Just like in simple (or “zero-order” correlation), you can do non-parametric (Spearman or Kendall) partial correlation for non-linear and non-normal data.
Reporting the results of partial correlation
You should give:
- Description of the relationship you are assessing - including the confounding variable(s) - and for semipartial correlation you should specify which one of the main variables is being influenced by the confounding ones.
- Partial (or semipartial) correlation coefficient (specifying which one you used out of Pearson r, Spearman rho, or Kendall tau). This is important because of course it tells you the strength of the actual relationship.
- The p-value. The statistical significance, which goes in tandem with the strength of the actual relationship.
- Perhaps a scatter plot showing the results (using residuals).
For example:
there is no significant correlation between aerosols and cloud abundance after controlling for wind speed, which was our confounding variable (Pearson partial correlation
, ).
References
- Introduction to Applied Linear Algebra: Vectors, Matrices, and Least Squares, Boyd and Vandenberghe, Cambridge University Press, 2018.
- Pham, D. & Dimov, Stefan & Nguyen, Cuong. (2005). Selection of K in K-means clustering. Proceedings of The Institution of Mechanical Engineers Part C-journal of Mechanical Engineering Science - PROC INST MECH ENG C-J MECH E. 219. 103-119. 10.1243/095440605X8298.
- Wu, J. (2012). Advances in K-means Clustering. Springer Theses. 10.1007/978-3-642-29807-3