#### Statistical thinking: Correlation

### The correlation coefficient

Students recognize and interpret the **correlation coefficient, r**.

R is a single number between −1 and 1, with no units.

R describes the association between two variables. For example, age and height in children are two variables that are associated.

A correlation of 1 is a perfect correlation: a change in one variable is associated with an equivalent change in another variable in the same direction.

A correlation of -1 is a perfect negative correlation: a change in one variable is associated with an equivalent change in another variable in the opposite direction.

A correlation of 0 means that the two variables are not associated with each other.

### Implications of correlation

Students recognize that

**correlation does not imply causation**.

- A correlation could be caused by a third variable that can explain both variables.

- The arrow of causation could go in either direction.

- A correlation could be a coincidence of unrelated phenomena.

### Regression analysis

Students recognize that a **regression analysis** allows researchers to isolate two variables--such as income and age--while holding constant the effects of other variables (such as gender, educational level, or years in the workforce).

Students recognize that a regression analysis needs a reasonable theoretical underpinning in order to be useful. They analyze findings based on a regression analysis for common problems.

- Assuming correlation implies causation. Regression tells us how much a given variable explains a dependent variable, but it does not explain causation. They might not be related at all.
- Inferring causality in the wrong direction.
- Omitting a crucial variable.
- Including highly correlated explanatory variables (multicollinearity): If two variables are very highly correlated with each other, then the equation might not reflect the true effect of just one of them, because ‘controlling for’ the other one will wipe out much of the effect. Better to ask about only one of the variables, or to have a composite variable.

- Extrapolating beyond the data: assuming that the results are valid for populations that may have significant differences from the population on which the study has been done.
- Data mining with too many variables: If too many variables are put into a regression equation, one of them may be deemed significant just by chance.
- Using a regression analysis that looks for linear relationships on a nonlinear relationship.

### Anscombe's quartet

A side note: Statistician Francis Anscombe developed four data sets with the same mean, variance, correlation and regression line, but with very different data points. Visualization is important in data analysis--and so are outliers.

Previous page on path | Statistical thinking, page 6 of 8 | Next page on path |

## Discussion of "Statistical thinking: Correlation"

## Add your voice to this discussion.

Checking your signed in status ...