Sign in or register
for additional privileges

Quantitative Literacy and the Humanities

aa, Author

You appear to be using an older verion of Internet Explorer. For the best experience please upgrade your IE version or switch to a another web browser.

Statistical thinking: Correlation

Correction is an important concept for consumers of information.  Students understand, interpret, and evaluate correlation and regression analyses, even if they do not calculate them.


The correlation coefficient


Students recognize and interpret the correlation coefficient, r.

R is a single number between −1 and 1, with no units.

R describes the association between two variables.  For example, age and height in children are two variables that are associated.

A correlation of 1 is a perfect correlation: a change in one variable is associated with an equivalent change in another variable in the same direction.

A correlation of -1 is a perfect negative correlation: a change in one variable is associated with an equivalent change in another variable in the opposite direction.

A correlation of 0 means that the two variables are not associated with each other.


Implications of correlation


Students recognize that correlation does not imply causation.
  • A correlation could be caused by a third variable that can explain both variables. 
  • The arrow of causation could go in either direction.
  • A correlation could be a coincidence of unrelated phenomena.


Regression analysis


Students recognize that a regression analysis allows researchers to isolate two variables--such as income and age--while holding constant the effects of other variables (such as gender, educational level, or years in the workforce). 

Students recognize that a regression analysis needs a reasonable theoretical underpinning in order to be useful.  They analyze findings based on a regression analysis for common problems

  • Assuming correlation implies causation.  Regression tells us how much a given variable explains a dependent variable, but it does not explain causation.  They might not be related at all.
  • Inferring causality in the wrong direction.
  • Omitting a crucial variable.
  • Including highly correlated explanatory variables (multicollinearity): If two variables are very highly correlated with each other, then the equation might not reflect the true effect of just one of them, because ‘controlling for’ the other one will wipe out much of the effect.  Better to ask about only one of the variables, or to have a composite variable.
  • Extrapolating beyond the data: assuming that the results are valid for populations that may have significant differences from the population on which the study has been done.
  • Data mining with too many variables: If too many variables are put into a regression equation, one of them may be deemed significant just by chance.
  • Using a regression analysis that looks for linear relationships on a nonlinear relationship.


Anscombe's quartet

A side note: Statistician Francis Anscombe developed four data sets with the same mean, variance, correlation and regression line, but with very different data points.  Visualization is important in data analysis--and so are outliers.

Comment on this page
 

Discussion of "Statistical thinking: Correlation"

Add your voice to this discussion.

Checking your signed in status ...

Previous page on path Statistical thinking, page 6 of 8 Next page on path