Pearson correlation in R

By Data Tricks, 28 July 2020

What is the Pearson correlation coefficient?

The Pearson correlation coefficient, or Pearson’s r, is a statistic which measures the linear correlation between two variables. It has a value between -1 and +1, where 0 indicates no linear correlation, -1 indicates a perfect negative linear correlation, and +1 a perfect positive linear correlation.

Example in R

Let’s create some example data:

set.seed(150)
data <- data.frame(x = rnorm(50, mean = 50, sd = 10),
                   random = sample(c(-10:10), 50, replace = TRUE))
data$y <- data$x + data$random

If we want to calculate the Pearson’s correlation of x and y in data, we can use the following code:

correlation <- cor(data$x, data$y, method = 'pearson')

Checking the results:

> correlation
[1] 0.9025428

The Pearson’s correlation coefficient is 0.90, which indicates a strong correlation between x and y.

How to interpret the Pearson correlation

A common misconception about the Pearson correlation is that it provides information on the slope of the relationship between the two variables being tested. This is incorrect, the Pearson correlation only measures the strength of the relationship between the two variables. To illustrate this, consider the following example:

set.seed(150)
xvalues <- rnorm(50, mean = 50, sd = 10)
random <- sample(c(10:30), 50, replace = TRUE)
data <- data.frame(x = rep(xvalues, 2),
                   random = rep(random, 2),
                   category = rep(c("One","Two"), each = 50))
data$y[data$category=="One"] <- 20 + data$x[data$category=="One"]/data$random[data$category=="One"]
data$y[data$category=="Two"] <- 20 + data$x[data$category=="Two"]/(5*data$random[data$category=="Two"])

correlation.one <- cor(data$x[data$category=="One"], data$y[data$category=="One"], method = 'pearson')
correlation.two <- cor(data$x[data$category=="Two"], data$y[data$category=="Two"], method = 'pearson')

The Pearson correlation coefficient of these two sets of x and y values is exactly the same:

> correlation.one
[1] 0.462251
> correlation.two
[1] 0.462251

However, when we plot these x and y values on a chart, the relationship looks very different:

library(ggplot2)
gg <- ggplot(data, aes(x, y, colour = category))
gg <- gg + geom_point()
gg <- gg + geom_smooth(alpha=0.3, method="lm")
print(gg)

Is Pearson correlation the right test?

Use our interactive tool to help you choose the right statistical test or read our article on how to choose the right statistical test.

Tags: , ,

Leave a Reply

Your email address will not be published.

Please note that your first comment on this site will be moderated, after which you will be able to comment freely.