Pearson correlation in R

By Data Tricks, 28 July 2020

What is the Pearson correlation coefficient?

The Pearson correlation coefficient, or Pearson’s r, is a statistic which measures the linear correlation between two variables. It has a value between -1 and +1, where 0 indicates no linear correlation, -1 indicates a perfect negative linear correlation, and +1 a perfect positive linear correlation.

Example in R

Let’s create some example data:

set.seed(150)
data <- data.frame(x = rnorm(50, mean = 50, sd = 10),
                   random = sample(c(-10:10), 50, replace = TRUE))
data$y <- data$x + data$random

If we want to calculate the Pearson’s correlation of x and y in data, we can use the following code:

correlation <- cor(data$x, data$y, method = 'pearson')

Checking the results:

> correlation
[1] 0.9025428

The Pearson’s correlation coefficient is 0.90, which indicates a strong correlation between x and y.

How to interpret the Pearson correlation

A common misconception about the Pearson correlation is that it provides information on the slope of the relationship between the two variables being tested. This is incorrect, the Pearson correlation only measures the strength of the relationship between the two variables. To illustrate this, consider the following example:

set.seed(150)
xvalues <- rnorm(50, mean = 50, sd = 10)
random <- sample(c(10:30), 50, replace = TRUE)
data <- data.frame(x = rep(xvalues, 2),
                   random = rep(random, 2),
                   category = rep(c("One","Two"), each = 50))
data$y[data$category=="One"] <- 20 + data$x[data$category=="One"]/data$random[data$category=="One"]
data$y[data$category=="Two"] <- 20 + data$x[data$category=="Two"]/(5*data$random[data$category=="Two"])

correlation.one <- cor(data$x[data$category=="One"], data$y[data$category=="One"], method = 'pearson')
correlation.two <- cor(data$x[data$category=="Two"], data$y[data$category=="Two"], method = 'pearson')

The Pearson correlation coefficient of these two sets of x and y values is exactly the same:

> correlation.one
[1] 0.462251
> correlation.two
[1] 0.462251

However, when we plot these x and y values on a chart, the relationship looks very different:

library(ggplot2)
gg <- ggplot(data, aes(x, y, colour = category))
gg <- gg + geom_point()
gg <- gg + geom_smooth(alpha=0.3, method="lm")
print(gg)

Tags: , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

Please note that your first comment on this site will be moderated, after which you will be able to comment freely.