Pearson correlation in R

By Data Tricks, 28 July 2020

Statistics

What is the Pearson correlation coefficient?

The Pearson correlation coefficient, or Pearson’s r, is a statistic which measures the linear correlation between two variables. It has a value between -1 and +1, where 0 indicates no linear correlation, -1 indicates a perfect negative linear correlation, and +1 a perfect positive linear correlation.

Example in R

Let’s create some example data:

set.seed(150)
data <- data.frame(x = rnorm(50, mean = 50, sd = 10),
                   random = sample(c(-10:10), 50, replace = TRUE))
data$y <- data$x + data$random

If we want to calculate the Pearson’s correlation of x and y in data, we can use the following code:

correlation <- cor(data$x, data$y, method = 'pearson')

Checking the results:

> correlation
[1] 0.9025428

The Pearson’s correlation coefficient is 0.90, which indicates a strong correlation between x and y.

How to interpret the Pearson correlation

A common misconception about the Pearson correlation is that it provides information on the slope of the relationship between the two variables being tested. This is incorrect, the Pearson correlation only measures the strength of the relationship between the two variables. To illustrate this, consider the following example:

set.seed(150)
xvalues <- rnorm(50, mean = 50, sd = 10)
random <- sample(c(10:30), 50, replace = TRUE)
data <- data.frame(x = rep(xvalues, 2),
                   random = rep(random, 2),
                   category = rep(c("One","Two"), each = 50))
data$y[data$category=="One"] <- 20 + data$x[data$category=="One"]/data$random[data$category=="One"]
data$y[data$category=="Two"] <- 20 + data$x[data$category=="Two"]/(5*data$random[data$category=="Two"])

correlation.one <- cor(data$x[data$category=="One"], data$y[data$category=="One"], method = 'pearson')
correlation.two <- cor(data$x[data$category=="Two"], data$y[data$category=="Two"], method = 'pearson')

The Pearson correlation coefficient of these two sets of x and y values is exactly the same:

> correlation.one
[1] 0.462251
> correlation.two
[1] 0.462251

However, when we plot these x and y values on a chart, the relationship looks very different:

library(ggplot2)
gg <- ggplot(data, aes(x, y, colour = category))
gg <- gg + geom_point()
gg <- gg + geom_smooth(alpha=0.3, method="lm")
print(gg)

Is Pearson correlation the right test?

Use our interactive tool to help you choose the right statistical test or read our article on how to choose the right statistical test.

Tags: correlation, pearson, statistics

Free data science in R guide

Sign up to our newsletter and we will send you a series of guides containing tips and tricks on data science and machine learning in R.

No thanks

Featured

How to choose the right statistical test

September 9, 2020

What is a statistical test and how do I choose the right one?

Statistics

Linear regression

July 28, 2020

What is linear regression and how to apply it in R.

Statistics

Spearman’s correlation in R

What is Spearman’s correlation coefficient and how to calculate it in R.

Statistics

Fisher’s test

What is a Fisher’s test and how to apply it in R.

Statistics

Chi-square test

What is a chi-square test and how to apply it in R.

Pearson correlation in R

What is the Pearson correlation coefficient?

Example in R

How to interpret the Pearson correlation

Is Pearson correlation the right test?

Leave a Reply Cancel reply

Free data science in R guide

You might also like

How to choose the right statistical test

Linear regression

Spearman’s correlation in R

Fisher’s test

Chi-square test