Feature scaling in R: five simple methods

By Data Tricks, 18 November 2020

What is feature scaling?

Feature scaling is the process of eliminating units of measurement for variables within a dataset, and is often carried out to boost the accuracy of a machine learning algorithm. For example, a dataset may contain Age with a range of 18 to 60 years, and Weight with a range of 50 to 110kg. The aim of feature scaling would be to transform these ranges to the same scale of, say, 0 to 1 for both Age and Weight.

Standardisation or normalisation?

Standardisation and normalisation are two of the most popular methods of feature scaling. Sometimes the terms are used interchangeably but they mean quite different things. Standardisation is the process of transforming data so that the new data will have a mean of 0 and standard deviation of 1, whereas normalisation transforms the data to a range of 0 to 1. You may come across standardisation being referred to as the z-score and normalisation as min-max scaling.

Standardisation or normalisation?

When to use standardisation and normalisation depends on the application, and there is no strict rule of when to apply one over the other. But features scaling in general (either standardisation or normalisation) is especially important to machine learning algorithms which are based on a distance matrix such as support vector machines and neural networks.

An important difference in standardisation vs. normalisation in the context of machine learning is that unlike normalisation, standardisation does not have a bounding range, so it is better suited to handle outliers. Normalisation may be the better option if you know that your data is not normally distributed and is helpful for machine learning algorithms that do not assume a normal distribution such as k-nearest neighbours and neural networks. Standardisation, on the other hand, is often used in clustering analyses or principal component analysis.

In summary, choosing which method of feature scaling to use is perhaps best accomplished by training your machine learning algorithm on the raw, standardised and normalised data and evaluating which method provides the highest accuracy.

How to apply feature scaling in R

The following outlines five simple methods to apply feature scaling in R. Each example starts with creating a data frame of Age and Weight values so that all code is reproducible.

Method 1: scale function (standardisation only)

set.seed(123)
data <- data.frame(Age = rnorm(500, 50, 8),
                   Weight = rnorm(500, 80, 10))
data <- as.data.frame(scale(data))

Method 2: manually apply a formula

Standardisation:

set.seed(123)
data <- data.frame(Age = rnorm(500, 50, 8),
                   Weight = rnorm(500, 80, 10))
data <- as.data.frame(sapply(data, function(x) (x-mean(x))/sd(x)))

Normalisation:

set.seed(123)
data <- data.frame(Age = rnorm(500, 50, 8),
                   Weight = rnorm(500, 80, 10))
data <- as.data.frame(sapply(data, function(x) (x-min(x))/(max(x)-min(x))))

Method 3: using the caret package

Standardisation:

library(caret)
set.seed(123)
data <- data.frame(Age = rnorm(500, 50, 8),
                   Weight = rnorm(500, 80, 10))
data.pre <- preProcess(data, method=c("center", "scale"))
data <- predict(data.pre, data)

Normalisation:

library(caret)
set.seed(123)
data <- data.frame(Age = rnorm(500, 50, 8),
                   Weight = rnorm(500, 80, 10))
data.pre <- preProcess(data, method="range")
data <- predict(data.pre, data)

Method 4: using the dplyr package (standardisation only)

library(dplyr)
set.seed(123)
data <- data.frame(Age = rnorm(500, 50, 8),
                   Weight = rnorm(500, 80, 10))
data <- data %>%
  mutate_all(scale)

Tip: if you want to standardise specific columns in a dataframe, use mutate_at instead of mutate_all as follows:

data <- data %>%
  mutate_at(vars("Weight"), scale)

Method 5: using the BBmisc package

Standardisation:

library(BBmisc)
set.seed(123)
data <- data.frame(Age = rnorm(500, 50, 8),
                   Weight = rnorm(500, 80, 10))
data <- normalize(data, method="standardize")

Normalisation:

library(BBmisc)
set.seed(123)
data <- data.frame(Age = rnorm(500, 50, 8),
                   Weight = rnorm(500, 80, 10))
data <- normalize(data, method="range", range=c(0,1))

Tags: feature scaling, machine learning, normalisation, standardisation

One thought on “Feature scaling in R: five simple methods”

Jairo Salazar says:

June 27, 2021 at 4:02 pm

Great tips! thanks a lot man

Reply

Free data science in R guide

Sign up to our newsletter and we will send you a series of guides containing tips and tricks on data science and machine learning in R.

No thanks

Machine learning

Confusion matrix in R: two simple methods

April 13, 2021

Two of the best methods to calculate a confusion matrix in R – from scratch or with the caret package.

Machine learning

The quickest way to check for missing values in an R data frame

November 3, 2020

How to check how much missing data you have in your data frame, and in which columns.

Machine learning blog

What is a good classification accuracy in machine learning?

June 1, 2020

How to measure the performance of your classification algorithm.

Machine learning

How to apply and interpret linear regression in R

May 28, 2020

Learning how to apply linear regression in R and how to interpret the output using house price data.

Machine learning blog

What is Machine Learning?

April 17, 2020

What is Machine Learning? Machine Learning is a subset of artificial intelligence which involves getting computers to learn autonomously from hidden patterns in existing data in order to make predictions on unseen data. There are two main types of machine learning – supervised and unsupervised. Supervised machine learning algorithms are used when the existing data […]

Feature scaling in R: five simple methods

What is feature scaling?

Standardisation or normalisation?

Standardisation or normalisation?

How to apply feature scaling in R

Method 1: scale function (standardisation only)

Method 2: manually apply a formula

Standardisation:

Normalisation:

Method 3: using the caret package

Standardisation:

Normalisation:

Method 4: using the dplyr package (standardisation only)

Method 5: using the BBmisc package

Standardisation:

Normalisation:

One thought on “Feature scaling in R: five simple methods”

Leave a Reply Cancel reply

Free data science in R guide

You might also like

Confusion matrix in R: two simple methods

The quickest way to check for missing values in an R data frame

What is a good classification accuracy in machine learning?

How to apply and interpret linear regression in R

What is Machine Learning?