Feature scaling in R: five simple methods

By Data Tricks, 18 November 2020

What is feature scaling?

Feature scaling is the process of eliminating units of measurement for variables within a dataset, and is often carried out to boost the accuracy of a machine learning algorithm. For example, a dataset may contain Age with a range of 18 to 60 years, and Weight with a range of 50 to 110kg. The aim of feature scaling would be to transform these ranges to the same scale of, say, 0 to 1 for both Age and Weight.

Standardisation or normalisation?

Standardisation and normalisation are two of the most popular methods of feature scaling. Sometimes the terms are used interchangeably but they mean quite different things. Standardisation is the process of transforming data so that the new data will have a mean of 0 and standard deviation of 1, whereas normalisation transforms the data to a range of 0 to 1. You may come across standardisation being referred to as the z-score and normalisation as min-max scaling.

Standardisation or normalisation?

When to use standardisation and normalisation depends on the application, and there is no strict rule of when to apply one over the other. But features scaling in general (either standardisation or normalisation) is especially important to machine learning algorithms which are based on a distance matrix such as support vector machines and neural networks.

An important difference in standardisation vs. normalisation in the context of machine learning is that unlike normalisation, standardisation does not have a bounding range, so it is better suited to handle outliers. Normalisation may be the better option if you know that your data is not normally distributed and is helpful for machine learning algorithms that do not assume a normal distribution such as k-nearest neighbours and neural networks. Standardisation, on the other hand, is often used in clustering analyses or principal component analysis.

In summary, choosing which method of feature scaling to use is perhaps best accomplished by training your machine learning algorithm on the raw, standardised and normalised data and evaluating which method provides the highest accuracy.

How to apply feature scaling in R

The following outlines five simple methods to apply feature scaling in R. Each example starts with creating a data frame of Age and Weight values so that all code is reproducible.

Method 1: scale function (standardisation only)

set.seed(123)
data <- data.frame(Age = rnorm(500, 50, 8),
                   Weight = rnorm(500, 80, 10))
data <- as.data.frame(scale(data))

Method 2: manually apply a formula

Standardisation:

set.seed(123)
data <- data.frame(Age = rnorm(500, 50, 8),
                   Weight = rnorm(500, 80, 10))
data <- as.data.frame(sapply(data, function(x) (x-mean(x))/sd(x)))

Normalisation:

set.seed(123)
data <- data.frame(Age = rnorm(500, 50, 8),
Weight = rnorm(500, 80, 10))
data <- as.data.frame(sapply(data, function(x) (x-min(x))/(max(x)-min(x))))

Method 3: using the caret package

Standardisation:

library(caret)
set.seed(123)
data <- data.frame(Age = rnorm(500, 50, 8),
                   Weight = rnorm(500, 80, 10))
data.pre <- preProcess(data, method=c("center", "scale"))
data <- predict(data.pre, data)

Normalisation:

library(caret)
set.seed(123)
data <- data.frame(Age = rnorm(500, 50, 8),
                   Weight = rnorm(500, 80, 10))
data.pre <- preProcess(data, method="range")
data <- predict(data.pre, data)

Method 4: using the dplyr package (standardisation only)

library(dplyr)
set.seed(123)
data <- data.frame(Age = rnorm(500, 50, 8),
                   Weight = rnorm(500, 80, 10))
data <- data %>%
  mutate_all(scale)

Tip: if you want to standardise specific columns in a dataframe, use mutate_at instead of mutate_all as follows:

data <- data %>%
mutate_at(vars("Weight"), scale)

Method 5: using the BBmisc package

Standardisation:

library(BBmisc)
set.seed(123)
data <- data.frame(Age = rnorm(500, 50, 8),
                   Weight = rnorm(500, 80, 10))
data <- normalize(data, method="standardize")

Normalisation:

library(BBmisc)
set.seed(123)
data <- data.frame(Age = rnorm(500, 50, 8),
Weight = rnorm(500, 80, 10))
data <- normalize(data, method="range", range=c(0,1))

Tags: , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

Please note that your first comment on this site will be moderated, after which you will be able to comment freely.