One-hot encoding in R: three simple methods

By Data Tricks, 3 July 2019

Cleaning and preparing data is one of the most effective ways of boosting the accuracy of predictions through machine learning. If you’re working with categorical variables, you’ll probably want to recode them to a format more friendly to machine learning algorithms.

What is one-hot encoding?

One-hot encoding is the process of converting a categorical variable with multiple categories into multiple variables, each with a value of 1 or 0.

For example, it involves taking this:

ID	Colour
001	Red
002	Blue
003	Red
004	Green

and converting it into this:

ID	Red	Green	Blue
001	1	0	0
002	0	0	1
003	1	0	0
004	0	1	0

For the methods outlined below, the following simple dataframe will be required:

set.seed(555)
data <- data.frame(
  Outcome = seq(1,100,by=1),
  Variable = sample(c("Red","Green","Blue"), 100, replace = TRUE)
)

Method 1: one_hot in mltools package

library(mltools)
library(data.table)

newdata <- one_hot(as.data.table(data))

Update 10/12/2021. The code above may no longer work and you may need to convert the Variable column to a factor, as follows:

data$Variable <- as.factor(data$Variable)
newdata <- one_hot(as.data.table(data))

Method 2: dummyVars in caret package

library(caret)

dummy <- dummyVars(" ~ .", data=data)
newdata <- data.frame(predict(dummy, newdata = data))

Method 3: dcast in reshape2 package

library(reshape2)

newdata <- dcast(data = data, Outcome ~ Variable, length)

Applying one-hot encoding to multiple variables at the same time?

For the following examples, we’ll modify the dataframe to introduce another variable:

set.seed(555)
data <- data.frame(ID = seq(1,100,by=1),
  Colour = sample(c("Red","Green","Blue"), 100, replace = TRUE),
  Quality = sample(c("Poor","Average","Good"), 100, replace = TRUE)
  )

If you’re using the one_hot function in the mltools package:

newdata <- one_hot(as.data.table(data))

For the dummyVars function in the caret package:

dummy <- dummyVars(" ~ .", data=data)
newdata <- data.frame(predict(dummy, newdata = data))

For the dcast function in the reshape2 package:

newdata <- dcast(data = melt(data, id.vars = "ID"), ID ~ variable + value, length)

Note that we have to melt the data before we cast it.

I hope you found this quick tutorial helpful. Happy encoding!

Tags: caret, encoding, mltools, one-hot encoding, R, reshape2

8 thoughts on “One-hot encoding in R: three simple methods”

Pingback: How to apply and interpret linear regression in R | Data Tricks
Pingback: UpSetR is the Greatest Set Visualization Since the Venn Diagram – Eliana Grosof's Portfolio
Abzetdin Adamov says:

February 26, 2021 at 10:43 pm

In the data frame generation code the factor function is missing. It should be
Variable = factor(sample(c(“Red”,”Green”,”Blue”), 100, replace = TRUE))

Reply
1. Data Tricks says:
  
  December 10, 2021 at 3:19 pm
  
  Good spot, thank you!
  
  Reply
SG says:

April 1, 2021 at 8:52 am

Yes, indeed, it is necessary to convert columns to factor for Method 1.

Reply
1. Data Tricks says:
  
  December 10, 2021 at 3:19 pm
  
  Thanks, good spot. I have update the tutorial.
  
  Reply
Megan Taylor says:

December 7, 2021 at 2:36 pm

Could you double check the code for mltools? It did not generate a data table that looked similar to the second panel in the tutorial intro or the other two methods. Thanks!

Reply
1. Data Tricks says:
  
  December 10, 2021 at 3:14 pm
  
  Hi Megan,
  
  Thanks for your question. As per the comments above I think if you convert the Variable column to a factor first it will work. I have added a note to the tutorial under Method 1 to reflect this.
  
  Thanks,
  Tom
  
  Reply

Free data science in R guide

Sign up to our newsletter and we will send you a series of guides containing tips and tricks on data science and machine learning in R.

No thanks

Machine learning

Confusion matrix in R: two simple methods

April 13, 2021

Two of the best methods to calculate a confusion matrix in R – from scratch or with the caret package.

Machine learning

How to apply and interpret linear regression in R

May 28, 2020

Learning how to apply linear regression in R and how to interpret the output using house price data.

Creating a D3.js bar chart in R

February 14, 2020

How to produce a simple D3 bar chart in R with the r2d3 package. Including customisation of axes, titles and making it interactive.

R Shiny App Theme

August 10, 2019

Custom Bootstrap theme for Shiny apps.

Featured

80% in Kaggle’s Titanic competition in 50 lines of R code

July 16, 2019

A simple step-by-step guide to achieving over 80% accuracy in Kaggle’s Titanic competition in just 50 lines of R code.