One-hot encoding in R: three simple methods

By Data Tricks, 3 July 2019

Cleaning and preparing data is one of the most effective ways of boosting the accuracy of predictions through machine learning. If you’re working with categorical variables, you’ll probably want to recode them to a format more friendly to machine learning algorithms.

What is one-hot encoding?

One-hot encoding is the process of converting a categorical variable with multiple categories into multiple variables, each with a value of 1 or 0.

For example, it involves taking this:

IDColour
001Red
002Blue
003Red
004Green

and converting it into this:

IDRedGreenBlue
001100
002001
003100
004010

For the methods outlined below, the following simple dataframe will be required:

set.seed(555)
data <- data.frame(
Outcome = seq(1,100,by=1),
Variable = sample(c("Red","Green","Blue"), 100, replace = TRUE)
)

Method 1: one_hot in mltools package

library(mltools)
library(data.table)

newdata <- one_hot(as.data.table(data))

Update 10/12/2021. The code above may no longer work and you may need to convert the Variable column to a factor, as follows:

data$Variable <- as.factor(data$Variable)
newdata <- one_hot(as.data.table(data))

Method 2: dummyVars in caret package

library(caret)

dummy <- dummyVars(" ~ .", data=data)
newdata <- data.frame(predict(dummy, newdata = data)) 

Method 3: dcast in reshape2 package

library(reshape2)

newdata <- dcast(data = data, Outcome ~ Variable, length)

Applying one-hot encoding to multiple variables at the same time?

For the following examples, we’ll modify the dataframe to introduce another variable:

set.seed(555)
data <- data.frame(ID = seq(1,100,by=1),
  Colour = sample(c("Red","Green","Blue"), 100, replace = TRUE),
  Quality = sample(c("Poor","Average","Good"), 100, replace = TRUE)
  )

If you’re using the one_hot function in the mltools package:

newdata <- one_hot(as.data.table(data))

For the dummyVars function in the caret package:

dummy <- dummyVars(" ~ .", data=data)
newdata <- data.frame(predict(dummy, newdata = data))

For the dcast function in the reshape2 package:

newdata <- dcast(data = melt(data, id.vars = "ID"), ID ~ variable + value, length)

Note that we have to melt the data before we cast it.

I hope you found this quick tutorial helpful. Happy encoding!

Tags: , , , , ,

8 thoughts on “One-hot encoding in R: three simple methods”

  1. Abzetdin Adamov says:

    In the data frame generation code the factor function is missing. It should be
    Variable = factor(sample(c(“Red”,”Green”,”Blue”), 100, replace = TRUE))

    1. Data Tricks says:

      Good spot, thank you!

  2. SG says:

    Yes, indeed, it is necessary to convert columns to factor for Method 1.

    1. Data Tricks says:

      Thanks, good spot. I have update the tutorial.

  3. Megan Taylor says:

    Could you double check the code for mltools? It did not generate a data table that looked similar to the second panel in the tutorial intro or the other two methods. Thanks!

    1. Data Tricks says:

      Hi Megan,

      Thanks for your question. As per the comments above I think if you convert the Variable column to a factor first it will work. I have added a note to the tutorial under Method 1 to reflect this.

      Thanks,
      Tom

Leave a Reply

Your email address will not be published. Required fields are marked *

Please note that your first comment on this site will be moderated, after which you will be able to comment freely.