One-hot encoding in R: three simple methods

By Data Tricks, 3 July 2019

Cleaning and preparing data is one of the most effective ways of boosting the accuracy of predictions through machine learning. If you’re working with categorical variables, you’ll probably want to recode them to a format more friendly to machine learning algorithms.

One-hot encoding is the process of converting a categorical variable with multiple categories into multiple variables, each with a value of 1 or 0.

For example, it involves taking this:

IDColour
001Red
002Blue
003Red
004Green

and converting it into this:

IDRedGreenBlue
001100
002001
003100
004010

For the methods outlined below, the following simple dataframe will be required:

set.seed(555)
data <- data.frame(
Outcome = seq(1,100,by=1),
Variable = sample(c("Red","Green","Blue"), 100, replace = TRUE)
)

Method 1: one_hot in mltools package

library(mltools)
library(data.table)

newdata <- one_hot(as.data.table(data))

Method 2: dummyVars in caret package

library(caret)

dummy <- dummyVars(" ~ .", data=data)
newdata <- data.frame(predict(dummy, newdata = data)) 

Method 3: dcast in reshape2 package

library(reshape2)

newdata <- dcast(data = data, Outcome ~ Variable, length)

Tags: , , , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

Please note that your first comment on this site will be moderated, after which you will be able to comment freely.

You might also like

R Shiny App Theme

Custom Bootstrap theme for Shiny apps.

Read more

80% in Kaggle’s Titanic competition in 50 lines of R code

A simple step-by-step guide to achieving over 80% accuracy in Kaggle’s Titanic competition in just 50 lines of R code.

Read more

Histograms and density plots

Creating histograms and density plots in R.

Read more