By Data Tricks, 3 July 2019
Cleaning and preparing data is one of the most effective ways of boosting the accuracy of predictions through machine learning. If you’re working with categorical variables, you’ll probably want to recode them to a format more friendly to machine learning algorithms.
One-hot encoding is the process of converting a categorical variable with multiple categories into multiple variables, each with a value of 1 or 0.
For example, it involves taking this:
and converting it into this:
For the methods outlined below, the following simple dataframe will be required:
data <- data.frame(
Outcome = seq(1,100,by=1),
Variable = sample(c("Red","Green","Blue"), 100, replace = TRUE)
Method 1: one_hot in mltools package
library(mltools) library(data.table) newdata <- one_hot(as.data.table(data))
Method 2: dummyVars in caret package
library(caret) dummy <- dummyVars(" ~ .", data=data) newdata <- data.frame(predict(dummy, newdata = data))
Method 3: dcast in reshape2 package
library(reshape2) newdata <- dcast(data = data, Outcome ~ Variable, length)
Please note that your first comment on this site will be moderated, after which you will be able to comment freely.
You might also like
R Shiny App Theme
Custom Bootstrap theme for Shiny apps.
80% in Kaggle’s Titanic competition in 50 lines of R code
A simple step-by-step guide to achieving over 80% accuracy in Kaggle’s Titanic competition in just 50 lines of R code.
Histograms and density plots
Creating histograms and density plots in R.
Access more for free
Access more articles and code for free