Wordclouds in R

By Data Tricks, 9 November 2017

Last year saw the release of a very cool R package, wordcloud2, an interface for wordcloud2.js. In this post we take you through some simple steps to produce a great looking wordcloud from your text documents.

1 Load packages and read in data

Create a folder in your working directory called Texts and save your .txt files here. You can save as many text files are you want as the code will read them all in, however to avoid problems it is best to save your files as .txt files.

rm(list=ls())
setwd("your_working_directory_path")
library(wordcloud2)
library(tm)
library(tidytext)

#Read in text file
cname <- file.path("your_working_directory_path/Texts")
docs <- Corpus(DirSource(cname))

2 Clean the text files

You can use the tm package to clean up the text files and remove words that you want to exclude from the wordcloud such as “the” and “a”.

#Clean the text file
docs <- tm_map(docs, removePunctuation) #remove punctuation
docs <- tm_map(docs, removeNumbers) #remove numbers
docs <- tm_map(docs, tolower) #convert all characters to lower case
docs <- tm_map(docs, removeWords, stopwords("english")) #remove common words
docs <- tm_map(docs, removeWords, c("will", "people", "britain", "british", "country", "thats")) #remove any additional specific words
docs <- tm_map(docs, stripWhitespace) #remove whitespaces
docs <- tm_map(docs, PlainTextDocument) #ensure the document is treated as text

In the fifth line above, stopwords(“english”) provides a list of stop words such as “the”, “and” etc. Any additional words can be removed as in the sixth line above – you might need to go back to this line to add/remove words from the list depending on the result of your wordcloud.

3 Create a document term matrix and convert to a dataframe

The wordcloud2 package requires a dataframe as the input, with words or phrases in the first column and a numerical value (ie. how many times the word appears in the text file) in the second column to depict the size of the words in the wordcloud.

#Create a document term matrix
dtm <- DocumentTermMatrix(docs)

#Convert document term matrix into a dataframe and sort by frequency
df <- tidy(dtm)
df <- df[order(-df$count),c(2,3)]

4 Run the wordcloud

You’re now ready to run the wordcloud. In our example we’ve used the UK Prime Minister’s calamitous speech in Manchester 2017.

wordcloud2(data = df, shape = 'circle')

Tags: data art, tagcloud, text, tidytext, tm, wordcloud, wordcloud2

Free data science in R guide

Sign up to our newsletter and we will send you a series of guides containing tips and tricks on data science and machine learning in R.

No thanks

The hidden art in pi

June 27, 2019

A series of visualisations created using the ggplot2 package in R and the first million digits of pi.

Data Art

UK population density map in R

July 26, 2018

Create a unique population density map of the UK in R using ggplot2 and geom_point.

Visualising pi

January 20, 2018

Exploring the randomness in pi with a random walk inspired visual

Data Art: London mapped in R

October 17, 2017

Plotting an artistic map of London in R.