Wordclouds in R

By Data Tricks, 9 November 2017

Last year saw the release of a very cool R package, wordcloud2, an interface for wordcloud2.js. In this post we take you through some simple steps to produce a great looking wordcloud from your text documents.

1 Load packages and read in data

Create a folder in your working directory called Texts and save your .txt files here. You can save as many text files are you want as the code will read them all in, however to avoid problems it is best to save your files as .txt files.

rm(list=ls())
setwd("your_working_directory_path")
library(wordcloud2)
library(tm)
library(tidytext)

#Read in text file
cname <- file.path("your_working_directory_path/Texts")
docs <- Corpus(DirSource(cname))

2 Clean the text files

You can use the tm package to clean up the text files and remove words that you want to exclude from the wordcloud such as “the” and “a”.

#Clean the text file
docs <- tm_map(docs, removePunctuation) #remove punctuation
docs <- tm_map(docs, removeNumbers) #remove numbers
docs <- tm_map(docs, tolower) #convert all characters to lower case
docs <- tm_map(docs, removeWords, stopwords("english")) #remove common words
docs <- tm_map(docs, removeWords, c("will", "people", "britain", "british", "country", "thats")) #remove any additional specific words
docs <- tm_map(docs, stripWhitespace) #remove whitespaces
docs <- tm_map(docs, PlainTextDocument) #ensure the document is treated as text

In the fifth line above, stopwords(“english”) provides a list of stop words such as “the”, “and” etc. Any additional words can be removed as in the sixth line above – you might need to go back to this line to add/remove words from the list depending on the result of your wordcloud.

3 Create a document term matrix and convert to a dataframe

The wordcloud2 package requires a dataframe as the input, with words or phrases in the first column and a numerical value (ie. how many times the word appears in the text file) in the second column to depict the size of the words in the wordcloud.

#Create a document term matrix
dtm <- DocumentTermMatrix(docs)

#Convert document term matrix into a dataframe and sort by frequency
df <- tidy(dtm)
df <- df[order(-df$count),c(2,3)]

4 Run the wordcloud

You’re now ready to run the wordcloud. In our example we’ve used the UK Prime Minister’s calamitous speech in Manchester 2017.

wordcloud2(data = df, shape = 'circle')

Tags: , , , , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

Please note that your first comment on this site will be moderated, after which you will be able to comment freely.