By Data Tricks, 9 November 2017
Last year saw the release of a very cool R package, wordcloud2, an interface for wordcloud2.js. In this post we take you through some simple steps to produce a great looking wordcloud from your text documents.
1 Load packages and read in data
Create a folder in your working directory called Texts and save your .txt files here. You can save as many text files are you want as the code will read them all in, however to avoid problems it is best to save your files as .txt files.
rm(list=ls()) setwd("your_working_directory_path") library(wordcloud2) library(tm) library(tidytext) #Read in text file cname <- file.path("your_working_directory_path/Texts") docs <- Corpus(DirSource(cname))
2 Clean the text files
You can use the tm package to clean up the text files and remove words that you want to exclude from the wordcloud such as “the” and “a”.
#Clean the text file docs <- tm_map(docs, removePunctuation) #remove punctuation docs <- tm_map(docs, removeNumbers) #remove numbers docs <- tm_map(docs, tolower) #convert all characters to lower case docs <- tm_map(docs, removeWords, stopwords("english")) #remove common words docs <- tm_map(docs, removeWords, c("will", "people", "britain", "british", "country", "thats")) #remove any additional specific words docs <- tm_map(docs, stripWhitespace) #remove whitespaces docs <- tm_map(docs, PlainTextDocument) #ensure the document is treated as text
In the fifth line above, stopwords(“english”) provides a list of stop words such as “the”, “and” etc. Any additional words can be removed as in the sixth line above – you might need to go back to this line to add/remove words from the list depending on the result of your wordcloud.
3 Create a document term matrix and convert to a dataframe
The wordcloud2 package requires a dataframe as the input, with words or phrases in the first column and a numerical value (ie. how many times the word appears in the text file) in the second column to depict the size of the words in the wordcloud.
#Create a document term matrix dtm <- DocumentTermMatrix(docs) #Convert document term matrix into a dataframe and sort by frequency df <- tidy(dtm) df <- df[order(-df$count),c(2,3)]
4 Run the wordcloud
You’re now ready to run the wordcloud. In our example we’ve used the UK Prime Minister’s calamitous speech in Manchester 2017.
wordcloud2(data = df, shape = 'circle')
Please note that your first comment on this site will be moderated, after which you will be able to comment freely.
You might also like
The hidden art in pi
A series of visualisations created using the ggplot2 package in R and the first million digits of pi.
UK population density map in R
Create a unique population density map of the UK in R using ggplot2 and geom_point.
Exploring the randomness in pi with a random walk inspired visual
Data Art: London mapped in R
Plotting an artistic map of London in R.
Access more for free
Access more articles and code for free