In order to visualise and compare the emphasis of each political party's election manifesto, we have text-mined them into Word Clouds. The seven manifestos shown here have been treated in exactly the same way in order to achieve maximum objectivity.The process of producing these Word Clouds was as follows:
- download the manifestos from each party's web site as a PDF file;
- export all the text each contains into a text file (note: text embedded within images could not be exported from the PDF files);
- page headers and or footers (where present) were removed;
- each chapter in a manifesto was grouped into a single block of text;
- a two-column CSV file was constructed containing all the manifestos with party name in the first column and chapter texts in the second (a total of 121 chapters, approximately 189,000 words;
- the CSV file was loaded into open-source statistical package R using the text mining package tm and package wordcloud;
- the text is transformed in a corpus (a data type for analysing texts) and further cleaned to turn all the text to lower case, remove all punctuation and numbers, remove unwanted words such as 'a', 'the' and 'we', as well as political party names;
- the remaining text for each party is turned into its own Word Cloud (using exactly the same parameters for all parties) such that the most frequent words are larger, a maximum word count of the 200 most frequently used words and a maximum size to the plot (smaller plots occur where the total word count of the manifesto is smaller).
Further text mining will be carried out to produce more analytical visualisations.
This type of text mining is just a small part of what our students learn on the MSc and Professional Doctorate in Data Science at the University of East London.