Visual and Predictive Analytics: FAKE/TRUE NEWS Classifier
FAKE NEWS... A hot topic that has been around for many years and that recently has exploded and got hotter amid the COVID-19 pandemic. But, what exactly is fake news? Fake news is false or misleading information presented as news. It often aims to damage the reputation of a person or entity or make money through advertising revenue.
Particularly for business, why is it relevant to analyze, characterize, and PREDICT fake news? At the same time that fake news is considered a serious threat to real journalism, they are also connected to stock market fluctuations and massive trades. For example, a few years ago fake news claiming that Barack Obama was injured in an explosion wiped out 130 billion dollars in stock value.
As I mentioned in my first post, in this and in future posts, I'll dive a little bit deeper into relevant examples of Data Analytics applications to real-life situations. In this post, I'm going to describe how it is possible to apply Advanced Visualization techniques to visually explore and characterize fake and true (real) news articles; and quickly extract insight, that can reinforce PREDICTIONS carried out leveraging, for example, Neural Natural Language Modeling for binary text classification, using the R package ruimtehol.
To facilitate the exposition, I've divided it into three sections:
- Data preparation in TRIFACTA
- QUANTEDA Visualizations Tools: Descriptive Analytics
- Neural Natural Language Modeling — RUIMTEHOL R Package: Predictive Analytics
Now, it's time to unlock another real-world use case. I really hope it would be both interesting and informative for you. Please, don't forget to leave your comments below and share.
Data Preparation in TRIFACTA
The dataset for this example is from a Kaggle competition; it comprises two files in CSV format containing news from November 2016 to December 2017 (fake news articles file 61.32 MB, 23,500 fake news articles; real news articles file 52.33 MB, 21,417 true news articles). Both files have the same features/columns: news title/headline, the text of the news, news subject, and dates.
A quick note before I move forward. It is worth noticing that there exists the misconception that Analytics techniques are only suited for a high volume of data or "big data". This example will show (as well as the use case that'll be discussed in the next post), that most of the Analytics methods and techniques are flexible enough to adapt smoothly to small data volumes and "big data" as well. This flexibility is also true regarding modern data preparation workflows and tools like TRIFACTA software (available on-premise and in the cloud). However, in this particular example, to prepare the data, why not use an Excel spreadsheet or a programmed script instead of TRIFACTA?
On one hand, as the next post's data file will show, a small file doesn't mean a simple one. On the other hand, in collaborative tools like TRIFACTA, users of any level of expertise (only domain knowledge and general math/statistics are required), can visually assess the data and intuitively interact with it along the preparation process, knowing exactly what's going on with each data point at any time.
Data preparation, including blending disparate and complex file types, is now an experience easily auditable, explainable, flexible, and reproducible. No matter the file(s) size(s). And the process can be fully automated. Yes, the once very time-consuming and tedious data preparation job now in TRIFACTA can be done better, easier, and faster. Achieving the same goals with cryptic programmed scripts or Excel spreadsheets would be impractical difficult and very lengthy to accomplish.
Figure 1: Data preparation in TRIFACTA of TRUE-News text data.
Figure 1 above (click on any image to enlarge) depicts true news text data loaded in TRIFACTA; to the right, the recipe with the logic implemented to clean the data (to eliminate, punctuation and unnecessary symbols, etc.) and perform all the necessary transformations (transform to lowercase, for example), including the addition of an extra column with the tag or label (class) TRUE.
Figure 2: Data preparation in TRIFACTA of FAKE-News text data.
Figure 2 above illustrates the same for fake news data; the tag or label (class) FAKE column has also been included. Figure 3 below; shows the result of blending in a single dataset True news and Fake news data after being refined. Now, only the necessary features have been kept: a document/news ID, news title, news text, and the label/class, namely, TRUE or FAKE. This tag or class will be used ahead in the workflow, together with the news text, to train a Machine Learning model corresponding to a binary classification task. At this point, it's also possible to connect directly with the refined blended data or download it in the appropriate format. The data is ready to perform visual exploration/Descriptive Analytics, Predictive Analytics, and more.
Figure 3: Blending of FAKE/TRUE News text data.
It is important to point out that if any issue is detected downstream or additional features are required to be included in the analysis, it is pretty easy to go back and edit the recipe(s), solve the problem, and move on again in the pipeline.
With the data ready, now it is time to visually explore the data and gain some useful insight; it's time for Descriptive Analytics. Here is where QUANTEDA's powerful visualization functions come in handy. Let's build some informative and useful visualizations!
QUANTEDA Visualization Tools: Descriptive Analytics
The QUANTEDA package is a fast, flexible, and comprehensive framework for graphic and quantitative text analysis. The package is designed for R users needing to apply Natural Language Processing (NLP) to texts, from documents to final analysis. Its capabilities match or exceed those provided in many end-user software applications, many of which are expensive and not open source. Therefore, the package greatly benefits researchers, students, and other analysts with fewer financial resources. I was particularly interested in some of the package's rich graphic toolbox functions, to visually explore the fake/true news dataset and quickly surface relevant and useful insight.
Now a few examples of QUANTEDA graphic tools to visually assess fake/true news text. Figure 4 below (click on any image to enlarge) shows the comparison of fake news and true news Word Clouds. It can be seen that fake news' word distribution is sharper than that of true news: fake news has the tendency to use fewer words and repeats some of these words more frequently ("trump" for example) than true news. The "transition" zone (in light red and green) between low frequent words (in red) and more frequent words (in blue) is notably broader in true news; see Word Cloud to the right.
Figure 4: FAKE and TRUE News Word Clouds comparison.
Figure 5 below depicts the comparison of fake news and true news Sentiment Word Clouds; sentiments considered are fear, anticipation, trust, surprise, and sadness. More subtle differences between fake and true news can now be observed; for example, in sadness, fake news stress words like black, while true news tax; in anticipation, fake news shows up with boiler, true news with white and house; and so on. Therefore, it could be said that relying on specific sentiments, fake news try to create or exploit a particular idea or opinion, sometimes making the reader experience quite uneasy; on the other hand, the aim of true news is mostly informative and based on facts.
Figure 5: FAKE and TRUE News Sentiment Word Clouds comparison.
To further corroborate some of the above observations, I move forward and built the Wordfish plots shown in Figure 6 below. In a Wordfish plot, it is possible to track the most often associated words with a certain type of content or specific subject — the higher the words are in the plot, the more likely they are related to it.
It's clear from the plots, and following the previous observations, that contrary to fake news true news is more words-riched, resulting in a broader words' distribution quite notable in the Wordfish plot to the right (see figure below).
Figure 6: FAKE and TRUE News Wordfish plots comparison.
For a clearer comparison, Figure 7 and Figure 8 below shown again the fake news and true news Wordfish plots, respectively. For example, consider the words "black" and "police"; as can be seen, this combination of words appears very close together in the fake news plot and quite near the top; on the other hand, in the true news Wordfish plot, both words are notably separated; and the word "black" show up down in the plot.
Figure 7: FAKE News - Wordfish plot discussion.
Figure 8: TRUE News - Wordfish plot discussion.
Summarizing, it could be interpreted from the visual exploration so far — using some of the powerful QUANTEDA's visualization tools— that fake news focuses mainly on particularly sensitive topics and specific words (sometimes making the reader experience pretty uneasy), and, at most, in opinions partially true; while real news is more informative, words-riched, and based, very often, on verifiable facts.
Finished the visual exploration and after gained some useful insight, the next step is to build a content-based news classifier, training a Machine Learning binary classification algorithm to predict if a fresh news article is fake or true. This task can be achieved, for example, by performing Neural Natural Language Modeling with the R package RUIMTEHOL.
Neural Natural Language Modeling — RUIMTEHOLD R Package: Predictive Analytics
RUIMTEHOL is a comprehensive R package that wraps the StarSpace C++ library. It's a Neural Natural
Language Modeling toolkit which allows you to perform:
- Text classification.
- Finding sentence or document similarity.
- Content-based recommendations (e.g. recommend text/music based on the content).
- Collaborative filtering-based recommendations (e.g. recommend text/music based on interest).
- And much more.
As mentioned at the end of the last section, a binary text classification task will be carried out. Each news article, as also mentioned in the section regarding data preparation, has been already labeled with the tag/class FAKE or TRUE. Now, a binary classification model is constructed which can be used to tag
fresh news articles (articles no considered in the modeling). For a detailed discussion and explanation of the modeling process using RUIMTEHOL functions, please explore this LINK.
Figure 9 below illustrates an example of the prediction of FAKE news. The model correctly classifies the fresh fake news article (news text included in the figure) as FAKE. Tabulated similarity values could be interpreted as a prediction probability: closer to 1 (0.9999447) higher the likelihood this sample of news is FAKE; a negative value close to -1 (-0.999511) indicates a very low likelihood the news is TRUE.
Now, Figure 10 below illustrates an example of the prediction of TRUE news. Again, the model correctly classifies the fresh true news article (news text included in the figure) as TRUE. Tabulated similarity values could be interpreted, as before, representing a prediction probability, so a value closer to 1 (0.9999447) higher the likelihood the sample of news is TRUE; a negative value close to -1 (-0.999511) indicates a very low likelihood the news is FAKE.
Figure 10: NNLM prediction - TRUE News example.
Reading the text samples depicted in Figure 9 and Figure 10 above (click on any image to enlarge), it is possible to identify several features similar to the features surfaced in the visual exploration step carried out in the previous section, characteristics that typify fake and real news, respectively; and as would be expected the intuition and the observations are pretty consistent with and reinforce the predictions achieved through the Neural Natural Language model just implemented.
Summary
Wrapping up, a dataset from a Kaggle competition, comprising samples of fake and true news articles (23,500 and 21,417 articles, respectively), was cleaned, structured, reshaped, and blended together in TRIFACTA software. The resulting refined data was the input to some of the powerful graphic tools available in the QUANTEDA R package; a handful of compelling and informative visualizations were delivered. Also, the refined data was used to train a Neural Natural Language algorithm available in the RUIMTEHOL R package; a binary classification model was implemented to predict if fresh news articles (no considered in the modeling process), were true or fake. Reader intuition and results of both Visual Analytics and Predictive Analytics are pretty consistent and reinforce each other.
As a bonus, here a video walkthrough of the presented use case, showcasing, together with additional details and relevant information, some interesting Topic Modeling findings, obtained using the stm R Package:
In future posts, I'll present and discuss more real-life interesting and relevant use cases. Please, stay tuned and don't miss them out. And kindly, leave your comments below and share. Thank you!
Comments
Post a Comment
Please, leave your comments: