Fake news existed ever since there was news, from rumors to printed media then radio and television. Recently, the information age, with its communications and Internet breakthroughs, exacerbated the spread of fake news. Additionally, aside from e-Commerce, the current Internet economy is dependent on advertisements, views and clicks, which prompted many developers to bait the end users to click links or ads. Consequently, the wild spread of fake news through social media networks has impacted real world issues from elections to 5G adoption and the handling of the Covid- 19 pandemic. Efforts to detect and thwart fake news has been there since the advent of fake news, from fact checkers to artificial intelligence-based detectors. Solutions are still evolving as more sophisticated techniques are employed by fake news propagators. In this paper, R code have been used to study and visualize a modern fake news dataset. We use clustering, classification, correlation and various plots to analyze and present the data. The experiments show high efficiency of classifiers in telling apart real from fake news.
Point of view analysis and sentiment classification methods showed that the identification of fake news depends on the languages content. Tweets were collected and probabilistic Nave Bayes (NB) model was used. The NB model was compared to Neutral Network and Support Vector Machine (SVM) classifiers. While NB was faster, less accurate and often required a vast number of records to acquire good outcomes . In contrast, Cybenko and Cybenko did not believe that artificial intelligence or machine learning would be able to categorize and measure what is true news or fake news . They raised the question of how will the programmers avoid the inheritance of biases in the software? So ethical issues will arise in cyber security as computers decide for humans what is true and what is false .
id: 4eea7c6374345809f47a014a58f03216 - page: 3
III. METHODOLOGY AND DATASET Before beginning to suggest solutions for fake news detection, one must decide on what dataset to use from the large number of publicly available repositories. The LIAR dataset includes a combination of metadata with text, which provides a significant improvement during analysis . To help solve this problem we provide an R code to study the LIAR dataset and look at the most common counts included within the dataset. The counts in the data are divided in five major counts which are: mostly true counts, false counts, half true counts, barely true counts and pants on fire counts.
id: d10a5d31d45807b3dd24cc29fefba450 - page: 3
A. Dataset The LIAR datasets are collected from political debate, TV ads, Facebook posts, tweets, interview, news releases and others. The LIAR dataset includes 128,000 human labeled short statements from PolitiFact.coms API, and each statement was evaluated by PolitiFact.com editor for its truthfulness. After initial analysis, they found duplicate labels, and merged the full-flop, half-flip, no-flip into false, half-true and true, respectively . In LIAR, there are six labels for truthfulness ratings: pants-fire, false, barely true, half-true, mostly-true, and true. Moreover, the LIAR dataset includes a mix of democrats, republicans and posts from online social medias. Besides, there is rich meta-data for each speaker including current job, home state, party affiliation and credit history .
id: 336314961d88395f3cfc71b527f05fca - page: 3
B. Attributes Discovery The LIAR dataset has 14 attributes, where columns 1 through 8 represent the following. 1. Column 1: the ID of the statement. 2. Column 2: the label. 3. Column 3: the statement. 4. Column 4: the subject(s). 5. Column 5: the speaker. 6. Column 6: the speaker's job title. 7. Column 7: the state info. 8. Column 8: the party affiliation. On the other hand, columns 9-13 represent the total credit history count, including the current statement, as follows. 9. Column 9: barely true counts. 10. Column 10: false counts. 11. Column 11: half true counts. 12. Column 12: mostly true counts. 13. Column 13: pants on fire counts. 14. Column 14: the context (venue / location of the speech or statement).
id: 6862dad8b2ef04c694e7ced6c40941f9 - page: 3