Fake NEWS detection using Data Analytics

Published on . Written by

Fake NEWS detection using Data Analytics

Do you think all the news that spread across the internet is true and realistic? Not at all. Fake news has become a serious issue in the digital world. This news spread just like wildfire, without limitations and very fast impacting the lives of millions of peoples.  So how can we deal with fake news? It is not as easy as turning to a simple fact-checker. Such news is intentionally written with some story-by-story base. Here comes Python to help us.

Read more..

SLNOTE

Skyfi Labs Projects
Project Description

Before going deep into the fake news detection project, let’s get familiar with some terms related to this project.

To get the statistics about the news, we need to count the appearance of the word in the document. But one issue with word counting is that words like ‘the’ appears many times in the document but its count is not meaningful in encoded vector.

One solution for this is to count the word frequency. The method used for this is TF-IDF which stands for “Term Frequency – Inverse Document Frequency “.

  • Term Frequency: It indicates how many times the word appears in the document. A higher value means the word appears more times and so on.
  • Inverse Document Frequency: IDF measures how significant the term is in other articles of the same writer. Words that occur many times in a document may occur many times in others also.
In short, TF-IDF is a word frequency counter that tries to highlight the interesting words. TF-IDF tokenize the document and encode the new document. TF-IDF Vectorizer converts the raw data in the document into TF-IDF matrix.

Modules used for this Project

  • numpy: Numpy is a package that stands for ‘Numeric Python’. It is a library for scientific calculations and computations. It is used in linear algebra, random number capability, Fourier transform and dealing with multidimensional arrays. Numpy is also used as a multidimensional container for generic data. It is a sophisticated and high-performance multidimensional array object processor.
  • Pandas: pandas is an open-source library built on the top of Numpy. This means to run pandas you need to have Numpy installed on your machine. It is used to perform data manipulation in Python. Pandas provides an easy and efficient way to slice the data, merge, concatenation and reshaping the data.
  • Sklearn: It is an open-source Python library that includes a wide range of machine learning, preprocessing, cross-validation and visualization algorithms.

SLLATEST
Project Implementation

The dataset used for this project is news.csv. Dataset has a shape of 7796*4.

The dataset has four columns: first identifies the news, second and third are title and text and the fourth one is the label denoting FAKE or REAL.

Follow the below steps to complete the project:

  • Make the necessary imports.
  • Read the data into the data frame and get the shape of the data.
  • Now get the labels from the DataFrame.
  • Split the dataset into training and testing models.
  • Initialize the TfidfVectorizer with stop words from English and maximum document frequency of 0.7.
  • Initialize the PassiveAggressiveClassifier.
  • At last print the confusion matrix to gain the data about false and true negatives and positives.
  • After completion of the project, we get an accuracy of 92.82%.
Software requirements: Pycharm Community Edition.

Programming Languages and modules: Python3, Numpy-module, pandas, sklearn.


SLDYK
Kit required to develop Fake NEWS detection using Data Analytics:
Technologies you will learn by working on Fake NEWS detection using Data Analytics:


Any Questions?


Subscribe for more project ideas