Tableau Time

Our next data analysis tool is Tableau. I used Tableau Public to create some initial charts exploring the Fake News dataset from Kaggle.

Stories per Country

As mentioned in the previous post, the subset dataset is heavily weighted to the US via entries in the “Country” field. The pie chart below depicts the countries by the sum of their associated “Stories” or rows. Note the disproportion to the US with 5,087. DE (Germany) is  second with 172 and GB (Great Britain) has third most with 143, respectively. Again, it is not entirely clear if the Country field refers to the publishing website’s origin or how it was assigned.

Screen Shot 2017-04-21 at 4.28.16 PM

 

Popularity by Website

The following chart examines the popularity of the websites represented in the data subset, based on the “Domain Rank” field. The rows of the chart are represented by the “Site URL” and the columns by “Domain Rank”. The top site, Amren, has a domain rank of about 8.7 million. My research on domain ranking seemed to indicate a more limited scale (e.g. 1-100), so it is unclear the nature of these scores. However, within the context of the dataset, the chart demonstrates 1) the Websites included and 2) their Domain Rank relative to each other. Note, the charts required screen captures for inclusion here, and this one extended down multiple screens worth.

Screen Shot 2017-04-21 at 3.34.38 PM

Screen Shot 2017-04-21 at 3.45.48 PM

Popular Websites and Stories Count

Building off the bar chart above, I added the number of records to Rows, i.e. the Sum of “Stories”. The scatterplot chart below shows us not only the most popular websites by Domain Rank, but also how many stories each contributed to the dataset. Amren, the most popular site contributed near the most stories at 93. Liberal America contributed the most at 95 but is much less popular, having around 2.8 million Domain Rank.

 

Screen Shot 2017-04-21 at 4.13.30 PM

 

 

 

SQL Time

We are beginning to employ stronger tools for analyzing our dataset. This time, it is SQL. I used Google Big Query to load a subset of the Fake News dataset.

It was necessary to cut down the dataset, initially because of the misaligned and bogus rows I have been mentioning. Looking closer and preparing to upload the dataset, there were ALOT of these. If you download the dataset from Kaggle and browse it, you will see what I mean. Anyhow, for simplicity I manually extracted around 6,000 rows from the dataset for import into Big Query. I also removed the “Text” field from the import, which is ostensibly the full text of each story. Those are huge strings of text that we can do without for now.

Following import, I ran a few initial queries around the “Country” field, combined with “Replies_count” and “Likes”. It is currently unclear whether “Country” refers to the site’s country of origin or otherwise, but based on our subset the list is heavily weighted to the U.S. Over 5,000 of the roughly 6,000 rows have ‘US’ listed.

Eventually, I settled on the query below as a starting point. It selects a few chosen fields and shows us the first 100 results of stories that received 100 or more Likes on Facebook. First however, it eliminates rows where the story title is blank or ‘NULL’. This was done so we have some context around what titles received the most likes. The results are sorted or ordered by the number of likes.

SELECT title, likes, site_url, author, country
FROM [engaged-tape-165216:CIS395.fake_Import_NoText]
WHERE title is not NULL
AND likes >= 100
ORDER BY likes
LIMIT 100

The Query result set is below….enjoy!

results-20170420-164702.csv – results-20170420-164702.csv

Note, I would have preferred to upload an Excel or CSV version (link is to PDF) but inserting those document types failed on WordPress for some reason. This result set is sorted by number  of likes, but it was interesting to sort by the Site URLs to see which are most represented.

 

ER Diagramming

One method to analyze our data, which is useful for grouping and understanding relationships within it, is a simple Entity Relationship (“ER”) diagram. ER diagrams were originally developed for database design. Lucid Charts provides a good tutorial including history of ER Diagrams HERE.

In our example, we have three Entities and two Relationships, described below.

Entity

  1. “News” Story – a story written and published online.
  2. Website – a website hosting a story or stories.
  3. Social Media Posts – a posting about a story, including Likes, Comments or Shares on Facebook.

Relationship

  1. A “News” Story can be published on many Websites. Cardinality is One to Many.
  2. A “News” Story is referenced by many Social Media Posts. Cardinality is One to Many.

ER Diagram

Origins of Fake News Pt. 1

As we continue to analyze the Fake News dataset, another field is “Author”. Below is a subset copied from the Author field after removing duplicates. Removing duplicates resulted in over 4,000 unique Author entries, however the misaligned rows issue continues to be a factor. I estimate between 25-40% of the 4,000 unique rows are actual names like below, as opposed to long strings of text.

This subset was chosen for a reason because a particular name jumped out. HINT: he has run for President under the Green political party ticket.

Suffice to say, it may not be easy to determine credible origins based on the Author field, but we will look closer.

Kenneth Surin
José Tirado
Joe Sammut
Mel Gurtov
Julian Vigo
Gilbert Mercier
Chandra Muzaffar
Murray Dobbin
David Macaray
Ralph Nader
Pepe Escobar
Dean Baker
Renee Parsons
Mike Whitney
CP Editor
Wahid Azal
Lynne Wycherley
Aidan O’Brien
Contributed Daily News Bin Staff
dailouk
Andrew Anglin

Getting Familiar with our Data

The Fake News data set from Kaggle contains over 7,500 rows or entries. I noticed some misaligned or improperly rendered rows, but generally this is our data size. As for fields there are 20 columns, one of which is ‘Type’ that categorizes the fake news entry. There seems to be some overlap among types, and “BS”, as the most common type, is fairly ambiguous. However, you can get an idea of the types of fake news compiled.

Below are the total counts per Fake News Type:

 

Type Count
bias 443
bs 11444
conspiracy 430
fake 19
hate 245
junksci 101
satire 146
state 121