Final Project

For the final project, I created a Tableau Story viewable on Tableau Public. This story examines the popularity of Fake News using a “Drill Down” approach. In other words, we begin by assessing the websites’ popularity and iteratively narrow down on the popularity of the stories posted.

There are three Story Points:

  1. Website Domain Rank Popularity
  2. Website Domain Rank and Published Content Volume
  3. Story Titles by Facebook Likes

We begin by examining the websites and how much traffic they receive, based on the Domain Rank field. Next, we look at how much published content each site is contributing, based on the number of story records. Lastly, we glimpse the published story titles receiving the most Facebook Likes, sorted in descending order, and hypothesize a genre theme.

ENJOY!

Tableau Time

Our next data analysis tool is Tableau. I used Tableau Public to create some initial charts exploring the Fake News dataset from Kaggle.

Stories per Country

As mentioned in the previous post, the subset dataset is heavily weighted to the US via entries in the “Country” field. The pie chart below depicts the countries by the sum of their associated “Stories” or rows. Note the disproportion to the US with 5,087. DE (Germany) is  second with 172 and GB (Great Britain) has third most with 143, respectively. Again, it is not entirely clear if the Country field refers to the publishing website’s origin or how it was assigned.

Screen Shot 2017-04-21 at 4.28.16 PM

 

Popularity by Website

The following chart examines the popularity of the websites represented in the data subset, based on the “Domain Rank” field. The rows of the chart are represented by the “Site URL” and the columns by “Domain Rank”. The top site, Amren, has a domain rank of about 8.7 million. My research on domain ranking seemed to indicate a more limited scale (e.g. 1-100), so it is unclear the nature of these scores. However, within the context of the dataset, the chart demonstrates 1) the Websites included and 2) their Domain Rank relative to each other. Note, the charts required screen captures for inclusion here, and this one extended down multiple screens worth.

Screen Shot 2017-04-21 at 3.34.38 PM

Screen Shot 2017-04-21 at 3.45.48 PM

Popular Websites and Stories Count

Building off the bar chart above, I added the number of records to Rows, i.e. the Sum of “Stories”. The scatterplot chart below shows us not only the most popular websites by Domain Rank, but also how many stories each contributed to the dataset. Amren, the most popular site contributed near the most stories at 93. Liberal America contributed the most at 95 but is much less popular, having around 2.8 million Domain Rank.

 

Screen Shot 2017-04-21 at 4.13.30 PM

 

 

 

SQL Time

We are beginning to employ stronger tools for analyzing our dataset. This time, it is SQL. I used Google Big Query to load a subset of the Fake News dataset.

It was necessary to cut down the dataset, initially because of the misaligned and bogus rows I have been mentioning. Looking closer and preparing to upload the dataset, there were ALOT of these. If you download the dataset from Kaggle and browse it, you will see what I mean. Anyhow, for simplicity I manually extracted around 6,000 rows from the dataset for import into Big Query. I also removed the “Text” field from the import, which is ostensibly the full text of each story. Those are huge strings of text that we can do without for now.

Following import, I ran a few initial queries around the “Country” field, combined with “Replies_count” and “Likes”. It is currently unclear whether “Country” refers to the site’s country of origin or otherwise, but based on our subset the list is heavily weighted to the U.S. Over 5,000 of the roughly 6,000 rows have ‘US’ listed.

Eventually, I settled on the query below as a starting point. It selects a few chosen fields and shows us the first 100 results of stories that received 100 or more Likes on Facebook. First however, it eliminates rows where the story title is blank or ‘NULL’. This was done so we have some context around what titles received the most likes. The results are sorted or ordered by the number of likes.

SELECT title, likes, site_url, author, country
FROM [engaged-tape-165216:CIS395.fake_Import_NoText]
WHERE title is not NULL
AND likes >= 100
ORDER BY likes
LIMIT 100

The Query result set is below….enjoy!

results-20170420-164702.csv – results-20170420-164702.csv

Note, I would have preferred to upload an Excel or CSV version (link is to PDF) but inserting those document types failed on WordPress for some reason. This result set is sorted by number  of likes, but it was interesting to sort by the Site URLs to see which are most represented.