SQL Time

We are beginning to employ stronger tools for analyzing our dataset. This time, it is SQL. I used Google Big Query to load a subset of the Fake News dataset.

It was necessary to cut down the dataset, initially because of the misaligned and bogus rows I have been mentioning. Looking closer and preparing to upload the dataset, there were ALOT of these. If you download the dataset from Kaggle and browse it, you will see what I mean. Anyhow, for simplicity I manually extracted around 6,000 rows from the dataset for import into Big Query. I also removed the “Text” field from the import, which is ostensibly the full text of each story. Those are huge strings of text that we can do without for now.

Following import, I ran a few initial queries around the “Country” field, combined with “Replies_count” and “Likes”. It is currently unclear whether “Country” refers to the site’s country of origin or otherwise, but based on our subset the list is heavily weighted to the U.S. Over 5,000 of the roughly 6,000 rows have ‘US’ listed.

Eventually, I settled on the query below as a starting point. It selects a few chosen fields and shows us the first 100 results of stories that received 100 or more Likes on Facebook. First however, it eliminates rows where the story title is blank or ‘NULL’. This was done so we have some context around what titles received the most likes. The results are sorted or ordered by the number of likes.

SELECT title, likes, site_url, author, country
FROM [engaged-tape-165216:CIS395.fake_Import_NoText]
WHERE title is not NULL
AND likes >= 100
ORDER BY likes

The Query result set is below….enjoy!

results-20170420-164702.csv – results-20170420-164702.csv

Note, I would have preferred to upload an Excel or CSV version (link is to PDF) but inserting those document types failed on WordPress for some reason. This result set is sorted by number  of likes, but it was interesting to sort by the Site URLs to see which are most represented.