Who’s behind this website? A checklist

2022-04-07 11:29

Reports

By Priyanjana Bengani (@acookiecrumbles) and Jon Keegan (@jonkeegan) IRE NICAR Conference – March 4, 2022 Slides: English | Russian

WHAT IS THIS?

This checklist is meant to be used as a reporting tool to help journalists and researchers when trying to find out who published a website. This is meant to be used in conjunction with offline reporting techniques.

Following this checklist does not guarantee that you can unmask an owner of a website who does not want to be found, but it can help surface crucial clues and connections that can act as leads for further reporting.

???? Strong recommendation: while running through this checklist, create a data diary—it can be a TextEdit doc, a Google Doc, just the Notes app, whatever. It is important to be able to retrace your steps.

 

SITE CONTENT


Text
  •  ✍️ Are there any authors listed?

  •  ???? Are there any email addresses or contact information?

  •  ???? What’s the server’s local time?

    • Look at the datetime attribute in links on WordPress sites. GMT timestamp can reveal time zone based on GMT offset: <time class="updated" datetime="2022-03-04T10:21:40+06:00">March 4, 2022</time>
  •  ???? Does the website have a privacy policy or terms and conditions that mentions an LLC or what regional laws apply?

  •  ???? Does the website have an RSS feed?

    • Does the RSS feed give any additional information about authors / stories that aren’t visible on the site?
    • You can pull RSS article links into Google sheets using IMPORTFEED

Features and functionality
  •  ???? Does the website have a newsletter?
    • Check for the physical postal address—required by the CAN-SPAM Act in the US
  •  ???? Does the website collect donations?
  •  ???? Does the website have an e-commerce store? Or, does it sell products?
    • Try walking through the checkout process (without paying). Sometimes the real payee name is revealed just before you confirm the payment.​

Links
  •  ???? What domains does the website link to most? (Requires scraping)
  •  ❤️ Who links to the domain most often?
    • Google search operator: “link:yourwebsite.com”
    • Check backlinks on ahrefs.com ????
  •  Do the links have UTM codes?​

Photos, images, and documents
  •  ???? Are there author photos?
    • Use reverse image search to see if the same images appear elsewhere
    • Check sensity.ai to see if the image is GAN-generated
    • Read more about spotting GAN-generated images here.
  •  ???? Do the images have EXIF data?
    • Instructions here.
  •  ???? Do the images have any other identifying information?
    • Run through the list here
  •  ???? Where are the images hosted?
    • If on AWS S3, the bucket name can be revealing—or you might find the bucket isn’t secure.
  •  ???? Are there PDFs hosted on the site?
    • On a search engine, “filetype:pdf site:<yourwebsite.com>”
    • If you find some, check the metadata with “Get Info” in your PDF viewer.​

 

 

SOCIAL MEDIA

If there are any social media profiles mentioned on the site, they are worth investigating.

  •  ???? Are there any social media accounts in the <meta> section of the HTML?
  •  ???? When were the individual accounts created? Does it line up with the site history?
  •  ???? What platform has the biggest reach?
  •  ???? Is the messaging different across platforms?
  •  ???? Do they have completely distinct account names across social media platforms or are they more or less the same?
    • Note: just because you find the same account name across platforms doesn’t necessarily mean they belong to the same person!

Facebook

On the Facebook profile, go to Page Transparency:

  •  ☎️ Is there an address and phone number for the page?
  •  ⏪ Does the page history reveal a different name?
    • Has the page shifted topics?
  •  ???? When was the Facebook page created?
  •  Is the page running any groups?
  •  ???? Has the page run any ads? Has the page run political ads?
  •  ???? Does Facebook flag any “related pages” for the given page? Rely on Facebook’s algorithms to find connections!​

Twitter

On Twitter, the account might be part of a pod or network that boosts it. Using en.whotwi.com, it’s worth checking:

  •  ????‍♀️ Who is the account engaging with?
  •  ???? What are the account’s tweeting patterns?
  •  #️⃣ What hashtags are associated with the account?
  •  Who were the account’s first follows / followers?

Other platforms

Don’t forget to check to see if the site has accounts on Youtube, Instagram, Reddit, Github…

 

 

INFRASTRUCTURE

  •  ???? Have you archived the website? (You always should!)

    • you can do this on archive.org or use their browser extension.
    • you can grab the whole website on Terminal with wgetwget -mpEk <yourwebsite.com>
  •  ???? What is the website using?

    • Is it using WordPress, Squarespace, something else?
  •  ☁️ Where is it hosted?

    • Is it on Google Cloud, AWS, Cloudflare, something else?
  •  ???? Are there any trackers present?

  •  ???? How is the site monetized?

    • Are there any affiliate links (Amazon, etc.)?
  •  ???? What are the various tracking identifiers, and are those shared with other domains?

    • Check Google Analytics, Facebook Pixel, Quantcast, NewRelic, etc.
    • Use tools like builtwithRiskIQ, or Dnslytics to see if other domains share the same ID.
  •  Are there any relevant subdomains?

  •  ???? Are there historic WHOIS records?

  •  ⌛️ Has the site changed over time?

    • Look at archive.org to see whether the domain shifted tremendously—and if so, when.
  •  ???? Did the earlier version of the site have more information?

    • People can remove info when a site’s been up for a while.

 

 

RESOURCES & TOOLS


Books

Open Source Intelligence Techniques – Michael Bazzell https://inteltechniques.com/book1.html

Verification Handbook – edited by Craig Silverman https://datajournalism.com/read/handbook/verification-3

Website Infrastructure
  • Blacklight: The Markup’s real-time website privacy inspector.
  • builtwith.com: gives you the infrastructure of the site, including IP addresses, analytics codes, tech stack, etc. Freemium model.
  • DNSDBScout: allows you to search and “flexible search” for passive DNS lookups including IP <-> domain mapping.
  • Dnslytics: offers a range of tools including reverse Analytics and reverse DNS lookups, as well as WHOIS data. Freemium.
  • RiskIQ: a “threat intelligence” tool that allows you to get reverse IP, reverse analytics, WHOIS, SSL, subdomains, etc.
  • Whoxy: a tool that lets you see historical WHOIS registrations. Free.
  • The Internet Archive browser extension.

Social Media Accounts
  • Sensity AI: check if an image is GAN-generated or not. Freemium.
  • whotwi.com: create a profile-at-a-glance for any account on Twitter. Free.

 

Source: