go back

Placeholder for post cover image
Cover image for post

Building a simple TikTok scraping API 🤖

August 14, 2022

Note: this no longer works as TikTok now blocks cross-origin media requests.

As part of a project centered around TikTok (coming soon!) I needed a way to get a video's media. The closest thing that TikTok officially provides is their oembed API, but unfortunately this only includes basic information like title, thumbnail_url etc. And other unofficial APIs cost money or require python, which I don't want to do. So I needed another way to retrieve this info.

Leveraging SSR patterns 📦

As a website that handles large volumes of traffic I knew they had some sort of SSR/client-side hydration going on. This generally involves is shipping data to the client in an object inlined inside a <script> tag, reducing the number of necessary client-side requests by directly populating components with this data. In Next.js for example, this happens via server-side props.

And lo and behold! There's a handy blob provided alongside the page markup for every video. You can confirm this for yourself by visiting any TikTok video's page (https://www.tiktok.com/:user/video/:id) and looking at the contents of the SIGI_STATE tag via console:

JSON.parse(document.getElementById("SIGI_STATE").innerText);
Enter fullscreen mode Exit fullscreen mode

Knowing this I built a simple handler using deno to fetch this blob and return it directly.

Enter cheerio & deno 🥣🦕

If the video URL is attached to the request (via POST body or params), a simple handler can fetch (scrape) the content for the page:

const response = await fetch(`https://tiktok.com/${video}`);
const html = await response.text();
Enter fullscreen mode Exit fullscreen mode

Then, using cheerio, this raw text can be turned into a queryable jQuery-like API. This provides an entrypoint to get the content of the server-generated blob within the script tag:

const $ = cheerio.load(html);
const appContext = $("#SIGI_STATE").text();
Enter fullscreen mode Exit fullscreen mode

From there, it's only a matter of parsing this as JSON and returning pertinent data in the response (ItemModule). The data is currently keyed by some unique identifier, Object.keys helps remove that:

const json = JSON.parse(appContext);
const key = Object.keys(json.ItemModule)[0];
const data = json.ItemModule[key];

return new Response(JSON.stringify({ ...data }), {
  headers: { 'Content-Type': 'application/json' },
  status: 200,
});
Enter fullscreen mode Exit fullscreen mode

And that's it! Now I can make a client-side request to get up-to-date media for a video and display it directly. This also gets around TikTok's CDN expiration which would prevent saving a reference to this media.

Side-note: comparing this to the output for one of the unofficial API options I came across, the data looks identical. So it's likely they're doing the same thing (but charging almost 00/mo for unlimited usage)!


Anyways, this is fragile and could break if they decide to rename the script tag, rate-limit, etc. But this demonstrates how easy it is to spin up an instance for your own project. And maybe with enough bot traffic TikTok will create a public API that accomplishes this instead of making us scrape for it! 😇

go back