go back

Placeholder for post cover image
Cover image for post

Building APOD color search part I: Image analysis in Rust

February 23, 2023

Kicking off the first of a series about how I built APOD color search! For an introduction to the project (and some background on Astronomy Picture of the Day) go here:

From a high level, a search like this cannot be done on-the-fly as it requires a static index of image and color information to return results. So, the first step here is to devise a way to extract relevant color information from each image.

Gathering the data 🚜

On to populating the dataset - image analysis is a low-level operation. Given that there's nearly two decades worth of images to process, this needed to be fast and performant.

The full source code can be found on GitHub, but I'll go through some of the essential bits here.

Processing data for an APOD 🌌

To populate the searchable color-based index of each APOD, three things must be done:

  1. Fetch APOD information via apod-api.
  2. Extract color information from image.
  3. Store color metadata in database.

One of my goals for this project was to use cloud-first (and free) resources whenever possible to save headaches later on with deployments & environments. For the database above I created a Postgres instance using supabase's free tier.

Getting a day's APOD information

Fetching this data is easy enough using reqwest and the apod-api (just need an API key):

let api_url = "https://api.nasa.gov/planetary/apod";
let api_key = std::env::var("APOD_API_KEY").unwrap();
let request_url = format!(
  "{}?api_key={}&start_date={}&end_date={}",
  api_url, api_key, start_date, end_date
);

let resp = reqwest::get(request_url).await?;
let body = resp.text().await?;
Enter fullscreen mode Exit fullscreen mode

To do more with this data however, Rust requires that it be properly typed. serde streamlines this with built-in JSON serialization; it only requires a static type and can handle the rest. Here's the type I added to correspond to the API response:

use serde::{Deserialize, Serialize};

#[derive(Debug, Deserialize, Serialize)]
pub struct Day {
  id: Option<u32>,
  copyright: Option<String>,
  date: String,
  explanation: Option<String>,
  hdurl: Option<String>,
  media_type: String,
  service_version: Option<String>,
  title: String,
  url: Option<String>,
}
Enter fullscreen mode Exit fullscreen mode

Then calling serde_json::from_str will deserialize it to the typed data structure:

use serde_json::json;

let days: Vec<Day> = serde_json::from_str(&body).unwrap();
Enter fullscreen mode Exit fullscreen mode

Lastly, once we have a Day object to work with, we need to fetch the actual image bytes to do pixel-based analysis:

use crate::image_utils;

let img_bytes = reqwest::get(&image_url).await?.bytes().await?;
let img = image::load_from_memory(img_bytes)?;
Enter fullscreen mode Exit fullscreen mode

Processing the colors 🔬

Now all that's left is some low-level pixel processing. This isn't the most efficient algorithm as I'm still a novice Rustacean so it's the best I could do. 😇

Because these images tend to be massive, the most important pieces are around removing noisy data to avoid unnecessary computation. In this case, only analyzing significant pixels and then counting, ranking and grouping them together.

Many images have a minimal amount of color information, being either grayscale or mostly black due to the vast emptiness of space. As "color" is in the name, the project is more intended to enable finding colorful pictures, not specific hues of black/gray. To not waste computation time on these, I filtered out the non-colored pixels:

let gray_pixels: HashSet<Rgba<u8>> =
  img.grayscale().pixels().into_iter().map(|p| p.2).collect();
let all_pixels: Vec<Rgba<u8>> = img.pixels().into_iter().map(|p| p.2).collect();

let colored_pixels: Vec<Rgba<u8>> = all_pixels
  .into_iter()
  .filter(|p| !gray_pixels.contains(p))
  .collect();
Enter fullscreen mode Exit fullscreen mode

Then, using a relative luminance function, only included the most luminous pixels:

let luminous_pixels: Vec<Rgba<u8>> = colored_pixels
  .into_iter()
  .filter(|p| get_luminance(*p) > 50.)
  .collect();
Enter fullscreen mode Exit fullscreen mode

Now we're left with a cleaner dataset to work on.

Generate frequency array

To get the most frequent colors of the image, the primary goal of this analysis, a frequency hash can be used. Put simply, a string/number map of color values to how many times they occur. For easier typing, each pixel is converted from RGB to String (hex value):

use colorsys::Rgb;

pub fn generate_hex(pixel: Rgba<u8>) -> String {
  Rgb::from((
    pixel[0] as f32, 
    pixel[1] as f32, 
    pixel[2] as f32)
  ).to_hex_string()
}

Vec::from_iter(input)
  .into_iter()
  .map(generate_hex)
  .collect::<Vec<String>>()
Enter fullscreen mode Exit fullscreen mode

We can then generate a BTreeMap of type { [hex: string]: frequency: number } by iterating over this list and incrementing when the same hex value is found.

To optimize for performance, the function splits the image into chunks and spawns multiple threads to run in parallel before joining them together once complete. I experimented with different values for worker_count and landed on 5 as most optimal:

use std::collections::{BTreeMap, HashMap, HashSet};
use std::sync::{Arc, Mutex};
use std::{i64, thread};

pub fn get_frequency(
  input: Vec<String>, 
  worker_count: usize
) -> BTreeMap<String, usize> {
  let result = Arc::new(Mutex::new(BTreeMap::<String, usize>::new()));

  input
    .chunks((input.len() as f64 / worker_count as f64).ceil() as usize)
    .enumerate()
    .map(|(_, chunk)| {
      let chunk = chunk.iter().map(String::from).collect::<Vec<String>>();

      let rresult = result.clone();

      thread::spawn(move || {
        chunk.iter().for_each(|h| {
          rresult
            .lock()
            .unwrap()
            .entry(h.to_string())
            .and_modify(|e| *e += 1)
            .or_insert(1);
        })
      })
    })
    .for_each(|handle| handle.join().unwrap());

  Arc::try_unwrap(result).unwrap().into_inner().unwrap()
}
Enter fullscreen mode Exit fullscreen mode

Once the most frequent color values are found, similar ones can be grouped together. I refer to these as Clusters; if a color has R, G, & B values within a certain threshold of each other they are combined into the same Cluster.

The threshold algorithm is a simple series of conditions that only checks for the green value if the value for red is within the threshold, and so on:

pub fn within_threshold(
  a: &Rgba<u8>, 
  b: &Rgba<u8>, 
  color: usize, 
  threshold: i64
) -> bool {
  let color1 = a.0[color] as i64;
  let color2 = b.0[color] as i64;

  let mut min = 0;
  let mut max = 255;

  if color2 >= threshold {
    min = color2 - threshold;
  }

  if color2 <= (255 - threshold) {
    max = color2 + threshold
  }

  color1 >= min && color1 <= max
}

pub fn assign_clusters(
  input: Vec<(Rgba<u8>, usize)>, 
  threshold: i64
) -> HashMap<Rgba<u8>, usize> {
  let mut result = HashMap::<Rgba<u8>, usize>::new();

  for item in input {
    let s_r: Vec<Rgba<u8>> = result
      .into_keys()
      .filter(|p| within_threshold(p, &item.0, 0, threshold))
      .collect();

    ...
Enter fullscreen mode Exit fullscreen mode

The closest color value matches are then added to the Cluster. Once the clusters are finalized, the last step here is to only return the most popular ones. This is done with a simple sort_by call:

let mut sorted_result = Vec::from_iter(result);
sorted_result.sort_by(|(_, f_a), (_, f_b)| f_b.partial_cmp(f_a).unwrap());
let size = std::cmp::min(num_clusters, sorted_result.len());
sorted_result[0..size].to_vec()
Enter fullscreen mode Exit fullscreen mode

And now we have the most significant clusters! This makes it possible to search for a color and map to images that contain many pixels with that color.

One month at a time 🗓️

Processing a single APOD is one thing, but the end goal is to process all of them. The cleanest way to group batches of days was by month. As the apod-api supports start_date and end_date parameters to support this, I just used the first and last days of the month for these parameters.

Since I knew I'd be running this via command line I first checked the arguments provided (year and month) and if they correlated to a valid date for an APOD. Since this will map from raw numbers to chrono::Date objects, some serialization is needed:

let args: Vec<String> = env::args().collect();
let first_apod = Utc.ymd(1995, 6, 16);
let today = Utc::today();

let numbers: Vec<u32> = args.iter().flat_map(|x| x.parse()).collect();
let day = Utc.ymd(numbers[0] as i32, numbers[1], 1);

if day < first_apod || day > today {
  Err(format!(
    "Out of range, date must be between {} and {}.",
    first_apod.format("%b %e, %Y"),
    today.format("%b %e, %Y")
  ))?;
}
Enter fullscreen mode Exit fullscreen mode

Then, given that it's a valid month we can iterate over each day. I added a fetch_month function to generate a list of Days for a month given the first day, as a chrono::Date generated by the above arguments:

async fn fetch_month(
  first_day: chrono::Date<Utc>
) -> Result<Vec<api::Day>, Box<dyn Error>> {
  let first_day_formatted = first_day.format("%Y-%m-%d").to_string();
  let today = Utc::today();

  let mut last_day = (
    first_day + Duration::days(31)
  ).with_day(1).unwrap() - Duration::days(1);

  if last_day > today {
    last_day = today;
  }

  let last_day_formatted = last_day.format("%Y-%m-%d").to_string();

  let apods = api::get_days(&first_day_formatted, &last_day_formatted).await?;

  Ok(apods)
}
Enter fullscreen mode Exit fullscreen mode

After getting the data for an entire month it's as simple as iterating over each day and processing it:

let apods = fetch_month(day).await?;
let mut i = 1;

for apod in apods {
  process_apod(apod).await?;
  i += 1;
}
Enter fullscreen mode Exit fullscreen mode

I didn't include this in the code snippets, but things are saved via Postgrest along the way - most importantly Colors and Clusters which are used to perform searches. Feel free to have a look at the full source to see these.


Thanks for reading & stay tuned for the next part: using GitHub Actions as a free provider to run it in parallel remotely!

go back