Photo Studio — API & capabilities

01What it does

Five capability groups behind a small HTTP API. Everything runs on CPU and is built for batch jobs on the scraper server. iPhone HEIC inputs are accepted everywhere.

✂️ Cutout

AI background removal via rembg. Five models from fast people-segmentation to BiRefNet portrait matting with clean hair edges. Returns a transparent PNG.

🪄 Compose

Drop one or more subjects onto a new background. Face/profile detection auto-decides side, scale and gaze — or place manually, or let a vision model decide.

🎛️ Edit

An ordered pipeline of raster ops: resize, crop, rotate, brightness/contrast/saturation/sharpness, blur, grayscale, sepia, autocontrast, vignette, borders, format convert.

🎬 Motion

Six reel-native effects rendered to MP4: Ken Burns, parallax, kinetic captions, karaoke subtitles, count-up data, whip/glitch transitions — for a stills + TTS news pipeline.

🗞️ Styles

Three editorial "news illustration" looks rendered from a photo + its cutout mask: torn-paper split-face, red sticker cutout, and neon money glow — ready for social posts.

02Background removal

One call strips the background and returns a transparent PNG. BiRefNet keeps fine hair and edge detail clean.

cutout — Cutout (BiRefNet portrait) — transparent background

# model: u2net_human_seg (default, fast) · birefnet-portrait (best edges)
curl -F file=@photo.jpg "https://foto-api.autonomousmedia.io/v1/cutout?model=birefnet-portrait" -o cutout.png

03Smart compositing

Place a cut-out subject onto a new scene. In auto mode, face & profile detection picks the side, scale, and which way the subject should face — so the gaze leads into the frame and the scene's focal point stays visible.

Auto mode — zero config

composited — Auto result — placed front-left, facing the path & water

curl -F background=@path.jpg -F subjects=@man.jpg \
     -F 'params={"mode":"auto","cutout_model":"birefnet-portrait"}' \
     https://foto-api.autonomousmedia.io/v1/compose -o out.jpg

Multiple subjects & manual control

Pass several subjects in one call. Manual mode gives per-subject control of scale, position, facing and draw order (depth).

two-person composite — Two subjects composited onto one scene (manual placement, draw order = depth)

curl -F background=@path.jpg -F subjects=@man.jpg -F subjects=@woman.jpg \
     -F 'params={"mode":"manual","cutout_model":"birefnet-portrait","items":[
        {"scale":0.86,"target_frac":0.26,"draw_order":1,"warm":1.01},
        {"scale":0.60,"target_frac":0.60,"base_frac":0.99,"draw_order":0}
     ]}' \
     https://foto-api.autonomousmedia.io/v1/compose -o combo.jpg

Three placement modes

Mode	How placement is decided	Cost
`auto`	Face + profile detection (OpenCV). Picks side, scale, gaze. Default.	Free, fast, scales infinitely
`manual`	You supply explicit per-subject params (`items`).	Free
`llm`	A vision model looks at the scene and decides placement. Opt-in.	Tokens per image — needs ANTHROPIC_API_KEY

Manual items fields: scale (0–1 of frame height) · target_frac (0–1, where the face sits horizontally) · base_frac (0–1, vertical position of the subject's feet/bottom — lower value = further back) · facing · flip · draw_order (lower = behind) · feather · warm (<1 cooler, >1 warmer) · color_match (0–0.4, pull toward scene colour).

04General edits

Send an ordered list of operations; they apply in sequence. Good for thumbnails, normalisation, watermarks-free filters, format conversion.

curl -F file=@photo.jpg \
     -F 'ops=[{"op":"resize","max":1600},{"op":"autocontrast"},{"op":"saturation","factor":1.3},{"op":"vignette","strength":0.45}]' \
     -F output_format=jpeg -F quality=90 \
     https://foto-api.autonomousmedia.io/v1/edit -o edited.jpg

Available operations

op	params
`resize`	`w` / `h` / `max` (bounds long edge, keeps aspect)
`crop`	`box=[l,t,r,b]` or `aspect="16:9"` (centre crop)
`rotate` / `flip`	`deg` · `axis="h"\|"v"`
`brightness` · `contrast` · `saturation` · `sharpness`	`factor` (1.0 = unchanged)
`blur`	`radius`
`grayscale` · `sepia` · `invert` · `equalize`	—
`autocontrast`	`cutoff` (percent clipped, default 1)
`posterize`	`bits` (1–8)
`vignette`	`strength` (0–1)
`border`	`size`, `color="#rrggbb"`
`convert`	`format="jpeg"\|"png"\|"webp"`

05Motion / reels

Six reel-native effects rendered server-side to 9:16 MP4 — built for a stills + TTS news pipeline (no After Effects, no manual editing). Each is a pure function of time → frame; the same model maps 1:1 to a Remotion useCurrentFrame() setup if you move rendering to Node later. The clips below are produced by the service.

Ken Burnsphoto → motion · slow zoom + drift on a still

Parallaxreuses the cutout · subject moves faster than the background

Kinetic captionshighest ROI · word-pop with spring overshoot

Karaoke subtitlesTTS word-sync highlight

Count-up datanumber ticks up + bars grow in

Whip / glitchfast slide + blur spike + RGB split

Two effects depend on word-level timing. Kinetic captions and karaoke land best when driven off per-word timestamps — pass words:[{"w":"Räntan","t":0.0},…] from your ElevenLabs with_timestamps response (or WhisperX). Without timings they fall back to an even beat.

POST/v1/motion/{effect}

field	type	notes
`effect`	path	`ken_burns` · `parallax` · `captions` · `karaoke` · `countup` · `transition`
`image`	file	required for `ken_burns`/`parallax`; optional background for the others
`image2`	file	second scene for `transition` (optional)
`params`	form (JSON)	common: `w`, `h`, `fps`, `duration` · plus per-effect (below)

→ video/mp4 (H.264, 9:16 by default).

Per-effect params

effect	params
`ken_burns`	`zoom` (1.16) · `pan` ("up-left"…) · `kicker` · `headline`
`parallax`	`cutout_model` · `kicker` · `headline`
`captions`	`words` (list) · `highlight` (index) · `eyebrow` · `beat` (s)
`karaoke`	`words` (list or `[{w,t}]`) · `gap` (s) · `lead` (s)
`countup`	`title` · `corner` · `value` · `delta` · `bars` (list) · `labels` (list)
`transition`	`scene_a`/`scene_b` `{kicker,headline,tint}` — or pass `image`+`image2`

# Ken Burns on a still, with a lower-third headline
curl -F image=@photo.jpg \
     -F 'params={"duration":4,"zoom":1.18,"kicker":"Stockholm · 06 juni","headline":"Stadshuset i kvällsljus"}' \
     https://foto-api.autonomousmedia.io/v1/motion/ken_burns -o kenburns.mp4

# Kinetic captions driven by word timings from TTS
curl -F 'params={"words":["Räntan","sänks","med 0,25","i juni"],"highlight":2}' \
     https://foto-api.autonomousmedia.io/v1/motion/captions -o captions.mp4

# Count-up data card from your pipeline JSON
curl -F 'params={"title":"OMXS30 · stängning","value":2487.6,"delta":1.4,
        "bars":[-0.8,1.2,0.9,-0.4,1.4],"labels":["VOLVO","EVO","SBB","SINCH","ATCO"]}' \
     https://foto-api.autonomousmedia.io/v1/motion/countup -o data.mp4

06Editorial styles

Three "news illustration" looks applied automatically to an ordinary photo. The subject is cut out (rembg) and the art — duotone colour ramps, torn-paper seams, sticker strokes, neon edge glow, halftone, scanlines — is composited around the mask. One photo, three social-ready looks:

↓

split style — **split** — torn-paper warm/cool split-face on rust · 1080×1080

red style — **red** — white-outline sticker, red duotone, splatter bg · 1080×1350

money style — **money** — teal duotone, neon-green glow, $-tile, scanlines · 1600×900

Mask source. The original experiments used an Apple Vision Swift tool for the cutout; here the mask is the alpha from the built-in rembg cutout, so it runs anywhere (and keeps every person in multi-subject shots). Tune via cutout_model — u2net_human_seg (default, fast) or birefnet-portrait (cleaner edges).

POST/v1/style/{style}

field	type	notes
`style`	path	`split` · `red` · `money`
`file`	file	the photo (JPG/PNG/HEIC)
`cutout_model`	query	default `u2net_human_seg`
`quality`	query	JPEG quality, default 92

→ image/jpeg at the style's native social size.

curl -F file=@photo.jpg "https://foto-api.autonomousmedia.io/v1/style/red" -o red.jpg
curl -F file=@photo.heic "https://foto-api.autonomousmedia.io/v1/style/money?cutout_model=birefnet-portrait" -o money.jpg

07Gallery

A mixed bag of effects across varied source photos — portraits, live music, architecture, coast — all produced by the endpoints above.

motion · ken_burns
a street still → motion

motion · parallax
cutout depth on a portrait

compose · auto
a subject dropped onto a beach

edit grade — edit · grade
autocontrast + saturation + vignette

08API reference

All endpoints return the resulting media bytes (PNG/JPEG/WebP/MP4) except the JSON discovery routes. Interactive Swagger UI is at /swagger.

Base URL: the examples use the live endpoint https://foto-api.autonomousmedia.io (public, TLS via Let's Encrypt). For local development, swap it for http://localhost:8000.

method	route	returns
`GET`	`/health`	JSON — liveness + capabilities
`GET`	`/v1/models`	JSON — models, ops, effects, modes
`POST`	`/v1/cutout`	image/png (transparent)
`POST`	`/v1/compose`	image/jpeg (or png/webp)
`POST`	`/v1/edit`	image/jpeg (or png/webp)
`POST`	`/v1/motion/{effect}`	video/mp4 (9:16)
`POST`	`/v1/style/{style}`	image/jpeg (editorial look)

GET/health

Liveness + capability snapshot (models, ops, whether LLM mode is enabled).

GET/v1/models

Lists cutout models, edit ops (with param hints), and available compose modes.

POST/v1/cutout

field	type	notes
`file`	file	the image (multipart)
`model`	query	`u2net_human_seg` (default) · `u2net` · `isnet-general-use` · `birefnet-portrait` · `birefnet-general`
`alpha_matting`	query	bool, default true — softer, cleaner edges

→ image/png with alpha.

POST/v1/compose

field	type	notes
`background`	file	the scene
`subjects`	file[]	one or more (repeat the field). Raw photos are auto-cut; pre-cut RGBA PNGs are used as-is.
`params`	form (JSON)	`mode`, `cutout_model`, `auto_cutout`, `output_format`, `quality`, `items[]`

→ image/jpeg (or PNG/WebP via output_format).

POST/v1/edit

field	type	notes
`file`	file	the image
`ops`	form (JSON)	array of `{"op":...}` objects, applied in order
`output_format` · `quality`	form	default `jpeg` · `94`

POST/v1/motion/{effect}

Renders a 9:16 MP4. Full fields & per-effect params in §05 Motion. → video/mp4.

POST/v1/style/{style}

Applies an editorial look (split/red/money). Details in §06 Editorial styles. → image/jpeg.

Python client

import requests, json
B = "https://foto-api.autonomousmedia.io"

# 1) cut a subject out
png = requests.post(f"{B}/v1/cutout", params={"model":"birefnet-portrait"},
                    files={"file": open("me.jpg","rb")}).content
open("me.png","wb").write(png)

# 2) composite onto a new scene (auto placement)
r = requests.post(f"{B}/v1/compose",
      files=[("background",open("scene.jpg","rb")),
             ("subjects",open("me.jpg","rb"))],
      data={"params": json.dumps({"mode":"auto"})})
open("out.jpg","wb").write(r.content)

09Deploy

Local

docker compose up -d --build
# → http://localhost:8000  (docs at /, Swagger at /swagger)
curl -s localhost:8000/health | jq

Scraper server

Heavy/batch image work belongs on the scraper. Ship the repo to /opt/apps/photo-studio/ and bring it up with Compose.

rsync -az --exclude .git ./ sandenskog@SCRAPER_IP:/opt/apps/photo-studio/
ssh sandenskog@SCRAPER_IP "cd /opt/apps/photo-studio && docker compose up -d --build"

Model storage. The default model is baked into the image; BiRefNet (~1 GB) downloads on first use into the models Docker volume so it persists across restarts. To bake it in too, build with --build-arg PREFETCH_BIREFNET=1.

Enable the vision-LLM placement mode (optional)

# in docker-compose.yml → environment:
ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
PLACEMENT_MODEL: claude-haiku-4-5   # cheap default; bump to claude-opus-4-8 for max quality

When set, {"mode":"llm"} becomes available on /v1/compose and /health reports llm_placement:true.

10Notes & limits

CPU inference. On the scraper, cutout runs on CPU — fast for u2net_human_seg, a few seconds for BiRefNet. GPU is not required.
Auto-placement is a heuristic. It reads faces & profiles; a raised arm or an ambiguous pose can fool the gaze guess. Override with manual, or enable llm mode for hard cases.
Foreground framing. Close-up/bust subjects compose believably as foreground figures (anchored near the bottom). You can't shrink a bust into a tiny distant figure — there are no legs to show.
Motion = render-to-MP4. Effects are rendered frame-by-frame (PIL/numpy) and encoded with bundled ffmpeg — H.264, 9:16, CPU only. Fonts are configurable via PHOTO_FONT_BOLD/PHOTO_FONT_MONO; the Docker image ships DejaVu. For production reels, move rendering to Remotion (Node) — the time→frame model maps 1:1.
Parallax depth is approximate. The subject is cut out and moved over a blurred copy of the original; small movements read as depth without an inpainting/depth-map step. Keep motion subtle.
Size guard. Inputs above MAX_PIXELS (default 50 MP) are rejected with 413.
No external calls by default. Cutout, compose-auto/manual and edit are fully local. Only llm mode sends images to the Anthropic API.