SOL-001
CAPCAT TUI/CLI - ETHICAL WEB SCRAPER
01
Summary
Capcat is a dual-mode command-line tool designed for ethical web scraping and content preservation. It allows users to archive articles from various online sources into a local, searchable library, ensuring content remains accessible even if the original websites go offline.
Core Functionality:
Capcat operates in two distinct modes to suit different workflows:
CLI Mode: A fast, scriptable interface optimized for power users, automation, cron jobs, and integration into existing technical workflows.
TUI Mode: A visual, guided exploration mode that allows users to discover sources and test workflows without needing to memorize commands.
The tool fetches content from over 17 pre-configured sources (such as Hacker News, BBC, The Guardian, and IEEE Spectrum) or custom RSS feeds.
It is a working open-source tool, live at capcat.org.
Outside the main product goal, this project was the ideal opportunity to learn how to utilize LLMs generative coding with context engineering and spec driven development (Technically: Context engineering is the practice of structuring what an LLM receives before it generates. The architecture of the input determines the quality of the output).
clig.dev is used to set the standard for CLI usage experience. The TUI interface follows the strict heuristic rules. Two interfaces run on one shared backend. The CLI takes flags, pipes, and scripts. The TUI walks users through a visual menu. Both produce the same output. Ethical scraping is part of the system architecture as a constraint.
Tools: Claude Code, Gemini, Figma, Drawio, Affinity Designer, Procreate.
Process: UX research, JTBD, heuristic analysis, product design, illustration, hand drawing, PRD, TDD, spec-driven implementation and context engineering in iterative cycles
Update: User testing surfaced two unmet needs. One of portable installation and other of Obsidian integration. I shipped solution for both. Replaced the bash wrapper with pipx install capcat. A full YAML configuration abstraction layer to give users control over themes and sources. Obsidian frontmatter and back-linking for specific sources.
Download a short version of the Case Study below:
02
Challenge and Context
I read across several sources a week and keep what matters in all the popular tools. Over the years, this built several collections with no connection between them.
I searched online for things I had already read. Sometimes the page was gone. Browser bookmarks took one click to save, but told me nothing about the broader context and time.
When I needed something from my own archive, I could not find it easily. Time that should have gone to reading went to searching.
Every tool I tried solved one of the UX problems. None solved capturing and categorizing.
The task was four things that had to work together:
- archive content before it disappears
- catalogue it so search works
- find it on demand
- share it without depending on an external service.
User need:
Permanent, searchable, knowledge archive
Current solution space:
- Search online
- Browser bookmarks
- Pocket/Instapaper
- Manual coping and pasting
- Evernote and Notion
Defined pain points:
- Data locked in
- Unsorted links
- Manual curation
- Disappearing content
03
Discovery and User Insights
I was the primary research subject. I mapped my own workflow before the design process started. A heuristic analysis using Nielsen's 10 principles and Laws of UX followed.
Before building, I ran informal conversations with three people in my network: a developer, a researcher, and an ML engineer. I asked each of them to walk me through how they currently save and retrieve things they read online. I did not describe the problem or the solution. Two of them independently mentioned losing content when the original page went down. That confirmed the core job-to-be-done before building a design solution and writing a PRD.
That process changed how I saw the problem. Saving articles was one of the goals. The general product definition was to own a permanent archive. The actual work: organizing, searching, getting content back out in a format that works.
The same problem belongs to four groups of people:
- Design engineers building information pipelines want to automate content collection so they can focus on systems, not manual gathering.
- Developers maintaining technical archives want searchable local copies so they can find reference material without depending on a website staying up.
- ML engineers tracking research across sources want structured, parseable output so they can feed articles into tools and agents.
- CLI-familiar users who want speed and control also want flags and scripts so they can skip the GUI entirely.
All four share the same gap: no reliable path from intent to content.
04
Information Architecture and Strategy
The first version of Capcat was a single scraping command. ./scangrab fetch hn which pulled articles from Hacker News and saved them as Markdown. It worked, but it had no error handling, and no way to add new sources without editing code.
That first version shipped under a different name: ScanGrab.
I stopped, ran a full JTBD and heuristic analysis, wrote a minimal PRD, and restarted. The name changed because the thinking changed. Cap(ture)Cat(alogue), the new name encoded the two core jobs the tool actually performed.
The PRD locked three non-negotiable constraints:
1. The archive belongs to the user, not a service. Technically: all output is self-contained files with no server dependency. I can open the Capcat archive on a different machine years from now, and every article, document, and image will be accessible.
2. The tool meets users where they are, guided or direct. Technically: a visual step-by-step TUI and a flag-based CLI on one shared backend. Neither path is a reduced version of the other.
3. Ethical scraping is built in, not opted in. Technically: the EthicalScrapingManager sits at the architecture level. Every source passes through it before touching the network. The user has no toggle because it is not a choice.
Every feature request during development was evaluated against these three. The GUI was cut because quality native implementation required Swift, outside the current stack, and the TUI already covered the guided experience job.
The CLI came first because the target users live in the terminal. DX requirements covered:
- documentation plugin architecture
- automation support
- HTML generation with easy theming
- self-contained output for sharing
- Markdown structured for Obsidian and LLM agents.
An expert user who learns the commands can type ./capcat bundle tech --count 10 --html, and get things done in three steps.
But not everyone who needs this tool thinks in flags and arguments. Once the CLI was stable, I had to decide how the TUI interactive mode should relate to it.
A wrapper sitting on top of the CLI would inherit its vocabulary. That works for someone who already knows the commands. A novice user does not know the commands. They need to see their options, follow guided steps, and have error prevention built into the flow.
One interface trying to serve both would compromise both. So I built the TUI as a separate surface on the same processing core. The novice path walks through eight steps: launch, see options, select, choose a bundle, decide on HTML, review, confirm, execute. The expert path takes three. Both produce identical files in the same folder structure.
Archive structure as information architecture.
The folder hierarchy was designed to match the way a user thinks about their own content. Batch fetches land in News/news_DD-MM-YYYY/, organized by date at the top level and by source below it. Single article captures land in Capcats/DD-MM-YYYY-Article-Title/, with the date prefix keeping them in chronological order regardless of source. Within each article folder, the structure is consistent: article.md and article.html for the content, comments.md and comments.html for the discussion, an images/ sub-folder for downloaded media, and a pdfs/sub-folder when applicable.
The naming conventions carry meaning deliberately. The date format is DD-MM-YYYY rather than YYYY-MM-DD because the archive is read by people and used primary on daily base. Article titles in folder names are slugified but kept readable, so a user browsing the archive in Finder or a file manager can identify content without opening anything.
This matters because the archive is not meant to be accessed only through Capcat. A user might browse it in their file manager, search it with script, open it in Obsidian, or index it with a local search tool. The folder structure has to be legible to all of those contexts without Capcat acting as an intermediary. The IA decision was to make the archive self-documenting in which the structure itself tells you what is in it and when it arrived.
05
Iterative Design and Testing
Before writing and generating any code, I designed the CLI vocabulary. Every command follows a verb-first structure: fetch, bundle, single, list, catch.
I wanted someone who has used git or docker to open Capcat and already have a rough sense of how it works. The reference standard was clig.dev, which documents what human-centered command-line design looks like in practice.
The harder problem was what happens when someone opens the tool and does not know what to type. In a GUI, there is usually a menu or a home screen. In a terminal, there is a blinking cursor and nothing else. An uncertain user has nowhere obvious to go. I needed to solve that before building the TUI.
From the Application folder, ./capcat catch launches the full interactive menu. Six options appear on screen. Arrow keys navigate. Every sub-menu carries an explicit "Back to Main Menu" option. During any text input, Ctrl+C returns to the menu instead of crashing the program. I did not want a single moment where someone feels stuck.
H3, User Control and Freedom: Users often perform actions by mistake. They need a clearly marked "emergency exit" to leave the unwanted action without having to go through an extended process.
The same principle shaped the relationship between the two interfaces: neither locks the user in, both are independently complete, and they are separate tools that share a backend rather than modes you switch between mid-task.
The TUI follows H1 and H6. Every fetch operation reports progress as it runs, so the user always knows what the system is doing. The main menu shows all six paths on launch, so nothing requires memorization. Arrow keys move between options. Space toggles check boxes. Enter confirms. The interaction vocabulary is small enough to learn in one session.
H1, Visibility of System Status: The design should always keep users informed about what is going on, through appropriate feedback within a reasonable amount of time.
H6, Recognition Rather Than Recall: Minimize the user's memory load by making elements, actions, and options visible. The user should not have to remember information from one part of the interface to another.
On the CLI side I utilized H7. An expert user who types ./capcat fetch hn,bbc --count 20 --html gets the same output as someone who walked through every TUI step. The CLI is the accelerator layer that the novice never needs to see, but the power user depends on.
H7, Flexibility and Efficiency of Use: Shortcuts - hidden from novice users - may speed up the interaction for the expert user so that the design can cater to both inexperienced and experienced users. Allow users to tailor frequent actions.
The --html flag generates self-contained HTML files with embedded CSS and JavaScript. Each article page includes syntax highlighting for code blocks, dark and light theme support, responsive layout, breadcrumb navigation, and back/forward links between articles. The templates are built on a minimal design system with CSS variables, so changing the look of every generated page takes one file edit. The HTML is self-contained, meaning you can send an article folder to someone and they can open it in a browser with no dependencies, no server, no internet connection.
06
The Final Solution
Bundles are the daily driver for compacting sources. Predefined source groups organized by category: tech, techpro, news, science, ai, sports. I pick a bundle, choose whether to generate HTML, review a summary, confirm, and execute. Output lands in ../News/news_DD-MM-YYYY/, organized by source and date. The list flow works the same way but lets me hand-pick individual sources with check boxes instead of committing to a full bundle.
The single article path handles any URL. I paste it and the system figures out what to do. If the URL matches a known source, Capcat uses that source's config. If it matches a platform like Medium or Substack, a dedicated handler checks for paywall content and falls back when needed. If the URL is unknown, a generic scraper auto-detects selectors. The user does not see any of this. Output always lands in ../Capcats/DD-MM-YYYY-Article-Title/ as clean Markdown with downloaded images, self-contained and ready to share.
Source management handles the third case. Add a new RSS source by pasting a feed URL. The system validates it, suggests an ID, asks for a category, optionally assigns it to a bundle, and optionally runs a test fetch.
For complex sites, an interactive wizard generates a full YAML config with selectors, rate limiting, and skip patterns.
07
HTML Output
The HTML output is the result of TUI fetch with selected html option and --html argument called by CLI. It is also the most visible surface Capcat produces, the one a reader actually opens. Designing was inspired by Bauhaus minimal aesthetic and Dieter Rams accent color practice.
The archive as publication. The index page presents each day's fetch as an organized collection: sources listed as sections, each article as a titled entry with its comment count and metadata. The reader can scan a day's content the way they would scan a newspaper front page, then go deeper into individual pieces. This is a different mental model from a bookmarks list or a folder of files. The archive has an editorial structure and feeling built in.
Ownership and portability as the primary constraint. Every design decision in the HTML output was evaluated against one question: does this work when the folder is closed and reopened on a different machine with no internet connection? This ruled out external fonts, hosted scripts, and CDN dependencies of any kind. It also shaped the image strategy. Images are downloaded into the article folder at fetch time, so the article is visually complete offline. The result is an archive that belongs to the user in a literal sense. It does not require a service to remain running.
The HTML output is a folder. The user can move it, copy it, zip it and email it, or open it on any device. Nothing about how it was generated imposes constraints on how it is used.
Reading surface design. The article page is designed for sustained reading. Content width is constrained to a comfortable line length. Headings use a serif typeface at a light weight, which creates a visual distinction between article content and interface chrome without requiring the reader to consciously register it. A reading progress bar at the top of the sticky header answers the implicit question any reader asks on a long article: how far through am I? Navigation links appear at both the top and bottom of the article, so the reader never has to scroll back to find their way out. If needed a dedicated Go to top button provides this functionality.
Theme as user preference, not a feature. The dark and light theme toggle is a persistent choice, not a per-session setting. The user sets it once and the entire archive respects it, across every article and every navigation.
The design system behind both themes is fully open. Every color, spacing decision, and typographic choice in the output traces back to a single design-system.css file that ships with Capcat. A user who wants to change how the archive looks edits that file. The change propagates to every article on the next fetch. This is user control at the output level. The minimal design serves also as a good starting point for customization.
Comment hierarchy. Comments are one of the harder design problems in the output. A Hacker News discussion can run hundreds of replies deep, with threads that branch and re-branch. The challenge was making depth legible without making the page feel cluttered. The solution uses a left border on each comment thread shifts slightly in tone as depth increases. The hierarchy is visible at a glance. The reader can see at once which replies are top-level and which are nested responses, without needing to count indent levels or read author names to orient themselves.
Author anonymization is built into the output, not offered as a setting. Names are replaced at write time. The visual design reflects the comment header and shows a neutral label rather than drawing attention to identity.
Code syntax themes. Technical articles frequently contain code blocks, and the syntax highlighting was designed as two distinct themes rather than one inverted palette. The dark theme follows the One Dark convention. Keywords in purple, strings in green, numbers and attributes in orange, functions and built-ins in blue, variables in red, classes in yellow. The palette is warm overall, with high contrast against the near-black code background. The light theme follows the GitHub convention keywords to be in red with added weight, functions and classes in purple, strings in deep navy, numbers and constants in blue.
PDF integration. For sources that publish PDFs alongside articles, a linked reference bar appears at the top of the article page. The PDFs are stored locally in the article folder and linked by filename. The reader can open them without leaving the archive or depending on the original source remaining available. This closes the gap between reading an article that references a paper and actually having that paper.
H2, Match Between System and the World: The design should speak the users' language. Use words, phrases, and concepts familiar to the user. Follow real-world conventions, making information appear in a natural and logical order.
08
Validation and Impact
Every choice in the TUI was counted. Hick's Law says decision time increases with the number of options. I kept the main menu at six:
What would you like me to do?
> Catch articles from a bundle of sources
Catch articles from a list of sources
Catch from a single source
Catch a single article by URL
Manage Sources (add/remove/configure)
Exit
Each sub-level adds two to three more, never a wall of choices. The bundle selection:
Select a news bundle and hit Enter for activation.
> tech - Technology News
(IEEE, Mashable)
news - General News
(BBC News, The Guardian)
science - Science News
(Nature News, Scientific American)
ai - AI & Machine Learning
(MIT News)
sports - Sports News
(BBC Sport)
The source list uses check boxes instead of single selection, so the user builds a custom set without navigating back and forth:
Select sources with <space> and press Enter to continue:
[ ] hn Hacker News
[x] lb Lobsters
[x] iq InfoQ
[ ] bbc BBC News
[ ] guardian The Guardian
(Use <space> to select, <enter> to confirm)
The single article flow strips the interface down to two questions:
(Use Ctrl+C to go to the Main Menu)
Please enter the article URL: https://example.com/article
Generate HTML for web browsing?
> Yes
No
And source management keeps its full capability behind one sub-menu with a clear exit:
Source Management - Select an option:
> Add New Source from RSS Feed
Generate Custom Source Config
Remove Existing Sources
List All Sources
Test a Source
────────────────
Back to Main Menu
The user never sees more than they need at any given step. Complexity is there, but it reveals itself as you go deeper.
Hick's Law, Progressive Disclosure: The time it takes to make a decision increases with the number and complexity of choices.
Miller's Law puts working memory capacity at around seven items, plus or minus two. Recent cognitive science suggests four plus or minus one for complex information. The CLI has nine primary commands: single, fetch, bundle, list, config, add-source, remove-source, generate-config, catch. That sits slightly above the threshold, but Capcat targets technically capable users who routinely hold more in working memory. Every command is a verb. Every verb maps to one action.
Miller's Law: The average person can hold about 7 (+-2) items in working memory at once. Recent cognitive science suggests 4+-1 may be more accurate for complex information. The key principle: reduce cognitive burden by organizing information into meaningful, manageable chunks.
Jakob's Law says users bring mental models from every other tool they have used. Fighting those models creates friction.
./capcat fetch hn,bbc --count 20 maps to git fetch. The verb, the comma-separated targets, the flag structure. ./capcat bundle tech --count 10 maps to git bundle. ./capcat list sources maps to docker ps --list. ./capcat remove-source maps to how every package manager handles removal. A CLI-familiar user opens Capcat and already has a rough sense of how it works before reading any documentation.
Jakob's Law, Pattern Transfer: Users spend most of their time on other sites. This means that users prefer your site to work the same way as all the other sites they already know.
09
Retrospective and Learnings
The System Architecture diagram has four layers: User Interface, Source System, Processing Pipeline, Output. The placement that matters most is Ethical Scraping. It sits inside the Source System at the same level as Source Factory and Source Registry.
Every source goes through the EthicalScrapingManager before it touches the network. Robots.txt is cached, rate limiting is enforced per domain, and the user agent identifies itself honestly. Ethical behavior is a constraint the architecture enforces, not a feature the user toggles.
This is an ethical constraint with a UX consequence. The user never has to wonder whether Capcat is behaving lawfully. That uncertainty is removed by design. And the tool identifies itself honestly in the network. The product has integrity in both senses of the word.
In summary Ethical Scraping:
- Respects
robots.txt - Rate Limiting (1 request per 10 seconds)
- Prefers RSS/APIs over HTML Scrapping
- No Paywall Circumvention
- Proper Source Attribution
A CLI and a TUI are two separate design problems that share a backend. And ethical constraints belong in the architecture, not in the documentation.
10
Product Website
The website was built with the same process as the tool: context engineering, a PRD, iterative development with an LLM. Most of the CSS was written by hand. The design system, the color decisions, the spacing, the typography scale.
capcat.org serves three audiences. A user evaluating the tool sees the feature set, the interface options, the preconfigured sources, and a three-step getting started flow. A developer reads the architecture docs, the source development guide, and the API reference. A contributor finds the GitHub link, the issue tracker, and the ethical scraping guidelines. Primary navigation covers Features, How It Works, Tutorials, Case Study, Get Started, and Ethical Scraping. The footer carries a parallel layer grouped by function: Documentation, Resources, About.
The design system runs on CSS variables. Typography uses a system font stack with no external dependencies, which was a performance and accessibility choice, not a default. The color palette is built around a nine-step orange scale. The primary accent color is the same orange used in the terminal output coloring, so the brand and the interface share one identity. Cream base background, dark ink for text, hover states, tints, and semantic aliases all derive from the same scale. The same tokens drive the Mermaid diagrams across all documentation pages.
Eight custom SVG icons illustrate the feature section, one per capability: Command-Line Mode, Interactive Menu, Bulk RSS Fetching, Local Markdown Storage, HTML Generation, Offline Accessibility, Add Your Own Sources, and the Capcat mascot.
The footer carries additional information for the project and documentation.
11
Branding and Illustration
The name is a compression. Cap from Capture. Cat from Catalogue. Those are the two core actions the tool performs. Hidden inside the same four letters are two more layers: Cat, the mascot that gives the product a face (a FOSS tradition), and cat, the Unix command that reads and outputs file contents. The name works on first encounter and holds more the longer you look at it.
The logotype is a handmade serif. The letter forms are a custom serif drawn by hand. The capital C opens with a pronounced curved terminal that no standard typeface carries, giving the word an immediate identity at the first letter.
The white space inside the aperture curves in a way that reads as a cat's tail, so the mascot is present in the very first letter without being illustrated. The logotype carries the brand identity and the character reference in the same stroke.
The lowercase descender on the p is deliberately deep, anchoring the word to the baseline with a tail that echoes the mascot the second time, without depicting it. Ink traps at the stroke joints keep the forms clean at small sizes. The whole word sits on a precise construction grid, visible in the presentation versions, which shows the spacing and optical alignment were considered rather than assumed.
Four versions cover the practical range of contexts where the logotype appears.
The mascot is a cat dressed as a baseball player, catching a loading ball from a progress bar. Behind it, a crowd of computers cheers from the stands. Every FOSS project has a mascot. The inspiration is the style of Top Cat cartoon. Top Cat is an American animated sitcom produced by Hanna-Barbera Productions.
I wanted one that carried the product's name, referenced its function, and had enough personality to anchor a brand.
The illustration was drawn by hand on paper, refined in Procreate, and vectorized in Affinity Designer.
The slogan ties the most important functionality in one line:
"Archive Articles with Confidence. Share without Limits."