Everything under the hood.
The homepage shows you what TinyFirm does. This page shows you how. Mechanisms, architecture, and the engineering decisions that make the system reliable.
Apply for a Discovery CallFive models. Three providers. One consensus.
Before any phase advances in a Build project, TinyFirm runs a multi-model adversarial review. Five AI models from different providers independently analyze the phase output. They do not see each other’s results. They do not collaborate.
The models
- •DeepSeek-R1 70B (local via Ollama, free)
- •Llama 3.3 70B (via Groq, free tier)
- •Gemini 2.5 Flash (via Google, free tier)
- •Qwen 3 235B (via Cerebras, free tier)
- •o4-mini (via OpenAI, ~$0.04 per review)
When multiple models independently flag the same issue, it is almost certainly real. A single model hallucinating a false positive gets outvoted. Consensus findings surface genuine problems: logic errors, security gaps, performance regressions, missing edge cases.
This is not an optional code review you can dismiss. It is a mechanical gate. The phase does not advance until findings are addressed. It cannot be skipped, edited, or waved through.
Cost: Four of the five models run free. Total cost per phase-gate review is approximately $0.04. The quality ceiling this buys would cost thousands in human review time.
Who this matters to: Engineering leaders who need accountability at scale. When your team ships without you reviewing every line, you need a system that catches what individuals miss. Five independent perspectives are better than one.
Your tenth project benefits from the first nine.
Every TinyFirm project generates a structured HQ report when you save progress: what was built, what worked, what did not, team effectiveness, and generalizable lessons. These reports accumulate in your workspace.
Say “sync big brain.” The system aggregates every HQ report into cross-project intelligence: which agent configurations perform best, which patterns produce reliable output, which anti-patterns waste cycles, and how team size should scale with scope.
What compounds
- Agent playbook. Proven configurations, team-size guidelines, role combinations that work (and ones that do not).
- Pattern library. Architectural decisions that shipped well across projects. Decisions that caused rework.
- Team calibration. Which agents performed highly, which need refinement, which roles overlap.
This is not per-project memory (that compounds within a single engagement). This is portfolio intelligence. The kind of institutional knowledge that usually requires a decade of engineering leadership to accumulate, delivered structurally across every project you run.
Who this matters to: Technical ICs who want the system to improve without manual tuning. Founders running multiple products. Anyone whose next project should be meaningfully easier than the last.
Five layers. Three automated. None rely on self-assessment.
TinyFirm does not trust agents to evaluate their own output. Quality enforcement is structural: deterministic gates that block bad code, independent reviewers that catch what authors miss, and security tools that scan for what both miss.
Layer 1: Agent Self-Review
Every code-owning agent has a domain-specific quality checklist in their definition file. Security agents check for hardcoded credentials, XSS vectors, and auth bypasses. React agents check dependency arrays, render-loop risks, and server/client boundaries. Backend agents check input validation, rate limiting, and query safety. Before writing their summary, agents review their work against this checklist.
This is the weakest layer. It exists because it catches obvious mistakes cheaply. It is not trusted.
Layer 2: Deterministic Quality Gate
After every agent task, a shell script runs the project’s TypeScript typecheck, linter, and test suite. This is mechanical. It reads compiler output. It does not interpret, it does not make judgment calls, it does not get tired. Code that produces type errors does not get committed. Code that fails tests does not get committed. There is no override flag.
If the gate fails, the agent is re-delegated automatically with the failure output. Fix and re-run. No human intervention required for routine errors.
Layer 3: Snyk Security Scan
After the quality gate passes, Snyk scans the changed files for known vulnerabilities. High and critical findings are reported before commit. The team does not proceed on insecure code without explicit human acknowledgment.
This is automatic. No approval required to run. No cost.
Layer 4: Phase-Gate Multi-Agent Review
Before advancing from one phase to the next, Ace dispatches 2-3 agents in parallel to review the phase output. A backend engineer reviews a frontend agent’s API integration. A security agent reviews auth flows. A QA agent reviews test coverage. Each reviewer produces findings independently.
Critical and should-fix findings must be resolved before the phase advances. This is where cross-cutting issues surface: the security gap that the backend agent did not think about, the edge case the frontend agent did not test.
Layer 5: Security Hardening Phase
Before any Build project deploys to production, the security agent runs a full audit:
- •OWASP ZAP (free): spider + active scan covering the OWASP Top 10.
- •Nuclei (free): 9,000+ vulnerability templates, known CVEs.
- •AgentShield (free): agent configuration audit, 102 static rules checking for prompt injection and unsafe permissions.
- •Shannon (enhanced tier, ~$50/run): autonomous AI pentester. Real exploit proofs. Zero false positives. “No exploit, no report.”
The project does not deploy until the security agent produces a signed-off report with a GO/NO-GO recommendation.
The result: Five layers. Three are fully automated (quality gate, Snyk, multi-model review). Two involve agent judgment but with mechanical enforcement. No single point of failure. No “the agent said it looked fine.”
Who this matters to: Engineering leaders responsible for production reliability. Anyone who has been burned by AI-generated code that compiled but did not work correctly.
Not everything is a software build.
TinyFirm generates teams for four types of work. The intake interview determines the track. The team composition adapts.
Build
Full software development. Frontend, backend, security, QA, documentation, DevOps. The intake maps your tech stack, compliance requirements, architecture preferences, and testing strategy. The team ships production code with persistent memory, quality gates, and security scanning baked in.
Example team: Pixel (Frontend), Wrench (Backend), Sentinel (Security), Atlas (QA), Quill (Docs). 5-7 agents typical.
Ideation
No code. The team brainstorms, explores, and evaluates. Business model canvases, market sizing, concept validation, competitive positioning. The deliverable is a recommendation you can act on, not a repository.
Example team: Scout (Research), Prism (Strategy), Beacon (Analysis). 3-4 agents typical.
Research
Deep investigation. Market analysis, competitive intelligence, data synthesis, trend mapping. Primary research (surveys, interviews) or secondary (reports, public data, market analysis). The deliverable is structured knowledge: reports, presentations, spreadsheets, or executive summaries.
Example team: Scout (Research Lead), Beacon (Quantitative Analysis), Quill (Report Writing). 2-3 agents typical.
Hybrid
Blend any of the above. Research a market, then build the product. Ideate three concepts, then prototype the winner. The tracks are composable. Phases adapt to what the project needs at each stage.
Example team: Varies. A hybrid project might start with 3 research agents and expand to 6 build agents when the scope crystallizes.
Who this matters to: Founders who need more than code. Technical ICs exploring adjacent projects. Agencies scoping engagements before committing to a build.
Full audit trail. Zero manual commits.
Every agent task is automatically committed to a local Git repository when it completes. Commit messages follow conventional commit format and include the agent’s name:
- feat: rate limiting on public API routes | Wrench
- fix: auth token refresh race condition | Bolt
- refactor: extract validation middleware | Sentinel
- test: add integration tests for user endpoints | Atlas
The full history is browsable in Cursor’s Source Control panel. Click any commit to see exactly what changed, line by line. Compare any two points in time. Revert to any previous state with a single command.
This is not opt-in. It is not configured per-project. It happens automatically after every task completes, enforced by the delegation protocol. The working tree must be clean before any new delegation proceeds. Uncommitted changes block the pipeline.
What this means in practice
- •Complete audit trail of who changed what and why.
- •Instant rollback when something breaks.
- •Blame any line to see which agent wrote it and during which task.
- •Full confidence to experiment. Every checkpoint is recoverable.
Purely local. Nothing is pushed to GitHub, GitLab, or any remote service unless you explicitly configure it. Your code history stays on your machine.
Who this matters to: Engineering leaders who need accountability. Anyone who has ever asked “what changed?” after a regression.
Nothing happens without your say-so.
Every action in TinyFirm follows a three-step cycle: Present, Approve, Execute.
Present. Ace describes what will be done, which agent handles it, and the expected outcome. This is a plan, not a notification.
Approve. You review the plan and give the go-ahead. Work does not begin on silence, ambiguity, or assumed intent.
Execute. Only after your explicit approval does Ace delegate to the specialist. The agent works within the scope you approved.
Batch approval for efficiency. Approve a multi-step plan and the team works through it without interrupting you at every step. Five bug fixes? Approve the batch. The team handles them in sequence while you do other things.
New tasks restart the cycle. If you raise something new mid-session, even while approved work is in progress, the cycle resets. Present. Approve. Execute. No assumptions.
Example interaction
You
“The login page needs rate limiting and the API needs input validation.”
Ace
“Here’s my plan:
- Wrench adds rate limiting to /api/auth endpoints (express-rate-limit, 5 attempts per 15 minutes)
- Wrench adds Zod validation schemas to all public API routes
Wrench handles both since they’re backend tasks. Expected output: committed code with passing tests for both changes. Want me to proceed?”
You
“Go ahead.”
Ace
[delegates to Wrench with full context, key files, and expected output]
This is not a convenience feature. It is a structural constraint. The delegation protocol enforces it mechanically. Ace cannot delegate without presenting first. The system is designed so that autonomous action is architecturally impossible.
Who this matters to: Engineering leaders who need to trust the system. Anyone who has seen an AI tool make changes they did not ask for.
Queue work overnight. Wake up to results.
OptionalNightOwl is a macOS menu bar app that schedules local LLM tasks via Ollama. Queue analysis jobs, data processing, or batch transformations before you leave for the day. They run locally on your machine using free, open-source models. Results are ready by morning.
How it works
- •Click the menu bar icon. Add a task to the queue.
- •Select a model (DeepSeek-R1 70B, Qwen 2.5 32B, Llama 3.3, or any Ollama-compatible model).
- •Set the schedule: run immediately, run at a specific time, or run when the machine is idle.
- •Results are written to a local output directory or delivered to a remote server via SCP.
Use cases
- •Analyze a large codebase overnight and have a summary ready in the morning.
- •Process a batch of data files through a local model without tying up Cursor.
- •Run multiple analysis passes with different models and compare results.
- •Deliver processed output to a remote staging server automatically.
NightOwl runs independently of Cursor. You do not need an active editor session. The app sits in your menu bar, manages the queue, and handles model loading and unloading for you.
Cost: $0. Local inference only. No API calls, no cloud dependencies, no usage fees.
System Requirements
- macOS Sonoma (14) or newer with Apple Silicon.
- 48GB unified memory for full 70B model access (M3 Pro, M4 Pro, M3 Max, M4 Max, or M4 Ultra).
- 32GB unified memory for 32B models only (M1 Pro/Max, M2 Pro/Max, or newer).
- 70GB free disk space for model downloads.
- Ollama installed (free, open-source).
NightOwl runs overnight batch jobs. Even at the minimum spec (48GB, ~5 tok/s), a complex analysis completes in minutes, not hours. Speed matters less when the machine works while you sleep.
Who this matters to: Technical ICs who want to leverage local hardware for batch work without manual orchestration. Power users who want their machine working while they sleep.
Deep analysis for large codebases.
OptionalWhen delegated files exceed 100KB total, standard single-model analysis loses fidelity. The RLM Analyzer uses a two-model pipeline to maintain depth at scale.
The pipeline
- 1Fast model (Qwen 2.5 32B): Reads the full file set and writes analysis code: chunking strategies, extraction patterns, and domain-specific queries tailored to the task.
- 2Reasoning model (DeepSeek-R1 70B): Executes the analysis code against each chunk, performing deep reasoning with full chain-of-thought on complex sections.
The result is a structured findings JSON that surfaces what a single-pass model would miss: cross-file dependencies, subtle logic errors, architectural drift, and dormant technical debt.
When it runs
- •Automatically during phase-gate reviews when key files exceed the 100KB threshold. Ace detects this and includes analyzer instructions in the delegation.
- •Manually when you want deep analysis on a specific file or directory. Run the script directly.
Both models run locally via Ollama. No API costs. No data leaves your machine.
System Requirements
- 48GB unified memory (Apple Silicon) or 42GB+ VRAM (NVIDIA dual-GPU / workstation GPU).
- Works on macOS, Linux, and Windows wherever Ollama runs.
- 70GB free disk space for both models.
Fallback: Users with 32GB memory can swap to 32B models with minimal quality loss (within 1-3% on code/math benchmarks).
Who this matters to: Technical ICs working with large, mature codebases. Anyone whose project has grown past the point where a single model can hold the full context.
Improve the system itself.
TinyFirm is not a black box. Every agent definition, memory protocol, quality gate, and delegation rule lives in editable files inside your workspace. Specialist Hat Mode is how you improve them.
Say “wear your specialist hats.” Ace activates a mode where your team reviews and refines its own configuration: agent scopes, constraint lists, memory structures, and quality checklists. Every improvement is tested against the current project and, if validated, written back to the workspace.
What you can customize
- •Agent personality, scope, and constraint definitions.
- •Quality gate thresholds and checklist items.
- •Memory protocol structure and condensation rules.
- •Delegation protocols and reporting formats.
Why this matters: Every change you make to the workspace is inherited by all future projects created from it. A better security checklist today means better security audits on every project you start tomorrow. A refined agent definition means more reliable output across the board.
You are not locked into the defaults. You own the configuration. You improve it over time.
Who this matters to: Technical ICs who want control over how the system works, not just what it produces. Anyone who has ever wanted to tune an AI tool’s behavior and been told they cannot.
Three layers of compounding intelligence.
TinyFirm gets smarter in three distinct ways. Each layer compounds independently.
Layer 1: Per-project memory.
Within a single project, agents accumulate knowledge across every session. What was built, which patterns work, what to avoid, your preferences, your constraints. Session 47 is as informed as session 1. This is the foundation.
Layer 2: Cross-project intelligence.
The Big Brain system aggregates lessons across all your projects. Agent effectiveness, team configurations that work, architectural decisions that shipped well, anti-patterns that wasted time. Your tenth project starts with the collective intelligence of the previous nine.
Layer 3: Workspace-level improvement.
Every change you make to agent definitions, quality gates, memory protocols, or delegation rules inside the workspace is inherited by every future project you create. Better defaults compound forever. A security checklist refined in March protects every project started in April, May, and beyond.
The result: You are not just using a tool. You are building an institutional knowledge base that compounds with every project, every session, and every improvement you make. The system is better tomorrow than it is today. Not because we shipped an update. Because you used it.
Ready to see how it works for your project?
Every team is custom-generated. The discovery call is where we figure out what yours looks like.
Apply for a Discovery Call$555/mo + $2,200 one-time setup. Cursor subscription required separately. Full pricing details at /pricing.