RAG in a company: build a knowledge search with permissions and auth

RAG (Retrieval-Augmented Generation) is simple in concept: an LLM answers a question, but before it answers it receives a small set of relevant snippets from your internal documents. Done right, you get a knowledge search that:

  • works on internal content (Confluence, Google Drive, Notion, Slack, PDFs),
  • respects permissions (who can see what),
  • is auditable (who asked what, which sources were used),
  • and reduces hallucinations by grounding answers in real sources.

Most RAG projects fail not because of the model, but because of identity, access control, ingestion and operations. Here is an implementation plan that holds up in a real company environment.

1) Define the product: RAG is a system, not a prompt

A minimal production-grade RAG includes:

  • Data connectors + refresh schedules,
  • Processing pipeline (cleanup, chunking, metadata),
  • Search indexes (vector + often keyword),
  • Authorization layer (ACLs, groups, inheritance),
  • Retriever (how you pick snippets),
  • Generator (LLM) with citation rules,
  • Observability (logs, metrics, feedback, evaluation),
  • UX: question, answer, sources, feedback, follow-ups.

If you only build “LLM + embeddings”, you will get nice demos and ugly problems: data leaks, inconsistent quality and zero debuggability.

2) Permissions: the part you cannot postpone

Without access control, RAG is either useless (people can’t trust it) or dangerous (it leaks data). Two common approaches:

A) Pre-filtering (filter before retrieval)

This is the safest option:

  1. User signs in (SSO/OIDC) → you get identity + groups/roles.
  2. You build an access filter: which documents/chunks the user can see.
  3. You retrieve only within that allowed subset.

Pro: the model never sees forbidden context. Con: you must keep ACL metadata consistent in the index.

B) Post-filtering (retrieve, then filter)

It is easier to implement, but risky. If you retrieve top-K chunks first and filter later, you may:

  • end up with too few good chunks,
  • or accidentally send restricted content to the LLM.

If you must do it, enforce filtering before building the LLM context and add regression tests for leakage scenarios.

How to represent ACLs in the index

A practical metadata model per chunk:

  • doc_id, source, url, title, updated_at
  • acl_users: user IDs (if per-user access)
  • acl_groups: group/role IDs
  • acl_policy: public / private / restricted
  • tenant_id (for multi-tenant setups)

Your filter becomes: tenant_id == X AND (acl_policy == public OR acl_groups overlaps user.groups OR acl_users contains user.id).

3) Authentication: integrate with SSO (OIDC) instead of creating new accounts

To avoid friction, plug into what the company already uses:

  • OIDC (Okta, Auth0, Azure AD/Entra, Keycloak) is usually the best path.
  • SAML exists in many enterprises; often bridged to OIDC.

Auth checklist:

  • Short-lived access tokens (15–60 minutes) + refresh if needed.
  • Group/role mapping in token claims.
  • Enforce tenant_id from the session (never from a client-provided field).
  • Audit trail: userId, tenantId, timestamp, query, sources used.

4) Ingestion: make content searchable (and maintainable)

The most common mistake is indexing “whatever you can extract”. RAG needs a consistent pipeline.

Common sources

  • Confluence / Jira / SharePoint
  • Google Drive / OneDrive
  • Notion
  • Code repositories (README, ADRs, docs)
  • PDF procedures, contracts, slide decks
  • Slack/Teams (optional; high noise)

Cleanup and normalization

  • Remove navigation, repeated footers and boilerplate.
  • Preserve headings (H1/H2/H3): they are semantic anchors.
  • Extract tables and lists properly (often the highest signal).
  • For scanned PDFs: OCR + quality scoring.

Chunking (splitting into snippets)

Typical production settings:

  • 300–900 tokens per chunk,
  • 10–20% overlap,
  • section-based chunking (header → content), not “every N characters”.

Every chunk must carry metadata: document title, path, section, URL, updated date and ACL.

5) Retrieval: get the right snippets (so the model doesn’t guess)

In practice, the best results come from hybrid search:

  • Vector search for meaning and paraphrases,
  • Keyword/BM25 for proper nouns, error codes and IDs.

Only-vector setups often fail on “ERR-4132”-style queries. Hybrid retrieval is the pragmatic fix.

Re-ranking

If you can afford it, add a re-ranker:

  1. Retrieve top 30,
  2. re-rank to top 5–8,
  3. send only that to the LLM.

6) Answer generation: rules that protect quality

Company-grade RAG should enforce:

  • Citations: show sources and links.
  • “I don’t know” when sources are weak; ask clarifying questions.
  • Facts vs recommendations: label what comes from documents vs interpretation.
  • Short vs detailed modes for different audiences.

A proven answer format

  • TL;DR (2–4 sentences)
  • Steps / checklist
  • Sources (links + document names)
  • What to clarify (1–3 clarifying questions)

7) Observability: you can’t improve what you can’t see

To avoid “it sometimes works”, log:

  • query, userId, tenantId
  • retrieved chunks (doc_id + scores)
  • whether sources were shown
  • user feedback (👍/👎 + optional comment)
  • latency: retrieval vs generation
  • token usage and cost per query

Then build an offline evaluation set (30–100 real questions + expected sources) to run regression tests on chunking/retrieval changes.

8) Security baseline for internal data

  • PII/secrets: detection and at least tagging.
  • Encryption in transit and at rest.
  • Retention policy for queries and context logs.
  • RBAC for admin tools.
  • Guardrails: export limits, response caps, prompt injection defenses.

Prompt injection (practical)

Because documents are user-controlled, someone can embed malicious instructions. Defenses:

  • Treat retrieved text as data, not instructions.
  • Detect and flag suspicious patterns.
  • Require that claims are supported by cited sources.

9) A realistic 4-week delivery plan

Week 1: technical MVP

  • SSO/OIDC login
  • 1–2 sources (e.g., Confluence + Drive)
  • Index + basic retriever
  • UI with sources

Week 2: permissions and auditing

  • ACL propagation to chunks
  • Pre-filtering
  • Query and source logs

Week 3: quality improvements

  • Hybrid retrieval + optional re-ranking
  • Better chunking
  • Offline evaluation baseline

Week 4: operations

  • Monitoring and alerting
  • Admin panel for sources and re-indexing
  • Retention and backups

Summary

RAG succeeds in companies when you build it like a real system: SSO, permissions, auditability, quality controls and operations. The model matters—but the surrounding engineering is what makes it safe and reliable.

Read also: AI in product: how to start with a simple MVP and avoid burning budget

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *

5 + 12 =