Architecture Proposal — Phase 1 Discovery

Pearl Platform

Test Environment Architecture Proposal

A comprehensive walkthrough of the Pearl platform, why a fully isolated test environment is critical, and the architecture we recommend — designed for MessageDirect's technical leadership.

Version 2.0 Date 6 March 2026 Status Phase 1 — Discovery Complete
Begin the Story
Scroll to explore
01

What Is Pearl?

Pearl is the enterprise-grade Telephone Answering Service (TAS) platform powering MessageDirect — a leading UK 24/7 virtual receptionist and contact centre business.

Core Mission

Enable call centre operators to answer phone calls on behalf of hundreds of subscribing client companies — capturing caller details, recording messages, triggering escalations via SMS/email/push, and providing clients with a self-service portal to view messages, manage rotas, and pay invoices.

📞

24/7 Contact Centre

Operators answer calls around the clock on behalf of client companies using dynamic answering scripts

💬

Message Handling

Capture caller details, record messages, and escalate to the right contact via SMS, email, or push

🌐

Client Self-Service

108+ portal pages for clients to view messages, manage rotas, search callers, and handle billing

💳

Billing & Payments

Automated billing lifecycle — usage tracking, invoice generation, card & DD payments, Xero accounting sync

🤖

AI & Voice

AI chatbots, voice assistants (ElevenLabs/Twilio), speech analytics, and GPT-powered QC scoring

📊

Multi-Brand

Operates MessageDirect, JAM, Answer.co.uk, Argyll, VirtuallyThere — all from one platform

The Core Flow: How a Call Becomes a Message

sequenceDiagram participant Caller participant Genesys as Genesys Cloud CX participant Pearl as Pearl Web App participant Op as Operator Browser participant DB as SQL MI (17 DBs) participant Esc as Escalation Engine participant Client as Client (SMS/Email) Caller->>Genesys: Dials client's number Genesys->>Pearl: DDI lookup (GET /exposed/genesys_ddilookupv2.aspx) Pearl->>DB: Lookup PhysicalDDIs → Company Pearl-->>Genesys: JSON routing instructions Genesys->>Op: Route call to operator Pearl->>Op: Screen pop via Totem (long-poll) Note over Op: Operator sees greeting script,
data fields, special instructions Op->>Pearl: Save message (caller, body, "call for") Pearl->>DB: Write to Messages, Callers, ScreenInits Pearl->>Esc: PutIntoDispatchQueue() Esc->>DB: Resolve rota → contacts Esc->>Client: Deliver via SMS/Email/Push
02

Platform at Scale

A mature, organically-grown platform handling significant operational complexity.

0 Database Tables Across 17 databases on SQL MI
0 Service Endpoints /exposed/ internal API pages
0 Background Jobs /utilityservices/ job pages
0 Admin Tools /Tools/ admin pages
0 Client Portal Pages /usercontrolpanel/ self-service
0 External Integrations From Stripe to Genesys to AI
0 Application Components Web apps, workers, services
0 Hours / 7 Days Platform never sleeps
03

Current Production Architecture

The production layout powering 24/7 operations today.

flowchart TB subgraph Internet["Internet / Users"] Operators["Call Centre Operators
(24/7)"] Clients["Client Portal Users"] CTI["Genesys Cloud CX
(Telephony)"] end subgraph AzureProd["Azure Production Environment (UK South)"] subgraph WebTier["Web Tier — IIS Servers"] Pearl3["pearl3.private.pearl
(IIS Web Server 1)"] Pearl4["pearl4.private.pearl
(IIS Web Server 2)"] end subgraph InternalSvc["Internal Services Layer"] PearlInternal["pearlinternal.private.pearl
pearl-webservices-azure
+ utility-server"] Memcached["memcached.private.pearl:11211
(Distributed Cache)"] end subgraph Workers["Background Workers"] QueueProc["queue-processor-azure
(Job Queue HTTP Executor)"] SystemCheck["system-checker
(Health Monitor + Alerting)"] AISpooler["ai-spooler
(6-Lane AI QC Spooler)"] Totem["totem-2-cloud-nosql
(Long-Poll Notification Socket)"] end subgraph DBTier["Database Tier"] SQLMI["Azure SQL Managed Instance
(Business Critical)
17 databases • 489+ tables
2 SQL accounts (pearl, utility)"] end end subgraph ExtSvc["External Services (22 Integrations)"] direction LR S3["Amazon S3"] Solr["Apache Solr
(5 Search Cores)"] Stripe["Stripe"] GC["GoCardless"] SagePay["SagePay"] Mailgun["Mailgun"] SMSGw["SMS Gateways
(MediaBurst, MessageBird, ClickSend)"] Zoho["Zoho Desk/CRM"] Twilio["Twilio"] AzureAI["Azure OpenAI"] EL["ElevenLabs"] Xero["Xero Accounting"] BQ["BigQuery"] end Operators --> Pearl3 Operators --> Pearl4 Clients --> Pearl3 CTI --> PearlInternal Pearl3 --> SQLMI Pearl4 --> SQLMI Pearl3 <--> Memcached Pearl4 <--> Memcached Pearl3 --> PearlInternal PearlInternal --> SQLMI PearlInternal --> S3 PearlInternal --> Solr QueueProc --> SQLMI QueueProc --> PearlInternal SystemCheck --> SQLMI AISpooler --> PearlInternal Totem <--> Pearl3 PearlInternal --> Stripe PearlInternal --> GC PearlInternal --> Mailgun PearlInternal --> SMSGw PearlInternal --> Zoho PearlInternal --> AzureAI PearlInternal --> Twilio PearlInternal --> EL PearlInternal --> Xero PearlInternal --> BQ
04

Technology Stack

The confirmed technology landscape powering every layer of Pearl.

Runtime
.NET Framework 4.8
Languages
VB.NET (~95%), C#
Web Framework
ASP.NET Web Forms ("Web Site" model)
UI Library
Telerik RadControls for ASP.NET AJAX
Database
Azure SQL MI (Business Critical)
Caching
Memcached (BeIT client, port 11211)
Session
SQL Server Session (5-hour timeout)
Authentication
ASP.NET Forms Auth (cookie) + 2FA/TOTP
Search
Apache Solr (5 cores)
Azure Region
UK South (London)

ASP.NET "Web Site" Compilation Model

This is not a Web Application project — the folder structure is the project. App_Code/ is auto-compiled at runtime via JIT. Source .vb and .aspx files are deployed directly to the server. Pre-compilation uses aspnet_compiler.exe for production.

05

Application Components — Deep Dive

8 distinct components, each with unique runtime characteristics.

ComponentTypeFrameworkRoleDatabase Access
pearl-azure ASP.NET Web Forms .NET 4.8 Main UI — operators, admins, client portal. 321+ exposed endpoints, 304+ admin tools, 108+ portal pages All 17 databases
pearl-webservices-azure ASP.NET Web App .NET 4.8 Background services — 278+ utility job pages, billing, stats, search indexing, AI QC endpoints, job scheduler All 17 databases
utility-server ASP.NET Web Forms (3 sub-apps) .NET 4.8 PCI-isolated payments portal (Stripe), Xero accounting sync, multi-brand reporting PearlBilling, PearlData, PearlOperations
queue-processor-azure WinForms (.exe) .NET 4.8 Job queue worker — claims rows from Process_JobQueue, executes HTTP calls with turn-based coordination PearlQueues, PearlData, PearlBilling, PearlLog
system-checker WinForms (.exe) .NET 4.8 Health monitoring — ICMP ping, TCP, HTTP probe, SQL query, disk space checks with transition-based alerts Checking, PearlOperations, PearlData
ai-spooler WinForms (.exe) .NET 4.8.1 AI QC spooler — 6-lane conveyor belt for speech analytics, round-robin distribution, 55s backoff on empty Via HTTP to pearl-webservices
totem-2-cloud-nosql Console App (Socket Server) .NET 3.5 Real-time browser notifications via long-poll. /register, /poll, /notify protocol. All state in-memory None (in-memory only)
alpha-code-generator WinForms (.exe) .NET 4.8 Batch generator for unique 9-char alphanumeric codes (base-31 encoding) FreeAlphaCodes table

Core Business Logic Modules (App_Code)

The backbone of Pearl's logic — VB.NET classes auto-compiled at runtime.

⚙️

PearlOperations.vb

~557 KB — Screen XML, message processing, DDI management, screen pop, real-time signalling

Core Engine
🖥️

PearlControls.vb

~338 KB — Dynamic UI generation from XML config. Renders answering screens, data grids, forms

UI Renderer
👤

PearlUserManagement.vb

~220 KB — User CRUD, login, permissions, shift tracking, password management

Identity
🏢

PearlCompanyManagement.vb

~153 KB — Client onboarding, company config, setup wizards

Clients
🔔

PearlEscalation.vb

~97 KB — Escalation rules, notification routing, on-call rota resolution

Dispatch
💳

PearlPayments.vb

~90 KB — Stripe, SagePay, GoCardless — gateway integrations & payment processing

Billing
06

How Data Flows Through Pearl

Five interconnected data flows that power the entire platform.

flowchart LR subgraph CallFlow["1. Call-to-Message Flow"] direction TB CF1["Genesys CTI Event"] --> CF2["DDI Lookup"] CF2 --> CF3["Screen Pop via Totem"] CF3 --> CF4["Operator Captures Message"] CF4 --> CF5["Save to PearlData"] CF5 --> CF6["Escalation Queue"] CF6 --> CF7["SMS / Email / Push"] end subgraph BillingFlow["2. Billing Flow"] direction TB BF1["Message Saved"] --> BF2["BillItem Created"] BF2 --> BF3["Rate Calculation"] BF3 --> BF4["Invoice Generation"] BF4 --> BF5["Stripe / GoCardless / SagePay"] BF5 --> BF6["Xero Sync"] end subgraph QueueFlow["3. Background Job Flow"] direction TB QF1["Job Scheduler
(5-min cycle)"] --> QF2["Process_JobQueue"] QF2 --> QF3["Queue Processor Claims"] QF3 --> QF4["HTTP Execution"] QF4 --> QF5["Result Logged"] end subgraph AIFlow["4. AI QC Flow"] direction TB AF1["Message Created"] --> AF2["AI Spooler Fetches"] AF2 --> AF3["6-Lane Round-Robin"] AF3 --> AF4["Speech Analytics"] AF4 --> AF5["GPT-4o Scoring"] end subgraph RTFlow["5. Real-Time Flow"] direction TB RF1["State Change"] --> RF2["/notify to Totem"] RF2 --> RF3["Match Subscribed Sessions"] RF3 --> RF4["Return Script to Browser"] end
07

17 Databases at a Glance

Azure SQL Managed Instance (Business Critical) — the data backbone.

PearlData 165 tables — Core ops data
PearlUsers 79 tables — Users & companies
PearlBilling 76 tables — Invoices & payments
PearlQueues 39 tables — Job queues & dispatch
PearlLog 30 tables — Audit & access logs
PearlSwitch 30 tables — DDI & call routing
PearlAnalysis 23 tables — QC & text analysis
PearlArchive 22 tables — Historic archival
PearlOperations 22 tables — Screen setups & config
PearlSearch 3 tables — Solr sync tokens
SMSBroadcast SMS delivery spool
Messages Message storage (legacy)
MSGView Message viewing portal
LookupDBs Reference & postcodes
ASPNET Session state
ASPStateInMemory Legacy OLTP (disabled)
Checking Health check definitions

Database Access Pattern

Two SQL accounts: pearl (main apps — web & workers) and utility (utility-server & system-checker). Cross-database queries use 3-part naming. The ConfigStrings table in PearlOperations holds all connection strings, API keys, and feature flags — the central configuration hub.

08

22 External Integrations

Every external dependency Pearl relies on — from telephony to AI.

☎️ Telephony & Voice

Genesys Cloud CX Primary CTI — DDI routing, call stats, screen pops, speech analytics
Twilio Programmable voice for AI assistants
ElevenLabs Text-to-speech / conversational AI

💳 Payments

Stripe Primary card processor — Checkout + auto-pay
GoCardless Direct debit collections
SagePay/Opayo PAYG card payments (Answer.co.uk)

📨 Communications

Mailgun Transactional email delivery
MediaBurst (Route 21) SMS provider with failover
MessageBird (Route 22) SMS provider
ClickSend (Route 23) SMS provider

🧠 AI & Analytics

Azure OpenAI GPT-4o-mini for QC scoring
Genesys Speech Analytics Transcript + sentiment
BigQuery Analytics data export

🔧 Business Tools

Xero Accounting — invoice & payment sync
Zoho Desk + CRM Support & customer management
Amazon S3 Backups, recordings, AI QA archives
Apache Solr Full-text search (5 cores)
09

The Pain Points

A complex, mission-critical platform with no isolated test environment. Every change is a risk to the 24/7 production service.

⚠️

No Test Isolation

All development and testing happens against or very near production. Every deploy risks the live 24/7 service that operators and clients depend on around the clock.

🔗

22 Live Integrations at Risk

A test against the wrong config could trigger real Stripe charges, send SMS to real customers, or disrupt live Genesys call routing. No safety net exists.

🔄

No Repeatable Regression

Cannot wipe and rebuild a clean test state. No way to validate that a change doesn't break any of the 489+ tables, 321+ endpoints, or 278+ background jobs.

🔒

PII Exposure & GDPR Risk

Any test data access risks exposing real customer PII — names, phone numbers, billing details, message content. No masking or anonymisation layer exists.

🏗️

Legacy Architecture Constraints

.NET Framework 4.8 with WinForms workers, raw sockets (.NET 3.5 Totem), and hardcoded IPs — not cloud-native, cannot use modern PaaS services without refactoring.

📋

No Release Process

Deployments are robocopy-based file syncs with no approval gates, no rollback mechanism, no audit trail. Manual and error-prone.

The Bottom Line

Every code change, database migration, or configuration update is deployed directly to production with no safety net. For a 24/7 contact centre handling calls for hundreds of client companies, this is an unacceptable operational risk that must be resolved.

10

Client's Request for Proposal

MessageDirect issued an RFP to design and deliver a secure, fully isolated, repeatable test environment. The RFP can only be formally responded to once Phase 1 (Discovery) is finalised.

1

Safe Releases

Develop, deploy, and validate changes without any risk to production

2

Repeatability

Wipe and rebuild the test environment and reload test data on demand

3

Full Isolation

Complete isolation from production systems and data; private-only connectivity

4

Test Data

Clean, anonymised dataset (no production PII) with weekly refresh procedure

5

Scalability

Spin up multiple test environments per feature branch with minimal overhead

💰

Budget

GBP £25,000 total cap (discovery + implementation)

📅

Timeline

Test environment ready by start to mid May 2026

🔐

Compliance

ISO 27001 aligned + GDPR data controls

📄

Deliverables

IaC, CI/CD, runbooks, SOPs, handover walkthrough

11

Architecture Options Evaluated

We evaluated 4 architecture options against Pearl's specific constraints.

★ SELECTED

3-VM Split + Azure SQL MI (General Purpose)

VM1 — Web Tier: IIS (pearl-azure + pearl-webservices + utility-server) + Memcached + Solr

VM2 — Worker Tier: queue-processor + system-checker + ai-spooler + totem-2-cloud-nosql

VM3 — Build/Dev: GitHub Actions runner + MSBuild 17 + .NET 4.8 SDK + restore tools

DB: Azure SQL MI (General Purpose, 4 vCores) — all 17 databases masked

Estimated implementation: 425h (incl. 20% buffer)

✅ Pros
  • Mirrors production layout — reliable test results
  • Zero code changes needed
  • Clear role separation (web vs worker vs build)
  • Familiar ops model (Windows Server + IIS)
  • Maps cleanly to IaC (Bicep/ARM)
⚠️ Trade-offs
  • Highest VM count of viable options
  • Moderate running cost (~£700/mo at 24/7)

2-VM + Azure Functions Hybrid

VM1 = Web + Memcached

VM2 = GitHub runner

Workers converted to Azure Functions

DB = Azure SQL MI (General Purpose)

Estimated implementation: 665h (incl. 20% buffer)

✅ Pros
  • Lower VM cost
  • Functions scale automatically
❌ Why Not
  • Requires refactoring WinForms workers to Functions
  • queue-processor uses in-process timer/state with turn-based coordination
  • ai-spooler uses 6-thread conveyor model with round-robin
  • totem uses raw .NET 3.5 sockets — incompatible with Functions
  • Significant engineering effort — explicitly out of RFP scope

Single VM — Everything on One Box

All components on one VM: web + workers + runner + Memcached + Solr

Totem as Azure Function (requires refactor)

Estimated implementation: 390h (incl. 20% buffer)

✅ Pros
  • Cheapest option (~£400/mo)
❌ Why Not
  • No isolation between web/worker/build processes
  • Resource contention — build jobs starve web tier
  • Doesn't replicate production topology
  • Test results unreliable for production prediction

4-VM Full Separation

VM1 = Web only

VM2 = Workers only

VM3 = Totem + Memcached + Solr

VM4 = Build runner

Estimated implementation: 460h (incl. 20% buffer)

✅ Pros
  • Maximum isolation per role
❌ Why Not
  • Over-engineered for a test environment
  • Highest cost (~£900/mo) — exceeds budget tolerance
  • Extra VM provides marginal benefit for testing
12

Weighted Comparison

5 criteria. 4 options. One clear winner.

Production Fidelity (25%)
A
B
C
D
Cost Efficiency (25%)
A
B
C
D
Time to Deliver (20%)
A
B
C
D
Operational Simplicity (15%)
A
B
C
D
Scalability (15%)
A
B
C
D
Option A — 3-VM
0
★ WINNER
Option D — 4-VM
0
Option C — Single
0
Option B — Hybrid
0
13

Why 3-VM Split Is the Right Answer

The architecture is dictated by Pearl's actual runtime constraints.

1

No Code Changes Required

WinForms workers (queue-processor, system-checker, ai-spooler) are architecturally bound to the Windows desktop runtime. Totem uses raw .NET 3.5 sockets. Converting to Azure Functions would be a major rewrite — explicitly out of the RFP scope.

2

Mirrors Production Topology

3-VM layout replicates the actual production separation: web tier (IIS), internal services tier (workers), and a dedicated build server. Test results reliably predict production behaviour.

3

Fastest Path to Delivery

Deploy existing compiled binaries via robocopy — the current deployment method. No new toolchain, no recompilation model, no replatforming. Ship in weeks, not months.

4

Operationally Familiar

Windows Server 2022 + IIS + Windows Services. The team already knows how to operate, troubleshoot, and deploy this stack. Zero learning curve.

5

Budget Appropriate

~£600-700/month operational cost fits comfortably within budgets. Business-hours auto-shutdown drops to ~£550-650/month. Majority of the £25k goes to implementation, not infrastructure.

6

IaC-Ready for Repeatability

3 VMs + SQL MI + networking maps cleanly to Bicep/ARM templates. Entire environment can be torn down and rebuilt from code — meeting the RFP's repeatability requirement.

14

Target Architecture

The complete test environment design — fully isolated from production.

flowchart TB subgraph DevAccess["Developer / QA Access"] Dev["Developer Workstation"] QA["QA Tester"] end subgraph GitHub["GitHub"] Repo["GitHub Repository"] Actions["GitHub Actions CI/CD"] end subgraph AzureTest["Azure Test Subscription (UK South)"] subgraph HubVNet["Hub VNet"] Bastion["Azure Bastion
(Secure RDP only)"] FW["Azure Firewall
(Approved outbound only)"] end subgraph SpokeVNet["Test Spoke VNet"] VM1["VM1 — Web Tier
D4s v5 • 4 vCPU • 16 GB
IIS + Memcached + Solr"] VM2["VM2 — Worker Tier
D2s v5 • 2 vCPU • 8 GB
Queue Proc + System Check
AI Spooler + Totem"] VM3["VM3 — Build / Restore Tier
D2s v5 • 2 vCPU • 8 GB
GitHub Runner + MSBuild
Restore + Masking Tools"] TestSQL["Test SQL MI (GP, 4 vCores)
17 masked databases"] Blob["Azure Blob Storage
(Prod backup staging)"] KV["Azure Key Vault"] end end subgraph ProdRO["Production (Read Only)"] ProdSQL["Prod SQL MI
(Weekly backup source)"] end subgraph Sandboxes["Approved Sandboxes"] GS["Genesys Sandbox"] ST["Stripe Test"] GC["GoCardless Sandbox"] MG["Mailgun Sandbox"] S3T["S3 Test Bucket"] end Dev -->|"Bastion RDP"| Bastion QA -->|"Bastion RDP"| Bastion Bastion --> VM1 & VM2 & VM3 Repo --> Actions Actions -->|"Self-hosted runner"| VM3 VM3 -->|"Deploy"| VM1 & VM2 ProdSQL -->|"Weekly .bak to Blob"| Blob Blob -->|"Restore + mask"| VM3 VM3 -->|"Restore to"| TestSQL VM1 --> TestSQL VM2 --> TestSQL KV --> VM1 & VM2 & VM3 VM1 -. outbound via firewall .-> FW VM2 -. outbound via firewall .-> FW FW -. controlled egress .-> GS & ST & GC & MG & S3T VM2 -->|"HTTP jobs"| VM1 VM1 -. "Totem notify" .-> VM2

Portable diagram asset: target-architecture.png

How This Architecture Works End to End

The design uses a private hub-and-spoke Azure layout so administrator access, application workload, and outbound internet traffic are controlled separately. Azure Bastion is the only RDP entry point, Azure Firewall is the single outbound checkpoint, and the spoke VNet hosts the actual Pearl workload across VM1 for IIS and local cache/search, VM2 for background workers, and VM3 for build, restore, and masking automation.

The single test SQL Managed Instance stores all 17 masked databases used by the environment. Production never connects directly to the test estate; it only places weekly backup files into Blob Storage, and VM3 restores, masks, and validates those backups before VM1 and VM2 use them. Azure Key Vault keeps the environment secrets out of the servers, and every external dependency is redirected to sandboxes such as Genesys, Stripe, GoCardless, Mailgun, and the test S3 bucket so the platform behaves like production without touching live customer data, live payments, or live telephony.

15

Phase 1 — Discovery & Planning

Phase 1 is the foundation. The RFP response to the client cannot be submitted until Phase 1 is decided and finalised. This is where we confirm everything about the current system, size the target, and commit to the plan.

🔍

Review Production VM Setup

Confirm current IIS configuration, server roles, installed components, Windows features, and service accounts on pearl3, pearl4, pearlinternal

🗄️

Review SQL Server & Backup Size

Measure actual database sizes for all 17 databases. Confirm Business Critical tier specifics. Estimate .bak sizes for backup/restore pipeline

📦

Review Runtime Dependencies

Catalog all .NET Framework versions (.NET 4.8, 4.8.1, 3.5), Telerik licence requirements, NuGet packages, Bin/ DLLs, and third-party assemblies

⚙️

Identify Environment Configs

Map all ConfigStrings entries, web.config connection strings, hardcoded IPs (10.0.0.12, 10.0.1.44), hostnames, and file paths that need repointing

🔒

Define Network & Security

Design hub-spoke topology, subnet addressing (10.1.x.x hub, 10.2.x.x spoke), NSG rules, Azure Bastion access, firewall egress whitelist

☁️

Confirm Azure Sizing

Finalise VM SKUs, SQL MI tier and vCores, storage requirements, region (UK South). This sizing recommendation drives the cost model.

Phase 1 Deliverables

📋

Architecture Diagram

Current production setup documented with all components, connections, and dependencies mapped

📐

Azure Sizing Recommendation

Finalised SKUs, vCores, storage tiers — the basis for the cost model and RFP response

⚠️

Risk Assessment Summary

All identified risks with likelihood, impact, and proposed mitigations

16

Azure Sizing Recommendation

The sizing recommendation is the key Phase 1 output — it determines the cost model and drives the RFP response to the client.

VM1 — Web Tier

D4s v5
vCPUs: 4 RAM: 16 GB Disk: 128 GB Premium SSD (P10) OS: Windows Server 2022 Datacenter
Why This Size?
  • IIS hosts 3 web applications — pearl-azure (321+ endpoints + 304+ admin tools + 108+ portal pages), pearl-webservices-azure (278+ job pages), and utility-server (3 sub-apps)
  • Memcached requires ~2-4 GB RAM — distributed cache serving all web requests
  • Apache Solr requires ~1-2 GB RAM — 5 search cores (messageanalytics, callers, faqs, elements, search)
  • ASP.NET Web Forms JIT compilation — first-request compilation of App_Code/ modules (PearlOperations.vb is 557 KB alone) is CPU-intensive
  • 4 vCPUs provide headroom for concurrent IIS requests, Solr indexing, and cache operations

VM2 — Worker Tier

D2s v5
vCPUs: 2 RAM: 8 GB Disk: 64 GB Premium SSD (P6) OS: Windows Server 2022 + .NET 3.5 Feature
Why This Size?
  • queue-processor — timer-based poll loop, claims batches from Process_JobQueue, executes HTTP calls. Low CPU, moderate memory
  • ai-spooler — 6 concurrent worker threads + fetcher thread. Each thread holds one HTTP connection. Moderate parallel I/O
  • system-checker — 10-second timer loop running health checks (ICMP, TCP, HTTP, SQL). Low resource usage
  • totem-2-cloud-nosql — .NET 3.5 socket server handles long-poll connections. In-memory state only. Needs .NET 3.5 Framework feature enabled
  • 2 vCPUs sufficient — workers are I/O-bound (HTTP calls, SQL queries), not CPU-bound

VM3 — Build/Dev

D2s v5
vCPUs: 2 RAM: 8 GB Disk: 128 GB Premium SSD (P10) OS: Windows Server 2022 Datacenter
Why This Size?
  • GitHub Actions self-hosted runner — runs CI/CD workflows triggered by repo events
  • MSBuild 17 + .NET 4.8 SDK — compiles all 7 components plus aspnet_compiler.exe precompilation
  • PowerShell restore tooling — downloads .bak files from Blob, runs RESTORE DATABASE commands, executes masking scripts
  • 128 GB disk — needs capacity for .bak downloads (all 17 databases), build artefacts, and deployment staging
  • 2 vCPUs adequate — builds run sequentially (not parallel), restore is I/O-bound

Test SQL Managed Instance

General Purpose
vCores: 4 Storage: Determined by actual DB sizes (Phase 1 measurement) Tier: General Purpose (not Business Critical) Compute: PAYG
Why General Purpose (Not Business Critical)?
  • Production uses Business Critical for HA (Always On, low latency) — test doesn't need this
  • General Purpose costs ~60% less than Business Critical for equivalent vCores
  • 4 vCores handles functional testing workload (not performance testing)
  • Supports all 17 databases with cross-database queries (3-part naming)
  • Storage tier sized after Phase 1 measurement of actual production DB sizes

Supporting Resources

ResourceSKU / ConfigJustification
Azure Blob StorageHot tier, LRS, ~500 GBWeekly backup staging — 4 weekly copies of all 17 databases with 28-day retention
Azure BastionStandard SKUSecure RDP to all VMs — no public IPs, no VPN needed. Audit-logged access
Azure FirewallStandard SKUEgress filtering — allowlist-only outbound to sandbox endpoints. Prevents accidental production contact
Azure Key VaultStandardAll connection strings, API keys, secrets. Managed identity access. Versioned secret rotation
Log AnalyticsPer-GB ingestionCentralised audit logging — VM diagnostics, SQL MI audit, access records for ISO 27001 alignment
17

Database Backup & Restore Strategy

Dual-mode data pipeline: Testing Mode for regression and Debug Mode for real-world investigation.

Source: RFP Section 5.2

CLEAN & DETERMINISTIC

Testing Mode — Regression Testing

The database is wiped and reloaded with a known synthetic seed dataset, ensuring repeatable, deterministic regression test results every time.

  1. DROP all user databases on Test SQL MI
  2. CREATE fresh databases from schema-only scripts (versioned in Git)
  3. EXECUTE seed data scripts — synthetic companies, operators, callers, DDIs, messages
  4. CLEAR Memcached + Solr indexes
  5. REBUILD Solr indexes from seed data
  6. VALIDATE + notify — "Testing Mode ready"
✅ Key Characteristics
  • Deterministic — Same data every time for reliable assertions
  • Fast — Schema + seed scripts in minutes (no large .bak downloads)
  • Zero PII risk — All data is synthetic, no masking needed
  • Versioned — Seed scripts in Git alongside application code

Debug Mode — Anonymised Production Data

Loads anonymised production data to investigate real-world issues that cannot be reproduced with synthetic data. Weekly automated pipeline.

sequenceDiagram participant ProdMI as Prod SQL MI participant Blob as Azure Blob Storage participant VM3 as VM3 (Restore Tool) participant TestMI as Test SQL MI Note over ProdMI: Weekly backup (Sat 02:00 UTC) ProdMI->>Blob: BACKUP DATABASE TO URL
17 databases → .bak files Blob->>Blob: Encrypted at rest (SSE)
Retention: 4 weekly copies Note over VM3: Saturday 06:00 UTC VM3->>Blob: Download .bak files via SAS token VM3->>TestMI: RESTORE DATABASE
(all 17 databases sequentially) Note over VM3: Post-restore masking VM3->>TestMI: Execute T-SQL masking scripts
(PII anonymisation per database) VM3->>TestMI: Repoint ConfigStrings to test endpoints VM3->>TestMI: Validate row counts + key queries VM3->>VM3: Log + email notification
✅ Key Characteristics
  • Real-world data shape — Actual distributions, edge cases
  • Anonymised — All PII masked via T-SQL scripts
  • Weekly refresh — Automated Saturday pipeline
  • Approval-gated — Requires team lead to switch modes

Restore Tool — Mode Selection

A single CLI tool supports both modes: --mode testing (clean + seed) or --mode debug (backup + restore + mask). Both modes repoint ConfigStrings to test endpoints and validate data integrity before marking the environment ready.

PearlData

Callers, CallerHistory

Name, Phone, Email, Address

Faker-generated UK data

PearlUsers

Users, CompanyContacts

Name, Email, Phone, Password

Hashed/randomised

PearlBilling

Invoices, Payments

CustomerName, BankDetails, CardRefs

Synthetic replacement

Messages / SMS

MessageContent, SMSSpoolOutgoing

CallerName, Phone, MessageText, Mobile

Faker replacement + anonymised

PearlLog

Various log tables

PII embedded in log payloads

Truncated/replaced

✅ No Masking Required (config/reference data only)

PearlOperationsPearlSwitchPearlAnalysisPearlSearch LookupDBsASPNETChecking
18

Integration Safety Model — All 22 Services

Every external integration safely sandboxed, stubbed, or disabled. Full inventory from third-party dependencies index.

flowchart LR subgraph TestVM1["Test VM1 (Web Tier)"] Pearl["Pearl Test Instance"] end subgraph Sandbox["Sandbox Mode ✅ (5)"] G["Genesys Cloud CX"] ST["Stripe (sk_test)"] GCS["GoCardless"] SP["SagePay Simulator"] XR["Xero"] end subgraph Disabled["Disabled ⛔ (9)"] EL["ElevenLabs"] SMS1["MediaBurst"] SMS2["MessageBird"] SMS3["ClickSend"] BQ["BigQuery"] ZD["Zoho Desk/CRM"] TW2["Twitter (X)"] TP["Trustpilot"] SX["Sinerix"] end subgraph Local["Local / Isolated (5)"] Solr["Solr (localhost)"] MC["Memcached (localhost)"] TM["Totem (VM2)"] AI["Azure OpenAI (test)"] TW["Twilio (test creds)"] end subgraph Storage["Separate Resource (1)"] S3["S3 Test Bucket"] end Pearl --> G Pearl --> ST Pearl --> GCS Pearl --> SP Pearl --> XR Pearl -.->|"disabled"| EL Pearl -.->|"disabled"| SMS1 Pearl -.->|"disabled"| SMS2 Pearl -.->|"disabled"| SMS3 Pearl -.->|"disabled"| BQ Pearl -.->|"disabled"| ZD Pearl -.->|"disabled"| TW2 Pearl -.->|"disabled"| TP Pearl -.->|"disabled"| SX Pearl --> Solr Pearl --> MC Pearl --> TM Pearl --> AI Pearl --> TW Pearl --> S3

Complete Integration Inventory

Source: 53-third-party-dependencies-index.md + Phase 1 codebase analysis

☎️ Telephony & Voice (3)

Genesys Cloud CX → Sandbox org Risk: Test triggers live call routing, screen pops hit production operators
Twilio → Test credentials Risk: Real voice calls placed, real SMS sent, charges incurred
ElevenLabs → Disabled Risk: AI voice calls to real numbers, TTS API credit consumption

💳 Payments (3)

Stripe → Test mode (sk_test) Risk: Real credit card charges, production webhooks contaminated
GoCardless → Sandbox Risk: Real DD mandates created, customer bank accounts debited
SagePay/Opayo → Simulator Risk: Real card payments via Answer.co.uk brand

📨 Communications (4)

Mailgun → Sandbox domain Risk: Real emails to customers — notifications, invoices, alerts
MediaBurst (Route 21) → Disabled Risk: Real SMS to customer mobiles
MessageBird (Route 22) → Disabled Risk: SMS failover fires to real numbers
ClickSend (Route 23) → Disabled Risk: Third SMS failover route sends to real numbers

🧠 AI & Analytics (3)

Azure OpenAI → Isolated instance Risk: QC scoring pollutes production analytics, API credits consumed
Genesys Speech Analytics → Sandbox org Risk: Results written to production Genesys, corrupting real QC data
BigQuery → Disabled Risk: Test data exported to production dataset, corrupts analytics

🔧 Business Tools (4)

Xero → Sandbox Risk: Invoices in production Xero, reconciliation corrupted
Zoho Desk/CRM → Disabled Risk: Test tickets in production Zoho, CRM stats corrupted
Twitter (X) → Disabled Risk: Test posts to production company accounts
Trustpilot → Disabled Risk: Review invitations sent to real customers

🗄️ Storage & Infrastructure (5)

Amazon S3 → Separate bucket Risk: Test writes to prod S3, backups/recordings overwritten
Apache Solr → Local (localhost:8983) Risk: Test indexing corrupts production search indexes
Memcached → Local (localhost:11211) Risk: Test cache writes corrupt production cache state
Totem → Test instance (VM2) Risk: Screen pops sent to production operator browsers
Sinerix → Disabled Risk: E-signature requests to real sessions

Integration Safety Summary

0Sandbox
0Disabled
0Local / Isolated
0Separate Resource
0Deprecated (No Action)
19

CI/CD Pipeline & Infrastructure as Code

GitHub Actions with self-hosted runner, Terraform for multi-environment spawning, approval gates and rollback.

Source: RFP Section 5.4 & 5.5

Terraform — Multi-Environment Spawning

Per the RFP requirement, Terraform is the primary IaC tool for its superior multi-environment capabilities. A single terraform apply with variable overrides can provision N parallel test environments — e.g., one per feature branch or test cycle.

terraform/
├── modules/
│   ├── networking/      # Hub-Spoke VNets, NSGs, Bastion, Firewall
│   ├── compute/         # VM1 (Web), VM2 (Worker), VM3 (Build)
│   ├── database/        # SQL MI (General Purpose)
│   ├── security/        # Key Vault, RBAC, Managed Identities
│   ├── monitoring/      # Log Analytics, Azure Monitor, Audit
│   └── storage/         # Blob Storage, SAS policies
├── environments/
│   ├── test-01/         # Primary test environment
│   ├── test-02/         # Feature branch environment
│   └── test-N/          # N-th on-demand environment
├── main.tf              # Root module composition
├── variables.tf         # Parameterised config
└── backend.tf           # Azure Blob remote state
flowchart TB subgraph DevSub["Developer"] D1["Push / Pull Request"] end subgraph Terraform["Terraform IaC"] TF1["terraform plan"] TF2["terraform apply"] TF3["Provision N environments"] end subgraph GHActions["GitHub Actions Workflows"] subgraph BuildWF["build.yml"] B1["Checkout code"] B2["NuGet restore"] B3["MSBuild all 7 components"] B4["aspnet_compiler precompile"] B5["Package artefacts"] end subgraph DeployWF["deploy-test.yml (manual)"] D2["🔒 Approval gate"] D3["Select target env (test-01..N)"] D4["Stop IIS + services"] D5["Backup → _rollback/"] D6["Robocopy artefacts → VMs"] D7["Start IIS + restart"] D8["Health check"] end subgraph DBMode["db-mode-switch.yml"] DB1["Select: testing / debug"] DB2["Execute restore tool"] DB3["Validate + notify"] end subgraph RollbackWF["Rollback"] R1["Restore _rollback/"] R2["Restart + notify"] end end D1 --> B1 B1 --> B2 --> B3 --> B4 --> B5 B5 --> D2 D2 --> D3 --> D4 --> D5 --> D6 --> D7 --> D8 D8 -->|"Failure"| R1 R1 --> R2 TF1 --> TF2 --> TF3

Deployment Strategy

Web Apps

Robocopy artefacts to IIS physical paths — mirrors current production method

Rollback: Previous build backed up to _rollback/

Workers

Stop Windows Service → copy binaries → restart service

Rollback: Same backup/restore approach

Database Modes

Testing mode: clean + seed scripts. Debug mode: backup/restore + mask. Switchable via pipeline.

Rollback: Re-run mode switch to restore state

Infrastructure

Terraform plan → apply with approval. Entire environment from code. Multi-env via workspaces.

Rollback: terraform destroy + terraform apply
20

Security Controls — ISO 27001 Aligned

18 controls covering network, identity, data, and audit layers.

Source: RFP Section 5.6

Required

Subscription Isolation

Separate Azure subscription for test environment

Required

No Public IPs

All VM access via Azure Bastion only — no exposed endpoints

Required

NSG Micro-Segmentation

Per-subnet NSGs with least-privilege port rules

Required

Egress Filtering

Azure Firewall with allowlisted outbound only

Required

No Prod Connectivity

No VNet peering to production subscription — air gap

Required

Secrets in Key Vault

All connection strings and API keys in Key Vault

Required

Managed Identities

System-assigned MI for Key Vault & Blob access

Required

RBAC Least Privilege

Custom role definitions per persona (see RBAC section)

Required

Disk Encryption

ADE (BitLocker) with Customer-Managed Keys

Required

SQL MI TLS

Encrypt=True, TrustServerCertificate=False

Required

Audit Logging

Dedicated PearlAudit database + Log Analytics

Required

Resource Tagging

Environment=Test, Project=Pearl enforced tags

Required

Data Masking

PII anonymisation executed on every Debug Mode restore

Required

GDPR Lifecycle

Documented retention, purpose limitation, access controls

Required

Key Rotation

90-day secret rotation, 365-day CMK rotation via Key Vault

20b

Audit Logging & Trail System

Dedicated PearlAudit database with separate tables for every category of change — complete, tamper-resistant audit trail.

Source: RFP Section 9.2 Option D

flowchart TB subgraph Apps["Application Layer"] PA["pearl-azure"] PW["pearl-webservices"] WK["Workers"] end subgraph Collector["Audit Collector"] AC["Structured Logging API"] end subgraph AuditDB["PearlAudit Database (Dedicated)"] T1["Audit_ChangeLog
INSERT / UPDATE / DELETE"] T2["Audit_AccessLog
Login, page views, API calls"] T3["Audit_ConfigChanges
ConfigString modifications"] T4["Audit_SecurityEvents
RBAC changes, auth failures"] T5["Audit_IntegrationLog
External API calls (sanitised)"] T6["Audit_DeploymentLog
CI/CD events + artefact hashes"] T7["Audit_DataAccessLog
PII access tracking"] T8["Audit_SystemEvents
Infrastructure changes"] end subgraph Export["Long-Term Storage"] LA["Azure Log Analytics
90 days hot / 365 days archive"] end PA --> AC PW --> AC WK --> AC AC --> T1 AC --> T2 AC --> T3 AC --> T4 AC --> T5 AC --> T6 AC --> T7 AC --> T8 T1 --> LA T4 --> LA T6 --> LA
📝

SQL Triggers

State-changing tables from the audit scope below write INSERT/UPDATE/DELETE events to Audit_ChangeLog with before/after JSON values.

🌐

Application Middleware

Global.asax captures page access and API call events with user identity, IP address, and correlation IDs

🔗

Integration Wrapper

All external API calls logged with PII-sanitised payloads, response codes, and timing data

🚀

CI/CD Hooks

GitHub Actions posts deployment events via secure webhook — artefact hashes, approver identity, timestamps

Which Tables Need Direct Audit Coverage?

Not all 489 tables receive synchronous SQL triggers. The direct audit scope focuses on the 28 high-risk tables that can change configuration, permissions, customer data, financial records, queue execution, or regulated communication content. The rest remain covered through access logs, integration logs, deployment logs, and system events unless later discovery promotes them into the trigger scope.

Source DBTables in direct audit scopePrimary audit routeWhy in scope
PearlOperationsConfigStringsSQL trigger → Audit_ConfigChangesControls API keys, endpoints, feature flags, and runtime behaviour.
PearlUsersUsers, Permissions, LoginLogs, Companies, CompanyInfo, CompanyContacts, Rotas, ShiftsSQL trigger → Audit_ChangeLog / Audit_SecurityEventsIdentity, tenancy, rota, and permission changes determine who can access the system and how escalations are routed.
PearlDataMessages, Callers, CallerHistory, ScreenInits, PhysicalDDIsSQL trigger → Audit_ChangeLog / Audit_DataAccessLogCore message-taking tables hold caller PII, operator-entered content, screen state, and DDI routing context.
PearlQueuesDispatchQueue, Process_JobQueue, Process_MachineStates, JobSchedulesSQL trigger → Audit_ChangeLogThese tables control background execution, dispatch timing, and worker coordination.
PearlBillingInvoices, BillItems, PaymentsSQL trigger → Audit_ChangeLog / Audit_IntegrationLogFinancial correctness, payment actions, and accounting exports depend on these records.
MessagesMessageContentSQL trigger → Audit_DataAccessLogLegacy message body storage contains customer communication content.
SMSBroadcastSMSSpoolOutgoingSQL trigger → Audit_IntegrationLogOutbound customer communications need clear send intent and change history.
PearlLogPageAccessLogs, APILogs, ProcessLogs, SecurityExceptionsAsync logging → Audit_AccessLog / Audit_SecurityEventsEvidence of user journeys, API misuse, worker faults, and suspicious activity.
PearlSwitchCallRecordingsSQL trigger → Audit_DataAccessLogCall recording metadata is sensitive operational evidence and needs attributable access history.
20c

RBAC Hardening Strategy

Custom Azure role definitions with least-privilege access and Privileged Identity Management.

Source: RFP Section 5.1

RoleScopePermissionsAssigned To
Pearl-TestEnv-AdminSubscriptionFull Contributor + KV admin + SQL MI adminInfra team lead (PIM-gated)
Pearl-TestEnv-DeveloperResource GroupVM Contributor + KV Secret Reader + SQL MI Read/WriteDevelopment team
Pearl-TestEnv-QAResource GroupVM Reader + SQL MI Data Reader (read-only)QA/testing team
Pearl-TestEnv-DeployerResource GroupVM Contributor (start/stop) + Blob Reader + KV Secret ReaderGitHub Actions SP
Pearl-TestEnv-DBASQL MISQL MI Contributor + KV Secret ReaderDatabase administrators
Pearl-TestEnv-AuditorLog AnalyticsLog Analytics Reader + Audit DB read-onlyCompliance / audit

Privileged Identity Management (PIM)

🔓

Admin Elevation

4-hour max duration. Requires tech lead approval. For infrastructure changes and Key Vault management.

🗄️

SQL MI Direct Access

2-hour max duration. Requires DBA lead approval. For emergency database operations only.

🔍

Debug Mode Activation

8-hour max duration. Requires DPO approval. For loading unmasked production data.

20d

Key Vault & Encryption Management

Customer-Managed Keys for all encryption layers with automated rotation policies.

Source: RFP Section 9.2 Option D

💿

VM Disk Encryption

ADE (BitLocker) with Customer-Managed Key (CMK) stored in Key Vault. All VM disks encrypted at rest.

🗄️

SQL MI TDE

Transparent Data Encryption with CMK. SQL MI TDE protector key rotated annually via Key Vault policy.

📦

Blob Storage SSE

Server-Side Encryption with CMK. Backup .bak files encrypted at rest with customer-controlled keys.

🔑

Secret Rotation

90-day rotation for API keys and service principal secrets. 365-day rotation for encryption CMKs. 14-day expiry alerts.

Rotation Schedule

Key / Secret TypeRotation PeriodMethodAlert Threshold
Integration API keys90 daysManual rotate + KV version14 days before expiry
Service principal secrets90 daysAuto-rotate via KV policy14 days before expiry
SQL MI connection string90 daysAuto-rotate via Azure Function14 days before expiry
Disk encryption CMK365 daysAuto-rotate via KV policy30 days before expiry
SQL MI TDE protector365 daysAuto-rotate via KV policy30 days before expiry
Blob encryption CMK365 daysAuto-rotate via KV policy30 days before expiry
20e

Delivery Plan

RFP-aligned weekly plan targeting MVTE readiness by mid-May 2026.

Source: RFP Section 7

W1

Week 1 — Discovery & Validation

Current-state validation. Full integration inventory (all 22 services). Data strategy selection (dual-mode). Terraform module design.

✅ Deliverables: Discovery report + Integration safety matrix + Agreed design

W2-3

Weeks 2–3 — Environment Build & Isolation

Azure infrastructure via Terraform. Hub-Spoke VNets, VMs, SQL MI, Bastion, Firewall, Key Vault. Initial app deployment via CI/CD. Integration sandbox/stub configuration.

✅ Deliverables: Running test environment + Initial deployment validated

W4

Week 4 — Data Pipeline & Security

Dual-mode data pipeline (testing + debug modes). Weekly refresh automation. RBAC hardening and audit logging system. Runbooks and SOPs.

✅ Deliverables: Data pipeline operational + Runbooks delivered + Security controls verified

W5

Week 5 — Stabilisation & Handover

Stabilisation and defect fixes. Multi-environment spawning verification. Acceptance evidence pack. Handover session and recorded walkthrough.

✅ Deliverables: Acceptance criteria met + Handover complete

21

Implementation Roadmap

425 hours / 53 working days on the selected 3-VM + 1 DB design across 5 weeks. Target: mid-May 2026.

Source: RFP Section 7, 9.2

Discovery

Current-state validation

40h (5 days) ✅ Completed

RFP Option A — MVTE Build

3-VM + 1 DB foundation

105h (13 days)
  • 18h — Hub-spoke network, subnets, Bastion, Firewall, NSGs
  • 22h — VM1 web tier, IIS, Memcached, and Solr baseline
  • 14h — VM2 worker tier and Windows runtime hosting
  • 16h — VM3 build runner, MSBuild, and restore tooling
  • 15h — SQL MI provisioning and private connectivity
  • 20h — CI/CD workflows, rollback path, and smoke test

RFP Option B — Data Pipeline

Restore, mask, and refresh automation

96h (12 days)
  • 22h — Blob intake and restore orchestration
  • 20h — Restore sequencing for 17 databases
  • 28h — PII masking for the six high-risk databases
  • 12h — Testing-mode seed data and debug-mode controls
  • 14h — Weekly refresh automation, validation, and reporting

RFP Option C — Multi-env

Cloneable environment pattern

32h (4 days)
  • 12h — Parameterise IaC for extra 3-VM + 1 DB stacks
  • 6h — Naming, address-space, and DNS conventions
  • 8h — Environment-specific secrets and runner targeting
  • 6h — Provisioning verification and demo of an extra environment

RFP Option D — Security

Hardening and audit evidence

56h (7 days)
  • 14h — Azure RBAC, PIM roles, and admin segregation
  • 12h — Key Vault secret governance and rotation policy
  • 18h — Audit logging database, triggers, and application capture
  • 12h — Monitoring evidence, policy checks, and control validation

Validation

Testing & handover

56h (7 days)
  • End-to-end deploy + restore test
  • Smoke test key user journeys
  • Integration safety verification
  • Documentation + recorded handover

Buffer

Stabilisation

40h (5 days)
🎯 MVTE Ready — Mid-May 2026 — 425 hours total
22

Hours Estimate

Three-perspective comparison plus the detailed RFP Option A-D breakdown on the selected 3-VM + 1 DB design.

Source: RFP Section 9.2 — Mandatory Table Format

RFP 9.2 — Mandatory Hours Breakdown

Work Package (RFP 9.2)Original Est.RFP-AdjustedFinal Proposed
Discovery24-32h32-40h40h
RFP Option A — MVTE Build64-88h88-112h105h
RFP Option B — Data Pipeline64-80h80-104h96h
RFP Option C — Multi-env Capability0h (not estimated)24-32h32h
RFP Option D — Enhanced Security16-24h48-64h56h
Validation & Handover32-40h48-64h56h
Buffer40h40h40h
TOTAL240-304h360-456h425h (53 days)

Estimate Comparison

Original Team (phases.txt) 240-304h
RFP-Adjusted (+ Options C, D) 360-456h
Final Proposed 425h

Why the increase? The uplift comes from standing up the selected 3-VM + 1 DB foundation, adding the restore-and-mask data pipeline, and then layering on multi-environment capability plus the enhanced security controls required by the RFP.

Architecture option totals below include a 20% buffer and are rounded to the nearest 5 hours.

Architecture option totals for topology comparison:

Option A — 3-VM

425h/53 days

Baseline RFP-compliant path

Option B — 2-VM + Functions

665h/83 days

Includes worker refactor to Functions

Option C — Single VM

390h/49 days

Lower infra effort, less isolation

Option D — 4-VM Full

460h/58 days

Extra tiering and deployment complexity

Detailed RFP Option A-D Breakdown on the Selected 3-VM + 1 DB Design

These are work packages delivered on top of the chosen architecture, not alternative topologies. RFP Option A carries the 105h baseline because it creates the private landing zone, provisions VM1, VM2, VM3, wires the single SQL Managed Instance, and proves the first repeatable deployment path.

Sub-activityHoursWhy it is needed
Hub-spoke network, subnets, NSGs, Bastion, and firewall rules18hCreate the private landing zone before any VM or SQL resource is attached.
VM1 web tier provisioning and IIS baseline22hVM1 hosts pearl-azure, pearl-webservices, utility-server, Memcached, and Solr together.
VM2 worker tier provisioning and service hosting14hKeep the Windows workers separate from browser traffic and give them the required runtime dependencies.
VM3 build tier, GitHub runner, and toolchain setup16hProvide a dedicated build and restore server for deploys, restores, and masking workflows.
Azure SQL MI GP provisioning, private access, and base configuration15hCreate the single managed database tier that must hold all 17 masked databases.
CI/CD workflows, initial deployment, rollback, and smoke test20hMake the environment usable by proving a repeatable deployment path into VM1 and VM2.
Total105h13 days
Sub-activityHoursWhy it is needed
Backup intake from Blob and restore orchestration22hUse production backup files as the only approved path into the test SQL MI.
Restore sequencing for 17 databases and dependency handling20hScript restore order, logins, jobs, and cross-database checks inside the single managed instance.
PII masking for the six high-risk databases28hMake the shared test data safe before any non-production access is allowed.
Testing-mode seed data and debug-mode controls12hSupport both repeatable regression data and a governed debugging path on the same estate.
Weekly refresh automation, validation checks, and reporting14hTurn refresh into a repeatable operational run rather than a one-off manual restore.
Total96h12 days
Sub-activityHoursWhy it is needed
Parameterise IaC for extra 3-VM + 1 DB stacks12hClone the selected pattern instead of inventing a new topology for each extra environment.
Naming, address-space, and DNS conventions per environment6hKeep each environment predictable and isolated when more than one exists.
Environment-specific secrets, runner targeting, and configuration selection8hEnsure the build tier can deploy to the right VM and SQL set without leaking credentials.
Provisioning verification and demo of an extra environment6hProve that one additional environment can be created and validated from the template.
Total32h4 days
Sub-activityHoursWhy it is needed
Azure RBAC, PIM roles, and admin segregation14hSeparate infrastructure, database, and deployment access inside the 3-VM estate.
Key Vault secret governance and rotation policy12hFormalise rotation, access scoping, and recovery procedures for the secrets already used by the design.
Audit logging database, triggers, and application capture18hCollect evidence of privileged activity, data changes, and release actions across web, worker, and data tiers.
Monitoring evidence, policy checks, and control validation12hProvide proof points and alerting so the controls are supportable and reviewable.
Total56h7 days
22b

Azure Running Costs

Monthly infrastructure cost per environment, optimised with auto-shutdown.

Monthly Cost Profiles

Business Hours

£550–650/mo

VMs off 19:00–07:00 + weekends

Recommended

Full-Time (24/7)

£700–750/mo

For intensive test periods

Minimal (Idle)

~£290/mo

SQL MI + Blob only, VMs off

Resource Breakdown (per environment)

ResourceSKUEst. £/mo
VM1 — Web (D4s v5)PAYG, auto-shutdown~£160
VM2 — Worker (D2s v5)PAYG, auto-shutdown~£80
VM3 — Build (D2s v5)PAYG, auto-shutdown~£80
SQL MI (GP, 4 vCores)PAYG~£280
Blob Storage (500 GB)Hot LRS~£10
Azure BastionStandard (shared hub)~£45
Azure FirewallStandard (shared hub)~£40
Key VaultStandard~£1
Log AnalyticsPer-GB~£5
TOTAL (24/7)~£700/mo

Multi-environment scaling: Each additional Terraform workspace adds ~£291/mo (VMs + spoke SQL MI). Hub resources (Bastion, Firewall) are shared.

23

Risk Assessment

Key risks identified during Phase 1 discovery. Click to expand.

R1

Hardcoded IPs in Source Code

High Likelihood Medium Impact

Risk: AI spooler uses http://10.0.0.12, reporting uses 10.0.1.44, queue-processor references pearlsqlmi2, totem reads from C:\totemscripts\*.txt

Mitigation: Audit all source for hardcoded IPs/hostnames. Create web.config overrides or hosts file entries on test VMs. Provision required filesystem paths.

R2

Data Masking Incompleteness

Medium Likelihood High Impact

Risk: PII may exist in unexpected columns/tables across 489+ tables.

Mitigation: Data discovery audit before go-live. Deny-by-default approach — mask all string columns in PII-sensitive tables. Review PearlLog payloads.

R3

Integration Credential Leakage

Medium Likelihood High Impact

Risk: Test/sandbox API keys (Stripe, Genesys, etc.) must not leak into production-visible systems.

Mitigation: All secrets stored in Azure Key Vault. No secrets in code, config files on disk, or repository. Managed Identity access only.

R4

SQL MI Backup Size Unknown

Medium Likelihood Medium Impact

Risk: Actual DB sizes not yet measured — restore pipeline duration and storage costs are estimates.

Mitigation: Measure actual sizes during Phase 1 completion. Consider trimming PearlLog/PearlArchive for test. Adjust Blob Storage tier if needed.

R5

Telerik Licence Coverage

Medium Likelihood Medium Impact

Risk: RadControls for ASP.NET AJAX require a valid licence for the test environment.

Mitigation: Verify existing licence covers non-production use. Contact Telerik/Progress if additional licence needed.

R6

Totem .NET 3.5 Dependency

Low Likelihood Low Impact

Risk: totem-2-cloud-nosql targets .NET Framework 3.5 — must be installed as a Windows feature.

Mitigation: Enable .NET 3.5 feature on VM2 via DISM during provisioning script. Also provision C:\totemscripts\ directory with script templates.

R10

Terraform State Corruption

Medium Likelihood High Impact

Risk: Remote state stored in Azure Blob could be corrupted by concurrent operations or manual changes outside Terraform.

Mitigation: Enable state locking via Azure Blob lease. Blob versioning for state file recovery. terraform plan mandatory before apply in CI/CD.

R11

Audit Logging Performance Impact

Medium Likelihood Medium Impact

Risk: SQL triggers on high-volume tables (e.g., PearlLog) could add latency to application transactions.

Mitigation: Triggers only on critical configuration/security tables. High-volume access logging via async middleware, not triggers. Performance test during Phase 4.

R12

Key Rotation Service Disruption

Low Likelihood High Impact

Risk: Automatic key rotation could temporarily break SQL MI TDE or Blob encryption if applications cache old keys.

Mitigation: Key Vault rotation events trigger Azure Function that validates connectivity. Dual-key overlap period (old key retained for 24h). Rotation during maintenance window.

R13

Multi-Environment SQL MI Cost Scaling

Medium Likelihood Medium Impact

Risk: Each additional Terraform workspace spawns a separate SQL MI instance (~£280/mo), which could exceed budget if many environments run simultaneously.

Mitigation: Auto-shutdown policy for non-primary environments. Spawn-on-demand, destroy-after-use workflow. Budget alerts at 80% threshold.

R14

Regression Seed Data Drift

Medium Likelihood Low Impact

Risk: Testing Mode synthetic seed data may diverge from production schema changes over time, causing false test failures.

Mitigation: Seed data scripts version-controlled in repo. Schema diff check included in CI/CD pipeline. Quarterly seed data review process.

R15

Integration Sandbox Availability

Medium Likelihood Medium Impact

Risk: Not all 22 integrations offer sandbox/test environments. Some vendors may charge extra or have limited test APIs.

Mitigation: Integration inventory categorises each service's test strategy. Disabled-by-default for services without sandbox. Confirm sandbox access during Week 1 discovery.

24

Summary & Next Steps

Phase 1 Discovery is complete. The architecture is designed, risks are mapped, and hours estimates follow the RFP 9.2 mandatory format. Once approved, the RFP response can be formalised.

Architecture 3-VM Split + SQL MI GP
Weighted Score 4.15 / 5
Estimated Hours 424h (53 days)
IaC Approach Terraform / Bicep
Monthly Run Cost ~£600-700/mo
Target Ready Date Mid-May 2026

Immediate Next Steps

1

Approve Phase 1 findings and architecture recommendation

2

Measure actual production DB sizes to finalise SQL MI storage tier

3

Confirm Azure subscription and Terraform state backend location

4

Verify Telerik licence covers non-production environment use

5

Confirm integration sandbox access with Genesys, Stripe, Chargebee, OpenAI

6

Define seed data scope for Testing Mode synthetic database content

7

Prioritise audit table implementation — which tables in Week 4 vs later

8

Confirm RBAC personas and PIM approval chains with client IT team

9

Formalise RFP response to client once Phase 1 is signed off

Infrastructure Engineering Team — March 2026