AI Auditing Tools & Techniques 21% of exam
Domains 1 and 2 taught you what AI is and how it is built and run. Domain 3 is about doing the audit: how to scope and plan it, how to design tests and sample, how to gather evidence that holds up, how to use data analytics, and how to report findings that change behavior β all while protecting your independence.
Audit Planning & Design D3 Β· A
How to scope, risk-rate, set objectives and criteria, break the AI system into auditable units, fix materiality, and build a risk-based audit program β the phase that decides whether the rest of the audit is even pointed at the right target.
Domain 3 loves "what should the auditor do first?" and the answer is almost always a planning act: understand the use case, assess its risk tier, set criteria. You cannot test what you have not scoped, and you cannot conclude against criteria you never agreed.
1.1 Scoping the AI audit
Scoping defines the boundary of the engagement: which AI system(s), which environments (training, staging, production), which time period, and which auditable units are in and out. For AI this is harder than a classic app because the "system" sprawls across data pipelines, a model, MLOps infrastructure, governance processes, and downstream decisions β and a third-party foundation model may sit at its core.
Auditor's angle: a vague scope is the root of most failed AI audits β you either boil the ocean or miss the unit that mattered. The control to look for is a documented, risk-aligned scope agreed in the engagement letter; the evidence is the model inventory entry, architecture diagram, and data-flow map you use to draw the boundary.
- Fix the system identity precisely: name, version family, owner, and the decision it drives.
- State exclusions explicitly (e.g., "the third-party LLM's internal weights are out of scope; its integration and outputs are in scope").
Scoping an audit of a rΓ©sumΓ©-screening model
HR has deployed a model that ranks job applicants. You are asked to "audit the AI." That phrasing is uselessly broad; you must convert it into a defensible scope.
- Pull the model inventory record
Confirm the exact system, its owner, the business decision it influences (shortlisting), and whether it is internal or a vendor product.
- Map the data flow end to end
Trace from applicant data ingestion β feature engineering β model scoring β recruiter dashboard, listing every component you could audit.
- List candidate auditable units
Data, model, pipeline/MLOps, governance, outputs/use, monitoring β and note which the team can credibly evaluate.
- Risk-rank the units and draw the line
Put fairness of outputs and training-data representativeness in scope; agree that the cloud vendor's SOC 2 covers raw infrastructure, marking it out of scope but relied upon.
- Set the period and the version
Scope production scoring from JanβJun 2026 against model version
v4.2, the version actually live now. - Record it in the engagement letter
Document scope, exclusions, and reliance on third-party assurance so there is no dispute later.
1.2 Understanding the use case & its risk tier
Before any control is touched, understand what the AI actually does and what it decides: the population affected, the consequence of a wrong output, and the degree of human oversight (human-in-the-loop, on-the-loop, or fully automated). This maps to the risk-tier language from Domain 1 (e.g., EU AI Act prohibited / high-risk / limited / minimal), and the tier should drive audit depth.
Auditor's angle: matching audit effort to risk tier is risk-based planning. The risk is over-auditing a marketing-copy generator while under-auditing a credit-decisioning model. Look for an approved risk classification in the model inventory; if none exists, that absence is itself a governance finding.
- A model that recommends to a human is lower risk than one that decides autonomously.
- Consequence severity (credit, health, employment, safety) outranks transaction volume when tiering.
Tiering an automated credit-decline model
A bank deploys a model that automatically declines credit-card applications with no human review. You must establish its risk tier to size the audit.
- Identify the decision and its automation level
It is a fully automated adverse decision affecting individuals β no human-in-the-loop on declines.
- Identify who is affected and the harm
Applicants denied access to credit; harm is financial exclusion plus potential discrimination.
- Map to a regulatory tier
Automated creditworthiness assessment is named high-risk under the EU AI Act and is regulated (e.g., adverse-action/fair-lending rules).
- Confirm the org's internal classification
Check the inventory: does the bank itself rate it high-risk? Compare with your assessment.
- Set audit depth from the tier
High-risk β deep testing of bias, robustness, governance gates, explainability, and monitoring β not a light-touch review.
1.3 Audit objectives & criteria
Every engagement needs objectives (what the audit will conclude on) and criteria (the measurable benchmark). For AI, criteria come from internal AI policy and model-risk standards, regulation (EU AI Act, sector rules), recognized frameworks (NIST AI RMF, ISO/IEC 42001), and contractual/SLA commitments. Without explicit, agreed criteria there is no objective basis for a finding.
Auditor's angle: "the model feels biased" is an opinion; "the selection rate for the protected group was 0.62, below the org's own 0.80 four-fifths threshold" is a finding. The independence trap: the auditor agrees and applies criteria but must not invent the control standard management should have set β doing so means auditing your own benchmark next year.
- Criteria must be defined before testing, in writing, and agreed with management.
- If no criterion exists, the missing criterion is a reportable governance gap, not something you fill in.
"Just tell us if the model is fair"
During planning for a hiring-model audit, management says "just tell us if it's fair." Fairness is not auditable without a benchmark.
- Ask for the existing fairness criterion
Request the policy, threshold, and the protected attributes management has defined.
- Surface the gap if there is none
If no metric or threshold exists, note that absence β it is a governance finding in its own right.
- Propose recognized criteria for agreement
Point to a four-fifths/adverse-impact rule and a chosen fairness metric drawn from policy and regulation, and have management adopt it.
- Define the objective precisely
"Conclude whether the model's selection rates across protected groups meet the agreed 0.80 four-fifths threshold for JanβJun 2026 production decisions."
- Record criteria and objective in the engagement letter
Lock them before testing so the conclusion has an objective anchor.
1.4 Identifying auditable units
A defining Domain 3 skill is breaking an AI system into its auditable units so nothing material is missed: data, model, infrastructure/MLOps, governance, outputs/use, and monitoring. Each unit carries its own risks, controls, and evidence.
| Auditable unit | What lives here | What the auditor looks for |
|---|---|---|
| Data | Training/validation/test/production data, labels, features | Quality, representativeness, lineage, privacy basis, bias, drift |
| Model | Algorithm, weights, hyperparameters, versions | Validation evidence, performance vs criteria, explainability, robustness, pinned version |
| Infra / MLOps | Compute, pipelines, CI/CD, APIs, third-party models | Access control, change mgmt, segregation of duties, vendor assurance |
| Governance | Policy, approval gates, roles, inventory, risk acceptance | Sign-offs, committee minutes, an accurate model inventory |
| Outputs & use | Predictions, decisions, overrides, downstream actions | Output quality, override/appeal, decision logging |
| Monitoring | Drift/fairness/performance dashboards, alerts, owners | Alerts configured, fired, investigated, and actioned |
Auditor's angle: the risk is a "complete-looking" audit that silently omitted a unit (commonly monitoring or third-party models). The control is a unit-by-unit risk-and-control matrix; the evidence is that each in-scope unit has at least one designed procedure.
Decomposing a fraud-detection model
You must ensure your fraud-model audit covers everything material. Build the auditable-unit breakdown.
- Enumerate the six units
Data, model, infra/MLOps, governance, outputs/use, monitoring.
- Attach a top risk to each
E.g., data β unrepresentative training set; monitoring β drift alerts that nobody owns.
- Map the expected control to each risk
Data β representativeness checks; monitoring β alert thresholds with named owners and escalation.
- Note the evidence to request per unit
Datasheet, validation report, change tickets, approval minutes, decision logs, monitoring exports.
- Risk-rank units to allocate effort
Spend most hours on outputs/fairness and monitoring; rely on vendor SOC report for base infra.
- Confirm no unit is silently dropped
Have a reviewer challenge the matrix specifically for omitted units before fieldwork.
1.5 Materiality & the audit program
Materiality for AI is rarely a dollar figure; it is the threshold at which an issue becomes significant enough to report β expressed as number of affected individuals, severity of a wrong decision, regulatory exposure, or a tolerated error/bias rate. The audit program then translates objectives and risks into specific procedures, evidence to obtain, sample sizes, and timing. The engagement letter also fixes the engagement type (assurance vs advisory).
Auditor's angle: without a materiality definition you cannot rate findings or decide what to report; the control to look for is a documented materiality basis tied to harm and regulation, and an audit program that traces every procedure back to a risk and criterion.
- For AI, "qualitative materiality" (harm to a protected group, safety) often dominates quantitative thresholds.
- The program should state, for each procedure, the assertion tested and the evidence sought.
Setting materiality for a medical-triage model
A model triages emergency-room patients by acuity. There is no dollar figure that captures the risk; you must define materiality another way.
- Identify the harm dimension
A wrong "low acuity" output can delay urgent care β patient-safety harm, not financial.
- Translate harm into a threshold
Define any systematic mis-triage of a high-acuity case as material, and a false "low" rate above the clinically agreed tolerance as reportable.
- Add a population dimension
Treat disparities across age/ethnicity subgroups above the agreed margin as material regardless of count.
- Tie procedures to the threshold in the program
Write a procedure to compute subgroup false-negative rates against the clinical tolerance.
- Document the basis in the engagement letter
Record the materiality rationale so finding severity is defensible later.
1.6 Competencies & when to use specialists
You do not need to be a data scientist, but you must understand the system well enough to audit it. Where the team cannot competently evaluate a fairness metric, a robustness test, or a model's statistics, the auditor should engage a specialist (internal data-science assurance or an external expert) and remain responsible for scoping their work and integrating their conclusions.
Auditor's angle: pretending to assess something you cannot understand is an audit-quality failure; so is letting a specialist's opinion pass unchallenged. The control is a documented competency assessment during planning; the evidence is the specialist's terms of reference and the auditor's evaluation of their work.
- The auditor remains accountable for the conclusion even when a specialist did the technical test.
- A specialist informs the audit; they do not relieve the auditor of judgment or independence.
The team without ML skills
Your team has deep control-testing experience but little ML background, and the engagement requires evaluating model robustness and a fairness metric. How do you proceed competently?
- Assess the skills gap in planning
Document that fairness/robustness evaluation exceeds current team competence.
- Decide the response
Engage an internal data-science assurance specialist (independent of the model team) or an external expert.
- Scope the specialist's work
Write terms of reference: which model version, which metric, which thresholds, what deliverable.
- Preserve their independence
Confirm the specialist did not build or tune the model under audit.
- Evaluate, don't just accept, their output
Challenge assumptions, confirm the data and version match scope, and integrate the result into your conclusion.
- Stay accountable
The audit opinion remains the auditor's, supported by β not delegated to β the specialist.
Audit Testing & Sampling Methodologies D3 Β· B
Designing tests of design versus operating effectiveness for AI controls, choosing statistical versus judgmental sampling, building representative/stratified samples of outputs and data, reperforming model results, testing governance and monitoring gates, and getting frequency and timing right for a system that drifts.
2.1 Tests of design vs operating effectiveness
The bedrock distinction the exam tests constantly: a test of design (TOD) asks "is the control capable of working if it operates as intended?"; a test of operating effectiveness (TOE) asks "did the control actually work, consistently, over the whole period?" Both apply to AI controls.
| AI control | Test of design | Test of operating effectiveness |
|---|---|---|
| Pre-deployment approval gate | Policy requires validation + independent sign-off before release? | For a sample of releases, was each validated and signed off before go-live? |
| Bias/fairness review | A metric, threshold, and review step are defined? | Reperform/inspect fairness results; confirm reviews happened and breaches were actioned |
| Drift monitoring | Metrics, thresholds, alerts, owners defined? | Inspect logs over the period; confirm alerts fired and were investigated |
Auditor's angle: the classic trap is concluding from design alone. "There is a drift dashboard" proves design; it does not prove anyone read it or acted on an alert. Always push to operating effectiveness with period-spanning evidence.
A policy or dashboard existing is design evidence. Proof it operated all period is operating effectiveness. When an answer choice stops at "the control exists," it is usually the distractor.
The approval gate that may not have operated
Policy requires every model release to be validated and independently signed off before go-live. You must test whether this works.
- Run the test of design first
Read the policy: confirm it mandates validation + independent sign-off pre-release, with a named approver role.
- Pull the population of releases
List every model deployment in the audit period from the deployment/CI-CD log.
- Select releases to test
Sample (or take all) releases, including any emergency/hotfix deployments β those are where gates get skipped.
- Compare timestamps
For each, verify the validation report date and sign-off date both precede the go-live timestamp.
- Check independence of the approver
Confirm the signer was not the model's developer.
- Conclude on operating effectiveness
If two emergency releases went live before sign-off, the control is well-designed but not operating effectively β a reportable finding.
2.2 Statistical vs judgmental sampling
Statistical sampling uses random selection and lets you project results to the whole population with measurable confidence β use it when you must quantify an error or bias rate defensibly. Judgmental (non-statistical) sampling targets high-risk items (edge cases, the protected subgroup, post-retrain decisions); efficient but not statistically projectable.
Auditor's angle: the method must match the conclusion. If you intend to state "the error rate is X% with 95% confidence," you need a statistical sample; if you only want to probe whether a specific risk exists, judgmental selection is fine β but you must not over-claim projectability from a judgmental sample.
- Statistical β defensible quantified rate; needs proper sample-size calculation.
- Judgmental β fast risk probing; cannot be extrapolated to the population.
Choosing the sampling method for a claims model
An insurer's model auto-approves low-value claims. You must report a defensible exception rate to the audit committee.
- Define the conclusion you must support
A projectable statement: "the rate of wrongly auto-approved claims is X% Β± margin."
- Choose the method to fit it
That conclusion requires statistical sampling, not a hand-picked set.
- Set confidence and tolerable error
Pick 95% confidence and the tolerable error tied to materiality, then compute sample size.
- Select randomly from the full population
Draw the sample with a documented random method from all in-period approvals.
- Add a judgmental layer for known risk
Separately hand-pick edge cases (just-below-threshold claims) to probe a specific risk β reported as targeted testing, not projected.
- Report each correctly
Project the statistical result; describe the judgmental findings as illustrative, not population estimates.
2.3 Stratified / representative sampling
The heart of AI sampling is representativeness. A sample of model outputs that omits minority subgroups can hide exactly the bias you are looking for. Stratify the population by subgroup, time period, and decision type so the sample mirrors the risks, not just the volume.
Auditor's angle: a simple random sample drawn from a skewed population reproduces the skew β if the protected group is 3% of records, 60 random items may contain almost none, and "no errors found" proves nothing about fairness. The control is stratification aligned to the audit objective; the evidence is the documented sampling plan showing each stratum's coverage.
- Stratify by the dimensions that carry the risk (protected attribute, retrain epoch, decision outcome).
- Ensure each high-risk stratum has enough items to support a conclusion about it.
The unrepresentative loan sample
An auditor randomly sampled 60 of 50,000 loan decisions, found no errors, and concluded the model is fair. A reviewer objects.
- Identify the audit objective
The objective is fairness across protected groups, so the sample must let you conclude per group.
- Examine the population composition
The protected subgroup is a small fraction, so 60 random items may contain only a handful or none.
- Diagnose the flaw
"No errors found" reflects the absence of the at-risk group from the sample, not fairness.
- Re-design with stratification
Stratify by protected attribute, by time/retrain period, and by decision (approve vs decline).
- Size each stratum to the conclusion
Allocate enough items to the protected-group/decline stratum to compute its outcome rate.
- Re-test and compare subgroup rates
Compute selection rates per stratum against the agreed threshold.
2.4 Reperformance of model results
Reperformance β independently re-running the model (or a control) and comparing results β is the most powerful AI test: re-score a sample of inputs and check the outputs match what production produced and what validation reported. It is also where reproducibility bites β if you cannot reproduce a result because the model version, data, or random seed was not pinned, that inability is a finding.
Auditor's angle: reperformance is auditor-generated evidence, so it sits at the top of the reliability hierarchy. The risk is non-reproducibility masking poor control; the evidence to demand is a pinned model version, a data snapshot, and a runnable environment.
- Match auditor-reproduced outputs against both production logs and the validation report.
- Treat "we can't reproduce it" as a control finding, not a footnote.
Reperforming a fairness metric
Management reports the model's fairness metric is within tolerance. You want auditor-generated proof rather than trusting the number.
- Pin the exact model version
Obtain version
v4.2with its weights, hyperparameters, and the seed used. - Obtain the data snapshot
Get the exact validation dataset (hash recorded) management used for the reported metric.
- Recreate the environment
Stand up the pinned runtime/dependencies so the run is reproducible.
- Re-score and recompute the metric
Independently compute the fairness metric per subgroup.
- Compare three numbers
Your result vs production logs vs management's validation report.
- Conclude on reproducibility too
If you cannot reproduce because the version/data/seed was not pinned, raise a reproducibility finding regardless of the number.
2.5 Testing governance gates & monitoring controls
Testing governance gates means confirming approvals genuinely preceded deployment and were given by an independent, authorized role. Testing monitoring controls means confirming alerts are configured, fire when thresholds breach, and are acted on β not merely that a dashboard exists.
Auditor's angle: both are prime "design vs operating effectiveness" territory. The risk is a beautiful dashboard nobody watches and an approval workflow that is bypassed for "urgent" releases. The evidence is alert-fire records tied to investigation tickets, and approval timestamps preceding go-live.
- For monitoring, trace a real breach all the way to its investigation and resolution.
- For gates, test the exceptions (emergency changes) where controls usually fail.
Does the drift alert actually do anything?
A team proudly shows a drift-monitoring dashboard. You must test whether the monitoring control operates, not just exists.
- Confirm thresholds and ownership
Verify a drift threshold, an alert, and a named owner with an escalation path are defined.
- Find a period where drift breached
Inspect the monitoring history for an actual threshold breach.
- Confirm the alert fired
Check the alerting log that a notification was generated at that breach.
- Trace to investigation
Follow the alert to an incident/ticket showing someone investigated.
- Confirm action taken
Verify the outcome β retrain, rollback, or accepted risk β was recorded.
- Conclude
If breaches fired alerts that nobody actioned, the control fails operating effectiveness despite the dashboard.
2.6 Frequency & timing of testing
Because models drift and are retrained, point-in-time testing ages quickly. The exam expects you to recognize that AI often calls for more frequent or continuous testing, and that timing matters: test the model version actually in production now, and align sample periods with retraining events.
Auditor's angle: the risk is concluding on a version that has since been replaced, or sampling a period that predates a retrain that changed behavior. The control is testing cadence tied to the retrain schedule; the evidence is alignment between your sample window and the deployment/retrain log.
- Always confirm which version is live today before concluding.
- Anchor sample periods to retrain boundaries so before/after behavior is comparable.
Testing the wrong version
An auditor validated model v3.0 in March and is writing the conclusion in June. You suspect a timing problem.
- Check the current production version
Query the deployment log β production now runs
v4.2, retrained in May. - Identify the gap
The March testing concluded on a version no longer in use.
- Re-scope to the live version
Re-aim testing at
v4.2, the version actually making decisions. - Align the sample period to the retrain
Sample post-May decisions so results reflect current behavior.
- Set a cadence going forward
Recommend testing at each retrain plus at least annually, given drift risk.
Audit Evidence Collection Techniques D3 Β· C
What counts as sufficient and appropriate evidence for AI, the reliability hierarchy (reperformance > inspection > observation > inquiry), the rich new AI evidence sources, why you must pin the exact model version and data snapshot, and how reproducibility itself becomes evidence.
3.1 Sufficient & appropriate evidence for AI
The standard is unchanged: evidence must be sufficient (enough of it) and appropriate (relevant + reliable). A conclusion is only as good as the evidence behind it, and one weak source rarely supports a strong opinion.
Auditor's angle: the risk for AI is concluding on thin or self-reported evidence (a single interview, a self-authored model card). The control mindset is to combine sources so that relevance and reliability are both met, and to corroborate anything self-reported.
- Sufficiency = quantity and coverage; appropriateness = relevance + reliability.
- More high-quality evidence is needed where risk and materiality are higher.
Is one report enough?
An auditor plans to conclude that a high-risk model is robust based solely on a vendor's marketing whitepaper. You assess sufficiency and appropriateness.
- Judge relevance
Does the whitepaper address this deployment, version, and use case? Marketing claims rarely do.
- Judge reliability
It is vendor self-promotion β low reliability, not independent.
- Judge sufficiency for the risk tier
A high-risk conclusion needs more and stronger evidence than one document.
- Add corroborating sources
Obtain an independent validation/robustness report and reperform a robustness test on the pinned version.
- Re-evaluate the combined evidence
Conclude only when relevant, reliable evidence is sufficient for the risk.
3.2 The evidence reliability hierarchy
The four classic techniques all apply, and the exam wants you to rank their reliability and pick the right one for the assertion: reperformance (strongest β auditor generates it) > inspection (documents/records) > observation (only proves it happened while you watched) > inquiry (weakest β must be corroborated).
| Evidence | Reliability | What it proves (and its limit) |
|---|---|---|
| Auditor reperformance of scoring | Highest | The model produces claimed outputs β only for the version/data you ran |
| Version control / git history | High | What changed, when, by whom β lineage of code/config |
| Independent validation report | High | Met criteria at validation β may be stale now |
| Approval / sign-off records | High | A gate was passed β authorization, not model quality |
| Monitoring dashboard export | MediumβHigh | Metrics over time β only if computation is trusted and alerts actioned |
| Model card / datasheet | Medium | Intended use, data, limits β self-reported, verify |
| Interview with the team | Lowest | Context and leads β must be corroborated |
Auditor's angle: when two pieces of evidence conflict, prefer the more reliable source. The trap is resting a conclusion on inquiry alone.
Auditor-produced (reperformance) > independent systems/third parties (logs, git, SOC reports) > auditee's own documents (model cards) > what someone says (interviews).
"The model is fair β the team told us"
An auditor's only evidence that a model meets the fairness threshold is an interview in which the lead data scientist said the metric was within tolerance.
- Classify the evidence
This is inquiry β the weakest, least reliable form.
- Check appropriateness for the assertion
A fairness conclusion needs reliable, ideally auditor-generated evidence; inquiry alone is not appropriate.
- Inspect to corroborate
Obtain and read the independent validation report showing the metric and threshold.
- Reperform if material
Recompute the fairness metric on a pinned version and data snapshot.
- Resolve any conflict by reliability
If reperformance disagrees with the interview, the reperformed result governs.
3.3 AI-specific evidence sources
AI adds rich new evidence sources the auditor must know how to use and verify: model cards, datasheets for datasets, training logs, validation reports, monitoring dashboards, version-control history, approval records, data lineage, and decision logs.
Auditor's angle: each source proves something different and has a different reliability β a model card states intended use (self-reported), while git history and decision logs are independent system records. The risk is treating a self-authored artifact as proof of reality; verify the card against logs and reperformance.
Verifying a model card against reality
A model card claims the model was trained only on post-2023 data and is "not used for high-stakes decisions." You must test those claims.
- Read the card's specific claims
Note the asserted training-data date range and the intended-use statement.
- Corroborate the data range with training logs
Inspect training logs/lineage to confirm the actual date range of inputs.
- Check actual use against decision logs
Inspect production decision logs to see whether outputs feed a high-stakes workflow despite the card.
- Cross-check the version
Confirm the card corresponds to the deployed version via version history.
- Resolve discrepancies
If logs show pre-2023 data or high-stakes use, the card is inaccurate β a finding about governance accuracy.
3.4 Pinning the model version & data snapshot
The AI-specific habit that separates a credible audit from a weak one: record exactly which model version, data snapshot, and configuration you tested, with hashes or version IDs. Because models are retrained and inputs shift, evidence is meaningless without pinning the version.
Auditor's angle: the risk is "evidence drift" β your finding refers to a model that no longer exists, so it cannot be defended or remediated. The control is version pinning; the evidence is a documented version ID and data-snapshot hash attached to every test result.
- Capture version ID, data-snapshot hash, config, and (where relevant) the random seed.
- Tie every workpaper result to the exact artifacts it was produced from.
Pinning before you test
You are about to reperform scoring on a recommendation model that retrains nightly. Without pinning, your result will be unrepeatable.
- Record the production version ID
Capture the exact deployed version/build identifier at the moment of test.
- Snapshot and hash the input data
Freeze the input dataset and record its hash so it cannot silently change.
- Record the configuration and seed
Note hyperparameters, feature config, and any random seed.
- Attach these to the workpaper
Stamp the version ID and data hash onto the test evidence.
- Re-run and confirm match
Reproduce the result against the pinned artifacts and confirm consistency.
- Reference the pin in the finding
Any finding cites the precise version and snapshot it relates to.
3.5 Reproducibility as evidence
Reproducibility itself is evidence. If the team can hand you a pinned environment and you reproduce their validation results, that strongly corroborates their governance. If results cannot be reproduced, the inability is a finding about controls β not a footnote.
Auditor's angle: reproducibility tests the maturity of the MLOps and governance controls behind the model, not just the number. The risk is a team that cannot recreate its own results, signaling weak versioning and change control. The evidence is a successful (or failed) independent reproduction.
- Successful reproduction corroborates governance; failed reproduction is a control finding.
- Ask the team to reproduce their own result first β inability is itself revealing.
When results can't be reproduced
You ask the data-science team to reproduce the validation accuracy they reported six months ago. They cannot β the data and code state were never captured.
- Request the pinned artifacts
Ask for the model version, data snapshot, config, and environment from the original validation.
- Attempt reproduction
Re-run the validation procedure with whatever the team provides.
- Document the failure cause
Establish that the version/data/seed was never pinned, so the original result cannot be recreated.
- Frame it as a control finding
Weak reproducibility = weak versioning/change control over the model lifecycle.
- Assess the downstream effect
Note that prior governance sign-offs rested on results that cannot now be verified.
Audit Data Quality & Data Analytics D3 Β· D
Assessing the quality of the data the AI relies on, using CAATs and full-population testing, deciding when sampling versus analytics is appropriate, using AI/ML to assist the audit while managing its risks (over-reliance, confidentiality, explainability of your own tools), and standing up continuous auditing.
4.1 Assessing the AI's data quality
Garbage in, garbage out is a control objective here. Auditors assess the data the model trains and runs on against the classic dimensions: completeness (missing records, fields, periods β missing data on a subgroup bakes in bias), accuracy (do values and labels reflect reality β mislabeled training data corrupts the model silently), validity (formats, ranges, business rules), and timeliness (stale data + a shifting world = drift).
Auditor's angle: data quality is upstream of every model risk, so a weak dataset undermines even a well-governed model. The control is documented data-quality checks in the pipeline; the evidence is reconciliation, profiling, and label-audit results β which the auditor can reperform.
- Subgroup completeness is a fairness issue, not just a data-hygiene issue.
- Label accuracy is the silent killer β sample and re-check labels against ground truth.
Auditing training-data quality
A fraud model underperforms on a particular region. You suspect a data-quality root cause and test the four dimensions.
- Completeness
Profile records by region/subgroup; find the suspect region is severely under-represented in training data.
- Accuracy
Sample labels and recheck against confirmed fraud outcomes; measure mislabel rate.
- Validity
Run range/format checks; flag impossible values (e.g., negative transaction amounts) that skew features.
- Timeliness
Check the data's recency against the current fraud patterns the model must catch.
- Link defects to model behavior
Connect the under-representation to the regional underperformance.
- Conclude
Raise a data-completeness finding driving biased outcomes, with evidence from profiling and label audit.
4.2 Full-population testing & CAATs
Computer-assisted audit techniques (CAATs) and analytics let the auditor move beyond samples. Because AI data and decision logs are already digital and large, the auditor can often test the full population β every decision, every record β which gives far stronger coverage of bias and exceptions than a sample. Anomaly analytics (including Benford-style tests) can surface manipulated, fabricated, or out-of-pattern values.
Auditor's angle: sampling is a response to populations you cannot economically test in full; when full-population analysis is feasible, it is more defensible β especially for rare-event bias that a sample would miss. The control/evidence is a documented, repeatable CAAT script run over the complete log.
- Prefer full-population testing when data is digital and complete.
- Full-population analysis catches rare disparities a sample of 60 never would.
Sample vs full population for pricing bias
You must determine whether an automated pricing model ever charged a protected group more than others. Decision logs hold 2 million fully digital records.
- Assess feasibility of full population
The data is digital, complete, and queryable β full-population testing is feasible.
- Write a CAAT to compute outcomes per group
Script average price and price distribution by protected attribute across all 2M records.
- Run anomaly detection for rare disparities
Flag individual decisions where the protected group was charged outside expected ranges.
- Validate the CAAT logic
Test the script on known cases to confirm it computes correctly.
- Quantify the disparity
Measure the gap across the entire population, not an estimate.
- Conclude with full coverage
Report exact, population-wide results rather than a sample-based projection.
4.3 Using AI/ML to assist the audit β and its risks
An auditor may use AI/ML tools to triage evidence, summarize documents, or detect anomalies β but doing so introduces its own risks the auditor must manage: over-reliance (treating output as conclusion rather than a lead to verify), explainability of the audit's own tools (if you can't explain why it flagged something, you can't defend the finding), confidentiality (feeding client data into an external model), and bias/error in the tool itself.
The AI tool is an assistant, never the auditor. Document the tool, its limitations, and how its output was validated. AI assists; it does not replace professional judgment or evidence.
Auditor's angle: the trap the exam plants is reporting an AI tool's output as a finding without independent verification. Always verify a sample (or all) of the flags and be able to explain how the tool reached them.
"The auditor used an AI tool that flagged 200 transactions and reported them as exceptions." The trap is over-reliance β the auditor must independently verify the flags and explain the method before reporting.
The over-trusted assistant
An external auditor pastes a client's customer dataset into a public generative-AI chatbot to "find anomalies," then reports the anomalies it returned. Identify and fix the problems.
- Spot the confidentiality breach
Client data was sent to an uncontrolled external model β a serious confidentiality and data-protection failure.
- Spot the over-reliance
The flagged items were reported without independent verification.
- Spot the explainability gap
The auditor cannot explain or reproduce why those items were flagged.
- Switch to an approved, controlled tool
Use an approved tool with appropriate data protection (no client data to public models).
- Validate the tool's output
Independently verify a sample (or all) of the flags against source evidence.
- Retain accountability
Document the tool, its limits, and the validation; the conclusion stays the auditor's.
4.4 Continuous auditing / monitoring
Analytics enable continuous auditing/monitoring: automated checks that flag drift, fairness breaches, or anomalous outputs on an ongoing basis rather than once a year. For systems that retrain and drift, point-in-time assurance ages too fast to be enough alone.
Auditor's angle: the risk is annual-only assurance on a model whose behavior changes monthly. The control is auditor-run (or auditor-relied-upon) continuous tests; the evidence is the ongoing exception log and the response to each flag. Independence note: if the auditor relies on management's monitoring, that is testing a control, not the auditor's own continuous audit.
- Continuous auditing suits AI because drift makes one-off testing stale quickly.
- Keep the auditor's continuous tests independent of the controls they assess.
Standing up continuous fairness monitoring
A high-risk lending model retrains monthly. Annual fairness testing is clearly insufficient, so you design continuous auditing.
- Define the continuous test
Automate a monthly subgroup selection-rate computation against the agreed threshold.
- Set the exception trigger
Flag any month where a subgroup breaches the four-fifths threshold.
- Pin versions automatically
Have the test record the live model version and data snapshot each run.
- Route exceptions to the auditor
Send flags to the audit function's log for independent review, separate from management's own monitoring.
- Verify each flag
Investigate flagged months before treating them as findings.
- Report on a rolling basis
Provide ongoing assurance instead of a single stale annual point.
AI Audit Outputs & Reports D3 Β· E
Structuring findings with the 4 Cs plus a recommendation, rating severity, reporting AI-specific issues (bias, drift, weak explainability, governance gaps) to technical and non-technical audiences, communicating limitations and residual risk, tracking remediation, distinguishing assurance from advisory, and protecting independence β the auditor never designs or owns controls.
5.1 The 4 Cs finding structure + recommendation
Every well-formed finding has the same anatomy: Condition (what is), Criteria (what should be), Cause (why the gap exists), Effect (so what β the risk/impact), plus a Recommendation owned by management.
| Element | Question | AI example |
|---|---|---|
| Condition | What is | Fraud model live 14 months with no fairness re-test since launch |
| Criteria | What should be | Policy requires a fairness re-test at every retrain and at least annually |
| Cause | Why | No owner assigned for periodic re-testing; monitoring assumed to cover it |
| Effect | So what | Undetected drift could produce biased declines, breaching regulation and harming customers |
| Recommendation | What to do | Management assigns an owner and schedules re-tests each retrain and annually |
Auditor's angle: weak findings stop at condition vs criteria; strong ones nail the root cause (so the fix is real) and the effect (so management cares). The boundary: the auditor recommends; management decides and owns the action.
The half-written finding
A report states: "The recommendation engine has no drift monitoring. We recommend implementing drift monitoring." Management asks why to prioritize it. What's missing?
- Check the elements present
Only condition and a recommendation appear β criteria, cause, and effect are absent.
- Add the criteria
Cite the policy/framework requiring drift monitoring for this risk tier.
- Establish the cause
Identify why it's missing β e.g., no owner, monitoring deprioritized at launch.
- Articulate the effect
"Undetected drift may degrade quality and surface biased/unsafe outputs to customers, with regulatory and reputational exposure."
- Rate severity
Assign high/medium/low from likelihood and impact against materiality.
- Assign ownership
Let management own the remediation and timeline; the auditor only recommends.
5.2 Severity rating & AI-specific issue types
Findings are rated by severity (high/medium/low or critical/significant/minor) from likelihood and impact against materiality. AI introduces issue types you must name and explain: bias/unfairness, drift, lack of explainability, weak or absent governance, data-quality defects, and third-party/model-supply-chain risk.
Auditor's angle: severity must be consistent and tied to the materiality basis set in planning, not to how alarming the issue feels. The risk is rating by gut; the evidence is a rating rubric mapping likelihood Γ impact to a defined band.
- Tie each rating back to the planning-phase materiality definition.
- Name the AI issue type precisely β "bias" vs "drift" vs "explainability" drive different fixes.
Rating two AI findings consistently
You have two findings: (a) a minor formatting issue in a model card, and (b) a protected group facing higher decline rates on a high-risk lending model. You must rate both defensibly.
- Recall the materiality basis
Materiality is anchored to harm to affected individuals and regulatory exposure.
- Assess likelihood and impact of (a)
Low impact, no harm β minor/low severity.
- Assess likelihood and impact of (b)
High impact (discrimination, regulatory breach), plausible likelihood β critical/high.
- Name the issue types
(a) governance/documentation; (b) bias/unfairness on a high-risk system.
- Apply the rubric, not instinct
Map both via the likelihood Γ impact rubric for consistency.
- Sequence reporting
Escalate (b) promptly; bundle (a) into routine recommendations.
5.3 Reporting to technical & non-technical stakeholders
The skill the exam tests is translating AI findings for two audiences: a technical team needs the metric, the model version, and the test detail; the board needs the business risk in plain language ("the model may be systematically declining a protected group, exposing us to regulatory penalty and reputational harm").
Auditor's angle: the risk is a report that is unreadable to the people who must act β technical jargon for the board, or hand-waving for the engineers. The control is a layered report (executive summary + technical detail); the evidence is that both audiences can act on it.
- Same finding, two registers: business impact up, technical specifics down.
- Never bury a critical bias finding in statistics the board cannot parse.
One finding, two audiences
You found that a hiring model's selection rate for women is 0.62 versus 0.90 for men, below the 0.80 four-fifths threshold. You must report to both the data-science team and the board.
- Write the technical version
State the metric, the model version
v4.2, the data snapshot, the subgroup rates, and the threshold breached. - Translate the business impact
"The model appears to disadvantage female applicants, creating discrimination, legal, and reputational risk."
- Layer the report
Lead with an executive summary for the board; attach technical detail as an appendix for engineers.
- State severity in both registers
Critical, expressed as both a metric breach and a business exposure.
- Give each audience its next step
Engineers: investigate root cause; board: oversee remediation and risk decision.
5.4 Communicating limitations & residual risk
An honest AI report states its scope limitations (e.g., "we could not reproduce the May validation results; vendor model internals were not accessible") and the residual risk that remains after recommended actions. Reporting residual risk respects that management β not the auditor β owns the decision to accept it.
Auditor's angle: the risk is an over-stated conclusion that hides what you could not test, exposing the auditor and misleading management. The control is explicit limitation and residual-risk statements; the evidence is their presence in the report.
- Disclose what you could not access or reproduce β never imply coverage you didn't have.
- State residual risk; let management formally accept (or reject) it.
Reporting around an inaccessible vendor model
You audited a system built on a third-party LLM whose internals the vendor would not disclose. You must report honestly without over-claiming.
- State the scope limitation
Note that the vendor's model internals/training data were inaccessible.
- Describe what you could test
Integration, input/output controls, monitoring, and output-level fairness on real decisions.
- Identify the residual risk
Hidden bias or instability inside the vendor model that output testing may not fully reveal.
- Recommend compensating evidence
Obtain the vendor's independent assurance/model documentation as a control.
- Hand the risk decision to management
Present residual risk for management's formal acceptance, not the auditor's.
5.5 Follow-up & remediation tracking
Reporting is not the end: the auditor tracks remediation to closure and re-tests that fixes work. A finding is only resolved when evidence shows the control now operates effectively β not when management says it's done.
Auditor's angle: the risk is "closed-on-promise" β items marked resolved on management's word without re-testing. The control is a tracked remediation log with auditor re-verification; the evidence is the re-test result confirming operating effectiveness.
- Re-test the fix; don't close on management's assertion alone.
- Track each finding with owner, due date, and verification status.
Closing the fairness re-test finding
Last year you found no periodic fairness re-testing. Management says it's now fixed. You must decide whether to close it.
- Confirm the design fix
Verify a policy/schedule now assigns an owner and a re-test cadence.
- Test operating effectiveness
Obtain evidence that a re-test actually ran at the last retrain.
- Reperform a spot check
Independently recompute the fairness metric for the latest version.
- Compare to the original gap
Confirm the specific deficiency is genuinely remediated.
- Update the tracker
Close only with re-test evidence; otherwise keep it open with status.
5.6 Assurance vs advisory & auditor independence
Two heavily tested distinctions close the domain. Assurance gives an independent opinion against criteria; advisory helps management improve. Advisory can erode independence if the auditor later has to audit what they helped design, so the engagement type must be set up front and disclosed. And the cardinal rule: the auditor never designs, owns, operates, or signs off the controls under audit β they recommend; management decides, owns, and accepts residual risk.
The most common Domain 3 distractor is the auditor fixing the problem: "designed the new fairness control," "selected the model threshold," "wrote the monitoring rules." All impair independence.
Auditor's angle: objectivity is impaired if the audit function configured the very controls under review, or used an AI tool built/tuned by the team it now audits β disclose and manage it, and use a different auditor for the assurance.
The helpful auditor
While auditing AI governance, the auditor finds the bias-testing process weak and offers to write the new bias-testing standard so the team "doesn't have to." Next year, the same auditor is scheduled to audit that standard.
- Identify the role being assumed
Writing the standard makes the auditor the owner/designer of the control.
- Spot the independence impairment
They cannot independently audit their own work next year β objectivity is compromised.
- Define the proper role
The auditor may recommend what a good standard contains and cite frameworks β but management must author and own it.
- If management insists on help
Reclassify it as an advisory engagement and disclose it explicitly.
- Protect next year's assurance
Assign a different, independent auditor to review the standard.
Exam focus β quick recap
At 21% (~18β19 of 90 questions), Domain 3 is highly "what should the auditor do?" The right answer is usually the one that follows proper sequence, rests on the most reliable evidence, and protects independence. When two answers look right, pick the one an independent auditor (not management) would do, that rests on the strongest evidence, and that comes next in order.
- Sequence wins: understand & scope β risk-tier β set criteria β plan risk-based β design tests β gather evidence β conclude β report β follow up. "First/next/best" answers live in this order.
- Risk tier drives depth: match audit effort to the use case's risk; high-risk automated decisions get the heaviest testing.
- Design vs operating effectiveness: a policy/dashboard existing is design; proof it operated all period is operating effectiveness β always push to the latter.
- Representative sampling: stratify by subgroup/time; prefer full-population CAAT testing when data is digital and complete.
- Evidence hierarchy: reperformance > independent system logs/third-party > auditee documents > interview. Corroborate inquiry; pin the model version and data snapshot.
- Reproducibility is evidence: inability to reproduce a result is a control finding, not a footnote.
- AI tools assist, never decide: beware over-reliance, explainability of your own tools, and client-data confidentiality β verify output, you stay accountable.
- Findings = 4 Cs + recommendation: Condition, Criteria, Cause, Effect, Recommendation. Cause and effect drive severity and action.
- Independence is sacred: the auditor never designs, owns, operates, or signs off the controls under audit. Recommend; management owns and accepts residual risk.
- Report to two audiences: technical detail for the team, business risk for the board. State limitations and residual risk; track remediation to re-tested closure.