Data – FOI Forest

FOI Forest collects disclosure log data from — federal agencies and processes it into a single, consistently structured dataset. That dataset is what powers the search on this site, and it’s also available for download here. If you prefer to work with the data exactly as scraped before any processing occurs, this is also available as separate files for each agency.

All datasets are released under CC0 1.0. You can use them for any purpose without restriction. Attribution is appreciated but not required.

Unified dataset

↓ Download foiforest.csv

The unified dataset contains — records from — agencies, normalised into a consistent schema. It is rebuilt from scratch each weekday.

Schema

Field	Description
`request_id`	Stable unique identifier for each record. For records with a reference number, this takes the form `{agency_key}:{reference_number}` (e.g. `afp:LEX-1234`). For records without a reference number, it is derived from a hash of the agency key, date, and description, in the form `{agency_key}:nref:{hash}`.
`reference_number`	The FOI reference number as published by the agency. Missing for a small proportion of records, most of which come from agencies that do not assign reference numbers (notably AUSTRAC).
`date`	The best available date for the request: `date_of_access` where present, otherwise `date_published`. See `date_source` for provenance. Present for nearly all records.
`date_source`	How `date` was determined. `date_of_access`: taken directly from the agency log (present in the majority of records); `date_published`: the date the entry appeared on the disclosure log; `date_published_month`: month-level publication date, day set to 1; `ref_year`: year extracted from the reference number; `url_path`: inferred from a dated path segment in a document URL; `year_of_access`: year inferred from a heading in the page HTML; `none`: no date recoverable .
`date_of_access`	The date the applicant received the decision or documents, as recorded by the agency. Missing where agencies do not publish this field.
`date_published`	The date the entry appeared on the disclosure log.
`date_first_scraped`	The date FOI Forest first collected this record. Useful for tracking when entries were added to the log, independent of the decision date.
`description`	The request description as published by the agency.
`access_outcome`	Standardised access outcome: `Full` (full release), `Partial` (partial release or some documents withheld), or empty. Many agencies do not publish outcome information on their disclosure logs, and this field reflects what is available in the source data rather than a complete picture of outcomes. Notably, requests that are refused in full do not appear in disclosure logs.
`document_urls`	Semicolon-separated URLs to released documents, where published. Some agencies publish documents on request rather than linking directly. These are the links present when the page was scraped and may no longer be functional.
`notes`	Agency-specific fields that do not map to the standard schema but are worth preserving for search and analysis. Formatted as `key: value` pairs, newline-separated.
`agency`	Full agency name. Note that agency names reflect the department’s name at the time of the most recent scrape, not at the time the request was decided. If a department is renamed, all historical records for that agency will appear under the new name in subsequent data exports.
`canonical_url`	URL of the agency’s disclosure log page where this record was found, with pagination parameters stripped for stability.
`subpage_url`	URL of the individual entry page, where the agency publishes records on separate pages rather than in a single table.

Notes on the data

Date quality. Dates come directly from agency logs, but date formats vary considerably across agencies and have changed over time within agencies. Some records have dates that required manual correction during processing due to clear typographical errors in the source data (such as invalid month names or transposed digits).

Reference numbers. Most agencies assign a reference number to each FOI request, but not all. Where present, reference numbers are reproduced as is. A small number of agencies appear to have assigned the same reference number to genuinely distinct requests. If you are relying on reference numbers to deduplicate or link records, treat them with some caution.

Access outcomes. The access_outcome field is only as reliable as the source data. Agencies use inconsistent language to describe outcomes, and the standardisation applied here (Full / Partial) is necessarily approximate. A substantial proportion of records have no outcome recorded, either because the agency does not publish this information or because the field could not be reliably parsed.

Archived sources. For some agencies, records are drawn from historical snapshots in the National Library of Australia’s Australian Web Archive or the Internet Archive Wayback Machine in addition to the current live disclosure log. This extends coverage back further than the live log alone. Document links for these records point to archived copies rather than the original agency site, which means they are more likely to remain accessible long-term.

Agency names and machinery of government. The agency field in FOI Forest reflects the agency’s current name, applied uniformly across all historical records including those that predate the current name. This is a pragmatic choice forced by the fact that agencies handle their own disclosure log history inconsistently. Some carry their full history forward under the current name with no indication of any discontinuity, whereas others start fresh when restructured and drop older records from their log entirely. A small number do indicate where entries come from a predecessor, but even these have inconsistencies. Because there is no reliable way to determine which historical predecessor a given record rightfully belongs to, using the current agency name seems to me the least bad option.

The consequence is that filtering by agency name may not return a coherent institutional unit, particularly for departments that have gained or lost functions over time without changing name. The canonical_url field can be helpful here, as a record whose URL points to a different domain than the current agency’s site is likely a historical record that predates the current department’s form.

Some agencies have undergone complete institutional succession across multiple names. The Administrative Review Tribunal is the clearest example: records from the Migration Review Tribunal, the Refugee Review Tribunal, and the Administrative Appeals Tribunal are all attributed to the ART, reflecting a continuous institutional function across three names. Similarly, records from the Department of Human Services appear under Services Australia.

Some agencies are more challenging to deal with because two currently distinct agencies share a predecessor that combined the functions of both. This affects the Department of Agriculture, Fisheries and Forestry and the Department of Climate Change, Energy, the Environment and Water, and also the Department of Education and the Department of Employment and Workplace Relations. In these cases the same record often appears in the logs of both agencies following the split, producing duplicates in the dataset.

To reduce this duplication in the unified dataset, where a reference number appears in both the Department of Agriculture, Fisheries and Forestry and the Department of Climate Change, Energy, the Environment and Water logs, the former’s copy is attributed to the latter and the duplicate removed. For the Department of Education and the Department of Employment and Workplace Relations, a creator field preserved in the raw scraped data identifies which historical predecessor published each record and is used to route it to the correct modern successor. These changes apply only to the unified dataset. The raw files reflect the data as scraped and will contain the duplicate records.

The attribution in these cases was done to reduce duplication, not to make a claim about institutional history. If you are researching historical records for any of these four agencies, I suggest broadening your search to agencies that historically shared the same functions.

Raw files

Raw files are available for each agency. These files contain the data exactly as scraped from each agency’s disclosure log, before any normalisation, date parsing, deduplication, or schema standardisation is applied. Column names and structures vary by agency and reflect each agency’s own disclosure log format.

Raw files are re-uploaded each weekday in the same run that produces the unified dataset, so the two are always in sync.

Loading…

Processing

The processing pipeline has three stages.

Scraping

Each agency’s disclosure log is fetched on a weekday schedule using a headless browser where required as many agencies block simpler HTTP requests. The scraper extracts structured data from tables, list elements, or custom page layouts depending on the agency.

Transformation

Each agency’s raw data is processed through a common pipeline that handles date parsing, URL normalisation, and schema mapping. At least eight distinct date formats are used overall and these are often mixed within the same logs. These are parsed using a combination of format specifications and fallback strategies. Where dates cannot be parsed from the decision date field, FOI Forest falls back to publication date, URL path segments, year headings in the page HTML, or years embedded in reference numbers, in that order of preference. The date_source field records which strategy was used for each record. Agency-specific fields that do not map to the standard schema are preserved in the notes field rather than discarded.

Deduplication

Some agencies publish the same request multiple times. For example, when a decision is updated, when documents are released in tranches, or as an apparent data entry error.

FOI Forest resolves duplicate reference numbers within an agency as follows:

Records with the same reference number, description (case insensitive), and URLs are collapsed to the earliest record. Any later dates are noted in the notes field.
Records with the same reference number, description, and dates but different URLs are merged, with URLs combined.
Records with the same reference number and description but both different dates and different URLs are kept as separate rows to preserve tranche release history.
Records where a reference number appears to belong to genuinely distinct requests (identified by the reference number appearing in another record’s description) are kept as separate rows.
Everything else with a duplicate reference number is kept as separate rows.

The request_id field provides a stable identifier that reflects these decisions. It takes one of two forms:

{agency_key}:{reference_number} for records with a reference number, where agency_key is a short internal key (e.g. afp, health) rather than the full agency name. This means request_id is stable even if a department is renamed. All rows belonging to the same logical FOI request share a request_id. This included staged releases kept as separate rows,and ambiguous cases that have not been resolved. request_id is therefore not unique per row in these cases, but unique per logical request.
{agency_key}:nref:{hash} for records without a reference number. The hash is derived from the agency key, date, and description. It is stable across scrape runs as long as neither the description nor the date changes. This means that a record that initially has no date but later acquires one will get a new request_id.

A consequence of this design is that reference_number and request_id do not always align cleanly. Some agencies reuse reference numbers across genuinely unrelated requests; where this has been detected, both records are kept as separate rows sharing the same request_id. There are also cases where the same reference number appears to belong to unrelated requests but the relationship is ambiguous and has not been resolved. These also share a request_id. If you are using reference_number to identify unique requests, be aware that it may not be unique within an agency, and that request_id may inherit this ambiguity rather than resolving it.

Source disclosure logs

The table below lists the primary disclosure log URL scraped for each agency. For some agencies, historical snapshots from the National Library of Australia’s Australian Web Archive or the Internet Archive Wayback Machine are also used to extend coverage. These are the authoritative sources for current data; if something looks wrong, checking the source log is the best place to start.

Loading…

Contact

Found an error or want to suggest an agency to add? Reach me at gabrielle@foiforest.org.