Science Museum Group Datasets

9th April 2025

We have made available below a number of bulk exports and datasets for academic or personal research use.

Hopefully these datasets will eliminate the requirement to bulk query our API, which in response to overly aggressive crawling we had to heavily rate limit.

If you use one of our data sets for your project we love to hear about it, it’s always interesting to hear about the interesting projects and uses of our data and helps us justify our time and investment in maintaining them.

Hopefully the data sets will be somewhat self-describing; the best was to get a handle on their structure and the data they contain is to compare a small sub-set of them with their matching online record.

Using our Data

Metadata

Data in the title, made, maker and details fields are released under a Creative Commons Zero licence.

Descriptions and textual content

Descriptions and all other text content are licensed under a Creative Commons Attribution 4.0 licence.

Attribution should be made to ‘© The Board of Trustees of the Science Museum’ along with a link back to the Science Museum Collection website https://collection.sciencemuseumgroup.org.uk

Images

Each image referenced within the dataset has it’s own copyright

          "legal": {
            "rights": [
              {
                "licence": "CC BY-NC-SA 4.0",
                "copyright": "© The Board of Trustees of the Science Museum"
              }
            ]
          },
          "credit": { "value": "Science Museum Group Collection" }

The full path of an image can be constructed by appending the path component /241/152/large_thumbnail_1901_0006__0004_.jpg of an multimedia location: element to the base url for our image server https://coimages.sciencemuseumgroup.org.uk. The medium_thumbnail (~240px240px) is likely ideal for the purposes of training basic ML models.

Pleased do not unnecessarily download the larger image sizes in bulk or you may be rate limited.

Alternatively, you can download a single dataset containing all the medium_thumbnail images as a single .zip file, in which case you will simply need to match up the image paths to wherever you placed that image folder locally.

smg_all_medium_thumbnail_images_09_04_2025.zip (1.19GB)

Images must not be harvested from our IIIF and Zoom endpoints, we will block access and may act again anyone doing so.

Copyright of images

Please check and only use image which fall under on of ther following three licence and attribute accordingly.

Always check the licence

Many records contain more than one image and while the first image may be made available under an open source licence that does not necessarily hold true for all other image on that record. Always check the licence for each individual image.

Attribution

Attribution should be made to ‘© The Board of Trustees of the Science Museum’ along with a link back to the Science Museum Collection website https://collection.sciencemuseumgroup.org.uk where applicable.

Sources of data

Our collection of 7 million items has been added to and documented since 1851. Some historic records within the collection may contain illustrations and descriptions that are offensive or reflect outdated ideas and analysis. This material does not reflect our values as an organisation that is Open for All.

We are actively improving these historic records, including by adding context, correcting errors, changing language and calling out offensive characterisations. Sometimes we have retained original content as evidence of historic racism.

Notice & Takedown Policy

The Science Museum Group has made all reasonable efforts to ensure that images and other content on our collection website are reproduced with the consent of their copyright holders. However, there are some copyright holders who so far have proved to be untraceable. For more information, please view our Notice & Takedown Policy.

SMG Collection website

If you are simply looking to browse our collection online, you can find material published from our object and archival collections on the Science Museum Group collection website. We have over 500,000 objects and 50,000 archival documents published online.

Datasets

The following datasets are made avaliable for academic or personal research use.

JSON

Note: Some of the following files are .zip compressed.

Object records

These are by and large what you think of as physical objects. Trains, planes and automobiles, household appliances, TV sets, typewriters, models, command modules and so much more.

smg_object_records_with_CC_images_09_04_2025.json.zip (150,355 records) 145MB
smg_object_records_all_09_04_2025.json.zip (525,595 records) 322.8MB

Documents records

These are largely 2D paper and print based objects in our collection ie. archives, plans, drawings, diarys, books, maps, trade journals.

smg_document_records_with_CC_images_09_04_2025.json: (7017 records) 159.2MB
smg_document_records_all_09_04_2025.json (77,409 records) 560.8MB

People and Company records

These are the people and company records related to the objects and documents we have published online and made available in the above datasets.

smg_people_and_company_records_09_04_2025.json (23,684 records) 61.4MB

Note #1: There may be duplication and overlap in these records, check @admin.source if you want to de-dupe the dataset yourself.

Note #2: There is currently no means to easily distinguish between people and company records. Hence the somewhat loose naming of this export.

CSV

These are simplified CSV files containing a subset of key-fields.

There is quite a lot of granular nested detail in our records, depending on your use-case you may want to spend some time carefully parsing out and cleaning specific fields from the JSON files.

Downloads

smg_object_records_with_CC_images_09_04_2025.csv (150,344 records) 48.9MB
smg_object_records_all_09_04_2025.csv (525,595 records) 135.8

Field names

uid (a URL can be constructed from this https://collection.sciencemuseumgroup.org.uk/objects/co8538830)
identifier (SMG accession number)
title
description
category (a high level museum category)
material
object_name (a basic taxonomy classification, some overlap with Getty ATT),
date (made)
place (made)
maker
image (a medium size image; ties up with image data set below)

Images

A single .zip file containing all the medium sized thumbnails referenced in the files above.

smg_all_medium_thumbnail_images_09_04_2025.zip (1.19GB)

Useful code snippets

Load a JSON file in as a Panda Data frame

import pandas as pd
import json

try:
    with open('smg_object_records_with_CC_images_09_04_2025.json', 'r', encoding='utf-8') as f:
        data = json.load(f)

        print(f"JSON contains {len(data)} top-level records")
        df = pd.json_normalize(data)

    print("\nData loaded successfully. First few rows:")
    print(df.head())

except Exception as e:
    print(f"Error loading file: {e}")

df.info()

Examine the nested fields in the first record

print("\nNested fields in first record:")
for col in nested_fields:
    print(f"\nField: {col}")
    print(df[col].iloc[0])

Compare this structure to a website record:

Rocket Locomotive - object (HTML / JSON)
Robert Stephenson - person (HTML / JSON)
Side elevation of the Rocket locomotive engine - document (HTML / JSON

Create a simple CSV file from the larger nested JSON exports:

Note: This can take a long time to process

There is quite a lot of granular nested detail in the records. Depending on your use-case you may want to spend some time carefully parsing out and cleaning specific fields from the JSON. The following is provided as an example only.

import pandas as pd
import json

def extract_primary_value(field):
    """Extract primary value from a list of dictionaries"""
    if isinstance(field, list) and len(field) > 0:
        for item in field:
            if isinstance(item, dict) and item.get('primary', False):
                return item.get('value', '')
        return field[0].get('value', '') if isinstance(field[0], dict) else ''
    return field

def extract_materials(field):
    """Extract semicolon-separated materials/names"""
    if isinstance(field, list):
        return "; ".join([item.get('value', '') for item in field if isinstance(item, dict)])
    return field

def extract_image_url(field):
    """Extract medium thumbnail image path"""
    if isinstance(field, list) and len(field) > 0:
        first_item = field[0]
        if isinstance(first_item, dict) and '@processed' in first_item:
            return first_item['@processed'].get('medium_thumbnail', {}).get('location', '')
    return ''

def process_dataframe(df):
    """Process the DataFrame to extract required fields"""
    processed_df = pd.DataFrame()

    # Extract and transform each field
    processed_df['uid'] = df['@admin.uid']
    processed_df['identifier'] = df['identifier'].apply(extract_primary_value)
    processed_df['title'] = df['title'].apply(extract_primary_value)
    processed_df['description'] = df['description'].apply(extract_primary_value)
    processed_df['category'] = df['category'].apply(lambda x: x[0]['name'] if isinstance(x, list) and len(x) > 0 and isinstance(x[0], dict) else x)
    processed_df['material'] = df['material'].apply(extract_materials)
    processed_df['object_name'] = df['name'].apply(extract_materials)
    processed_df['date'] = df['creation.date'].apply(extract_primary_value)

    # Extract place (first summary.title)
    processed_df['place'] = df['creation.place'].apply(
        lambda x: x[0]['summary']['title'] if isinstance(x, list) and len(x) > 0 and isinstance(x[0], dict) and 'summary' in x[0] else ''
    )

    # Extract maker (first summary.title)
    processed_df['maker'] = df['creation.maker'].apply(
        lambda x: x[0]['summary']['title'] if isinstance(x, list) and len(x) > 0 and isinstance(x[0], dict) and 'summary' in x[0] else ''
    )

    # Extract image URL
    processed_df['image'] = df['multimedia'].apply(extract_image_url)

    return processed_df

def main():
    # Load JSON file
    input_file = 'smg_object_records_with_CC_images_09_04_2025.json'  # Change to your input file path
    output_file = 'smg_object_records_with_CC_images_09_04_2025.csv'  # Change to your desired output file path

    try:
        # Read JSON file
        with open(input_file, 'r', encoding='utf-8') as f:
            data = json.load(f)

        # Create DataFrame (assuming JSON is a list of records)
        df = pd.json_normalize(data)

        # Process the DataFrame
        processed_df = process_dataframe(df)

        # Export to CSV
        processed_df.to_csv(output_file, index=False, columns=[
            'uid', 'identifier', 'title', 'description', 'category', 
            'material', 'object_name', 'date', 'place', 'maker', 'image'
        ])

        print(f"Successfully processed and saved to {output_file}")

        # Print example row
        print("\nExample row:")
        print('uid:', processed_df.loc[1, 'uid'])
        print('identifier:', processed_df.loc[1, 'identifier'])
        print('title:', processed_df.loc[1, 'title'])
        print('description:', processed_df.loc[1, 'description'])
        print('category:', processed_df.loc[1, 'category'])
        print('material:', processed_df.loc[1, 'material'])
        print('object_name:', processed_df.loc[1, 'object_name'])
        print('date:', processed_df.loc[1, 'date'])
        print('place:', processed_df.loc[1, 'place'])
        print('maker:', processed_df.loc[1, 'maker'])
        print('image:', processed_df.loc[1, 'image'])

    except Exception as e:
        print(f"Error processing file: {str(e)}")

if __name__ == "__main__":
    main()